<a href="https://colab.research.google.com/github/Ayanlola2002/DATA-SCIENCE-PROJECTS/blob/master/ProjectWebscraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Web Scraping 101 

#### Kindly ensure you have the legal rights to scrape and use data from a site before doing so. Propertypro is more flexible about this as seen in the terms and conditions page however Nigeria Property center is not. Check below for more: 

#### https://www.propertypro.ng/terms
#### https://nigeriapropertycentre.com/terms-of-use 

.

#### Import Beautiful Soup for scraping, requests for making request to a website and re for regular expressions

In [0]:
import requests, re
from bs4 import BeautifulSoup

#### Make a request to the website and extract its content (page source)

In [0]:
r=requests.get("https://www.propertypro.ng/property-for-rent?search=gbagada")
c=r.content

#### Parse the page source using the Beautiful soup HTML parser
#### Find all property features on the page

In [0]:
soup=BeautifulSoup(c,"html.parser")

real=soup.find_all("div",{"class":"prop-features"})

#### Collect property features on the page.For this add the index of the feature at the end of the code e.g. [0] for bed, [2] for bath.
#### This first method is not ideal because the location of the specific feature might change 

In [48]:
real[0].get_text().strip().split()

['4', 'bed', '4', 'bath', '5', 'toilet']

#### This second method uses regular expressions and is a better way to collect feature information, because it checks for the feature before collecting. If the feature does not exist it will give a none. For more on regular expressions check https://www.w3schools.com/python/python_regex.asp 

In [49]:
re.findall("..bath",real[0].get_text().strip())#[0][0]

['4 bath']

#### You can change div and class below to search for something else. 

In [0]:
real=soup.find_all("div",{"class":"prop-features"})

#### Websites typically have a structure which allows for easy automation. For example location and page number can eaasily be changed and the website will respond accordingly. Try changing the location below and page number to surulere and page 2 respectively.

In [0]:
#https://www.propertypro.ng/property-for-rent?search=gbagada&page=1

#### There is a slight challenge from above. You will need to get the total number of pages. This can be calculated using total number of items divided by number of listing on each page. The listing is written within a paragraph of text so this has to be extracted using regular expressions. 

In [52]:
items = int(re.findall("\d+",soup.find_all("div",{"class":"jumbotron m-hide"})[0].text.split("total of")[1][:6].replace(",","").strip())[0])
listings = 20
page_nr = int(items/listings)
page_nr

30

#### This is the full code below, the first for loop is used to extract the details on a page, the second is to extract across multiple pages while the last is to extract across locations. Please as indicated in the instructions, do not scrape multiple locations or pages until off peak hours (>6pm) to avoid overloading the site.

In [0]:
l=[]
location = ["gbagada","ikeja","surulere","ikeja","ogba","iyana ipaja","lekki","ajah","ikorodu"]


for place in location:
    base_url="https://www.propertypro.ng/property-for-rent?search="+place+ "&auto=&type=&bedroom=&max_price="
    r=requests.get(base_url+".html")
    c=r.content
    soup=BeautifulSoup(c,"html.parser")

    items = int(re.findall("\d+",soup.find_all("div",{"class":"jumbotron m-hide"})[0].text.split("total of")[1][:6].replace(",","").strip())[0])
    listings = 20 #This can be edited back to 20 items as stated on site. However your code will take a while to run
    page_nr = int(items/listings)
    
    #base_url="https://www.propertypro.ng/property-for-rent?search="+place+ "&auto=&type=&bedroom=&max_price="
    for page in range(1,int(page_nr),1):

        r=requests.get(base_url+".html"+"&page="+str(page))
        c=r.content

        soup=BeautifulSoup(c,"html.parser")
        
        classes = ["col-lg-6 col-md-6 col-sm-6 col-xs-12 prop-meta-data","col-lg-8 col-md-8 col-sm-7 col-xs-12 prop-meta-data text-left",
                   "col-lg-9 col-md-9 col-sm-12 col-xs-12 main-listing-cont"]
        for class_ in classes:
            real=soup.find_all("div",{"class":class_})

            for i in list(range(0,len(real))):
                d={}
                d['page']= page
                try:
                    d["location"] = real[i].find("h3",{"class":"pro-location"}).text.strip()
                except (IndexError,TypeError,AttributeError):
                    d["location"] = None
                try:
                    d["specific_location"] = real[i].find("h3",{"class":"pro-location"}).text.strip().split("gbagada")[0].replace("-","").strip()
                except(IndexError,TypeError,AttributeError):
                    d['specific_location'] = None
                try:
                    d["features"]=real[i].find("span",{"class":"prop-aminities float-left"}).text.strip()  
                except (AttributeError,IndexError) as e:
                    d["features"]= None
                try:
                    d["bedrooms"]= re.findall("..bed",real[i].find("span",{"class":"prop-aminities float-left"}).text.strip())[0][0]
                except (IndexError,TypeError,AttributeError) as e:
                    d["bedrooms"]= None 
                try:
                    d["bathrooms"]= re.findall("..bath",real[i].find("span",{"class":"prop-aminities float-left"}).text.strip())[0][0]
                except (IndexError,TypeError,AttributeError) as e:
                    d["bathrooms"]= None
                try:
                    d["toilets"]= re.findall("..toilet",real[i].find("span",{"class":"prop-aminities float-left"}).text.strip())[0][0]
                except (IndexError,TypeError,AttributeError) as e:
                    d["toilets"]=None
                try:
                    d["description"]=real[i].find("p",{"class":"pro-description"}).text.strip()
                except (IndexError,TypeError,AttributeError) as e:
                    d["description"]= None
                try:
                    d["other_description"]=real[i].find("p",{"class":"pro-description readmore"}).text.strip()
                except (IndexError,TypeError,AttributeError) as e:
                    d["other_description"]= None     
                
                try:
                    d["price"]=real[i].find("p",{"class":"prop-price"}).text.strip().replace("₦","").replace(",","")
                except (IndexError,TypeError,AttributeError) as e:
                    d["price"] = None
                l.append(d)
                #print(l)
                #print(" "

#### Convert output to dataframe

In [55]:
import pandas as pd
ld = pd.DataFrame(l)
ld

Unnamed: 0,page,location,specific_location,features,bedrooms,bathrooms,toilets,description,other_description,price
0,1,gbagada lagos,,1 bed 1 bath toilet,1,1,,1 bedroom mini flat Mini flat Flat / Apartment...,1 bedroom mini flat Mini flat Flat / Apartment...,450000
1,1,phase 2 gbagada lagos,phase 2,3 bed 3 bath 4 toilet,3,3,4,3 bedroom flat upstairs apartment ... ...,3 bedroom flat upstairs apartment ... ...,1500000
2,1,ifako gbagada gbagada lagos,ifako,2 bed 2 bath 2 toilet,2,2,2,"Decent 2 bedrooms ground flat in a block of 4,...","Decent 2 bedrooms ground flat in a block of 4,...",800000
3,1,ifako ifako gbagada gbagada lagos,ifako ifako,2 bed 2 bath 3 toilet,2,2,3,"2 bedroom apartment in a block of 4, upstairs ...","2 bedroom apartment in a block of 4, upstairs ...",1000000
4,1,millenuim ups gbagada lagos,millenuim ups,3 bed 3 bath 4 toilet,3,3,4,two tenants in the compound ... Security. Park...,two tenants in the compound ... Security. Park...,1800000
...,...,...,...,...,...,...,...,...,...,...
5116,11,,,,,,,,,
5117,12,,,,,,,,,
5118,13,,,,,,,,,
5119,14,,,,,,,,,


In [0]:
scrapedata=ld.to_csv("data1.csv",index=False)

In [57]:
ld.head(10)

Unnamed: 0,page,location,specific_location,features,bedrooms,bathrooms,toilets,description,other_description,price
0,1,gbagada lagos,,1 bed 1 bath toilet,1,1,,1 bedroom mini flat Mini flat Flat / Apartment...,1 bedroom mini flat Mini flat Flat / Apartment...,450000
1,1,phase 2 gbagada lagos,phase 2,3 bed 3 bath 4 toilet,3,3,4.0,3 bedroom flat upstairs apartment ... ...,3 bedroom flat upstairs apartment ... ...,1500000
2,1,ifako gbagada gbagada lagos,ifako,2 bed 2 bath 2 toilet,2,2,2.0,"Decent 2 bedrooms ground flat in a block of 4,...","Decent 2 bedrooms ground flat in a block of 4,...",800000
3,1,ifako ifako gbagada gbagada lagos,ifako ifako,2 bed 2 bath 3 toilet,2,2,3.0,"2 bedroom apartment in a block of 4, upstairs ...","2 bedroom apartment in a block of 4, upstairs ...",1000000
4,1,millenuim ups gbagada lagos,millenuim ups,3 bed 3 bath 4 toilet,3,3,4.0,two tenants in the compound ... Security. Park...,two tenants in the compound ... Security. Park...,1800000
5,1,ifako gbagada gbagada lagos,ifako,3 bed 3 bath 4 toilet,3,3,4.0,3 bedroom Self Contain Flat / Apartment for re...,3 bedroom Self Contain Flat / Apartment for re...,1600000
6,1,odo-eran iyana orowo gbagada oworonshoki gbaga...,odoeran iyana orowo,2 bed 2 bath 3 toilet,2,2,3.0,Newly built two bedroom flat with all rooms en...,Newly built two bedroom flat with all rooms en...,550000
7,1,after deperlife church gbagada lagos,after deperlife church,1 bed 1 bath 1 toilet,1,1,1.0,Mini flat (1bedroom Apartment with kitchen toi...,Mini flat (1bedroom Apartment with kitchen toi...,450000
8,1,soluyi gbagada lagos,soluyi,1 bed 1 bath 1 toilet,1,1,1.0,Mini flat with kitchen and prepaid metre plus ...,Mini flat with kitchen and prepaid metre plus ...,450000
9,1,"off gbagada phase2 estate, gbagada, phase 2 gb...",off,2 bed 3 bath 3 toilet,2,3,3.0,A newly built classical luxuriously built 2bed...,A newly built classical luxuriously built 2bed...,1300000


In [58]:
sum(ld.apply(lambda x: sum(x.isnull().values), axis = 0)>0)

9

In [0]:
#removing all null values
ld = ld.dropna(how='any',axis=0) 

In [60]:
#checking for null values
ld.isnull().sum()

page                 0
location             0
specific_location    0
features             0
bedrooms             0
bathrooms            0
toilets              0
description          0
other_description    0
price                0
dtype: int64

In [0]:
#ld['location'] = ld['location'].str.extract(r'(gbagada|ikeja|surulere|ogba|iyana ipaja|lekki|ajah|ikorodu)').map({'gbagada':'gbagada','ikeja':'ikeja','ogba':'ogba','iyana ipaja':'iyana ipaja','lekki':'lekki','ajah':'ajah','ikorodu':'ikorodu'})

In [61]:
ld["location"][ld['location'].str.contains("gbagada")] ="gbagada"
ld["location"][ld['location'].str.contains("ikeja")] ="ikeja"
ld["location"][ld['location'].str.contains("iyana ipaja")] ="iyana ipaja"
ld["location"][ld['location'].str.contains("surulere")] ="surulere"
ld["location"][ld['location'].str.contains("ogba")] ="ogba"
ld["location"][ld['location'].str.contains("lekki")] ="lekki"
ld["location"][ld['location'].str.contains("ajah")] ="ajah"
ld["location"][ld['location'].str.contains("ikorodu")] ="ikorodu"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value 

In [62]:
#testing the values
ld['location'][ld['location']=='ogba']

1673    ogba
1675    ogba
1676    ogba
1679    ogba
1680    ogba
        ... 
1790    ogba
4994    ogba
5001    ogba
5111    ogba
5113    ogba
Name: location, Length: 97, dtype: object

In [63]:
ld.head()

Unnamed: 0,page,location,specific_location,features,bedrooms,bathrooms,toilets,description,other_description,price
0,1,gbagada,,1 bed 1 bath toilet,1,1,,1 bedroom mini flat Mini flat Flat / Apartment...,1 bedroom mini flat Mini flat Flat / Apartment...,450000
1,1,gbagada,phase 2,3 bed 3 bath 4 toilet,3,3,4.0,3 bedroom flat upstairs apartment ... ...,3 bedroom flat upstairs apartment ... ...,1500000
2,1,gbagada,ifako,2 bed 2 bath 2 toilet,2,2,2.0,"Decent 2 bedrooms ground flat in a block of 4,...","Decent 2 bedrooms ground flat in a block of 4,...",800000
3,1,gbagada,ifako ifako,2 bed 2 bath 3 toilet,2,2,3.0,"2 bedroom apartment in a block of 4, upstairs ...","2 bedroom apartment in a block of 4, upstairs ...",1000000
4,1,gbagada,millenuim ups,3 bed 3 bath 4 toilet,3,3,4.0,two tenants in the compound ... Security. Park...,two tenants in the compound ... Security. Park...,1800000
