<a href="https://colab.research.google.com/github/Augusta02/websrapping/blob/main/House_Webscrapping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping Nigeria Housing Data

---



Data is not readily available in Nigeria, therefore data professionals in this part of the world have to utilize web scraping to bridge the gap. In this project, I would be collecting data for a house prices project in Nigeria and I would be scraping three popular websites used to search for homes and answer business questions such as :

- Average Price of Homes by Location
- Average Price of Homes by Type

- Nigeria Property Centre: https://nigeriapropertycentre.com/
- Property Pro: https://www.propertypro.ng/
- Jiji: https://jiji.ng/api_web/v1/listing?slug=houses-apartments-for-rent

Data such as Description of the home/apartment, location and price would be collected from each of the websites and API. Information like are not included because Size of the rooms are not provided in the websites, also from further research size is determined by the owners of the properties. 

The most popular way to web scrap is through BeautifulSoup, a Python Package used for extracting data from HTML and XML contents. I will use this method and introduce extracting data through APIs in this project. 

API is the acronym for Application Programming Interface. It is a software intermediary that allows communication between two applications, and APIs are used to extract and share data.

Read more here: https://en.wikipedia.org/wiki/API


# BeautifulSoup
## Nigeria Property 

In this section, I would use the first two websites url to requests and get the contents of the websites, and finally store the data collected into a csv file. 

In [None]:
# import necessary libraries

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

In [None]:
# create an empty list 
listings = []

# scrap data from 1000 pages
for i in range(1,1000):
  # store the url in the url variable 
  url = f"https://nigeriapropertycentre.com/for-rent?page={i}"
  headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}

  # get content of the website
  response= requests.get(url,headers=headers)
  # print(response.status_code)
  # status code should be 200 

  # parse url content into BeautifulSoup 
  soup= BeautifulSoup(response.content, "html.parser")

  # the data needed is stored in cards
  # find all the cards using the find_all function
  # pass in the element and class where the card is stored
  items= soup.find_all('div', class_='wp-block-body')

  # iterate through all the cards and collect the data needed
  # by locating where each data is nested in the html tag.
  for item in items:
    
    description = item.find('h4', class_='content-title').get_text()
    location = item.find('address', class_='voffset-bottom-10').get_text()
    # the price data is stored in span tag
    # the span tag appears twice with the same class name
    # in BeauttifulSoup, it would return the data in the span tag
    # therefore input all data in the span tag in a list
    prices = item.find_all('span', class_='price')
    # the data we need is in the second positio of the list
    # use indexing to locate it 
    price = prices[1].get_text()
    # append the data to the listings list
    listings.append([description, location,price])

In [None]:
# convert the list into a dataframe 
df= pd.DataFrame(listings, columns=['Description', 'Location', 'Price'])
df.head()

Unnamed: 0,Description,Location,Price
0,4 bedroom semi-detached duplex for rent,"Lekki Conservation Road, Lekki, Lagos",5500000
1,4 bedroom terraced duplex for rent,"Ikota, Lekki, Lagos",3500000
2,3 bedroom flat / apartment for rent,"Thomas Estate, Ajah, Lagos",2000000
3,2 bedroom flat / apartment for rent,"By Dunamis, Lugbe District, Abuja",1500000
4,Self contain (single rooms) for rent,Newly Built Roomself At Diya Road Gbagada Fo...,600000


In [None]:
# change the dataframe into a csv file. 
# it automatically saves when the cell is run
df.to_csv('Nigerian Properties.csv')

In [None]:
df.shape

(20979, 3)

# Property Pro

The same method applied to the previous website was used here. The difference is in how the data was accessed, since both websites have different structure.

In [None]:
listed_homes = []

for i in range(1,1000):
  url = f'https://www.propertypro.ng/property-for-rent?psafe_param=1&gclid=CjwKCAjw8-OhBhB5EiwADyoY1YL38QnX6VQtuVQEL5gD74EeWUTHlnUMAgpo9EOS1_CSoeSt-YzdHxoCF_oQAvD_BwE&page={i}'
  headers ={'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36'}
  res = requests.get(url,headers=headers)

  pro_soup = BeautifulSoup(res.content, 'html.parser')
  homes = pro_soup.find_all('div', class_='single-room-sale listings-property')
  # print(homes)

  for home in homes:
    # description and location are in the same h4 tag
    # location has no class attached to it
    # used the find_all method to have all the h4 attributes in a list
    # and access using indexing
    tag= home.find_all('h4')
    description= tag[0].get_text()
    location = tag[1].get_text()
    prices = home.find('h3', class_='listings-price').get_text()
    
    listed_homes.append([description, location, prices])
    

In [None]:
dx= pd.DataFrame(listed_homes, columns=['Description', 'Location', 'Prices'])
dx.head()

Unnamed: 0,Description,Location,Prices
0,5 BEDROOM HOUSE FOR RENT,Maitama Abuja,"₦ 35,000,000/year"
1,5 BEDROOM HOUSE FOR RENT,Katampe Ext Abuja,"₦ 7,000,000/year"
2,4 BEDROOM HOUSE FOR RENT,"Idado/agungi, Lekki Lagos","₦ 6,000,000/year"
3,COMMERCIAL PROPERTY FOR RENT,"Phase 1, Lekki Lagos","₦ 3,500,000/year"
4,4 BEDROOM HOUSE FOR RENT,"Legislative Quarters, Zone E Apo Abuja","₦ 5,000,000/year"


In [None]:
dx.to_csv('Property Homes.csv')

In [None]:
dx.shape

(18913, 3)

# API

They are different types of APIs, but the most popularly used is the REST API. The API contents would be accessed using requests and then parsed as a JSON format. JSON stands for Javascipt Object Notation, it is file format and data interchange format that is human-readable. The file is stored in key-value pairs and arrays. Its syntax:
- Data is in key-value pairs
- Data is seperated by commas
- Curly brackets hold objects
- Square brackets hold arrays

It is similar to python dictionary and can also be accessed same way. 

Read more about JSON: https://blog.hubspot.com/website/json-files

# Jiji

In [None]:
homes = []



for i in range(1, 1000):
  # pass website api url
    url= f'https://jiji.ng/api_web/v1/listing?slug=houses-apartments-for-rent&page={i}'
    response = requests.get(url)
    # print(response.status_code)

    # pass content as json format
    data = response.json()
    # print(data)

    # access the content of the website 
    # similar to the previous urls, data is store in cards
    # and each card is the
    add_list= data["adverts_list"]["adverts"]

    for add in add_list:
        description = add['fb_view_content_data']['content_name']
        location = add['region_name']
        price = add['price_obj']['value']
        # print(len(price))
        
        homes.append([description, location, price])

In [None]:
dy = pd.DataFrame(homes, columns= ['Description', 'Location', 'Price'])
dy.head()

Unnamed: 0,Description,Location,Price
0,1bdrm Apartment in Ido for Rent,Ido,110000
1,1bdrm Apartment in Ido for Rent,Ido,150000
2,"1bdrm Bungalow in Iragon, Badagry / Badagry fo...",Badagry / Badagry,100000
3,4bdrm House in Durumi for Rent,Durumi,9000000
4,5bdrm Duplex in Ikoyi Lagos for Rent,Ikoyi,30000000


In [None]:
dy.shape

(19642, 3)

In [None]:
dy.to_csv('Jiji.csv')