# LAND PRICE PREDICTION APP USING AWS SAGEMAKER'S IN-BUILT XGBOOST  - End-to-End
We will build a Land Price Prediction App to help people looking to buy land in Cameroon, get the expected price of land per quartier they intend to buy land from.
The following steps will be taken:
- I)   PROBLEM STATEMENT:

Many people in Cameroon want to buy lands and they have trouble getting information on what to expect as price per square metre for the quartier they want to buy the land from.They also want to be able to consult the prices of several quartiers before making their final choice.
This is a difficult process in Cameroon as it will mean these people who want to buy lands will have to go about making many phone calls to people asking them the price of land in those quartiers.
So the objective is to scrape the data already available on the biggest Classified adds website in Cameroon (Jumia Cameroon) https://www.jumia.cm/en/land-plots

This data will be cleaned and trained using the in-built XGBoost Algorithm on AWS Sagemaker, and an endpoint will be created in AWS ,which wll be used to make predictions when given the inputs like 
- The Quartier the customer wants to buy land from
- The size of the land the customer intends to buy (in metres square)
- And the output of the model will be the predicted Price per metres square for the Quartier the customer requested.


- II)   SCRAPING THE DATA:

Scrape the data from a Classified ads website, where people post lands for sale per quartier in Cameroon.They typically type in the price per metres square and the total area of the land availlable for sale.
- III)  PERFORM EXPLORATORY DATA ANALYSIS 

Inspect the data to validate the quality of the data scraped from the classified ads website. Analyse the distribution of missing values, outliers and gain other relevant insights from the model
- IV) DO FEATURE ENGINEERING & SELECTION

Handle the mising values, outliers and do the necessary transformations which will ensure the data is well suited for the machine learning model.And also to maximise the insights gotten from the Exploratory Data Analysis phase.
- V)  BUILD,TRAIN AND DEPLOY THE MODEL IN SAGEMAKER

The Boto3 Container will be used to create the S3 buckets to store the preprocessed dataset.The Sagemaker's inbuilt XGBoost algorithm, will be built, trained and deployed.Including the use of optimal hyperparameters to get the best results for the RMSE( Root Mean Squared Error).An Endpoint will be created after the model is built.
The Endpoint created awill be used to predict the price per metre square when the inputs of "Quartier" and "Land size" are fed to the endpoint.

### II) SCRAPING THE DATA
We will perform the following tasks, in order to successully scrape the data we need
- a.) Importing the necessary Libraries 
- b.) Writing the ETL functions to obtain the data 
- c.) Scraping and storing the data to a dictionary
- d.) Saving the final scraped dataframe to a CSV file

#### a.) Importing all the necessary libraries 

In [1]:
# Importing Libraries required to scrape the data
import requests
from bs4 import BeautifulSoup
import pandas as pd

#### b.) Writing ETL functions to Extract and Load the data to a Dictionary

In [1]:
# Create the  function using Request and BeautifulSoup to get the URL of the pages we will need to scrape 
def get_urls(page_number):
    base_url = 'https://www.jumia.cm'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'}
    request = requests.get(f'https://www.jumia.cm/en/land-plots?page={page_number}&xhr=ugmii', headers)
    soup = BeautifulSoup(request.text, 'html.parser')
    partial_url_list = soup.find_all('article')
    for partial_url in partial_url_list:
        new_url = base_url + partial_url.find('a')['href']
        url_list.append(new_url)
        print(f"Getting the Urls for page {page_number}")
    return

In [3]:
# Create function using BeautifulSoup to parse URLs from all the pages from the above function 
def extract_page(url):
    url = url
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'}
    request = requests.get(url, headers)
    soup = BeautifulSoup(request.text, 'html.parser')
    return soup

In [4]:
# Create function to obtain the data we need from all those URLs above and store in a dictionary
def transform_page(soup):
    main_div = soup.find('div', class_='twocolumn')
    price = main_div.find('span', {'class': 'price'}).get_text(strip=True).replace('FCFA',"")
    location = main_div.select('dl > dd')[1].text.strip()
    try:
        area = main_div.find_all('h3')[1].get_text(strip=True).replace('Area', "").replace(' m2',"")
    except IndexError:
        area = ''

    items = {
        'Price': price,
        'Location': location,
        'Area': area
    }
    land_data_list.append(items)

    print(f"Scrapping the page '{soup.find('title').text}'...")
    return

#### c.) Scraping and Storing the data into a dictionary

In [5]:
# Extracting all the URLs from page 1 to the number of pages required.In this case I just extracted 1 page as a demo
url_list = []
for page_number in range(1, 2):
    get_urls(page_number)

Getting the Urls for page 1
Getting the Urls for page 1
Getting the Urls for page 1
Getting the Urls for page 1
Getting the Urls for page 1
Getting the Urls for page 1
Getting the Urls for page 1
Getting the Urls for page 1
Getting the Urls for page 1


In [7]:
#Extracting and Transfroming all the data from the required pages selected above
land_data_list = []
for url in url_list:
    page = extract_page(url)
    transform_page(page)

Scrapping the page 'Terrain Titré 2000 M² À Odza  | Odza | Jumia Deals'...
Scrapping the page 'OPPORTUNITÉ De TERRAIN À OMNISPORT | Omnisports | Jumia Deals'...
Scrapping the page 'Vente terrain titré de 2hectared à Bastos | Bastos | Jumia Deals'...
Scrapping the page 'Terrain À Vendre a Bonaberi ( Bonadale) | Bonaberi | Jumia Deals'...
Scrapping the page 'Terrain Titré Et Viable a Ngombé | Douala | Jumia Deals'...
Scrapping the page 'A Vendre  Terrain | Limbé | Jumia Deals'...
Scrapping the page 'A Vendre  Terrain | Limbé | Jumia Deals'...
Scrapping the page 'A Vendre  Terrain | Limbé | Jumia Deals'...
Scrapping the page 'Terrain À Vendre a Bonaberi ( BONADALE) | Bonaberi | Jumia Deals'...


  #### d.) Saving the scraped data as a CSV file using pandas   

In [8]:
# Creating a pandas dataframe
df = pd.DataFrame(land_data_list)
print('Printing first 05 elements...')
print(df.head())


Printing first 05 elements...
     Price    Location    Area
0   15,500        Odza    2000
1   65,000  Omnisports     630
2  500,000      Bastos  20.000
3    8,000    Bonaberi    5000
4    7,000      Douala    8500


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Price     9 non-null      object
 1   Location  9 non-null      object
 2   Area      9 non-null      object
dtypes: object(3)
memory usage: 344.0+ bytes


In [10]:
#Formating Area and Price Columns from text to numeric
df['Area'].replace({' m2':'',',': ''},regex = True,inplace = True)
df['Area'] = pd.to_numeric(df['Area'],errors = 'coerce')

df['Price'].replace({'FCFA':'',',': ''},regex = True,inplace = True)
df['Price'] = pd.to_numeric(df['Price'])

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Price     9 non-null      int64  
 1   Location  9 non-null      object 
 2   Area      9 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 344.0+ bytes


In [12]:
df.to_csv('land_price_data.csv',index = False)
df.head()

Unnamed: 0,Price,Location,Area
0,15500,Odza,2000.0
1,65000,Omnisports,630.0
2,500000,Bastos,20.0
3,8000,Bonaberi,5000.0
4,7000,Douala,8500.0


Great!!! We have finally scraped the data from the clasified adds website and saved as a csv (land_price_data.csv).Let us move on the the next phase of Exploratory Data Analysis.