# LAND PRICE PREDICTION APP USING AWS SAGEMAKER - End-to-End
We will build a Land Price Prediction App to help people looking to buy land in Cameroon.This app will help them get the expected price per quartier,when they enter the name of the neighbourhood and the size of landthey intend to purchase in that neighbourhood.The app will return the price per metre square of land in that neighbourhood.They can also enter other neighbourhoods to compare the prices.
As seen in the Best Practices for Machine Learning Projects on AWS, and also by the CRISP-DM process, the following steps will be taken to build this machin learning app:
- I)   PROBLEM STATEMENT:

Many people in Cameroon want to buy lands and they have trouble getting informatoon of what to expect as price per suare metre for the quartier they want to buy the land in.They also want to be able to consult the prices of several quartiers before making their choice.
This is a difficult process in Cameroon as it will mean these people who want to buy lands will have to go about making phone calls and asking pople what are the prices of lands in different areas.
So the objective is to scrape the data already available on the biggest Classified adds website in Cameroon (Jumia Cameroon) https://www.jumia.cm/en/land-plots

This data will be cleaned and trained using the in-built XGBoost Algorithm on AWS Sagemaker, and an endpoint will be created in AWS ,which wll be used to make predictions when given the inputs like 
- The Quartier the customer wants to buy land from
- The size of the land the customer intends to buy (in metres square)
- And the outputt of the model will be the predicted Price per metre square for the Quartier the customer requested.


- II)   SCRAPING THE DATA:

Scrape the data from a Classified ads website, where people post lands for sale per quartier in Cameroon.They typically type in the price per metres square and the total area of the land availlable for sale
- III)  PERFORM EXPLORATORY DATA ANALYSIS 

Inspect the data to validate the quality of the data scraped from the classified ads website.See the distribution of missing values, outliers and gain other insight which will be used in the Feature Engineering stage to better prepare the features for the machine learning model to be able to make accurate predictions.
- IV) DO FEATURE ENGINEERING & SELECTION

Handle the mising values, outliers and do the necessary transformations which will ensure the data is well suited for the machine learning model.And also to maximise the insights gotten from the Exploratory Data Analysis phase.
- V)  BUILD,TRAIN AND DEPLOY THE MODEL IN SAGEMAKER

The Boto3 Container will be used to create the S3 buckets to store the preprocessed dataset.The Sagemaker's inbuilt XGBoost algorithm, will be built, trained and deployed.Including the use of optimal hyperparameters to get the best results for the RMSE( Root Mean Squared Error).An Endpoint will be created after the model is built.

- VI)   MODEL INFERENCE IN SAGEMAKER

The Endpoint created above will be used to predict the price per metre square when the inputs of "Quartier" and "Land size" are entered.

### II) SCRAPING THE DATA
We will perform the following tasks in order to successully scrape the data we need
- a.) Importing all the necessary Libraries 
- b.) Writing the ETL functions to obtain the data 
- c.) Scraping and storing the data to a dictionary
- d.)Saving the final scraped dataframe to a CSV file using pandas

#### a.) Importing all the necessary libraries 

In [2]:
# Importing Libraries required to scrape the data
import requests
from bs4 import BeautifulSoup
import pandas as pd

#Importing Libraries required to store the scraped data in an AWS S3 bucket
#import sagemaker
#import boto3
#from sagemaker.session import s3_input, Session

#### b.) Writing ETL functions to Extract and Load the data to a Dictionary

In [3]:
# Create the  function using Request and BeautifulSoup to get the URL of the pages we will need to scrape 
def get_urls(page_number):
    base_url = 'https://www.jumia.cm'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'}
    request = requests.get(f'https://www.jumia.cm/en/land-plots?page={page_number}&xhr=ugmii', headers)
    soup = BeautifulSoup(request.text, 'html.parser')
    partial_url_list = soup.find_all('article')
    for partial_url in partial_url_list:
        new_url = base_url + partial_url.find('a')['href']
        url_list.append(new_url)
        print(f"Getting the Urls for page {page_number}")
    return

In [4]:
# Create function using BeautifulSoup to parse URLs from all the pages from the above function 
def extract_page(url):
    url = url
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'}
    request = requests.get(url, headers)
    soup = BeautifulSoup(request.text, 'html.parser')
    return soup

In [5]:
# Create function to obtain the data we need from all those URLs above and store in a dictionary
def transform_page(soup):
    main_div = soup.find('div', class_='twocolumn')
    price = main_div.find('span', {'class': 'price'}).get_text(strip=True).replace('FCFA',"")
    location = main_div.select('dl > dd')[1].text.strip()
    try:
        area = main_div.find_all('h3')[1].get_text(strip=True).replace('Area', "").replace(' m2',"")
    except IndexError:
        area = ''

    items = {
        'Price': price,
        'Location': location,
        'Area': area
    }
    land_data_list.append(items)

    print(f"Scrapping the page '{soup.find('title').text}'...")
    return

#### c.) Scraping and Storing the data into a dictionary

In [6]:
# Extracting all the URLs for from page 1 to the number of pages required
url_list = []
for page_number in range(1, 2):
    get_urls(page_number)

Getting the Urls for page 1
Getting the Urls for page 1
Getting the Urls for page 1
Getting the Urls for page 1
Getting the Urls for page 1
Getting the Urls for page 1
Getting the Urls for page 1
Getting the Urls for page 1
Getting the Urls for page 1
Getting the Urls for page 1


In [7]:
#Extracting and Transfroming all the data from the required pages selected above
land_data_list = []
for url in url_list:
    page = extract_page(url)
    transform_page(page)

Scrapping the page 'Terrain 60 Hectares À Nkoabang À Vendre | Mfou | Jumia Deals'...
Scrapping the page 'Terrain  Titré à vendre Logbessou 200m2/1000m2 /500m2  | Douala | Jumia Deals'...
Scrapping the page 'TERRAIN A VENDRE A PK 13 | PK13 | Jumia Deals'...
Scrapping the page 'Terrain titré à vendre a pk 19 | PK19 | Jumia Deals'...
Scrapping the page 'TERRAIN TITRE A VENDRE A PK 12 | PK12 | Jumia Deals'...
Scrapping the page 'TERRAIN TITRE A VENDRE A  ARIE ( après bocom village) | Village | Jumia Deals'...
Scrapping the page 'Terrain à vendre  | Bonaberi | Jumia Deals'...
Scrapping the page 'Terrain Titre a Vendre  a Pk 26 | PK26 | Jumia Deals'...
Scrapping the page 'Terrain commercial à vendre : Odza 3000 m² | Odza | Jumia Deals'...
Scrapping the page 'Terrain à vendre : Yaoundé - Simbock 1 200 m² | Yaoundé | Jumia Deals'...


  #### d.) Saving the scraped data as a CSV file using pandas   

In [8]:
# Creating a pandas dataframe
df = pd.DataFrame(land_data_list)
print('Printing first 05 elements...')
print(df.head())


Printing first 05 elements...
    Price Location  Area
0     350     Mfou  6000
1  60,000   Douala  1000
2  35,000     PK13   387
3  23,000     PK19   500
4  30,000     PK12      


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Price     10 non-null     object
 1   Location  10 non-null     object
 2   Area      10 non-null     object
dtypes: object(3)
memory usage: 368.0+ bytes


In [10]:
#Formating Area and Price Columns from text to numeric
df['Area'].replace({' m2':'',',': ''},regex = True,inplace = True)
df['Area'] = pd.to_numeric(df['Area'],errors = 'coerce')

df['Price'].replace({'FCFA':'',',': ''},regex = True,inplace = True)
df['Price'] = pd.to_numeric(df['Price'])

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Price     10 non-null     int64  
 1   Location  10 non-null     object 
 2   Area      9 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 368.0+ bytes


In [12]:
df.to_csv('land_price_data.csv',index = False)
df.head()

Unnamed: 0,Price,Location,Area
0,350,Mfou,6000.0
1,60000,Douala,1000.0
2,35000,PK13,387.0
3,23000,PK19,500.0
4,30000,PK12,


In [None]:
df.