# LAND PRICE PREDICTION APP USING AWS SAGEMAKER'S IN-BUILT XGBOOST  - End-to-End
We will build a Land Price Prediction App to help people looking to buy land in Cameroon, get the expected price of land per quartier they intend to buy land from.
The following steps will be taken:
- I)   PROBLEM STATEMENT:

Many people in Cameroon want to buy lands and they have trouble getting information on what to expect as price per square metre for the quartier they want to buy the land from.They also want to be able to consult the prices of several quartiers before making their final choice.
This is a difficult process in Cameroon as it will mean these people who want to buy lands will have to go about making many phone calls to people asking them the price of land in those quartiers.
So the objective is to scrape the data already available on the biggest Classified adds website in Cameroon (Jumia Cameroon) https://www.jumia.cm/en/land-plots

This data will be cleaned and trained using the in-built XGBoost Algorithm on AWS Sagemaker, and an endpoint will be created in AWS ,which wll be used to make predictions when given the inputs like 
- The Quartier the customer wants to buy land from
- The size of the land the customer intends to buy (in metres square)
- And the output of the model will be the predicted Price per metres square for the Quartier the customer requested.


- II)   SCRAPING THE DATA:

Scrape the data from a Classified ads website, where people post lands for sale per quartier in Cameroon.They typically type in the price per metres square and the total area of the land availlable for sale.
- III)  PERFORM EXPLORATORY DATA ANALYSIS 

Inspect the data to validate the quality of the data scraped from the classified ads website. Analyse the distribution of missing values, outliers and gain other relevant insights from the model
- IV) DO FEATURE ENGINEERING & SELECTION

Handle the mising values, outliers and do the necessary transformations which will ensure the data is well suited for the machine learning model.And also to maximise the insights gotten from the Exploratory Data Analysis phase.
- V)  BUILD,TRAIN AND DEPLOY THE MODEL IN SAGEMAKER

The Boto3 Container will be used to create the S3 buckets to store the preprocessed dataset.The Sagemaker's inbuilt XGBoost algorithm, will be built, trained and deployed.Including the use of optimal hyperparameters to get the best results for the RMSE( Root Mean Squared Error).An Endpoint will be created after the model is built.
The Endpoint created awill be used to predict the price per metre square when the inputs of "Quartier" and "Land size" are fed to the endpoint.

### V) BUILD,TRAIN AND DEPLOY THE MODEL IN SAGEMAKER
We will perform the following tasks, in order to successully scrape the data we need
- a.) Importing the necessary Libraries and create S3 bucket
- b.) Download the train and test data and store in S3
- c.) Build and Train the Inbuilt XGBoost model
- d.) Deploy the model to an Endpoint
- e.) Test the predictions
- f.) Delete the Endpoint
- g.) Conclusion

#### a.) Importing all the necessary libraries and creating S3 bucket

In [1]:
import sagemaker
import boto3
from sagemaker.amazon.amazon_estimator import get_image_uri 
from sagemaker.image_uris import retrieve
from sagemaker.session import s3_input, Session

In [2]:
bucket_name = 'landpriceapp' # <--- Give this a unique name, since there can be no 02 bucket names in AWS
my_region = boto3.session.Session().region_name # set the region of the instance
print(my_region)

us-east-1


In [3]:
s3 = boto3.resource('s3')
try:
    if  my_region == 'us-east-1':
        s3.create_bucket(Bucket=bucket_name)
    print('S3 bucket created successfully')
except Exception as e:
    print('S3 error: ',e)

S3 bucket created successfully


In [4]:
# set an output path where the trained model will be saved
prefix = 'xgboost-inbuilt-algo'
output_path ='s3://{}/{}/output'.format(bucket_name, prefix)
print(output_path)

s3://landpriceapp/xgboost-inbuilt-algo/output


#### b.) Download the train and test data and store in S3

In [9]:
import pandas as pd

import numpy as np
import urllib

pd.set_option("display.max_columns", None) #setting pandas to display all columns

In [10]:
#Importing the train dataset
try:
    urllib.request.urlretrieve ("https://raw.githubusercontent.com/Bandolo/AWS-Machine-Learning-Projects/main/LandPriceApp-XGBoost-Sagemaker/train_clean.csv", "train_clean.csv")
    print('Success: downloaded train_clean.csv.')
except Exception as e:
    print('Data load error: ',e)

try:
    train_clean = pd.read_csv('./train_clean.csv')
    print('Success: Data loaded into dataframe.')
except Exception as e:
    print('Data load error: ',e)

Success: downloaded train_clean.csv.
Success: Data loaded into dataframe.


In [11]:
#Importing the test dataset
try:
    urllib.request.urlretrieve ("https://raw.githubusercontent.com/Bandolo/AWS-Machine-Learning-Projects/main/LandPriceApp-XGBoost-Sagemaker/test_clean.csv", "test_clean.csv")
    print('Success: downloaded train_clean.csv.')
except Exception as e:
    print('Data load error: ',e)

try:
    test_clean = pd.read_csv('./test_clean.csv')
    print('Success: Data loaded into dataframe.')
except Exception as e:
    print('Data load error: ',e)

Success: downloaded train_clean.csv.
Success: Data loaded into dataframe.


In [12]:
print(test_clean.shape)

(624, 26)


In [13]:
print(train_clean.head())

     Area  Price_log  Awae  Bastos  Bonaberi  Bonamoussadi  Japoma  Kotto  \
0  3000.0  10.463103     0       0         0             0       0      0   
1   200.0  11.608236     0       0         0             0       0      0   
2   450.0  10.596635     0       0         0             0       0      0   
3   798.0  10.596635     0       0         0             0       0      0   
4  1900.0   9.798127     0       0         0             0       0      0   

   Kribi  Lendi  Limbé  Logbessou  Logpom  Makepe  Mfou  Nkoabang  Odza  \
0      1      0      0          0       0       0     0         0     0   
1      0      0      0          0       0       1     0         0     0   
2      0      0      0          0       0       0     0         0     0   
3      0      0      0          0       0       0     0         0     1   
4      0      0      0          0       0       0     0         0     0   

   Olembe  Omnisports  PK12  PK14  PK16  Soa  Village  Yaoundé  Yassa  
0       0     

In [14]:
print(test_clean.head())

     Area  Price_log  Awae  Bastos  Bonaberi  Bonamoussadi  Japoma  Kotto  \
0   500.0  10.463103     0       0         0             0       0      0   
1   500.0  11.002100     0       0         0             0       0      1   
2  3000.0   7.949091     0       0         0             0       0      0   
3   325.0  11.512925     0       0         0             0       0      0   
4  1000.0   9.903488     0       0         0             0       0      0   

   Kribi  Lendi  Limbé  Logbessou  Logpom  Makepe  Mfou  Nkoabang  Odza  \
0      0      0      0          0       0       0     0         0     0   
1      0      0      0          0       0       0     0         0     0   
2      0      0      0          0       0       0     1         0     0   
3      0      0      0          0       0       1     0         0     0   
4      0      0      0          0       0       0     0         0     1   

   Olembe  Omnisports  PK12  PK14  PK16  Soa  Village  Yaoundé  Yassa  
0       0     

In [68]:
### Saving Train And Test Into Buckets
## We start with Train Data
import os
pd.concat([train_clean['Price_log'], train_clean.drop(['Price_log'], 
                                                axis=1)], 
                                                axis=1).to_csv('train.csv', index=False, header=False)
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
s3_input_train = sagemaker.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket_name, prefix), content_type='csv')

In [69]:
# Test Data Into Buckets
pd.concat([test_clean['Price_log'], test_clean.drop(['Price_log'], 
                                              axis=1)], axis=1).to_csv('test.csv', index=False, header=False)
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'test/test.csv')).upload_file('test.csv')
s3_input_test = sagemaker.TrainingInput(s3_data='s3://{}/{}/test'.format(bucket_name, prefix), content_type='csv')

#### c.) Build and Train the Inbuilt XGBoost model

In [70]:
# this line automatically looks for the XGBoost image URI and builds an XGBoost container.
# specify the repo_version depending on your preference.
container = retrieve('xgboost',boto3.Session().region_name,'latest')

In [71]:
# initialize hyperparameters
hyperparameters = {
        "max_depth":"5",
        "eta":"0.25",
        "gamma":"0.3",
        "min_child_weight":"7",
        "subsample":"1",
        "objective":"reg:linear",
        "num_round":50
        }

In [72]:
# construct a SageMaker estimator that calls the xgboost-container
estimator = sagemaker.estimator.Estimator(image_uri=container, 
                                          hyperparameters=hyperparameters,
                                          role=sagemaker.get_execution_role(),
                                          instance_count=1, 
                                          instance_type='ml.m5.2xlarge', 
                                          volume_size=5, # 5 GB 
                                          output_path=output_path,
                                          use_spot_instances=True,
                                          max_run=300,
                                          max_wait=600)

In [73]:
estimator.fit({'train': s3_input_train,'validation': s3_input_test})

2022-02-25 15:06:07 Starting - Starting the training job...
2022-02-25 15:06:16 Starting - Launching requested ML instancesProfilerReport-1645801567: InProgress
.........
2022-02-25 15:07:57 Starting - Preparing the instances for training......
2022-02-25 15:09:08 Downloading - Downloading input data...
2022-02-25 15:09:38 Training - Training image download completed. Training in progress.
2022-02-25 15:09:38 Uploading - Uploading generated training model[34mArguments: train[0m
[34m[2022-02-25:15:09:30:INFO] Running standalone xgboost training.[0m
[34m[2022-02-25:15:09:30:INFO] File size need to be processed in the node: 0.22mb. Available memory size in the node: 23769.68mb[0m
[34m[2022-02-25:15:09:30:INFO] Determined delimiter of CSV input is ','[0m
[34m[15:09:30] S3DistributionType set as FullyReplicated[0m
[34m[15:09:30] 2495x25 matrix with 62375 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2022-02-25:15:09:30:INFO] Determin

#### d.) Deploy the model to an Endpoint

In [74]:
xgb_predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m4.xlarge')

---------!

#### e.) Test the predictions

In [75]:
#from sagemaker.predictor import csv_serializer
from sagemaker.serializers import CSVSerializer

test_data_array = test_clean.drop(['Price_log'], axis=1).values #load the data into an array
#xgb_predictor.content_type = 'text/csv' # set the data type for an inference
xgb_predictor.serializer = CSVSerializer() # set the serializer type
predictions = xgb_predictor.predict(test_data_array).decode('utf-8') # predict!
predictions_array = np.fromstring(predictions[1:], sep=',') # and turn the prediction into an array
print(predictions_array.shape)

(624,)


In [76]:
predictions_array[0:10]

array([ 0.99798775, 10.81545544,  8.42318058, 11.17301559, 10.12899113,
       10.35171318, 10.78121758, 10.18690586, 10.59714508, 10.59128189])

In [77]:
test_clean.head(10)

Unnamed: 0,Area,Price_log,Awae,Bastos,Bonaberi,Bonamoussadi,Japoma,Kotto,Kribi,Lendi,...,Odza,Olembe,Omnisports,PK12,PK14,PK16,Soa,Village,Yaoundé,Yassa
0,500.0,10.463103,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,500.0,11.0021,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3000.0,7.949091,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,325.0,11.512925,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1000.0,9.903488,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
5,1000.0,10.463103,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
6,2900.0,11.512925,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,1000.0,10.463103,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,482.0,10.463103,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,655.0,11.156251,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [92]:
prediction = pd.DataFrame(np.exp(predictions_array[0:10]),columns=["Predicted Price"])
print(round(prediction))

   Predicted Price
0              3.0
1          49784.0
2           4551.0
3          71183.0
4          25059.0
5          31311.0
6          48109.0
7          26553.0
8          40020.0
9          39786.0


In [102]:
actual = np.exp(test_clean.Price_log.head(10))
print(actual)

0     35000.0
1     60000.0
2      2833.0
3    100000.0
4     20000.0
5     35000.0
6    100000.0
7     35000.0
8     35000.0
9     70000.0
Name: Price_log, dtype: float64


#### f.) Deleting The Endpoints

In [3]:
sagemaker.Session().delete_endpoint(xgb_predictor.endpoint)
bucket_to_delete = boto3.resource('s3').Bucket(bucket_name)
bucket_to_delete.objects.all().delete()

NameError: name 'sagemaker' is not defined

Congratulations!!! You just built an end-to-end machine learning app.

#### g.)Conclusion

Whe have successfully gone through the machine learning cycle from Problem framing till deployment.And since Machine Learning is an iterative process, we can always go back and optimise from Feature Engineering to improve our accuracy.

The __accuracy__ of the current model is low, due to the following reasons:
- The __prices per m2__ square was poorly entered by some sales agents on the Jumia website. Instead of entering price per metres squared, they entered the total price of the whole piece of land.
- There are still __some farmlands__ in the causing variations in prices from the price for redential areas.
- The __house prices vary alot__, so the model is finding it hard to predict a price which is close to the actual price we are measutrig at the time of model evaluation.

Possible __corrective actions__ can be taken to improve the efficiency of the model:
- Recruit junior staff to go through the prices and __correct the wrong prices__ entered by the sales agents
- Complety eliminate observations with very low prices, which could indicate __farmlands__
- Or simply use the __median house price per location__, irrespective of the metres square bought.

Feel free to use any resources here to improve on the model's performance.

### Wish you Good Data Luck!!!