# LAND PRICE PREDICTION APP USING AWS SAGEMAKER'S IN-BUILT XGBOOST  - End-to-End
We will build a Land Price Prediction App to help people looking to buy land in Cameroon, get the expected price of land per quartier they intend to buy land from.
The following steps will be taken:
- I)   PROBLEM STATEMENT:

Many people in Cameroon want to buy lands and they have trouble getting information on what to expect as price per square metre for the quartier they want to buy the land from.They also want to be able to consult the prices of several quartiers before making their final choice.
This is a difficult process in Cameroon as it will mean these people who want to buy lands will have to go about making many phone calls to people asking them the price of land in those quartiers.
So the objective is to scrape the data already available on the biggest Classified adds website in Cameroon (Jumia Cameroon) https://www.jumia.cm/en/land-plots

This data will be cleaned and trained using the in-built XGBoost Algorithm on AWS Sagemaker, and an endpoint will be created in AWS ,which wll be used to make predictions when given the inputs like 
- The Quartier the customer wants to buy land from
- The size of the land the customer intends to buy (in metres square)
- And the output of the model will be the predicted Price per metres square for the Quartier the customer requested.


- II)   SCRAPING THE DATA:

Scrape the data from a Classified ads website, where people post lands for sale per quartier in Cameroon.They typically type in the price per metres square and the total area of the land availlable for sale.
- III)  PERFORM EXPLORATORY DATA ANALYSIS 

Inspect the data to validate the quality of the data scraped from the classified ads website. Analyse the distribution of missing values, outliers and gain other relevant insights from the model
- IV) DO FEATURE ENGINEERING & SELECTION

Handle the mising values, outliers and do the necessary transformations which will ensure the data is well suited for the machine learning model.And also to maximise the insights gotten from the Exploratory Data Analysis phase.
- V)  BUILD,TRAIN AND DEPLOY THE MODEL IN SAGEMAKER

The Boto3 Container will be used to create the S3 buckets to store the preprocessed dataset.The Sagemaker's inbuilt XGBoost algorithm, will be built, trained and deployed.Including the use of optimal hyperparameters to get the best results for the RMSE( Root Mean Squared Error).An Endpoint will be created after the model is built.
The Endpoint created awill be used to predict the price per metre square when the inputs of "Quartier" and "Land size" are fed to the endpoint.

### V) BUILD,TRAIN AND DEPLOY THE MODEL IN SAGEMAKER
We will perform the following tasks, in order to successully scrape the data we need
- a.) Importing the necessary Libraries and create S3 bucket
- b.) Download the train and test data and store in S3
- c.) Build and Train the Inbuilt XGBoost model
- d.) Deploy the model to an Endpoint
- e.) Test the predictions
- f.) Delete the Endpoint
- g.) Conclusion

#### a.) Importing all the necessary libraries and creating S3 bucket

In [1]:
import sagemaker
import boto3
from sagemaker.amazon.amazon_estimator import get_image_uri 
from sagemaker.image_uris import retrieve
from sagemaker.session import s3_input, Session

In [2]:
bucket_name = 'landpriceapp' # <--- Give this a unique name, since there can be no 02 bucket names in AWS
my_region = boto3.session.Session().region_name # set the region of the instance
print(my_region)

us-east-1


In [3]:
s3 = boto3.resource('s3')
try:
    if  my_region == 'us-east-1':
        s3.create_bucket(Bucket=bucket_name)
    print('S3 bucket created successfully')
except Exception as e:
    print('S3 error: ',e)

S3 bucket created successfully


In [4]:
# set an output path where the trained model will be saved
prefix = 'xgboost-inbuilt-algo'
output_path ='s3://{}/{}/output'.format(bucket_name, prefix)
print(output_path)

s3://landpriceapp/xgboost-inbuilt-algo/output


#### b.) Download the train and test data and store in S3

In [5]:
import pandas as pd
import urllib

In [27]:
#Importing the train dataset
try:
    urllib.request.urlretrieve ("https://raw.githubusercontent.com/Bandolo/AWS-Machine-Learning-Projects/main/LandPriceApp-XGBoost-Sagemaker/train_clean.csv", "train_clean.csv")
    print('Success: downloaded train_clean.csv.')
except Exception as e:
    print('Data load error: ',e)

try:
    train_clean = pd.read_csv('./train_clean.csv',index_col=0)
    print('Success: Data loaded into dataframe.')
except Exception as e:
    print('Data load error: ',e)

Success: downloaded train_clean.csv.
Success: Data loaded into dataframe.


In [28]:
#Importing the test dataset
try:
    urllib.request.urlretrieve ("https://raw.githubusercontent.com/Bandolo/AWS-Machine-Learning-Projects/main/LandPriceApp-XGBoost-Sagemaker/test_clean.csv", "test_clean.csv")
    print('Success: downloaded train_clean.csv.')
except Exception as e:
    print('Data load error: ',e)

try:
    train_clean = pd.read_csv('./test_clean.csv',index_col=0)
    print('Success: Data loaded into dataframe.')
except Exception as e:
    print('Data load error: ',e)

Success: downloaded train_clean.csv.
Success: Data loaded into dataframe.


In [30]:
print(test.shape)

(695, 24)


In [33]:
print(train.head())

          Price_log  Awae  Bastos  Bonaberi  Bonamoussadi  Douala  Japoma  \
Area                                                                        
10000.0    8.987197     0       0         0             0       1       0   
200000.0   9.615805     0       0         0             0       0       0   
500.0     10.915088     0       0         0             0       0       0   
310.0     10.463103     0       0         0             0       0       0   
1000.0    10.126631     0       0         0             0       0       1   

          Kotto  Kribi  Lendi  ...  Mfou  Nkoabang  Odza  PK12  PK16  PK21  \
Area                           ...                                           
10000.0       0      0      0  ...     0         0     0     0     0     0   
200000.0      0      0      0  ...     0         0     0     0     0     0   
500.0         0      0      0  ...     0         0     0     0     0     0   
310.0         0      0      0  ...     0         0     0     0     0  

In [34]:
print(test.head())

        Price_log  Awae  Bastos  Bonaberi  Bonamoussadi  Douala  Japoma  \
Area                                                                      
1800.0  11.512925     0       0         0             0       0       1   
9000.0   8.160518     0       0         0             0       0       0   
3000.0   9.740969     0       0         0             0       0       0   
500.0    9.798127     0       0         0             0       0       0   
500.0   11.002100     0       0         0             0       0       0   

        Kotto  Kribi  Lendi  ...  Mfou  Nkoabang  Odza  PK12  PK16  PK21  Soa  \
Area                         ...                                                
1800.0      0      0      0  ...     0         0     0     0     0     0    0   
9000.0      0      0      1  ...     0         0     0     0     0     0    0   
3000.0      0      0      0  ...     0         0     0     0     0     0    0   
500.0       0      0      0  ...     0         0     1     0     0   

In [35]:
### Saving Train And Test Into Buckets
## We start with Train Data
import os
pd.concat([train['Price_log'], train.drop(['Price_log'], 
                                                axis=1)], 
                                                axis=1).to_csv('train.csv', index=False, header=False)
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
s3_input_train = sagemaker.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket_name, prefix), content_type='csv')

In [36]:
# Test Data Into Buckets
pd.concat([test['Price_log'], test.drop(['Price_log'], 
                                              axis=1)], axis=1).to_csv('test.csv', index=False, header=False)
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'test/test.csv')).upload_file('test.csv')
s3_input_test = sagemaker.TrainingInput(s3_data='s3://{}/{}/test'.format(bucket_name, prefix), content_type='csv')

#### c.) Build and Train the Inbuilt XGBoost model

In [37]:
# this line automatically looks for the XGBoost image URI and builds an XGBoost container.
# specify the repo_version depending on your preference.
container = retrieve('xgboost',boto3.Session().region_name,'latest')

In [38]:
# initialize hyperparameters
hyperparameters = {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"0.3",
        "min_child_weight":"7",
        "subsample":"1",
        "objective":"reg:linear",
        "num_round":50
        }

In [39]:
# construct a SageMaker estimator that calls the xgboost-container
estimator = sagemaker.estimator.Estimator(image_uri=container, 
                                          hyperparameters=hyperparameters,
                                          role=sagemaker.get_execution_role(),
                                          instance_count=1, 
                                          instance_type='ml.m5.2xlarge', 
                                          volume_size=5, # 5 GB 
                                          output_path=output_path,
                                          use_spot_instances=True,
                                          max_run=300,
                                          max_wait=600)

In [40]:
estimator.fit({'train': s3_input_train,'validation': s3_input_test})

2021-12-10 14:57:08 Starting - Starting the training job...
2021-12-10 14:57:15 Starting - Launching requested ML instancesProfilerReport-1639148228: InProgress
......
2021-12-10 14:58:26 Starting - Preparing the instances for training.........
2021-12-10 15:00:07 Downloading - Downloading input data
2021-12-10 15:00:07 Training - Downloading the training image.[34mArguments: train[0m
[34m[2021-12-10:15:00:10:INFO] Running standalone xgboost training.[0m
[34m[2021-12-10:15:00:10:INFO] File size need to be processed in the node: 0.21mb. Available memory size in the node: 23762.52mb[0m
[34m[2021-12-10:15:00:10:INFO] Determined delimiter of CSV input is ','[0m
[34m[15:00:10] S3DistributionType set as FullyReplicated[0m
[34m[15:00:10] 2777x23 matrix with 63871 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2021-12-10:15:00:10:INFO] Determined delimiter of CSV input is ','[0m
[34m[15:00:10] S3DistributionType set as FullyReplicated

#### d.) Deploy the model to an Endpoint

In [41]:
xgb_predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m4.xlarge')

---------!

#### e.) Test the predictions

In [44]:
#from sagemaker.predictor import csv_serializer
from sagemaker.serializers import CSVSerializer

test_data_array = test.drop(['Price_log'], axis=1).values #load the data into an array
#xgb_predictor.content_type = 'text/csv' # set the data type for an inference
xgb_predictor.serializer = CSVSerializer() # set the serializer type
predictions = xgb_predictor.predict(test_data_array).decode('utf-8') # predict!
predictions_array = np.fromstring(predictions[1:], sep=',') # and turn the prediction into an array
print(predictions_array.shape)

(695,)


In [50]:
predictions_array[0:10]

array([ 0.41550922,  9.58652115, 10.30050278, 10.1236887 , 10.1236887 ,
        9.02184963,  9.67544365,  9.67544365, 10.1236887 ,  8.48962975])

In [49]:
test.head(10)

Unnamed: 0_level_0,Price_log,Awae,Bastos,Bonaberi,Bonamoussadi,Douala,Japoma,Kotto,Kribi,Lendi,...,Mfou,Nkoabang,Odza,PK12,PK16,PK21,Soa,Village,Yaoundé,Yassa
Area,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1800.0,11.512925,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9000.0,8.160518,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3000.0,9.740969,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
500.0,9.798127,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
500.0,11.0021,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10000.0,8.987197,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
300.0,9.903488,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2000.0,11.082143,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
783.0,10.126631,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
500.0,8.853665,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


#### f.) Deleting The Endpoints

In [3]:
sagemaker.Session().delete_endpoint(xgb_predictor.endpoint)
bucket_to_delete = boto3.resource('s3').Bucket(bucket_name)
bucket_to_delete.objects.all().delete()

NameError: name 'sagemaker' is not defined

Congratulations!!! You just built an end-to-end machine learning app.

#### g.)Conclusion

Because the house prices have a very high variance the model is RMSE cannot go very low.
There are options to improve the model, like replacing all the house prices per Location with just the median price per location.That way the model will have less varaince.

Feel free to use any resources here to improve on the model's performence.

Wish you Good Data Luck!!!

### Wish you Good Data Luck!!!