### **Quick Tutorial - Machine Learning Models for Residential Appraisers, Part 2**
### **Random Forest and Extreme Gradient Boosting (XGBoost)**

&nbsp;  
*Important Note: There are plenty of free machine learning tutorials and courses online. Anyone can access them, learn, and run predictive models for home values. However, this tutorial has been designed specifically for residential appraisers, and some of the material will be irrelevant or less important for other industries. If you are not a residential appraiser, and/or you are looking to learn about machine learning as a broader field, this tutorial may not be adequate for you.*

The purpose of this tutorial is for residential appraisers interested in machine learning to get their toes wet. This is **not** a comprehensive machine learning tutorial and it does **not** cover everything there is to know about the topic. 


&nbsp;  
**Prerequisites:**
1.	Some understanding of Python programming.
2.	Access and familiarity with Jupyter notebook.
3.	If you want to use your own data you will need access to home sales data and custom exports from your local MLS.

&nbsp;  
**Obtain the data:**
We will be using a dataset of home sales that includes sales prices and several predicting features, in csv format ***(Already pre-processed from Part 1 of the tutorial)***. 

The easiest way to obtain a dataset for your specific market area is by creating a custom csv export from your local MLS system. If you don’t know how to create it you should get technical support from your MLS provider.

The sales dataset  contains the following features:
Lot size, Water View, Year Built, Bedrooms, Bathrooms, GLA, Garage, Carport, Fireplace, Pool, and Sales Price.
Note: These features are based on the subject’s market area. You should export a dataset that includes all of the value-affecting features you consider relevant for your specific market area. 

&nbsp;  
**Assumptions:**
We will make the following assumptions for the purpose of this tutorial. 
1.	Stable market condition. In rapidly increasing or decreasing markets you will likely need to add a “sales date” column/feature.
2.	Accurate data sources. MLS data is considered to be good and reliable.
3.	Relevant predicting features. Make sure you include all of the appropriate value-affecting features for the subject’s market area.

### **IMPORT LIBRARIES**

#### First, we need to import the libraries we are going to be using.

In [1]:
import pandas as pd
import numpy as np
import sklearn as sk


### **IMPORT DATASET**

#### Next, we import the homes sales dataset to our notebook. (The dataset is already pre-processed from Part 1 of the tutorial) 




In [2]:
file_path = '/content/Sales_Dataset2.csv'       #Create file path to csv dataset
sales_df = pd.read_csv(file_path)               #Read dataset and assign the name sales_df
sales_df.head()                                 #Display data (first 5 rows) to make sure it was successfully imported

Unnamed: 0,Site,View,Age,Bedrooms,Bathrooms,GLA,Garage,Fireplace,Pool,Sales_Price
0,8549,1,5,5,4.0,4439,3,0,1,675000
1,11108,0,3,5,4.0,4069,3,0,0,540000
2,9920,0,22,5,4.0,3834,3,1,1,515000
3,10035,1,22,5,4.0,3828,3,1,1,495000
4,9600,1,14,5,4.0,3382,3,0,1,494700


#### Every time we modify the dataframe it is a good practice to check a sample of it to confirm the code did what was intended. For this we will use sales_df.head(), which will show the first 5 rows of the dataframe.

&nbsp;  
&nbsp;  
&nbsp;  
#### **EXPLORE THE DATA**

#### Now we can explore the characteristics of the dataset.

In [3]:
sales_df.describe()     #Get main statistics of the dataset

Unnamed: 0,Site,View,Age,Bedrooms,Bathrooms,GLA,Garage,Fireplace,Pool,Sales_Price
count,210.0,210.0,210.0,210.0,210.0,210.0,210.0,210.0,210.0,210.0
mean,8608.233333,0.195238,20.461905,3.7,2.466667,2183.052381,4.128571,0.157143,0.333333,306573.128571
std,2779.318027,0.397331,4.749311,0.641842,0.611949,596.492637,26.066489,0.364805,0.472531,71470.906697
min,1684.0,0.0,2.0,2.0,2.0,1191.0,0.0,0.0,0.0,167850.0
25%,6577.5,0.0,18.0,3.0,2.0,1805.75,2.0,0.0,0.0,259925.0
50%,8115.5,0.0,21.0,4.0,2.0,2023.0,2.0,0.0,0.0,285000.0
75%,10080.0,0.0,24.0,4.0,3.0,2454.5,3.0,0.0,1.0,340750.0
max,18439.0,1.0,28.0,5.0,4.0,4439.0,380.0,1.0,1.0,675000.0


In [4]:
sales_df.info()         #Get general information including number of rows, columns, and data types

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210 entries, 0 to 209
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Site         210 non-null    int64  
 1   View         210 non-null    int64  
 2   Age          210 non-null    int64  
 3   Bedrooms     210 non-null    int64  
 4   Bathrooms    210 non-null    float64
 5   GLA          210 non-null    int64  
 6   Garage       210 non-null    int64  
 7   Fireplace    210 non-null    int64  
 8   Pool         210 non-null    int64  
 9   Sales_Price  210 non-null    int64  
dtypes: float64(1), int64(9)
memory usage: 16.5 KB


&nbsp;
####Since we already covered the data pre-processing steps in Part 1, we will not cover those steps again in Part 2. Instead, we will jump right into data modeling and predictions.
&nbsp;

#### **MODELS**

#### First, we import the libraries we are going to be using.

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor


&nbsp;  
&nbsp;  
#### Before running our models we must define the predicting features (Site, View, Age, Bedrooms, Bathrooms, GLA, Garage, Fireplace, and Pool) as 'X' and the target feature (Sales Prices) as 'y'.

In [6]:
X = sales_df.iloc[:,:-1]      #Define predicting features, X
y = sales_df.iloc[:,-1]       #Define target, y

&nbsp;  
&nbsp;  
#### Next, we split the dataframe in two sets. The train set (to train the model) and the test set (to test the model with new data).
#### We assign 80% of the data to the train set and 20% to the test set. (Other common combinations are 75/25, 85/15, and 90/10).

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.80, test_size=0.2, random_state=0)     #train-test split

&nbsp;  
&nbsp;  

#### **MODEL: Random Forest**


#### The next step is to define the random forest model and fit the train set.

In [8]:
random_forest = RandomForestRegressor(random_state=0)   #Define the model

random_forest.fit(X_train,y_train)                      #Fit the train set


RandomForestRegressor(random_state=0)

&nbsp;  
&nbsp;  
#### Now that the Random Forest model has been created, we want to know how well it performs with new data (test set). We will use "Mean Absolute Error" (MAE) to score the model.

In [9]:
rf_pred = random_forest.predict(X_test)                     #Get predictions using the test set

rf_mae = mean_absolute_error(rf_pred, y_test)               #Calculate MAE

print("Random Forest MAE:", format(rf_mae, ',.2f'))         #Display MAE score

Random Forest MAE: 17,733.81


#### The Mean Absoulte Error (MAE) is 17,733.81

&nbsp; 
&nbsp; 
#### Next, we can work on the Extreme Gradient Boosting (XGBoost) model.

####**MODEL: Extreme Gradient Boosting (XGBoost)**

#### Let's define the model and fit the train set.

In [10]:
xg_boost = XGBRegressor()             #Define the model
 
xg_boost.fit(X_train,y_train)         #Fit the train set



XGBRegressor()

&nbsp;  
&nbsp;  
#### We just created the Extreme Gradient Boosting model. Now we want to know how well it performs with new data (test set). We will use "Mean Absolute Error" (MAE) to score the model.

In [11]:
xgb_pred = xg_boost.predict(X_test)                 #Get predictions using the test set

xgb_mae = mean_absolute_error(xgb_pred,y_test)      #Calculate MAE

print("XGBoost MAE:", format(xgb_mae, ',.2f'))      #Display MAE score


XGBoost MAE: 20,499.71


#### The Mean Absoulte Error (MAE) is 20,499.17

#### Based on the MAE scores it appears that the **Random Forest** model, with a lower MAE, is a better fit for our dataset.

&nbsp;  
&nbsp;  
#### Next, we will run both of the trained models using the subject property data.

#### **PREDICTIONS**

#### Import a csv file that contains the subject's information in the same format as the cleaned dataframe. The only missing feature is the "Sales Price" since that is the target we are trying to predict.

In [12]:
sp_data_path = '/content/SP_Data2.csv'          #Create file path to csv dataset 
sp_data = pd.read_csv(sp_data_path)             #Read dataset in dataframe

sp_data.head()                                  #Display data (first 5 rows) just to make sure it was successfully imported

Unnamed: 0,Site,View,Age,Bedrooms,Bathrooms,GLA,Garage,Fireplace,Pool
0,12100,0,16,4,3,2915,3,0,1


&nbsp;

&nbsp;  
#### Now we define the predicting features and then we are run the trained models to predict the subject's estimated value.

In [13]:
features = ['Site', 'View', 'Age', 'Bedrooms', 'Bathrooms', 'GLA','Garage','Fireplace','Pool']    #Define predicting features
sp_X = sp_data[features]

prediction_random_forest = int(random_forest.predict(sp_X))           #Predict the subject's value using Random Forest model
prediction_xgboost = int(xg_boost.predict(sp_X))                      #Predict the subject's value using Extreme Gradient Boosting model

print('Estimated Values')                                             #Display estimated values
print()
print('Random Forest: $', format(prediction_random_forest,','))
print('Extreme Gradient Boosting: $', format(prediction_xgboost,','))

Estimated Values

Random Forest: $ 395,426
Extreme Gradient Boosting: $ 407,910


&nbsp;  
&nbsp;  
#### Subject's estimated value:
#### 1) Based on Random Forest is 395,426
#### 2) Based on Extreme Gradient Boosting is 407,910

&nbsp;  
&nbsp;
#### **What to do with the results?**

#### Similar to Part 1 of the tutorial, we were able to run a couple of models that predicted a property value with reasonable accuracy. However, the fact that our models produced these results doesn’t mean that we must use them to derive a final estimate of value. Perhaps we just want them as an additional tool to support our own analysis. Or maybe we do want to use them for low-risk collateral analysis. The point is that we should always keep in mind that machine learning models are just tools, we are ultimately in charge of making the decision to use them or not. 
&nbsp;

#### **Conclusion**
#### In Part 2 of this tutorial we covered data collection, data modeling, and prediction of values.
&nbsp;
#### In summary, the estimated values obtained from our four models are:
#### Linear Regression: 405,604 
#### Decision Tree: 420,000
#### Random Forest: 395,426
#### Extreme Gradient Boosting: 407,910
&nbsp;

