## Let's Briefly Review the ML Pipeline

In [1]:
import pandas as pd
import numpy as np
import pickle as pkl
import matplotlib.pyplot as plt
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.preprocessing import  OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

Start by reading in your data

In [2]:
#read in the code
df = pd.read_csv('data/Housing.csv')

In [3]:
#get the head of the data
df.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished


Raw data description

In [4]:
df.describe()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,parking
count,545.0,545.0,545.0,545.0,545.0,545.0
mean,4766729.0,5150.541284,2.965138,1.286239,1.805505,0.693578
std,1870440.0,2170.141023,0.738064,0.50247,0.867492,0.861586
min,1750000.0,1650.0,1.0,1.0,1.0,0.0
25%,3430000.0,3600.0,2.0,1.0,1.0,0.0
50%,4340000.0,4600.0,3.0,1.0,2.0,0.0
75%,5740000.0,6360.0,3.0,2.0,2.0,1.0
max,13300000.0,16200.0,6.0,4.0,4.0,3.0


With massively different sclaes, let's play with the scaling a bit to make the modeling a bit easier, this can be an abbreviated version of the EDA and Feature engineering phase

In [5]:
#let's play with some scaling
df['price'] = [i/(10**6) for i in df['price']]

In [6]:
#let's play with scaling further
df['area'] = [i/(10**3) for i in df['area']]

Let's see the result

In [7]:
#now let's see how our data is acting overall
df.describe()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,parking
count,545.0,545.0,545.0,545.0,545.0,545.0
mean,4.766729,5.150541,2.965138,1.286239,1.805505,0.693578
std,1.87044,2.170141,0.738064,0.50247,0.867492,0.861586
min,1.75,1.65,1.0,1.0,1.0,0.0
25%,3.43,3.6,2.0,1.0,1.0,0.0
50%,4.34,4.6,3.0,1.0,2.0,0.0
75%,5.74,6.36,3.0,2.0,2.0,1.0
max,13.3,16.2,6.0,4.0,4.0,3.0


Let's continue the feature engineering by encoding our categorical features

In [8]:
#let's seperate the columns into both numerical and categorical columns
cat_cols = []
num_cols = []
#iterate through the values
for i in df.columns:
    #check the columns type
    col_type = type(df[i][0])
    if col_type.__name__ =='str':
        #add to one list fi it is a string
        cat_cols.append(i)
    else:
        #add to another if not a string
        num_cols.append(i)

In [9]:
#create two dataframes for numerical and categorical data
df_cats = df[cat_cols]
df_numerical = df[num_cols]

This may seem premature, but you'll understand why we need to make this saving function early

In [10]:
#function used to save the model
def save_models(file,model):
    #open pickle file
    model_file = open(f'model/{file}','wb')
    #dump information
    pkl.dump(model,model_file)
    #close file
    model_file.close()

Now let's create the encoder and use it to transform the data

In [11]:
#instatiate the encoder
ohe = OneHotEncoder()
#fit the data
ohe_fit = ohe.fit(df_cats)
#transform the data
ohe_data = ohe.transform(df_cats)

In practice, the encoder refitting will cause an error, when your saved model will try to make predictions. New categories will required to be stored, and the encoder and model will have to be retrained. 

Let's save the encoder

In [12]:
#save the ohe fit to be used later
save_models('ohe.pkl',ohe)

Now let's take our encoded data and create a dataframe from it. Make sure to:
* Take the name
* Condense the output
* Turn it into a dataframe
* Name the columns

In [13]:
#create the ohe data frame
#get names
ohe_names = ohe.get_feature_names_out()
#condense to array and convert to dataframe
df_ohe = pd.DataFrame(ohe_data.toarray())
#set columns
df_ohe.columns = ohe_names

Merge the datasets together and shuffle the rows

In [14]:
#merge the numerical and categorical sets
complete_df = pd.merge(df_numerical,df_ohe,left_index = True, right_index = True)
#shuffle the rows
complete_df.sample(frac = 1)

Unnamed: 0,price,area,bedrooms,bathrooms,stories,parking,mainroad_no,mainroad_yes,guestroom_no,guestroom_yes,...,basement_yes,hotwaterheating_no,hotwaterheating_yes,airconditioning_no,airconditioning_yes,prefarea_no,prefarea_yes,furnishingstatus_furnished,furnishingstatus_semi-furnished,furnishingstatus_unfurnished
461,3.080,4.960,2,1,1,0,0.0,1.0,1.0,0.0,...,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0
93,6.300,7.200,3,2,1,3,0.0,1.0,1.0,0.0,...,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0
234,4.620,3.880,3,2,2,2,0.0,1.0,1.0,0.0,...,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
214,4.865,4.350,2,1,1,0,0.0,1.0,1.0,0.0,...,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
40,7.875,6.550,3,1,2,0,0.0,1.0,1.0,0.0,...,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
134,5.803,7.000,3,1,1,2,0.0,1.0,1.0,0.0,...,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
150,5.600,5.136,3,1,2,0,0.0,1.0,0.0,1.0,...,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
337,3.920,2.145,4,2,1,0,0.0,1.0,1.0,0.0,...,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
282,4.270,2.175,3,1,2,0,1.0,0.0,0.0,1.0,...,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0


Now let's get to training:
* Split the data into inputs and outputs
* Split into model and validation set
* Split the model set into training and testing sets

In [15]:
#drop the var you're trying to predict
X = complete_df.drop('price',axis = 1)
#get output variable
y = complete_df['price']
#split into model and validation sets
X_model, X_validation, y_model, y_validation = train_test_split(X, y, test_size = 0.8)
#split set into training and testing
X_train, X_test, y_train, y_test = train_test_split(X_model, y_model, test_size = 0.8)

Now let's train the model
* Instantiate it
* Train it
* Predict
* Evaluate

In [16]:
#instantiate the model
rf = RandomForestRegressor()
#fit on training
rf_fit = rf.fit(X_train,y_train)
#make the predictions
rf_preds = rf_fit.predict(X_test)
#get mean squared error
rf_mse = mean_squared_error(y_test,rf_preds)
#get mean absolute error
rf_mae = mean_absolute_error(y_test,rf_preds)
print(rf_mse)
print(np.sqrt(rf_mse))
print(rf_mae)

2.0497378071883556
1.4316905416982943
1.0604264204545464


Now let's cross validate to complete our evaluation

In [17]:
#cross validate the model to ensure reliability 
cv_rf = cross_val_score(rf,X,y,cv = 5,scoring='neg_mean_squared_error')
print(cv_rf)

[-8.75620903 -1.33637697 -1.09021075 -0.7705125  -1.90793304]


Let's validate the model, this is important to ensure lack of overfitting

In [18]:
#validation preds
rf_val_preds = rf_fit.predict(X_validation)
#get mean squared error
rf_mse_val = mean_squared_error(y_validation,rf_val_preds)
#get mean absolute error
rf_mae_val = mean_absolute_error(y_validation,rf_val_preds)
print(rf_mse_val)
print(np.sqrt(rf_mse_val))
print(rf_mae_val)

2.2788329579164017
1.5095803913393953
1.09444133027523


Now let's repeat this process with XGBoost to compare

In [19]:
#instantiate xgboost
xgb = XGBRegressor()
#fit on training data
xgb_fit = xgb.fit(X_train,y_train)
#make the predictions
xgb_preds = xgb.predict(X_test)
#get mean squared error
xgb_mse = mean_squared_error(y_test,xgb_preds)
#get th emean absolute error
xgb_mae = mean_absolute_error(y_test,xgb_preds)
print(xgb_mse)
print(np.sqrt(xgb_mse))
print(xgb_mae)

2.596646431957968
1.6114113168145394
1.1916549621881138


We can again repeat the cross validation process

In [20]:
#cross validate the xgboost regressor
cv_xgb = cross_val_score(xgb,X,y,cv = 5,scoring='neg_mean_squared_error')
print(cv_xgb)

[-8.62796591 -1.81489535 -1.44665018 -0.975073   -1.99202287]


Let's repeat the validation process

In [21]:
#validation preds
xgb_val_preds = xgb_fit.predict(X_validation)
#get mean squared error
xgb_mse_val = mean_squared_error(y_validation,xgb_val_preds)
#get mean absolute error
xgb_mae_val = mean_absolute_error(y_validation,xgb_val_preds)
print(xgb_mse_val)
print(np.sqrt(xgb_mse_val))
print(xgb_mae_val)

2.6705387960378784
1.634178324430317
1.1896957568903583


Now let's take the best algorithm, and train it on the entire data, this will be used for the dashboard.

In [22]:
#make a complete fit of the random forest model for the dash
rf_full = RandomForestRegressor()
full_model = rf_full.fit(X,y)

Let's save the complete trained model

In [23]:
#save this complete fit
save_models('rf_model.pkl',full_model)