# Part 3 - Model Building for Production

### Contents

<a href = "part-1_EDA.ipynb">Part 1 - Exploratory Data Analysis</a><br>

<a href = "part-2_data_cleaning">Part 2 - Data Cleaning</a>
    
Part 3 - Production Model Building

- [Summary of Steps](#Summary-of-Steps)
- [Preprocessing](#Preprocessing)
- [Hyperparameter Tuning](#Hyperparameter-Tuning)
- [Model Selection](#Model-Selection)
- [Kaggle Submission](#Kaggle-Submission)
- [Conclusion and Discussion](#Conclusion-and-Discussion)

In [1]:
# imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.dummy import DummyRegressor
from sklearn.model_selection import cross_val_score, KFold, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV, ElasticNet, ElasticNetCV
from sklearn.metrics import mean_squared_error

import warnings
warnings.filterwarnings("ignore")

#### Importing Data
[top](#Contents)

In [2]:
# Importing dataset

ames_data_lasso = ("datasets/final_ames.csv")

ames = pd.read_csv(ames_data_lasso)

ames.drop(columns = ["Unnamed: 0"], inplace = True)

---

## Summary of Steps

With the dataset cleaned and the features chosen, a linear regression model will have to be selectedas the base for the model. There are 4 choices: Linear Regression, Lasso Regression , Ridge regression and ElasticNet regression. An additional dummy regressor was added to serve as a baseline for measurement.

The data was split into training and test data sets to provide a measure of performance for each of the models, and then standard scaled. The best hyperparameters for each model were found, and all 5 models were cross validated. 

From the cross-validation, a Ridge Regression model with an alpha of about 48 provided the best score, and was used to generate the predictions for the Kaggle submission

## Preprocessing
[top](#Contents)

#### Splitting data into train and test sets

In [3]:
# Splitting varibles into features and target

features = list(ames.columns)
features.remove("saleprice")

X = ames[features]
y = ames["saleprice"]

In [4]:
# Splitting data into training set and test set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 7)

#### Scaling data

In [5]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

---

## Hyperparameter Tuning
[top](#Contents)

#### RidgeCV

In [6]:
r_alphas = np.logspace(0, 5, 200)
rid_cv = RidgeCV(alphas=r_alphas, cv = 10)
rid_cv = rid_cv.fit(X_train_scaled, y_train)

print (f'Best ridge alpha: {rid_cv.alpha_}')
print (f'Best ridge r2: {rid_cv.score(X_train_scaled, y_train)}')

Best ridge alpha: 34.09285069746813
Best ridge r2: 0.8901254491515729


#### LassoCV

In [7]:
l_alphas = np.arange(0.01, 0.20, 0.01)
las_cv = LassoCV(alphas=l_alphas, cv = 10, max_iter = 5000)
las_cv = las_cv.fit(X_train_scaled, y_train)

print (f'Best Lasso alpha: {las_cv.alpha_}')
print (f'Best Lasso r2: {las_cv.score(X_train_scaled, y_train)}')

Best Lasso alpha: 0.15000000000000002
Best Lasso r2: 0.8902576594786166


#### ElasticNetCV

In [None]:
# Checking for best ElasticNet alpha

e_alphas = np.arange(0.01, 0.20, 0.01)
l1_ratios = np.arange(0, 1, 0.1)
el_cv = ElasticNetCV(alphas=e_alphas, l1_ratio = l1_ratios, cv = 10, max_iter = 5000)
el_cv = el_cv.fit(X_train_scaled, y_train)

print (f'Best ElasticNet alpha: {el_cv.alpha_}')
print (f'Best ElasticNet L1 ratio: {el_cv.l1_ratio_}')
print (f'Best ElasticNet r2: {el_cv.score(X_train_scaled, y_train)}')

---

## Model Selection
[top](#Contents)

In [None]:
# Calling models
lr = LinearRegression()
rd = Ridge(alpha = rid_cv.alpha_)
ls = Lasso(alpha = las_cv.alpha_)
el = ElasticNet(alpha = el_cv.alpha_, l1_ratio = el_cv.l1_ratio_)
dum = DummyRegressor()

#### Cross Validation

In [None]:
# Custom function for cross validation

def validator(model, x_train, x_test, y_train, y_test, nfolds = 10):
    
    model.fit(x_train, y_train)
    
    kf = KFold(nfolds, shuffle = True, random_state = 5)
    
    train_rmse = np.sqrt(-cross_val_score(model, x_train, y_train, cv = kf, scoring = 'neg_mean_squared_error'))
    train_r2 = cross_val_score(model, x_train, y_train, cv=kf)
    
    y_predict = model.predict(x_test)
    
    test_rmse = np.sqrt(mean_squared_error(y_test, y_predict))
    test_r2 = model.score(x_test, y_test)
    
    print(f'For {model} with seen data:\nMean RMSE: {round(train_rmse.mean(),5)},\nMean CV r2:{round(train_r2.mean(),5)}\n')
    print(f'For {model} with unseen data:\nMean RMSE: {round(test_rmse.mean(),5)},\nMean CV r2:{round(test_r2.mean(),5)}\n')
    print("-----------------\n")

In [None]:
# Running function on all models

model_list = [dum,lr,rd,ls,el]

for i in model_list:
    validator(i, X_train_scaled,X_test_scaled, y_train,y_test)

---

## Kaggle Submission
[top](#Contents)

#### Importing and Cleaning Test Data

In [None]:
# Importing dataset

testcsv = ("datasets/test.csv")

test = pd.read_csv(testcsv)

# title cleanup

def edit_title (title):
    
    title = (title.replace(" ","_")).lower()
    
    return title

test.rename(columns = lambda i:edit_title(i), inplace = True)

# 3 column names start with numbers. Replacing numbers with strings.

test.rename(columns = {'1st_flr_sf':"first_flr_sf",  '2nd_flr_sf':'second_flr_sf',"3ssn_porch":"threessn_porch"}, inplace = True)

# Removing unwanted columns

no_data = ['id', 'pid','misc_feature', 'misc_val']

l_num = ["mas_vnr_area", "bsmtfin_sf_2","low_qual_fin_sf", "bsmt_half_bath", "half_bath", 
                          "kitchen_abvgr", "pool_area"]

l_nom = ["street","alley","land_contour","condition_1","condition_2","roof_matl","bsmtfin_type_2",
                         "heating","paved_drive","sale_type"]

l_ord = ["utilities", "land_slope","exter_cond","bsmt_cond","central_air", "electrical",
                          "functional", "garage_qual", "garage_cond", "pool_qc"]

low_cor = ["mo_sold","yr_sold"]

high_null = ['fireplace_qu','fence']

comb_list = no_data+l_num+l_nom+l_ord+low_cor+high_null

test.drop(columns = comb_list, inplace = True)

colinear = ['exter_qual', 'kitchen_qual','garage_yr_blt','garage_cars']

test.drop(columns = colinear, inplace = True)

# combining different types of porch sf to a single column

porch_sfs = ['open_porch_sf', 'enclosed_porch', 'threessn_porch','screen_porch']

test['porch_sf'] = test[porch_sfs].sum(axis = 1)

# dropping porch columns

test.drop(columns = porch_sfs, inplace = True)

# creating list of continuous data columns

num_cols = [i for i in test.columns if test[i].dtypes == int or test[i].dtypes == float]

# ms_subclass, while numerical in value, is nominal in nature. Removing it from the num_cols list.

# months and years should also be classified as ordinal rather than numerical data

list_nonnums = ["ms_subclass", "year_built", "year_remod/add"]

for i in list_nonnums:
    num_cols.remove(i)
    
# The remanining data had to be cross examined with the data dictionary to determine if it was nominal or ordinal.

# Listing out norminal data

cat_nom_cols = ['ms_subclass','ms_zoning',
 'lot_config',
 'neighborhood',
 'bldg_type',
 'house_style',
 'roof_style',
 'exterior_1st',
 'exterior_2nd',
 'mas_vnr_type',
 'foundation',
 'bsmtfin_type_1',
 'garage_type']

# Creating ordinal data list

cat_ord_cols = [i for i in test.columns if i not in num_cols and i not in cat_nom_cols]

# Changing all num_cols to floats.

for i in num_cols:
    test[i] = test[i].map(lambda x:float(x))

#for the "ms_subclass" column, data should be strings insted of floats

test["ms_subclass"] = test["ms_subclass"].map(lambda x:str(x))


# mean imputation for numerical columns

def mean_impute(df, col):
    for i in col:
        mean = df[i].mean()
        df[i] = df[i].fillna(mean)
        
mean_impute(test, num_cols)

# filling categorical columns with "NA"

test[cat_nom_cols] = test[cat_nom_cols].fillna("NA")
test[cat_ord_cols] = test[cat_ord_cols].fillna("NA")


# lable encoding lot_shape

lot_shape_dict = {"NA":0, "Reg":1, "IR1": 2, "IR2": 3, "IR3":4}

# lable encoding exterqual, extercon, bsmtqual, bsmtcon, heatingqc, kitchenqual, fireplacequ, garagequal, garagecond
# poolqc

qualcon_dict = {"Ex":5, "Gd":4, "TA":3, "Fa":2, "Po":1, "NA":0}

# lable encoding bsmtexposure

bsmtexp_dict = {"Gd":4, "Av":3, "Mn":2, "No":1, "NA":0}

# lable encoding garagefinish

garfin_dict = {"Fin":3, "RFn":2, "Unf":1, "NA":0}

# Combining dictionaries:

combine_dict = {**garfin_dict,**bsmtexp_dict,**qualcon_dict,**lot_shape_dict}

# applying dictionary to ordinal non numerical columns

for i in cat_ord_cols:
    if i not in list_nonnums:
        test[i] = test[i].map(combine_dict)
        
# converting nominal columns to dummies

dummy_noms = pd.get_dummies(test[cat_nom_cols], drop_first = True)

# dropping nominal columns

test.drop(columns = cat_nom_cols, inplace=True)

# log transform year built, year remod

def logt(df, col):
    df[col] = np.log(df[col])

for i in ["year_built", "year_remod/add"]:
    logt (test, i)
    
# merging dummies back into test set

final_test = test.merge(dummy_noms, left_index = True, right_index = True)

# Creating empty columns for dummy features in final test that are not in the trained model

for i in features:
    if i not in final_test.columns:
        final_test[i] = 0

#### Making Final Predictions

In [None]:
# Scaling training and test data

X_exam = final_test[features]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_e_scaled = scaler.fit_transform(X_exam)

In [None]:
r_alphas = np.logspace(0, 5, 200)
rid_cv = RidgeCV(alphas=r_alphas, cv = 10)
rid_cv = rid_cv.fit(X_scaled, y)

print (f'Best ridge alpha: {rid_cv.alpha_}')
print (f'Best ridge r2: {rid_cv.score(X_scaled, y)}')

In [None]:
# Selecting Ridge Regression with optimal alpha
rd = Ridge(alpha = rid_cv.alpha_)
rd.fit(X_scaled, y)
rd.score(X_scaled, y)

In [None]:
predictions = rd.predict(X_e_scaled)

In [None]:
# Creating Kaggle submission file

submission = pd.read_csv(testcsv)
submission['SalePrice'] = predictions
submission = submission[["Id", "SalePrice"]]
submission.to_csv('./datasets/kaggle_submission.csv',index=False)

---

## Conclusion and Discussion

The final Ridge regression model has 25 Features and a mean RMSE of 23752 on unseen data. It's public and private kaggle scores were 28919.01006 and 30388.85877 respectively.


These scores were not the lowest possible obtained(further discussed below), and suggest that the model is underfitting the test data. However, the model strikes a balance between accuracy and usability that conforms to the requirements of the client, and would be a successful back-end predictor for the interactive website as descirbed in the brief. 


There is, of course, room for further improvement in this model. 


Of great help would be the inclusion of a domain expert, i.e. an experienced property agent, for consultation. This domain expert could provide deep insights during the examination of data, as well as assist in feature engineering, providing more comprehensive and accurate results for the feature selection process. 


There are also methods for feature selection that are not limited to Linear Regression (e.g. random forests), as well as non-parametric models that might be more effective in predicting the price from the housing data.


Worth a further mention were several other iterations of this model which resulted in lower Kaggle scores. 


A model using 90 features, with everything else unchanged, scored an average of 500 points lower over multiple attempts, hinting that 25 features might not be providing the model enough data. 

Another attemt was made with log-converted sales price scores which resulted in a public kaggle score of around 22700, but a public score of around 38000. The large discrepancy suggests a difference in the test data for both score sets, and the real estate company should more acutely examine their test data for discrepancies.
