## AMES HOUSING DATA - PREDICTIVE ANALYSIS

Based on the preprocessed/prepared data(cleaned from missing values and outliers), the next step is towards data modeling where we're going to implement several ML Algorithms/model available through ***Scikit-Learn*** library. 

**Objective**

* Fitting the prepared data within regression model
* Implementing techniques like Cross-Validation, Regularization, Scaling/Normalization or changing hyperparamaters
* The main aim is to progress with best possible performance metrics, that is, evaluated and compared through errors/scores that we're going to obtain through multiple approaches.
* Since, we have derived several continuous columns through categorical columns (through feature engineering) -- The aim is to identify necessary features that are most valuable in predicting house prices. (I'll be using ***Lasso Regression*** Model for the same)

#### PART 1: DATA LOADING

In [None]:
# importing useful libraries

In [1]:
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

In [27]:
df = pd.read_csv("Ames_Final_DF.csv")

In [29]:
df.drop("Unnamed: 0", axis = 1, inplace=True)
df.head()

Unnamed: 0,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,BsmtFin SF 1,BsmtFin SF 2,Bsmt Unf SF,...,Sale Type_ConLw,Sale Type_New,Sale Type_Oth,Sale Type_VWD,Sale Type_WD,Sale Condition_AdjLand,Sale Condition_Alloca,Sale Condition_Family,Sale Condition_Normal,Sale Condition_Partial
0,141.0,31770,6,5,1960,1960,112.0,639.0,0.0,441.0,...,False,False,False,False,True,False,False,False,True,False
1,80.0,11622,5,6,1961,1961,0.0,468.0,144.0,270.0,...,False,False,False,False,True,False,False,False,True,False
2,81.0,14267,6,6,1958,1958,108.0,923.0,0.0,406.0,...,False,False,False,False,True,False,False,False,True,False
3,93.0,11160,7,5,1968,1968,0.0,1065.0,0.0,1045.0,...,False,False,False,False,True,False,False,False,True,False
4,74.0,13830,5,5,1997,1998,0.0,791.0,0.0,137.0,...,False,False,False,False,True,False,False,False,True,False


In [31]:
#making X and y from dataframe
X = df.drop("SalePrice", axis = 1)
y = df['SalePrice']

In [33]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2925 entries, 0 to 2924
Columns: 273 entries, Lot Frontage to Sale Condition_Partial
dtypes: bool(238), float64(11), int64(24)
memory usage: 1.4 MB


There are 273 features in our independent variables set. All of these will be not useful to predict our dependent variable (SalePrice). We can confirm most crucial features from Lasso Regression

#### PART 2: DATA SPLITTING (TRAIN | TEST), SCALING & MODELING

**Steps to follow**

* Split
* Scale
* Model
* Retrain/Evaluate

In [36]:
#Splitting Train | Test data sets
from sklearn.model_selection import train_test_split

In [39]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=11)

In [45]:
#Scaling the X features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [48]:
#implementing LassoCV Regression
from sklearn.linear_model import LassoCV

In [54]:
model_ls = LassoCV(
    n_alphas = 100,
    max_iter = 1000000,
    eps = 0.001,
    cv = 10
)

In [56]:
model_ls.fit(X_train, y_train)

In [76]:
model_ls.alpha_

257.7945496809009

In [78]:
model_ls.intercept_

181399.4793261868

In [86]:
coef_array = model_ls.coef_

In [90]:
col_array = X.columns

In [98]:
col_coef = pd.DataFrame(data = [coef_array, col_array]).T
col_coef.columns = ['coefficient', 'columns']

In [129]:
del_columns = list(col_coef[col_coef['coefficient'] == 0]['columns'])
del_columns

['Bsmt Unf SF',
 '1st Flr SF',
 'Garage Yr Blt',
 '3Ssn Porch',
 'Misc Val',
 'MS SubClass_180',
 'MS SubClass_190',
 'MS SubClass_40',
 'MS SubClass_50',
 'MS SubClass_70',
 'MS SubClass_75',
 'MS SubClass_80',
 'MS Zoning_C (all)',
 'MS Zoning_FV',
 'MS Zoning_RH',
 'MS Zoning_RL',
 'Lot Shape_IR3',
 'Land Contour_Lvl',
 'Lot Config_FR2',
 'Neighborhood_ClearCr',
 'Neighborhood_CollgCr',
 'Neighborhood_IDOTRR',
 'Neighborhood_Landmrk',
 'Neighborhood_MeadowV',
 'Neighborhood_SawyerW',
 'Condition 1_Feedr',
 'Condition 2_Norm',
 'Condition 2_RRAn',
 'Condition 2_RRNn',
 'House Style_1.5Unf',
 'House Style_1Story',
 'House Style_2.5Unf',
 'House Style_2Story',
 'House Style_SFoyer',
 'House Style_SLvl',
 'Roof Style_Gable',
 'Roof Style_Gambrel',
 'Roof Matl_Metal',
 'Roof Matl_Roll',
 'Roof Matl_Tar&Grv',
 'Exterior 1st_AsphShn',
 'Exterior 1st_BrkComm',
 'Exterior 1st_CBlock',
 'Exterior 1st_VinylSd',
 'Exterior 1st_WdShing',
 'Exterior 2nd_Brk Cmn',
 'Exterior 2nd_CBlock',
 'Exterio

Above is the list of columns that are ignored by LassoCV (if the coefficient values of columns are zero, that means Regression model doesn't consider those features) -- Therefore, these are not necessarily important features to predict house prices.

**Conclusion**

* It comes with certain trade-off to decide whether to remove these features or not.
  * Since, we have a large dataset of 274 columns, it would be great if we can lower it down to some extent, let's say we remove these 99 features.
  * Moreover, it can also happen that we must get somewhat marginal higher error points as well. (After removing columns, we can try models other than lassoCV and can see the difference of error)

To summarize, let's find the error margin from this trained model. In the next step, I will remove these features from original dataset and apply other regression models to check and evaluate.

In [117]:
#importing error metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error
y_test_pred = model_ls.predict(X_test)
print("RMSE is:", round(np.sqrt(mean_squared_error(y_test, y_test_pred))))
print("MAE is:", round(mean_absolute_error(y_test, y_test_pred)))

RMSE is: 21431
MAE is: 14414


From LassoCV Regression, the resulted performance is off by ***$14414*** MAE (the average predicted price would be off by this margin)

In [322]:
#removing the untilized columns from the original dataframe to create a new one
Imp_dataframe = df.drop(del_columns, axis = 1)

In [300]:
Imp_dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2925 entries, 0 to 2924
Columns: 175 entries, Lot Frontage to Sale Condition_Partial
dtypes: bool(144), float64(9), int64(22)
memory usage: 1.1 MB


#### Working with new dataframe with 175 columns

In [133]:
#converting X and y from new dataframe
X = Imp_dataframe.drop('SalePrice', axis = 1)
y = Imp_dataframe['SalePrice']

In [137]:
#left columns
len(X.columns)

174

In [139]:
#splitting the dataset into train | test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=11)

In [143]:
#scaling X features
new_scale = StandardScaler()
X_train = new_scale.fit_transform(X_train)
X_test = new_scale.transform(X_test)

**Let's start with basic linear regression model that uses no alpha values or cross validations**

In [151]:
from sklearn.linear_model import LinearRegression

In [153]:
model_lr = LinearRegression()

In [162]:
model_lr.fit(X_train, y_train)

In [168]:
y_test_pred = model_lr.predict(X_test)

In [170]:
print("RMSE is:", round(np.sqrt(mean_squared_error(y_test, y_test_pred))))
print("MAE is:", round(mean_absolute_error(y_test, y_test_pred)))

RMSE is: 21257
MAE is: 14483


Not a significant different, as MAE = 14483, although we got a better RMSE value

**Improving the process of regression through cross validations for more transparent effects on test set. Since, we have options for manual validations, cross_val_score, or cross_validate, but preferably GridSearchCV does the best possible work for opting best alpha values and CVs**

In [186]:
from sklearn.linear_model import RidgeCV

In [191]:
base_model_rg = RidgeCV()

In [334]:
#creating a dictionary for parameter grid to be passed in GridSearchCV
param_grid = {'alphas':[100, 200, 253, 254, 255, 256, 257.79, 260, 280, 300, 320, 360, 400, 450, 480]}

In [155]:
from sklearn.model_selection import GridSearchCV

In [336]:
grid_model = GridSearchCV(
    estimator = base_model_rg,
    param_grid = param_grid,
    scoring='neg_mean_squared_error',
    cv=5,
    verbose = 2
)

In [338]:
grid_model.fit(X_train, y_train)

Fitting 5 folds for each of 15 candidates, totalling 75 fits
[CV] END .........................................alphas=100; total time=   0.0s
[CV] END .........................................alphas=100; total time=   0.0s
[CV] END .........................................alphas=100; total time=   0.0s
[CV] END .........................................alphas=100; total time=   0.0s
[CV] END .........................................alphas=100; total time=   0.0s
[CV] END .........................................alphas=200; total time=   0.0s
[CV] END .........................................alphas=200; total time=   0.0s
[CV] END .........................................alphas=200; total time=   0.0s
[CV] END .........................................alphas=200; total time=   0.0s
[CV] END .........................................alphas=200; total time=   0.0s
[CV] END .........................................alphas=253; total time=   0.0s
[CV] END .......................................

In [340]:
grid_model.best_params_

{'alphas': 100}

In [342]:
y_test_pred = grid_model.predict(X_test)

In [344]:
print("RMSE is:", round(np.sqrt(mean_squared_error(y_test, y_test_pred))))
print("MAE is:", round(mean_absolute_error(y_test, y_test_pred)))

RMSE is: 21842
MAE is: 14770


Let's compare the performance metrices from all three approaches:

| LassoCV(with all features) | Linear Regression(with necessary features) | RidgeCV with GridSearch (with necessary features) |
| ---------------------------|--------------------------------------------|---------------------------------------------------|
|            MAE             |                   MAE                      |                       MAE                         |
|           14414            |                  14483                     |                      15002                        |

#### NOTE: Using RidgeCV with GridSearch picked alpha value of 200, while LassoCV had picked an alpha value of 257.79

Let's use ElasticNet with GridSearchCV

**OBJECTIVE**
* This will finally test that which one is the best regression model: 'Ridge or Lasso' for the new dataframe. (Through L1_ratio parameter)
* Able to analyze the best possible alpha values

In [234]:
from sklearn.linear_model import ElasticNet

In [238]:
base_elastic_model = ElasticNet(max_iter=1000000)

In [346]:
param_grid = { 'alpha' : [100, 200, 253, 254, 255, 256, 257.79, 260, 280, 300, 320, 360, 400, 450, 480],
    'l1_ratio' : [0.1, 0.2, 0.5, 0.7, 0.9, 0.95, 0.99, 1]
}

In [348]:
grid_model = GridSearchCV(
    estimator = base_elastic_model,
    param_grid = param_grid,
    scoring='neg_mean_squared_error',
    cv=25,
    verbose = 0
)

In [350]:
grid_model.fit(X_train, y_train)

In [262]:
#let's check different attributes based on trained grid model
grid_model.best_params_

{'alpha': 200, 'l1_ratio': 1}

In [330]:
y_test_pred = grid_model.predict(X_test)
print("RMSE is:", round(np.sqrt(mean_squared_error(y_test, y_test_pred))))
print("MAE is:", round(mean_absolute_error(y_test, y_test_pred)))

RMSE is: 21306
MAE is: 14374


#### Conclusion
Alpha is still different, although ElasticNet choosed Lasso Regression completely (L1_ratio = 1) 

* This can be something noticeable, although the best fit model for the dataset is **LassoCV** (whether to choose it with full dataset or with the new dataframe)
* Another thing to notice is the individual performance of **LassoCV**, rather then getting opted through ElasticNet + GridSearchCV regression model. The best model fit will be the one that has been trained and modelled through **LassoCV** individually. 

----

#### OPTIONAL
Retrain model based on LassoCV regression individually using the new dataframe of 175 columns. (with same hyperparameters)

There was a slight considerable change happened through this process as follows:

In [270]:
model_new_ls = LassoCV(
    n_alphas = 100,
    max_iter = 1000000,
    eps = 0.001,
    cv = 10
)

In [272]:
model_new_ls.fit(X_train, y_train)

In [274]:
y_test_pred = model_new_ls.predict(X_test)
print("RMSE is:", round(np.sqrt(mean_squared_error(y_test, y_test_pred))))
print("MAE is:", round(mean_absolute_error(y_test, y_test_pred)))

RMSE is: 21213
MAE is: 14404


In [315]:
model_new_ls.alpha_
#getting new alpha value

63.85767837180903

In [286]:
obs = model_new_ls.coef_
col_array = X.columns
new_obs = pd.DataFrame(data = [obs, col_array]).T
new_obs.columns = ['coefficient', 'columns']
new_obs

Unnamed: 0,coefficient,columns
0,1672.465221,Lot Frontage
1,4479.959398,Lot Area
2,9559.93825,Overall Qual
3,6136.190088,Overall Cond
4,10363.190841,Year Built
...,...,...
169,-681.514236,Sale Type_Oth
170,979.401207,Sale Condition_AdjLand
171,728.73888,Sale Condition_Alloca
172,3047.873171,Sale Condition_Normal


In [317]:
del_col = list(new_obs[new_obs['coefficient'] == 0]['columns'])
del_col
#still finding some columns w.r.t their current comparison within 175 columns, that aren't considered for the price prediction

['Low Qual Fin SF',
 'Roof Style_Hip',
 'Exterior 1st_HdBoard',
 'Exterior 1st_Wd Sdng',
 'Mas Vnr Type_BrkFace']

In [319]:
new_df = Imp_dataframe.drop(del_col, axis = 1)
new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2925 entries, 0 to 2924
Columns: 170 entries, Lot Frontage to Sale Condition_Partial
dtypes: bool(140), float64(9), int64(21)
memory usage: 1.1 MB


----

It is an iterative process to go through Lasso Regression for every new dataframe formed after deleting certain columns. We can see minor adjustments in MAE (lower side) through this process but that doesn't give any significant difference (untill it is not required to be strict enough) 