## Table of Content <a name = 'content'></a>   

1. [Data Loading](#loading)  
2. [Linear Regression](#lin_reg)    
3. [Ridge Regression](#ridge)    
4. [Lasso Regression](#lasso)   
5. [Elastic Net](#el_net)   
6. [Decision Tree](#tree)   
7. [Random Forest](#forest)   

In [77]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from pandas.core.common import random_state
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
import warnings
from sklearn.exceptions import ConvergenceWarning

# Data Loading <a name = 'loading'></a>  

[Table of Content](#content)  

In [78]:
data = pd.read_csv('../../Data/data_processed.csv', index_col = 'Id')

data.head()

Unnamed: 0_level_0,OverallQual,GrLivArea,GarageArea,YearBuilt,TotalBsmtSF,FullBath,YearRemodAdd,2ndFlrSF,Fireplaces,LotArea,...,Exterior1st_BrkFace,Exterior1st_CemntBd,Exterior1st_HdBoard,Exterior1st_MetalSd,Exterior1st_Plywood,Exterior1st_Stucco,Exterior1st_VinylSd,Exterior1st_Wd Sdng,Exterior1st_WdShing,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,7,1710,548,2003,856,2,2003,854,0,8450,...,0,0,0,0,0,0,1,0,0,208500
2,6,1262,460,1976,1262,2,1976,0,1,9600,...,0,0,0,1,0,0,0,0,0,181500
3,7,1786,608,2001,920,2,2002,866,1,11250,...,0,0,0,0,0,0,1,0,0,223500
4,7,1717,642,1915,756,1,1970,756,1,9550,...,0,0,0,0,0,0,0,1,0,140000
5,8,2198,836,2000,1145,2,2000,1053,1,14260,...,0,0,0,0,0,0,1,0,0,250000


In [79]:
X = data.copy()  

y = X['SalePrice']

X = X.drop(['SalePrice'], axis = 1)

In [80]:
# Dictionary for storing models results
models_results = pd.Series()

# List for storing all models
all_models = []

  models_results = pd.Series()


# Classical Linear Regression <a name = 'lin_reg'></a>  

[Table of Content](#content)

In [81]:
linear_regression = LinearRegression()

linear_regression_scores = cross_val_score(linear_regression,
                         X,
                         y,
                         cv = 5,
                         scoring = 'neg_mean_absolute_error')

print(linear_regression_scores.mean())

-14231.924375192946


In [82]:
models_results['linear_regression'] = linear_regression_scores.mean()

all_models.append(linear_regression)

# Ridge Regression <a name = 'ridge'></a>  

[Table of Content](#content)

In [83]:
ridge_sample = Ridge()
ridge_hyper_params = {'alpha': range(1, 100, 5), 'random_state': [0]}
ridge_regression = GridSearchCV(ridge_sample, ridge_hyper_params, scoring = 'neg_mean_absolute_error', cv = 10)
ridge_regression.fit(X, y)

print('Best value of λ: ', ridge_regression.best_params_)
print('Best score: ', ridge_regression.best_score_)

Best value of λ:  {'alpha': 6, 'random_state': 0}
Best score:  -14054.142695898288


Okay, we've roughly figured out in which range the best alpha value lies. Let's try to get a more accurate value.

In [84]:
ridge_hyper_params = {'alpha': np.linspace(1, 10, 40), 'random_state': [0]}
ridge_regression = GridSearchCV(ridge_sample, ridge_hyper_params, scoring = 'neg_mean_absolute_error', cv = 10)
ridge_regression.fit(X, y)

print('Best value of λ: ', ridge_regression.best_params_)
print('Best score: ', ridge_regression.best_score_)

Best value of λ:  {'alpha': 3.076923076923077, 'random_state': 0}
Best score:  -14039.208670160035


Okay, now we'll save the ridge regression model with best value of alpha.    

**NOTE!** The process of finding the best hyperparameters is the same for each model. I don't want to overload this notebook with similar blocks of code, so I will leave only the "best attempts" for each model.

In [85]:
models_results['ridge_regression'] = ridge_regression.best_score_

ridge_regression = Ridge(ridge_regression.best_params_)

In [86]:
all_models.append(ridge_regression)

# Lasso Regression <a name = 'lasso'></a>  

[Table of Content](#content)

In [87]:
# I don't want to overload the output of the LASSO regression and Elastic Net
warnings.filterwarnings("ignore", category=ConvergenceWarning)

In [88]:
lasso_sample = Lasso()
lasso_hyper_params = {'alpha': range (45, 60), 'random_state': [0]}
lasso_regression = GridSearchCV(lasso_sample, lasso_hyper_params, scoring = 'neg_mean_absolute_error', cv = 10)
lasso_regression.fit(X, y)

print('best alpha: ', lasso_regression.best_params_)
print('score: ', lasso_regression.best_score_)

best alpha:  {'alpha': 52, 'random_state': 0}
score:  -13943.969522193185


Save the model with best parameters.

In [89]:
models_results['lasso_regression'] = lasso_regression.best_score_

lasso_regression = Lasso(lasso_regression.best_params_)

In [90]:
all_models.append(lasso_regression)

# Elastic Net <a name = 'el_net'></a>  

[Table of Content](#content)

Do the same steps as with the Ridge and Lasso regression. 

In [91]:
elastic_net_sample = ElasticNet()
elnet_hyper_params = {'alpha': range(45, 60), 'l1_ratio': np.linspace(0.99, 1, 5), 'random_state': [0]}
elastic_net = GridSearchCV(elastic_net_sample, elnet_hyper_params, scoring = 'neg_mean_absolute_error', cv = 10)
elastic_net.fit(X, y)

print('best alpha and l1_ratio: ', elastic_net.best_params_)
print('score: ', elastic_net.best_score_)

best alpha and l1_ratio:  {'alpha': 52, 'l1_ratio': 1.0, 'random_state': 0}
score:  -13943.969522193185


Lasso regression seems to be the best fit. 

In [92]:
models_results['elastic_net'] = elastic_net.best_score_

elastic_net = ElasticNet(elastic_net.best_params_)

In [93]:
all_models.append(elastic_net)

# DecisionTree <a name = 'tree'></a>  

[Table of Content](#content)

In [100]:
decision_tree_sample = DecisionTreeRegressor()
decision_tree_hyper_params = {'max_depth': [8],
                              'min_samples_split': range(2, 4),
                              'min_samples_leaf': range(2, 5),
                              'max_features': range(40, 42),
                              'random_state': [0],
                              'min_impurity_decrease': range(0, 2),
                              'ccp_alpha': np.linspace(0, 1, 3)
                             }
decision_tree_regressor = GridSearchCV(decision_tree_sample, decision_tree_hyper_params, 
                                       scoring = 'neg_mean_absolute_error', cv = 10)
decision_tree_regressor.fit(X, y)

print('Best DT params: ', decision_tree_regressor.best_params_)
print('Best score: ', decision_tree_regressor.best_score_)

Best DT params:  {'ccp_alpha': 0.0, 'max_depth': 8, 'max_features': 41, 'min_impurity_decrease': 0, 'min_samples_leaf': 4, 'min_samples_split': 2, 'random_state': 0}
Best score:  -16727.753705152823


The model does not seem to suffer from overfitting, since best value of `min_impurity_decrease` and `ccp_alpha` is zero.

In [101]:
models_results['decision_tree_regressor'] = decision_tree_regressor.best_score_

best_params = decision_tree_regressor.best_params_
decision_tree_regressor = DecisionTreeRegressor(**best_params)

In [102]:
all_models.append(decision_tree_regressor)

# Random Forest <a name = 'forest'></a>  

[Table of Content](#content)

In [96]:
random_forest_sample = RandomForestRegressor()
random_forest_hyper_params = {'n_estimators': [1100],
                              'max_depth': [31],
                              'min_samples_split': [3],
                              'min_samples_leaf': [1, 5, 10],
                              'max_features': [15],
                              'random_state': [0],
                              'n_jobs': [-1]
                             }
random_forest_regressor = GridSearchCV(random_forest_sample, random_forest_hyper_params, 
                                       scoring = 'neg_mean_absolute_error', cv = 5)
random_forest_regressor.fit(X, y)

print('Best parameters: ', random_forest_regressor.best_params_)
print('Best score: ', random_forest_regressor.best_score_)

Best parameters:  {'max_depth': 31, 'max_features': 15, 'min_samples_leaf': 1, 'min_samples_split': 3, 'n_estimators': 1100, 'n_jobs': -1, 'random_state': 0}
Best score:  -12596.608348303494


In [98]:
models_results['random_forest_regressor'] = random_forest_regressor.best_score_

random_forest_regressor = RandomForestRegressor(random_forest_regressor.best_params_)

In [99]:
all_models.append(random_forest_regressor)

In [103]:
print(models_results.sort_values(ascending = False), '\n')

all_models

random_forest_regressor   -12596.608348
lasso_regression          -13943.969522
elastic_net               -13943.969522
ridge_regression          -14039.208670
linear_regression         -14231.924375
decision_tree_regressor   -16727.753705
dtype: float64 



[LinearRegression(),
 Ridge(alpha={'alpha': 3.076923076923077, 'random_state': 0}),
 Lasso(alpha={'alpha': 52, 'random_state': 0}),
 ElasticNet(alpha={'alpha': 52, 'l1_ratio': 1.0, 'random_state': 0}),
 RandomForestRegressor(n_estimators={'max_depth': 31, 'max_features': 15,
                                     'min_samples_leaf': 1,
                                     'min_samples_split': 3,
                                     'n_estimators': 1100, 'n_jobs': -1,
                                     'random_state': 0}),
 DecisionTreeRegressor(max_depth=8, max_features=41, min_impurity_decrease=0,
                       min_samples_leaf=4, random_state=0)]