# Fitting and Evaluation of Models   

The purpose of this notebook is to try out different ML models on my data, tune their hyperparameters, and figure out which model best fits my data.   

In this project, I will be using different types of linear models and some tree-based ensemble models.

### Steps <a name = 'content'></a>   

1. [Data Loading](#loading)  
2. [Scoring Method](#scoring)    
2. [Linear Regression](#lin_reg)    
3. [Ridge Regression](#ridge)    
4. [Lasso Regression](#lasso)   
5. [Elastic Net](#el_net)   
6. [Decision Tree](#tree)   
7. [Random Forest](#forest)   
8. [Extreme Gradient Boosting](#xgboost)   
9. [Check and Save](#save)

In [18]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.metrics import make_scorer
from sklearn.model_selection import train_test_split
from pandas.core.common import random_state
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
%pip install xgboost
from xgboost import XGBRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
import warnings
from sklearn.exceptions import ConvergenceWarning

Note: you may need to restart the kernel to use updated packages.


In [19]:
print(pd.__version__)
print(np.__version__)
#print(sklearn.__version__)
print(xgboost.__version__)

1.3.4
1.22.4


NameError: name 'xgboost' is not defined

# Data Loading <a name = 'loading'></a>  

[Table of Content](#content)  

First of all, we load the training data prepared in the EDA notebook.

In [2]:
data = pd.read_csv('../../Data/train_data_processed.csv', index_col = 'Id')
data.shape

(1132, 66)

After that separate target variable from other features.

In [3]:
X = data.copy()  
y = X['SalePrice']
X = X.drop(['SalePrice'], axis = 1)

For convenience, I will save all model results into a pd.Series.

In [4]:
models_results = pd.Series()

  models_results = pd.Series()


# Scoring Method <a name = 'scoring'></a>   

[Table of Content](#content)

A word or two must be said about the evaluation method I will be using in this notebook. I will not be using standard metrics such as 'mean_absolute_error' or 'root_mean_squared_error'. Instead, the Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price will be used. Why is that? Because the Kaggle competition from which the data is taken uses this metric. So I'm just repeating it.     

Since the metric is not a standard metric I need to define a function that I will pass to the scoring parameter using cross validation.    

In [5]:
# My metric for cross validation
def rmse_log(model, X, y):
    y_pred = model.predict(X)
    return np.sqrt(mean_squared_error(np.log1p(y), np.log1p(y_pred)))

# My metric for GridSearch
def rmse_log_grid(y_true, y_pred):
    return np.sqrt(mean_squared_error(np.log(y_true), np.log(y_pred)))

custom_scorer = make_scorer(rmse_log_grid, greater_is_better=False)

# Classical Linear Regression <a name = 'lin_reg'></a>  

[Table of Content](#content)

In [8]:
# Define a model sample
linear_regression = LinearRegression()

# Get cross-validation score
linear_regression_scores = cross_val_score(linear_regression,
                         X,
                         y,
                         cv = 5,
                         scoring = rmse_log)

print(linear_regression_scores.mean())

0.12149916987994222


In [51]:
# Store the result
models_results['linear_regression'] = linear_regression_scores.mean()

# Ridge Regression <a name = 'ridge'></a>  

[Table of Content](#content)

In [10]:
# Create a model sample
ridge_sample = Ridge()

# Set the search area for GridSearch.
ridge_hyper_params = {'alpha': range(1, 100, 5), 'random_state': [0]}

# Create a GridSearch sample
ridge_regression = GridSearchCV(ridge_sample, ridge_hyper_params, scoring = custom_scorer, cv = 10)

# Fit the model
ridge_regression.fit(X, y)

print('Best value of λ: ', ridge_regression.best_params_)
print('Best score: ', ridge_regression.best_score_)

Best value of λ:  {'alpha': 1, 'random_state': 0}
Best score:  -0.1187338322310516


Okay, we've roughly figured out in which range the best alpha value lies. Let's try to get a more accurate value.

In [53]:
ridge_hyper_params = {'alpha': np.linspace(1, 5, 40), 'random_state': [0]}   
ridge_regression = GridSearchCV(ridge_sample, ridge_hyper_params, scoring = custom_scorer, cv = 10)
ridge_regression.fit(X, y)

print('Best value of λ: ', ridge_regression.best_params_)
print('Best score: ', ridge_regression.best_score_)

Best value of λ:  {'alpha': 2.641025641025641, 'random_state': 0}
Best score:  -0.11852401419682393


Okay, now we'll save the ridge regression model with best value of alpha.    

**NOTE!** The process of finding the best hyperparameters is the same for each model. I don't want to overload this notebook with similar blocks of code, so I will leave only the "best attempts" for each model.

In [54]:
models_results['ridge_regression'] = ridge_regression.best_score_

ridge_best_params = ridge_regression.best_params_
ridge_regression = Ridge(**ridge_best_params)

# Lasso Regression <a name = 'lasso'></a>  

[Table of Content](#content)

In [55]:
# I don't want to overload the output of the LASSO regression and Elastic Net
warnings.filterwarnings("ignore", category=ConvergenceWarning)

In [74]:
lasso_sample = Lasso()
lasso_hyper_params = {'alpha': range (30, 45), 'random_state': [0]}
lasso_regression = GridSearchCV(lasso_sample, lasso_hyper_params, scoring = custom_scorer, cv = 10)
lasso_regression.fit(X, y)

print('best alpha: ', lasso_regression.best_params_)
print('score: ', lasso_regression.best_score_)

best alpha:  {'alpha': 41, 'random_state': 0}
score:  -0.11749945302004305


Save the model with best parameters.

In [75]:
models_results['lasso_regression'] = lasso_regression.best_score_

lasso_best_params = lasso_regression.best_params_
lasso_regression = Lasso(**lasso_best_params)

# Elastic Net <a name = 'el_net'></a>  

[Table of Content](#content)

Do the same steps as with the Ridge and Lasso regression. 

In [79]:
elastic_net_sample = ElasticNet()
elnet_hyper_params = {'alpha': range(35, 45), 'l1_ratio': np.linspace(0.99, 1, 10), 'random_state': [0]}
elastic_net = GridSearchCV(elastic_net_sample, elnet_hyper_params, scoring = custom_scorer, cv = 10)
elastic_net.fit(X, y)

print('best alpha and l1_ratio: ', elastic_net.best_params_)
print('score: ', elastic_net.best_score_)

best alpha and l1_ratio:  {'alpha': 41, 'l1_ratio': 1.0, 'random_state': 0}
score:  -0.11749945302004305


Lasso regression seems to be the best fit. 

In [80]:
models_results['elastic_net'] = elastic_net.best_score_

elnet_best_params = elastic_net.best_params_
elastic_net = ElasticNet(**elnet_best_params)

# DecisionTree <a name = 'tree'></a>  

[Table of Content](#content)

All the same steps again.

In [85]:
decision_tree_sample = DecisionTreeRegressor()
decision_tree_hyper_params = {'max_depth': range(5, 8),
                              'min_samples_split': [2, 3],
                              'min_samples_leaf': [6, 7 ,8],
                              'max_features': range(32, 38),
                              'random_state': [0],
                              'min_impurity_decrease': range(0, 2),
                              'ccp_alpha': np.linspace(0, 1, 3)
                             }
decision_tree_regressor = GridSearchCV(decision_tree_sample, decision_tree_hyper_params, 
                                       scoring = custom_scorer, cv = 10)
decision_tree_regressor.fit(X, y)

print('Best DT params: ', decision_tree_regressor.best_params_)
print('Best score: ', decision_tree_regressor.best_score_)

Best DT params:  {'ccp_alpha': 0.0, 'max_depth': 6, 'max_features': 35, 'min_impurity_decrease': 0, 'min_samples_leaf': 6, 'min_samples_split': 2, 'random_state': 0}
Best score:  -0.1371179620163833


The model does not seem to suffer from overfitting, since best value of `min_impurity_decrease` and `ccp_alpha` is zero.

In [86]:
models_results['decision_tree_regressor'] = decision_tree_regressor.best_score_

decision_tree_best_params = decision_tree_regressor.best_params_
decision_tree_regressor = DecisionTreeRegressor(**decision_tree_best_params)

# Random Forest <a name = 'forest'></a>  

[Table of Content](#content)

In [93]:
random_forest_sample = RandomForestRegressor()
random_forest_hyper_params = {'n_estimators': [1150],
                              'max_depth': [27], # 30 14 - 0.11081
                              'min_samples_split': [3],
                              'min_samples_leaf': [1],
                              'max_features': [12],
                              'random_state': [0],
                              'n_jobs': [-1]
                             }
random_forest_regressor = GridSearchCV(random_forest_sample, random_forest_hyper_params, 
                                       scoring = custom_scorer, cv = 5)
random_forest_regressor.fit(X, y)

print('Best parameters: ', random_forest_regressor.best_params_)
print('Best score: ', random_forest_regressor.best_score_)

Best parameters:  {'max_depth': 27, 'max_features': 12, 'min_samples_leaf': 1, 'min_samples_split': 3, 'n_estimators': 1150, 'n_jobs': -1, 'random_state': 0}
Best score:  -0.11115767045055536


In [94]:
models_results['random_forest_regressor'] = random_forest_regressor.best_score_

random_forest_best_params = random_forest_regressor.best_params_
random_forest_regressor = RandomForestRegressor(**random_forest_best_params)

# Extreme Gradient Boosting <a name = 'xgboost'></a>  

[Table of Content](#content)

In [95]:
# Split the data to use eval_set in selecting the best values for n_estimators and learning_rate
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [96]:
# Find the best values for n_estimators and learning_rate

# The best learning_rate was found through experimentation and multiple code executions. Here you can only see the result
xgb_regressor_presearch = XGBRegressor(n_estimators = 1000, learning_rate = 0.25, random_state = 0) 
xgb_regressor_presearch.fit(X_train, y_train,
                 early_stopping_rounds = 100,
                 eval_set = [(X_valid, y_valid)],
                 verbose = False)

print("Best value for n_estimators: ", xgb_regressor_presearch.best_iteration)

xgb_best_iteration = xgb_regressor_presearch.best_iteration
xgb_regressor_ = XGBRegressor(n_estimators = xgb_best_iteration)


# Get neg_mean_absolute_error for XGBRegressor with this parameters
scores = cross_val_score(xgb_regressor_presearch,
                         X, y,
                         cv = 10,
                         scoring = rmse_log
                        )
print('XGBRegressor score: ', scores.mean())



Best value for n_estimators:  38
XGBRegressor score:  0.11799976470681713


In [98]:
# Find the best values for all other parameters using GridSearchCV

# All the values were found through experimentation and multiple code executions. Here you can only see the result
xgb_regressor_sample = XGBRegressor()
xgb_hyper_params = {'n_estimators': [xgb_best_iteration],
                    'learning_rate': [0.25],
                    'max_depth': [2, 3], 
                    'min_child_weight': [6, 7, 8],  
                    'gamma': [0],
                    'subsample': [1],
                    'colsample_bytree': [1],
                    'reg_alpha': [0],
                    'reg_lambda': [1],
                    'random_state': [0]
                   }

xgb_regressor = GridSearchCV(xgb_regressor_sample, xgb_hyper_params, scoring = custom_scorer, cv = 10)
xgb_regressor.fit(X, y)

print('Best parameters: ', xgb_regressor.best_params_)
print('Best score: ', xgb_regressor.best_score_)

Best parameters:  {'colsample_bytree': 1, 'gamma': 0, 'learning_rate': 0.25, 'max_depth': 3, 'min_child_weight': 7, 'n_estimators': 38, 'random_state': 0, 'reg_alpha': 0, 'reg_lambda': 1, 'subsample': 1}
Best score:  -0.11238739375225304


In [99]:
models_results['xgb_regressor'] = xgb_regressor.best_score_

xgb_best_params = xgb_regressor.best_params_
xgb_regressor = XGBRegressor(**xgb_best_params)

# Check and Save <a name = 'save'></a>  

[Table of Content](#content)

Check all the scores.

In [100]:
models_results = models_results.sort_values(ascending = False)
models_results

linear_regression          0.121499
random_forest_regressor   -0.111158
xgb_regressor             -0.112387
elastic_net               -0.117499
lasso_regression          -0.117499
ridge_regression          -0.118524
decision_tree_regressor   -0.137118
dtype: float64

Create a DataFrame with all the model objects.

In [108]:
models = [linear_regression, ridge_regression, lasso_regression, elastic_net, decision_tree_regressor, 
          random_forest_regressor, xgb_regressor]
models

[LinearRegression(),
 Ridge(alpha=2.641025641025641, random_state=0),
 Lasso(alpha=41, random_state=0),
 ElasticNet(alpha=41, l1_ratio=1.0, random_state=0),
 DecisionTreeRegressor(max_depth=6, max_features=35, min_impurity_decrease=0,
                       min_samples_leaf=6, random_state=0),
 RandomForestRegressor(max_depth=27, max_features=12, min_samples_split=3,
                       n_estimators=1150, n_jobs=-1, random_state=0),
 XGBRegressor(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None, colsample_bytree=1,
              device=None, early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, feature_types=None, gamma=0, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.25, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=3,
              max_leaves=None, min_child_weight=

The model parameters are saved correctly.

Save scores and model objects.

In [109]:
%store models
%store models_results

Stored 'models' (list)
Stored 'models_results' (Series)


And we are done here. Let's try our model on validation data and see the real results!