In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


%matplotlib inline

## 1. Cross Val Score

**Recall house prices prediction problem**

We will use the same dataset as the previous week, but already preprocessed.

In [2]:
data = pd.read_csv('house_prices_prep.csv')
data.head()

Unnamed: 0,SalePrice,LotArea,OverallQual,MasVnrArea,TotalBsmtSF,GrLivArea,FullBath,GarageCars,Fireplaces,WoodDeckSF,...,SaleCondition_Normal,SaleCondition_Partial,SaleCondition_other,GarageType_2Types,GarageType_Attchd,GarageType_Basment,GarageType_BuiltIn,GarageType_CarPort,GarageType_Detchd,GarageType_None
0,208500,8450.0,7,196.0,856.0,1710.0,2,2,0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,181500,9600.0,6,0.0,1262.0,1262.0,2,2,1,298.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,223500,11250.0,7,162.0,920.0,1786.0,2,2,1,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,140000,9550.0,7,0.0,756.0,1717.0,1,3,1,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,250000,14260.0,8,350.0,1145.0,2198.0,2,3,1,192.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [3]:
from sklearn.model_selection import train_test_split

tr, te = train_test_split(data, test_size=0.2, random_state=42)

y_train = tr.SalePrice
y_test = te.SalePrice
X_train = tr.drop(['SalePrice'], axis=1)
X_test = te.drop(['SalePrice'], axis=1)

All the preprocessing was already done, so the only thing we need to do is scale numerical features. For example, we can use `StandardScaler` for that.

In [4]:
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

num_cols = ['LotArea', 'MasVnrArea', 'TotalBsmtSF', 'GrLivArea', 'WoodDeckSF', 'OpenPorchSF', 'Age', 'RemodAge']

# transform
column_transforms = ColumnTransformer([
    ('scaling', StandardScaler(), num_cols)
], remainder='passthrough')

Total Pipeline:
 - Column Tranformer
 - Linear Regression

In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

# define pipeline
pipe = make_pipeline(
    column_transforms,
    LinearRegression()
)

But what if I want to try different preprocessing? E.g we can use `MinMaxScaler` for numerical features instead of `StandardScaler`?


In [6]:
from sklearn.preprocessing import MinMaxScaler

#option 2
column_transforms_2 = ColumnTransformer([
    ('scaling', MinMaxScaler(), num_cols)
], remainder='passthrough')

pipe_2 = make_pipeline(column_transforms_2, LinearRegression())


We would like to compare Linear Regression these two types of preprocessing **before** evaluating model on the test set. Cross-validation is very useful in this case. 

![im](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png)

`sklearn.model_selection` module has a a function `cross_val_score`

**Parameters**:
 - estimator (model or the whole pipeline)
 - training data
 - number of folds or custom CV object
 - scorer 

In [7]:
# possible scorers
import sklearn.metrics
sorted(sklearn.metrics.SCORERS.keys())

['accuracy',
 'adjusted_mutual_info_score',
 'adjusted_rand_score',
 'average_precision',
 'balanced_accuracy',
 'completeness_score',
 'explained_variance',
 'f1',
 'f1_macro',
 'f1_micro',
 'f1_samples',
 'f1_weighted',
 'fowlkes_mallows_score',
 'homogeneity_score',
 'jaccard',
 'jaccard_macro',
 'jaccard_micro',
 'jaccard_samples',
 'jaccard_weighted',
 'max_error',
 'mutual_info_score',
 'neg_brier_score',
 'neg_log_loss',
 'neg_mean_absolute_error',
 'neg_mean_gamma_deviance',
 'neg_mean_poisson_deviance',
 'neg_mean_squared_error',
 'neg_mean_squared_log_error',
 'neg_median_absolute_error',
 'neg_root_mean_squared_error',
 'normalized_mutual_info_score',
 'precision',
 'precision_macro',
 'precision_micro',
 'precision_samples',
 'precision_weighted',
 'r2',
 'recall',
 'recall_macro',
 'recall_micro',
 'recall_samples',
 'recall_weighted',
 'roc_auc',
 'roc_auc_ovo',
 'roc_auc_ovo_weighted',
 'roc_auc_ovr',
 'roc_auc_ovr_weighted',
 'v_measure_score']

We will use K-Fold cross validation. But there are other, more sophisticated options available. You can read about them [here](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators)

CV-score of the first pipeline:

In [8]:
from sklearn.model_selection import cross_val_score

np.mean(
    (-cross_val_score(pipe, 
                     X_train, 
                     y_train, 
                     cv=10, 
                     scoring='neg_mean_squared_error'))**0.5
)

35405.07542373271

CV-score of the second pipeline:

In [9]:
# your code here
np.mean(
    (-cross_val_score(pipe_2, 
                     X_train, 
                     y_train, 
                     cv=10, 
                     scoring='neg_mean_squared_error'))**0.5
)

35405.075423732706

So there is not much difference in either using StandardScaler or MinMaxScaler

---

## 2. Linear Regression with Regularization

**Lasso**
$$
\min_{w} MSE + \lambda \|w\|_1
$$


**Ridge**
$$
\min_{w} MSE + \lambda \|w\|_2^2
$$

Let us use cross-validation to compare Lasso and Ridge regression.

In [10]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Lasso, Ridge

# define pipelines
pipe_lasso = Pipeline([
    ('transform', column_transforms),
    ('lasso', Lasso())
])

pipe_ridge = Pipeline([
    ('transform', column_transforms),
    ('ridge', Ridge())
])

In [13]:
# lasso cv score
np.mean(
    (-cross_val_score(pipe_lasso, 
                     X_train, 
                     y_train, 
                     cv=10, 
                     scoring='neg_mean_squared_error'))**0.5
)

35404.12693645839

In [14]:
# ridge cv score
np.mean(
    (-cross_val_score(pipe_ridge, 
                     X_train, 
                     y_train, 
                     cv=10, 
                     scoring='neg_mean_squared_error'))**0.5
)

35390.22070413702

In [15]:
# take a look at our pipeline
pipe_lasso.steps

[('transform',
  ColumnTransformer(n_jobs=None, remainder='passthrough', sparse_threshold=0.3,
                    transformer_weights=None,
                    transformers=[('scaling',
                                   StandardScaler(copy=True, with_mean=True,
                                                  with_std=True),
                                   ['LotArea', 'MasVnrArea', 'TotalBsmtSF',
                                    'GrLivArea', 'WoodDeckSF', 'OpenPorchSF',
                                    'Age', 'RemodAge'])],
                    verbose=False)),
 ('lasso',
  Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
        normalize=False, positive=False, precompute=False, random_state=None,
        selection='cyclic', tol=0.0001, warm_start=False))]

But now we also want to try different values of regularization coefficient. Creating new pipeline for each optin would be too much, so we need a better solution. `GridSearchCV` will help us.

In [16]:
from sklearn.model_selection import GridSearchCV

In [20]:
# define parameter grid 
param_grid = {
    'ridge__alpha': [1e-4, 1e-2, 0.1, 1., 10.]
}


# define `GridSearchCV` object
pipe_cv = GridSearchCV(pipe_ridge, param_grid=param_grid,
                      cv=10, scoring='neg_mean_squared_error')

In [22]:
# fit `pipe_cv`
pipe_cv.fit(X_train, y_train)

# get best estimator
pipe_cv.best_estimator_

Pipeline(memory=None,
         steps=[('transform',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('scaling',
                                                  StandardScaler(copy=True,
                                                                 with_mean=True,
                                                                 with_std=True),
                                                  ['LotArea', 'MasVnrArea',
                                                   'TotalBsmtSF', 'GrLivArea',
                                                   'WoodDeckSF', 'OpenPorchSF',
                                                   'Age', 'RemodAge'])],
                                   verbose=False)),
                ('ridge',
                 Ridge(alpha=10.0, copy_X=True, fit_intercept=True,
    

Therefore we can see that the best regularization parameter for Ridge is $\lambda=10$ (alpha in scikit)

### Compare models with GridSearchCV
By far we've used cross-validation to:
- Compare two different models
- Select best set of hyperparameters within one model

But what if we want to do both? We can use `GridSearchCV` to compare different models with different sets of hyperparameters and select the best one. 

To do that, we need to add different models into the parameter grid. 

In [25]:
from sklearn.pipeline import Pipeline

# define pipe 
pipe = Pipeline([
    ('preprocess', column_transforms),
    ('reg', Ridge())
])

# define param grid
param_grid = {
    'reg': [Ridge(), Lasso()],
    'reg__alpha': [1e-2, 0.1, 1., 10.] 
}

# define grid search object
pipe_cv = GridSearchCV(pipe, param_grid=param_grid,
                      cv=10, scoring='neg_mean_squared_error')

In [26]:
# fit
# fit `pipe_cv`
pipe_cv.fit(X_train, y_train)

# get best estimator
pipe_cv.best_estimator_

Pipeline(memory=None,
         steps=[('preprocess',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('scaling',
                                                  StandardScaler(copy=True,
                                                                 with_mean=True,
                                                                 with_std=True),
                                                  ['LotArea', 'MasVnrArea',
                                                   'TotalBsmtSF', 'GrLivArea',
                                                   'WoodDeckSF', 'OpenPorchSF',
                                                   'Age', 'RemodAge'])],
                                   verbose=False)),
                ('reg',
                 Ridge(alpha=10.0, copy_X=True, fit_intercept=True,
     

In [5]:
# print the score of the best model


---

**[Optional task]**
Finally, what if we also want to compare the Linear Regression model with Ridge and Lasso? 

We cannot add it to the list of models in the parameter grid above, because it does not have `alpha` parameter. 
Turns out `GridSearchCV` can deal with this situation as well. We can create **list of dictionaries** as a param grid. 

In [23]:
# define pipe 
pipe = Pipeline([
    ('preprocess', column_transforms),
    ('reg', Ridge())
])

# define param grid
param_grid = [
    {}, # parameter grid for lasso and ridge (model and regularization coefficient)
    {} # parameter grid for linear regression (only model)
]

# define grid search object
pipe_cv = # your code here

In [None]:
# fit and print best estimator

### Train best model on the whole train and evaluate on test

Now we can use best estimator found by Grid Search, to train on the whole training dataset and evaluate it on the test dataset. 

In [25]:
# get the best model from `pipe_cv`
best_m = pipe_cv.best_estimator_

# fit on the train dataset


# calculate predictions on test


In [None]:
# calculate root mean squared error on the test set
