In [4]:
# this needs to be run for each new runtime
# because colab has scikit-learn 1.0.2 pre-installed 
# and we need newer version (1.2.0 and higher)
# to use .set_output() method
# !pip install scikit-learn --upgrade

# if you plan on running the whole notebook again during the same runtime
# you can comment the line above

# Titanic 3: Creating a Pipeline and tuning the model with Grid Search Cross Validation

# Data reading and preprocessing for Housing calculation

We will first review what we did in the previous notebook.

In [6]:
import pandas as pd

In [115]:
url = "https://drive.google.com/file/d/1SxHrO6j5552c7uVUWKqqFKaSSkx06Gh8/view?usp=share_link"
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]

house = pd.read_csv(path)
house.head()

Unnamed: 0,LotArea,LotFrontage,TotalBsmtSF,BedroomAbvGr,Fireplaces,PoolArea,GarageCars,WoodDeckSF,ScreenPorch,Expensive
0,8450,65.0,856,3,0,0,2,0,0,0
1,9600,80.0,1262,3,1,0,2,298,0,0
2,11250,68.0,920,3,1,0,2,0,0,0
3,9550,60.0,756,3,1,0,3,0,0,0
4,14260,84.0,1145,4,1,0,3,192,0,0


## 1.1. Set X and y

- **X**: columns that help us make a prediction.
- **y**: the column that we want to predict.

In [65]:
X = house.drop(columns=["Fireplaces", "PoolArea",  "ScreenPorch"],axis=1)
y = X.pop("Expensive")

Since scikit-Learn models cannot deal with categorical features, we will keep only the numerical features.

## 1.2. Split the data

In [66]:
from sklearn.model_selection import train_test_split

In [67]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,  random_state=123)

## 1.3. Impute missing values

(Fit on train, transform train & test)

In [68]:
from sklearn.impute import SimpleImputer

my_imputer = SimpleImputer().set_output(transform='pandas') # initialize
my_imputer.fit(X_train) # fit on the train set
X_imputed_train = my_imputer.transform(X_train) # transform the train set
X_imputed_test = my_imputer.transform(X_test) # transform the test set

## 1.4. Modelling: Decision Tree

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In [71]:
from sklearn.tree import DecisionTreeClassifier

Initialize the model.

In [72]:
my_tree = DecisionTreeClassifier(max_depth=4,
                                 min_samples_leaf=10
                                )

Fit the model to the train data.

In [73]:
my_tree.fit(X = X_imputed_train, 
            y = y_train)

## 1.5. Check accuracy on the train set

In [74]:
from sklearn.metrics import accuracy_score

Use the model and the preprocessed **train** data to make predictions.

In [75]:
y_pred_tree_train = my_tree.predict(X_imputed_train)

In [76]:
y_pred_tree_train

array([1, 0, 0, ..., 0, 0, 0], dtype=int64)

Take the predicted values and the data from y train, and compare them with each other. The greater the share of correct predictions is, the closer the accuracy score will be to 1.

In [77]:
accuracy_score(y_true = y_train,
               y_pred = y_pred_tree_train)

0.922945205479452

## 1.6. Check accuracy on the test set

To check whether our model is only good at predicting the values it was trained on (overfitting) or also useful to predict new data:

use the model and the preprocessed **test** data to make predictions.

In [78]:
y_pred_tree_test = my_tree.predict(X_imputed_test)

Then, take the predicted values and the data from y test, and compare them with each other.

Ideally, the accuracy for the train and the test data is similar.

In [79]:
my_test_score = accuracy_score(y_true = y_test,
               y_pred = y_pred_tree_test)
round(my_test_score,4)

0.9007

# 2.&nbsp; Pipeline creation

Before moving forward in our quest to improve the model, take a moment to learn how to use Scikit-Learn Pipelines. They will not increase the performance of your model. However, they are a necessary tool to compress all the steps in the data preparation & modelling phases into a single one. This will become very relevant as we move forward and keep adding more steps:

* Read the lesson "Scikit-Learn Pipelines" on the platform.

* Check the docs: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html

In [80]:
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline

## 2.1 Initialize transformer and model

In [81]:
imputer = SimpleImputer(strategy="median")
dtree = DecisionTreeClassifier(max_depth=4,
                               min_samples_leaf=10)

## 2.2 Create a pipeline

In [82]:
pipe = make_pipeline(imputer, dtree).set_output(transform='pandas')

Note that make_pipeline is just a slightly more concise function than Pipeline, as it does not require you to name each step, but their behaviour is equivalent.

In [83]:
from sklearn.pipeline import Pipeline
pipe_2 = Pipeline([("imputer", imputer), ("classifier", dtree)]).set_output(transform='pandas')

## 2.3 Fit the pipeline to the training data

In [84]:
pipe.fit(X_train, y_train)

If you want pipe steps presented like text:

In [85]:
from sklearn import set_config

set_config(display="text")
pipe

Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),
                ('decisiontreeclassifier',
                 DecisionTreeClassifier(max_depth=4, min_samples_leaf=10))])

To switch back to diagram:

In [86]:
set_config(display="diagram")
pipe

In [87]:
y_train_predict = pipe.predict(X_train)
accuracy_score(y_train,y_train_predict)

0.922945205479452

In [88]:
accuracy_score(y_train,pipe.predict(X_train))

0.922945205479452

## 2.4 Use the pipeline to make predictions

In [89]:
accuracy_score(y_test,pipe.predict(X_test))

0.9006849315068494

Now, the object `pipe` can take (almost) raw data as input and output predictions. We no longer need to impute missing values and use the model to make predictions in separate steps.

# 3.&nbsp; Use GridSearchCV to find the best parameters of the model

So far, we tuned the hyperparameters of the decision tree manually. This is not ideal, for two reasons:

- It's not efficient in terms of quickly finding the best combination of parameters.
- If we keep checking the performance on the test set over and over again, we might end up creating a model that fits that particular test set, but does not generalize as well with new data. Test sets are meant to reamain unseen until the very last moment of ML development —we have been cheating a bit!

Grid Search Cross Validation solves both issues:

* Read the lesson "Housing Prices: Iteration 2, Grid Search & Cross Validation" on the platform.

* Check the docs: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

In [90]:
# 1. initialize transformers & model without specifying the parameters
imputer = SimpleImputer()
dtree = DecisionTreeClassifier()

In [91]:
# 2. Create a pipeline
pipe = make_pipeline(imputer, dtree).set_output(transform='pandas')

In [92]:
pipe

To define the parameter grid for cross validation, you need to create a dictionary, where:

- The keys are the name of the pipeline step, followed by two underscores and the name of the parameter you want to tune.
- The values are lists (or "ranges") with all the values you want to try for each parameter.

In [93]:
param_grid = {
    'simpleimputer__strategy':['median','mean'],
    'decisiontreeclassifier__max_depth': range(2, 12),
    'decisiontreeclassifier__min_samples_leaf': range(3, 10, 2),
    'decisiontreeclassifier__min_samples_split': range(3, 40, 5),
    'decisiontreeclassifier__criterion':['gini', 'entropy']
    }

When defining the cross validation, we want to pass our pipeline (`pipe`), our parameter grid (`param_grid`) and the number of folds (an arbitrary number, usually 5 or 10). You can also define the parameter `verbose` if you want to recieve a bit more info about the CV task.

In [94]:
from sklearn.model_selection import GridSearchCV

In [95]:
search = GridSearchCV(pipe, # you have defined this beforehand
                      param_grid, # your parameter grid
                      cv=5, # the value for K in K-fold Cross Validation
                      scoring='accuracy', # the performance metric to use, 
                      verbose=1) # we want informative outputs during the training process

Fit your "search" to the training data (`X` and `y`), as we used to do with our model alone or with our pipeline:

In [96]:
search.fit(X_imputed_train, y_train)

Fitting 5 folds for each of 1280 candidates, totalling 6400 fits


Explore the best parameters and the best score achieved with your cross validation:

In [97]:
search.best_params_

{'decisiontreeclassifier__criterion': 'entropy',
 'decisiontreeclassifier__max_depth': 4,
 'decisiontreeclassifier__min_samples_leaf': 3,
 'decisiontreeclassifier__min_samples_split': 33,
 'simpleimputer__strategy': 'median'}

In [98]:
# the mean cross-validated score of the best estimator
search.best_score_

0.9126811195480723

In [99]:
# training accuracy
y_train_pred = search.predict(X_train)

accuracy_score(y_train, y_train_pred)

0.922945205479452

In [100]:
# testing accuracy
y_test_pred = search.predict(X_test)

accuracy_score(y_test, y_test_pred)

0.9006849315068494

# 4.Use GridSearchCV to find the best parameters of the pipeline

Add a scaler to the pipeline, and use GridSearchCV to tune the parameters of the scaler, as well as the parameters of the imputer and the decision tree.

This shows how Grid Search Cross Validation can be used to not only tune the parameters of the model but also the parameters of all the transformers in a pipeline, thus helping us find the best preprocessing strategy for our data.

In [101]:
from sklearn.preprocessing import StandardScaler

In [102]:
# initialize transformers & model
imputer = SimpleImputer()
scaler = StandardScaler()
dtree = DecisionTreeClassifier()

In [103]:
# create the pipeline
pipe = make_pipeline(imputer,
                     scaler,
                     dtree).set_output(transform='pandas')

We can see the steps in the pipeline (note that they have been given names: `simpleimputer` and `decisiontreeclassifier`. we will use these names when defining the parameter grid for the cross validation)

In [104]:
pipe

In [105]:
# create parameter grid
param_grid = {
    "simpleimputer__strategy":["mean", "median"],
    "standardscaler__with_mean":[True, False],
    "standardscaler__with_std":[True, False],
    "decisiontreeclassifier__max_depth": range(2, 14),
    "decisiontreeclassifier__min_samples_leaf": range(3, 20),
    "decisiontreeclassifier__criterion":["gini", "entropy"]
}

In [106]:
# define cross validation
search = GridSearchCV(pipe,
                      param_grid,
                      cv=5,
                      verbose=1)

In [107]:
# fit
search.fit(X_train, y_train)

Fitting 5 folds for each of 3264 candidates, totalling 16320 fits


In [108]:
# cross validation average accuracy
search.best_score_

0.9118264186933714

In [109]:
# best parameters
search.best_params_

{'decisiontreeclassifier__criterion': 'entropy',
 'decisiontreeclassifier__max_depth': 4,
 'decisiontreeclassifier__min_samples_leaf': 18,
 'simpleimputer__strategy': 'mean',
 'standardscaler__with_mean': True,
 'standardscaler__with_std': True}

In [110]:
# training accuracy
y_train_pred = search.predict(X_train)

accuracy_score(y_train, y_train_pred)

0.9203767123287672

In [111]:
# testing accuracy
y_test_pred = search.predict(X_test)

accuracy_score(y_test, y_test_pred)

0.9006849315068494

## **Your challenge**

In a new notebook, apply everything you have learned here to the Housing project, following the Learning platform when needed.