# Housing Prices 3: Creating a Pipeline and tuning the model with Grid Search Cross Validation

## 1. Read the data

In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split

original_housing_df = pd.read_csv('housing_iteration_0_2_classification.csv')
original_housing_df.head()

Unnamed: 0,LotArea,LotFrontage,TotalBsmtSF,BedroomAbvGr,Fireplaces,PoolArea,GarageCars,WoodDeckSF,ScreenPorch,Expensive
0,8450,65.0,856,3,0,0,2,0,0,0
1,9600,80.0,1262,3,1,0,2,298,0,0
2,11250,68.0,920,3,1,0,2,0,0,0
3,9550,60.0,756,3,1,0,3,0,0,0
4,14260,84.0,1145,4,1,0,3,192,0,0


## 1.1. Set X and y

- **X**: columns that help us make a prediction.
- **y**: the column that we want to predict.

In [3]:
X = original_housing_df.copy()#.drop(columns="LotFrontage")
y = X.pop("Expensive")

## 1.2. Feature selection

Since scikit-Learn models cannot deal with categorical features, we will keep only the numerical features.

In [9]:
X_num = X.select_dtypes(include="number")
X_num.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   LotArea       1460 non-null   int64  
 1   LotFrontage   1201 non-null   float64
 2   TotalBsmtSF   1460 non-null   int64  
 3   BedroomAbvGr  1460 non-null   int64  
 4   Fireplaces    1460 non-null   int64  
 5   PoolArea      1460 non-null   int64  
 6   GarageCars    1460 non-null   int64  
 7   WoodDeckSF    1460 non-null   int64  
 8   ScreenPorch   1460 non-null   int64  
dtypes: float64(1), int64(8)
memory usage: 102.8 KB


## 1.3. Split the data

In [25]:
from sklearn.model_selection import train_test_split

In [26]:
X_num_train, X_num_test, y_train, y_test = train_test_split(X_num,
                                                            y,
                                                            test_size=0.2,
                                                            random_state=31416)

## 1.4. Impute missing values

(Fit on train, transform train & test)

In [27]:
from sklearn.impute import SimpleImputer

my_imputer = SimpleImputer().set_output(transform='pandas') # initialize
my_imputer.fit(X_num_train) # fit on the train set
X_num_imputed_train = my_imputer.transform(X_num_train) # transform the train set
X_num_imputed_test = my_imputer.transform(X_num_test) # transform the test set

## 1.5. Modelling: Decision Tree

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In [28]:
from sklearn.tree import DecisionTreeClassifier

Initialize the model.

In [29]:
my_tree = DecisionTreeClassifier(max_depth=5,
                                 min_samples_leaf=8
                                )

Fit the model to the train data.

In [30]:
my_tree.fit(X = X_num_imputed_train,
            y = y_train)

## 1.6. Check accuracy on the train set

In [31]:
from sklearn.metrics import accuracy_score

Use the model and the preprocessed **train** data to make predictions.

In [32]:
y_pred_tree_train = my_tree.predict(X_num_imputed_train)

In [33]:
y_pred_tree_train

array([0, 0, 0, ..., 0, 1, 0])

Take the predicted values and the data from y train, and compare them with each other. The greater the share of correct predictions is, the closer the accuracy score will be to 1.

In [34]:
accuracy_score(y_true = y_train,
               y_pred = y_pred_tree_train)

0.9323630136986302

## 1.7. Check accuracy on the test set

To check whether our model is only good at predicting the values it was trained on (overfitting) or also useful to predict new data:

use the model and the preprocessed **test** data to make predictions.

In [35]:
y_pred_tree_test = my_tree.predict(X_num_imputed_test)

Then, take the predicted values and the data from y test, and compare them with each other.

Ideally, the accuracy for the train and the test data is similar.

In [36]:
accuracy_score(y_true = y_test,
               y_pred = y_pred_tree_test)

0.934931506849315

# 2.&nbsp; Pipeline creation

Before moving forward in our quest to improve the model, take a moment to learn how to use Scikit-Learn Pipelines. They will not increase the performance of your model. However, they are a necessary tool to compress all the steps in the data preparation & modelling phases into a single one. This will become very relevant as we move forward and keep adding more steps:

* Read the lesson "Scikit-Learn Pipelines" on the platform.

* Check the docs: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html

In [37]:
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline

## 2.1 Initialize transformer and model

In [38]:
imputer = SimpleImputer(strategy="median")
dtree = DecisionTreeClassifier(max_depth=5,
                               min_samples_leaf=8,
                               random_state=31416)

## 2.2 Create a pipeline

In [39]:
pipe = make_pipeline(imputer, dtree).set_output(transform='pandas')

In [42]:
pipe

Note that make_pipeline is just a slightly more concise function than Pipeline, as it does not require you to name each step, but their behaviour is equivalent.

In [43]:
from sklearn.pipeline import Pipeline
pipe_2 = Pipeline([("imputer", imputer), ("classifier", dtree)]).set_output(transform='pandas')

In [44]:
pipe_2

## 2.3 Fit the pipeline to the training data

In [45]:
pipe.fit(X_num_train, y_train)

If you want pipe steps presented like text:

In [46]:
from sklearn import set_config

set_config(display="text")
pipe

Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),
                ('decisiontreeclassifier',
                 DecisionTreeClassifier(max_depth=5, min_samples_leaf=8,
                                        random_state=31416))])

To switch back to diagram:

In [47]:
set_config(display="diagram")
pipe

## 2.4 Use the pipeline to make predictions

In [49]:
y_pred_tree_train_pipe = pipe.predict(X_num_train)

In [50]:
y_pred_tree_train_pipe

array([0, 0, 0, ..., 0, 1, 0])

In [51]:
accuracy_score(y_true = y_train,
               y_pred = y_pred_tree_train_pipe)

0.9323630136986302

In [52]:
y_pred_tree_train_pipe = pipe.predict(X_num_test)

In [53]:
accuracy_score(y_true = y_test,
               y_pred = y_pred_tree_train_pipe)

0.934931506849315

Now, the object `pipe` can take (almost) raw data as input and output predictions. We no longer need to impute missing values and use the model to make predictions in separate steps.

# 3.&nbsp; Use GridSearchCV to find the best parameters of the model

So far, we tuned the hyperparameters of the decision tree manually. This is not ideal, for two reasons:

- It's not efficient in terms of quickly finding the best combination of parameters.
- If we keep checking the performance on the test set over and over again, we might end up creating a model that fits that particular test set, but does not generalize as well with new data. Test sets are meant to reamain unseen until the very last moment of ML development —we have been cheating a bit!

Grid Search Cross Validation solves both issues:

* Read the lesson "Housing Prices: Iteration 2, Grid Search & Cross Validation" on the platform.

* Check the docs: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

In [54]:
# 1. initialize transformers & model without specifying the parameters
imputer = SimpleImputer()
dtree = DecisionTreeClassifier()

In [55]:
# 2. Create a pipeline
pipe = make_pipeline(imputer, dtree).set_output(transform='pandas')

In [37]:
pipe

To define the parameter grid for cross validation, you need to create a dictionary, where:

- The keys are the name of the pipeline step, followed by two underscores and the name of the parameter you want to tune.
- The values are lists (or "ranges") with all the values you want to try for each parameter.

In [57]:
param_grid = {
    'decisiontreeclassifier__max_depth': range(2, 12),
    'decisiontreeclassifier__min_samples_leaf': range(3, 10, 2),
    'decisiontreeclassifier__min_samples_split': range(3, 40, 5),
    'decisiontreeclassifier__criterion':['gini', 'entropy']
    }

When defining the cross validation, we want to pass our pipeline (`pipe`), our parameter grid (`param_grid`) and the number of folds (an arbitrary number, usually 5 or 10). You can also define the parameter `verbose` if you want to recieve a bit more info about the CV task.

In [60]:
from sklearn.model_selection import GridSearchCV

In [61]:
search = GridSearchCV(pipe, # you have defined this beforehand
                      param_grid, # your parameter grid
                      cv=10, # the value for K in K-fold Cross Validation
                      scoring='accuracy', # the performance metric to use,
                      verbose=1) # we want informative outputs during the training process

Fit your "search" to the training data (`X` and `y`), as we used to do with our model alone or with our pipeline:

In [66]:
search.fit(X_num_train, y_train)

Fitting 10 folds for each of 640 candidates, totalling 6400 fits


Explore the best parameters and the best score achieved with your cross validation:

In [67]:
search.best_params_

{'decisiontreeclassifier__criterion': 'entropy',
 'decisiontreeclassifier__max_depth': 5,
 'decisiontreeclassifier__min_samples_leaf': 3,
 'decisiontreeclassifier__min_samples_split': 3}

In [64]:
# the mean cross-validated score of the best estimator
search.best_score_

0.9289493073975832

In [68]:
# training accuracy
y_train_pred = search.predict(X_num_train)

accuracy_score(y_train, y_train_pred)

0.9409246575342466

In [69]:
# testing accuracy
y_test_pred = search.predict(X_num_test)

accuracy_score(y_test, y_test_pred)

0.928082191780822

In [76]:
param_grid_2 = {
    'decisiontreeclassifier__max_depth': range(2, 8),
    'decisiontreeclassifier__min_samples_leaf': range(1, 10),
    'decisiontreeclassifier__min_samples_split': range(1, 7),
    'decisiontreeclassifier__criterion':['gini', 'entropy']
    }

In [77]:
search = GridSearchCV(pipe, # you have defined this beforehand
                      param_grid_2, # your parameter grid
                      cv=5, # the value for K in K-fold Cross Validation
                      scoring='accuracy', # the performance metric to use,
                      verbose=1) # we want informative outputs during the training process

In [78]:
search.fit(X_num_train, y_train)

Fitting 5 folds for each of 648 candidates, totalling 3240 fits


540 fits failed out of a total of 3240.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
540 fits failed with the following error:
Traceback (most recent call last):
  File "/Applications/anaconda3/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Applications/anaconda3/lib/python3.11/site-packages/sklearn/base.py", line 1151, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Applications/anaconda3/lib/python3.11/site-packages/sklearn/pipeline.py", line 420, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/Applications/anaconda3/lib/pyth

In [79]:
search.best_params_

{'decisiontreeclassifier__criterion': 'gini',
 'decisiontreeclassifier__max_depth': 5,
 'decisiontreeclassifier__min_samples_leaf': 5,
 'decisiontreeclassifier__min_samples_split': 3}

In [80]:
y_test_pred = search.predict(X_num_test)

accuracy_score(y_test, y_test_pred)

0.9315068493150684

# 4.&nbsp; Use GridSearchCV to find the best parameters of the pipeline

Add a scaler to the pipeline, and use GridSearchCV to tune the parameters of the scaler, as well as the parameters of the imputer and the decision tree.

This shows how Grid Search Cross Validation can be used to not only tune the parameters of the model but also the parameters of all the transformers in a pipeline, thus helping us find the best preprocessing strategy for our data.

In [81]:
from sklearn.preprocessing import StandardScaler

In [82]:
# initialize transformers & model
imputer = SimpleImputer()
scaler = StandardScaler()
dtree = DecisionTreeClassifier()

In [83]:
# create the pipeline
pipe = make_pipeline(imputer,
                     scaler,
                     dtree).set_output(transform='pandas')

We can see the steps in the pipeline (note that they have been given names: `simpleimputer` and `decisiontreeclassifier`. we will use these names when defining the parameter grid for the cross validation)

In [84]:
pipe

In [85]:
# create parameter grid
param_grid = {
    "simpleimputer__strategy":["mean", "median"],
    "standardscaler__with_mean":[True, False],
    "standardscaler__with_std":[True, False],
    "decisiontreeclassifier__max_depth": range(2, 10),
    "decisiontreeclassifier__min_samples_leaf": range(1, 7),
    "decisiontreeclassifier__criterion":["gini", "entropy"]
}

In [86]:
# define cross validation
search = GridSearchCV(pipe,
                      param_grid,
                      cv=10,
                      verbose=1)

In [87]:
# fit
search.fit(X_num_train, y_train)

Fitting 10 folds for each of 768 candidates, totalling 7680 fits


In [88]:
# cross validation average accuracy
search.best_score_

0.9306587091069849

In [92]:
# best parameters
search.best_params_

{'decisiontreeclassifier__criterion': 'entropy',
 'decisiontreeclassifier__max_depth': 5,
 'decisiontreeclassifier__min_samples_leaf': 2,
 'simpleimputer__strategy': 'mean',
 'standardscaler__with_mean': False,
 'standardscaler__with_std': True}

In [93]:
y_train_pred = search.predict(X_num_train)

accuracy_score(y_train, y_train_pred)

0.9409246575342466

In [94]:
y_test_pred = search.predict(X_num_test)

accuracy_score(y_test, y_test_pred)

0.9315068493150684

In [130]:
selected_row = X_num_test.iloc[60]
selected_row_df = pd.DataFrame(selected_row).transpose()
selected_row_df

Unnamed: 0,LotArea,LotFrontage,TotalBsmtSF,BedroomAbvGr,Fireplaces,PoolArea,GarageCars,WoodDeckSF,ScreenPorch
1036,12898.0,89.0,1620.0,2.0,1.0,0.0,3.0,228.0,0.0


In [131]:
y_test_pred_new = search.predict(selected_row_df)
y_test_pred_new

array([1])

In [129]:
y_test.iloc[60]

1