# Creating a sklearn pipeline and applying Cross Validation

The goal of this notebook is to implement a sklearn pipeline and grid cross validation.

The following articles on the platform will help you to accomplish this notebook:
* [Scikit-Learn Pipelines](https://platform.wbscodingschool.com/courses/data-science/14411/)
* [Grid Search & Cross Validation](https://platform.wbscodingschool.com/courses/data-science/14418/)

In [None]:
import pandas as pd

housing = pd.read_csv('https://raw.githubusercontent.com/JoanClaverol/housing_data/main/housing-classification-iter3.csv')
housing.head()

Unnamed: 0,LotArea,LotFrontage,TotalBsmtSF,BedroomAbvGr,Fireplaces,PoolArea,GarageCars,WoodDeckSF,ScreenPorch,Expensive,MSZoning,Condition1,Heating,Street,CentralAir,Foundation
0,8450,65.0,856,3,0,0,2,0,0,0,RL,Norm,GasA,Pave,Y,PConc
1,9600,80.0,1262,3,1,0,2,298,0,0,RL,Feedr,GasA,Pave,Y,CBlock
2,11250,68.0,920,3,1,0,2,0,0,0,RL,Norm,GasA,Pave,Y,PConc
3,9550,60.0,756,3,1,0,3,0,0,0,RL,Norm,GasA,Pave,Y,BrkTil
4,14260,84.0,1145,4,1,0,3,192,0,0,RL,Norm,GasA,Pave,Y,PConc


## 1. Create train and test



How can you split the data into train and test sets?

In [None]:
X = housing.copy()
y = X.pop('Expensive')
X = X.select_dtypes(include = 'number')

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=123)

##2. Preprocess the data

Is there any pre pre processing that needs to be applied to the numerical columns?

In [None]:
X.isna().sum()

LotArea           0
LotFrontage     259
TotalBsmtSF       0
BedroomAbvGr      0
Fireplaces        0
PoolArea          0
GarageCars        0
WoodDeckSF        0
ScreenPorch       0
dtype: int64

In [None]:
y.isna().sum()

0

### 3. Pipeline Creation






In [None]:
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline

### 3.1 Initialise your Transformer and  model

In [None]:
imputer = SimpleImputer(strategy="median")
dtree = DecisionTreeClassifier(max_depth=4,
                               min_samples_leaf=10,
                               random_state=42)

### 3.2 Creating pipeline

In [None]:
pipe = make_pipeline(imputer, dtree).set_output(transform='pandas')

### 3.3 Fit the pipeline to the training set

In [None]:
pipe.fit(X_train, y_train)

### 3.4 Use the Pipeline to make Predictions and calculate accuracy

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
y_pred_train = pipe.predict(X_train)

accuracy_score(y_true = y_train,
               y_pred = y_pred_train)

0.9238013698630136

In [None]:
y_pred_test = pipe.predict(X_test)

accuracy_score(y_true = y_test,
               y_pred = y_pred_test)

0.9212328767123288

### 4. Use GridSearchCV to find the best parameters of the model

So far, we tuned the hyperparameters of the decision tree manually. This is not ideal, for two reasons:

- It's not efficient in terms of quickly finding the best combination of parameters.
- If we keep checking the performance on the test set over and over again, we might end up creating a model that fits that particular test set, but does not generalize as well with new data. Test sets are meant to reamain unseen until the very last moment of ML development —we have been cheating a bit!

Grid Search Cross Validation solves both issues:

* Read the lesson "Housing Prices: Iteration 2, Grid Search & Cross Validation" on the platform.

* Check the docs: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

In [None]:
# 1. initialize transformers & model without specifying the parameters
imputer = SimpleImputer()
dtree = DecisionTreeClassifier()

In [None]:
# 2. Create a pipeline
pipe = make_pipeline(imputer, dtree).set_output(transform='pandas')

In [None]:
pipe

To define the parameter grid for cross validation, you need to create a dictionary, where:

- The keys are the name of the pipeline step, followed by two underscores and the name of the parameter you want to tune.
- The values are lists (or "ranges") with all the values you want to try for each parameter.

In [None]:
param_grid = {
    'decisiontreeclassifier__max_depth': range(2, 12),
    'decisiontreeclassifier__min_samples_leaf': range(3, 10, 2),
    'decisiontreeclassifier__min_samples_split': range(3, 40, 5),
    'decisiontreeclassifier__criterion':['gini', 'entropy']
    }

When defining the cross validation, we want to pass our pipeline (`pipe`), our parameter grid (`param_grid`) and the number of folds (an arbitrary number, usually 5 or 10). You can also define the parameter `verbose` if you want to recieve a bit more info about the CV task.

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
search = GridSearchCV(pipe, # you have defined this beforehand
                      param_grid, # your parameter grid
                      cv=5, # the value for K in K-fold Cross Validation
                      scoring='accuracy', # the performance metric to use,
                      verbose=1) # we want informative outputs during the training process

In [None]:
#Fit your "search" to the training data (`X` and `y`), as we used to do with our model alone or with our pipeline:
search.fit(X_train, y_train)

Fitting 5 folds for each of 640 candidates, totalling 3200 fits


Explore the best parameters and the best score achieved with your cross validation:

In [None]:
search.best_params_

{'decisiontreeclassifier__criterion': 'entropy',
 'decisiontreeclassifier__max_depth': 6,
 'decisiontreeclassifier__min_samples_leaf': 3,
 'decisiontreeclassifier__min_samples_split': 33}

In [None]:
# the mean cross-validated score of the best estimator
search.best_score_

0.9255236418326547

In [None]:
# training accuracy
y_train_pred = search.predict(X_train)

accuracy_score(y_train, y_train_pred)

0.9392123287671232

In [None]:
# testing accuracy
y_test_pred = search.predict(X_test)

accuracy_score(y_test, y_test_pred)

0.9212328767123288