### Introduction

#### Here we will be reducing complexity(pruning the tree) by introducing parameters and also creating pipelines which will compress both the data preparation phase as well as the modelling phase into one 

In [1]:
import pandas as pd
X = pd.read_csv('housing-classification-iter-0-2.csv')
y = X.pop('Expensive')
X.head(3)

Unnamed: 0,LotArea,LotFrontage,TotalBsmtSF,BedroomAbvGr,Fireplaces,PoolArea,GarageCars,WoodDeckSF,ScreenPorch
0,8450,65.0,856,3,0,0,2,0,0
1,9600,80.0,1262,3,1,0,2,298,0
2,11250,68.0,920,3,1,0,2,0,0


#### Splitting data, processing , fitting and transforming train and test

In [2]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=38276)

In [3]:
from sklearn.impute import SimpleImputer
my_imputer = SimpleImputer()
my_imputer.fit(X_train)
X_imputer_train = my_imputer.transform(X_train)
X_imputer_test = my_imputer.transform(X_test)

#### Modelling

In [4]:
from sklearn.tree import DecisionTreeClassifier
My_tree = DecisionTreeClassifier(max_depth = 4,
                                 min_samples_leaf = 10)
My_tree.fit(X =X_imputer_train,
           y = y_train)

#### Checking accuracy on train and test set

In [5]:
from sklearn.metrics import accuracy_score
y_pred_tree_train = My_tree.predict(X_imputer_train)
accuracy_score(y_true= y_train,
              y_pred = y_pred_tree_train)

0.9212328767123288

In [6]:
y_pred_tree_test = My_tree.predict(X_imputer_test)
accuracy_score(y_true= y_test,
              y_pred = y_pred_tree_test)

0.910958904109589

### Creating pipelines

In [10]:
from sklearn.pipeline import make_pipeline
# Initializing transformers and model
imputer = SimpleImputer(strategy='median')
dtree = DecisionTreeClassifier(max_depth=4,
                              min_samples_leaf=10)
#creating piepeline
pipe = make_pipeline(imputer, dtree)

# fitting pipeline to training dataset
pipe.fit(X_train, y_train)

# predicting test set with pipe
pipe.predict(X_test)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0])

#### Using GridSearchCV to find the best parameters

#### So far, we tuned the hyperparameters of the decision tree manually. This is not ideal, for two reasons:

It's not efficient in terms of quickly finding the best combination of parameters.
If we keep checking the performance on the test set over and over again, we might end up creating a model that fits that particular test set, but does not generalize as well with new data. Test sets are meant to reamain unseen until the very last moment of ML development —we have been cheating a bit!


Grid Search Cross Validation solves both issues:

In [14]:
imputer = SimpleImputer()
dtree = DecisionTreeClassifier()

pipe = make_pipeline(imputer, dtree)

param_grid = {'decisiontreeclassifier__max_depth': range(2,12),
             'decisiontreeclassifier__min_samples_leaf': range(3,10,2),
             'decisiontreeclassifier__min_samples_split': range(3,40,5),
             'decisiontreeclassifier__criterion':['gini', 'entropy']
             }

from sklearn.model_selection import GridSearchCV
search = GridSearchCV(pipe,
                     param_grid,
                     cv=5, #value of k in k-fold cross validation,
                     scoring='accuracy',#performance metrics to be used,
                     verbose=1 #we want informative output during th training output
                     )
search.fit(X_train, y_train)

Fitting 5 folds for each of 640 candidates, totalling 3200 fits


In [15]:
search.best_params_

{'decisiontreeclassifier__criterion': 'entropy',
 'decisiontreeclassifier__max_depth': 6,
 'decisiontreeclassifier__min_samples_leaf': 5,
 'decisiontreeclassifier__min_samples_split': 13}

In [16]:
search.best_score_

0.9160925864788526