## Advanced tuning of parameters

In this tutorial, we will apply the skills from previous tutorials and build a classifier using the `Pipelines` and `FeatureUnion` functions from sklearn.

In [1]:
# IMPORT PACKAGES
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

In [13]:
from sklearn.model_selection import train_test_split


### Data

We will be building a binary classifier that predicts wheter a person has diabetes or not, using information on the patient's health. 

The data can be found [here](https://drive.google.com/file/d/1TvCKlmH3Z32XAKk-VUcZyYu95Ccyw3PO/view?usp=sharing). 


In [2]:
col_names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

In [6]:
df=pd.read_csv('pima-indians-diabetes.csv', sep=';')

In [7]:
df.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### Task

Build classifier which predicts the target variable `class` using rest of the attributes. The model should be fitted using pipeline that contains:
- PCA method
- SelectKBest method
- FeatureUnion
- Random Forest

Choose the best set of parameters using `Pipeline` and grid_search.

> #### Note
> **In this exercise, we are focusing on the implementation of pipeline. Since we have only 9 columns in our dataset the PCA is probably not the best technique to use during the data preparation from a methodology point of view.**

In [10]:
X, y = df.iloc[:,:8],df.iloc[:,-1]

In [11]:
X.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33


In [12]:
y.head()

0    1
1    0
2    1
3    0
4    1
Name: class, dtype: int64

In [14]:
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.25, random_state=100)

In [15]:
pca = PCA(n_components=2)

selection = SelectKBest(k=3)

In [16]:
combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])

In [17]:
rf=RandomForestClassifier(n_estimators=100)

In [18]:
# create our pipeline from FeatureUnion 
pipeline = Pipeline([("features", combined_features), ("randomforest", rf)])

# set up our parameters grid
param_grid = {"features__pca__n_components": [1, 2, 3],
              "features__univ_select__k": [1, 2, 3],
              "randomforest__n_estimators":[100, 150, 200],
              "randomforest__max_depth":[5,7, 10],
             }

# create a Grid Search object
grid_search = GridSearchCV(pipeline, param_grid, verbose=10, refit=True)    

# fit the model and tune parameters
grid_search.fit(X_train, y_train)


Fitting 5 folds for each of 81 candidates, totalling 405 fits
[CV 1/5; 1/81] START features__pca__n_components=1, features__univ_select__k=1, randomforest__max_depth=5, randomforest__n_estimators=100
[CV 1/5; 1/81] END features__pca__n_components=1, features__univ_select__k=1, randomforest__max_depth=5, randomforest__n_estimators=100;, score=0.707 total time=   1.4s
[CV 2/5; 1/81] START features__pca__n_components=1, features__univ_select__k=1, randomforest__max_depth=5, randomforest__n_estimators=100
[CV 2/5; 1/81] END features__pca__n_components=1, features__univ_select__k=1, randomforest__max_depth=5, randomforest__n_estimators=100;, score=0.722 total time=   0.4s
[CV 3/5; 1/81] START features__pca__n_components=1, features__univ_select__k=1, randomforest__max_depth=5, randomforest__n_estimators=100
[CV 3/5; 1/81] END features__pca__n_components=1, features__univ_select__k=1, randomforest__max_depth=5, randomforest__n_estimators=100;, score=0.765 total time=   0.4s
[CV 4/5; 1/81] ST

[CV 3/5; 6/81] END features__pca__n_components=1, features__univ_select__k=1, randomforest__max_depth=7, randomforest__n_estimators=200;, score=0.739 total time=   0.7s
[CV 4/5; 6/81] START features__pca__n_components=1, features__univ_select__k=1, randomforest__max_depth=7, randomforest__n_estimators=200
[CV 4/5; 6/81] END features__pca__n_components=1, features__univ_select__k=1, randomforest__max_depth=7, randomforest__n_estimators=200;, score=0.757 total time=   0.7s
[CV 5/5; 6/81] START features__pca__n_components=1, features__univ_select__k=1, randomforest__max_depth=7, randomforest__n_estimators=200
[CV 5/5; 6/81] END features__pca__n_components=1, features__univ_select__k=1, randomforest__max_depth=7, randomforest__n_estimators=200;, score=0.713 total time=   0.8s
[CV 1/5; 7/81] START features__pca__n_components=1, features__univ_select__k=1, randomforest__max_depth=10, randomforest__n_estimators=100
[CV 1/5; 7/81] END features__pca__n_components=1, features__univ_select__k=1, 

[CV 5/5; 11/81] END features__pca__n_components=1, features__univ_select__k=2, randomforest__max_depth=5, randomforest__n_estimators=150;, score=0.757 total time=   0.6s
[CV 1/5; 12/81] START features__pca__n_components=1, features__univ_select__k=2, randomforest__max_depth=5, randomforest__n_estimators=200
[CV 1/5; 12/81] END features__pca__n_components=1, features__univ_select__k=2, randomforest__max_depth=5, randomforest__n_estimators=200;, score=0.750 total time=   0.9s
[CV 2/5; 12/81] START features__pca__n_components=1, features__univ_select__k=2, randomforest__max_depth=5, randomforest__n_estimators=200
[CV 2/5; 12/81] END features__pca__n_components=1, features__univ_select__k=2, randomforest__max_depth=5, randomforest__n_estimators=200;, score=0.730 total time=   0.7s
[CV 3/5; 12/81] START features__pca__n_components=1, features__univ_select__k=2, randomforest__max_depth=5, randomforest__n_estimators=200
[CV 3/5; 12/81] END features__pca__n_components=1, features__univ_select_

[CV 2/5; 17/81] END features__pca__n_components=1, features__univ_select__k=2, randomforest__max_depth=10, randomforest__n_estimators=150;, score=0.730 total time=   0.5s
[CV 3/5; 17/81] START features__pca__n_components=1, features__univ_select__k=2, randomforest__max_depth=10, randomforest__n_estimators=150
[CV 3/5; 17/81] END features__pca__n_components=1, features__univ_select__k=2, randomforest__max_depth=10, randomforest__n_estimators=150;, score=0.748 total time=   0.5s
[CV 4/5; 17/81] START features__pca__n_components=1, features__univ_select__k=2, randomforest__max_depth=10, randomforest__n_estimators=150
[CV 4/5; 17/81] END features__pca__n_components=1, features__univ_select__k=2, randomforest__max_depth=10, randomforest__n_estimators=150;, score=0.791 total time=   0.5s
[CV 5/5; 17/81] START features__pca__n_components=1, features__univ_select__k=2, randomforest__max_depth=10, randomforest__n_estimators=150
[CV 5/5; 17/81] END features__pca__n_components=1, features__univ_s

[CV 4/5; 22/81] END features__pca__n_components=1, features__univ_select__k=3, randomforest__max_depth=7, randomforest__n_estimators=100;, score=0.800 total time=   0.4s
[CV 5/5; 22/81] START features__pca__n_components=1, features__univ_select__k=3, randomforest__max_depth=7, randomforest__n_estimators=100
[CV 5/5; 22/81] END features__pca__n_components=1, features__univ_select__k=3, randomforest__max_depth=7, randomforest__n_estimators=100;, score=0.757 total time=   0.4s
[CV 1/5; 23/81] START features__pca__n_components=1, features__univ_select__k=3, randomforest__max_depth=7, randomforest__n_estimators=150
[CV 1/5; 23/81] END features__pca__n_components=1, features__univ_select__k=3, randomforest__max_depth=7, randomforest__n_estimators=150;, score=0.767 total time=   0.6s
[CV 2/5; 23/81] START features__pca__n_components=1, features__univ_select__k=3, randomforest__max_depth=7, randomforest__n_estimators=150
[CV 2/5; 23/81] END features__pca__n_components=1, features__univ_select_

[CV 1/5; 28/81] END features__pca__n_components=2, features__univ_select__k=1, randomforest__max_depth=5, randomforest__n_estimators=100;, score=0.707 total time=   0.3s
[CV 2/5; 28/81] START features__pca__n_components=2, features__univ_select__k=1, randomforest__max_depth=5, randomforest__n_estimators=100
[CV 2/5; 28/81] END features__pca__n_components=2, features__univ_select__k=1, randomforest__max_depth=5, randomforest__n_estimators=100;, score=0.739 total time=   0.4s
[CV 3/5; 28/81] START features__pca__n_components=2, features__univ_select__k=1, randomforest__max_depth=5, randomforest__n_estimators=100
[CV 3/5; 28/81] END features__pca__n_components=2, features__univ_select__k=1, randomforest__max_depth=5, randomforest__n_estimators=100;, score=0.765 total time=   0.3s
[CV 4/5; 28/81] START features__pca__n_components=2, features__univ_select__k=1, randomforest__max_depth=5, randomforest__n_estimators=100
[CV 4/5; 28/81] END features__pca__n_components=2, features__univ_select_

[CV 3/5; 33/81] END features__pca__n_components=2, features__univ_select__k=1, randomforest__max_depth=7, randomforest__n_estimators=200;, score=0.765 total time=   0.7s
[CV 4/5; 33/81] START features__pca__n_components=2, features__univ_select__k=1, randomforest__max_depth=7, randomforest__n_estimators=200
[CV 4/5; 33/81] END features__pca__n_components=2, features__univ_select__k=1, randomforest__max_depth=7, randomforest__n_estimators=200;, score=0.774 total time=   0.7s
[CV 5/5; 33/81] START features__pca__n_components=2, features__univ_select__k=1, randomforest__max_depth=7, randomforest__n_estimators=200
[CV 5/5; 33/81] END features__pca__n_components=2, features__univ_select__k=1, randomforest__max_depth=7, randomforest__n_estimators=200;, score=0.713 total time=   0.6s
[CV 1/5; 34/81] START features__pca__n_components=2, features__univ_select__k=1, randomforest__max_depth=10, randomforest__n_estimators=100
[CV 1/5; 34/81] END features__pca__n_components=2, features__univ_select

[CV 5/5; 38/81] END features__pca__n_components=2, features__univ_select__k=2, randomforest__max_depth=5, randomforest__n_estimators=150;, score=0.748 total time=   0.5s
[CV 1/5; 39/81] START features__pca__n_components=2, features__univ_select__k=2, randomforest__max_depth=5, randomforest__n_estimators=200
[CV 1/5; 39/81] END features__pca__n_components=2, features__univ_select__k=2, randomforest__max_depth=5, randomforest__n_estimators=200;, score=0.759 total time=   0.7s
[CV 2/5; 39/81] START features__pca__n_components=2, features__univ_select__k=2, randomforest__max_depth=5, randomforest__n_estimators=200
[CV 2/5; 39/81] END features__pca__n_components=2, features__univ_select__k=2, randomforest__max_depth=5, randomforest__n_estimators=200;, score=0.791 total time=   0.7s
[CV 3/5; 39/81] START features__pca__n_components=2, features__univ_select__k=2, randomforest__max_depth=5, randomforest__n_estimators=200
[CV 3/5; 39/81] END features__pca__n_components=2, features__univ_select_

[CV 2/5; 44/81] END features__pca__n_components=2, features__univ_select__k=2, randomforest__max_depth=10, randomforest__n_estimators=150;, score=0.739 total time=   0.6s
[CV 3/5; 44/81] START features__pca__n_components=2, features__univ_select__k=2, randomforest__max_depth=10, randomforest__n_estimators=150
[CV 3/5; 44/81] END features__pca__n_components=2, features__univ_select__k=2, randomforest__max_depth=10, randomforest__n_estimators=150;, score=0.774 total time=   0.6s
[CV 4/5; 44/81] START features__pca__n_components=2, features__univ_select__k=2, randomforest__max_depth=10, randomforest__n_estimators=150
[CV 4/5; 44/81] END features__pca__n_components=2, features__univ_select__k=2, randomforest__max_depth=10, randomforest__n_estimators=150;, score=0.783 total time=   0.6s
[CV 5/5; 44/81] START features__pca__n_components=2, features__univ_select__k=2, randomforest__max_depth=10, randomforest__n_estimators=150
[CV 5/5; 44/81] END features__pca__n_components=2, features__univ_s

[CV 4/5; 49/81] END features__pca__n_components=2, features__univ_select__k=3, randomforest__max_depth=7, randomforest__n_estimators=100;, score=0.757 total time=   0.4s
[CV 5/5; 49/81] START features__pca__n_components=2, features__univ_select__k=3, randomforest__max_depth=7, randomforest__n_estimators=100
[CV 5/5; 49/81] END features__pca__n_components=2, features__univ_select__k=3, randomforest__max_depth=7, randomforest__n_estimators=100;, score=0.748 total time=   0.4s
[CV 1/5; 50/81] START features__pca__n_components=2, features__univ_select__k=3, randomforest__max_depth=7, randomforest__n_estimators=150
[CV 1/5; 50/81] END features__pca__n_components=2, features__univ_select__k=3, randomforest__max_depth=7, randomforest__n_estimators=150;, score=0.793 total time=   0.5s
[CV 2/5; 50/81] START features__pca__n_components=2, features__univ_select__k=3, randomforest__max_depth=7, randomforest__n_estimators=150
[CV 2/5; 50/81] END features__pca__n_components=2, features__univ_select_

[CV 1/5; 55/81] END features__pca__n_components=3, features__univ_select__k=1, randomforest__max_depth=5, randomforest__n_estimators=100;, score=0.698 total time=   0.5s
[CV 2/5; 55/81] START features__pca__n_components=3, features__univ_select__k=1, randomforest__max_depth=5, randomforest__n_estimators=100
[CV 2/5; 55/81] END features__pca__n_components=3, features__univ_select__k=1, randomforest__max_depth=5, randomforest__n_estimators=100;, score=0.704 total time=   0.3s
[CV 3/5; 55/81] START features__pca__n_components=3, features__univ_select__k=1, randomforest__max_depth=5, randomforest__n_estimators=100
[CV 3/5; 55/81] END features__pca__n_components=3, features__univ_select__k=1, randomforest__max_depth=5, randomforest__n_estimators=100;, score=0.783 total time=   0.5s
[CV 4/5; 55/81] START features__pca__n_components=3, features__univ_select__k=1, randomforest__max_depth=5, randomforest__n_estimators=100
[CV 4/5; 55/81] END features__pca__n_components=3, features__univ_select_

[CV 3/5; 60/81] END features__pca__n_components=3, features__univ_select__k=1, randomforest__max_depth=7, randomforest__n_estimators=200;, score=0.774 total time=   0.8s
[CV 4/5; 60/81] START features__pca__n_components=3, features__univ_select__k=1, randomforest__max_depth=7, randomforest__n_estimators=200
[CV 4/5; 60/81] END features__pca__n_components=3, features__univ_select__k=1, randomforest__max_depth=7, randomforest__n_estimators=200;, score=0.791 total time=   0.7s
[CV 5/5; 60/81] START features__pca__n_components=3, features__univ_select__k=1, randomforest__max_depth=7, randomforest__n_estimators=200
[CV 5/5; 60/81] END features__pca__n_components=3, features__univ_select__k=1, randomforest__max_depth=7, randomforest__n_estimators=200;, score=0.704 total time=   0.7s
[CV 1/5; 61/81] START features__pca__n_components=3, features__univ_select__k=1, randomforest__max_depth=10, randomforest__n_estimators=100
[CV 1/5; 61/81] END features__pca__n_components=3, features__univ_select

[CV 5/5; 65/81] END features__pca__n_components=3, features__univ_select__k=2, randomforest__max_depth=5, randomforest__n_estimators=150;, score=0.774 total time=   1.0s
[CV 1/5; 66/81] START features__pca__n_components=3, features__univ_select__k=2, randomforest__max_depth=5, randomforest__n_estimators=200
[CV 1/5; 66/81] END features__pca__n_components=3, features__univ_select__k=2, randomforest__max_depth=5, randomforest__n_estimators=200;, score=0.741 total time=   0.8s
[CV 2/5; 66/81] START features__pca__n_components=3, features__univ_select__k=2, randomforest__max_depth=5, randomforest__n_estimators=200
[CV 2/5; 66/81] END features__pca__n_components=3, features__univ_select__k=2, randomforest__max_depth=5, randomforest__n_estimators=200;, score=0.730 total time=   0.7s
[CV 3/5; 66/81] START features__pca__n_components=3, features__univ_select__k=2, randomforest__max_depth=5, randomforest__n_estimators=200
[CV 3/5; 66/81] END features__pca__n_components=3, features__univ_select_

[CV 2/5; 71/81] END features__pca__n_components=3, features__univ_select__k=2, randomforest__max_depth=10, randomforest__n_estimators=150;, score=0.730 total time=   0.7s
[CV 3/5; 71/81] START features__pca__n_components=3, features__univ_select__k=2, randomforest__max_depth=10, randomforest__n_estimators=150
[CV 3/5; 71/81] END features__pca__n_components=3, features__univ_select__k=2, randomforest__max_depth=10, randomforest__n_estimators=150;, score=0.748 total time=   0.5s
[CV 4/5; 71/81] START features__pca__n_components=3, features__univ_select__k=2, randomforest__max_depth=10, randomforest__n_estimators=150
[CV 4/5; 71/81] END features__pca__n_components=3, features__univ_select__k=2, randomforest__max_depth=10, randomforest__n_estimators=150;, score=0.765 total time=   0.5s
[CV 5/5; 71/81] START features__pca__n_components=3, features__univ_select__k=2, randomforest__max_depth=10, randomforest__n_estimators=150
[CV 5/5; 71/81] END features__pca__n_components=3, features__univ_s

[CV 4/5; 76/81] END features__pca__n_components=3, features__univ_select__k=3, randomforest__max_depth=7, randomforest__n_estimators=100;, score=0.765 total time=   0.3s
[CV 5/5; 76/81] START features__pca__n_components=3, features__univ_select__k=3, randomforest__max_depth=7, randomforest__n_estimators=100
[CV 5/5; 76/81] END features__pca__n_components=3, features__univ_select__k=3, randomforest__max_depth=7, randomforest__n_estimators=100;, score=0.730 total time=   0.3s
[CV 1/5; 77/81] START features__pca__n_components=3, features__univ_select__k=3, randomforest__max_depth=7, randomforest__n_estimators=150
[CV 1/5; 77/81] END features__pca__n_components=3, features__univ_select__k=3, randomforest__max_depth=7, randomforest__n_estimators=150;, score=0.776 total time=   0.6s
[CV 2/5; 77/81] START features__pca__n_components=3, features__univ_select__k=3, randomforest__max_depth=7, randomforest__n_estimators=150
[CV 2/5; 77/81] END features__pca__n_components=3, features__univ_select_

GridSearchCV(estimator=Pipeline(steps=[('features',
                                        FeatureUnion(transformer_list=[('pca',
                                                                        PCA(n_components=2)),
                                                                       ('univ_select',
                                                                        SelectKBest(k=3))])),
                                       ('randomforest',
                                        RandomForestClassifier())]),
             param_grid={'features__pca__n_components': [1, 2, 3],
                         'features__univ_select__k': [1, 2, 3],
                         'randomforest__max_depth': [5, 7, 10],
                         'randomforest__n_estimators': [100, 150, 200]},
             verbose=10)

In [20]:
grid_search.best_params_

{'features__pca__n_components': 2,
 'features__univ_select__k': 3,
 'randomforest__max_depth': 7,
 'randomforest__n_estimators': 150}

In [23]:
print('Final training score is: ', grid_search.score(X_train, y_train))

Final training score is:  0.9496527777777778


In [24]:
print('Final testing score is: ', grid_search.score(X_test, y_test))

Final testing score is:  0.734375
