## Advanced tuning of parameters

In this tutorial, we will apply the skills from previous tutorials and build a classifier using the `Pipelines` and `FeatureUnion` functions from sklearn.

In [4]:
# IMPORT PACKAGES
import pandas as pd
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion

from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

### Data

We will be building a binary classifier that predicts wheter a person has diabetes or not, using information on the patient's health. 

The data can be found [here](https://drive.google.com/file/d/1TvCKlmH3Z32XAKk-VUcZyYu95Ccyw3PO/view?usp=sharing). 


In [5]:
col_names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

In [6]:
data = pd.read_csv('data/pima-indians-diabetes.csv', sep=';')
X = data.drop('class', axis=1)
y = data['class']

X_train, X_test, y_train, y_test = train_test_split(X, y)


In [7]:
X_train.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age
482,4,85,58,22,49,27.8,0.306,28
528,0,117,66,31,188,30.8,0.493,22
433,2,139,75,0,0,25.6,0.167,29
741,3,102,44,20,94,30.8,0.4,26
222,7,119,0,0,0,25.2,0.209,37


### Task

Build classifier which predicts the target variable `class` using rest of the attributes. The model should be fitted using pipeline that contains:
- PCA method
- SelectKBest method
- FeatureUnion
- Random Forest

Choose the best set of parameters using `Pipeline` and grid_search.

> #### Note
> **In this exercise, we are focusing on the implementation of pipeline. Since we have only 9 columns in our dataset the PCA is probably not the best technique to use during the data preparation from a methodology point of view.**

In [8]:
y_train

482    0
528    0
433    0
741    0
222    0
      ..
522    0
101    0
523    1
615    0
407    0
Name: class, Length: 576, dtype: int64

In [21]:
pca = PCA(n_components=2)
selection = SelectKBest(k=3)
combined_features = FeatureUnion([('pca', pca), ('univ_select', selection)])
randomForest = RandomForestClassifier(max_depth=2)

pipeline = Pipeline([('features', combined_features), 
                     ('random_forest', randomForest)])

param_grid = {'features__pca__n_components': [1, 2, 3, 4],
              'features__univ_select__k': [1, 2, 3],
              'random_forest__max_depth': [2, 4, 6, 10, None]}

grid_search = GridSearchCV(pipeline, param_grid, verbose=1, refit=True)

grid_search.fit(X_train, y_train)


Fitting 5 folds for each of 60 candidates, totalling 300 fits


GridSearchCV(estimator=Pipeline(steps=[('features',
                                        FeatureUnion(transformer_list=[('pca',
                                                                        PCA(n_components=2)),
                                                                       ('univ_select',
                                                                        SelectKBest(k=3))])),
                                       ('random_forest',
                                        RandomForestClassifier(max_depth=2))]),
             param_grid={'features__pca__n_components': [1, 2, 3, 4],
                         'features__univ_select__k': [1, 2, 3],
                         'random_forest__max_depth': [2, 4, 6, 10, None]},
             verbose=1)

In [22]:
grid_search.score(X_test, y_test)

0.7604166666666666

In [23]:
grid_search.best_params_

{'features__pca__n_components': 3,
 'features__univ_select__k': 2,
 'random_forest__max_depth': 4}