## Advanced tuning of parameters

In this tutorial, we will apply the skills from previous tutorials and build a classifier using the `Pipelines` and `FeatureUnion` functions from sklearn.

In [22]:
# import packages
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

In [23]:
import pandas as pd
import numpy as np

from pathlib import Path
data_path = Path('./data')

### Data

We will be building a binary classifier that predicts wheter a person has diabetes or not, using information on the patient's health. 

The data can be found [here](https://drive.google.com/file/d/1TvCKlmH3Z32XAKk-VUcZyYu95Ccyw3PO/view?usp=sharing). 


In [24]:
diabetes_data = pd.read_csv(data_path / 'diabetes.csv', sep=';')
col_names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

In [25]:
xtrain, xtest, ytrain, ytest = train_test_split(diabetes_data.iloc[:, :-1], diabetes_data.iloc[:,-1])

### Task

Build classifier which predicts the target variable `class` using rest of the attributes. The model should be fitted using pipeline that contains:
- PCA method
- SelectKBest method
- FeatureUnion
- Random Forest

Choose the best set of parameters using `Pipeline` and grid_search.

> #### Note
> **In this exercise, we are focusing on the implementation of pipeline. Since we have only 9 columns in our dataset the PCA is probably not the best technique to use during the data preparation from a methodology point of view.**

In [26]:
#instantiate transformer objects
pca = PCA()
kbest = SelectKBest()
feature_select = FeatureUnion([('pca', pca), ('kbest', kbest)])

# instantiate the classifier
rf = RandomForestClassifier()

In [29]:
# define the pipeline
pipeline = Pipeline([('features', feature_select), ('forest', rf)])

In [30]:
# define grid parameters
param_grid = {
    'features__pca__n_components' : [1, 2, 3],
    'features__kbest__k' : [3, 4, 5],
    'forest__n_estimators' : [10, 30, 1000],
    'forest__max_depth' : [3, 5, 8]
}

In [32]:
# initialize grid search
gs = GridSearchCV(pipeline, param_grid, refit=True, cv=5, return_train_score=True)

In [33]:
gs.fit(xtrain, ytrain)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('features',
                                        FeatureUnion(transformer_list=[('pca',
                                                                        PCA()),
                                                                       ('kbest',
                                                                        SelectKBest())])),
                                       ('forest', RandomForestClassifier())]),
             param_grid={'features__kbest__k': [3, 4, 5],
                         'features__pca__n_components': [1, 2, 3],
                         'forest__max_depth': [3, 5, 8],
                         'forest__n_estimators': [10, 30, 1000]},
             return_train_score=True)

In [40]:
print(gs.best_params_)
print(gs.best_score_)

ypred = gs.predict(xtest)

{'features__kbest__k': 5, 'features__pca__n_components': 3, 'forest__max_depth': 8, 'forest__n_estimators': 30}
0.7604497751124438
