## Advanced tuning of parameters

In this tutorial, we will apply the skills from previous tutorials and build a classifier using the `Pipelines` and `FeatureUnion` functions from sklearn.

In [1]:
# IMPORT PACKAGES

In [15]:
# IMPORT PACKAGES
from sklearn.svm import SVC
from sklearn.datasets import load_iris
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import train_test_split

### Data

We will be building a binary classifier that predicts wheter a person has diabetes or not, using information on the patient's health. 

The data can be found [here](https://drive.google.com/file/d/1TvCKlmH3Z32XAKk-VUcZyYu95Ccyw3PO/view?usp=sharing). 


In [3]:
col_names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

In [7]:
df= pd.read_csv('pima-indians-diabetes.csv', delimiter=';')

In [10]:
y = df['class']
x= df.drop('class', axis=1)

In [17]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

### Task

Build classifier which predicts the target variable `class` using rest of the attributes. The model should be fitted using pipeline that contains:
- PCA method
- SelectKBest method
- FeatureUnion
- Random Forest

Choose the best set of parameters using `Pipeline` and grid_search.

> #### Note
> **In this exercise, we are focusing on the implementation of pipeline. Since we have only 9 columns in our dataset the PCA is probably not the best technique to use during the data preparation from a methodology point of view.**

In [18]:
# This dataset is way too high-dimensional. Better do PCA:
pca = PCA(n_components=2)

# Maybe some of the original features were good, too?
selection = SelectKBest(k=3)

In [19]:
# Build an transformer from PCA and Univariate selection:
combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])

In [20]:
# We will initialize the classifier
RFC = RandomForestClassifier()

In [27]:
pipeline = Pipeline([("features", combined_features), ("rfc", RFC)])

param_grid = {"features__pca__n_components": [1, 3],
                  "features__univ_select__k": [1, 3],
                  #'rfc__bootstrap': [True, False],
                  #'rfc__max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
                  #'rfc__max_features': ['auto', 'sqrt'],
                  #'rfc__min_samples_leaf': [1, 2, 4],
                  'rfc__min_samples_split': [2, 5, 10],
                  'rfc__n_estimators': [200, 400]}



In [28]:
# create a Grid Search object
grid_search = GridSearchCV(pipeline, param_grid, verbose=1, refit=True)    

# fit the model and tune parameters
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 24 candidates, totalling 120 fits


GridSearchCV(estimator=Pipeline(steps=[('features',
                                        FeatureUnion(transformer_list=[('pca',
                                                                        PCA(n_components=2)),
                                                                       ('univ_select',
                                                                        SelectKBest(k=3))])),
                                       ('rfc', RandomForestClassifier())]),
             param_grid={'features__pca__n_components': [1, 3],
                         'features__univ_select__k': [1, 3],
                         'rfc__min_samples_split': [2, 5, 10],
                         'rfc__n_estimators': [200, 400]},
             verbose=1)

In [29]:
grid_search.best_params_

{'features__pca__n_components': 1,
 'features__univ_select__k': 3,
 'rfc__min_samples_split': 10,
 'rfc__n_estimators': 400}

In [30]:
grid_search.score(X_test, y_test)

0.7272727272727273