## Advanced tuning of parameters

In this tutorial, we will apply the skills from previous tutorials and build a classifier using the `Pipelines` and `FeatureUnion` functions from sklearn.

In [1]:
# IMPORT PACKAGES

In [20]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion

from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

### Data

We will be building a binary classifier that predicts wheter a person has diabetes or not, using information on the patient's health. 

The data can be found [here](https://drive.google.com/file/d/1TvCKlmH3Z32XAKk-VUcZyYu95Ccyw3PO/view?usp=sharing). 


In [2]:
col_names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

In [7]:
df = pd.read_csv('data/pima-indians-diabetes.csv', sep =';')
df.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [21]:
# Test-train split
X = df.iloc[:,:-1].to_numpy()
y = df.iloc[:,-1].to_numpy()

test_size = 0.25
seed = 88

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)

### Task

Build classifier which predicts the target variable `class` using rest of the attributes. The model should be fitted using pipeline that contains:
- PCA method
- SelectKBest method
- FeatureUnion
- Random Forest

Choose the best set of parameters using `Pipeline` and grid_search.

> #### Note
> **In this exercise, we are focusing on the implementation of pipeline. Since we have only 9 columns in our dataset the PCA is probably not the best technique to use during the data preparation from a methodology point of view.**

In [22]:
# Combine features using feature union
pca = PCA(n_components=2)
selection = SelectKBest(k=7)

combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])

In [23]:
rf = RandomForestClassifier()

In [26]:
pipeline = Pipeline([("features", combined_features), ("random_forest", rf)])

param_grid = {"features__pca__n_components":[1,2,3],
                "features__univ_select__k": [3,4,5,6,7,8],
                "random_forest__n_estimators":[10, 50, 100,300],
                 "random_forest__max_depth":[2, 10, 20]}

grid_search = GridSearchCV(pipeline, param_grid, verbose=10, refit=True)    

# fit the model and tune parameters
grid_search.fit(X_train, y_train)

In [27]:
print(grid_search.best_params_)

{'features__pca__n_components': 3, 'features__univ_select__k': 5, 'random_forest__max_depth': 20, 'random_forest__n_estimators': 50}
