## Advanced tuning of parameters

In this tutorial, we will apply the skills from previous tutorials and build a classifier using the `Pipelines` and `FeatureUnion` functions from sklearn.

In [1]:
# IMPORT PACKAGES
import numpy as np
import pandas as pd

from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion

from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

In [16]:
import warnings
warnings.filterwarnings('ignore')

### Data

We will be building a binary classifier that predicts wheter a person has diabetes or not, using information on the patient's health. 

The data can be found [here](https://drive.google.com/file/d/1TvCKlmH3Z32XAKk-VUcZyYu95Ccyw3PO/view?usp=sharing). 


In [1]:
col_names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

In [9]:
# load the dataset
data = pd.read_csv("pima-indians-diabetes.csv", delimiter=";")
data.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [10]:
X = data.drop(columns=['class'])
y = data['class']
print(X.shape)
print(y.shape)

(768, 8)
(768,)


### Task

Build classifier which predicts the target variable `class` using rest of the attributes. The model should be fitted using pipeline that contains:
- PCA method
- SelectKBest method
- FeatureUnion
- Random Forest

Choose the best set of parameters using `Pipeline` and grid_search.

> #### Note
> **In this exercise, we are focusing on the implementation of pipeline. Since we have only 9 columns in our dataset the PCA is probably not the best technique to use during the data preparation from a methodology point of view.**

In [11]:
# Create the PCA and SelectKBest transformers
pca = PCA()
select_k_best = SelectKBest()

# Create the FeatureUnion to combine transformers
combined_features = FeatureUnion([("pca", pca), ("select_k_best", select_k_best)])

# Create the RandomForest classifier
clf = RandomForestClassifier()

In [14]:
# create the pipeline from FeatureUnion
pipeline = Pipeline([
    ("features", combined_features),
    ("classifier", clf)
])

# Define the parameter grid for GridSearch
param_grid = {
    'features__pca__n_components': [1, 2, 3],  # Adjust the number of PCA components
    'features__select_k_best__k': [1, 2, 3],  # Adjust the number of selected features
    'classifier__n_estimators': [50, 100, 200],  # Random Forest hyperparameters
    'classifier__max_depth': [None, 10, 20]  # Random Forest hyperparameters
}

In [17]:
# Perform GridSearch to find the best set of parameters
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)
grid_search.fit(X, y)

# Get the best parameters and the best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

In [18]:
print("Best Parameters:", best_params)
print("Best Score:", best_score)

Best Parameters: {'classifier__max_depth': 10, 'classifier__n_estimators': 50, 'features__pca__n_components': 2, 'features__select_k_best__k': 3}
Best Score: 0.7643748408454292
