# Sklearn Pipeline: Practice

We will use sklearn pipeline to build a model sequentially. The purpose of pipeline is to use apply several steps sequentially in a combined manner rather doing one by one. In this Lab we will build a simple pipeline and will also use random serach on the pipeline for hyperparmeter optimization. 

* [Sklearn pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)

For this we use the breast cancer dataset from sklearn load_breast_cancer. We will train a svm model. But before this we will apply  min_max_scalar for scaling and PCA for feature reduction. We will do this using sklearn pipeline in a single step rather doing those separately. 

Later we will do random search on the pipeline for hyperparmeter optimization. 

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
%matplotlib inline

## Load dataset.
We are using the breast cancer dataset. A modified version of the dataset is already available in the sklearn dataset module.

In [None]:
cancer_data = load_breast_cancer()

## Data Inspection

In [None]:
print('Sample and Features:', cancer_data.data.shape)
print('Target class:', cancer_data.target_names)

### Split the dataset

Split the dataset for testing and training purpose. We are spliting the dataset to training (80%) and testing (20%).

In [None]:
# split the dataset (P101)
X_train, X_test, y_train, y_test = train_test_split(cancer_data.data, cancer_data.target, test_size = .2)

# Stage 1: Building a Pipeline 
We will build a pipeline which will use MinMaxScalar for data scaling, PCA for reducing the dimentionality of the features, and then a classifier for training and predicting with the data.

## Defining the segments of the pipe

Here we define a pipeline as an ordered list of classes that will take data.

In the example below:

  1. Data --> Scale --> Scaled_Features
  2. Scaled_Features --> PCA --> Data_Features
  3. Data_Features --> LinearSVC --> Classifications

Therefore, 

  1. Data --> Pipeline --> Classifications

In [None]:
# For stage 1, set the pca_components
pca_components = 20

# Define the pipeline (P102)
pipe = Pipeline([
    ('scale', MinMaxScaler()),                  # Scale the data
    ('PCA', PCA(n_components= pca_components)), # it will reduce the fature vector to size of 20
    ('SVC', SVC(kernel='rbf'))                  # Then it will train an SVC with the reduced 20 size feature vector
])

## Train the pipeline

In [None]:
# Fit the pipeline (P103)
pipe.fit(X_train, y_train)

## Predict with the pipeline

In [None]:
# Predict with the test set (P104)
preidcted_y = pipe.predict(X_test)

# Check the correct labels
correct_prediction = np.sum(preidcted_y  == y_test)

print('Total correct prediction: ', correct_prediction, '\nTotal test set: ', len(y_test))

# Pipeline Evaluation
Score function of the pipeline provides the accuracy of the trained pipeline.

In [None]:
# Get the score of the pipeline (P105)
pipe.score(X_test, y_test)

In [None]:
# classification report (P106)
print(classification_report(y_test, preidcted_y))

In [None]:
# confusion matrix (P107)
print(confusion_matrix(y_test, preidcted_y))

# Stage 2: RandomSearch over a pipeline 

It's awesome to have a single pipeline and do preprocessing and train at once. But its not a good idea to use manual params for the each part of the pipeline. One more interesting part is that we could perform `GridSearch` and `RandomSearch` over a pipeline for hyper parameter tuning. 

To perform the hyperparameter tuning over a pipeline, we need to concatenate the model name as a prefix of param name with underscore `_`. For example, if we want to do `RandomSearch` over the `kernel` params of  `SVC`,  then the name of this parameter in the configuration will be `SVC_kernel`.


In [None]:
# import random search
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform


# configure parameters for randomsearch (P108)

# select params list to do random search here all the pramas 
# name is concatenated with __ preceding the model name

# due to cpu resource allocation we are only using single options for grid search
param_grid = {'SVC__C': uniform(1000, 100000), # select randomly from unifrom distribution of (1000, 1000 + 100000) range
              'SVC__gamma': uniform(0, 0.1), 
              'PCA__n_components': [20],
              'SVC__kernel': ['rbf']}


# Now build the pipeline again (P109)
clf_pipe = Pipeline([
    ('scale', MinMaxScaler()),
    ('PCA', PCA()), # it will reduce the fature vector to size of 20
    ('SVC', SVC())                  # Then it will train an SVC with the reduced 20 size feature vector
])

# Now define a random search with the pipe (P110)
rand_model = RandomizedSearchCV(clf_pipe, param_distributions = param_grid, n_jobs=5, cv=5)

## Fit the random search model

In [None]:
# fit the pipeline (P111)
rand_model.fit(X_train, y_train)

# Check the best choosen params
print(rand_model.best_estimator_)

## Evaluation

In [None]:
# Classification report (P112)
predicted_y = rand_model.predict(X_test)

print(classification_report(y_test, predicted_y))

In [None]:
# Confusion Matrix (P113)
pd.DataFrame(confusion_matrix(y_test, predicted_y))