# Sklearn Pipeline: Practice

We will use sklearn pipeline to build a model sequentially. The purpose of pipeline is to use apply several steps sequentially in a combined manner rather doing one by one. In this Lab we will build a simple pipeline and will also use random serach on the pipeline for hyperparmeter optimization. 

* [Sklearn pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)

For this we use the breast cancer dataset from sklearn load_breast_cancer. We will train a svm model. But before this we will apply  min_max_scalar for scaling and PCA for feature reduction. We will do this using sklearn pipeline in a single step rather doing those separately. 

Later we will do random search on the pipeline for hyperparmeter optimization. 

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
%matplotlib inline

## Load dataset.
We are using the breast cancer dataset. A modified version of the dataset is already available in the sklearn dataset module.

In [2]:
cancer_data = load_breast_cancer()

## Data Inspection

In [3]:
print('Sample and Features:', cancer_data.data.shape)
print('Target class:', cancer_data.target_names)

Sample and Features: (569, 30)
Target class: ['malignant' 'benign']


### Split the dataset

Split the dataset for testing and training purpose. We are spliting the dataset to training (80%) and testing (20%).

In [4]:
# split the dataset (P101)
X_train, X_test, y_train, y_test = train_test_split(cancer_data.data, cancer_data.target, test_size = 0.2)


# Stage 1: Building a Pipeline 
We will build a pipeline which will use MinMaxScalar for data scaling, PCA for reducing the dimentionality of the features, and then a classifier for training and predicting with the data.

## Defining the segments of the pipe

Here we define a pipeline as an ordered list of classes that will take data.

In the example below:

  1. Data --> Scale --> Scaled_Features
  2. Scaled_Features --> PCA --> Data_Features
  3. Data_Features --> LinearSVC --> Classifications

Therefore, 

  1. Data --> Pipeline --> Classifications

In [7]:
# For stage 1, set the pca_components
pca_components = 20

# Define the pipeline (P102)
pipe = Pipeline([
    ('Scale', MinMaxScaler()), 
    ('PCA', PCA(n_components = pca_components)), 
    ('SVC', SVC(kernel = 'rbf'))
])


## Train the pipeline

In [8]:
# Fit the pipeline (P103)
pipe.fit(X_train, y_train)


Pipeline(steps=[('Scale', MinMaxScaler()), ('PCA', PCA(n_components=20)),
                ('SVC', SVC())])

## Predict with the pipeline

In [9]:
# Predict with the test set (P104)
preidcted_y = pipe.predict(X_test)

# Check the correct labels
correct_prediction = np.sum(preidcted_y  == y_test)

print('Total correct prediction: ', correct_prediction, '\nTotal test set: ', len(y_test))

Total correct prediction:  112 
Total test set:  114


# Pipeline Evaluation
Score function of the pipeline provides the accuracy of the trained pipeline.

In [10]:
# Get the score of the pipeline (P105)
pipe.score(X_test, y_test)


0.9824561403508771

In [11]:
# classification report (P106)
print(classification_report(y_test, preidcted_y))


              precision    recall  f1-score   support

           0       0.98      0.98      0.98        43
           1       0.99      0.99      0.99        71

    accuracy                           0.98       114
   macro avg       0.98      0.98      0.98       114
weighted avg       0.98      0.98      0.98       114



In [12]:
# confusion matrix (P107)
print(confusion_matrix(y_test, preidcted_y))


[[42  1]
 [ 1 70]]


# Stage 2: RandomSearch over a pipeline 

It's awesome to have a single pipeline and do preprocessing and train at once. But its not a good idea to use manual params for the each part of the pipeline. One more interesting part is that we could perform `GridSearch` and `RandomSearch` over a pipeline for hyper parameter tuning. 

To perform the hyperparameter tuning over a pipeline, we need to concatenate the model name as a prefix of param name with underscore `_`. For example, if we want to do `RandomSearch` over the `kernel` params of  `SVC`,  then the name of this parameter in the configuration will be `SVC_kernel`.


In [24]:
# import random search
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

# configure parameters for randomsearch (P108)

# select params list to do random search here. all the pramas 
# name is concatenated with __ preceding the model name

param_grid = {'PCA__n_components': [20], 
              'SVC__C': uniform(1e3, 5e3), 
              'SVC__gamma': uniform(0, 0.1), 
              'SVC__kernel': ['rbf']}

# Now build the pipeline again (P109)
clf_pipe = Pipeline([
    ('Scale', MinMaxScaler()),
    ('PCA', PCA()), 
    ('SVC', SVC())
])

# Now define a random search with the pipe (P110)
rand_model = RandomizedSearchCV(clf_pipe, param_distributions = param_grid, cv = 5, n_jobs = 2)


## Fit the random search model

In [25]:
# fit the pipeline (P111)
rand_model.fit(X_train, y_train)

# Check the best choosen params
print(rand_model.best_estimator_)


Pipeline(steps=[('Scale', MinMaxScaler()), ('PCA', PCA(n_components=20)),
                ('SVC', SVC(C=1448.2856706182731, gamma=0.0462140887665545))])


## Evaluation

In [26]:
# Classification report (P112)
predicted_y = rand_model.predict(X_test)

print(classification_report(y_test, predicted_y))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99        43
           1       1.00      0.99      0.99        71

    accuracy                           0.99       114
   macro avg       0.99      0.99      0.99       114
weighted avg       0.99      0.99      0.99       114



In [28]:
# Confusion Matrix (P113)
pd.DataFrame(confusion_matrix(y_test, predicted_y))

Unnamed: 0,0,1
0,43,0
1,1,70
