Scikit-learn's Pipeline class is designed as a manageable way to apply a series of data transformations followed by the application of an estimator.

this simple tool is useful for:

- Convenience in creating a coherent and easy-to-understand workflow
- Enforcing workflow implementation and the desired order of step applications
- Reproducibility
- Value in persistence of entire pipeline objects (goes to reproducibility and convenience)

Build 3 pipelines, each with a different estimator (classification algorithm), using default hyperparameters:

- Logisitic Regression
- Support Vector Machine
- Decision Tree

To demonstrate pipeline transforms, will perform:

- feature scaling
- dimensionality reduction, using PCA to project data onto 8 dimensional space

In [2]:
# import packages
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.externals import joblib
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn import tree
import numpy as np 
import pandas as pd 

import warnings 
warnings.filterwarnings("ignore") 

In [3]:
#load data 
data =  pd.read_csv("data/winedataset.csv", sep=";")

In [4]:
#show first five rows 
data.head() 

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [6]:
# show the shape of the data  
data.shape 

(1599, 12)

In [7]:
# show column of the dataset 
data.columns  

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

In [9]:
# show data information 
data.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
fixed acidity           1599 non-null float64
volatile acidity        1599 non-null float64
citric acid             1599 non-null float64
residual sugar          1599 non-null float64
chlorides               1599 non-null float64
free sulfur dioxide     1599 non-null float64
total sulfur dioxide    1599 non-null float64
density                 1599 non-null float64
pH                      1599 non-null float64
sulphates               1599 non-null float64
alcohol                 1599 non-null float64
quality                 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


In [11]:
# show if data have missing values 
data.isnull().sum() 

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

In [14]:
# split data into features and target 
X = data.drop(['quality'], axis=1)
Y = data['quality']

In [15]:
# split data into train and test set 
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.1, random_state=42, stratify=Y)


In [42]:
# Construct some pipelines
pipe_lr = Pipeline([('scl', StandardScaler()),
            ('pca', PCA(n_components=8)),
            ('clf', LogisticRegression(random_state=42))])

pipe_svm = Pipeline([('scl', StandardScaler()),
            ('pca', PCA(n_components=8)),
            ('clf', svm.SVC(random_state=42))])

pipe_dt = Pipeline([('scl', StandardScaler()),
            ('pca', PCA(n_components=8)),
            ('clf', tree.DecisionTreeClassifier(random_state=42))]) 

In [43]:
# List of pipelines for ease of iteration
pipelines = [pipe_lr, pipe_svm, pipe_dt]

In [44]:
# Dictionary of pipelines and classifier types for ease of reference
pipe_dict = {0: 'Logistic Regression', 1: 'Support Vector Machine', 2: 'Decision Tree'}


In [45]:
# Fit the pipelines
for pipe in pipelines:
    pipe.fit(X_train, y_train)


In [46]:
# Compare accuracies
for idx, val in enumerate(pipelines):
    print('{} pipeline test accuracy: {:.3f}'.format(pipe_dict[idx], val.score(X_test, y_test)))


Logistic Regression pipeline test accuracy: 0.606
Support Vector Machine pipeline test accuracy: 0.656
Decision Tree pipeline test accuracy: 0.700


In [47]:
# Identify the most accurate model on test data
best_acc = 0.0
best_clf = 0
best_pipe = ''
for idx, val in enumerate(pipelines):
    if val.score(X_test, y_test) > best_acc:
        best_acc = val.score(X_test, y_test)
        best_pipe = val
        best_clf = idx
print('Classifier with best accuracy: {}'.format( pipe_dict[best_clf]))


Classifier with best accuracy: Decision Tree


In [48]:
# Save pipeline to file
joblib.dump(best_pipe, 'models/best_pipeline.pkl', compress=1)
print('Saved {} pipeline to file'.format(pipe_dict[best_clf]))

Saved Decision Tree pipeline to file
