### What is Pipeline in scikit learn?
* Sequentially apply a list of transforms and a final estimator. Intermediate steps of pipeline must implement fit and transform methods and the final estimator only needs to implement fit.

* As the name suggests, pipeline class allows sticking multiple processes into a single scikit-learn estimator. pipeline class has fit, predict and score method just like any other estimator (ex. LinearRegression).

In [1]:
# import packages
import numpy as np 
import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt 
import sklearn
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler  
from sklearn.pipeline import Pipeline 
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import f1_score, accuracy_score 
import warnings 
warnings.filterwarnings("ignore") 

In [29]:
# load  data 
data = pd.read_csv("data/winedataset.csv", sep=";")

In [30]:
#show the first five rows
data.head() 

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [31]:
# show columns 
data.columns 

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

In [32]:
# show data information 
data.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
fixed acidity           1599 non-null float64
volatile acidity        1599 non-null float64
citric acid             1599 non-null float64
residual sugar          1599 non-null float64
chlorides               1599 non-null float64
free sulfur dioxide     1599 non-null float64
total sulfur dioxide    1599 non-null float64
density                 1599 non-null float64
pH                      1599 non-null float64
sulphates               1599 non-null float64
alcohol                 1599 non-null float64
quality                 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


In [33]:
# check the shape of the data
data.shape 

(1599, 12)

In [34]:
#check missing values 
data.isnull().sum() 

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

### Start Implement Pipeline 

In [35]:
# separate features and target
X=data.drop(['quality'],axis=1)
Y=data['quality'] 

In [36]:
# create pipeline with steps 
steps = [('scaler', StandardScaler()), ('SVM', SVC())]


pipeline = Pipeline(steps) # define the pipeline object.

* Here our steps are standard scalar and support vector machine. These steps are list of tuples consisting of name and an instance of the transformer or estimator.
* The final step has to be an estimator in this list of tuples.

In [37]:
# split the dataset into train and test sets 
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.1, random_state=30`)

In [38]:
# define parameter grid 
parameteres = {'SVM__C':[0.001,0.1,10,100,10e5], 'SVM__gamma':[0.1,0.01]} 

In [39]:
# set five folds for cross validation 
grid = GridSearchCV(pipeline, param_grid=parameteres, cv=5)

In [40]:
# fit the train dataset 
grid.fit(X_train,y_train) 



GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('SVM', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'SVM__C': [0.001, 0.1, 10, 100, 1000000.0], 'SVM__gamma': [0.1, 0.01]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [41]:
# show accuracy 
accuracy = grid.score(X_test,y_test)
print("Accuracy:{}".format(accuracy)) 

Accuracy:0.6625


In [42]:
#show best parameters 
print (grid.best_params_) 

{'SVM__C': 10, 'SVM__gamma': 0.1}


In [43]:
# find the f1 score 
be = grid.best_estimator_
y_pred = be.predict(X_test)
fscore = f1_score(y_test,y_pred, average="weighted")
print("F1 Score is: {:.3f}".format(fscore)) 

F1 Score is: 0.643


  'precision', 'predicted', average, warn_for)
