# Pipeline Implementation

A machine learning pipeline is used to help automate machine learning workflows.

It consist of several steps to train a model and continuously improve the accuracy of the model and achieve a 
successful algorithm.

A pipeline consists of a **sequence of components which are a compilation of computations**. Data is sent through these components and is manipulated with the help of computation.

A typical machine learning pipeline would consist of the following processes:

Data collection

Data cleaning

Feature extraction (labelling and dimensionality reduction)

Model validation

Visualisation

### Importing Libraries

In [11]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics

### Loading and splitting dataset

In [10]:
# Since it is a data file with no header, we will supply the column names which have been obtained from the above URL 
# Create a python list of column names called "names"

colnames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

#Load the file from local directory using pd.read_csv which is a special form of read_table
#while reading the data, supply the "colnames" list

pima_df = pd.read_csv("pima-indians-diabetes.data", names= colnames)


array = pima_df.values
X = array[:,0:7] # select all rows and first 8 columns which are the attributes
Y = array[:,8]   # select all rows and the 8th column which is the classification "Yes", "No" for diabeties
test_size = 0.30 # taking 70:30 training and test set
seed = 7  # Random numbmer seeding for reapeatability of the code
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
type(X_train)

numpy.ndarray

### Implementing Pipeline

In [5]:
# it takes a list of tuples as parameter. The last entry is the call to the modelling algorithm
pipeline = Pipeline([ ('scaler',StandardScaler()), ('clf', LogisticRegression()) ])

# use the pipeline object as you would a regular classifer
pipeline.fit(X_train,y_train)

Pipeline(steps=[('scaler', StandardScaler()), ('clf', LogisticRegression())])

In [13]:
from sklearn import metrics

y_predict = pipeline.predict(X_test)
model_score = pipeline.score(X_test, y_test)
print('Score is',model_score)
print('Confusion Matrix',metrics.confusion_matrix(y_test, y_predict))
print('Classification Report',metrics.classification_report(y_test,y_predict))

Score is 0.7792207792207793
Confusion Matrix [[132  15]
 [ 36  48]]
Classification Report               precision    recall  f1-score   support

         0.0       0.79      0.90      0.84       147
         1.0       0.76      0.57      0.65        84

    accuracy                           0.78       231
   macro avg       0.77      0.73      0.75       231
weighted avg       0.78      0.78      0.77       231



Here we are creating an object of Pipeline and then fitting training data using it instead of using model object directly. 

# Make_Pipeline()

Creating the pipeline could be cumbersome. Specifying a name to each stage may not
be necessary

Alternatively there is a **“make_pipeline()” function** that will create the pipeline and automatically name each step.    We do not need to specify a name.

a. from sklearn.pipeline import make_pipeline

b. pipe = make_pipeline( MinMaxScaler(), (SVC()))

c. print(" Pipeline steps:\ n{}". format( pipe.steps))

 Here we have not specified any name to the stages. The names will be automatically
assigned and are usually lowercase of the class names

In [19]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC

In [20]:
pipe = make_pipeline( MinMaxScaler(), (SVC())) 
print(" Pipeline steps:\ n{}". format( pipe.steps))

 Pipeline steps:\ n[('minmaxscaler', MinMaxScaler()), ('svc', SVC())]


In [21]:
pipe.fit( X_train, y_train)

Pipeline(steps=[('minmaxscaler', MinMaxScaler()), ('svc', SVC())])

In [27]:
# print(" Test score: {:.2f}". format( pipe.score( X_test, y_test)))
print(pipe.score(X_test,y_test))

0.7532467532467533
