# Pipeline and make_pipeline

## Context

Diabetes is one of the most frequent diseases worldwide and the number of diabetic patients is growing over the years. The main cause of diabetes remains unknown, yet scientists believe that both genetic factors and environmental lifestyle play a major role in diabetes.

Individuals with diabetes face a risk of developing some secondary health issues such as heart diseases and nerve damage. Thus, early detection and treatment of diabetes can prevent complications and assist in reducing the risk of severe health problems.
Even though it's incurable, it can be managed by treatment and medication.

Researchers at the Bio-Solutions lab want to get better understanding of this disease among women and are planning to use machine learning models that will help them to identify patients who are at risk of diabetes.

We will use pima indians diabetes dataset to see how to use pipeline and make_pipeline

## Data Dictionary

* Pregnancies: Number of times pregnant
* Glucose: Plasma glucose concentration over 2 hours in an oral glucose tolerance test
* BloodPressure: Diastolic blood pressure (mm Hg)
* SkinThickness: Triceps skinfold thickness (mm)
* Insulin: 2-Hour serum insulin (mu U/ml)
* BMI: Body mass index (weight in kg/(height in m)^2)
* Pedigree: Diabetes pedigree function - A function that scores likelihood of diabetes based on family history.
* Age: Age in years
* Class: Class variable (0: the person is not diabetic or 1: the person is diabetic)





## Import libraries

In [1]:
# to work with dataframes
import pandas as pd
import numpy as np

# to split data into train and test
from sklearn.model_selection import train_test_split

# to build logistic regression model
from sklearn.linear_model import LogisticRegression

# to create k folds of data and get cross validation score
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# to create pipeline and make_pipeline
from sklearn.pipeline import Pipeline, make_pipeline

# to use standard scaler
from sklearn.preprocessing import StandardScaler

# to ignore warnings
import warnings
warnings.filterwarnings('ignore')

## Load and view dataset

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
df = pd.read_csv('/content/drive/MyDrive/Python Course/Model Tuning/Week 2 _ ML Pipeline and Hyperparameter Tuning/pima-indians-diabetes.csv')

df.head()

Unnamed: 0,Preg,Plas,Pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
# separating data into X and Y
X = df.drop(['class'], axis = 1)
Y = df['class']

In [5]:
# dividing data into train and test sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state=1, stratify = Y)

## Pipeline

In [6]:
# pipeline takes a list of tuples as parameter. The last entry is the call to the modeling algorithm
pipeline = Pipeline([
    ('scaler',StandardScaler()),
    ('clf', LogisticRegression())
])

# "scaler" is the name assigned to StandardScaler
# "clf" is the name assigned to LogisticRegression

In [7]:
# Any element of the pipeline can be called later using the assigned name
pipeline['scaler'].fit(X_train)

In [8]:
# now the pipeline object can be used as a normal classifier
pipeline.fit(X_train,Y_train)

In [13]:
# pipeline object's accuracy on the train set
pipeline.score(X_train, Y_train)

0.7839851024208566

In [14]:
# pipeline object's accuracy on the test set
pipeline.score(X_test, Y_test)

0.7619047619047619


## make_pipeline

- make_pipeline is an extended version of the pipeline. Here, we don't need to assign names separately to each element of the pipeline.

In [9]:
# defining pipe using make_pipeline
pipe = make_pipeline(StandardScaler(), (LogisticRegression()))

In [10]:
# we can see that make_pipeline itself assigned names to all the objects
pipe.steps

[('standardscaler', StandardScaler()),
 ('logisticregression', LogisticRegression())]

In [11]:
# now you can use the pipe object as a normal classifier
pipe.fit(X_train,Y_train)

In [12]:
# pipe object's accuracy on the train set
pipe.score(X_train, Y_train)

0.7839851024208566

In [15]:
# pipe object's accuracy on the test set
pipe.score(X_test, Y_test)

0.7619047619047619