In [1]:
import pandas as pd
import numpy as np

This notebook was create using the tutorial videos from Data School , video link : https://youtu.be/irHhDMbw3xo

Why should you use a Pipeline?

Point of pipeline is to chain steps sequentially. we put preprocesing steps and model building steps in a pipeline.
Pipeline allows us to properly cross-validate a process rather than just a model. When we use cross-val-score to cross-validate it requires a model and data sklearn.model_selection.cross_val_score(estimator, X, y=None, *, groups=None, scoring=None, cv=None,......) , there are cases where this don't give us good results because we are doing preprocessing outside of cross-validation.
A pipeline is useful in this context because it allows us to cross-validate a process that includes preprocessing and model building , plus we could do a grid-search or randomized-search of a pipeline which allows us to do a grid-search or randomized-search of both tuning parameters for model and preprocessing steps.

Using pipeline + grid-search we can do parameter search for preprocessing steps in combination with the model

When to use sklearn.compose.ColumnTransformer?

we use ColumnTransformer when we have features in dataframe that need different preprocessing.

Use column transformer to apply different preprocessing to different columns

> Select columns by name from the dataframe

> passthrough or drop unspecified columns

How do you encode categorical features with OneHotEncoder?

Using sklearn.preprocessing.OneHotEncoder we can one-hot encode the categorical columns

`one_hot = OneHotEncoder()
encoded_feature = one_hot.fit_transform(feature_array)`

How do you apply OneHotEncoder to selected columns with ColumnTransformer?

`from sklearn.preprocessing import make_column_transformer
col_trans = make_column_transformer((OneHotEncoder(), [col1, col2 ... ]),...)
transformed_features = col_trans.fit_transform(feature_array)`

How do you build and cross-validate a Pipeline?

`from sklearn.pipeline import make_pipeline
pipe = make_pipeline(data_preprocessing_step , modelling_step)`

we can pass this entire pipeline into cross_val_score to cross-validate a process which has both preprocessing and modelling

`cv_score = cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()`

after splitting the data the `cross_val_score` will run the pipeline 

How do you make predictions on new data using a Pipeline?

`pipe.fit(X, y)  
pipe.predict(X_test)`

Why should you use scikit-learn (rather than pandas) for preprocessing?

we could've used `pandas.get_dummies` but sklearn's one-hot-encoding combined with a pipeline is better
Advantages

* We don't have to create a big dataframe
one hot encoding does not affect our dataframe

* When new data comes-in we don't have to call `pandas.get_dummies` on it
we'll have problems if test-data have different categories than the training data , `pandas.get_dummies` will not produce correctly shaped data

* With a pipeline we could do model-parameter tuning as well as preprocessing-parameters

* In some cases preprocessing outside of sklearn make cross-val-score less reliable

## import the data

In [2]:
df = pd.read_csv('http://bit.ly/kaggletrain')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
# lets find the shape of the dataset 
df.shape

(891, 12)

In [4]:
# lets check the features of the dataset 
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [5]:
# lets check for missing values in the dataset
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [6]:
# let create a dataframe with features  'Survived', 'Pclass', 'Sex', 'Embarked'  from the dataset

df_1 = df.loc[df.Embarked.notna(), ['Survived', 'Pclass', 'Sex', 'Embarked']]
df_1.head()

Unnamed: 0,Survived,Pclass,Sex,Embarked
0,0,3,male,S
1,1,1,female,C
2,1,3,female,S
3,1,1,female,S
4,0,3,male,S


In [7]:
# lets check the shape of data
df_1.shape

(889, 4)

## Experiments 
### Crossvalidate a model with one feature
 lets select `Pclass`  as the feature because its numeric 

In [8]:
# Feature and label
X = df_1.loc[:,['Pclass']]
y = df_1.Survived

In [9]:
# check the shape of features and labels
X.shape, y.shape

((889, 1), (889,))

#### lets use Logistic-Regression ,It is the go-to method for binary classification problems 

In [10]:
from sklearn.linear_model import LogisticRegression


logreg = LogisticRegression()

In [11]:
from sklearn.model_selection import cross_val_score

cross_val_score(logreg, X, y, scoring='accuracy', cv=5).mean()

0.6783406335301212

In [12]:
# lets check the labels 
y.value_counts(normalize=True)

0    0.617548
1    0.382452
Name: Survived, dtype: float64

## Encoding Categorical Variables

In [13]:
# we can use pandas.get_dummies for this purpose
pd.get_dummies(df_1).head()

Unnamed: 0,Survived,Pclass,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,0,3,0,1,0,0,1
1,1,1,1,0,1,0,0
2,1,3,1,0,0,0,1
3,1,1,1,0,0,0,1
4,0,3,0,1,0,0,1


In [14]:
pd.get_dummies(df_1).shape

(889, 7)

### dummy encoding of categorical features using sklearn.OneHotEncoder

In [15]:
from sklearn.preprocessing import OneHotEncoder

one_hot = OneHotEncoder(sparse=False)

In [16]:
one_hot.fit_transform(df_1[['Sex']])

array([[0., 1.],
       [1., 0.],
       [1., 0.],
       ...,
       [1., 0.],
       [0., 1.],
       [0., 1.]])

In [17]:
# lets find-out the categories
one_hot.categories_

[array(['female', 'male'], dtype=object)]

In [18]:
one_hot.fit_transform(df_1[['Embarked']])

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.],
       ...,
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.]])

In [19]:
# lets find-out the categories
one_hot.categories_

[array(['C', 'Q', 'S'], dtype=object)]

## Cross validate a pipeline with all features

In [20]:
# create features and lables 
X = df_1.drop(columns=['Survived'])
X.shape

(889, 3)

In [21]:
y.shape

(889,)

In [22]:
# use when different features need different preprocessing
from sklearn.compose import ColumnTransformer

col_transformer = ColumnTransformer(transformers=[('onehot', OneHotEncoder(), ['Sex','Embarked'])],
                                   remainder='passthrough')


In [23]:
# output of col_transformer
col_transformer.fit_transform(X)

array([[0., 1., 0., 0., 1., 3.],
       [1., 0., 1., 0., 0., 1.],
       [1., 0., 0., 0., 1., 3.],
       ...,
       [1., 0., 0., 0., 1., 3.],
       [0., 1., 1., 0., 0., 1.],
       [0., 1., 0., 1., 0., 3.]])

In [24]:
# chain the sequential steps together using a pipeline
from sklearn.pipeline import Pipeline


model_pipeline = Pipeline(steps=[('preprocess',col_transformer),('model',logreg)])

In [25]:

# cross-validate the entire process
# thus, preprocessing occurs within each fold of cross-validation

cross_val_score(model_pipeline, X, y, scoring='accuracy', cv=5).mean()

0.7727924839713071

## Make predictions on a test data

`pandas.DataFrame.sample`

Return a random sample of items from an axis of object.

In [26]:
test_data = df_1.sample(n=10)

test_data.head()

Unnamed: 0,Survived,Pclass,Sex,Embarked
697,1,3,female,Q
536,0,1,male,S
197,0,3,male,S
777,1,3,female,S
418,0,2,male,S


In [27]:
X_test = test_data.drop(columns=['Survived'])
X_test.shape

(10, 3)

In [28]:
y_test=test_data['Survived']
y_test.shape

(10,)

In [29]:
## predictions 

model_pipeline.fit(X,y)

Pipeline(steps=[('preprocess',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('onehot', OneHotEncoder(),
                                                  ['Sex', 'Embarked'])])),
                ('model', LogisticRegression())])

In [30]:
y_preds = model_pipeline.predict(X_test)
y_preds

array([1, 0, 0, 1, 0, 0, 1, 0, 1, 1], dtype=int64)

In [31]:
pd.DataFrame(data={'true':y_test,'prediction':y_preds})

Unnamed: 0,true,prediction
697,1,1
536,0,0
197,0,0
777,1,1
418,0,0
321,0,0
255,1,1
410,0,0
816,0,1
452,0,1


In [32]:
# check the score
model_pipeline.score(X_test, y_test)

0.8