![Image](https://drive.google.com/uc?export=view&id=10B8NecPfn9sXRescmijQ8Zc2CO08fQm7)

## Modeling Pipeline with SKLearn
### ACC Tech Challenge Series, Winter 2020
### Harper Xiang

Recall a typical machine learning process usually includes a combination of the following steps:

. Data Cleaning

. EDA

. Standard Data Transformation (incl. standardization, PCA)

. Data Imputation

. Feature Transformation

. Feature Selection

. Modeling

. Evaluation

We have covered most of the topics in our training sessions. And you should have worked on details of some topics in your courses. 

In this session, we recap the modeling pipeline procedure using machine learning package "sklearn".


In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('data/train.csv')

In [3]:
data.set_index('PassengerId', inplace=True)

In [4]:
data.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
# Use this code to check how many NAs are in each column

data.isna().sum(axis=0)

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

Preprocessing using pipelines is important because it lets you to easily and reproducibly preprocess any new data that comes in with less error.  So it's a good practice to use pipelines.

We are going to build a simple pipeline to impute missing data and transform numeric and categorical values separately.

In [18]:
# Drop some columns we are not gonna use for simplicity

data.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

In [19]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

It is a good practice to first separate numeric and categorical columns and store each type in a list.  Use `df.select_dtypes(include=[dtypes...].columns` to quickly get numerical and categorical columns without too much manual labor.

In [20]:
# Make features and target in separate dataframes

y = data.Survived
X = data.iloc[:, 1:]

In [26]:
# Transform this column to object because it is a categorical column

X['Pclass'] = X['Pclass'].astype('object')

numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

In [28]:
# make training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [29]:
# Make the pipelines for numeric and categorical columns

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

The object `ColumnTransformer` consolidates the entire preprocessing process in one place, so you can pass all feature columns to it at once, reducing possibility of error.

In [30]:
# Make the column transformer

from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

The `ColumnTransformer` object can even be included in other pipeline as well, for example with a model.

In [31]:
from sklearn.ensemble import RandomForestClassifier

rf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier())])

Note that you can make a single pipeline object out of multiple pipeline objects, as you can see above.

Now all you need to do is to fit your data on the `rf` pipeline and the data will be transformed then modeled directly.

In [32]:
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)



Note I did not have to transform `X_test` explicitly, but all the transformation steps are done for me through `pipeline.predict()`.

In [35]:
from sklearn.metrics import confusion_matrix, accuracy_score

print(accuracy_score(y_test, y_pred))
confusion_matrix(y_test, y_pred)

0.7988826815642458


array([[98, 18],
       [18, 45]])

You can also save the pipeline objects for use later using the package `pickle`.  This step is important for reproducibility d

In [36]:
# save object

import pickle

pickle.dump(rf, open("rf.pkl", "wb"))

In [37]:
# load object

rf_restored = pickle.load(open("rf.pkl","rb"))

In [38]:
# Restored pipeline objects gives reproducible results.

y_pred_restored = rf.predict(X_test)

print(accuracy_score(y_test, y_pred_restored))
confusion_matrix(y_test, y_pred_restored)

0.7988826815642458


array([[98, 18],
       [18, 45]])

To summarize, sklearn pipelines are easy to use and makes data processing more reproducible and less finnicky so use them often.  

Basic syntax: 

```
pipeline_object = Pipeline(steps=[('step_name_1', pipeline_object_1), ('step_name_2', pipeline_object_2), ...])
```

Can replace pipeline objects with other sklearn objects or objects that has fit/transform/predict methods.