# Data Science Quick Tip #003: Using Scikit-Learn Pipelines!
In this notebook, I'll show you how to create a pipeline that produces a single binary file in the end for clean inference purposes. The goal is NOT to create a necessarily accurate model here, so don't worry if your accuracy scores are bad. This project will only focus on using Scikit-Learn's default transformers. In the next quick tip post, I'll teach you how to create custom transfomers and also make use of those within this same pipeline format.

## Project Setup
Let's go ahead and import the libraries we'll be using as well as the datasets.

In [1]:
# Importing the libraries we'll be using for this project
import pandas as pd
import joblib

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, accuracy_score, confusion_matrix

In [2]:
# Importing the training datasets
raw_train = pd.read_csv('data/train.csv')

In [3]:
# Splitting the training data into appropriate training and validation sets
X = raw_train.drop(columns = ['Survived'])
y = raw_train[['Survived']]

X_train, X_val, y_train, y_val = train_test_split(X, y, random_state = 42)

In [4]:
# Viewing first few rows of X_train dataset
X_train.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
298,299,1,"Saalfeld, Mr. Adolphe",male,,0,0,19988,30.5,C106,S
884,885,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.05,,S
247,248,2,"Hamalainen, Mrs. William (Anna)",female,24.0,0,2,250649,14.5,,S
478,479,3,"Karlsson, Mr. Nils August",male,22.0,0,0,350060,7.5208,,S
305,306,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S


## Creating Our Pipeline
With our data imported, we're ready to go ahead and start creating our pipeline. As mentioned above, we'll only be using the default transformers here, so we definitely won't be getting great results out of our model predictions. But that's okay! The purpose here is learning how to use a pipeline.

Note: You might be wondering in the next cell why we're creating a column transformer for a single column. This is because in the next post, we'll be adding custom transformers making use of mostly the same code you'll see below. (With a few additions!)

In [5]:
# Creating a preprocessor to transform the 'Sex' column
data_preprocessor = ColumnTransformer(transformers = [
    ('sex_transformer', OneHotEncoder(), ['Sex'])
])

In [6]:
# Creating our pipeline that first preprocesses the data, then scales the data, then fits the data to a RandomForestClassifier
rfc_pipeline = Pipeline(steps = [
    ('data_preprocessing', data_preprocessor),
    ('data_scaling', StandardScaler()),
    ('model', RandomForestClassifier(max_depth = 10,
                                     min_samples_leaf = 3,
                                     min_samples_split = 4,
                                     n_estimators = 200))
])

In [7]:
# Fitting the training data to our pipeline
rfc_pipeline.fit(X_train, y_train)

  self._final_estimator.fit(Xt, y, **fit_params)


Pipeline(memory=None,
         steps=[('data_preprocessing',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('sex_transformer',
                                                  OneHotEncoder(categories='auto',
                                                                drop=None,
                                                                dtype=<class 'numpy.float64'>,
                                                                handle_unknown='error',
                                                                sparse=True),
                                                  ['Sex'])],
                                   verbose=False)),
                ('data_scaling',
                 StandardScaler(copy=True,...
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
    

In [8]:
# Saving our pipeline to a binary pickle file
joblib.dump(rfc_pipeline, 'model/rfc_pipeline.pkl')

['model/rfc_pipeline.pkl']

In [9]:
# Loading back in our serialized model
loaded_model = joblib.load('model/rfc_pipeline.pkl')

In [10]:
# Checking out our predicted results using the validation dataset
pipeline_preds = loaded_model.predict(X_val)

val_accuracy = accuracy_score(y_val, pipeline_preds)
val_roc_auc = roc_auc_score(y_val, pipeline_preds)
val_confusion_matrix = confusion_matrix(y_val, pipeline_preds)

print(f'Accuracy Score: {val_accuracy}')
print(f'ROC AUC Score: {val_roc_auc}')
print(f'Confusion Matrix: \n{val_confusion_matrix}')

Accuracy Score: 0.7847533632286996
ROC AUC Score: 0.7718430320308569
Confusion Matrix: 
[[112  22]
 [ 26  63]]
