# 6.4 - Preprocessing and Pipelining

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# Oftentimes, the data we get is not in a proper format for machine learning algorithms.
The titanic dataset has missing values.

In [3]:
titanic = sns.load_dataset('titanic')

# examine a summary of the dataset



It also has categorical features encoded as strings.

And some algorithms, like support vector machines and principal component analysis, requires the data to be normalized ahead of time.

# Ordinal features
Notice how **class**, while not strictly a continuous numerical feature, refers to categories that have an intrinsic ordered relationship. Other examples of ordinal features:
- Freshman, sophomore, junior, senior ...
- Republican, Independent, Democrat
- Strongly disagree, disagree, no opinion, agree, strongly agree

In [None]:
# define function convert_class

In [4]:
# convert the class column


# Categorical features

In contrast to ordinal features, the possible values of categorical features have no ordered relationship. For example, what is the average of *cat* and *dog*?

In [9]:
titanic.select_dtypes(exclude = [float, bool, int]).head()

Unnamed: 0,sex,embarked,class,who,deck,embark_town,alive
0,male,S,3,man,,Southampton,no
1,female,C,1,woman,C,Cherbourg,yes
2,female,S,3,woman,,Southampton,yes
3,female,S,1,woman,C,Southampton,yes
4,male,S,3,man,,Southampton,no


Gender is easy. We encode this as a binary variable.

If a feature with two categories can be encoded with one binary column, then a feature with $n$ categories can be encoded with $n-1$ binary columns. This is called one-hot encoding. We use the get_dummies() function from pandas.

S    644
C    168
Q     77
Name: embarked, dtype: int64
     C    Q    S
0  0.0  0.0  1.0
1  1.0  0.0  0.0
2  0.0  0.0  1.0
3  0.0  0.0  1.0
4  0.0  0.0  1.0


Without setting drop_first = True, a feature with $n$ categories gets encoded as $n$ binary columns. But as we discussed before, one of the columns will end up being redundant.

Finally, we drop *alive* as it correlates too much with *survived*, the target feature. This is an example of **data leakage**. If ever your results seem too good to be true, make sure this isn't the cause.

# Scikit-learn transformers
Scikit-learn has two very important basic classes. You've already seen the **estimator** classes which allow you to *fit* and *predict* on datasets. The other basic class is a **transformer**, which provides *fit* and *transform* functionality. The genius of the API is that you use it much the same way as an estimator.

Why does a transformer need a fit method? Whereas the previous preprocessing operations merely changed the format of the data, the following change the content of the data. That is, the following transformations perform a calculation based on the available data and then use this calculation to change the data. For this reason, we cannot simply transform on our entire dataset. We must perform a transformation on the training data, save the calculations, and then transform the test data using training calculations.

# Missing value imputation

In [5]:
# import the Imputer, a transformer class


# initialize an imputer object


# fit the imputer on training data


# alternatively, you could use X_train = imp.fit_transform(X_train)


# transform test data using the imputer. DO NOT fit on the test data.



Now we can use a model on all titanic data and not just the numeric data like before. Let's use a logistic regression.

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.cross_validation import cross_val_score

lr = LogisticRegression()

lr.fit(X_train, y_train)
y_train_pred = lr.predict(X_train)

print('Training ROC:', roc_auc_score(y_train, y_train_pred))

print('Validation ROC:', cross_val_score(estimator = lr, X = X_train, y = y_train, cv = 5).mean())

y_test_pred = lr.predict(X_test)

print('Test ROC:', roc_auc_score(y_test, y_test_pred))


NameError: name 'X_train' is not defined

# Feature scaling
Some algorithms require that features be normalized beforehand. For this, we use a transformer called StandardScaler. Its interface is much the same as what we've already seen.

In [49]:
from sklearn.svm import SVC

svm = SVC(C = 0.5, gamma = 0.2)

svm.fit(X_train, y_train)
y_train_pred = svm.predict(X_train)

print('Training ROC:', roc_auc_score(y_train, y_train_pred))

print('Validation ROC:', cross_val_score(estimator = svm, X = X_train, y = y_train, cv = 5).mean())

y_test_pred = svm.predict(X_test)

print('Test ROC:', roc_auc_score(y_test, y_test_pred))

Training ROC: 0.806828582636
Validation ROC: 0.816949593496
Test ROC: 0.800411885849


# Pipelining
In machine learning, whatever transformations we apply to the training data must also be applied to the test data. However, the transformations must be fit to the training data independently of the test data. This can get cumbersome, so scikit-learn offers a Pipeline object that combines transformers and predictors. This will save your code from getting messy and also save you from mistakes (hopefully). Just remember this about a pipeline, **you can put into it as many transformers as you want, but if you add an estimator, you may only add one at the end of the pipeline**.

Let's bring back our imperfect titanic dataset. We want to perform missing value imputation, feature scaling, and prediction using an svm.

In [59]:
X = titanic.drop('survived', axis = 'columns')
y = titanic.survived

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, stratify = y)

In [8]:
# import Pipeline class


# initialize objects we want to put in the pipeline


# define the steps we want our pipeline to take. here we add transformers and an optional estimator at the end.
# these must always take the form of a list of tuples.
# the names you give in the first position of each tuple is not important yet.


# initialize pipeline with dictionary of steps


# treat pipeline as an estimator

