# 6.4 - Preprocessing and Pipelining

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# Oftentimes, the data we get is not in a proper format for machine learning algorithms.
The titanic dataset has missing values.

In [4]:
titanic = sns.load_dataset('titanic')
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
survived       891 non-null int64
pclass         891 non-null int64
sex            891 non-null object
age            714 non-null float64
sibsp          891 non-null int64
parch          891 non-null int64
fare           891 non-null float64
embarked       889 non-null object
class          891 non-null category
who            891 non-null object
adult_male     891 non-null bool
deck           203 non-null category
embark_town    889 non-null object
alive          891 non-null object
alone          891 non-null bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.2+ KB


It also has categorical features encoded as strings.

In [5]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


And some algorithms, like support vector machines and principal component analysis, requires the data to be normalized ahead of time.

# Ordinal features
Notice how **class**, while not strictly a continuous numerical feature, refers to categories that have an intrinsic ordered relationship. Other examples of ordinal features:
- Freshman, sophomore, junior, senior ...
- Republican, Independent, Democrat
- Strongly disagree, disagree, no opinion, agree, strongly agree

In [6]:
def convert_class(in_string):
    if in_string == 'First':
        return 1
    elif in_string == 'Second':
        return 2
    elif in_string == 'Third':
        return 3

In [7]:
titanic['class'] = titanic['class'].apply(convert_class)

In [8]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,3,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,1,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,3,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,1,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,3,man,True,,Southampton,no,True


# Categorical features

In contrast to ordinal features, the possible values of categorical features have no ordered relationship. For example, what is the average of *cat* and *dog*?

In [9]:
titanic.select_dtypes(exclude = [float, bool, int]).head()

Unnamed: 0,sex,embarked,class,who,deck,embark_town,alive
0,male,S,3,man,,Southampton,no
1,female,C,1,woman,C,Cherbourg,yes
2,female,S,3,woman,,Southampton,yes
3,female,S,1,woman,C,Southampton,yes
4,male,S,3,man,,Southampton,no


Gender is easy. We encode this as a binary variable.

In [10]:
print(titanic.sex.value_counts())
titanic['sex'] = titanic['sex'].apply(lambda x: 1 if x == 'male' else 0)
print(titanic.sex.value_counts())

male      577
female    314
Name: sex, dtype: int64
1    577
0    314
Name: sex, dtype: int64


If a feature with two categories can be encoded with one binary column, then a feature with $n$ categories can be encoded with $n-1$ binary columns. This is called one-hot encoding. We use the get_dummies() function from pandas.

In [11]:
print(titanic.embarked.value_counts())
print(pd.get_dummies(titanic.embarked).head())

S    644
C    168
Q     77
Name: embarked, dtype: int64
     C    Q    S
0  0.0  0.0  1.0
1  1.0  0.0  0.0
2  0.0  0.0  1.0
3  0.0  0.0  1.0
4  0.0  0.0  1.0


Without setting drop_first = True, a feature with $n$ categories gets encoded as $n$ binary columns. But as we discussed before, one of the columns will end up being redundant.

In [12]:
embarked_onehot = pd.get_dummies(titanic.embarked, drop_first=True)
class_onehot = pd.get_dummies(titanic.embark_town, drop_first=True)
deck_onehot = pd.get_dummies(titanic.deck, drop_first=True)
deck_onehot = pd.get_dummies(titanic.who, drop_first=True)

In [13]:
titanic = pd.concat([titanic, embarked_onehot, class_onehot, deck_onehot], axis = 'columns')
titanic = titanic.drop(['embarked','embark_town','deck','who'], axis = 'columns')
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,class,adult_male,alive,alone,Q,S,Queenstown,Southampton,man,woman
0,0,3,1,22.0,1,0,7.25,3,True,no,False,0.0,1.0,0.0,1.0,1.0,0.0
1,1,1,0,38.0,1,0,71.2833,1,False,yes,False,0.0,0.0,0.0,0.0,0.0,1.0
2,1,3,0,26.0,0,0,7.925,3,False,yes,True,0.0,1.0,0.0,1.0,0.0,1.0
3,1,1,0,35.0,1,0,53.1,1,False,yes,False,0.0,1.0,0.0,1.0,0.0,1.0
4,0,3,1,35.0,0,0,8.05,3,True,no,True,0.0,1.0,0.0,1.0,1.0,0.0


Finally, we drop *alive* as it correlates too much with *survived*, the target feature. This is an example of **data leakage**. If ever your results seem too good to be true, make sure this isn't the cause.

In [14]:
titanic = titanic.drop('alive', axis = 'columns')
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,class,adult_male,alone,Q,S,Queenstown,Southampton,man,woman
0,0,3,1,22.0,1,0,7.25,3,True,False,0.0,1.0,0.0,1.0,1.0,0.0
1,1,1,0,38.0,1,0,71.2833,1,False,False,0.0,0.0,0.0,0.0,0.0,1.0
2,1,3,0,26.0,0,0,7.925,3,False,True,0.0,1.0,0.0,1.0,0.0,1.0
3,1,1,0,35.0,1,0,53.1,1,False,False,0.0,1.0,0.0,1.0,0.0,1.0
4,0,3,1,35.0,0,0,8.05,3,True,True,0.0,1.0,0.0,1.0,1.0,0.0


# Scikit-learn transformers
Scikit-learn has two very important basic classes. You've already seen the **estimator** classes which allow you to *fit* and *predict* on datasets. The other basic class is a **transformer**, which provides *fit* and *transform* functionality. The genius of the API is that you use it much the same way as an estimator.

Why does a transformer need a fit method? Whereas the previous preprocessing operations merely changed the format of the data, the following change the content of the data. That is, the following transformations perform a calculation based on the available data and then use this calculation to change the data. For this reason, we cannot simply transform on our entire dataset. We must perform a transformation on the training data, save the calculations, and then transform the test data using training calculations.

In [18]:
from sklearn.model_selection import train_test_split

X = titanic.drop('survived', axis = 'columns')
y = titanic.survived

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, stratify = y)

# Missing value imputation

In [20]:
# import the Imputer, a transformer class
from sklearn.preprocessing import Imputer

# initialize an imputer object
imp = Imputer(strategy='mean')

# fit the imputer on training data
imp.fit(X_train)
X_train = imp.transform(X_train)

# alternatively, you could use X_train = imp.fit_transform(X_train)
print(X_train)

# transform test data using the imputer. DO NOT fit on the test data.
X_test = imp.transform(X_test)

[[  3.          1.          2.        ...,   1.          0.          0.       ]
 [  3.          0.         20.        ...,   1.          0.          1.       ]
 [  3.          0.         28.9926506 ...,   1.          0.          1.       ]
 ..., 
 [  2.          1.         30.        ...,   1.          1.          0.       ]
 [  2.          0.         29.        ...,   1.          0.          1.       ]
 [  3.          0.         45.        ...,   1.          0.          1.       ]]


Now we can use a model on all titanic data and not just the numeric data like before. Let's use a logistic regression.

In [28]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.cross_validation import cross_val_score

lr = LogisticRegression()

lr.fit(X_train, y_train)
y_train_pred = lr.predict(X_train)

print('Training ROC:', roc_auc_score(y_train, y_train_pred))

print('Validation ROC:', cross_val_score(estimator = lr, X = X_train, y = y_train, cv = 5).mean())

y_test_pred = lr.predict(X_test)

print('Test ROC:', roc_auc_score(y_test, y_test_pred))


Training ROC: 0.802028852859
Validation ROC: 0.799245528455
Test ROC: 0.830773756987


# Feature scaling
Some algorithms require that features be normalized beforehand. For this, we use a transformer called StandardScaler. Its interface is much the same as what we've already seen.

In [31]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

print(X_test)

[[  8.46128909e-01   7.25920646e-01   1.19754394e-16 ...,   6.17555441e-01
    8.13224036e-01  -6.54903873e-01]
 [ -1.53372802e+00   7.25920646e-01   7.02536200e-01 ...,   6.17555441e-01
    8.13224036e-01  -6.54903873e-01]
 [  8.46128909e-01   7.25920646e-01   1.19754394e-16 ...,   6.17555441e-01
    8.13224036e-01  -6.54903873e-01]
 ..., 
 [  8.46128909e-01  -1.37756104e+00   1.19754394e-16 ...,  -1.61928781e+00
   -1.22967344e+00   1.52694165e+00]
 [ -1.53372802e+00   7.25920646e-01   3.97836343e+00 ...,   6.17555441e-01
    8.13224036e-01  -6.54903873e-01]
 [ -3.43799557e-01  -1.37756104e+00  -1.24736096e+00 ...,   6.17555441e-01
   -1.22967344e+00  -6.54903873e-01]]


In [49]:
from sklearn.svm import SVC

svm = SVC(C = 0.5, gamma = 0.2)

svm.fit(X_train, y_train)
y_train_pred = svm.predict(X_train)

print('Training ROC:', roc_auc_score(y_train, y_train_pred))

print('Validation ROC:', cross_val_score(estimator = svm, X = X_train, y = y_train, cv = 5).mean())

y_test_pred = svm.predict(X_test)

print('Test ROC:', roc_auc_score(y_test, y_test_pred))

Training ROC: 0.806828582636
Validation ROC: 0.816949593496
Test ROC: 0.800411885849


# Pipelining
In machine learning, whatever transformations we apply to the training data must also be applied to the test data. However, the transformations must be fit to the training data independently of the test data. This can get cumbersome, so scikit-learn offers a Pipeline object that combines transformers and predictors. This will save your code from getting messy and also save you from mistakes (hopefully). Just remember this about a pipeline, **you can put into it as many transformers as you want, but if you add an estimator, you may only add one at the end of the pipeline**.

Let's bring back our imperfect titanic dataset. We want to perform missing value imputation, feature scaling, and prediction using an svm.

In [59]:
X = titanic.drop('survived', axis = 'columns')
y = titanic.survived

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, stratify = y)

In [61]:
# import Pipeline class
from sklearn.pipeline import Pipeline

# initialize objects we want to put in the pipeline
imp = Imputer(strategy = 'median')
ss = StandardScaler()
svm = SVC()

# define the steps we want our pipeline to take. here we add transformers and an optional estimator at the end.
# these must always take the form of a list of tuples.
# the names you give in the first position of each tuple is not important yet.
steps = [('imputer', imp), ('scaler', ss), ('svc', svm)]

# initialize pipeline with dictionary of steps
pipeline = Pipeline(steps = steps)

# treat pipeline as an estimator
pipeline.fit(X_train, y_train)
y_train_pred = pipeline.predict(X_train)

print('Training ROC:', roc_auc_score(y_train, y_train_pred))

print('Validation ROC:', cross_val_score(estimator = pipeline, X = X_train, y = y_train, cv = 5).mean())

y_test_pred = pipeline.predict(X_test)

print('Test ROC:', roc_auc_score(y_test, y_test_pred))

Training ROC: 0.80789639993
Validation ROC: 0.829827642276
Test ROC: 0.799794057076
