# The Standard Workflow

In this chapter, we will be reminded of the basics of a supervised learning workflow, complete with model fitting, tuning and selection, feature engineering and selection, and data splitting techniques. We will understand how these steps in a workflow depend on each other, and recognize how they can all contribute to, or fight against overfitting: the data scientist's worst enemy. By the end of the chapter, we will already be fluent in supervised learning, and ready to take the dive towards more advanced material in later chapters.

In [22]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

## Supervised learning pipelines

We are tasked with predicting whether or not a new cohort of loan applicants are likely to default on their loans. We have a historical dataset and wish to train a classifier on it. We notice that many features are in string format, which is a problem for our classifiers. We hence decide to encode the string columns numerically using `LabelEncoder()`.

In [6]:
credit = pd.read_csv('data/credit.csv')

# Inspect the data types of the columns of the data frame
print(credit.dtypes)
# Inspect the first few lines of your data using head()
credit.head(3)

checking_status           object
duration                   int64
credit_history            object
purpose                   object
credit_amount              int64
savings_status            object
employment                object
installment_commitment     int64
personal_status           object
other_parties             object
residence_since            int64
property_magnitude        object
age                        int64
other_payment_plans       object
housing                   object
existing_credits           int64
job                       object
num_dependents             int64
own_telephone             object
foreign_worker            object
class                     object
dtype: object


Unnamed: 0,checking_status,duration,credit_history,purpose,credit_amount,savings_status,employment,installment_commitment,personal_status,other_parties,...,property_magnitude,age,other_payment_plans,housing,existing_credits,job,num_dependents,own_telephone,foreign_worker,class
0,'<0',6,'critical/other existing credit',buy_radio_tv,1169,'no known savings','>=7',4,'male single',none,...,'real estate',67,none,own,2,skilled,1,yes,yes,good
1,'0<=X<200',48,'existing paid',buy_radio_tv,5951,'<100','1<=X<4',2,'female div/dep/mar',none,...,'real estate',22,none,own,1,skilled,1,none,yes,bad
2,'no checking',12,'critical/other existing credit',education,2096,'<100','4<=X<7',2,'male single',none,...,'real estate',49,none,own,1,'unskilled resident',2,none,yes,good


In [7]:
non_numeric_columns =   ['checking_status',
                         'credit_history',
                         'purpose',
                         'savings_status',
                         'employment',
                         'personal_status',
                         'other_parties',
                         'property_magnitude',
                         'other_payment_plans',
                         'housing',
                         'job',
                         'own_telephone',
                         'foreign_worker']

# Create a label encoder for each column. Encode the values
for column in non_numeric_columns:
    le = LabelEncoder()
    credit[column] = le.fit_transform(credit[column])

# Inspect the data types of the columns of the data frame
print(credit.dtypes)

checking_status            int32
duration                   int64
credit_history             int32
purpose                    int32
credit_amount              int64
savings_status             int32
employment                 int32
installment_commitment     int64
personal_status            int32
other_parties              int32
residence_since            int64
property_magnitude         int32
age                        int64
other_payment_plans        int32
housing                    int32
existing_credits           int64
job                        int32
num_dependents             int64
own_telephone              int32
foreign_worker             int32
class                     object
dtype: object


With a fully numeric matrix we can now pretty much use any classifier you like! Let's try a couple out. 

Our colleague has used `AdaBoostClassifier` for the credit scoring dataset. We want to also try out a random forest classifier. In this exercise, we will fit this classifier to the data and compare it to `AdaBoostClassifier`.We will use train/test data splitting to avoid overfitting.

In [13]:
accuracies = {'ab': 0.75}

X, y = credit.drop('class', axis=1), credit['class']

# Split the data into train and test, with 20% as test
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2, random_state=1)

# Create a random forest classifier, fixing the seed to 2
rf_model = RandomForestClassifier(random_state=2).fit(
  X_train, y_train)

# Use it to predict the labels of the test data
rf_predictions = rf_model.predict(X_test)

# Assess the accuracy of both classifiers
accuracies['rf'] = accuracy_score(y_test, rf_predictions)

print(accuracies)

{'ab': 0.75, 'rf': 0.775}


We have just built our first pipeline. Did you wonder whether there are any additional parameters that we could tune to make AdaBoost even better? The answer is yes! Let's explore that in our next lesson on tuning parameters.

## Model complexity and overfitting

Most classifiers have one or more hyperparameters that control its complexity. We can tune them using `GridSearchCV()`. In this exercise, we will perfect this skill. We will experiment with:

- The number of trees, `n_estimators`, in a `RandomForestClassifier`.
- The maximum depth, `max_depth`, of the decision trees used in an `AdaBoostClassifier`.
- The number of nearest neighbors, `n_neighbors`, in `KNeighborsClassifier`.

In [17]:
# Set a range for n_estimators from 10 to 40 in steps of 10
param_grid = {'n_estimators': range(10, 50, 10)}

# Optimize for a RandomForestClassifier() using GridSearchCV
grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=3)
grid.fit(X, y)
grid.best_params_

{'n_estimators': 40}

In [18]:
# Define a grid for n_estimators ranging from 1 to 10
param_grid = {'n_estimators': range(1, 11)}

# Optimize for a AdaBoostClassifier() using GridSearchCV
grid = GridSearchCV(AdaBoostClassifier(), param_grid, cv=3)
grid.fit(X, y)
grid.best_params_

{'n_estimators': 10}

In [19]:
# Define a grid for n_neighbors with values 10, 50 and 100
param_grid = {'n_neighbors': [10, 50, 100]}

# Optimize for KNeighborsClassifier() using GridSearchCV
grid = GridSearchCV(KNeighborsClassifier(), param_grid, cv=3)
grid.fit(X, y)
grid.best_params_

{'n_neighbors': 50}

We now know how to deal with the extremely important issue of model complexity.

## Feature engineering and overfitting

Our colleague has converted the columns in the credit dataset to numeric values using `LabelEncoder()`. He left one out: `credit_history`, which records the credit history of the applicant. We want to create two versions of the dataset. One will use `LabelEncoder()` and another one-hot encoding, for comparison purposes. The feature matrix is available to you as credit. You have LabelEncoder() preloaded and pandas as pd.

In [21]:
credit = pd.read_csv('data/credit.csv')

# Create numeric encoding for credit_history
credit_history_num = LabelEncoder().fit_transform(credit['credit_history'])

# Create a new feature matrix including the numeric encoding
X_num = pd.concat([X, pd.Series(credit_history_num)], axis=1)

# Create new feature matrix with dummies for credit_history
X_hot = pd.concat([X, pd.get_dummies(credit['credit_history'])], axis=1)

# Compare the number of features of the resulting DataFrames
print(X_hot.shape[1] > X_num.shape[1])

True


We are discussing the credit dataset with the bank manager. She suggests that the safest loan applications tend to request mid-range credit amounts. Values that are either too low or too high suggest high risk. This means that a non-linear relationship might exist between this variable and the class. We want to test this hypothesis. We will construct a non-linear transformation of the feature. Then, we will assess which of the two features is better at predicting the class using `SelectKBest()` and the `chi2()` metric.

In [23]:
# Function computing absolute difference from column mean
def abs_diff(x):
    return np.abs(x-np.mean(x))

# Apply it to the credit amount and store to new column
credit['diff'] = abs_diff(credit['credit_amount'])

# Create a feature selector with chi2 that picks one feature
sk = SelectKBest(chi2, k=1)

# Use the selector to pick between credit_amount and diff
sk.fit(credit[['diff','credit_amount']], credit['class'])

# Inspect the results
sk.get_support()

array([False,  True])

We now have one more tool at our disposal to decide which features are worth introducing to your dataset.

We just joined an arrhythmia detection startup and want to train a model on the arrhythmias dataset `arrh`. We noticed that random forests tend to win quite a few Kaggle competitions, so we want to try that out with a maximum depth of 2, 5, or 10, using grid search. We also observe that the dimension of the dataset is quite high so we wish to consider the effect of a feature selection method.

In [26]:
arrh = pd.read_csv('data/arrh.csv')

X, y = arrh.drop('class', axis=1), arrh['class']
# Split the data into train and test, with 20% as test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
rfc = RandomForestClassifier()

# Find the best value for max_depth among values 2, 5 and 10
grid_search = GridSearchCV(RandomForestClassifier(random_state=1), param_grid={'max_depth': [2,5,10]})
best_value = grid_search.fit(X_train, y_train).best_params_['max_depth']

# Using the best value from above, fit a random forest
clf = RandomForestClassifier(random_state=1, max_depth=best_value).fit(X_train, y_train)

# Apply SelectKBest with chi2 and pick top 100 features
vt = SelectKBest(chi2, k=100).fit(abs(X_train), y_train)

# Create a new dataset only containing the selected features
X_train_reduced = vt.transform(X_train)

We are already able to handle hundreds of features in a few lines of code! But what if the optimal number of estimators is different if we first apply feature selection? In Chapter 3 we will learn how to put our pipelines on steroids so that such questions can be asked in just one line of code.