# Model Lifecycle Management

In the previous chapter, we employed different ways of incorporating feedback from experts in our workflow, and evaluating it in ways that are aligned with business value. Now it is time for us to practice the skills needed to productize your model and ensure it continues to perform well thereafter by iteratively improving it. We will also learn to diagnose dataset shift and mitigate the effect that a changing environment can have on our model's accuracy.

In [1]:
import numpy as np
import pandas as pd

import pickle

from sklearn.preprocessing import LabelEncoder, FunctionTransformer
from sklearn.feature_selection import chi2, SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, f1_score, precision_score, confusion_matrix, roc_auc_score, make_scorer
from sklearn.pipeline import Pipeline

## From workflows to pipelines

Back in the arrhythmia startup, our monthly review is coming up, and as part of that an expert Python programmer will be reviewing our code. We decide to tidy up by following best practices and replace your script for feature selection and random forest classification, with a pipeline. We are using a training dataset available as `X_train` and `y_train` and a number of modules: `RandomForestClassifier`, `SelectKBest()` and `f_classif()` for feature selection, as well as `GridSearchCV` and `Pipeline`.

In [2]:
arrh = pd.read_csv('data/arrh.csv')
# arrh['class'] = arrh['class'] == 'bad'
X, y = arrh.drop('class', axis=1), arrh['class']

# just to override the error of the SelectKBest 
X = X[ X.columns[X.std() > 2.1 ]]


# Split the data into train and test, with 20% as test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# just to overaride the error

In [3]:
# Create pipeline with feature selector and classifier
pipe = Pipeline([
    ('feature_selection', SelectKBest(f_classif)),
    ('clf', RandomForestClassifier(random_state=2))])

# Create a parameter grid
params = {
   'feature_selection__k':[10, 20],
    'clf__n_estimators':[2, 5]}

# Initialize the grid search object
grid_search = GridSearchCV(pipe, param_grid=params)

# Fit it to the data and print the best value combination
print(grid_search.fit(X_train, y_train).best_params_)

{'clf__n_estimators': 5, 'feature_selection__k': 20}


Wrapping up our workflow inside a pipeline is a sign of a true professional!

We are proud of the improvement in our code quality, but just remembered that previously we had to use a custom scoring metric in order to account for the fact that false positives are costlier to our startup than false negatives. We hence want to equip your pipeline with scorers other than accuracy, including `roc_auc_score()`, `f1_score()`, and our own custom scoring function.

In [4]:
# Create a custom scorer
scorer = make_scorer(roc_auc_score)

# Initialize the CV object
gs = GridSearchCV(pipe, param_grid=params, scoring=scorer)

# Fit it to the data and print the winning combination
print(gs.fit(X_train, y_train).best_params_)

{'clf__n_estimators': 5, 'feature_selection__k': 20}


In [5]:
# Create a custom scorer
scorer = make_scorer(f1_score)

# Initialise the CV object
gs = GridSearchCV(pipe, param_grid=params, scoring=scorer)

# Fit it to the data and print the winning combination
print(gs.fit(X_train, y_train).best_params_)

{'clf__n_estimators': 5, 'feature_selection__k': 20}


In [6]:
def my_metric(y_test, y_est, cost_fp=10.0, cost_fn=1.0):
    tn, fp, fn, tp = confusion_matrix(y_test, y_est).ravel()
    return cost_fp * fp + cost_fn * fn

# Create a custom scorer
scorer = make_scorer(my_metric)

# Initialise the CV object
gs = GridSearchCV(pipe, param_grid=params, scoring=scorer)

# Fit it to the data and print the winning combination
print(gs.fit(X_train, y_train).best_params_)

{'clf__n_estimators': 5, 'feature_selection__k': 10}


We can now incorporate the knowledge we acquired in Chapter 2 in our pipelines.

## Model deployment

Finally, it is time for us to push our first model to production. It is a random forest classifier which we will use as a baseline, while we are still working to develop a better alternative. We have access to the data split in training test with their usual names, `X_train`, `X_test`, `y_train` and `y_test`, as well as to the modules `RandomForestClassifier()` and `pickle`, whose methods `.load()` and `.dump()` we will need for this exercise.

In [7]:
# Fit a random forest to the training set
clf = RandomForestClassifier(random_state=42).fit(X_train, y_train)

# Save it to a file, to be pushed to production
with open('model.pkl', 'wb') as file:
    pickle.dump(clf, file=file)

# Now load the model from file in the production environment
with open('model.pkl', 'rb') as file:
    clf_from_file = pickle.load(file)

# Predict the labels of the test dataset
preds = clf_from_file.predict(X_test)

At some point, we were told that the sensors might be performing poorly for obese individuals. Previously we had dealt with that using weights, but now we are thinking that this information might also be useful for feature engineering, so we decide to replace the recorded weight of an individual with an indicator of whether they are obese. We want to do this using pipelines and available `FunctionTransformer()`.

In [8]:
# Define a feature extractor to flag very large values
def more_than_average(X, multiplier=1.0):
  Z = X.copy()
  Z[:,1] = Z[:,1] > multiplier*np.mean(Z[:,1])
  return Z

# Convert your function so that it can be used in a pipeline
pipe = Pipeline([
  ('ft', FunctionTransformer(more_than_average)),
  ('clf', RandomForestClassifier(random_state=2))])

# Optimize the parameter multiplier using GridSearchCV
params = {'ft__multiplier': [1, 2, 3]}
grid_search = GridSearchCV(pipe, param_grid=params)

## Iterating without overfitting

Having pushed our random forest to production, we suddenly worry that a naive Bayes classifier might be better. We want to run a champion-challenger test, by comparing a naive Bayes, acting as the challenger, to exactly the model which is currently in production, which we will load from file to make sure there is no confusion. We will use the F1 score for assessment.

In [9]:
# Load the current model from disk
champion = pickle.load(open('model.pkl', 'rb'))

# Fit a Gaussian Naive Bayes to the training data
challenger = GaussianNB().fit(X_train, y_train)

# Print the F1 test scores of both champion and challenger
print(f1_score(y_test, champion.predict(X_test)))
print(f1_score(y_test, challenger.predict(X_test)))

# Write back to disk the best-performing model
with open('model.pkl', 'wb') as file:
    pickle.dump(champion, file=file)

0.8461538461538461
0.8064516129032258


This way of working is very similar to agile software development, and can greatly accelerate our workflows.

We used grid search CV to tune our random forest classifier, and now want to inspect the cross-validation results to ensure we did not overfit. In particular we would like to take the difference of the mean test score for each fold from the mean training score.

In [10]:
# Create pipeline with feature selector and classifier
pipe = Pipeline([
    ('feature_selection', SelectKBest(f_classif)),
    ('clf', RandomForestClassifier(random_state=2))])

# Create a parameter grid
params = {
   'feature_selection__k':[10, 20],
    'clf__n_estimators':[2, 5]}


# Fit your pipeline using GridSearchCV with three folds
grid_search = GridSearchCV(pipe, params, cv=3, return_train_score=True)

# Fit the grid search
gs = grid_search.fit(X_train, y_train)

# Store the results of CV into a pandas dataframe
results = pd.DataFrame(gs.cv_results_)

# Print the difference between mean test and training scores
print(
  results['mean_test_score']-results['mean_train_score'])

0   -0.260235
1   -0.184092
2   -0.314394
3   -0.256147
dtype: float64


The difference between training and test performance seems quite big here, and that is always a telltale sign of overfitting!

## Dataset shift

We want to check for ourself that the optimal window size for the arrhythmia dataset is 50. We have been given the dataset as a pandas data frame called `arrh`, and want to use a subset of the data up to time `t_now`. Our test data is available as X_test, y_test. We will try out a number of window sizes, ranging from 10 to 100, fit a naive Bayes classifier to each window, assess its F1 score on the test data, and then pick the best performing window size.

In [11]:
arrh = pd.read_csv('data/arrh.csv')
# arrh['class'] = arrh['class'] == 'bad'
X, y = arrh.drop('class', axis=1), arrh['class']

# Split the data into train and test, with 20% as test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)


t_now = 400
accuracies = []
wrange = range(10, 100, 10)

# Loop over window sizes
for w_size in wrange:

    # Define sliding window
    sliding = arrh.loc[(t_now - w_size + 1):t_now]

    # Extract X and y from the sliding window
    X, y = sliding.drop('class', axis=1), sliding['class']
    
    # Fit the classifier and store the F1 score
    preds = GaussianNB().fit(X, y).predict(X_test)
    accuracies.append(f1_score(y_test, preds))

# Estimate the best performing window size
optimal_window = wrange[np.argmax(accuracies)]

We now realise that the possibility of dataset shift introduces yet another parameter to optimize: the window size. This cannot be done with Cross-Validation on historical data, but instead requires the technique shown here.

We have two concerns about our pipeline at the arrhythmia detection startup:

- The app was trained on patients of all ages, but is primarily being used by fitness users who tend to be young. We suspect this might be a case of domain shift, and hence want to disregard all examples above 50 years old.
- We are still concerned about overfitting, so we want to see if making the random forest classifier less complex and selecting some features might help with that.

We will create a pipeline with a feature selection `SelectKBest()` step and a `RandomForestClassifier`. We also have access to `GridSearchCV()`, `Pipeline`, `numpy` as `np` and `pickle`. 

In [21]:
# Create a pipeline 
pipe = Pipeline([
  ('ft', SelectKBest()), ('clf', RandomForestClassifier(random_state=2))])

# Create a parameter grid
grid = {'ft__k':[5, 10], 'clf__max_depth':[10, 20]}

# Execute grid search CV on a dataset containing under 50s
grid_search = GridSearchCV(pipe, param_grid=grid)
arrh = pd.read_csv('data/arrh.csv')
arrh = arrh.loc[arrh['age'] < 50]
# arrh['class'] = arrh['class'] == 'bad'
X, y = arrh.drop('class', axis=1), arrh['class']

# just to override the error of the SelectKBest 
X = X[ X.columns[X.std() > 2.25 ]]

grid_search.fit(X, y)

# Push the fitted pipeline to production
with open('pipe.pkl', 'wb') as file:
    pickle.dump(grid_search, file)

We are now an sklearn ninja and nothing can stop you. Except for ... a lack of labelled data! Let's see what we can do about that in the next chapter.