In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
import pickle

# Pickling for Deployment Example
This notebook shows the basic outline for training a model, evaluating it, then using it in a "production" context to make predictions about new data.

## 1. Extract, Transform, Load Data
This is easy here because I'm using a nice tidy dataset from sklearn

In [2]:
# get premade wine dataset from sklearn
data = load_wine()

In [3]:
print(data.DESCR)

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0

## 2. Build a Model to Make Predictions

In [4]:
# let's build a model to predict the class of wine
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target)
classifier = RandomForestClassifier(max_depth=2, random_state=0, n_estimators=100)
classifier.fit(X_train, y_train)

RandomForestClassifier(max_depth=2, random_state=0)

## 3. Evaluate the Model
Not necessarily the most realistic performance, but let's go with it!

In [5]:
classifier.score(X_test, y_test)

1.0

In [6]:
metrics.confusion_matrix(y_test, classifier.predict(X_test))

array([[14,  0,  0],
       [ 0, 18,  0],
       [ 0,  0, 13]])

## 4. Export the Model
As far as I can tell, the [`pickle` format](https://docs.python.org/3/library/pickle.html) it most popular for this task in Python right now.  Pickling is a form of serialization or flattening, which basically means converting everything about an object in memory into bits of data that can be stored in a file.

In [7]:
output_file = open("wine_classifier.pickle", "wb") # "wb" means "write as bytes"
pickle.dump(classifier, output_file)
output_file.close()

## 5. Load the Model
This part would actually almost never be in the same file as the previous step.  The goal is to take information that was stored in memory at one time, then save it so it can be used later.  Here specifically this is useful because training a model is usually a lot slower than using the model to make a prediction, so this saves us from having to re-run that costly operation each time.

In [8]:
model_file = open("wine_classifier.pickle", "rb") # "rb" means "read as bytes"
loaded_model = pickle.load(model_file)
model_file.close()

## 6. Make a Prediction with the Loaded Model

In this section I'm constructing a request JSON that resembles what would come from a user who wants a predicted class of wine based on these feature values.  This code would not actually exist in your deployed application, it would be created automatically by whatever protocol generated the request.

In [9]:
# make a fake request JSON from the user with all the headings
expected_features = ("Alcohol", "Malic acid", "Ash", "Alcalinity of ash", \
        "Magnesium", "Total phenols", "Flavanoids", "Nonflavanoid phenols", \
        "Proanthocyanins", "Color intensity", "Hue", \
        "OD280/OD315 of diluted wines", "Proline")
example_values = [1.282e+01, 3.370e+00, 2.300e+00, 1.950e+01, 8.800e+01, 1.480e+00, \
       6.600e-01, 4.000e-01, 9.700e-01, 1.026e+01, 7.200e-01, 1.750e+00, \
       6.850e+02]

request_json = dict(zip(expected_features, example_values))

request_json

{'Alcohol': 12.82,
 'Malic acid': 3.37,
 'Ash': 2.3,
 'Alcalinity of ash': 19.5,
 'Magnesium': 88.0,
 'Total phenols': 1.48,
 'Flavanoids': 0.66,
 'Nonflavanoid phenols': 0.4,
 'Proanthocyanins': 0.97,
 'Color intensity': 10.26,
 'Hue': 0.72,
 'OD280/OD315 of diluted wines': 1.75,
 'Proline': 685.0}

This is the section that more closely resembles what you might have in your application.  I'm checking to make sure that the expected values are in the request_json, transforming them into the right format to make a prediction, then printing out that prediction.  In your actual deployed code, you would most likely be returning the response, not printing it.

In [10]:
if request_json and all(feature in request_json for feature in expected_features):
    # unpack all of the relevant values from the request into a list
    test_value = [request_json[feature] for feature in expected_features]
    
    # make a prediction from the "user input"
    predicted_class = int(loaded_model.predict([test_value])[0])
    
    # construct a response
    response_json = {"prediction": predicted_class}
    print(response_json)
else:
    print("something was missing from the request")

{'prediction': 2}
