
# Custom Python Code in CMLE Prediction: User Guide


## Background
Cloud ML Engine Online Prediction now supports custom python code in two forms:


1.   Custom transforms in scikit-learn pipelines
2.   Custom prediction routine


Use the custom code in scikit-learn pipelines option if: 

*  Your trained scikit-learn pipeline uses a custom transform.


Use the custom prediction routine option if:
* You’re not using the scikit-learn framework and require custom processing.
* You need to perform pre/post processing outside of the scikit-learn pipeline.
* You need to load in saved state outside of the exported scikit-learn pipeline.

**Note**: You must be whitelisted to use the custom code feature. Please fill out [this google form](https://goo.gl/forms/WgFzm97AJEpXiBDv2) to get started.

## Setup

Before we start let's install `gcloud` tool so we can interact with `Google Cloud Machine Learning Engine` easier:


In [0]:
!pip install google-cloud

from google.colab import auth
auth.authenticate_user()

Collecting google-cloud
  Downloading https://files.pythonhosted.org/packages/ba/b1/7c54d1950e7808df06642274e677dbcedba57f75307adf2e5ad8d39e5e0e/google_cloud-0.34.0-py2.py3-none-any.whl
Installing collected packages: google-cloud
Successfully installed google-cloud-0.34.0


Let's also define the project name, model name, and the `gcs` bucket name that we'll refer to later:

In [0]:
BUCKET="demo"
MODEL_DIR="model_files"
PACKAGES_DIR="my_packages"
PROJECT_NAME = 'demo-project'
MODEL_NAME = 'demo_model'
VERSION_NAME = 'v1'
RUNTIME_VERSION = "1.8"

!gcloud config set project {PROJECT_NAME}
!gcloud ml-engine models create {MODEL_NAME}
!gcloud ml-engine models list

## Option 1: Custom Code in scikit-learn Pipelines

If you need to apply any custom data transformation that cannot be done via out of the box scikit-learn algorithms (such as [pre-processing](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing) or [feature_selection](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection) modules), you can create your own python function and use it in your `Pipeline` using the `FunctionTransformer` wrapper. In order to have access to your custom function during prediction, you need to package it and upload it along with your model when you create a version. 

Here we give an example of how you can perform this. Assume we start with a basic iris training code similar to below:

In [0]:
from sklearn import datasets
from sklearn import svm
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

from sklearn.externals import joblib

# Load the digits dataset
iris = datasets.load_iris()

# Set up a pipeline with a feature selection preprocessor that
# selects the top 2 features to use.
# The pipeline then uses a RandomForestClassifier to train the
# model.
pipeline = Pipeline([
      ('feature_selection', SelectKBest(chi2, k=2)),
      ('classification', RandomForestClassifier())
    ])

pipeline.fit(iris.data, iris.target)

# Export the classifier to a file
joblib.dump(pipeline, 'model.joblib')


Now you want to add a preprocessing step to append the sum of all features as a new feature. You can write a simple method that performs this feature manipulation in a file called my_module.py:


In [0]:
%%writefile my_module.py

import numpy as np

def add_sum(X):

  # Append the sum of each row as a new feature.
  sums = X.sum(1)[...,None]
  new_features = np.append(X, sums, 1)
  return new_features


Writing my_module.py


You can now modify your training code, to use this function in your scikit-learn pipeline:

In [0]:

from sklearn import datasets
from sklearn import svm
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.externals import joblib
from my_module import add_sum

# Load the digits dataset
iris = datasets.load_iris()

# Set up a pipeline with a feature selection preprocessor that
# selects the top 2 features to use.
# The pipeline then uses a RandomForestClassifier to train the
# model.
pipeline = Pipeline([
      ('add_sum_step', FunctionTransformer(add_sum)),
      ('feature_selection', SelectKBest(chi2, k=2)),
      ('classification', RandomForestClassifier())
    ])

pipeline.fit(iris.data, iris.target)

# Export the classifier to a file
joblib.dump(pipeline, 'model.joblib')

['model.joblib']

The rest of training and exporting the model remains the same. So we can go ahead and upload this exported model to `MODEL_DIR`:


In [0]:
!gsutil cp ./model.joblib gs://{BUCKET}/{MODEL_DIR}/

Copying file://./model.joblib [Content-Type=application/octet-stream]...
/ [1 files][ 21.2 KiB/ 21.2 KiB]                                                
Operation completed over 1 objects/21.2 KiB.                                     


In order for your model to have access to `my_module` at prediction time, you need to package and upload it along with your exported model. To do so, first you need to create a pip-installable package. One way to do so is to add a setup.py file in the same directory similar to this example:


In [0]:
%%writefile setup.py
from setuptools import setup
setup(name="my_package",
      version="0.1",
      scripts=["my_module.py"]
      )


Writing setup.py


Then you can create a package by running:

In [0]:
!python setup.py sdist

running sdist
running egg_info
creating my_package.egg-info
writing my_package.egg-info/PKG-INFO
writing top-level names to my_package.egg-info/top_level.txt
writing dependency_links to my_package.egg-info/dependency_links.txt
writing manifest file 'my_package.egg-info/SOURCES.txt'
reading manifest file 'my_package.egg-info/SOURCES.txt'
writing manifest file 'my_package.egg-info/SOURCES.txt'

running check


creating my_package-0.1
creating my_package-0.1/my_package.egg-info
copying files to my_package-0.1...
copying my_module.py -> my_package-0.1
copying setup.py -> my_package-0.1
copying my_package.egg-info/PKG-INFO -> my_package-0.1/my_package.egg-info
copying my_package.egg-info/SOURCES.txt -> my_package-0.1/my_package.egg-info
copying my_package.egg-info/dependency_links.txt -> my_package-0.1/my_package.egg-info
copying my_package.egg-info/top_level.txt -> my_package-0.1/my_package.egg-info
Writing my_package-0.1/setup.cfg
creating dist
Creating tar archive
re

This will create a `.tar.gz` package under `/dist` directory. The name of the package will be `$name-$version.tar.gz` where `$name` and `$version` are the ones specified in the `setup.py`. 

Once you have successfully created the package, you can upload it to `GCS`:


In [0]:
!gsutil cp ./dist/my_package-0.1.tar.gz gs://{BUCKET}/{PACKAGES_DIR}/my_package-0.1.tar.gz


Copying file://./dist/my_package-0.1.tar.gz [Content-Type=application/x-tar]...

Operation completed over 1 objects/705.0 B.                                      


In [0]:
!gsutil acl -r ch -u AllUsers:O gs://{BUCKET}/{PACKAGES_DIR}

In [0]:
!gsutil ls gs://{BUCKET}/{PACKAGES_DIR}/

In [0]:
!gcloud alpha ml-engine versions create {VERSION_NAME} --model {MODEL_NAME} \
 --origin gs://{BUCKET}/{MODEL_DIR} \
 --runtime-version {RUNTIME_VERSION} \
 --framework SCIKIT_LEARN \
 --package-uris gs://{BUCKET}/{PACKAGES_DIR}/my_package-0.1.tar.gz


Creating version (this might take a few minutes)......done.


Once creating the version is finished (should take 1-2 minutes) you can send a prediction request to your model:

In [0]:
%%writefile input.json
[1, 2, 3, 4]

Writing input.json


In [0]:
!gcloud alpha ml-engine predict --model {MODEL_NAME} --version {VERSION_NAME} --json-instances input.json

[1]


## Option 2: Custom Prediction Routine

If you need to apply any custom data transformation before or after prediction is applied to the input, you can provide a custom python class along with your exported model. However, it’s important to be aware of the dangers of training-serving skew which can happen by inadvertently applying different preprocessing functions at prediction time vs. training time, and thus introduce a bias. 

To avoid this problem, we strongly recommend placing all the preprocessing related code into one method so it can be re-used during both training and prediction.

To do so, you can upload a custom class along with your exported model. This class should implement the `Model` interface shown below:

In [0]:
class Model(object):
  """Interface for constructing custom models."""

  def predict(self, instances, **kwargs):
    """Performs custom prediction.

    Instances are the decoded values from the request. Clients need not worry
    about decoding json nor base64 decoding.

    Args:
      instances: list of instances, as described in the API.


    Returns:
      A list of outputs containing the prediction results. This list must be 
      JSON serializable.
    """
    raise NotImplementedError()

  @classmethod
  def from_path(cls, model_dir):
    """Creates an instance of Model using the given path.

    Loading of the model should be done in this method.

    Args:
      model_dir: The local directory that contains the exported model file 
          along with any additional files uploaded when creating the version 
          resource.

    Returns:
      An instance implementing this Model class.
    """
    raise NotImplementedError()


Your implementaion of `Model` class should have a `predict` method that accepts  a list of input `instances`, and `**kwargs` which will hold any additional parameter you need to pass to your function.

> **Note**:  Everything included in the directory that's passsed as `origin` flag when creating a version would be available in the `model_dir` argument of `from_path` method.

The following examples show how to use this feature in two different ways.



### Example 1

In this example we show how to call a custom `preprocess` function by customizing the `predict` method.



#### Training the model

Let’s start with the following existing code for training which loads training data from a file and  performs the following preprocessing: 

* takes log of values in the second column 
* replaces `-inf` and `+inf` values by `0`:

Finally it fits a classifier and exports the trained model to a file (`model.joblib`):

In [0]:
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn import svm
from sklearn.externals import joblib

# Load the dataset
iris = datasets.load_iris()
data = pd.DataFrame(iris.data)
# Preprocess the data
data.iloc[:, 2] = np.log(data.iloc[:,2])
data[~np.isfinite(data)] = 0 

# Train a classifier
classifier = svm.SVC()
classifier.fit(data, iris.target)

# Export the classifier to a file
joblib.dump(classifier, 'model.joblib')


['model.joblib']

Now to make this script fit into the recommended approach above, we can make a few changes, so that the preprocessing happens in a separate class that can be shared at both training and prediction time: 

In [0]:
%%writefile preprocess.py

import numpy as np
import pandas as pd
import logging

class MyProcessor(object): 
    

  def preprocess(self, data):            
    data.iloc[:, 2] = np.log(data.iloc[:,2])
    data[~np.isfinite(data)] = 0
    return data


Writing preprocess.py


the training script should now reference the preprocessing module:

In [0]:
import numpy as np
import pandas as pd

from sklearn import datasets
from sklearn import svm
from sklearn.externals import joblib

from preprocess import MyProcessor

# Load the dataset
iris = datasets.load_iris()
data = pd.DataFrame(iris.data)

# Preprocess the data
processor = MyProcessor()
data = processor.preprocess(data)


# Train a classifier
classifier = svm.SVC()
classifier.fit(data, iris.target)

# Export the classifier to a file
joblib.dump(classifier, 'model.joblib')


['model.joblib']

#### Defining the custom `Model` class

In order to provide a custom prediction logic, we need  to implement the `Model` API shown above. Our example below shows how to load the model object (`model.joblib`) and perform predictions.

In [0]:
%%writefile my_model.py
import os
import numpy as np
import pandas as pd
from sklearn.externals import joblib
from preprocess import MyProcessor
import pickle

class ModelExample(object):

  def __init__(self, model):
    self._model = model
    self._processor = MyProcessor() 

  def predict(self, instances, **kwargs):
    data = pd.DataFrame(instances)
    preprocessed_data = self._processor.preprocess(data)
    return self._model.predict(preprocessed_data).tolist()

  @classmethod
  def from_path(cls, model_dir):
    model = joblib.load(os.path.join(model_dir, 'model.joblib'))    
    return cls(model)


Writing my_model.py




####Packaging up files and uploading to GCS

Just like in the normal workflow, we need to upload our model file, and any other data files we want to access from our custom model.

In [0]:
!gsutil cp model.joblib gs://{BUCKET}/{MODEL_DIR}/
!gsutil ls gs://{BUCKET}/{MODEL_DIR}/

Now we need to package the classes that are going to be needed at prediction time. We need to create a pip installable tar file with our `model.py` and `processor.py`:


In [0]:
%%writefile setup.py
from setuptools import setup
setup(name="my_package",
      version="0.1",
      scripts=["preprocess.py", "my_model.py"]
      )

Writing setup.py


Then you can create a package by running:

In [0]:
!python setup.py sdist

running sdist
running egg_info
creating my_package.egg-info
writing my_package.egg-info/PKG-INFO
writing top-level names to my_package.egg-info/top_level.txt
writing dependency_links to my_package.egg-info/dependency_links.txt
writing manifest file 'my_package.egg-info/SOURCES.txt'
reading manifest file 'my_package.egg-info/SOURCES.txt'
writing manifest file 'my_package.egg-info/SOURCES.txt'

running check


creating my_package-0.1
creating my_package-0.1/my_package.egg-info
copying files to my_package-0.1...
copying my_model.py -> my_package-0.1
copying preprocess.py -> my_package-0.1
copying setup.py -> my_package-0.1
copying my_package.egg-info/PKG-INFO -> my_package-0.1/my_package.egg-info
copying my_package.egg-info/SOURCES.txt -> my_package-0.1/my_package.egg-info
copying my_package.egg-info/dependency_links.txt -> my_package-0.1/my_package.egg-info
copying my_package.egg-info/top_level.txt -> my_package-0.1/my_package.egg-info
Writing my_package-0.1/setup.cfg
creating dist
Creat

This will create a `.tar.gz` package under `/dist` directory. The name of the package will be `$name-$version.tar.gz` where `$name` and `$version` are the ones specified in the `setup.py`. 

Once you have successfully created the package, you can upload it to `GCS`:


In [0]:
!gsutil cp ./dist/my_package-0.1.tar.gz gs://{BUCKET}/{PACKAGES_DIR}/my_package-0.1.tar.gz

Copying file://./dist/my_package-0.1.tar.gz [Content-Type=application/x-tar]...
/ [1 files][   1023 B/   1023 B]                                                
Operation completed over 1 objects/1023.0 B.                                     


#### Create a `Version`

Once you have your custom package ready, you can specify this as an argument when creating a `version` resource. Note that you need to provide the path to your package (as `package-uris`) and also the class name that contains your custom `predict` method (as `model-class`). 


In [0]:
!gcloud alpha ml-engine versions list --model {MODEL_NAME}

Listed 0 items.


In [0]:
!gcloud alpha ml-engine versions create {VERSION_NAME} --model {MODEL_NAME} \
--origin gs://{BUCKET}/{MODEL_DIR}/ \
--runtime-version {RUNTIME_VERSION} \
--package-uris gs://{BUCKET}/{PACKAGES_DIR}/my_package-0.1.tar.gz \
--model-class=my_model.ModelExample

Once creating the version is finished (should take 1-2 minutes) you can send a prediction request to your model:

In [0]:
%%writefile input.json
[1, 2, 3, 4]

Writing input.json


In [0]:
!gcloud alpha ml-engine predict --model {MODEL_NAME} --version {VERSION_NAME} --json-instances input.json

[2]


### Example 2

This example is very similar to the one above, with a slight difference that it also uses additional artifact that was produced during training. 




#### Training the model

As in the example above, we start with the training code that performs the following preprocessing, with an addition that it also binarizes the first column based on the mean value of that column. This data is available at training time, but not at serving time. Therefore we need to store the `mean` value somewhere and use it in our custom predict funtion later.


In [0]:
%%writefile stateful_preprocess.py

import numpy as np
import pandas as pd
import logging

class MyStatefulProcessor(object):
  def __init__(self):
    self._mean = None

  def preprocess(self, data):        
    if not self._mean: # during training
      self._mean = np.mean(data, axis=0)[0]
    data.iloc[:, 0] = (data.iloc[:,0] > self._mean) * 1
    data.iloc[:, 2] = np.log(data.iloc[:,2])
    data[~np.isfinite(data)] = 0
    return data


Writing stateful_preprocess.py


And our training script:

In [0]:
import numpy as np
import pandas as pd

from sklearn import datasets
from sklearn import svm
from sklearn.externals import joblib

from stateful_preprocess import MyStatefulProcessor

# Load the dataset
iris = datasets.load_iris()
data = pd.DataFrame(iris.data)

# Preprocess the data
processor = MyStatefulProcessor()
data = processor.preprocess(data)


# Train a classifier
classifier = svm.SVC()
classifier.fit(data, iris.target)

# Export the classifier to a file
joblib.dump(classifier, 'model.joblib')


['model.joblib']

In addition to the `model.joblib` file, we also export our `processor` object since it holds the `mean` value observed during training. We can reference it later during prediction:

In [0]:
import pickle
with open('./my_processor_state.pkl', 'wb') as f:
  pickle.dump(processor, f)


#### Defining the custom `Model` class

As in example 1, we need  to implement the `Model` API for our custom prediction module. The difference here is that it also loads the custom pickled files (in this case `my_processor.pkl`) as well as the model object (`model.joblib`) when the model is being loaded. This happens because every object that is included in the `model_dir` will be available in the local file system in the `model_dir` path that gets passed to the `from_path` method.

In [0]:
%%writefile my_model.py
import os
import numpy as np
import pandas as pd
from sklearn.externals import joblib
from stateful_preprocess import MyStatefulProcessor
import pickle

class ModelExample(object):

  def __init__(self, model, processor):
    self._model = model
    self._processor = processor


  def predict(self, instances, **kwargs):
    data = pd.DataFrame(instances)
    preprocessed_data = self._processor.preprocess(data)
    return self._model.predict(preprocessed_data).tolist()

  @classmethod
  def from_path(cls, model_dir):
    model = joblib.load(os.path.join(model_dir, 'model.joblib'))
    with open(os.path.join(model_dir, 'my_processor_state.pkl'), 'rb') as f:
      processor = pickle.load(f)
    return cls(model, processor)


Overwriting my_model.py



####Packaging up files and uploading to GCS

To be able to access any additional artifacts during prediction time, we need to put them in the same `gcs` bucket as our model: 


In [0]:
!gsutil cp model.joblib gs://{BUCKET}/{MODEL_DIR}/
!gsutil cp my_processor_state.pkl gs://{BUCKET}/{MODEL_DIR}/

Copying file://model.joblib [Content-Type=application/octet-stream]...
/ [1 files][  5.1 KiB/  5.1 KiB]                                                
Operation completed over 1 objects/5.1 KiB.                                      
Copying file://my_processor_state.pkl [Content-Type=application/octet-stream]...
/ [1 files][  272.0 B/  272.0 B]                                                
Operation completed over 1 objects/272.0 B.                                      


The rest of the work will be the same as example 1: Let's create a pip installable tar file with our `model.py` and `processor.py`:


In [0]:
%%writefile setup.py
from setuptools import setup
setup(name="my_package",
      version="0.2",
      scripts=["stateful_preprocess.py", "my_model.py"]
      )

Overwriting setup.py


Then we can create a package by running:

In [0]:
!python setup.py sdist

running sdist
running egg_info
writing my_package.egg-info/PKG-INFO
writing top-level names to my_package.egg-info/top_level.txt
writing dependency_links to my_package.egg-info/dependency_links.txt
reading manifest file 'my_package.egg-info/SOURCES.txt'
writing manifest file 'my_package.egg-info/SOURCES.txt'

running check


creating my_package-0.2
creating my_package-0.2/my_package.egg-info
copying files to my_package-0.2...
copying my_model.py -> my_package-0.2
copying preprocess.py -> my_package-0.2
copying setup.py -> my_package-0.2
copying stateful_preprocess.py -> my_package-0.2
copying my_package.egg-info/PKG-INFO -> my_package-0.2/my_package.egg-info
copying my_package.egg-info/SOURCES.txt -> my_package-0.2/my_package.egg-info
copying my_package.egg-info/dependency_links.txt -> my_package-0.2/my_package.egg-info
copying my_package.egg-info/top_level.txt -> my_package-0.2/my_package.egg-info
Writing my_package-0.2/setup.cfg
Creating tar archive
removing 'my_package-0.2' (and eve

This will create a `.tar.gz` package under `/dist` directory. The name of the package will be `$name-$version.tar.gz` where `$name` and `$version` are the ones specified in the `setup.py`. 

Once you have successfully created the package, you can upload it to `GCS`:


In [0]:
!gsutil cp ./dist/my_package-0.2.tar.gz gs://{BUCKET}/{PACKAGES_DIR}/my_package-0.2.tar.gz

Copying file://./dist/my_package-0.2.tar.gz [Content-Type=application/x-tar]...
/ [1 files][  1.1 KiB/  1.1 KiB]                                                
Operation completed over 1 objects/1.1 KiB.                                      


#### Create a `Version`

Once you have your custom package ready, you can specify this as an argument when creating a `version` resource. Note that you need to provide the path to your package (as `package-uris`) and also the class name that contains your custom `predict` method (as `model-class`). 


In [0]:
!gcloud alpha ml-engine versions list --model {MODEL_NAME}

NAME  DEPLOYMENT_URI                STATE
v1    gs://nedam-iris/model_files/  READY


In [0]:
VERSION_NAME='v2'
!gcloud alpha ml-engine versions create {VERSION_NAME} --model {MODEL_NAME} \
--origin gs://{BUCKET}/{MODEL_DIR}/ \
--runtime-version {RUNTIME_VERSION} \
--package-uris gs://{BUCKET}/{PACKAGES_DIR}/my_package-0.2.tar.gz \
--model-class=my_model.ModelExample

Once creating the version is finished (should take 1-2 minutes) you can send a prediction request to your model:

In [0]:
%%writefile input.json
[1, 2, 3, 4]

Overwriting input.json


In [0]:
!gcloud alpha ml-engine predict --model {MODEL_NAME} --version {VERSION_NAME} --json-instances input.json

[2]


# Questions? Feedback?

Feel free to send us an email (cloudml-feedback@google.com) if you run into any issues or have any questions/feedback!
