# Model operationalization via s2i

It's possible to use [s2i](https://github.com/openshift/source-to-image) to operationalize models that have been trained in a notebook.  The relevant builder image is [here](https://github.com/willb/simple-model-s2i); it works best if you follow some basic conventions.  The rest of this notebook will demonstrate these conventions with a simple example.

## requirements

The first convention to follow is declaring your model's requirements as a list of lists in a variable called `requirements`.  The s2i builder will use these to generate a `requirements.txt` file, which it will install while building an image.  This step is optional, but it is necessary if your model will depend on any libraries.

In [None]:
requirements = [["numpy", "1.15"], ["scikit-learn", "0.19.2"], ["scipy", "1.0.1"], 
                ["boto3","1.9.112"],["pandas", "0.19.2"]]

## model code

Your model training code can just appear in this notebook as it would in any other.  Note that the s2i build process will execute every cell in the notebook in order.

In [1]:
import numpy as np
import pandas as pd

from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import io

import boto3
import os

  from numpy.core.umath_tests import inner1d


In [2]:
DIMENSIONS = 2
randos = np.random.random((40000,DIMENSIONS))
labels = list(np.zeros(20000)) + list(np.ones(20000))

In [3]:
# my_bucket = "MICHAEL-DATA-ODSC"
# conn = boto3.client(service_name='s3',
#         aws_access_key_id=os.environ.get("AWS_SECRET_ACCESS_KEY"),
#         aws_secret_access_key=os.environ.get("AWS_ACCESS_KEY_ID"),
#         endpoint_url= os.environ.get("S3_ENDPOINT_URL"))

# obj = conn.get_object(Bucket=my_bucket, Key='demo-data/creditcard.csv')
# df = pd.read_csv(io.BytesIO(obj['Body'].read()))

EndpointConnectionError: Could not connect to the endpoint URL: "https://s3.upshift.redhat.com/MICHAEL-DATA-ODSC/demo-data/creditcard.csv"

In [4]:
# df_train, df_test = train_test_split(df, train_size=0.75)
# print("Random Forrest Classifier")
# model = RandomForestClassifier(n_estimators=100, max_depth=4, n_jobs=10)
# model.fit(df_train.drop(['Time', 'Class'], axis=1),df_train['Class'])
# test_pred = model.predict(df_test.drop(['Time', 'Class'] ,axis=1))
# test_label = df_test['Class']
# test_acc = np.sum(test_pred==test_label) / len(test_pred)
# print(f'test_acc = {test_acc}')


Random Forrest Classifier
test_acc = 0.9993960843796522


In [9]:
#DIMENSIONS = df_train.drop(['Time', 'Class'], axis=1).shape[1]

In [None]:
kmodel = KMeans(n_clusters=7).fit(randos)
#kmodel = RandomForestClassifier().fit(randos,labels)
#kmodel = model

In [None]:
kmodel

## validate and predict

Given a trained model, you simply need to provide two functions:

* `predictor`, which will make a single prediction from a single sample, and
* `validator`, which will return `True` if a single sample is of the correct type.

In [None]:
def predictor(x):
    return kmodel.predict([x]).tolist()[0]

`validator` is optional, but it will make your model service easier to use.  If you don't provide one, your model service will accept any input, which will likely lead to confusing error messages (i.e., crashes somewhere in the `predictor`) if your model service is called with bogus input.

In [None]:
def validator(x):
    return len(x) == DIMENSIONS