Created by [Nimblebox Inc.](https://www.nimblebox.ai/).
<img style="float:right; margin-right: 50px" src="https://idroot.us/wp-content/uploads/2019/03/TensorFlow-logo.png">
<img style="float:left; margin-right: 50px" src="https://media-exp1.licdn.com/dms/image/C4E1BAQH3ErUUfLXoHQ/company-background_10000/0?e=2159024400&v=beta&t=9Z2hcX4LqsxlDd2BAAW8xDc-Obfvk_rziT1AkPKBcCc" alt="Nimblebox Logo" width="500" height="600">

# Introduction

Today's session attempts to shed some light into an often overlooked aspect of the Machine Learning lifecycle, that is Model Serving. This is mainly because unlike other topics discussed in this webinar series, issues and processes regarding Model serving are even newer and constantly evolving. There is no solution which fits all (or even most) use cases

## Basic Idea
The basic idea and the first step in Model serving is to freeze and export the model. A Keras model consists of multiple components:

- An architecture, or configuration, which specifyies what layers the model contain, and how they're connected.
- A set of weights values (the "state of the model").
- An optimizer (defined by compiling the model).
- A set of losses and metrics (defined by compiling the model or calling ```add_loss()``` or ```add_metric()```).

The Keras API makes it possible to save of these pieces to disk at once, or to only selectively save some of them:

- Saving everything into a single archive in the TensorFlow SavedModel format (or in the older Keras H5 format). This is the standard practice.
- Saving the architecture / configuration only, typically as a JSON file.
- Saving the weights values only. This is generally used when training the model.

### Building a model (which we will later export)
For this webinar, we will be building and exporting the CNN digit classifier we built in the earlir session. Please refer to the Day 9 notebook for more information on the model

In [16]:
tf.keras.backend.clear_session()

In [1]:
import tensorflow as tf

### Loading Data ###
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()


### Pre-processing images ###
# Reshaping the array to 4-dims so that it can work with the Keras API
x_train = x_train.reshape(x_train.shape[0], 28, 28, 1)
x_test = x_test.reshape(x_test.shape[0], 28, 28, 1)

input_shape = (28, 28, 1)

# Making sure that the values are float so that we can get decimal points after division
# x_train = x_train.astype('float32')
# x_test = x_test.astype('float32')

# Normalizing the RGB codes by dividing it to the max RGB value.
# x_train /= 255
# x_test /= 255

# Sanity check
print('Number of images in x_train', x_train.shape[0])
print('Number of images in x_test', x_test.shape[0])


### Defining the model
# Importing the required Keras modules containing model and layers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, Dropout, Flatten, MaxPooling2D
# Creating a Sequential Model and adding the layers
model = Sequential([
                     Conv2D(28, kernel_size=(3,3), input_shape=input_shape),
                     MaxPooling2D(pool_size=(2, 2)),
                     Flatten(),
                     Dense(128, activation=tf.nn.relu),
                     Dropout(0.2),
                     Dense(10,activation=tf.nn.softmax)
                    ], name="MNIST_CNN")

## Building and compiling it
model.compile(optimizer='adam', 
              loss='sparse_categorical_crossentropy', 
              metrics=['accuracy'])
model.summary()

Number of images in x_train 60000
Number of images in x_test 10000
Model: "MNIST_CNN"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 26, 26, 28)        280       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 28)        0         
_________________________________________________________________
flatten (Flatten)            (None, 4732)              0         
_________________________________________________________________
dense (Dense)                (None, 128)               605824    
_________________________________________________________________
dropout (Dropout)            (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 10)                1290      
Total params: 607,394
Trainable params: 607,394
Non-trai

In [2]:
### Training the model
history = model.fit(x_train, y_train, epochs=5)

### And Evaluate 
model.evaluate(x_test, y_test)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


[0.13116046786308289, 0.9708999991416931]

Our model is ready, performing well and ready to ship! 

### Exporting a model

In [3]:
model.save('./MNIST-CNN')

Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
INFO:tensorflow:Assets written to: ./MNIST-CNN/assets


In [4]:
%ls -la MNIST-CNN/

total 224
drwxr-xr-x  5 shubham  staff     160 Aug 21 07:49 [1m[36m.[m[m/
drwxr-xr-x  6 shubham  staff     192 Aug 21 07:49 [1m[36m..[m[m/
drwxr-xr-x  2 shubham  staff      64 Aug 21 07:49 [1m[36massets[m[m/
-rw-r--r--  1 shubham  staff  111980 Aug 21 07:49 saved_model.pb
drwxr-xr-x  4 shubham  staff     128 Aug 21 07:49 [1m[36mvariables[m[m/


#### Exporting sklearn models

In [5]:
!conda install -q -y joblib

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



In [6]:
from sklearn import svm
from sklearn import datasets
clf = svm.SVC()
X, y= datasets.load_iris(return_X_y=True)
clf.fit(X, y)
print(clf.predict(X[0:1]))
from joblib import dump, load
dump(clf, "iris-svm.skl")

loaded_clf = load("iris-svm.skl")
loaded_clf.predict(X[0:1])

[0]


array([0])

## Model Deployment 
There are many ways to deploy models, with new paradigms being introduced and updated often. Some of them have been listed here-
1. **REST APIs** - Using web framework like [Flask](https://flask.palletsprojects.com/en/1.1.x/), [FastAPI](https://fastapi.tiangolo.com/), Django, a REST API can be built, which can make inferences on the webserver. These can be easily integrated into existing Web applications
1. **Using TensorFlow Serving** - Based on containers, [TensorFlow Serving](https://www.tensorflow.org/tfx/guide/serving) provides a flexible, high-performance serving system for machine learning models, designed for production environments. TensorFlow Serving makes it easy to deploy new algorithms and experiments, while keeping the same server architecture and APIs.
1. [**Kubeflow**](https://www.kubeflow.org/) Free and Open source machine learning platform designed to make deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable
1. **On-device Inference** - Inference can be performed on the device itself using 
    1. [TensorFlow Lite](https://www.tensorflow.org/lite) allows deploying ML models on mobile (Android and iOS) and IoT devices
    1. [TensorFlow.js](https://www.tensorflow.org/js) - Allows developing and deploying TensorFlow models using Javascript. Can be used to develop server side (using nodejs) as well as clientside apps for running inference
    1. [Intel OpenVINO™ Toolkit](https://software.intel.com/content/www/us/en/develop/tools/openvino-toolkit.html) allows model optimisation and deployment of Computer Vision based models for variety of Intel hardware
1. **Cloud Based Tools** - GCP AI Platform, Amazon SageMaker, Azure Machine Learning are all cloud based ML tools

# Introducing Flask

Flask micro web framework built with a small core and easy-to-extend philosophy. It is classified as a microframework because it does not require particular tools or libraries. It has no database abstraction layer, form validation, or any other components where pre-existing third-party libraries provide common functions.

## Running a simple websrver

In [10]:
%%writefile flask_api_demo.py

# load Flask 
import flask
app = flask.Flask(__name__)


@app.route("/predict", methods=["GET","POST"])
def predict():
    data = {"success": False}
    # get the request parameters
    params = flask.request.json
    
    if (params == None):
        params = flask.request.args
        
    # if parameters are found, echo the msg parameter 
    if (params != None):
        data["response"] = params.get("msg")
        data["success"] = True
    # return a response in json format 
    
    return flask.jsonify(data)
# start the flask app, allow remote connections

if __name__ == "__main__":
    app.run(host='127.0.0.1')

Overwriting flask_api_demo.py


### Running the development webserver
On your local machine, run this:

    FLASK_ENV=DEVELOPMEMT python3 flask_api_demo.py

In [40]:
import requests
r = requests.post("http://127.0.0.1:5000/predict", json={"msg": "hello"})
r.json()

{'response': 'hello', 'success': True}

## Modifying the webserver to load TF Model on startup

In [51]:
%%writefile flask_api_tf.py

# load Flask 
import flask

# Adding TF import
import tensorflow as tf
from tensorflow import keras
import numpy as np
from json import loads

model = None

app = flask.Flask(__name__)

def load_model(path="./MNIST-CNN"):
    global model
    model = keras.models.load_model(path)
    print(model.summary())
    
def make_prediction(model_in):
    try:
        np_image = np.asarray(loads(model_in))
        np_image = np_image.reshape(1, 28, 28, 1)
        pred = model.predict(np_image).argmax().item()
    except Exception as e:
        print("ERROR")
        print(e)
        pred = -1
    print(pred)
    return pred

@app.route("/predict", methods=["POST"])
def predict():
    data = {"success": False}
    
    # get the request parameters
    params = flask.request.json
    
    if (params == None):
        params = flask.request.args
         
    if (params != None):
        model_input_str = params.get("data")
        resp = make_prediction(model_input_str)
        data["response"] = resp
        data["success"] = True if resp > -1 else False
        
    # return a response in json format 
    return flask.jsonify(data)

# start the flask app, allow remote connections
if __name__ == "__main__":
    load_model()
    app.run(host='127.0.0.1')

Overwriting flask_api_tf.py


In [54]:
import requests
import json
inp = json.dumps(x_test[12].reshape(14, 56).tolist())
r = requests.post("http://127.0.0.1:5000/predict", json={"data": inp})
r.json()

{'response': 9, 'success': True}

In [26]:
y_test[12]

9

Now let's see what inference speeds we're getting for this simple model served via Flask

In [48]:
%%timeit
inp = json.dumps(x_test[12].reshape(1, 28, 28, 1).tolist())
r = requests.post("http://127.0.0.1:5000/predict", json={"data": inp})

33.2 ms ± 726 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Next Steps
Naturally, the next steps would be to implement better error handling, authentication and rate limiting (if required) for these Flask APIs. The in built development server we've been using to serve Flask is not suitable for production. For deploying Flask apps, please take a look at the [documentation](https://flask.palletsprojects.com/en/1.1.x/tutorial/deploy/