# Earnings Model Deploy

2022 June 15

## What should a deployed model do?

1. Training. Trin or tune model based on example data set and business requirements
1. Predict. Compute expected model ouputs based on novel input data (We refer to these as _predictions_ because they are potentially novel and their quality is provisional.)
1. Data Validation. Verify validity of input data
1. Model Performance. Provide metrics of model performance
1. Provenance. Keep track of provencence in a string of computations, over time and over model execution
1. Orchastration. Cooperate with other elements in the data pipeline
1. Encapsulation. Encapsulate model functionality and optimized parameters in ways support iteration

## Simple Example

Deploy our Decision Tree model from last week to K8S as scaleable service with a Restful interface.

### Training
Offline, batch mode training. Pre-deployment of model.

### Predict
Batch prediction of earnings level on 1 or more rows of valid input demographic data.

### Data Validation
Use the training data to set statistical warnings we input data is out of bounds.

### Model Performance 
Batch, post training and before deploy.

### Provenance
Model versioning. Minimal logging.

### Orchestration
K8s Servcie with 2 pods, K8s Ingress with NGINX loadbalancing.

### Encapsulation
Project in Github and deployed Docker image include data, validation data, model code and pickled model (parameters) by verion and build number.

# Using the building blocks directly

In [1]:
from joblib import dump, load
from model.training_data import *

In [2]:
# Unpickle the model
model = load("../data/decision_tree.pkl")

In [3]:
# Utility to give a random sample of the training data
data = random_feature_sample_array(n=24)
print(data)

[[    36 110622      0     13      0      0      8]
 [    49 543922      1      9      0      0     42]
 [    56  67841      1      9      0      0     40]
 [    19 146189      0      9      0      0     78]
 [    23 186006      0      9      0      0     37]
 [    57 195176      1     14      0      0     80]
 [    24 291355      0     10      0      0     60]
 [    36 350103      1      9      0      0     40]
 [    23 107801      0     13      0      0     20]
 [    39 179016      1      9      0      0     40]
 [    62 197918      0      9      0      0     40]
 [    40 436493      1      9      0      0     25]
 [    20 103277      0     10      0      0     35]
 [    19 318061      0     10      0      0     80]
 [    40 180123      1      9      0      0     40]
 [    36 174308      1      7      0      0     40]
 [    28  46987      0     11   2174      0     36]
 [    24 320615      1      9      0   2205     40]
 [    23 124802      0     10      0      0     40]
 [    33 304

In [4]:
# Utilities to read and access validation dataprint(vkeys)
print("\n".join([str(x) for x in vetting]))

['age', 48842.0, 38.64358543876172, 13.710509934443555, 17.0, 28.0, 37.0, 48.0, 90.0]
['fnlwgt', 48842.0, 189664.13459727284, 105604.02542315728, 12285.0, 117550.5, 178144.5, 237642.0, 1490400.0]
['sex-val', 48842.0, 0.6684820441423365, 0.47076356938045266, 0.0, 0.0, 1.0, 1.0, 1.0]
['education-num', 48842.0, 10.078088530363212, 2.5709727555922566, 1.0, 9.0, 10.0, 12.0, 16.0]
['capital-gain', 48842.0, 1079.0676262233324, 7452.019057655394, 0.0, 0.0, 0.0, 0.0, 99999.0]
['capital-loss', 48842.0, 87.50231358257237, 403.00455212435907, 0.0, 0.0, 0.0, 0.0, 4356.0]
['hours-per-week', 48842.0, 40.422382375824085, 12.391444024252307, 1.0, 40.0, 40.0, 45.0, 99.0]


In [5]:
res = vet_features(data)
print(res.keys())



In [6]:
# Training data, so no errors in our sample set (by definition)
print("\n".join(res["errors"]))




In [7]:
# Quartiles give us a resonable way to start debugging when we see poor performance
print("\n".join(res["warnings"]))

feature fnlwgt value (110622.0) out of quartile range [117550.5, 237642.0] of training data in vector 0
feature fnlwgt value (543922.0) out of quartile range [117550.5, 237642.0] of training data in vector 1
feature fnlwgt value (67841.0) out of quartile range [117550.5, 237642.0] of training data in vector 2
feature fnlwgt value (291355.0) out of quartile range [117550.5, 237642.0] of training data in vector 6
feature fnlwgt value (350103.0) out of quartile range [117550.5, 237642.0] of training data in vector 7
feature fnlwgt value (107801.0) out of quartile range [117550.5, 237642.0] of training data in vector 8
feature fnlwgt value (436493.0) out of quartile range [117550.5, 237642.0] of training data in vector 11
feature fnlwgt value (103277.0) out of quartile range [117550.5, 237642.0] of training data in vector 12
feature fnlwgt value (318061.0) out of quartile range [117550.5, 237642.0] of training data in vector 13
feature fnlwgt value (46987.0) out of quartile range [117550.5

In [8]:
prediction = model.predict(data)
print(prediction)

[' <=50K' ' >50K' ' >50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K'
 ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K'
 ' <=50K' ' <=50K' ' <=50K' ' >50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K']




# Create a Service

In [9]:
! cat ../model/model_serve.py | grep route

@app.route('/version')
@app.route('/predict', methods=['POST'])
@app.route('/example')
@app.route('/examples/<n>')
@app.route('/vetter', methods=["POST"])


In [10]:
! cat ../Dockerfile

FROM python:3.8.12
RUN apt update && apt upgrade -y

ENV APP /model
RUN mkdir $APP
WORKDIR $APP

RUN apt install make curl -y
RUN pip install --upgrade pip

# get and install poetry package manager
RUN curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python
ENV PATH /root/.poetry/bin:$PATH
# set poetry so it does not use a virtual environment in deployment container
RUN poetry config virtualenvs.create false

ENV PYTHONPATH $APP
COPY ./pyproject.toml ./poetry.lock $APP/
RUN poetry install --no-dev

COPY ./model $APP/
RUN mkdir ./data
COPY ./data /data

ENTRYPOINT poetry run uwsgi --ini model_serve.ini

In [11]:
! cat ../Deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: earnings-model-service
  labels:
    app: earnings-model-service
    tier: backend
    version: v1
spec:
  selector:
    matchLabels:
      app: earnings-model-service
  replicas: 2
  template:
    metadata:
      labels:
        app: earnings-model-service
    spec:
      containers:
      - name: model-service
        image: localhost:32000/earnings-model-server
        ports:
        - containerPort: 8080
        env:
        - name: APP_NAME
          value: EARNINGS


In [12]:
!kubectl describe ingress

Name:             k8s-ingress
Labels:           <none>
Namespace:        default
Address:          127.0.0.1
Ingress Class:    public
Default backend:  <default>
Rules:
  Host        Path  Backends
  ----        ----  --------
  *           
              /foo                   foo-app:8080 (10.1.118.82:8080,10.1.235.221:8080)
              /bar                   bar-app:8080 (10.1.118.81:8080,10.1.235.222:8080)
              /books/(.*)            book-service:8083 (10.1.36.215:8083,10.1.7.152:8083)
              /ts-model/(.*)         ts-model-service:8085 (10.1.118.91:8085,10.1.7.168:8085)
              /earnings-model/(.*)   earnings-model-service:8080 (10.1.7.167:8080,10.1.78.211:8080)
Annotations:  nginx.ingress.kubernetes.io/proxy-connect-timeout: 160
              nginx.ingress.kubernetes.io/proxy-next-upstream-timeout: 160
              nginx.ingress.kubernetes.io/proxy-read-timeout: 160
              nginx.ingress.kubernetes.io/proxy-send-timeout: 160
     

In [13]:
import requests

In [14]:
url = "http://192.168.127.8/earnings-model/"
res = requests.get(url + "version")
print(res.json())

{'version': '0.1.0', 'date': '2022-06-14T21:48'}


In [15]:
data_res = requests.get(url + "examples/20")
print(data_res.json())

{'size': 20, 'data': [[36, 180150, 1, 8, 0, 0, 40], [20, 215495, 0, 10, 0, 0, 40], [71, 118119, 1, 10, 20051, 0, 50], [35, 123809, 0, 13, 15024, 0, 35], [75, 31195, 0, 9, 0, 0, 20], [30, 337908, 0, 12, 0, 0, 20], [57, 367334, 0, 9, 0, 0, 40], [33, 144064, 1, 13, 0, 0, 50], [29, 255187, 0, 10, 0, 0, 40], [23, 170070, 0, 8, 0, 0, 38], [24, 214542, 1, 7, 0, 0, 40], [20, 526734, 0, 9, 0, 0, 30], [36, 185394, 0, 9, 0, 0, 40], [44, 154993, 0, 10, 0, 0, 55], [41, 89226, 1, 9, 0, 0, 40], [49, 119565, 1, 14, 0, 0, 40], [31, 191001, 0, 9, 0, 0, 40], [34, 241259, 1, 9, 0, 0, 40], [45, 160599, 1, 12, 0, 0, 40], [30, 104052, 1, 9, 0, 1741, 42]]}


In [16]:
payload = data_res.json()
valid_res = requests.post(url=url + "vetter", json=payload)
print("\n".join(valid_res.json()["warnings"]))

feature fnlwgt value (31195.0) out of quartile range [117550.5, 237642.0] of training data in vector 4
feature fnlwgt value (337908.0) out of quartile range [117550.5, 237642.0] of training data in vector 5
feature fnlwgt value (367334.0) out of quartile range [117550.5, 237642.0] of training data in vector 6
feature fnlwgt value (255187.0) out of quartile range [117550.5, 237642.0] of training data in vector 8
feature fnlwgt value (526734.0) out of quartile range [117550.5, 237642.0] of training data in vector 11
feature fnlwgt value (89226.0) out of quartile range [117550.5, 237642.0] of training data in vector 14
feature fnlwgt value (241259.0) out of quartile range [117550.5, 237642.0] of training data in vector 17
feature fnlwgt value (104052.0) out of quartile range [117550.5, 237642.0] of training data in vector 19
feature education-num value (8.0) out of quartile range [9.0, 12.0] of training data in vector 0
feature education-num value (13.0) out of quartile range [9.0, 12.0] 

In [17]:
service_prediction = requests.post(url=url + "predict", json=payload)
print(service_prediction.json())

{'size': 20, 'data': [0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0]}


In [18]:
payload = {"size": 24, "data": [x.tolist() for x in data]}
service_prediction_compare = requests.post(url=url + "predict", json=payload)
print(service_prediction_compare.json())

{'size': 24, 'data': [0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]}


In [19]:
print(prediction)

[' <=50K' ' >50K' ' >50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K'
 ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K'
 ' <=50K' ' <=50K' ' <=50K' ' >50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K']
