# MOJO Scoring: Two Approaches

Now we will use the model we built on the Lending Club data to score the test cases we pickled. To mimick the scoring performance we would experience if the model were implemented in a real-time environment, we will score the records one at a time. We will use the MOJO we downloaded from H2O to score these records in two different ways:

1. Use the `mojo_predict_pandas` method from the `h2o.utils.shared_utils` to score one record at a time

2. Use the java application we just built to score one record at a time. To do so, we will first initialize a java virtual machine using python's `subprocess` package. This JVM will instantiate an instance of our scoring class, which loads the model just once at initialization. As we will see, loading the model once is far more efficient than repeatedly calling `mojo_predict_pandas`, which reloads the model for each call. We will then establish a gateway to our JVM using `JavaGateway` from `py4j` and score our test cases one at a time.

Timing of these two approaches will show that the second approach is far faster than the first approach. On my machine, the first approach takes more than 300 *milliseconds* per record whereas the second approach takes less than 100 *microseconds* per record. For many real-time production applications, the difference between the second approach and the first approach is the difference between easily hitting an SLA and almost always failing to hit the SLA.

### Imports

In [1]:
import os, sys, json, pickle
import pandas as pd
import subprocess
from ast import literal_eval
from py4j.java_gateway import JavaGateway
from h2o.utils import shared_utils as su

### Read in our pickled test cases and feature engineering pipeline

In [2]:
test_data = pd.read_pickle('test_cases.pkl')

In [3]:
with open('pipeline.pkl','rb') as f:
    p = pickle.load(f)

In [4]:
test_data.head()

Unnamed: 0,loan_amnt,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,...,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,application_type,mort_acc,pub_rec_bankruptcies,address
316824,16000.0,36 months,12.49,535.19,B,B5,Pediatric Sonographer,10+ years,OWN,60000.0,...,8.0,1.0,3594.0,47.3,36.0,f,INDIVIDUAL,2.0,1.0,"239 Mccarty Pines\r\nWest Tarabury, MD 05113"
316825,17000.0,36 months,13.67,578.3,B,B5,Canadain National Railroad,10+ years,MORTGAGE,70000.0,...,10.0,0.0,15194.0,53.7,14.0,f,INDIVIDUAL,1.0,0.0,48036 Nicholson Roads Suite 299\r\nSouth Melan...
316826,6000.0,36 months,14.33,206.03,C,C1,joycone company,8 years,RENT,35000.0,...,8.0,0.0,8422.0,53.6,12.0,f,INDIVIDUAL,0.0,0.0,"1391 Logan Flats\r\nLeehaven, MS 70466"
316827,21600.0,36 months,18.49,786.22,E,E2,CWI,1 year,RENT,70000.0,...,5.0,0.0,440.0,21.0,36.0,f,INDIVIDUAL,0.0,0.0,"49851 Tammy Brook\r\nPort Jeffreytown, MI 86630"
316828,12000.0,36 months,10.25,314.14,B,B2,OfficeMax,3 years,MORTGAGE,741600.0,...,12.0,0.0,7006.0,53.5,17.0,f,INDIVIDUAL,,0.0,"2246 Jessica Knolls\r\nParkerfort, IL 29597"


### Apply feature engineering

In real-time production scoring, these transformations would constribute to the end-to-end runtime of the application and therefore influence whether scoring achieves its SLA. Here we are primarily interested in the time it takes to score with the MOJO itself under the two approaches outlined above. Therefore, we do not include this in the timing. 

In [5]:
test_data_prepped = (
    p.transform(test_data)
     .reset_index(drop=True)
     .drop(labels = 'loan_status',axis=1))

In [6]:
test_data_prepped.head()

Unnamed: 0,loan_amnt,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,...,revol_bal,revol_util,total_acc,initial_list_status,application_type,mort_acc,pub_rec_bankruptcies,mort_acc_na,pub_rec_bankruptcies_na,revol_util_na
0,16000.0,36 months,12.49,535.19,B,B5,_RARE_,10+ years,OWN,60000.0,...,3594.0,47.3,36.0,f,INDIVIDUAL,2.0,1.0,0,0,0
1,17000.0,36 months,13.67,578.3,B,B5,_RARE_,10+ years,MORTGAGE,70000.0,...,15194.0,53.7,14.0,f,INDIVIDUAL,1.0,0.0,0,0,0
2,6000.0,36 months,14.33,206.03,C,C1,_RARE_,8 years,RENT,35000.0,...,8422.0,53.6,12.0,f,INDIVIDUAL,0.0,0.0,0,0,0
3,21600.0,36 months,18.49,786.22,E,_RARE_,_RARE_,1 year,RENT,70000.0,...,440.0,21.0,36.0,f,INDIVIDUAL,0.0,0.0,0,0,0
4,12000.0,36 months,10.25,314.14,B,B2,_RARE_,3 years,MORTGAGE,741600.0,...,7006.0,53.5,17.0,f,INDIVIDUAL,1.0,0.0,1,0,0


In [7]:
predictors = test_data_prepped.columns.to_list()

### Scoring Approach 1: `h2o`'s `mojo_predict_pandas` method

In [8]:
mojo_zip_path = 'lendingclub-app/src/main/resources/final_gbm.zip'
genmodel_jar_path = 'h2o-genmodel.jar'

records = [test_data_prepped.iloc[[i]] for i in range(test_data_prepped.shape[0])]

In [9]:
%%timeit

results = []

for record in records:
    pred = su.mojo_predict_pandas(
        record,
        mojo_zip_path,
        genmodel_jar_path)
    results.append(pred)

3.1 s ± 40.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [10]:
results = []

for record in records:
    pred = su.mojo_predict_pandas(
        record,
        mojo_zip_path,
        genmodel_jar_path)
    results.append(pred)

In [11]:
# Predictions:
pd.concat(results)

Unnamed: 0,predict,Charged Off,Fully Paid
0,Fully Paid,0.195452,0.804548
0,Fully Paid,0.109756,0.890244
0,Fully Paid,0.146885,0.853115
0,Fully Paid,0.534436,0.465564
0,Fully Paid,0.073596,0.926404
0,Fully Paid,0.12821,0.87179
0,Charged Off,0.576497,0.423503
0,Fully Paid,0.065011,0.934989
0,Fully Paid,0.151643,0.848357
0,Fully Paid,0.037864,0.962136


### Scoring Approach 2: Our Java Application

In [12]:
## Start JVM using subprocess

cmd = "java -cp " + \
"lendingclub-app/target/" + \
"lendingclub-app-1.0-SNAPSHOT-jar-with-dependencies.jar " + \
"com.lendingclub.app.MojoScoringEntryPoint"
jvm = subprocess.Popen(cmd)

In [13]:
## Establish gateway with the JVM

gateway = JavaGateway()
mojoscorer = gateway.entry_point.getScorer()

In [14]:
## Construct cases as list of JSON objects

cases = test_data_prepped[predictors].to_dict(orient='records')
cases = [json.dumps(case) for case in cases]

In [15]:
%%timeit
results = []

for case in cases:
    results.append(literal_eval(mojoscorer.predict(case)))

988 µs ± 213 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [16]:
results = []

for case in cases:
    results.append(literal_eval(mojoscorer.predict(case)))

pd.DataFrame(results)

Unnamed: 0,Charged Off,Fully Paid
0,0.195452,0.804548
1,0.109756,0.890244
2,0.146885,0.853115
3,0.534436,0.465564
4,0.073596,0.926404
5,0.12821,0.87179
6,0.576497,0.423503
7,0.065011,0.934989
8,0.151643,0.848357
9,0.037864,0.962136


In [17]:
## Kill JVM

jvm.kill()