# 7. Production-Ready Machine Learning (BentoML)

In this section, we'll leverage open-source library to deploy our model into production.

## Credit Risk Scoring with XGBoost

This is the continuation of the previous session where we found our best model (XGBoost) to predict credit risk scoring. Now we'll deploy this model with the help of BentoML.

To begin, we need to import required libraries for the project:

In [1]:
import numpy as np
import pandas as pd
import xgboost as xgb
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import roc_auc_score

%matplotlib inline

## Data Cleaning and Preparation

- Download the dataset
- Re-encoding the categorical variables
- Doing the train/validation/test split

In [2]:
# Dataset url
data = 'https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-06-trees/CreditScoring.csv'

# Read the data in dataframe
df = pd.read_csv(data)
df.head()

Unnamed: 0,Status,Seniority,Home,Time,Age,Marital,Records,Job,Expenses,Income,Assets,Debt,Amount,Price
0,1,9,1,60,30,2,1,3,73,129,0,0,800,846
1,1,17,1,60,58,3,1,1,48,131,0,0,1000,1658
2,2,10,2,36,46,2,2,3,90,200,3000,0,2000,2985
3,1,0,1,60,24,1,1,1,63,182,2500,0,900,1325
4,1,0,1,36,26,1,1,1,46,107,0,0,310,910


In [3]:
# Check the number of rows and columns
df.shape

(4455, 14)

In [4]:
# Convert columns to lowercase
df.columns = df.columns.str.lower()
df.head()

Unnamed: 0,status,seniority,home,time,age,marital,records,job,expenses,income,assets,debt,amount,price
0,1,9,1,60,30,2,1,3,73,129,0,0,800,846
1,1,17,1,60,58,3,1,1,48,131,0,0,1000,1658
2,2,10,2,36,46,2,2,3,90,200,3000,0,2000,2985
3,1,0,1,60,24,1,1,1,63,182,2500,0,900,1325
4,1,0,1,36,26,1,1,1,46,107,0,0,310,910


In [5]:
# Check the columns data type
df.dtypes

status       int64
seniority    int64
home         int64
time         int64
age          int64
marital      int64
records      int64
job          int64
expenses     int64
income       int64
assets       int64
debt         int64
amount       int64
price        int64
dtype: object

In [6]:
# List of categorical columns
categorical_cols = ['status', 'home', 'marital', 'records', 'job']

# Check unique values in each of the column
for c in categorical_cols:
    display(df[c].value_counts())

1    3200
2    1254
0       1
Name: status, dtype: int64

2    2107
1     973
5     783
6     319
3     247
4      20
0       6
Name: home, dtype: int64

2    3241
1     978
4     130
3      67
5      38
0       1
Name: marital, dtype: int64

1    3682
2     773
Name: records, dtype: int64

1    2806
3    1024
2     452
4     171
0       2
Name: job, dtype: int64

Some the columns above have `0` values which will set as unknown, for rest of the values we'll replace them with appropiate values using pandas [map()](https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html) method:

In [7]:
# Map dict for 'status'
status_values = {
    1: 'ok',
    2: 'default',
    0: 'unk'
}

df.status = df.status.map(status_values)

In [8]:
# Check uniqe values of 'status' after reformatting
df.status.value_counts()

ok         3200
default    1254
unk           1
Name: status, dtype: int64

In [9]:
# Implement reformatting on rest of the categorical columns
home_values = {
    1: 'rent',
    2: 'owner',
    3: 'private',
    4: 'ignore',
    5: 'parents',
    6: 'other',
    0: 'unk'
}

df.home = df.home.map(home_values)

marital_values = {
    1: 'single',
    2: 'married',
    3: 'widow',
    4: 'separated',
    5: 'divorced',
    0: 'unk'
}

df.marital = df.marital.map(marital_values)

records_values = {
    1: 'no',
    2: 'yes',
    0: 'unk'
}

df.records = df.records.map(records_values)

job_values = {
    1: 'fixed',
    2: 'partime',
    3: 'freelance',
    4: 'others',
    0: 'unk'
}

df.job = df.job.map(job_values)

In [10]:
# View the dataframe
df.head()

Unnamed: 0,status,seniority,home,time,age,marital,records,job,expenses,income,assets,debt,amount,price
0,ok,9,rent,60,30,married,no,freelance,73,129,0,0,800,846
1,ok,17,rent,60,58,widow,no,fixed,48,131,0,0,1000,1658
2,default,10,owner,36,46,married,yes,freelance,90,200,3000,0,2000,2985
3,ok,0,rent,60,24,single,no,fixed,63,182,2500,0,900,1325
4,ok,0,rent,36,26,single,no,fixed,46,107,0,0,310,910


The columns are correctly formatted, now let's see the summary statistics of the numerical columns.

In [11]:
df.describe().round()

Unnamed: 0,seniority,time,age,expenses,income,assets,debt,amount,price
count,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0
mean,8.0,46.0,37.0,56.0,763317.0,1060341.0,404382.0,1039.0,1463.0
std,8.0,15.0,11.0,20.0,8703625.0,10217569.0,6344253.0,475.0,628.0
min,0.0,6.0,18.0,35.0,0.0,0.0,0.0,100.0,105.0
25%,2.0,36.0,28.0,35.0,80.0,0.0,0.0,700.0,1118.0
50%,5.0,48.0,36.0,51.0,120.0,3500.0,0.0,1000.0,1400.0
75%,12.0,60.0,45.0,72.0,166.0,6000.0,0.0,1300.0,1692.0
max,48.0,72.0,68.0,180.0,99999999.0,99999999.0,99999999.0,5000.0,11140.0


There is unsual maximum value for `income`, `assets`, and `debt`. We'll replace these values to `NaNs`.

In [12]:
# Replace '99999999' value with 'NaNs'
for c in ['income', 'assets', 'debt']:
    df[c] = df[c].replace(to_replace=99999999, value=np.nan)

Since we have replace the above values with NaNs, we'll have to take one more step to fill these missing values with `0` so that we can use the data for model.

In [13]:
# Fill missing values with 0
df = df.fillna(0)

In [14]:
# Check the summary statistic again
df.describe().round()

Unnamed: 0,seniority,time,age,expenses,income,assets,debt,amount,price
count,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0
mean,8.0,46.0,37.0,56.0,130.0,5346.0,342.0,1039.0,1463.0
std,8.0,15.0,11.0,20.0,87.0,11525.0,1244.0,475.0,628.0
min,0.0,6.0,18.0,35.0,0.0,0.0,0.0,100.0,105.0
25%,2.0,36.0,28.0,35.0,80.0,0.0,0.0,700.0,1118.0
50%,5.0,48.0,36.0,51.0,119.0,3000.0,0.0,1000.0,1400.0
75%,12.0,60.0,45.0,72.0,164.0,6000.0,0.0,1300.0,1692.0
max,48.0,72.0,68.0,180.0,959.0,300000.0,30000.0,5000.0,11140.0


The maximum values are changed to the reasonable range. Next, we'll deal with the categorical values one more time. Our target column `status` has three categories `ok`, `default`, and `unk` but we are only intrested to know which in the clients that have the status either ok or default. Therefore, we'll extract the only those rows in the `status` column where we have the values.

In [15]:
# Extract rows of the 'status' column where the value is not 'unk'
df = df[df.status != 'unk'].reset_index(drop=True) # reset index
df.shape

(4454, 14)

Next, we'll split the data into 80% train and 20% test sets with the random state of 11.

In [16]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=11)

In [17]:
# Reset index
df_full_train = df_full_train.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [18]:
# Convert target variable 'status' to binary integers
y_full_train = (df_full_train.status == 'default').astype(int).values
y_test = (df_test.status == 'default').astype(int).values

In [19]:
# Delete target variable 'status' from 'df_full_train' and 'df_test'
del df_full_train['status']
del df_test['status']

In [20]:
# Varify the split
df_full_train.shape[0], df_test.shape[0]

(3563, 891)

In [21]:
df.shape[0]

4454

In [22]:
df_full_train.shape[0] + df_test.shape[0]

4454

In [23]:
y_full_train.shape, y_test.shape

((3563,), (891,))

## XGBoost Model

The train and test data in to be transformed in features matrix, that's what happening in the cell below:

In [24]:
# Convert 'df_full_train' dataframe to dictionary
dicts_full_train = df_full_train.to_dict(orient='records')

# Instantiate dictvectorizer
dv = DictVectorizer(sparse=False)
# Create feature matrix 'X_full_train'
X_full_train = dv.fit_transform(dicts_full_train)

# Convert 'df_test' dataframe to dictionary
dicts_test = df_test.to_dict(orient='records')
# Create feature matrix 'X_test'
X_test = dv.transform(dicts_test)

We need to wrap you train and test data into a special data structure from xgboost and it is called `DMatrix`. This data structure is optimized to train xgboost models faster.

`DMatrix` has various parameters but the important ones are features matrix and target variables.

*Note*: we'll have to train the model without providing feature names in `DMatirx` otherwise bentoml will throw `ValueError`.

In [25]:
# Apply DMatrix wrapper on train data
dfulltrain = xgb.DMatrix(X_full_train, label=y_full_train)

# Apply DMatrix on test data (without labels because we don't train model on test)
dtest = xgb.DMatrix(X_test)

Train the xgboost model with optimal parameters setting that we found in session 6 experiements.

In [26]:
# Train xgboost on 'dfulltrain' for 175 iterations
xgb_params = {
    'eta': 0.1, 
    'max_depth': 3,
    'min_child_weight': 1,
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'nthread': 4,
    'seed': 1,
    'verbosity': 1,
}

model = xgb.train(xgb_params, dfulltrain, num_boost_round=175)

In [27]:
# Make predictions and calculate auc
y_pred = model.predict(dtest)
roc_auc_score(y_test, y_pred)

0.8324067738624701

## BentoML

Now we are going to save our model using `xgboost.save_model()` using bentoml. The function takes `file name`, `model`, and `DictVectorizer` as parameters.

In [28]:
import bentoml

In [29]:
bentoml.xgboost.save_model('credit_risk_model', model,
                           custom_objects={'DictVectorizer': dv})

Model(tag="credit_risk_model:mfg6gqso3gomohdv", path="C:\Users\awon\bentoml\models\credit_risk_model\mfg6gqso3gomohdv\")

**Test**

Let's extract information about a client from `df_test` for testing purpose.

In [30]:
import json

In [35]:
request = df_test.iloc[50].to_dict()
print(json.dumps(request, indent=2))

{
  "seniority": 10,
  "home": "owner",
  "time": 60,
  "age": 40,
  "marital": "married",
  "records": "no",
  "job": "fixed",
  "expenses": 60,
  "income": 194.0,
  "assets": 7000.0,
  "debt": 0.0,
  "amount": 1600,
  "price": 2171
}
