# Batch prediction notebook

### Intro

In this notebook we will see how we can utilize a trained model to create batch predictions for new data.
The necessary steps are the following:

1. Train and serialize a model based
2. Package the trained model and scoring logic in a Docker container
3. Deploy that container to a compute engine instance.

## 1. Training and serializing a model


### 1.1 Training script

In [38]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score, matthews_corrcoef, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

import warnings
warnings.simplefilter(action='ignore')

seed = 42
np.random.seed(seed)

train = pd.read_csv('gs://home-credit-simonyi-workshop/input/application_train.subsample.csv')

print('Train dataset shape (rows, columns): ', train.shape)

target = 'TARGET'

features = [
    'DAYS_EMPLOYED',
    'DAYS_BIRTH',
    'AMT_INCOME_TOTAL',
    'AMT_CREDIT',
    'CNT_FAM_MEMBERS',
    'AMT_ANNUITY',
    'EXT_SOURCE_1',
    'EXT_SOURCE_2',
    'EXT_SOURCE_3',
    'NAME_TYPE_SUITE', # categorical
    'NAME_INCOME_TYPE', # categorical
]

X = train.loc[:, features]
y = train.loc[:, target]

print("Train features DataFrame shape:", X.shape)
print("Train target Series shape:", y.shape)

X_train, X_valid, y_train, y_valid = train_test_split(X, y, stratify=train[target], test_size=0.5, random_state=seed)

print('Train features shape: ', X_train.shape)
print('Train target shape: ', y_train.shape)
print('Validate features shape: ', X_valid.shape)
print('Validate target shape: ', y_valid.shape)


num_feats = list(range(0, 9))
num_cats = [9,10]
num_transform = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
])

# Columns can be accessed with names also.
cat_feats = ['NAME_TYPE_SUITE', 'NAME_INCOME_TYPE'] 
cat_transform = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', num_transform, num_feats),
    ('cat', cat_transform, num_cats)
])

pipe = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, max_depth=12))
])

pipe.fit(X_train, y_train)


# Check model performance on validation set

metrics = [
    ('Precision', precision_score, False),
    ('Recall', recall_score, False),
#     ('MCC', matthews_corrcoef, False),
#     ('F1', f1_score, False),
     ('ROC-AUC', roc_auc_score, True)
]

pred_valid = pipe.predict(X_valid)
proba_valid = pipe.predict_proba(X_valid)[:,1]

print('-'*15, 'Model performance', '.'*15)

for m in metrics:
    score = m[1](y_valid, proba_valid) if m[2] else m[1](y_valid, pred_valid)
    print('%s on CV: %.3f' % (m[0], score))

Train dataset shape (rows, columns):  (2000, 123)
Train features DataFrame shape: (2000, 11)
Train target Series shape: (2000,)
Train features shape:  (1000, 11)
Train target shape:  (1000,)
Validate features shape:  (1000, 11)
Validate target shape:  (1000,)
--------------- Model performance ...............
Precision on CV: 0.000
Recall on CV: 0.000
ROC-AUC on CV: 0.676


### 1.2 Serializing the model

We need to seralize our model in some format. There are two ways how we can do it. Either with pickle or with joblib library.

In [42]:
import pickle

with open('trained_pipe.pkl', 'wb') as f:
    pickle.dump(pipe, f)


Let's try to import it and see if it works.

In [43]:
with open('trained_pipe.pkl', 'rb') as f:
    old_pipe = pickle.load(f)

In [44]:
pred_valid = old_pipe.predict(X_valid)
proba_valid = old_pipe.predict_proba(X_valid)[:,1]

print('-'*15, 'Model performance', '.'*15)

for m in metrics:
    score = m[1](y_valid, proba_valid) if m[2] else m[1](y_valid, pred_valid)
    print('%s on CV: %.3f' % (m[0], score))

--------------- Model performance ...............
Precision on CV: 0.000
Recall on CV: 0.000
ROC-AUC on CV: 0.676


## 2. Creating our prediction pipeline 

### 2.1 Creating our scoring function


In the following sections we will create the logic that will execute the prediction on new data.

It will have 3 main steps:
1. Load new data from Cloud Storage
2. Score the predictions
3. Upload the predictions to BigQuery.

We will create it as a python script using ipython magic function *writefile*.

In [24]:
%%writefile scoring.py

import pickle
import pandas as pd
import pandas_gbq
import datetime
# import some functions we need to shutdown the VM if we are running in Google Cloud
from shutdown import kill_vm
import atexit


# load new data from Cloud Storage
input_data = pd.read_csv('gs://home-credit-simonyi-workshop/input/application_train.subsample.csv')


# load our saved pipeline pickle file.
with open('trained_pipe.pkl', 'rb') as f:
    pipe = pickle.load(f)

# Define our feature columns
feature_cols = [
    'DAYS_EMPLOYED',
    'DAYS_BIRTH',
    'AMT_INCOME_TOTAL',
    'AMT_CREDIT',
    'CNT_FAM_MEMBERS',
    'AMT_ANNUITY',
    'EXT_SOURCE_1',
    'EXT_SOURCE_2',
    'EXT_SOURCE_3',
    'NAME_TYPE_SUITE', # categorical
    'NAME_INCOME_TYPE', # categorical
]
    
# Create the predictions and add them to the input dataframe.
input_data = input_data.assign(prediction=pipe.predict_proba(input_data[feature_cols])[:,1],
                               time=datetime.datetime.utcnow())

# Create our final result dataframe
out_data = input_data[['SK_ID_CURR', 'prediction','time']]

# Upload it to BigQuery.
bq_table = 'simonyi_ml.prediction_scores'
pandas_gbq.to_gbq(dataframe=out_data,
                  destination_table=bq_table,
                  project_id='norbert-liki-sandbox',
                  if_exists='append')

print('Success.')

atexit.register(kill_vm)

Overwriting scoring.py


#### Let's try to run our scoring script and check its results.

In [25]:
!python scoring.py

1it [00:09,  9.42s/it]
Success.
Not running inside a VM


In [47]:
import pandas_gbq

def validate_bq_results():
    bq_table = 'simonyi_ml.prediction_scores'
    query = f"select * from {bq_table}"

    check_df = pandas_gbq.read_gbq(query,project_id='norbert-liki-sandbox')
    print(check_df.head())
    print('-'*15, 'Prediction records by date.')
    print(check_df.groupby('time').size())
    
validate_bq_results()

Downloading: 100%|██████████| 12000/12000 [00:01<00:00, 9614.92rows/s]

   SK_ID_CURR  prediction                             time
0      268490    0.000000 2020-03-12 13:45:28.364581+00:00
1      401057    0.033122 2020-03-12 13:45:28.364581+00:00
2      166801    0.128110 2020-03-12 13:45:28.364581+00:00
3      130052    0.018069 2020-03-12 13:45:28.364581+00:00
4      224534    0.444930 2020-03-12 13:45:28.364581+00:00
--------------- Prediction records by date.
time
2020-03-12 13:39:44.305047+00:00    2000
2020-03-12 13:41:29.782480+00:00    2000
2020-03-12 13:45:28.364581+00:00    2000
2020-03-12 13:48:38.421967+00:00    2000
2020-03-13 16:10:00.333668+00:00    2000
2020-03-13 16:13:00.288943+00:00    2000
dtype: int64





### 2.2 Packaging our scoring logic in Docker container

We are going to use Docker to package up our scoring function.
Docker has several benefits and its use is widespread in the IT industry:
- Manage applications, not machines
- Code works the same everywhere:
        + Across dev, test and production
        + Across bare-metal, VMs and cloud


- Packaged apps speed development:
        + Agile creation and deployment
        + Continuous integration and delivery

![alt text](container.PNG "Title")

In [98]:
%%writefile Dockerfile
FROM continuumio/miniconda3 as builder

Overwriting Dockerfile


First we define our base container image that we are going to use. Docker containers consist of layers that we can stack and build upon. With this we can resuse already existing containers within an organization. For example we can create a base container for our organization which holds all the security options preconfigured so we do not have to manage them. We just need care about our application.


In our case we start from a miniconda image which has the essantials installed for our python data project.

Next we add a few files to our container:
- **conda.yaml** contains the conda environment description how to create our python environment
- **scoring.py** contains the main scoring logic in python
- **shutdown.py** contains the logic how to shutdown a Compute Engine Instance

In [99]:
%%writefile Dockerfile -a

ADD conda_env.yaml /
RUN conda env create -f conda_env.yaml && /
    conda clean -a -y

Appending to Dockerfile


Then we install the necessary python packages with pip.

In [100]:
%%writefile Dockerfile -a

FROM builder
ADD scoring.py shutdown.py trained_pipe.pkl /

Appending to Dockerfile


Then we define and ENTRYPOINT for our image. It will make it executable and it will run once the container is started.

In [101]:
%%writefile Dockerfile -a

ENTRYPOINT ["conda", "run", "-n", "simonyi_workshop", "python", "scoring.py"]

Appending to Dockerfile


### 2.3 Building our container with Cloud Build

Once our container definition is ready, we need to build it. We can use Cloud Build service for that. It will store the image in Google Container Registry for later use.

In [95]:
PROJECT_ID = "norbert-liki-sandbox"  # REPLACE THIS WITH YOUR PROJECT NAME!!!

In [103]:
!gcloud builds submit . -t "gcr.io/$PROJECT_ID/home_credit_scoring:latest" --project=$PROJECT_ID --timeout=1000

Creating temporary tarball archive of 128 file(s) totalling 6.6 MiB before compression.
Uploading tarball of [.] to [gs://norbert-liki-sandbox_cloudbuild/source/1584123969.16-ad85cd62f5ba46f395e71cb256fe3bfb.tgz]
Created [https://cloudbuild.googleapis.com/v1/projects/norbert-liki-sandbox/builds/afbb3a86-cfdf-43ae-89da-723db5aac3a8].
Logs are available at [https://console.cloud.google.com/cloud-build/builds/afbb3a86-cfdf-43ae-89da-723db5aac3a8?project=31878841857].
----------------------------- REMOTE BUILD OUTPUT ------------------------------
starting build "afbb3a86-cfdf-43ae-89da-723db5aac3a8"

FETCHSOURCE
Fetching storage object: gs://norbert-liki-sandbox_cloudbuild/source/1584123969.16-ad85cd62f5ba46f395e71cb256fe3bfb.tgz#1584124018464635
Copying gs://norbert-liki-sandbox_cloudbuild/source/1584123969.16-ad85cd62f5ba46f395e71cb256fe3bfb.tgz#1584124018464635...
/ [1 files][  5.7 MiB/  5.7 MiB]                                                
Operation completed over 1 objects/5.7 MiB

## 3. Deploying the image to a Compute Engine instance

Using our freshly built container we can deploy it as a background running service simply with a simple gcloud command.
Note that we are using the preemptbile flag. With this we can save money. These instances cost almost 80% less than normal ones. On the other hand they could be shutdown anytime and stay up maximum 24 hours.

**You can use this logic to create and long running computation in the background.**

In [104]:
%%bash
gcloud compute instances create-with-container credit-scoring3 \
--container-image="gcr.io/norbert-liki-sandbox/home_credit_scoring:latest" \
--project=norbert-liki-sandbox \
--zone=europe-west4-a \
--scopes=https://www.googleapis.com/auth/cloud-platform \
--maintenance-policy=TERMINATE \
--preemptible

NAME             ZONE            MACHINE_TYPE   PREEMPTIBLE  INTERNAL_IP  EXTERNAL_IP   STATUS
credit-scoring3  europe-west4-a  n1-standard-1  true         10.164.0.18  34.91.119.55  RUNNING


Created [https://www.googleapis.com/compute/v1/projects/norbert-liki-sandbox/zones/europe-west4-a/instances/credit-scoring3].


In [105]:
validate_bq_results()

Downloading: 100%|██████████| 14000/14000 [00:01<00:00, 8868.70rows/s]

   SK_ID_CURR  prediction                             time
0      268490    0.000000 2020-03-13 16:13:00.288943+00:00
1      401057    0.033122 2020-03-13 16:13:00.288943+00:00
2      166801    0.128110 2020-03-13 16:13:00.288943+00:00
3      130052    0.018069 2020-03-13 16:13:00.288943+00:00
4      224534    0.444930 2020-03-13 16:13:00.288943+00:00
--------------- Prediction records by date.
time
2020-03-12 13:39:44.305047+00:00    2000
2020-03-12 13:41:29.782480+00:00    2000
2020-03-12 13:45:28.364581+00:00    2000
2020-03-12 13:48:38.421967+00:00    2000
2020-03-13 16:10:00.333668+00:00    2000
2020-03-13 16:13:00.288943+00:00    2000
2020-03-13 18:38:57.188150+00:00    2000
dtype: int64



