# Batch prediction notebook

### Intro

In this notebook we will see how we can utilize a trained model to create batch predictions for new data.
The necessary steps are the following:

1. Train and serialize a model based
2. Package the trained model and scoring logic in a Docker container
3. Deploy that container to a compute engine instance.

## 1. Training and serializing a model

As a preparation for the workshop today we need to update our environment with the correct python packages. 

**Restart the kernel afterwards**

In [None]:
!conda install -c conda-forge scikit-learn==0.20.4 oauth2client pandas-gbq -y

In [None]:
PROJECT_ID = "norbert-liki-sandbox"  # REPLACE THIS WITH YOUR PROJECT NAME!!!

### 1.1 Training script

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score, matthews_corrcoef, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

import warnings
warnings.simplefilter(action='ignore')

seed = 42
np.random.seed(seed)

train = pd.read_csv('gs://home-credit-simonyi-workshop/input/application_train.subsample.csv')

print('Train dataset shape (rows, columns): ', train.shape)

target = 'TARGET'

features = [
    'DAYS_EMPLOYED',
    'DAYS_BIRTH',
    'AMT_INCOME_TOTAL',
    'AMT_CREDIT',
    'CNT_FAM_MEMBERS',
    'AMT_ANNUITY',
    'EXT_SOURCE_1',
    'EXT_SOURCE_2',
    'EXT_SOURCE_3',
    'NAME_TYPE_SUITE', # categorical
    'NAME_INCOME_TYPE', # categorical
]

X = train.loc[:, features]
y = train.loc[:, target]

print("Train features DataFrame shape:", X.shape)
print("Train target Series shape:", y.shape)

X_train, X_valid, y_train, y_valid = train_test_split(X, y, stratify=train[target], test_size=0.5, random_state=seed)

print('Train features shape: ', X_train.shape)
print('Train target shape: ', y_train.shape)
print('Validate features shape: ', X_valid.shape)
print('Validate target shape: ', y_valid.shape)


num_feats = list(range(0, 9))
num_cats = [9,10]
num_transform = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
])

cat_transform = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', num_transform, num_feats),
    ('cat', cat_transform, num_cats)
])

pipe = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, max_depth=3))
])

pipe.fit(X_train, y_train)


# Check model performance on validation set

metrics = [
    ('Precision', precision_score, False),
    ('Recall', recall_score, False),
     ('ROC-AUC', roc_auc_score, True)
]

pred_valid = pipe.predict(X_valid)
proba_valid = pipe.predict_proba(X_valid)[:,1]

print('-'*15, 'Model performance', '.'*15)

for m in metrics:
    score = m[1](y_valid, proba_valid) if m[2] else m[1](y_valid, pred_valid)
    print('%s on CV: %.3f' % (m[0], score))

### 1.2 Serializing the model

We need to seralize our model in some format. There are many ways how we can do it. With scikit-learn models we can use either pickle or joblib library.

In [None]:
import pickle

with open('batch_prediction_src/trained_pipe.pkl', 'wb') as f:
    pickle.dump(pipe, f)

Let's try to import it and see if it works.

In [None]:
with open('batch_prediction_src/trained_pipe.pkl', 'rb') as f:
    old_pipe = pickle.load(f)

In [None]:
pred_valid = old_pipe.predict(X_valid)
proba_valid = old_pipe.predict_proba(X_valid)[:,1]

print('-'*15, 'Model performance', '.'*15)

for m in metrics:
    score = m[1](y_valid, proba_valid) if m[2] else m[1](y_valid, pred_valid)
    print('%s on CV: %.3f' % (m[0], score))

## 2. Creating our prediction pipeline 

### 2.1 Creating our scoring function


In the following sections we will create the logic that will execute the prediction on new data.

It will have 3 main steps:
1. Load new data from Cloud Storage
2. Score the predictions
3. Upload the predictions to BigQuery.

We will create it as a python script using ipython magic function *writefile*.

In [None]:
%%writefile batch_prediction_src/scoring.py

import pickle
import pandas as pd
import pandas_gbq
import datetime
# import some functions we need to shutdown the VM if we are running in Google Cloud
from shutdown import kill_vm
import atexit


# load new data from Cloud Storage
input_data = pd.read_csv('gs://home-credit-simonyi-workshop/input/application_train.subsample.csv')


# load our saved pipeline pickle file.
try:
    with open('trained_pipe.pkl', 'rb') as f:
        pipe = pickle.load(f)

except FileNotFoundError:
    with open('batch_prediction_src/trained_pipe.pkl', 'rb') as f:
            pipe = pickle.load(f)
except:
    print('Model not found.')
    
        

# Define our feature columns
feature_cols = [
    'DAYS_EMPLOYED',
    'DAYS_BIRTH',
    'AMT_INCOME_TOTAL',
    'AMT_CREDIT',
    'CNT_FAM_MEMBERS',
    'AMT_ANNUITY',
    'EXT_SOURCE_1',
    'EXT_SOURCE_2',
    'EXT_SOURCE_3',
    'NAME_TYPE_SUITE', # categorical
    'NAME_INCOME_TYPE', # categorical
]
    
# Create the predictions and add them to the input dataframe.
input_data = input_data.assign(prediction=pipe.predict_proba(input_data[feature_cols])[:,1],
                               time=datetime.datetime.utcnow())

# Create our final result dataframe
out_data = input_data[['SK_ID_CURR', 'prediction','time']]

# Upload it to BigQuery.
bq_table = 'simonyi_ml.prediction_scores'
pandas_gbq.to_gbq(dataframe=out_data,
                  destination_table=bq_table,
                  project_id='norbert-liki-sandbox',   ## CHANGE THIS TO YOUR PROJECT_ID
                  if_exists='append')

print('Success.')

#### Let's try to run our scoring script and check its results.

In [None]:
!python batch_prediction_src/scoring.py

In [None]:
import pandas_gbq

def validate_bq_results():
    bq_table = 'simonyi_ml.prediction_scores'
    query = f"select * from {bq_table}"

    check_df = pandas_gbq.read_gbq(query,project_id=PROJECT_ID)
    print(check_df.head())
    print('-'*15, 'Prediction records by date.')
    print(check_df.groupby('time').size())
    
validate_bq_results()

We append a simple line at the end in order to stop the VM when the scoring has finished.

In [None]:
%%writefile batch_prediction_src/scoring.py -a

atexit.register(kill_vm)

### 2.2 Packaging our scoring logic in Docker container

We are going to use Docker to package up our scoring function.
Docker has several benefits and its use is widespread in the IT industry:
- Manage applications, not machines
- Code works the same everywhere:
        + Across dev, test and production
        + Across bare-metal, VMs and cloud


- Packaged apps speed development:
        + Agile creation and deployment
        + Continuous integration and delivery

![alt text](batch_prediction_src/container.PNG "Title")

In [None]:
%%writefile batch_prediction_src/Dockerfile
FROM continuumio/miniconda3 as builder

First we define our base container image that we are going to use. Docker containers consist of layers that we can stack and build upon. With this we can resuse already existing containers within an organization. For example we can create a base container for our organization which holds all the security options preconfigured so we do not have to manage them. We just need care about our application.


In our case we start from a miniconda image which has the essentials installed for our python data project.

Next we add a file to our container and create our python environment
- **conda.yaml** contains the conda environment description how to create our python environment

In [None]:
%%writefile batch_prediction_src/Dockerfile -a

ADD conda_env.yaml /
RUN conda env create -f conda_env.yaml && \
    conda clean -a -y

In order to reduce image size we are building the container in two steps. As a next step we add more files to our image:
- **scoring.py** contains the main scoring logic in python
- **shutdown.py** contains the logic how to shutdown a Compute Engine Instance
- **trained_pipe.pkl** our trained scikit-learn model

In [None]:
%%writefile batch_prediction_src/Dockerfile -a

FROM builder
ADD scoring.py shutdown.py trained_pipe.pkl /

Then we define and ENTRYPOINT for our image. It will make it executable and it will run once the container is started.

In [None]:
%%writefile batch_prediction_src/Dockerfile -a

ENTRYPOINT ["conda", "run", "-n", "simonyi_workshop", "python", "scoring.py"]

### 2.3 Building our container with Cloud Build

Once our container definition is ready, we need to build it. We can use Cloud Build service for that. It will store the image in Google Container Registry for later use. Using Cloud Build is easy but we need to modify the configuration a bit:

1. Create *cloudbuild.ylm* to define our build steps and output containers
2. Add *.gcloudignore* to explicitly include our trained_pipe.pkl in the container. Otherwise it would exclude it.
3. Submit the build command using *gcloud cli*

In [None]:
%%writefile batch_prediction_src/cloudbuild.yml

steps:
- name: 'gcr.io/cloud-builders/docker'
  args: ['build', '-t', 'gcr.io/$PROJECT_ID/home_credit_scoring', 'batch_prediction_src/']
options:
  machineType: 'N1_HIGHCPU_8'
tags: ['latest']
images: ['gcr.io/$PROJECT_ID/home_credit_scoring']

In [None]:
%%writefile .gcloudignore

!trainer_pipe.pkl
*.ipynb
*ipnb_checkpoints
__pycache__

Submit the build with the commands below. 
In the meantime open Cloud build from the console and observe the status there.



In [None]:
!gcloud builds submit --config batch_prediction_src/cloudbuild.yml  --project=$PROJECT_ID

## 3. Deploying the image to a Compute Engine instance

Using our freshly built container we can deploy it as a background running service simply with a simple gcloud command.
Note that we are using the preemptbile flag. With this we can save money. These instances cost almost 80% less than normal ones. On the other hand they could be shutdown anytime and stay up maximum 24 hours.

**You can use this logic to create any long running computation in the background.**

In [None]:
%%bash
gcloud compute instances create-with-container credit-scoring \
--container-image="gcr.io/norbert-liki-sandbox/home_credit_scoring:latest" \  # CHANGE THE PROJECT_ID TO YOUR PROJECT
--zone=europe-west4-a \
--scopes=https://www.googleapis.com/auth/cloud-platform \
--maintenance-policy=TERMINATE \
--preemptible

In [None]:
validate_bq_results()