## Packaging Champion Model (Mlean Flavor) for GCP deployment

This notebook walks through the process of:

    1. Write and run PySpark jobs on Cloud Dataproc for deploying the model in batch
    2. Saving the model with MLflow (Mleap flavor)
    3. Store Model in Github

#### Author: 

**Nardini, Ivan - Sr. Customer Advisor | CI & Analytics Team | ModelOps & Decisioning**

## Setup

Mleap needs jar files (inside SPARK_HOME/jars). Some of them are:

1. mleap-spark-base_2.11-0.7.0.jar
2. mleap-core_2.11-0.7.0.jar
3. mleap-runtime_2.11-0.7.0.jar
4. mleap-spark_2.11-0.7.0.jar
5. bundle-ml_2.11-0.7.0.jar
6. config-0.3.0.jar
7. scalapb-runtime_2.11-0.6.1.jar
8. mleap-tensor_2.11-0.7.0.jar

and then installed using pip mleap (0.7.0) - MLeap Python API

In [2]:
# Check if pyspark
# !pip freeze

In [6]:
!pip install mleap==0.15.0
!pip install pyspark==2.4.5

Collecting pyspark
  Downloading pyspark-2.4.5.tar.gz (217.8 MB)
[K     |████████████████████████████████| 217.8 MB 4.0 kB/s  eta 0:00:01
[?25hCollecting py4j==0.10.7
  Downloading py4j-0.10.7-py2.py3-none-any.whl (197 kB)
[K     |████████████████████████████████| 197 kB 51.6 MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25ldone
[?25h  Created wheel for pyspark: filename=pyspark-2.4.5-py2.py3-none-any.whl size=218257927 sha256=aa43968732d3d43923a0e4cd6297a76f9bf061fa63616cebe8e7ca3e1dd2425d
  Stored in directory: /home/jupyter/.cache/pip/wheels/01/c0/03/1c241c9c482b647d4d99412a98a5c7f87472728ad41ae55e1e
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.7 pyspark-2.4.5


In [None]:
# Restart Kernel
import os
os._exit(00)

Let's create a simple MLflow project programmatically with:

1. Create a Bucket

2. Create pyspark job for scoring: score.py

3. Create the .sh  entrypoint file to: 

    - Create a Spark cluster
    - Install Mleap and Jars
    - Run Batch Scoring Job based on score.py in cloud bucket

## 1. Create a Bucket

In [10]:
# change these to try this notebook out
BUCKET = 'cloud-demo-databrick-gcp'
PROJECT = 'gel-sassandbox'
REGION = 'europe-west1'

In [11]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION

# print(os.environ)

In [13]:
%%bash
if ! gsutil ls | grep -q gs://${BUCKET}/; then
    gsutil mb -l ${REGION} gs://${BUCKET}
fi

Creating gs://cloud-demo-databrick-gcp/...


## 2. Create pyspark job for scoring

In [3]:
%%writefile score.py

#!/usr/bin/python

import numpy as np
import pandas as pd
import pyspark
from pyspark import SparkContext
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType
from pyspark.ml.feature import VectorAssembler
import mleap.pyspark
from mleap.pyspark.spark_support import SimpleSparkSerializer
from pyspark.ml import PipelineModel
from pyspark.ml.evaluation import RegressionEvaluator


import os
import sys
import argparse
import tempfile
import warnings


def read_data_csv(spark, inputPath_CSV):
    
    '''
    Function to load data in the Spark Session 
    :param spark: spark session 
    :param inputPath: path to get the data 
    :return: df
    '''
    
    print('Trying to read the data...')
    
    try:
        schema = StructType([
          StructField('crim',DoubleType(),True),
          StructField('zn',DoubleType(),True),
          StructField('indus',DoubleType(),True),
          StructField('chas',IntegerType(),True),
          StructField('nox',DoubleType(),True),
          StructField('rm',DoubleType(),True),
          StructField('age',DoubleType(),True),
          StructField('dis',DoubleType(),True),
          StructField('rad',IntegerType(),True),
          StructField('tax',IntegerType(),True),
          StructField('ptratio',DoubleType(),True),
          StructField('b',DoubleType(),True),
          StructField('lstat',DoubleType(),True),
          StructField('medv',DoubleType(),True)]
        )
        
        df = (spark.read
          .option("HEADER", True)
          .schema(schema)
          .csv(inputPath_CSV))
    
    except ValueError:
        print('At least, one variable format is wrong! Please check the data')
      
    else:
        print('Data to score have been read successfully!')
        return df

def preprocessing(df):

    '''
    Function to preprocess data 
    :param df: A pyspark DataFrame 
    :return: abt_to_score
    '''
    
    print('Data preprocessing...')

    features = df.schema.names[:-1]
    assembler_features = VectorAssembler(inputCols=features, outputCol="features")
    abt_to_score = assembler_features.transform(df)
    
    print('Data have been processed successfully!')
    return abt_to_score

def score_data(abt_to_score, modelPath):
    
    '''
    Function to score data 
    :param abt_to_score: A pyspark DataFrame to score
    :param modelPath: The modelpath associated to .zip mleap flavor
    :return: scoredData
    '''
    print('Scoring process starts...')
    
    deserializedPipeline = PipelineModel.deserializeFromBundle("jar:file:{}".format(modelPath))
    scoredData = deserializedPipeline.transform(abt_to_score)
    return scoredData  
  
def write_output_csv(scoredData, outputPath_CSV):
    '''
    Function to write predictions
    :param scoredData: A pyspark DataFrame of predictions
    :param outputPath: The path to write the ouput table
    :return: scoredData
    '''
    print('Writing Prediction in {}'.format(outputPath_CSV))
    scoredData.toPandas().to_csv(outputPath_CSV, sep=',', index=False)
    return scoredData.toPandas().to_dict()

def evaluator(predictions):
    
    '''
    Function to produce some evaluation stats
    :param predictions: A pyspark DataFrame of predictions
    :return: rmse, mse, r2, mae
    '''
    evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="medv")
    rmse = evaluator.evaluate(predictions)
    mse = evaluator.evaluate(predictions, {evaluator.metricName: "mse"})
    r2 = evaluator.evaluate(predictions, {evaluator.metricName: "r2"})
    mae = evaluator.evaluate(predictions, {evaluator.metricName: "mae"})
    
    return rmse, mse, r2, mae

def main():
    
    parser = argparse.ArgumentParser(description='Score')
    
    parser.add_argument('--input', dest="inputpath_CSV",
                        required=True, help='Provide the input path of data to score')
    
    parser.add_argument('--model', dest="modelPath",
                        required=True, help='Provide the model path to score')
    
    parser.add_argument('--output', dest="outputpath_CSV",
                        required=True, help='Provide the model path to score')

    args = parser.parse_args()
    input_path_CSV = args.inputpath_CSV
    modelPath = args.modelPath
    output_path_CSV = args.outputpath_CSV
  
    try:
#         spark = SparkSession \
#         .builder \
#         .master(SPARK_MASTER) \
#         .config('spark.executor.memory', TOTAL_MEMORY) \
#         .config('spark.cores.max', TOTAL_CORES) \
#         .config('spark.jars.packages',
#                 'ml.combust.mleap:mleap-spark-base_2.11:0.9.3,ml.combust.mleap:mleap-spark_2.11:0.9.3') \
#         .appName("RegressionScoring") \
#         .getOrCreate()
        spark = SparkSession.builder.appName('RegressionScoring').getOrCreate()
        spark.sparkContext.setLogLevel("OFF")
        print('Created a SparkSession')
    
    except ValueError:
        warnings.warn('Check')
  
    #Read data
    data_to_process = read_data_csv(spark, input_path_CSV)
    #Preprocessing
    abt = preprocessing(data_to_process)
    #Scoring
    abt_scored = score_data(abt, modelPath)
    #Write data
    write_output_csv(abt_scored, output_path_CSV)
    #Evaluate Model
    evalstats = evaluator(abt_scored)
    return evalstats
    
    
if __name__=="__main__":
    
    stats = main()
    print('-'*20)
    print('Process Log')
    print('-'*20)
    print('Scoring Job ends successfully!')
    print("RMSE for the model: {}".format(stats[0]))
    print("MSE for the model: {}".format(stats[1]))
    print("R2 for the model: {}".format(stats[2]))
    print("MAE for the model: {}".format(stats[3]))
    print('Look at the Storage Bucket to get predictions!')
    

Overwriting score.py


### Test score.py

In [2]:
%%bash
python score.py --input "/home/jovyan/work/1_data/boston_house_prices.csv" \
    --model "/home/jovyan/work/2_notebooks/output/ModelProjects_Boston_ML_lrModel.zip"\
    --output  "/home/jovyan/work/1_data/boston_house_prices_scored.csv" 

Created a SparkSession
Trying to read the data...
Data to score have been read successfully!
Data preprocessing...
Data have been processed successfully!
Scoring process starts...
Writing Prediction in /home/jovyan/work/1_data/boston_house_prices_scored.csv
--------------------
Process Log
--------------------
Scoring Job ends successfully!
RMSE for the model: 4.696684029858866
MSE for the model: 22.05884087633132
R2 for the model: 0.7386998714429953
MAE for the model: 3.3284024432759862
Look at the Storage Bucket to get predictions!


20/04/17 08:57:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
