The purpose of this notebook is to use our trained model to generate predictions that may be imported into a downstream CRM system.  It should be run on a cluster leveraging Databricks ML 7.1+ and **CPU-based** nodes.

###Step 1: Retrieve Data for Scoring

The purpose of training our churn prediction model is to identify target customers for proactive retention management. As such, we need to periodically make predictions from feature information and make those predictions available within the systems supporting such campaigns.

With this in mind, we'll examine how we might retrieve our recently trained model and use it to generate scored output which can be imported into Salesforce, Microsoft Dynamics and many other systems accepting custom data imports.  While there are multiple paths for the integration of such output with these systems, we'll explore the simplest, *i.e.* a flat-file export.

To get started, we'll first retrieve feature data associated with the period for which we intend to make predictions.  Given we trained our model on February 2017 data and evaluated our model on March 2017 data, it would make sense for us to generate prediction output for April 2017.  That said, we want to avoid stepping on the toes of the Kaggle competition associated with this dataset so that we'll limit ourselves to generating March 2017 prediction output.

Unlike in previous notebooks, we'll limit data retrieval to features and a customer identifier, ignoring the churn lables as we would not have these if we were making actual future predictions. We'll load the data first into a Spark DataFrame and then into a pandas dataframe so that we might demonstrate two different techniques for generating output, each of which depends on a different dataframe type:

In [0]:
churningThreshold = int(dbutils.widgets.get("Churning Threshold"))

In [0]:
import mlflow
import mlflow.pyfunc

import pandas as pd
import shutil, os

from pyspark.sql.types import DoubleType
from pyspark.sql.functions import struct

In [0]:
# retrieve features & identifier to Spark DataFrame
input = spark.sql('''
  SELECT
    a.*,
    b.days_total,
    b.days_with_session,
    b.ratio_days_with_session_to_days,
    b.days_after_exp,
    b.days_after_exp_with_session,
    b.ratio_days_after_exp_with_session_to_days_after_exp,
    b.sessions_total,
    b.ratio_sessions_total_to_days_total,
    b.ratio_sessions_total_to_days_with_session,
    b.sessions_total_after_exp,
    b.ratio_sessions_total_after_exp_to_days_after_exp,
    b.ratio_sessions_total_after_exp_to_days_after_exp_with_session,
    b.seconds_total,
    b.ratio_seconds_total_to_days_total,
    b.ratio_seconds_total_to_days_with_session,
    b.seconds_total_after_exp,
    b.ratio_seconds_total_after_exp_to_days_after_exp,
    b.ratio_seconds_total_after_exp_to_days_after_exp_with_session,
    b.number_uniq,
    b.ratio_number_uniq_to_days_total,
    b.ratio_number_uniq_to_days_with_session,
    b.number_uniq_after_exp,
    b.ratio_number_uniq_after_exp_to_days_after_exp,
    b.ratio_number_uniq_after_exp_to_days_after_exp_with_session,
    b.number_total,
    b.ratio_number_total_to_days_total,
    b.ratio_number_total_to_days_with_session,
    b.number_total_after_exp,
    b.ratio_number_total_after_exp_to_days_after_exp,
    b.ratio_number_total_after_exp_to_days_after_exp_with_session
  FROM kkbox.test_trans_features a
  INNER JOIN kkbox.test_act_features b
    ON a.msno=b.msno
  ''')

# extract features to pandas DataFrame
input_pd = input.toPandas()
X = input_pd.drop(['msno'], axis=1) # features for making predictions
msno = input_pd[['msno']] # customer identifiers to which we will append predictions

###Load model previously registered

In [0]:
model_name = 'e2e-demo-churning'

model = mlflow.pyfunc.load_model('runs:/YOUR RUN ID MODEL/model')

In [0]:
# databricks location for the output file
output_path = '/mnt/churning/synapse/tables/predictions/'
shutil.rmtree('/dbfs'+output_path, ignore_errors=True) # delete folder & contents if exists
dbutils.fs.mkdirs(output_path) # recreate folder

# generate predictions
y_prob = model.predict(X)

# assemble output dataset
output = pd.concat([
    msno, 
    pd.DataFrame(y_prob, columns=['churn'])
    ], axis=1
  )
output['period']='2017-03-01'

#report only churning cases
output = output[(output.churn == 1)]

#predictions = spark.createDataFrame(output)
predictions = spark.createDataFrame(output).limit(churningThreshold)

#small amound of data so create one unique file
predictions.repartition(1).write.mode('overwrite').parquet(output_path)

In [0]:
#Example Delta format that can be read in Azure Synapse Serverless
output_path_delta = '/mnt/churning/synapse/tables/delta/predictions/'
shutil.rmtree('/dbfs'+output_path_delta, ignore_errors=True) # delete folder & contents if exists
dbutils.fs.mkdirs(output_path_delta)
predictions.repartition(1).write.mode('overwrite').format("delta").save(output_path_delta)

######We can inspect the results of the predictions

In [0]:
display(predictions)

msno,churn,period
/v8rP9CCwSo5n6M4sGRSjf3vNXgh+oRX6yLEGhbj4Po=,1,2017-03-01
/ztvahZ/ayo/o5S9tSszQ05LSH+FV2O4OaRRRV3YneA=,1,2017-03-01
0+R9FUdBcpUTJ0tZzm/BzUgbaJZcRKbY+wmJFsL3LbM=,1,2017-03-01
0AZgLIsRHl/oC2zqAsJ52CzghoXwpFFUklCtPLD4J4s=,1,2017-03-01
0JebV1vXh7Ufbuc+WActALZrYLQYpJq3B86ZZV60bV4=,1,2017-03-01
0Vw7PPAEpG+EhTzK6Lr9Fa5bP6DTv/WMChUJXr208Yw=,1,2017-03-01
0chT5agYBPTYbrOr0IVAc/6QbgXkFPyAK4yhHrPB8cI=,1,2017-03-01
0wfPO25Zm41W+IsiBZs9C7rEFQhUgGYMPhZD1I3Ca6U=,1,2017-03-01
1/ofYQ52I5fqq8VXcKnCTQQ07rmjqaAdWNP0hO6cBQM=,1,2017-03-01
20NH49L58FyEr4BSqNhDx0bwE7DCAXCnJUJ44i9LloE=,1,2017-03-01
