# Create and Test Scoring Pipeline 




Before executing this notebook on IBM Cloud , you need to:<br>
1) Insert a project token: Click on **More -> Insert project token** in the top-right menu section and run the cell <br>
![ws-project.mov](https://media.giphy.com/media/jSVxX2spqwWF9unYrs/giphy.gif)
2) Provide your IBM Cloud API key in the subsequent cell<br>
3) You can then step through the notebook execution cell by cell, by selecting Shift-Enter. Or you can execute the entire notebook by selecting **Cell -> Run All** from the menu.<br>



#### Insert IBM Cloud API key
Your Cloud API key can be generated by going to the [API Keys section of the Cloud console](https://cloud.ibm.com/iam/apikeys). From that page, scroll down to the API Keys section, and click Create an IBM Cloud API key. Give your key a name and click Create, then copy the created key and paste it below. 

If you are running this notebook on cloud pak for data on-prem, leave the ibmcloud_api_key field blank.

In [1]:
ibmcloud_api_key = ''

In [4]:
try:
    project
except NameError:
    # READING AND WRITING PROJECT ASSETS
    import project_lib
    project = project_lib.Project() 

### Introduction

Now that we have built the machine learning models, stored and deployed them, we can use the models to score new data. 

In the first part of the notebook we will:

* Programmatically get the ID's for the deployment space and model deployment that were created in the **1-model-training** notebook
* Promote assets required for scoring new data into the deployment space
* Create a deployable function which will take raw data for scoring, prep it into the format required for the model and score it
* Deploy the function
* Create the required payload, invoke the deployed function and return predictions


In [5]:
import pandas as pd
import datetime
from ibm_watson_machine_learning import APIClient
import os



if ibmcloud_api_key != '':
    wml_credentials = {
        "apikey": ibmcloud_api_key,
        "url": 'https://' + os.environ['RUNTIME_ENV_REGION']  + '.ml.cloud.ibm.com'
    }
else:
    token = os.environ['USER_ACCESS_TOKEN']
    wml_credentials = {
        "token": token,
        "instance_id" : "openshift",
        "url": os.environ['RUNTIME_ENV_APSX_URL'],
        "version": "3.5"
     }
client = APIClient(wml_credentials)



### User Inputs

Enter the path to the csv file with raw data to be scored.

In [6]:
# specify the location of the csv file with raw data that we would like to score for
filename = 'Bill Payment View.csv'
f = open(filename, 'w+b')
f.write(project.get_file(filename).getbuffer())
f.close()

### Set up Deployment Space, Deployments and Assets

The following code programmatically gets the deployment space and the model deployment details which were created in **1-model-training**. <br>

We use the space name and default tags that were used when creating the deployments, as specified below. If multiple spaces with the same name exist, the code will take the space that was created most recently. Similarly, if multiple deployments within the selected space have the same tag, the most recently created deployment is used.

Alternatively, you can manually enter the space and deployment guid's.

The code also promotes some assets into the deployment space; specifically, the dataset with raw data for scoring, the metadata that was stored in **1-model-training** and the transformer object that prepares the data. After being promoted into the deployment space, these assets are available and can be accessed by the deployed function.

In [7]:
space_name = 'Utilities Payment Risk Prediction Space'
model_name = 'utilities_payment_risk_prediction_model'
deployment_name = 'utilities_payment_risk_prediction_model_deployment'


Get the space we are working in (found using the name that were hardcoded in 1-model-training). If you would like to use a different space, manually set the space_id.

Set the space as the default space for working.

In [8]:
l_space_details = []
l_space_details_created_times = []
for space_details in client.spaces.get_details()['resources']:
    if space_details['entity']['name'] == space_name:
        space_id=space_details['metadata']['id']

# set this space as default space
client.set.default_space(space_id)

'SUCCESS'

Get the deployment id. If there are multiple deployments with the same name in the same space, we take the latest.

In [9]:
l_deployment_details = []
l_deployment_details_created_times = []

for deployment in client.deployments.get_details()['resources']:
        

        if deployment['entity']['name'] == deployment_name:            
                l_deployment_details.append(deployment)
                l_deployment_details_created_times.append(datetime.datetime.strptime(deployment['metadata']['created_at'],  '%Y-%m-%dT%H:%M:%S.%fZ'))
                

# get the index of the latest created date from the list and use that to get the deployment_id
list_latest_index = l_deployment_details_created_times.index(max(l_deployment_details_created_times))
deployment_id = l_deployment_details[list_latest_index]['metadata']['id']
print("Deployment ID of",deployment_name,"is",deployment_id)

Deployment ID of utilities_payment_risk_prediction_model_deployment is c99e10a9-1f0e-47f1-a6d9-895a7a409ecf


Promote the assets into the deployment space. The transformer which preps the data is promoted into the space. We also store the raw data dataset in the deployment space.

In [10]:
# get the transformer object created in training - this file was saved as .txt so that the mimetype could be recognised when creating the asset
transformer_asset_details = client.data_assets.create('preprocessor_transformer.joblib', file_path='preprocessor_transformer.txt')

dataset_asset_details = client.data_assets.create(filename, file_path=filename)

transformer_id = transformer_asset_details['metadata']['guid']
dataset_id = dataset_asset_details['metadata']['guid']

Creating data asset...
SUCCESS
Creating data asset...
SUCCESS


## Create the Deployable Function

Functions can be deployed in Watson Machine Learning in the same way models can be deployed. The python client or REST API can be used to send data to the deployed function. Using the deployed function allows us to prepare the data and pass it to the model for scoring all within the deployed function.

We start off by creating the dictionary of default parameters to be passed to the function. We get the ID of the assets that have been promoted into the deployment space. We also add the model deployment ID and space ID into the dictionary.

In [12]:
assets_dict = {'dataset_asset_id' : dataset_id, 'dataset_name' : filename, 'transformer_asset_id' : transformer_id}

In [13]:
# create the wml_credentials again. After already creating the client using the credentials, the instance_id gets updated to 999
# re-create the dictionary so that the correct instance_id is used
if ibmcloud_api_key=="":
    wml_credentials["instance_id"] = "openshift"

ai_parms = {'wml_credentials' : wml_credentials, 'space_id' : space_id, 'assets' : assets_dict, 'model_deployment_id' : deployment_id}

### Scoring Pipeline Function

The function below takes a customer ID and billing date as payload. It preps the customer raw data, loads the model, executes model scoring and generates predictions.

In [14]:
def scoring_pipeline(parms=ai_parms):
     
    import pandas as pd
    import requests
    import os
    import json
    import joblib
    
    from ibm_watson_machine_learning import APIClient

    client = APIClient(parms["wml_credentials"])
    client.set.default_space(parms['space_id'])

    # call the function to download the stored dataset asset and return the path
    dataset_path = client.data_assets.download(parms['assets']['dataset_asset_id'], parms['assets']['dataset_name'])
    df_raw = pd.read_csv(dataset_path)
    
    # call the function to download the transformer joblib file and return the path
    transformer_object_path = client.data_assets.download(parms['assets']['transformer_asset_id'], 'preprocessor_transformer.joblib')
    fitted_preprocessor = joblib.load(transformer_object_path)
    
    # get the ratio of the data passed in for this month vs the previous month and vs the average of the lookback window
    # eg how does this month's bill compare to the average for the previous month?
    def cur_month_vs_historical_summary(df, col, customer_id_col, lookback_window):
        # how does this month's data compare to the previous month?
        df[col + '_PREVIOUS_MONTH'] = df.groupby(customer_id_col)[col].shift(1)
        df['RATIO_THIS_MONTH_' + col + '_VS_LAST_MONTH'] = df[col] / df[col + '_PREVIOUS_MONTH']
        # how does this month's data compare to the average of the lookback window?

        # get the average of the lookback window
        df[col + '_AVG_LOOKBACK_WINDOW'] = df.groupby(customer_id_col)[col].shift(1).rolling(lookback_window).mean()
        df['RATIO_THIS_MONTH_' + col + '_VS_AVG_LOOKBACK_WINDOW'] = df[col] / df[col + '_AVG_LOOKBACK_WINDOW']

        df.drop([col + '_AVG_LOOKBACK_WINDOW', col + '_PREVIOUS_MONTH'], axis=1, inplace=True)

        return df

    def prep_data(cust_id, scoring_billing_month, user_inputs):

        target_col = user_inputs['target_col']
        overdue_balance_col = user_inputs['overdue_balance_col'] 
        billing_date_col  = user_inputs['billing_date_col']
        customer_id_col = user_inputs['customer_id_col'] 
        lookback_window = user_inputs['lookback_window']
        l_cols_to_summarise = user_inputs['l_cols_to_summarise']  
        
        # filter by customer id
        df_prep = df_raw[df_raw[customer_id_col] == cust_id]

        # create the column for billing month
        df_prep['BILLING_MONTH'] = df_prep[billing_date_col].astype(str).str[0:6]
        df_prep['BILLING_MONTH'] = df_prep['BILLING_MONTH'].astype(int)
        
        # if any records exist for after the billing month, remove then
        df_prep = df_prep[df_prep['BILLING_MONTH']<=scoring_billing_month]
        
        # sort by billing date
        df_prep = df_prep.sort_values('BILLING_MONTH')
    
        # shift the overdue balance back 1 record per customer to create our target variable
        # we need this variable for previous months so we can calculate if the customer missed payments in the lookback window
        df_prep[target_col] = df_prep.groupby(customer_id_col)[overdue_balance_col].shift(-1)

        # if we don't know if the customer missed their next payment we will just fill with 0
        df_prep[target_col] = df_prep[target_col].fillna(0)
        
        df_prep.loc[df_prep[target_col] != 0, target_col] = 1
    
        # create summary variables
        for col in l_cols_to_summarise:
            df_prep = cur_month_vs_historical_summary(df_prep, col, customer_id_col, lookback_window)

        # how many times as a cusotmer missed a payment in the lookback period?
        df_prep['NUM_MISSED_PAYMENTS_LOOKBACK_WINDOW'] = df_prep.groupby(customer_id_col)[target_col].shift(1).rolling(lookback_window).sum()
        # now that we've calculated all of our features based on lookbacks, we can delete the old data
        # we are only interested in scoring the customer for the specified billing month
        df_prep = df_prep[df_prep['BILLING_MONTH']==scoring_billing_month]
        # drop the target column
        df_prep.drop(target_col, axis=1, inplace=True)
        # create the variable for the billing month
        df_prep['BILLING_MONTH_NUMBER'] = df_prep['BILLING_MONTH'].astype(str).str[4:6]
        # use transformer to prep the data
        X_postprocess = fitted_preprocessor.transform(df_prep)
        
        return X_postprocess
    
    def score(payload):
        import json
        
        scoring_billing_date = payload['input_data'][0]['values'][0][1]
        # convert into int with year and month
        scoring_billing_month = int(scoring_billing_date[3:7] + scoring_billing_date[0:2])
        
        cust_id = payload['input_data'][0]['values'][0][0]
        
        # we stored the metadata dictionary when we deployed the model, we retrieve it in the following line of code
        metadata_dict = client.deployments.get_details(parms['model_deployment_id'])['entity']['custom']  
        
        probability_threshold = metadata_dict['probability_threshold']
        pre_balancing_target_density = metadata_dict['training_data_pre_balancing_target_density']
        post_balancing_target_density = metadata_dict['training_data_post_balancing_target_density']
        
        try:
            prepped_data = prep_data(cust_id, scoring_billing_month, metadata_dict['user_inputs'])
        except:
            return {"predictions" : [{'values' : 'Data prep filtered out customer data. Unable to score. Check that the billing date is valid for the input data(e.g. 06-2019).'}]}
            
        if prepped_data is None:
            return {"predictions" : [{'values' : 'Data prep filtered out customer data. Unable to score. Check that the billing date is valid for the input data.'}]}
        elif prepped_data.shape[0] == 0:
            return {"predictions" : [{'values' : 'Data prep filtered out customer data. Unable to score. Check that the billing date is valid for the input data.'}]}
        else:
            scoring_payload = {"input_data":  [{ "values" : prepped_data.tolist()}]}
            predictions = client.deployments.score(parms['model_deployment_id'], scoring_payload)
            
            # we need to adjust the probabilities because we trained on the balanced dataset
            # also update the predicted class returned based on our threshold
            # by default the predicted class is based on 0.5 probability, we changed this based on ROC curve
            for idx, val in enumerate(predictions['predictions'][0]['values']):
                # adjust the probabilities
                predictions['predictions'][0]['values'][idx][1][1] = 1/(1+(1/pre_balancing_target_density-1)/(1/post_balancing_target_density-1)*(1/predictions['predictions'][0]['values'][idx][1][1]-1))
                predictions['predictions'][0]['values'][idx][1][0] = 1 - predictions['predictions'][0]['values'][idx][1][1]
                if predictions['predictions'][0]['values'][idx][1][1] >= probability_threshold:

                    predictions['predictions'][0]['values'][idx][0] = 1
                else:

                    predictions['predictions'][0]['values'][idx][0] = 0
                    
            predictions["Payment Risk"]=str(round(predictions['predictions'][0]['values'][idx][1][1]*100,2))+"%"
                
        
        return {"predictions" : [{'values' : predictions}]}
            
    return score


# scoring_pipeline()({"input_data":[{"fields":["CUSTOMER ID","Billing Date(%mm-%YYYY)"],"values":[[212,"06-2019"]]}]})


### Deploy the Function

You can specify the name of the function and deployment in the code below. As we have previously seen, we use tags in the metadata to allow us to programmatically identify the deployed function.

In [15]:
# store the function and deploy it 
function_name = 'utilities_payment_risk_prediction_scoring_pipeline_function'
function_deployment_name = 'utilities_payment_risk_prediction_scoring_pipeline_function_deployment'

We use tags, input data schemas, output data schemas and software specifications in the metadata to store the function. Input data schemas provide an easy option to input data to score in the deployment space. Example metatadata to store the function can be viewed using `client.repository.FunctionMetaNames.get_example_values()`. Similarly, example metatadata to deploy the function can be viewed using `client.deployments.ConfigurationMetaNames.get_example_values()`.

The Software Specification refers to the runtime used in the Notebook, WML training and WML deployment. We use the `default_py3.7` software specification to store the function. We get the ID of the software specification and include it in the metadata when storing the function. Available Software specifications can be retrieved using `client.software_specifications.list()`.

In [16]:
software_spec_id = client.software_specifications.get_id_by_name("default_py3.7")

In [17]:
# add the metadata for the function and deployment    
meta_data = {
    client.repository.FunctionMetaNames.NAME : function_name,
    client.repository.FunctionMetaNames.TAGS : ['utilities_payment_risk_prediction_scoring_pipeline_function_tag'],
    client.repository.FunctionMetaNames.INPUT_DATA_SCHEMAS:[{'id': '1','type': 'struct','fields': [{'name': 'CUSTOMER ID', 'type': 'int'},{'name': 'Billing Date', 'type': 'MM-YYYY'}]}],
    client.repository.FunctionMetaNames.OUTPUT_DATA_SCHEMAS: [{'id': '1','type': 'struct','fields': [{'name': 'Payment_Risk','type': 'double'}]}],
    client.repository.FunctionMetaNames.SOFTWARE_SPEC_UID: software_spec_id
}

function_details = client.repository.store_function(meta_props=meta_data, function=scoring_pipeline)

function_id = function_details["metadata"]["id"]

meta_props = {
    client.deployments.ConfigurationMetaNames.NAME: function_deployment_name,
    client.deployments.ConfigurationMetaNames.TAGS : ['utilities_payment_risk_prediction_scoring_pipeline_function_deployment_tag'],
    client.deployments.ConfigurationMetaNames.ONLINE: {}
}

# deploy the stored model
function_deployment_details = client.deployments.create(artifact_uid=function_id, meta_props=meta_props)



#######################################################################################

Synchronous deployment creation for uid: '753b8073-0f76-437c-9cf6-07fba281262b' started

#######################################################################################


initializing.........
ready


------------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_uid='83fabc06-4ef8-48b9-98d7-9d8946378ba3'
------------------------------------------------------------------------------------------------




### Score New Data

Get the guid of the deployed function, create the payload and use the python client to score the data. The deployed function returns the classification prediction along with the probabilities.

The payload contains two values. The first is the billing date for scoring. This is the date that a bill would be created; the amount and usage to be billed for are known at this point. We pass it in in the **month number - year** format. The second value contains the ID of the customer who we would like to make the prediction for.

In [18]:
scoring_deployment_id = client.deployments.get_uid(function_deployment_details)

# date should be %m-%Y format
payload = [{"fields":["CUSTOMER ID","Billing Date(%mm-%YYYY)"],"values":[[217,"06-2019"]]}]

payload_metadata = {client.deployments.ScoringMetaNames.INPUT_DATA: payload}
# score
funct_output = client.deployments.score(scoring_deployment_id, payload_metadata)
funct_output

{'predictions': [{'values': {'predictions': [{'fields': ['prediction',
       'probability'],
      'values': [[1, [0.274550050596425, 0.725449949403575]]]}],
    'Payment Risk': '72.54%'}}]}

**The R Shiny Dashboard invokes this scoring pipeline for visualizing the results.**<br>
**Follow the instructions from Readme to launch the R-Shiny dashboard.**


<hr>

Sample Materials, provided under <a href="https://github.com/IBM/Industry-Accelerators/blob/master/CPD%20SaaS/LICENSE" target="_blank" rel="noopener noreferrer">license.</a> <br>
Licensed Materials - Property of IBM. <br>
© Copyright IBM Corp. 2020, 2021. All Rights Reserved. <br>
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. <br>