# Industry Accelerators - Utilities Customer Attrition Prediction Model

### Introduction

Now that we have built the machine learning pipeline, stored and deployed it using <a href="http://ibm-wml-api-pyclient.mybluemix.net" target="_blank" rel="noopener noreferrer">ibm-watson-machine-learning</a>[]() , we can use the pipeline to ingest new data, prep it and score it. 



Before executing this notebook on IBM Cloud , you need to:<br>
1) When you import this project on an IBM Cloud environment, a project access token should be inserted at the top of this notebook as a code cell. <br>
If you do not see the cell above, Insert a project token: Click on **More -> Insert project token** in the top-right menu section and run the cell <br>

![ws-project.mov](https://media.giphy.com/media/jSVxX2spqwWF9unYrs/giphy.gif)
2) Provide your IBM Cloud API key in the subsequent cell<br>
3) You can then step through the notebook execution cell by cell, by selecting Shift-Enter. Or you can execute the entire notebook by selecting **Cell -> Run All** from the menu.<br>


#### Insert IBM Cloud API key
Your Cloud API key can be generated by going to the <a href="https://cloud.ibm.com/iam/apikeys" target="_blank" rel="noopener noreferrer">API Keys section of the Cloud console</a>. From that page, scroll down to the API Keys section, and click Create an IBM Cloud API key. Give your key a name and click Create, then copy the created key and paste it below. 

If you are running this notebook on cloud pak for data on-prem, leave the ibmcloud_api_key field blank.

In [None]:
ibmcloud_api_key = ''

In [3]:
try:
    project
except NameError:
    # READING AND WRITING PROJECT ASSETS
    import project_lib
    project = project_lib.Project() 

## Create and Test Scoring Pipeline 

In the first part of the notebook we will:

* Programmatically get the ID's for the deployment space and model deployment that were created in the **1-model_training** notebook.
* Create a deployable function which will take raw data for scoring, complete the initial prep, feed it to the pipeline and score it.
* Deploy the function.
* Create the required payload, invoke the deployed function and return predictions.


In [5]:
import pandas as pd
import datetime
from ibm_watson_machine_learning import APIClient
import os



if ibmcloud_api_key != '':
    wml_credentials = {
        "apikey": ibmcloud_api_key,
        "url": 'https://' + os.environ['RUNTIME_ENV_REGION'] + '.ml.cloud.ibm.com'
    }
else:
    token = os.environ['USER_ACCESS_TOKEN']
    wml_credentials = {
        "token": token,
        "instance_id" : "openshift",
        "url": os.environ['RUNTIME_ENV_APSX_URL'],
        "version": "3.5"
     }
client = APIClient(wml_credentials)


### Set up Deployment Space, Deployments and Assets

The following code programmatically gets the deployment space and the model deployment details which were created in 1-model_training. 
We use the space name and deployment names that were used when creating the deployments as specified below. If multiple deployments within the selected space have the same name, the most recently created deployment is used.
Alternatively, the user can manually enter the space and deployment id's.

The code also promotes an asset into the deployment space. Before passing data to the pipeline, we completed one step of prepping the data, we aggregated some categories that had a low number of cases. This step needs to be completed when scoring any new data. We saved the category names that were aggregated out into a json file, `metadata.json`. We promote this asset into the deployment space. By promoting the asset into the deployment space, it is available and can be accessed by the deployed function.

In [6]:
space_name = 'Utilities Customer Attrition Space'
model_name = 'attrition_pipeline'
deployment_name = 'attrition_pipeline_deployment'

Get the space we are working in, which is found using the name that were hardcoded in **1-model_training**. 
If the user would like to use a different space manually set the **space_id**.

Set the space as the default space for working.

In [7]:
l_space_details = []
l_space_details_created_times = []
for space_details in client.spaces.get_details()['resources']:
    if space_details['entity']['name'] == space_name:
        space_id=space_details['metadata']['id']

# set this space as default space
client.set.default_space(space_id)

'SUCCESS'

Get the deployment id. If there are multiple deployments with the same name in the same space, we take the latest

In [8]:
l_deployment_details = []
l_deployment_details_created_times = []

for deployment in client.deployments.get_details()['resources']:
        if deployment['entity']['name'] == deployment_name:            
                l_deployment_details.append(deployment)
                l_deployment_details_created_times.append(datetime.datetime.strptime(deployment['metadata']['created_at'],  '%Y-%m-%dT%H:%M:%S.%fZ'))
                

# get the index of the latest created date from the list and use that to get the deployment_id
list_latest_index = l_deployment_details_created_times.index(max(l_deployment_details_created_times))
deployment_id = l_deployment_details[list_latest_index]['metadata']['id']

### Create the Deployable Function

Functions can be deployed in Watson Machine Learning in the same way models can be deployed. The python client or REST API can be used to send data to the deployed function. Using the deployed function allows us to prepare the data and pass it to the pipeline for scoring all within the deployed function.

We start off by creating the dictionary of default parameters to be passed to the function. We get the ID of the asset that has been promoted into the deployment space. We also add the model deployment ID and space ID into the dictionary.

In [9]:
# create the wml_credentials again. After already creating the client using the credentials, the instance_id gets updated to 999
# update the value

if ibmcloud_api_key == '':
    wml_credentials["instance_id"] = "openshift"  

ai_parms = {'wml_credentials' : wml_credentials, 'space_id' : space_id, 'model_deployment_id' : deployment_id}

#### Scoring Pipeline Function

The function below takes a dictionary of raw data to be scored as a payload. Any aggregation on categorical columns that are required is completed before the data is passed to the deployed pipeline. The pipeline completes the remaining steps in prepping the data, passes the data to the model and returns the predicted class and probabilities for attrition.

In [10]:
def scoring_pipeline(parms=ai_parms):
    
    from ibm_watson_machine_learning import APIClient
    client = APIClient(parms["wml_credentials"])
    client.set.default_space(parms['space_id'])    

    def score(payload):
        import json
        import requests
        import pandas as pd
     
        extracted_payload = payload['input_data'][0]['values']
        
        # the data passed in from the r shiny app will be in string format
        # convert to json s we can read it into a dataframe
        if isinstance(extracted_payload, str):
            # we need to remove the \ from the string
            extracted_payload = extracted_payload.replace('\\', '')
            extracted_payload = json.loads(extracted_payload)
        
        # create the dataframe from the values and fields that have been passed in the payload
        df = pd.DataFrame(extracted_payload)
        
        l_customer_ids = df['CUSTOMER_ID'].tolist()
        
        metadata_dict = client.deployments.get_details(parms['model_deployment_id'])['entity']['custom']          
        
        grouping_dict = metadata_dict['grouping_cols']
        # loop through each key in the dictionary, which is the name of a column that needs some aggregation 
        for key, value_dict in grouping_dict.items():    
            df[key].replace(value_dict, inplace=True)
            
        # all other prep steps are handled by the pipeline - columns not needed are removed, missing values are replaced
        # get the deployment and score the data      
        scoring_payload = {"input_data":  [{ "values" : df.values.tolist()}]}
        predictions = client.deployments.score(parms['model_deployment_id'], scoring_payload)
        
        # update the predicted class returned based on our threshold
        # by default the predicted class is based on 0.5 probability, we changed this based on ROC curve
        for idx, val in enumerate(predictions['predictions'][0]['values']):
            if predictions['predictions'][0]['values'][idx][1][1] >= metadata_dict['probability_threshold']:
                predictions['predictions'][0]['values'][idx][0] = 1
            else:
                predictions['predictions'][0]['values'][idx][0] = 0
            
            
        return {"predictions" : [{'values' : predictions, 'customer_ids' : l_customer_ids}]}
            
    return score

### Deploy the Function

The user can specify the name of the function and deployment in the code below. As we have previously seen, we use tags in the metadata to allow us to programmatically identify the deployed function.

In [11]:
# store the function and deploy it 
function_name = 'attrition_scoring_pipeline_function'
function_deployment_name = 'attrition_scoring_pipeline_function_deployment'


The Software Specification refers to the runtime used in the Notebook, WML training and WML deployment. We use the software specification `default_py3.7` to store the function. We get the ID of the software specification and include it in the metadata when storing the function. Available Software specifications can be retrieved using `client.software_specifications.list()`.

In [12]:
software_spec_id = client.software_specifications.get_id_by_name("default_py3.7")

In [13]:
# add the metadata for the function and deployment    
meta_data = {
    client.repository.FunctionMetaNames.NAME : function_name,
    client.repository.FunctionMetaNames.TAGS : ['utilities_attrition_scoring_pipeline_function_tag'],
    client.repository.FunctionMetaNames.SOFTWARE_SPEC_UID: software_spec_id

}

function_details = client.repository.store_function(meta_props=meta_data, function=scoring_pipeline)

function_id = function_details["metadata"]["id"]

meta_props = {
    client.deployments.ConfigurationMetaNames.NAME: function_deployment_name,
   client.deployments.ConfigurationMetaNames.TAGS : ['utilities_attrition_scoring_pipeline_function_deployment_tag'],
    client.deployments.ConfigurationMetaNames.ONLINE: {}
}

# deploy the function
function_deployment_details = client.deployments.create(artifact_uid=function_id, meta_props=meta_props)





#######################################################################################

Synchronous deployment creation for uid: 'e7b84bf6-45ab-48cf-b03a-59507c294f61' started

#######################################################################################


initializing....
ready


------------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_uid='de7e83a7-a859-4965-b94f-cb0de4dde0d4'
------------------------------------------------------------------------------------------------




### Score New Data

To create the payload, we pass a dictionary with raw data as the function payload. For demonstration purposes we will use the same csv file that was used in **1-model_training** notebook as the raw data. We take 5 records and convert them into a dictionary form to be passed to the payload.  

We then get the id of the deployed function and use the python client to score the data. The deployed function returns the classification prediction along with the probabilities. 

In [14]:
# specify the name of the csv file with raw customer data that we would like to score for
dataset_name = 'Attrition View.csv'

my_file = project.get_file(dataset_name)
my_file.seek(0)
df_raw_data = pd.read_csv(my_file)

# remove the target variable so the data has the same inputs as training data
df_raw_data.drop('ATTRITION_STATUS', axis=1, inplace=True)

In [15]:
payload_input_dict = df_raw_data.head(5).to_dict(orient='records')

Looking at the payload, not all of these fields are used in the model, transformers and pipeline will take care of removing columns that aren't used.

In [16]:
payload_input_dict[1]

{'CUSTOMER_ID': 2,
 'GENDER_ID': 1,
 'FIRST_NAME': 'Ima',
 'LAST_NAME': 'Labadie',
 'PHONE_1': '505-339-5197',
 'EMAIL': 'Ima.Labadie@allie.tv',
 'AGE': 34,
 'ENERGY_USAGE_PER_MONTH': 4970,
 'ENERGY_EFFICIENCY': 0.35600000000000004,
 'IS_REGISTERED_FOR_ALERTS': 0,
 'OWNS_HOME': 1,
 'COMPLAINTS': 1,
 'HAS_THERMOSTAT': 1,
 'HAS_HOME_AUTOMATION': 0,
 'PV_ZONING': 1,
 'WIND_ZONING': 0,
 'SMART_METER_COMMENTS': 'Negative',
 'IS_CAR_OWNER': 1,
 'HAS_EV': 0,
 'HAS_PV': 0,
 'HAS_WIND': 0,
 'TENURE': 11,
 'EBILL': 0,
 'IN_WARRANTY': 1,
 'CITY': 'Mountain View',
 'CURRENT_OFFER': 'Free Energy Audits',
 'CURRENT_CONTRACT': 'Dynamic Pricing 240 minute plan',
 'CURRENT_ISSUE': 'Billing Issue',
 'MARITAL_STATUS': 'U',
 'EDUCATION': "Bachelor's degree",
 'SEGMENT': 'GOLD',
 'EMPLOYMENT': 'Employed full-time',
 'STD_YRLY_USAGE_CUR_YEAR_MINUS_1': 52098,
 'STD_YRLY_USAGE_CUR_YEAR_MINUS_2': 40740,
 'STD_YRLY_USAGE_CUR_YEAR_MINUS_3': 26666,
 'STD_YRLY_USAGE_CUR_YEAR_MINUS_4': 26666,
 'STD_YRLY_USAGE_CUR_Y

In [17]:
scoring_deployment_id = client.deployments.get_uid(function_deployment_details)

payload = [{'values' : payload_input_dict}]

payload_metadata = {client.deployments.ScoringMetaNames.INPUT_DATA: payload}
# score
funct_output = client.deployments.score(scoring_deployment_id, payload_metadata)
funct_output

{'predictions': [{'values': {'predictions': [{'fields': ['prediction',
       'probability'],
      'values': [[0, [0.8085850664760637, 0.19141493352393651]],
       [1, [0.2273804809663913, 0.7726195190336088]],
       [0, [0.780502232264842, 0.21949776773515772]],
       [1, [0.3863743779099732, 0.6136256220900268]],
       [1, [0.604246298971217, 0.39575370102878316]]]}]},
   'customer_ids': [1, 2, 3, 4, 5]}]}

**The R Shiny Dashboard invokes this scoring pipeline for visualizing the results.**<br>
**Follow the instructions in Readme to launch R-Shiny dashboard.**

<hr>

Sample Materials, provided under <a href="https://github.com/IBM/Industry-Accelerators/blob/master/CPD%20SaaS/LICENSE" target="_blank" rel="noopener noreferrer">license</a>. <br>
Licensed Materials - Property of IBM. <br>
© Copyright IBM Corp. 2020, 2021. All Rights Reserved. <br>
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. <br>