# Industry Accelerators - Financial Markets Customer Segmentation Model

## Introduction

Now that we have built the machine learning pipeline, stored and deployed it using <a href="http://ibm-wml-api-pyclient.mybluemix.net" target="_blank" rel="noopener noreferrer">ibm-watson-machine-learning</a>[]() , we can use the pipeline to ingest new data, prep it and score it. 



Before executing this notebook on IBM Cloud, you need to:<br>
1) When you import this project on an IBM Cloud environment, a project access token should be inserted at the top of this notebook as a code cell. <br>
If you do not see the cell above, Insert a project token: Click on **More -> Insert project token** in the top-right menu section and run the cell <br>

![ws-project.mov](https://media.giphy.com/media/jSVxX2spqwWF9unYrs/giphy.gif)
2) Provide your IBM Cloud API key in the subsequent cell<br>
3) You can then step through the notebook execution cell by cell, by selecting Shift-Enter. Or you can execute the entire notebook by selecting **Cell -> Run All** from the menu.<br>


#### Insert IBM Cloud API key

Your Cloud API key can be generated by going to the [API Keys section of the Cloud console](https://cloud.ibm.com/iam/apikeys). From that page, scroll down to the API Keys section, and click Create an IBM Cloud API key. Give your key a name and click Create, then copy the created key and paste it below. 

If you are running this notebook on cloud pak for data on-prem, leave the ibmcloud_api_key field blank.

In [23]:
ibmcloud_api_key = ''

In [24]:
try:
    project
except NameError:
    # READING AND WRITING PROJECT ASSETS
    import project_lib
    project = project_lib.Project() 

## Create and Test Scoring Pipeline 

In this notebook we will:

- Programmatically get the ID's for the deployment space and model deployment that were created in the **2-model-training** notebook.
- Promote assets required for scoring new data into the deployment space.
- Create a deployable function which will take raw data for scoring, prep it into the format required for the models and score it.
- Deploy the function.
- Create the required payload, invoke the deployed function and return clusters.


In [25]:
import os
import pandas as pd
import datetime
from ibm_watson_machine_learning import APIClient

if ibmcloud_api_key != '':
    wml_credentials = {
        "apikey": ibmcloud_api_key,
        "url": 'https://' + os.environ['RUNTIME_ENV_REGION'] + '.ml.cloud.ibm.com'
    }
else:
    token = os.environ['USER_ACCESS_TOKEN']
    wml_credentials = {
        "token": token,
        "instance_id" : "openshift",
        "url": os.environ['RUNTIME_ENV_APSX_URL'],
        "version": "3.5"
     }
client = APIClient(wml_credentials)

### User Inputs

Enter the  csv file with raw data to be scored.  The file will be downloaded to the local path to promote it to deployment space later.

In [26]:
# specify the location of the csv file with raw data that we would like to score for
filename = 'customer_full_summary_latest.csv'
f = open(filename, 'w+b')
f.write(project.get_file(filename).getbuffer())
f.close()

### Set up Deployment Space, Deployments and Assets

The following code programmatically gets the deployment space and the model deployment details which were created in **2-model-training**. 
We use the space name and deployment names that were used when creating the deployments as specified below. If multiple deployments within the selected space have the same name, the most recently created deployment is used.

Alternatively, you can manually enter the space and deployment id's.

The code also promotes some assets into the deployment space, specifically, the dataset with raw data for scoring, the python script file which is used for prepping the data, the metadata that was stored when prepping the data and the PCA object that was created during training. By promoting these assets into the deployment space, they are available and can be accessed by the deployed function. 

In [27]:
space_name = 'Customer Segmentation Space'
model_name = 'customer_segmentation_model'
deployment_name = 'customer_segmentation_model_deployment'

Get the space we are working in, which is found using the name that were hardcoded in **2-model-training**. If you like to use a different space manually set the **space_id**.

Set the space as the default space for working.

In [28]:
l_space_details = []
l_space_details_created_times = []
for space_details in client.spaces.get_details()['resources']:
    if space_details['entity']['name'] == space_name:
        space_id=space_details['metadata']['id']

# set this space as default space
client.set.default_space(space_id)

'SUCCESS'

In [29]:
l_deployment_details = []
l_deployment_details_created_times = []

for deployment in client.deployments.get_details()['resources']:
        

        if deployment['entity']['name'] == deployment_name:            
                l_deployment_details.append(deployment)
                l_deployment_details_created_times.append(datetime.datetime.strptime(deployment['metadata']['created_at'],  '%Y-%m-%dT%H:%M:%S.%fZ'))
                
                

# get the index of the latest created date from the list and use that to get the deployment_id
list_latest_index = l_deployment_details_created_times.index(max(l_deployment_details_created_times))
deployment_id = l_deployment_details[list_latest_index]['metadata']['id']
print("Deployment ID of",deployment_name,"is",deployment_id)

Deployment ID of customer_segmentation_model_deployment is 504ec2f9-5ebc-44ef-ab66-73fc19e2eeff


### Promote Assets to Deployment space

Promote the assets into the deployment space. We will use the prep script for getting the raw data into the format required for scoring. We also need the prep metadata that was saved as json during the prep for training, this ensures that the user inputs specified for prepping the data for training are the same as the ones used for scoring. Also store the PCA object used in training and the raw data dataset in the deployment space.

We add these assets into the deployment space.  Also store the raw data dataset in the deployment space.

In [None]:
# we will use the prep script for getting the raw data into the format required for scoring
# we also need the prep metadata that was saved as json during the prep for training - this ensures that the user inputs specified for prepping the data for training are the same as the ones used for scoring.
# we need to add these files into the deployment space
asset_details_json = client.data_assets.create('training_data_metadata.json', file_path='training_data_metadata.json')
asset_details_script = client.data_assets.create('customer_segmentation_prep.py', file_path='customer_segmentation_prep.py')

# get the pca object created in training - this file was saved as .txt so that the mimetype could be recognised when creating the asset
asset_details_pca = client.data_assets.create('pca.joblib', file_path='pca.txt')

asset_details_dataset = client.data_assets.create(filename, file_path=filename)

Creating data asset...
SUCCESS
Creating data asset...
SUCCESS
Creating data asset...
SUCCESS
Creating data asset...


In [10]:
client.data_assets.list()

--------------------------------  ----------  -------  ------------------------------------
NAME                              ASSET_TYPE  SIZE     ASSET_ID
customer_full_summary_latest.csv  data_asset  1607327  bb16a02b-7944-49bc-b230-e032f84cc549
pca.joblib                        data_asset  6857     ce86f8af-940c-4c7b-886f-942619968f55
training_data_metadata.json       data_asset  3404     7a3dde2f-d037-41ab-8caf-e91acb2b841f
customer_segmentation_prep.py     data_asset  17871    1cd9260b-d0e4-4d5f-b77d-d0e5c8f55406
--------------------------------  ----------  -------  ------------------------------------


### Create the Deployable Function

Functions can be deployed in Watson Machine Learning in the same way models can be deployed. The python client or REST API can be used to send data to the deployed function. Using the deployed function allows us to prepare the data and pass it to the model for scoring all within the deployed function.

We start off by creating the dictionary of default parameters to be passed to the function. We get the ID's of all assets that have been promoted into the deployment space. We also add the model deployment ID and space ID information into the dictionary.


In [11]:
# get the assets that were stored in the space - in this version of the package we need to manually assign the id
metadata_id = asset_details_json['metadata']['guid']
prep_id = asset_details_script['metadata']['guid']
pca_id = asset_details_pca['metadata']['guid']
dataset_id = asset_details_dataset['metadata']['guid']

In [12]:
assets_dict = {'dataset_asset_id' : dataset_id, 'metadata_asset_id' : metadata_id, 'pca_asset_id' : pca_id,
                   'prep_script_asset_id' : prep_id, 'dataset_name' : filename}

In [13]:
# create the wml_credentials again. After already creating the client using the credentials, the instance_id gets updated to 999
# re-create the dictionary so that the correct instance_id is used
#wml_credentials["instance_id"] = "openshift"
    
ai_parms = {'wml_credentials' : wml_credentials,'space_id' : space_id, 'assets' : assets_dict, 'model_deployment_id' : deployment_id}

### Scoring Pipeline Function

The function below takes new customers to be scored as a payload. It preps the customer raw data, loads the model, executes the model scoring and assigns each customer to a cluster. 

The following rules are required to make a valid deployable function:

* The deployable function must include a nested function named "score".
* The score function accepts a list.
* The list must include an array with the name "values".
* The score function must return an array with the name "predictions", with a list as the value, which in turn contains an array with the name "values". Example: ```{"predictions" : [{'values' : }]}```
* We pass these into the function:

 - Default parameters
 - Credentials and space detail
 - Details of the assets that were promoted into the space
 - Model deployment guid

* The assets are downloaded into the deployment space and imported as variables. The raw data to be scored is then prepared and the function calls the model deployment endpoint to score and return predictions. 

In [14]:
def scoring_pipeline(parms=ai_parms):
     
    import pandas as pd
    import requests
    import os
    import json
    import joblib
   
    from ibm_watson_machine_learning import APIClient
    client = APIClient(parms["wml_credentials"])
    client.set.default_space(parms['space_id'])
    
    # call the function to download the stored dataset asset and return the path
    dataset_path = client.data_assets.download(parms['assets']['dataset_asset_id'], parms['assets']['dataset_name'])
    df_raw = pd.read_csv(dataset_path, infer_datetime_format=True, 
                             parse_dates=['CUSTOMER_RELATIONSHIP_START_DATE', 'CUSTOMER_SUMMARY_END_DATE', 'CUSTOMER_SUMMARY_START_DATE'])
    
    
    # call the function to download the prep script and return the path
    prep_script_path = client.data_assets.download(parms['assets']['prep_script_asset_id'], 'prep_data_script.py')
    # remove the rest of path and .py at end of file name to get the name of the script for importing
    script_name = os.path.basename(prep_script_path).replace('.py', '')
    
    
    # call the function to download the pca joblib file and return the path
    pca_object_path = client.data_assets.download(parms['assets']['pca_asset_id'], 'pca.joblib')
    pca = joblib.load(pca_object_path)

    # call the function to download the prep metadata and return the path
    metadata_path = client.data_assets.download(parms['assets']['metadata_asset_id'], 'user_inputs.json')   
    
    def prep(cust_ids, scoring_date):
        import requests
        import os
        # import the prep script that we downloaded into the deployment space
        prep_data_script = __import__(script_name)
        
        
        with open(metadata_path, 'r') as f:
            metadata = json.load(f)
        
        # create new variables for all elements in the dictionary. A new variable is created for each key
        globals().update(metadata)
        
        input_df = df_raw[df_raw[granularity_key].isin(cust_ids)]
        
        # call the script to prep the data
        scoring_prep = prep_data_script.CustomerSegmentationPrep('score', granularity_key=granularity_key,
                                            customer_start_date=customer_start_date,
                                            customer_end_date=customer_end_date,
                                            status_attribute=status_attribute,
                                            status_flag_active=status_flag_active,
                                            date_customer_joined=date_customer_joined,
                                            columns_required=columns_required,
                                            default_attributes=default_attributes,
                                            risk_tolerance_list=risk_tolerance_list,
                                            investment_objective_list=investment_objective_list,
                                            effective_date=scoring_date,
                                            std_multiplier=std_multiplier,
                                            max_num_cat_cardinality=max_num_cat_cardinality,
                                            nulls_threshold=nulls_threshold)
        prepped_data = scoring_prep.prep_data(input_df, 'score')
    
        if prepped_data is None:
            print("Data prep filtered out customer data. Unable to score.", file=sys.stderr)
            return None

        # handle empty data
        if prepped_data.shape[0] == 0:
            print("Data prep filtered out customer data. Unable to score.", file=sys.stderr)
            return None
    
        # the dataset contains a mix of continuous variables and categorical variables
        # categorical variables are converted into dummy variables
        # continuous variables are standardised using z-score

        categorical_cols = list(prepped_data.select_dtypes(include=[object]).columns)
        # create dummy variables for categorical variables and drop original
        for col in categorical_cols:
            prepped_data = pd.concat([prepped_data, pd.get_dummies(prepped_data[col], prefix=col, drop_first=True)], axis=1)
            prepped_data.drop(col, axis=1, inplace=True)
        
        # since we're using distance based clustering, we need to scale numeric variables
        # z score was used in training. We stored the mean and standard deviations used in standardscaler in the metadata, use these to standardise new data
        for i in range(0, len(cols_to_standardise)):
            current_col = cols_to_standardise[i]
            current_col_mean = scaler_means[i]
            current_col_standard_dev = scaler_standard_dev[i]
            # scale the variable 
            prepped_data[current_col] = (prepped_data[current_col] - current_col_mean) / current_col_standard_dev
        
        # if a column does not exist in scoring but is in training, add the column to scoring dataset
        for col in cols_used_for_training:
            if col not in list(prepped_data.columns):
                prepped_data[col] = 0

        # if a column exists in scoring but not in training, delete it from scoring dataset
        for col in list(prepped_data.columns):
            if col not in cols_used_for_training:
                prepped_data.drop(col, axis=1, inplace=True)

        # make sure order of scoring columns is same as training dataset
        prepped_data = prepped_data[cols_used_for_training]
        
        # get the pca object that was loaded in the outer function  and apply to the prepped dataset
        prepped_data = pd.DataFrame(pca.transform(prepped_data))
        
        return prepped_data
    
    def score(payload):
        import json
        
        scoring_date = payload['input_data'][0]['values']
        cust_ids = payload['input_data'][0]['cust_id']
        
        prepped_data = prep(cust_ids, scoring_date)
        
        if prepped_data is None:
            return {"predictions" : [{'values' : 'Data prep filtered out customer data. Unable to score.'}]}
        elif prepped_data.shape[0] == 0:
            return {"predictions" : [{'values' : 'Data prep filtered out customer data. Unable to score.'}]}
        else:
            
            scoring_payload = {"input_data":  [{ "values" : prepped_data.values.tolist()}]}
            
            response_scoring = client.deployments.score(parms['model_deployment_id'], scoring_payload)
            
        
        result = []
        # increment each clsuter by 1 so that it starts at 1 instead of 0
        for cluster_num in response_scoring['predictions'][0]['values']:
            result.append(cluster_num[0] + 1)

        return {"predictions" : [{'values' : result}]}
        
    return score

### Deploy the Function

The user can specify the name of the function and deployment in the code below. 

In [15]:
# store the function and deploy it 
function_name = 'customer_segmentation_scoring_pipeline_function'
function_deployment_name = 'customer_segmentation_scoring_pipeline_function_deployment'

### Get the ID of software specification to be used with the function

We use tags, input data schemas, output data schemas and software specifications in the metadata to store the function. Input data schemas provide an easy option to input data to score in the deployment space. Example to create a metatadata to store the function can be viewed using `client.repository.FunctionMetaNames.get_example_values()`.
Similarly, example to create a metatadata to deploy the function can be viewed using `client.deployments.ConfigurationMetaNames.get_example_values()` <br>
The Software Specification refers to the runtime used in the Notebook, WML training and WML deployment. We use the software specification `default_py3.7` to store the function. We get the ID of the software specification and include it in the metadata when storing the function. Available Software specifications can be retrieved using `client.software_specifications.list()`.


In [16]:
software_spec_id = client.software_specifications.get_id_by_name("default_py3.7")

In [17]:
# add the metadata for the function and deployment    
meta_data = {
    client.repository.FunctionMetaNames.NAME : function_name,
    client.repository.FunctionMetaNames.TAGS : ['customer_segmentation_scoring_pipeline_function_tag'],
    client.repository.FunctionMetaNames.SOFTWARE_SPEC_UID: software_spec_id,    
}

function_details = client.repository.store_function(meta_props=meta_data, function=scoring_pipeline)

function_id = function_details["metadata"]["id"]

meta_props = {
    client.deployments.ConfigurationMetaNames.NAME: function_deployment_name,
    client.deployments.ConfigurationMetaNames.TAGS : ['customer_segmentation_scoring_pipeline_function_deployment_tag'],
    client.deployments.ConfigurationMetaNames.ONLINE: {},
    
}

# deploy the stored model
function_deployment_details = client.deployments.create(artifact_uid=function_id, meta_props=meta_props)



#######################################################################################

Synchronous deployment creation for uid: '6460d1bd-2574-4c12-8883-a1d4e5dacc38' started

#######################################################################################


initializing........
ready


------------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_uid='66dbc2f5-f572-49de-afdd-95bbf51fc59d'
------------------------------------------------------------------------------------------------




### Score New Data

Get the guid of the deployed function, create the payload and use the python client to score the data. The deployed function returns the assigned clusters for each customer to be scored. 

The payload contains two values. The first is the effective date for scoring. This is the date that the clustering is computed. The second value is a list of customer ID's who we would like to make the prediction for. 

In [18]:
scoring_deployment_id = client.deployments.get_uid(function_deployment_details)
client.deployments.get_details(scoring_deployment_id)

{'entity': {'asset': {'id': '6460d1bd-2574-4c12-8883-a1d4e5dacc38'},
  'custom': {},
  'deployed_asset_type': 'function',
  'hardware_spec': {'id': 'Not_Applicable', 'name': 'XS', 'num_nodes': 1},
  'name': 'customer_segmentation_scoring_pipeline_function_deployment',
  'online': {},
  'space_id': '8b134edd-9809-4e60-94ff-cca6afc7789f',
  'status': {'online_url': {'url': 'https://us-south.ml.cloud.ibm.com/ml/v4/deployments/66dbc2f5-f572-49de-afdd-95bbf51fc59d/predictions'},
   'state': 'ready'}},
 'metadata': {'created_at': '2021-04-13T14:45:39.244Z',
  'id': '66dbc2f5-f572-49de-afdd-95bbf51fc59d',
  'modified_at': '2021-04-13T14:45:39.244Z',
  'name': 'customer_segmentation_scoring_pipeline_function_deployment',
  'owner': 'IBMid-664001TADX',
  'space_id': '8b134edd-9809-4e60-94ff-cca6afc7789f',
  'tags': ['customer_segmentation_scoring_pipeline_function_deployment_tag']}}

In [19]:
cust_ids = [1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008]

payload = [{'values' : "2018-09-30", 'cust_id' : cust_ids}]

payload_metadata = {client.deployments.ScoringMetaNames.INPUT_DATA: payload}
# score
funct_output = client.deployments.score(scoring_deployment_id, payload_metadata)
funct_output

{'predictions': [{'values': [6, 1, 5, 5, 7, 4, 5, 4, 3]}]}

**The R Shiny Dashboard invokes this scoring pipeline for visualizing the results. Follow the instructions from Readme to launch R-Shiny dashboard.**

<hr>
This project contains Sample Materials, provided under this <a href="https://github.com/IBM/Industry-Accelerators/blob/master/CPD%20SaaS/LICENSE" target="_blank" rel="noopener noreferrer">license</a>. <br/>
Licensed Materials - Property of IBM. <br/>
© Copyright IBM Corp. 2019, 2020, 2021. All Rights Reserved. <br/>
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.<br/>