# Batch Scoring on IBM Cloud Pak for Data as a Service

We are going to use this notebook to create and/or run a batch scoring job against a model that has previously been created and deployed to the Watson Machine Learning (WML) instance on Cloud Pak for Data as a Service (CP4DaaS).

## 1.0 Install required packages


There are a couple of Python packages we will use in this notebook.

WML Client: http://ibm-wml-api-pyclient.mybluemix.net/

In [1]:
!pip uninstall watson-machine-learning-client -y | tail -n 1
!pip uninstall watson-machine-learning-client-V4 -y | tail -n 1

!pip install --upgrade ibm-watson-machine-learning==1.0.22 --user --no-cache | tail -n 1



In [2]:
import json
from ibm_watson_machine_learning import APIClient

## 2.0 Create Batch Deployment Job

### 2.1 Instantiate Watson Machine Learning Client

To authenticate the Watson Machine Learning service on IBM Cloud, you need to provide a platform `api_key` and an endpoint URL. Where the endpoint URL is based on the `location` of the WML instance. To get these values you can use either the IBM Cloud CLI or the IBM Cloud UI.


#### IBM Cloud CLI

You can use the [IBM Cloud CLI](https://cloud.ibm.com/docs/cli/index.html) to create a platform API Key and retrieve your instance location.

- To generate the Cloud API Key, run the following commands:
```
ibmcloud login
ibmcloud iam api-key-create API_KEY_NAME
```
  - Copy the value of `api_key` from the output.


- To retrieve the location of your WML instance, run the following commands:
```
ibmcloud login --apikey API_KEY -a https://cloud.ibm.com
ibmcloud resource service-instance "WML_INSTANCE_NAME"
```
> Note: WML_INSTANCE_NAME is the name of your Watson Machine Learning instance and should be quoted in the command.

  - Copy the value of `Location` from the output.

#### IBM Cloud UI

To generate Cloud API key:
- Go to the [**Users** section of the Cloud console](https://cloud.ibm.com/iam#/users). 
- From that page, click your name in the top right corner, scroll down to the **API Keys** section, and click **Create an IBM Cloud API key**. 
- Give your key a name and click **Create**, then copy the created key and to use it below.

To retrieve the location of your WML instance:
- Go to the [**Resources List** section of the Cloud console](https://cloud.ibm.com/resources).
- From that page, expand the **Services** section and find your Watson Machine Learning Instance.
- Based on the Location displayed in that page, select one of the following values for location variable:
|Displayed Location|Location|
|-|-|
|Dallas|us-south|
|London|eu-gb|
|Frankfurt|eu-de|
|Tokyo|jp-tok|


**<font color='red'><< Enter your `api_key` and `location` in the following cell. >></font>**

In [3]:
# Be sure to update these credentials before running the cell.
api_key = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
location = 'us-south'

In [4]:
wml_credentials = {
    "apikey": api_key,
    "url": 'https://' + location + '.ml.cloud.ibm.com'
}

wml_client = APIClient(wml_credentials)

In [5]:
wml_client.spaces.list()

Note: 'limit' is not provided. Only first 50 records will be displayed if the number of records exceed 50
------------------------------------  -------------------------  ------------------------
ID                                    NAME                       CREATED
63964afe-458e-4e0f-a98a-1df55e344f91  CreditRiskDeploymentSpace  2020-10-14T03:49:26.453Z
------------------------------------  -------------------------  ------------------------


### 2.2 Find Deployment Space

We will try to find the `GUID` for the deployment space you want to use and set it as the default space for the client.

- **<font color='red'><< UPDATE THE VARIABLE 'DEPLOYMENT_SPACE_NAME' TO THE NAME OF THE DEPLOYMENT SPACE CREATED PREVIOUSLY>></font>**

> Note: You should copy the name of your deployment space from the output of the previous cell to the variable in the next cell. The deployment space ID will be looked up based on the name specified below. If you do not receive a space GUID as an output to the next cell, do not proceed until you have created a deployment space.

In [6]:
# Be sure to update the name of the space with the one you want to use.
DEPLOYMENT_SPACE_NAME = "CreditRiskDeploymentSpace"

In [7]:
wml_client.spaces.list()
all_spaces = wml_client.spaces.get_details()['resources']
space_id = None
for space in all_spaces:
    if space['entity']['name'] == DEPLOYMENT_SPACE_NAME:
        space_id = space["metadata"]["id"]
        print("\nDeployment Space ID: ", space_id)

if space_id is None:
    print("WARNING: Your space does not exist. Create a deployment space before proceeding to the next cell.")

Note: 'limit' is not provided. Only first 50 records will be displayed if the number of records exceed 50
------------------------------------  -------------------------  ------------------------
ID                                    NAME                       CREATED
63964afe-458e-4e0f-a98a-1df55e344f91  CreditRiskDeploymentSpace  2020-10-14T03:49:26.453Z
------------------------------------  -------------------------  ------------------------

Deployment Space ID:  63964afe-458e-4e0f-a98a-1df55e344f91


In [8]:
# Now set the default space to the GUID for your deployment space. If this is successful, you will see a 'SUCCESS' message.
wml_client.set.default_space(space_id)

'SUCCESS'

In [9]:
# These are the models and deployments we currently have in our deployment space.
wml_client.repository.list_models()
wml_client.deployments.list()

------------------------------------  -----------------------------  ------------------------  ---------
ID                                    NAME                           CREATED                   TYPE
1e815778-35f1-495e-b112-810928203cc6  JRTCreditRiskSparkModel1014v1  2020-10-14T19:10:52.002Z  mllib_2.4
------------------------------------  -----------------------------  ------------------------  ---------
------------------------------------  -------------------------  -----  ------------------------
GUID                                  NAME                       STATE  CREATED
659c3ba6-92c5-431d-807a-ea8d7237bae7  SparkModelBatchDeployment  ready  2020-10-14T19:14:20.378Z
------------------------------------  -------------------------  -----  ------------------------


### 2.3 Find Batch Deployment

We will try to find the batch deployment which was created.

- <font color=red>**<< UPDATE THE VARIABLES 'DEPLOYMENT_NAME' BELOW WITH THE NAME OF THE BATCH DEPLOYMENT YOU CREATED PREVIOUSLY >>**</font>

>Note: You should copy the deployment name from the output of the previous cell to the variable in this next cell. 

In [10]:
DEPLOYMENT_NAME = "SparkModelBatchDeployment"

In [11]:
wml_deployments = wml_client.deployments.get_details()
deployment_uid = None
deployment_details = None
for deployment in wml_deployments['resources']:
    if DEPLOYMENT_NAME == deployment['entity']['name']:
        deployment_uid = deployment['metadata']['id']
        deployment_details = deployment
        #print(json.dumps(deployment_details, indent=3))
        break

print("Deployment id: {}".format(deployment_uid))
wml_client.deployments.get_details(deployment_uid)

Deployment id: 659c3ba6-92c5-431d-807a-ea8d7237bae7


{'entity': {'asset': {'id': '1e815778-35f1-495e-b112-810928203cc6'},
  'batch': {},
  'custom': {},
  'deployed_asset_type': 'model',
  'hardware_spec': {'id': 'f3ebac7d-0a75-410c-8b48-a931428cc4c5',
   'name': 'XS',
   'num_nodes': 1},
  'name': 'SparkModelBatchDeployment',
  'space_id': '63964afe-458e-4e0f-a98a-1df55e344f91',
  'status': {'state': 'ready'}},
 'metadata': {'created_at': '2020-10-14T19:14:20.378Z',
  'id': '659c3ba6-92c5-431d-807a-ea8d7237bae7',
  'modified_at': '2020-10-14T19:14:20.378Z',
  'name': 'SparkModelBatchDeployment',
  'owner': 'IBMid-060000B8AS',
  'space_id': '63964afe-458e-4e0f-a98a-1df55e344f91'}}

### 2.4 Get Batch Test Data

We will load some data to run the batch predictions

**<font color='red'><< FOLLOW THE INSTRUCTIONS BELOW TO LOAD THE DATASET >></font>**

* Highlight the cell below by clicking in it, under the first commented line.
* Click the `01/00` "Find data" icon in the upper right of the notebook.
* Find the project file `German-Credit-Risk-SmallBatchSet.csv` in the `Files` tab. Then click `Insert to code` and choose `pandas DataFrame`.
* The code to bring the data into the notebook environment and create a Pandas DataFrame will be added to the cell below.
* Run the cell


In [12]:
# Place cursor below and insert the Pandas DataFrame for the German-Credit-Risk-SmallBatchSet.csv
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
client_42566de6c062499c8afc68156fe822fd = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='h2evvwzj8U3c8HU5R_pq3yC5Ls9BQ4Vdi0vBnVTNEpac',
    ibm_auth_endpoint="https://iam.cloud.ibm.com/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

body = client_42566de6c062499c8afc68156fe822fd.get_object(Bucket='creditriskproject-donotdelete-pr-atqrxmoe8zw0sw',Key='German-Credit-Risk-SmallBatchSet.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_data_2 = pd.read_csv(body)
df_data_2.head()


Unnamed: 0,CheckingStatus,LoanDuration,CreditHistory,LoanPurpose,LoanAmount,ExistingSavings,EmploymentDuration,InstallmentPercent,Sex,OthersOnLoan,CurrentResidenceDuration,OwnsProperty,Age,InstallmentPlans,Housing,ExistingCreditsCount,Job,Dependents,Telephone,ForeignWorker
0,0_to_200,4,all_credits_paid_back,car_new,250,100_to_500,less_1,2,female,none,3,real_estate,26,bank,rent,1,unskilled,1,none,yes
1,0_to_200,14,credits_paid_to_date,car_new,3148,less_100,1_to_4,3,male,none,3,car_other,41,none,own,2,skilled,1,none,yes
2,greater_200,19,credits_paid_to_date,radio_tv,5351,100_to_500,greater_7,4,male,none,3,savings_insurance,49,none,own,2,skilled,1,yes,yes
3,greater_200,34,outstanding_credit,other,5790,500_to_1000,greater_7,5,male,none,4,car_other,44,stores,own,2,unskilled,1,yes,yes
4,less_0,4,all_credits_paid_back,car_new,250,100_to_500,less_1,1,female,none,1,real_estate,21,bank,rent,1,unskilled,1,none,yes


We'll use the Pandas naming convention df for our DataFrame. Make sure that the cell below uses the name for the dataframe used above. For the locally uploaded file it should look like df_data_1 or df_data_2 or df_data_x.

<font color=red>**<< UPDATE THE VARIABLE ASSIGNMENT TO THE VARIABLE GENERATED ABOVE. >>**</font>

In [13]:
# Replace df_data_1 with the variable name generated above.
batch_df = df_data_2

### 2.5 Create Job

We can now use the information about the deployment and the test data to create a new job against our batch deployment. We submit the data as inline payload and want the results (i.e predictions) stored in a CSV file.

In [14]:
import time
timestr = time.strftime("%Y%m%d_%H%M%S")
job_payload = {
    wml_client.deployments.ScoringMetaNames.INPUT_DATA: [{
        'fields': batch_df.columns.values.tolist(),
        'values': batch_df.values.tolist()
    }],
    wml_client.deployments.ScoringMetaNames.OUTPUT_DATA_REFERENCE: {
            "type": "data_asset",
            "connection": {},
            "location": {
                "name": "batchres_{}_{}.csv".format(timestr,deployment_uid),
                "description": "results"
            }
    }
}

job = wml_client.deployments.create_job(deployment_id=deployment_uid, meta_props=job_payload)
job_uid = wml_client.deployments.get_job_uid(job)

print('Job uid = {}'.format(job_uid))

Job uid = 9dacfdf0-30d5-4657-8987-e18a3cfa247d


In [15]:
wml_client.deployments.list_jobs()

------------------------------------  ------  ------------------------  ------------------------------------
JOB-UID                               STATE   CREATED                   DEPLOYMENT-ID
9dacfdf0-30d5-4657-8987-e18a3cfa247d  queued  2020-10-14T19:37:02.294Z  659c3ba6-92c5-431d-807a-ea8d7237bae7
------------------------------------  ------  ------------------------  ------------------------------------


## 3.0 Monitor Batch Job Status

The batch job is an async operation. We can use the identifier to track its progress. Below we will just poll until the job completes (or fails).

In [16]:
def poll_async_job(client, job_uid):
    import time
    while True:
        job_status = client.deployments.get_job_status(job_uid)
        print(job_status)
        state = job_status['state']
        if state == 'completed' or 'fail' in state:
            return client.deployments.get_job_details(job_uid)
        time.sleep(5)
            
job_details = poll_async_job(wml_client, job_uid)

{'completed_at': '', 'running_at': '', 'state': 'queued'}
{'completed_at': '', 'running_at': '', 'state': 'queued'}
{'completed_at': '', 'running_at': '', 'state': 'queued'}
{'completed_at': '', 'running_at': '', 'state': 'queued'}
{'completed_at': '2020-10-14T19:37:38.000Z', 'running_at': '2020-10-14T19:37:36.000Z', 'state': 'completed'}


In [17]:
wml_client.deployments.list_jobs()

------------------------------------  ---------  ------------------------  ------------------------------------
JOB-UID                               STATE      CREATED                   DEPLOYMENT-ID
9dacfdf0-30d5-4657-8987-e18a3cfa247d  completed  2020-10-14T19:37:02.294Z  659c3ba6-92c5-431d-807a-ea8d7237bae7
------------------------------------  ---------  ------------------------  ------------------------------------


### 3.1 Check Results

With the job complete, we can see the predictions. 

In [18]:
wml_client.deployments.get_job_details()

{'resources': [{'entity': {'deployment': {'id': '659c3ba6-92c5-431d-807a-ea8d7237bae7'},
    'scoring': {'input_data': [{'fields': ['CheckingStatus',
        'LoanDuration',
        'CreditHistory',
        'LoanPurpose',
        'LoanAmount',
        'ExistingSavings',
        'EmploymentDuration',
        'InstallmentPercent',
        'Sex',
        'OthersOnLoan',
        'CurrentResidenceDuration',
        'OwnsProperty',
        'Age',
        'InstallmentPlans',
        'Housing',
        'ExistingCreditsCount',
        'Job',
        'Dependents',
        'Telephone',
        'ForeignWorker'],
       'values': [['0_to_200',
         4,
         'all_credits_paid_back',
         'car_new',
         250,
         '100_to_500',
         'less_1',
         2,
         'female',
         'none',
         3,
         'real_estate',
         26,
         'bank',
         'rent',
         1,
         'unskilled',
         1,
         'none',
         'yes'],
        ['0_to_200',
       

In [19]:
print(json.dumps(job_details, indent=2))

{
  "entity": {
    "deployment": {
      "id": "659c3ba6-92c5-431d-807a-ea8d7237bae7"
    },
    "scoring": {
      "input_data": [
        {
          "fields": [
            "CheckingStatus",
            "LoanDuration",
            "CreditHistory",
            "LoanPurpose",
            "LoanAmount",
            "ExistingSavings",
            "EmploymentDuration",
            "InstallmentPercent",
            "Sex",
            "OthersOnLoan",
            "CurrentResidenceDuration",
            "OwnsProperty",
            "Age",
            "InstallmentPlans",
            "Housing",
            "ExistingCreditsCount",
            "Job",
            "Dependents",
            "Telephone",
            "ForeignWorker"
          ],
          "values": [
            [
              "0_to_200",
              4,
              "all_credits_paid_back",
              "car_new",
              250,
              "100_to_500",
              "less_1",
              2,
              "female",
     

## Congratulations, you have created and submitted a job for batch scoring !