<img src="https://github.com/pmservice/ai-openscale-tutorials/raw/master/notebooks/images/banner.png" align="left" alt="banner">

# IBM Watson OpenScale and monitoring models with data in remote Hive  location

This notebook must be run in the Python 3.10 runtime environment. It requires Watson OpenScale service credentials.

The notebook demonstrates how to onboard a model (which stores its runtime data in a remote Hive table) for monitoring in IBM Watson OpenScale. Use the notebook to enable quality, drift, fairness and explainability monitoring. Before you can run the notebook, you must have the following resources:

1. The configuration package (archive) containing common configuration JSON, drift archive and explainability archive generated by using the [common configuration notebook](https://github.com/IBM/watson-openscale-samples/blob/main/Cloud%20Pak%20for%20Data/Batch%20Support/Configuration%20generation%20for%20OpenScale%20batch%20subscription.ipynb).
2. Feedback, payload, drifted transactions, explanations queue and result tables details (either existing or to be created) in Hive.

## Contents

1. [Setup](#setup)
2. [Provide Spark Compute Engine Details](#spark)
2. [Provide Storage Details](#backend-storage)
3. [Provide Table Details](#table-details)
4. [Connect to IBM Watson OpenScale Instance](#connect-openscale)
5. [Connect service provider in IBM Watson OpenScale Instance](#create-service-provider)
6. [Onboard model for monitoring in IBM Watson OpenScale Instance](#create-subscription)
7. [Enable services to monitor model](#enable-monitors)

## Setup <a name="setup"></a>

### Installing Required Libraries

First import some of the packages you need to use. After you finish installing the following software packages, restart the kernel.

### Import configuration archive/package

Configuration archive/package created using configuration notebook will be required to onboard model for monitoring in IBM Watson OpenScale. Provide path location of archive here.

Please note if you are executing this notebook in IBM Watson Studio, first upload the configuration archive/package to project and use provided code snippet to download it to local directory of this notebook.

In [None]:
import warnings
warnings.filterwarnings("ignore")
%env PIP_DISABLE_PIP_VERSION_CHECK=1

# Note: Restart kernel after the dependencies are installed
!pip install --upgrade ibm-watson-openscale
!pip install "ibm_wos_utils>=4.8.0"

# # Download "configuration_archive.tar.gz" from project to local directory
# from ibm_watson_studio_lib import access_project_or_space
# wslib = access_project_or_space()
# wslib.download_file("configuration_archive.tar.gz")

archive_file_path = "configuration_archive.tar.gz"

## Provide Spark Connection Details <a name="spark"></a>

To generate configuration for monitoring models in IBM Watson OpenScale, a spark compute engine is required. It can be either IBM Analytics Engine or your own Spark Cluster. Provide details of any one of them in this section.

Please note, if you are using your own Spark cluster, checkout IBM Watson OpenScale documentation on how to setup spark manager API to enable interface for use with IBM Watson OpenScale services.

### Parameters for IBM Analytics Engine
If your job is going to run on Spark cluster as part of an IBM Analytics Engine instance on IBM Cloud Pak for Data, enter the following details:

| Parameter | Description | Possible Value(s) |
| :- | :- | :- |
| display_name | Display Name of the Spark instance in IBM Analytics Engine | |
| location_type | Identifies if compute engine is IBM IAE or Remote Spark. For IBM IAE, this must be set to `cpd_iae`. | `cpd_iae` |
| endpoint | Spark Jobs Endpoint for IBM Analytics Engine | |
| volume | IBM Cloud Pak for Data storage volume name | |
| username | IBM Cloud Pak for Data username | |
| apikey | IBM Cloud Pak for Data API key | |

### Parameters for Remote Spark Cluster
If your job is going to run on Spark Cluster as part of a Remote Hadoop Ecosystem, enter the following details:

| Parameter | Description | Possible Value(s) |
| :- | :- | :- |
| location_type | Identifies if compute engine is IBM IAE or Remote Spark. For Remote Spark, this must be set to `custom`. | `custom` |
| endpoint | Endpoint URL where the Spark Manager Application is running | |
| username | Username to connect to Spark Manager Application | |
| password | Password to connect to Spark Manager Application | |


### Provide Spark Resource Settings [Optional]
Configure how much of your Spark Cluster resources can this job consume. Leave the variable `spark_settings` to `{}` if no customisation is required.

| Parameter | Description |
| :- | :- |
| max_num_executors | Maximum Number of executors to launch for this session |
| min_executors | Minimum Number of executors to launch for this session |
| executor_cores | Number of cores to use for each executor |
| executor_memory | Amount of memory (in GBs) to use per executor process |
| driver_cores | Number of cores to use for the driver process |
| driver_memory | Amount of memory (in GBs) to use for the driver process |

In [None]:
spark_connection_info = {
    "connection": {
        "endpoint": "<to_be_edited>",
        "location_type": "<to_be_edited>",
        "display_name": "<to_be_edited>",
        "volume": "<to_be_edited>"
    },
    "credentials": {
        "username": "<to_be_edited>",
        "password": "<to_be_edited>",
        "apikey": "<to_be_edited>"
    }
}


"""
Example:

spark_settings = {
    # max_num_executors: Maximum Number of executors to launch for this session
    "max_num_executors": "2",
    
    # min_executors: Minimum Number of executors to launch for this session
    "min_executors": "1",
    
    # executor_cores: Number of cores to use for each executor
    "executor_cores": "2",
    
    # executor_memory: Amount of memory (in GBs) to use per executor process
    "executor_memory": "2",
    
    # driver_cores: Number of cores to use for the driver process
    "driver_cores": "2",
    
    # driver_memory: Amount of memory (in GBs) to use for the driver process 
    "driver_memory": "1"
}
"""
spark_settings = {}

spark_connection_info["spark_settings"] = spark_settings

## Provide Backend Storage Details <a name="backend-storage"></a>

IBM Watson OpenScale services monitors models by analyzing runtime data, i.e., the data model is making predictions on. To do this analysis, most of the services require access to this runtime data (also called payload data). In addition, some of the services may require access to manually labelled runtime data (also called feedback data). Hence, user needs to store such data in some backend storage and connect this storage to IBM Watson OpenScale.

### Provide Hive database connection details

| Parameter | Description | Possible Value(s) |
| :- | :- | :- |
| type | Describes the type of storage being used. For hive, this must be set to `hive`. | `hive` |
| metastore_url | An optional string value specifying hive metastore url. Example: `thrift://localhost:9083` | |
| location_type | Identifies the type of location for connection to use. For hive, this must be set to `metastore`. | `metastore` |

#### Provide additional details related Hadoop delegation token if the Hive is Kerberos secured and Spark in IBM Analytics Engine is used [Optional]
| Parameter | Description | Possible Value(s) |
| :- | :- | :- |
| kerberos_principal | The kerberos principal used to generate the delegation token. | |
| delegation_token_urn | The secret_urn of the CP4D vault where the delegation token is stored. | |
| delegation_token_endpoint | The REST endpoint which generates and returns the delegation token. | |

In [None]:
datawarehouse_details = {
    "type": "hive",
    "connection": {
        "location_type": "metastore",
        "metastore_url": "<to_be_edited>"
    },
    "credentials": {}
}

# Flag to indicate if the Hive is secured with Kerberos and Spark in IAE is used
kerberos_enabled = False

# Provide Hadoop delegation token details if kerberos_enabled is True
# Provide either secret_urn of the CP4D vault OR the delegation token endpoint. One of the two fields is mandatory to fetch the delegation token.
kerberos_principal = "<to_be_edited>"
delegation_token_urn = "<to_be_edited>"
delegation_token_endpoint = "<to_be_edited>"

if kerberos_enabled is True:
    datawarehouse_details["connection"]["kerberos_enabled"] = True
    datawarehouse_details["credentials"]["kerberos_principal"] = kerberos_principal
    if delegation_token_urn:
        datawarehouse_details["credentials"]["delegation_token_urn"] = delegation_token_urn
    if delegation_token_endpoint:
        datawarehouse_details["credentials"]["delegation_token_endpoint"] = delegation_token_endpoint

## Provide details of different tables <a name="table-details"></a>

IBM Watson OpenScale services require different tables to perform their analysis. Depending on which services you have enabled, provide details of the corresponding tables.
Tables are:

| Table | Description |
| :- | :- |
| Payload Table | Hosts the runtime data predicted by model. Required for detecting fairness and drift in runtime data. |
| Feedback Table | Hosts the manually labelled runtime data (also called feedback data) predicted by model. Required for tracking quality of monitor by analyzing feedback data. |
| Drifted Transactions Table | Hosts the data identified to be drifted.|
| Explain Queue Table | Hosts the data for which explanations are required to be generated. This can be same as payload table.|
| Explain Results Table | Hosts the explanations generated for records in explain queue table. |

For each of the table, following information is required:

| Parameter | Description | Possible Value(s) |
| :- | :- | :- |
| database | Name of the database hosting the schema. | |
| schema | Name of the schema hosting the table. | |
| table | Name of the table. | |
| auto_create | Boolean value identifying if the table already exists or has to be created via IBM Watson OpenScale. | `True` or `False`|
| hive_storage_format | Storage format to use for data in tables. Used only when tables are created using IBM Watson OpenScale. | `csv`, `parquet`, `orc` |

In [None]:
DATABASE_NAME="<to_be_edited>"
SCHEMA_NAME="<to_be_edited>"

# Payload table information
payload_table = {
    "data": {
        "auto_create": True, #set it to False if table already exists
        "database": DATABASE_NAME,
        "schema": SCHEMA_NAME,
        "table": "<to_be_edited>"
    },
    "parameters":{
        "hive_storage_format": "<to_be_edited>"
    }
}

# Feedback table information
feedback_table = {
    "data": {
        "auto_create": True, #set it to False if table already exists
        "database": DATABASE_NAME,
        "schema": SCHEMA_NAME,
        "table": "<to_be_edited>"
    },
    "parameters":{
        "hive_storage_format": "<to_be_edited>"
    }
}

#Drifted Transaction table. 
#Set this table information if drift is enabled
drifted_transaction_table = {
    "data": {
        "auto_create": True, #set it to False if table already exists
        "database": DATABASE_NAME,
        "schema": SCHEMA_NAME,
        "table": "<to_be_edited>"
    },
    "parameters":{}
}

#Explanation Result table
#Set this table information if Explain is enabled
explain_result_table = {
    "data": {
        "auto_create": True, #set it to False if table already exists
        "database": DATABASE_NAME,
        "schema": SCHEMA_NAME,
        "table": "<to_be_edited>"
    }
}

#Explanation Queue table
#Set this table information if Explain is enabled
explain_queue_table = {
    "data": {
        "auto_create": True, #set it to False if table already exists
        "database": DATABASE_NAME,
        "schema": SCHEMA_NAME,
        "table": "<to_be_edited>"
    },
    "parameters":{
        "hive_storage_format": "<to_be_edited>"
    }
}

## Connect to IBM Watson OpenScale instance <a name="connect-openscale"></a>

Following information is required to connect to IBM Watson OpenScale instance:

| Parameter | Description |
| :- | :- |
| url | Base url of your Cloud Pak for Data cluster hosting IBM Watson OpenScale instance. |
| username | Username to connect to your IBM Watson OpenScale instance in Cloud Pak for Data cluster. |
| password | Password to connect to your IBM Watson OpenScale instance in  Cloud Pak for Data cluster. One of `password` or `api_key` must be provided. |
| api_key | API Key to connect to your IBM Watson OpenScale instance in Cloud Pak for Data cluster. One of `password` or `api_key` must be provided. |
| service_instance_id | Id of your IBM Watson OpenScale Instance |

In [None]:
from ibm_cloud_sdk_core.authenticators import CloudPakForDataAuthenticator
from ibm_watson_openscale import APIClient
from ibm_watson_openscale.supporting_classes.enums import *

import warnings
warnings.filterwarnings('ignore')

service_instance_id = "<SERVICE_INSTANCE_ID>" #Default is 00000000-0000-0000-0000-000000000000
service_credentials = {
    "url": "<to_be_edited>",
    "username": "<to_be_edited>",
    "password": "<to_be_edited>",
#     "apikey":"<to_be_edited>"
}

authenticator = CloudPakForDataAuthenticator(
    url=service_credentials['url'],
    username=service_credentials['username'],
    password=service_credentials['password'],
#     apikey=service_credentials['apikey'],
    disable_ssl_verification=True
)

client = APIClient(
    service_url=service_credentials['url'],
    service_instance_id=service_instance_id,
    authenticator=authenticator
)

print(client.version)

## Configure Machine Learning Provider in IBM Watson OpenScale instance <a name="create-service-provider"></a>

Before configuring model for monitoring in IBM Watson OpenScale, you need to connect your machine learning provider with IBM Watson OpenScale instance. Since, we are configuring a model for monitoring which has its runtime data located remotely to IBM Watson OpenScale, we'll create a custom machine learning provider in given instance.

Following details are required:

| Parameter | Description |
| :- | :- |
| name | Name of the machine learning provider being configured. This can be any string value. |
| description | Description for the machine learning provider being configured. |
| service_type | Identifies type of the machine learning provider. In this case, this value must be `ServiceTypes.CUSTOM_MACHINE_LEARNING` |
| credentials | Optional input, stores username and password to connect to machine learning provider. |
| operational_space_id | Defines the classification of machine learning provider. Possible values are `pre-production` and `production`. |

In [None]:
# [OPTIONAL] Delete existing service provider with the same name as provided

# SERVICE_PROVIDER_NAME = "<to_be_edited>"
# service_providers = client.service_providers.list().result.service_providers
# for provider in service_providers:
#     if provider.entity.name == SERVICE_PROVIDER_NAME:
#         client.service_providers.delete(service_provider_id=provider.metadata.id)
#         break

# Add Service Provider
from ibm_watson_openscale.supporting_classes.enums import ServiceTypes

added_service_provider_result = client.service_providers.add(
        name="<to_be_edited>",
        description="<to_be_edited>",
        service_type=ServiceTypes.CUSTOM_MACHINE_LEARNING,
        credentials={},
        operational_space_id="<to_be_edited>",
        background_mode=False
    ).result

service_provider_id = added_service_provider_result.metadata.id

client.service_providers.show()

## Onboard model for monitoring in IBM Watson OpenScale instance <a name="create-subscription"></a>

When you configure a model for monitoring in IBM Watson OpenScale instance, a corresponding subscription is created for this model. Following details are required:

| Parameter | Description |
| :- | :- |
| subscription_name | Name of the subscription to use. This can be any string value typically identifying model being monitored. |
| datamart_id | Same as id of IBM Watson OpenScale instance. |
| service_provider_id | Id of the machine learning provider instance created in IBM Watson OpenScale. |
| configuration_archive | Path to configuration package archive. |
| spark_credentials | Connection details of Spark compute engine to use for analysis by different IBM Watson OpenScale services. |
| data_warehouse_connection | Details of the backend storage hosting tables for data, feedback data, etc. |
| payload_table | Details of the payload table to be used with this subscription. |
| feedback_table | Details of the feedback table to be used with this subscription. |

In [None]:
# Provide the path to the configuration file. If executing in IBM Watson Studio then leave as it is
subscription_id = client.subscriptions.create_subscription(
    subscription_name="My SDK Batch Subscription-hive",
    datamart_id=service_instance_id,
    service_provider_id=service_provider_id,
    configuration_archive=archive_file_path,
    spark_credentials=spark_connection_info,
    data_warehouse_connection=datawarehouse_details,
    payload_table=payload_table,
    feedback_table=feedback_table
)

# print("Subscription id is {}".format(subscription_id))

# Wait for the subscription to get in active state and to create the 
# required tables in the background before moving onto enabling monitors

# import time
# from datetime import datetime

# subscription_status = None
# while subscription_status not in ("active", "error"):
#     subscription_status = client.subscriptions.get(subscription_id).result.entity.status.state
#     if subscription_status not in ("active", "error"):
#         print(datetime.now().strftime("%H:%M:%S"), subscription_status)
#         time.sleep(15)
        
# print(datetime.now().strftime("%H:%M:%S"), subscription_status)

## Enable different services to monitor model <a name="enable-monitors"></a>

Depending on the services enabled in configuration package and their corresponding artefacts availability, different services are enabled in given subscription. There services are called monitors.

Following details are required:

| Parameter | Description |
| :- | :- |
| datamart_id | Same as id of IBM Watson OpenScale instance. |
| service_provider_id | Id of the machine learning provider instance created in IBM Watson OpenScale. |
| subscription_id | Id of the subscription created for given model in IBM Watson OpenScale instance. |
| configuration_archive | Path to configuration package archive. |
| drifted_transaction_table | Details of the drifted transactions table to be used with this subscription. |
| explain_queue_table | Details of the explain queue table to be used with this subscription. |
| explain_results_table | Details of the explain results table to be used with this subscription. |

In [None]:
instance_ids = client.monitor_instances.enable_monitors(
    datamart_id=service_instance_id,
    service_provider_id=service_provider_id,
    subscription_id=subscription_id,
    configuration_archive=archive_file_path,
    drifted_transaction_table=drifted_transaction_table,
    explain_queue_table=explain_queue_table,
    explain_results_table=explain_result_table
)

print(instance_ids)

## Track each monitor instance status
# for key, value in instance_ids.items():
#     monitor_instance_status = None

#     while monitor_instance_status not in ("active", "error"):
#         monitor_instance_details = client.monitor_instances.get(monitor_instance_id=value).result
#         monitor_instance_status = monitor_instance_details.entity.status.state
#         if monitor_instance_status not in ("active", "error"):
#             print(datetime.now().strftime("%H:%M:%S"), monitor_instance_status)
#             time.sleep(30)

#     print(key, monitor_instance_status)

## Congratulations!

All the monitors have been enabled. It will take some time for monitors to get into active state. You can track the status of each monitor by using above code snippet.

Once, all monitors are active, load data into payload or feedback table and either run on-demand evaluations or wait for scheduled evaluations to complete for each monitor. You can check more details in [IBM Watson OpenScale Dashboard](https://url-to-your-cp4d-cluster/aiopenscale).

## Helper Methods

### Cleanup subscription and its related artefacts
Crawls through subscription json and identifies entities to be deleted. Currently, following entities are identified and deleted:
- Analytics Engine integrated system
- Data Warehouse Connection integrated system(s)

In [None]:
# # Uncomment and update following if you are running this at a later point of time or 
# # separate from this notebook with no subscription id and wos client session

# from ibm_cloud_sdk_core.authenticators import CloudPakForDataAuthenticator
# from ibm_watson_openscale import APIClient

# import warnings
# warnings.filterwarnings('ignore')

# service_instance_id = "<SERVICE_INSTANCE_ID>" #Default is 00000000-0000-0000-0000-000000000000
# service_credentials = {
#     "url": "<to_be_edited>",
#     "username": "<to_be_edited>",
#     "password": "<to_be_edited>",
# #     "apikey":"<to_be_edited>"
# }

# authenticator = CloudPakForDataAuthenticator(
#     url=service_credentials['url'],
#     username=service_credentials['username'],
#     password=service_credentials['password'],
# #     apikey=service_credentials['apikey'],
#     disable_ssl_verification=True
# )

# client = APIClient(
#     service_url=service_credentials['url'],
#     service_instance_id=service_instance_id,
#     authenticator=authenticator
# )

# print(client.version)

# subscription_id = "<to_be_edited>"

subscription_details = client.subscriptions.get(
    subscription_id=subscription_id).result.to_dict()
subscription_entity = subscription_details.get("entity", {})

integrated_systems_id = []

# add analytics engine integrated system id
analytics_engine = subscription_entity.get("analytics_engine", {})
if analytics_engine and analytics_engine.get("integrated_system_id"):
    print("Found integrated system for analytics engine with type: {}".format(
        analytics_engine.get("type")))
    integrated_systems_id.append(analytics_engine.get("integrated_system_id"))

# add data source integrated system ids
data_sources = subscription_entity.get("data_sources", [])
for data_source in data_sources:
    if not data_source.get("connection"):
        continue

    if not data_source.get("connection").get("integrated_system_id"):
        continue

    integrated_system_id = data_source.get("connection").get("integrated_system_id")
    if integrated_system_id in integrated_systems_id:
        continue

    print("Found integrated system for data source with type: {}".format(
        data_source.get("type")))
    integrated_systems_id.append(integrated_system_id)
    
print("Integrated Systems to delete: {}".format(integrated_systems_id))
    
# delete subscription
client.subscriptions.delete(
    subscription_id=subscription_id,
    background_mode=False)

# wait time for subscription delete to complete
import time
time.sleep(30)

# delete all integrated systems
for integrated_system_id in integrated_systems_id:
    print("Deleting integrated system with id: {}".format(integrated_system_id))
    client.integrated_systems.delete(integrated_system_id)
    
    # wait time for integrated system delete to complete
    time.sleep(10)
    
print("Cleanup Complete!!!")