<img src="https://github.com/pmservice/ai-openscale-tutorials/raw/master/notebooks/images/banner.png" align="left" alt="banner">

# Tutorial on generating an explanation for a text-based model on Watson OpenScale

This notebook includes steps for creating a text-based watson-machine-learning model, creating a subscription, configuring explainability, and finally generating an explanation for a transaction.

### Contents
- [1. Setup](#setup)
- [2. Creating and deploying a text-based model](#deploy)
- [3. Subscriptions](#subscription)
- [4. Explainability](#explainability)

**Note**: This notebook works correctly with kernel `Python 3.10.x` with pyspark 3.3.x.

<a id="setup"></a>
## 1. Setup

### 1.1 Install Watson OpenScale and WML packages

In [None]:
!pip install --upgrade ibm-watson-openscale --no-cache --user| tail -n 1

In [None]:
!pip install --upgrade ibm-watson-machine-learning --no-cache | tail -n 1

Note: Restart the kernel to assure the new libraries are being used.

### 1.2 Configure credentials

- WOS_CREDENTIALS (CP4D)
- WML_CREDENTIALS (CP4D)
- DATABASE_CREDENTIALS (DB2 on CP4D or Cloud Object Storage (COS))
- SCHEMA_NAME

In [None]:
WOS_CREDENTIALS = {
    "url": "***",
    "username": "***",
    "password": "***"
}

In [None]:
WML_CREDENTIALS = {
                   "url": "***",
                   "username": "***",
                   "password" : "***",
                   "instance_id": "wml_local",
                   "version" : "4.6" #If your env is CP4D 4.x.x then specify "4.x.x" instead of "4.6"
                  }

## 2. Creating and deploying a text-based model <a id="deploy"></a>

The dataset used is the UCI-ML SMS Spam Collection Dataset which can be found here: https://archive.ics.uci.edu/ml/machine-learning-databases/00228/. It is a binary classification dataset with the labels being 'ham' and 'spam'.

### 2.1 Loading the training data

In [None]:
!rm -rf SMSSpam.csv
!wget 'https://raw.githubusercontent.com/IBM/watson-openscale-samples/main/IBM%20Cloud/WML/assets/data/spam_detection/SMSSpam.csv'

In [None]:
# The training data is downloaded and saved as 'SMSSpam.csv' in this step from public link

# !pip install pandas
# !rm smsspamcollection.zip
# !wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
# !unzip smsspamcollection.zip
#pd.read_csv("smsspamcollection.zip",sep="\t",header=None, encoding="utf-8").to_csv("SMSSpam.csv", header=["label", "text"], sep=",", index=False)

# !rm SMSSpamCollection
# !rm readme
# !rm smsspamcollection.zip

### 2.2 Creating a model

**Note**: Skip the pyspark install step below if you are using a Spark kernel on Watson Studio.

In [None]:
!pip install --upgrade pyspark==3.3.0

**Note**: When running this notebook locally, If the `SparkSession` import fails below, set 'SPARK_HOME' environment variable with the path to `pyspark` installation.

In [None]:
import pandas as pd
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
df = spark.read.csv(path="SMSSpam.csv", header=True, multiLine=True, escape='"')
df.show(5, truncate = False)

In [None]:
train_df, test_df = df.randomSplit([0.8, 0.2], seed=12345)
print("Total count of data set: {}".format(df.count()))
print("Total count of training data set: {}".format(train_df.count()))
print("Total count of test data set: {}".format(test_df.count()))

In [None]:
!pip install nltk
from pyspark.ml.feature import StringIndexer, IndexToString, CountVectorizer, Tokenizer, IDF, StopWordsRemover
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml import Pipeline, Model
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('stopwords')
stop_words = list(set(stopwords.words('english')))

stringIndexer_label = StringIndexer(inputCol="label", outputCol="label_ix").fit(df)
tokenizer = Tokenizer(inputCol="text", outputCol="words")
stopword_remover = StopWordsRemover(inputCol="words", outputCol="filtered_words").setStopWords(stop_words)
count = CountVectorizer(inputCol="filtered_words", outputCol="rawFeatures")
idf = IDF(inputCol="rawFeatures", outputCol="features")
nb = GBTClassifier(labelCol="label_ix")
labelConverter = IndexToString(inputCol="prediction", outputCol="predictionLabel", labels=stringIndexer_label.labels)

In [None]:
pipeline = Pipeline(stages=[stringIndexer_label, tokenizer, stopword_remover, count, idf, nb, labelConverter])
model = pipeline.fit(train_df)
predictions = model.transform(test_df)
evaluator = BinaryClassificationEvaluator(labelCol="label_ix", rawPredictionCol="prediction", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)

print("Area under ROC curve = %g" % auc)

In [None]:
import json
from ibm_watson_machine_learning import APIClient

wml_client = APIClient(WML_CREDENTIALS)
wml_client.version

In [None]:
wml_client.spaces.list(limit=10)

In [None]:
WML_SPACE_ID='***' # use space id here
wml_client.set.default_space(WML_SPACE_ID)

In [None]:
MODEL_NAME = "Text Binary Classifier"

In [None]:
software_spec_uid = wml_client.software_specifications.get_id_by_name("spark-mllib_3.3")
print("Software Specification ID: {}".format(software_spec_uid))
model_props = {
        wml_client._models.ConfigurationMetaNames.NAME:"{}".format(MODEL_NAME),
        wml_client._models.ConfigurationMetaNames.TYPE: "mllib_3.3",
        wml_client._models.ConfigurationMetaNames.SOFTWARE_SPEC_UID: software_spec_uid,
        wml_client._models.ConfigurationMetaNames.LABEL_FIELD: "label",
    }

In [None]:
print("Storing model ...")
published_model_details = wml_client.repository.store_model(
    model=model, 
    meta_props=model_props, 
    training_data=train_df, 
    pipeline=pipeline)

model_uid = wml_client.repository.get_model_id(published_model_details)
print("Done")
print("Model ID: {}".format(model_uid))

### 2.3 Deploying the model

In [None]:
deployment_details = wml_client.deployments.create(
    model_uid, 
    meta_props={
        wml_client.deployments.ConfigurationMetaNames.NAME: "{}".format(MODEL_NAME + " deployment"),
        wml_client.deployments.ConfigurationMetaNames.ONLINE: {}
    }
)
scoring_url = wml_client.deployments.get_scoring_href(deployment_details)
deployment_uid=wml_client.deployments.get_id(deployment_details)

print("Scoring URL:" + scoring_url)
print("Model id: {}".format(model_uid))
print("Deployment id: {}".format(deployment_uid))

## 3. Subscriptions <a id="subscription"></a>

### 3.1 Configuring OS

In [None]:
from ibm_cloud_sdk_core.authenticators import CloudPakForDataAuthenticator
from ibm_watson_openscale import APIClient

from ibm_watson_openscale import *
from ibm_watson_openscale.supporting_classes.enums import *
from ibm_watson_openscale.supporting_classes import *


authenticator = CloudPakForDataAuthenticator(
        url=WOS_CREDENTIALS['url'],
        username=WOS_CREDENTIALS['username'],
        password=WOS_CREDENTIALS['password'],
        disable_ssl_verification=True
    )

wos_client = APIClient(service_url=WOS_CREDENTIALS['url'],authenticator=authenticator)
wos_client.version

**Note**: Please re-run the above cell if it doesn't work the first time.

In [None]:
#DB_CREDENTIALS= {"hostname":"","username":"","password":"","database":"","port":"","ssl":True,"sslmode":"","certificate_base64":""}
DB_CREDENTIALS = None
KEEP_MY_INTERNAL_POSTGRES = True

In [None]:
data_marts = wos_client.data_marts.list().result.data_marts
if len(data_marts) == 0:
    if DB_CREDENTIALS is not None:
        if SCHEMA_NAME is None: 
            print("Please specify the SCHEMA_NAME and rerun the cell")

        print('Setting up external datamart')
        added_data_mart_result = wos_client.data_marts.add(
                background_mode=False,
                name="WOS Data Mart",
                description="Data Mart created by WOS tutorial notebook",
                database_configuration=DatabaseConfigurationRequest(
                  database_type=DatabaseType.POSTGRESQL,
                    credentials=PrimaryStorageCredentialsLong(
                        hostname=DB_CREDENTIALS['hostname'],
                        username=DB_CREDENTIALS['username'],
                        password=DB_CREDENTIALS['password'],
                        db=DB_CREDENTIALS['database'],
                        port=DB_CREDENTIALS['port'],
                        ssl=True,
                        sslmode=DB_CREDENTIALS['sslmode'],
                        certificate_base64=DB_CREDENTIALS['certificate_base64']
                    ),
                    location=LocationSchemaName(
                        schema_name= SCHEMA_NAME
                    )
                )
             ).result
    else:
        print('Setting up internal datamart')
        added_data_mart_result = wos_client.data_marts.add(
                background_mode=False,
                name="WOS Data Mart",
                description="Data Mart created by WOS tutorial notebook", 
                internal_database = True).result
        
    data_mart_id = added_data_mart_result.metadata.id
    
else:
    data_mart_id=data_marts[0].metadata.id
    print('Using existing datamart {}'.format(data_mart_id))

In [None]:
SERVICE_PROVIDER_NAME = "Watson Machine Learning V2_test"
SERVICE_PROVIDER_DESCRIPTION = "Added by tutorial WOS notebook."

In [None]:
service_providers = wos_client.service_providers.list().result.service_providers
for service_provider in service_providers:
    service_instance_name = service_provider.entity.name
    if service_instance_name == SERVICE_PROVIDER_NAME:
        service_provider_id = service_provider.metadata.id
        wos_client.service_providers.delete(service_provider_id)
        print("Deleted existing service_provider for WML instance: {}".format(service_provider_id))

In [None]:
added_service_provider_result = wos_client.service_providers.add(
        name=SERVICE_PROVIDER_NAME,
        description=SERVICE_PROVIDER_DESCRIPTION,
        service_type=ServiceTypes.WATSON_MACHINE_LEARNING,
        deployment_space_id = WML_SPACE_ID,
        operational_space_id = "production",
        credentials=WMLCredentialsCP4D(
            url=WML_CREDENTIALS["url"],
            username=WML_CREDENTIALS["username"],
            password=WML_CREDENTIALS["password"],
            instance_id=None
        ),
        background_mode=False
    ).result
service_provider_id = added_service_provider_result.metadata.id

In [None]:
asset_deployment_details_list = wos_client.service_providers.list_assets(data_mart_id=data_mart_id, service_provider_id=service_provider_id, deployment_space_id = WML_SPACE_ID).result['resources']
DEPLOYMENT_NAME='Text Binary Classifier deployment' # use the model name here 
asset_deployment_details = [asset for asset in asset_deployment_details_list if asset['entity']["name"]==DEPLOYMENT_NAME]

if len(asset_deployment_details)>0:
    [asset_deployment_details] = asset_deployment_details
else:
    raise ValueError('deployment with name "{}" not found.'.format(DEPLOYMENT_NAME))
asset_deployment_details

In [None]:
model_asset_details_from_deployment=wos_client.service_providers.get_deployment_asset(data_mart_id=data_mart_id,service_provider_id=service_provider_id,deployment_id=deployment_uid,deployment_space_id=WML_SPACE_ID)
model_asset_details_from_deployment

### 3.2 Subscribe the asset

In [None]:
subscriptions = wos_client.subscriptions.list().result.subscriptions
for subscription in subscriptions:
    sub_model_id = subscription.entity.asset.asset_id
    if sub_model_id == model_uid:
        wos_client.subscriptions.delete(subscription.metadata.id)
        print('Deleted existing subscription for model', sub_model_id)

In [None]:
from ibm_watson_openscale.base_classes.watson_open_scale_v2 import ScoringEndpointRequest

In [None]:
subscription_details = wos_client.subscriptions.add(
        data_mart_id=data_mart_id,
        service_provider_id=service_provider_id,
        asset=Asset(
            asset_id=model_asset_details_from_deployment["entity"]["asset"]["asset_id"],
            name=model_asset_details_from_deployment["entity"]["asset"]["name"],
            url=model_asset_details_from_deployment["entity"]["asset"]["url"],
            asset_type=AssetTypes.MODEL,
            input_data_type=InputDataType.UNSTRUCTURED_TEXT,
            problem_type=ProblemType.BINARY_CLASSIFICATION
        ),
        deployment=AssetDeploymentRequest(
            deployment_id=asset_deployment_details['metadata']['guid'],
            name=asset_deployment_details['entity']['name'],
            deployment_type= DeploymentTypes.ONLINE,
            url=model_asset_details_from_deployment['entity']['asset']['url'],
            scoring_endpoint=ScoringEndpointRequest(url=scoring_url) # scoring model without shadow deployment
        ),
        asset_properties=AssetPropertiesRequest(
            label_column='label',
            probability_fields=['probability'],
            prediction_field='predictionLabel',
            feature_fields = ["text"],
            categorical_fields = ["text"],
            training_data_schema=SparkStruct.from_dict(model_asset_details_from_deployment["entity"]["asset_properties"]["training_data_schema"])
        )
    ).result
subscription_id = subscription_details.metadata.id
subscription_id

In [None]:
import time

time.sleep(5)
payload_data_set_id = None
payload_data_set_id = wos_client.data_sets.list(type=DataSetTypes.PAYLOAD_LOGGING, 
                                                target_target_id=subscription_id, 
                                                target_target_type=TargetTypes.SUBSCRIPTION).result.data_sets[0].metadata.id
if payload_data_set_id is None:
    print("Payload data set not found. Please check subscription status.")
else:
    print("Payload data set id: ", payload_data_set_id)

### 3.3 Get subscription

In [None]:
wos_client.subscriptions.show()

In [None]:
wos_client.subscriptions.get(subscription_id).result.to_dict()

### 3.4 Score the model and get transaction-id

In [None]:
text = "SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info"
payload = {"input_data": [{"fields": ["text"], "values": [[text]]}]}

response = wml_client.deployments.score(deployment_uid,payload)
print(response)

In [None]:
wos_client.data_sets.get_records_count(payload_data_set_id)

## 4. Explainability

### 4.1 Configure Explainability

In [None]:
target = Target(
    target_type=TargetTypes.SUBSCRIPTION,
    target_id=subscription_id
)
parameters = {
    "enabled": True
}
explainability_details = wos_client.monitor_instances.create(
    data_mart_id=data_mart_id,
    background_mode=False,
    monitor_definition_id=wos_client.monitor_definitions.MONITORS.EXPLAINABILITY.ID,
    target=target,
    parameters=parameters
).result

explainability_monitor_id = explainability_details.metadata.id

### 4.2 Get explanation for the transaction

In [None]:
pl_records_resp = wos_client.data_sets.get_list_of_records(data_set_id=payload_data_set_id, limit=1, offset=0).result
scoring_ids = [pl_records_resp["records"][0]["entity"]["values"]["scoring_id"]]
print("Running explanations on scoring IDs: {}".format(scoring_ids))
explanation_types = ["lime", "contrastive"]
result = wos_client.monitor_instances.explanation_tasks(scoring_ids=scoring_ids, explanation_types=explanation_types, subscription_id=subscription_id).result
print(result)

In [None]:
explanation_task_id=result.to_dict()['metadata']['explanation_task_ids'][0]
wos_client.monitor_instances.get_explanation_tasks(explanation_task_id=explanation_task_id, subscription_id=subscription_id).result.to_dict()