<img src="https://github.com/pmservice/ai-openscale-tutorials/raw/master/notebooks/images/banner.png" align="left" alt="banner">

# Notebook for analyzing payload transactions causing drift for JDBC storage

This notebook helps users of IBM Watson OpenScale to analyze payload transactions that are causing drift - both drop in accuracy and drop in data consistency. 

The notebook is designed to give users a jump start in their analysis of the payload transactions. It is by no means a comprehensive analysis. 

The user needs to provide the necessary inputs (where marked) to be able to proceed with the analysis. 

PS: This notebook is designed to analyse one drift monitor run at a time for a given subscription.

**Contents:**
1. [Pre-requisites](#Pre-requisites)
2. [Installing Dependencies](#Installing-Dependencies)
3. [User Inputs](#User-Inputs)
4. [Setting up Services](#Setting-up-Services)
5. [Measurement Summary](#Measurement-Summary)
6. [Counts from Drifted Transactions Table](#Counts-from-Drifted-Transactions-Table)
7. [Analyse Transactions Causing Drop in Accuracy](#Analyse-Transactions-Causing-Drop-in-Accuracy)
    * [Get all transactions causing drop in data accuracy](#Get-all-transactions-causing-drop-in-data-accuracy)
    * [Get all transactions causing drop in accuracy in given range of drift model confidence](#Get-all-transactions-causing-drop-in-accuracy-in-given-range-of-drift-model-confidence)
8. [Analyse Transactions Causing Drop in Accuracy and Drop in Data Consistency](#Analyse-Transactions-Causing-Drop-in-Accuracy-and-Drop-in-Data-Consistency)
    * [Get all transactions causing drop in accuracy and drop in data consistency](#Get-all-transactions-causing-drop-in-accuracy-and--drop-in-data-consistency)
    * [Get all transactions causing drop in accuracy and drop in data consistency in given range of drift model confidence](#Get-all-transactions-causing-drop-in-accuracy-and-drop-in-data-consistency-in-given-range-of-drift-model-confidence)
9. [Analyse Transactions Causing Drop in Data Consistency](#Analyse-Transactions-Causing-Drop-in-Data-Consistency)
    * [Get all transactions causing drop in data consistency](#Get-all-transactions-causing-drop-in-data-consistency)
    * [Get all transactions violating a data constraint](#Get-all-transactions-violating-a-data-constraint)
    * [Get all transactions where a column is causing drop in data consistency](#Get-all-transactions-where-a-column-is-causing-drop-in-data-consistency)
    * [Explain categorical distribution constraint violations](#Explain-categorical-distribution-constraint-violations)
    * [Explain numeric range constraint violations](#Explain-numeric-range-constraint-violations)
    * [Explain cat-numeric range constraint violations](#Explain-cat-numeric-range-constraint-violations)
    * [Explain cat-cat distribution constraint violations](#Explain-cat-cat-distribution-constraint-violations)

## Pre-requisites

1. Download the required jars to connect to the JDBC storage.
2. If running locally - place the jars in the `jars` directory of your spark installation.
3. If running from a project in Watson Studio - upload the jars as Data Assets to the project.

## Installing Dependencies

In [None]:
# ----------------------------------------------------------------------------------------------------
# IBM Confidential
# OCO Source Materials
# 5737-H76
# Copyright IBM Corp. 2021, 2024
# The source code for this Notebook is not published or other-wise divested of its trade
# secrets, irrespective of what has been deposited with the U.S.Copyright Office.
# ----------------------------------------------------------------------------------------------------

VERSION = "jdbc-1.1.9"

# Version history

# jdbc-1.1.9 : Upgrade ibm-wos-utils to 5.0.0
# jdbc-1.1.8 : Upgrade ibm-wos-utils to 4.8.0
# jdbc-1.1.7 : Upgrade ibm-wos-utils to 4.7.0
# jdbc-1.1.6 : Install pyspark as a pre-requisite; Add clarification for JDBC_SSL_CERTIFICATE
# jdbc-1.1.5 : Upgrade ibm-wos-utils to 4.5.0
# jdbc-1.1.4 : Upgrade ibm-wos-utils to 4.1.1 (scikit-learn has been upgraded to 1.0.2)
# jdbc-1.1.3 : Upgrade ibm-wos-utils to 4.0.34
# jdbc-1.1.2 : Upgrade ibm-wos-utils to 4.0.31
# jdbc-1.1.1 : Add comment about conda install for zLinux environments; Upgrade ibm-wos-utils to 4.0.25
# jdbc-1.1   : Add partition column information for the payload and drifted transactions table
# 1.0        : First public release

In [None]:
import warnings
warnings.filterwarnings("ignore")
%env PIP_DISABLE_PIP_VERSION_CHECK=1

In [None]:
import sys

PYTHON = sys.executable

!$PYTHON -m pip install --no-warn-conflicts --upgrade tabulate ibm-watson-openscale pyspark | tail -n 1   

**Note:** For IBM Watson OpenScale Cloud Pak for Data version 5.0.x, use the cell below:

In [None]:
# When this notebook is to be run on a zLinux cluster,
# install scikit-learn==1.3.1 using conda before installing ibm-wos-utils
# !conda install scikit-learn=1.3.1

!$PYTHON -m pip install --no-warn-conflicts "ibm-wos-utils~=5.0.0" | tail -n 1

## User Inputs

The following inputs are required:

1. **IBM_CPD_ENDPOINT:** The URL representing the IBM Cloud Pak for Data service endpoint.
2. **IBM_CPD_USERNAME:** IBM Cloud Pak for Data username used to obtain a bearer token.
3. **IBM_CPD_PASSWORD:** IBM Cloud Pak for Data password used to obtain a bearer token.
4. **JDBC_HOST:** Hostname of the JDBC Connection
5. **JDBC_PORT:** Port of the JDBC Connection
6. **JDBC_USE_SSL:** Boolean Flag to indicate whether to use SSL while connecting.
7. **JDBC_SSL_CERTIFICATE:** Path to SSL Certificate file. Ignored if JDBC_USE_SSL is False.
    - If running on local Jupyter, please provide the absolute path for the SSL Certificate
    - If running in a Watson Studio environment, upload the SSL Certificate as an asset and provide it's path. e.g.  `JDBC_SSL_CERTIFICATE = "/project_data/data_asset/my_cert.arm"`
8. **JDBC_DRIVER:** Class name of the JDBC driver to use to connect e.g. for DB2 use the default value "com.ibm.db2.jcc.DB2Driver"
9. **JDBC_USERNAME:** Username of the JDBC Connection
10. **JDBC_PASSWORD:** Password of the JDBC Connection
11. **JDBC_DATABASE_NAME:** Name of the JDBC Database to connect.
12. **ANALYSIS_INPUT_PARAMETERS:** Analysis Input Parameters to be copied from IBM Watson OpenScale UI

In [None]:
# IBM Cloud Pak for Data credentials
IBM_CPD_ENDPOINT = "<The URL representing the IBM Cloud Pak for Data service endpoint.>"
IBM_CPD_USERNAME = "<IBM Cloud Pak for Data username used to obtain a bearer token.>"
IBM_CPD_PASSWORD = "<IBM Cloud Pak for Data password used to obtain a bearer token.>"

# JDBC
JDBC_HOST = "<Hostname of the JDBC Connection>"
JDBC_PORT = "<Port of the JDBC Connection>"
JDBC_USE_SSL = False
JDBC_SSL_CERTIFICATE = "<Path to SSL Certificate file. Ignored if JDBC_USE_SSL is False.>"
JDBC_DRIVER = "com.ibm.db2.jcc.DB2Driver"
JDBC_USERNAME = "<Username of the JDBC Connection>"
JDBC_PASSWORD = "<Password of the JDBC Connection>"
JDBC_DATABASE_NAME = "<Name of the JDBC Database to connect.>"

# NUM_PARTITIONS decide the number of simultaneous connections Spark will make to your JDBC instance.
# e.g. DB2 on Cloud Free Plan has a maximum limit of 15 simultaneous connections
# This is the default value used if no value is set in the data sources in the subscription.
NUM_PARTITIONS = "10"

# FETCH_SIZE determines how many rows to fetch per round trip.
FETCH_SIZE = "100"

# Analysis Input Parameters to be copied from UI
# Please make sure that the quotes around the key-values 
# are correct after copying from UI
ANALYSIS_INPUT_PARAMETERS = {
    "data_mart_id": "<data_mart_id>",
    "subscription_id": "<subscription_id>",
    "monitor_instance_id": "<monitor_instance_id>",
    "measurement_id": "<measurement_id>"
}

In [None]:
DATAMART_ID = ANALYSIS_INPUT_PARAMETERS.get("data_mart_id")
SUBSCRIPTION_ID = ANALYSIS_INPUT_PARAMETERS.get("subscription_id")
MONITOR_INSTANCE_ID = ANALYSIS_INPUT_PARAMETERS.get("monitor_instance_id")
MEASUREMENT_ID = ANALYSIS_INPUT_PARAMETERS.get("measurement_id")

In [None]:
jdbc_url = "jdbc:db2://{}:{}/{}".format(JDBC_HOST, JDBC_PORT, JDBC_DATABASE_NAME)

connection_properties = {
    "user": JDBC_USERNAME,
    "password": JDBC_PASSWORD,
    "driver": JDBC_DRIVER,
    "fetchsize": FETCH_SIZE
}

if JDBC_USE_SSL:
    connection_properties["sslConnection"] = "true"
    connection_properties["sslCertLocation"] = JDBC_SSL_CERTIFICATE

## Setting up Services

In [None]:
import pandas as pd
import pyspark.sql.functions as F
from pyspark import SparkConf
from pyspark.sql import SparkSession

from ibm_cloud_sdk_core.authenticators import CloudPakForDataAuthenticator
from ibm_watson_openscale import APIClient

from ibm_wos_utils.drift.batch.util.constants import ConstraintName
from ibm_wos_utils.joblib.utils.analyze_notebook_utils import (
    explain_catcat_distribution_constraint,
    explain_categorical_distribution_constraint,
    explain_catnum_range_constraint, explain_numeric_range_constraint,
    get_column_query, get_drift_archive_contents,
    get_table_details_from_subscription, show_constraints_by_column,
    show_dataframe, show_last_n_drift_measurements, get_record_timestamp_column)

In [None]:
conf = SparkConf()\
        .setAppName("Analyze Drifted Transactions")\

# Uncomment the following line if running this notebook from a Watson Studio Project.
# Here, db2jcc4.jar is the DB2 specific jar required to run this notebook.
# conf = conf.set("spark.jars", "/project_data/data_asset/db2jcc4.jar")

spark = SparkSession.builder.config(conf=conf).getOrCreate()

In [None]:
authenticator = CloudPakForDataAuthenticator(
        url=IBM_CPD_ENDPOINT,
        username=IBM_CPD_USERNAME,
        password=IBM_CPD_PASSWORD,
        disable_ssl_verification=True
    )
wos_client = APIClient(authenticator=authenticator, service_url=IBM_CPD_ENDPOINT)

In [None]:
%%time

if not DATAMART_ID or not SUBSCRIPTION_ID:
    raise Exception("DATAMART_ID and SUBSCRIPTION_ID are required to proceed.")

subscription = wos_client.subscriptions.get(subscription_id=SUBSCRIPTION_ID).result
monitor_instance = wos_client.monitor_instances.list(data_mart_id=DATAMART_ID, target_target_id=SUBSCRIPTION_ID, monitor_definition_id="drift").result.monitor_instances[0]
model_drift_enabled = monitor_instance.entity.parameters.get("model_drift_enabled", False)
data_drift_enabled = monitor_instance.entity.parameters.get("data_drift_enabled", False)

if not MONITOR_INSTANCE_ID:
    MONITOR_INSTANCE_ID = monitor_instance.metadata.id
    
drift_archive = wos_client.monitor_instances.download_drift_model(monitor_instance_id=MONITOR_INSTANCE_ID).result.content
schema, ddm_properties, constraints_set = get_drift_archive_contents(drift_archive, model_drift_enabled, data_drift_enabled)
_, payload_schema_name, payload_table_name, payload_partition_column, payload_num_partitions = get_table_details_from_subscription(subscription, "payload", NUM_PARTITIONS)
_, drift_schema_name, drift_table_name, drift_partition_column, drift_num_partitions = get_table_details_from_subscription(subscription, "drift", NUM_PARTITIONS)

This notebook relies heavily on filtering transactions in the Drifted Transactions table based on three columns: `run_id`, `is_model_drift` and `is_data_drift`. 

It is, therefore, recommended that you create an index for these columns, if not done already as part of the common configuration notebook. You can use the following DDL to create the index.

In [None]:
ddl_string = "CREATE INDEX \"{1}_index\" ON \"{0}\".\"{1}\" (\"run_id\", \"is_model_drift\", \"is_data_drift\")".format(drift_schema_name, drift_table_name)
print(ddl_string)

In [None]:
if not MEASUREMENT_ID:
    print("Please pick a measurement to analyze from the following list:")
    
show_last_n_drift_measurements(10, wos_client, SUBSCRIPTION_ID)

In [None]:
# If you have not selected MEASUREMENT_ID so far, please enter a measurement ID
# from the above cell's output to analyze.

# MEASUREMENT_ID = None

In [None]:
if not MEASUREMENT_ID:
    raise Exception("MEASUREMENT_ID is required to proceed.")

measurement = wos_client.monitor_instances.measurements.get(measurement_id=MEASUREMENT_ID, monitor_instance_id=MONITOR_INSTANCE_ID).result
measurement_data = measurement.entity.sources[0].data
MONITOR_RUN_ID = measurement.entity.run_id
MONITOR_RUN_ID

## Measurement Summary

### Counts of transactions causing drop in accuracy and drop in data consistency

In [None]:
print("IBM Watson OpenScale analyzed {} transactions between {} and {} for drift. Here's a summary.".format(measurement_data["transactions_count"], measurement_data["start"], measurement_data["end"]))

if model_drift_enabled:
    print("  - Total {} transactions out of {} transactions are causing drop in accuracy.".format(measurement_data["drifted_transactions"]["count"], measurement_data["transactions_count"]))

if data_drift_enabled:
    print("  - Total {} transactions out of {} transactions are causing drop in data consistency.".format(measurement_data["data_drifted_transactions"]["count"], measurement_data["transactions_count"]))
    
if model_drift_enabled and data_drift_enabled:
    print("  - Total {} transactions out of {} transactions are causing both drop in accuracy and drop in data consistency.".format(measurement_data["model_data_drifted_transactions"]["count"], measurement_data["transactions_count"]))

### Counts of transactions causing drop in accuracy - percent bins

In [None]:
if model_drift_enabled:
    rows_df = pd.DataFrame(measurement_data["drifted_transactions"]["drift_model_confidence_count"])
    rows_df = rows_df[["lower_limit", "upper_limit", "count"]]
    rows_df.columns = ["Drift Model Confidence - Lower Limit", "Drift Model Confidence - Upper Limit", "Violated Transactions Count"]
    display(rows_df)

### Counts of transactions causing drop in data consistency - feature columns

In [None]:
if data_drift_enabled:
    rows_df = pd.Series(measurement_data["data_drifted_transactions"]["features_count"])\
                .sort_values(ascending=False).to_frame()
    rows_df.reset_index(inplace=True)
    rows_df.columns = ["Feature Column", "Violated Transactions Count"]
    display(rows_df)

### Counts of transactions causing drop in accuracy - constraints list

In [None]:
if data_drift_enabled:
    rows_df = pd.Series(measurement_data["data_drifted_transactions"]["constraints_count"])\
                .sort_values(ascending=False).to_frame()
    rows_df.reset_index(inplace=True)
    rows_df.columns = ["Constraint Name", "Violated Transactions Count"]
    display(rows_df)


## Counts from Drifted Transactions Table

To take advantage of multiple workers in Spark while reading information from the JDBC table, the partition information is being added. 

In [None]:
sql = "(select min(\"{0}\") \"rtmin\", max(\"{0}\") \"rtmax\" from \"{1}\".\"{2}\" where \"run_id\" = '{3}')".format(drift_partition_column, drift_schema_name, drift_table_name, MONITOR_RUN_ID)
print(sql)

result = spark.read.jdbc(url=jdbc_url,\
                 table=sql,\
                 properties=connection_properties).collect()[0]


drift_connection_properties = {}
drift_connection_properties.update(connection_properties)
drift_connection_properties["partitionColumn"] = drift_partition_column
drift_connection_properties["lowerBound"] = str(result.rtmin)
drift_connection_properties["upperBound"] = str(result.rtmax)
drift_connection_properties["numPartitions"] = str(drift_num_partitions)

drift_connection_properties

In [None]:
drift_table_df = spark.read.jdbc(url=jdbc_url,\
                                 table="\"{}\".\"{}\"".format(drift_schema_name, drift_table_name),\
                                 properties=drift_connection_properties)

drift_table_df = drift_table_df.where(drift_table_df.run_id == MONITOR_RUN_ID)
drift_table_df.printSchema()

In [None]:
record_timestamp_column = get_record_timestamp_column(subscription)

# Convert ISO format timestamp to DB2 SQL compatible format
start = measurement_data["start"].replace("T", " ")
end = measurement_data["end"].replace("T", " ")

sql = "(select min(\"{0}\") \"rtmin\", max(\"{0}\") \"rtmax\" from \"{1}\".\"{2}\" where \"{3}\" >= '{4}' and \"{3}\" <= '{5}')".format(payload_partition_column, payload_schema_name, payload_table_name, record_timestamp_column, start, end)
print(sql)

result = spark.read.jdbc(url=jdbc_url,\
                 table=sql,\
                 properties=connection_properties).collect()[0]


payload_connection_properties = {}
payload_connection_properties.update(connection_properties)
payload_connection_properties["partitionColumn"] = drift_partition_column
payload_connection_properties["lowerBound"] = str(result.rtmin)
payload_connection_properties["upperBound"] = str(result.rtmax)
payload_connection_properties["numPartitions"] = str(drift_num_partitions)

payload_connection_properties

In [None]:
payload_table_df = spark.read.jdbc(url=jdbc_url,\
                                 table="\"{}\".\"{}\"".format(payload_schema_name, payload_table_name),\
                                 properties=payload_connection_properties)

payload_table_df.printSchema()

In [None]:
%%time

print("Total number of drifted transactions: {}".format(drift_table_df.count()))
print("Total number of model drift transactions: {}".format(drift_table_df.where("is_model_drift").count()))
print("Total number of data drift transactions: {}".format(drift_table_df.where("is_data_drift").count()))
print("Total number of model + data drift transactions: {}".format(drift_table_df.where("is_model_drift").where("is_data_drift").count()))
print()

## Analyse Transactions Causing Drop in Accuracy

### Get all transactions causing drop in data accuracy

The `drifted_transactions_df` can be exported to a format of your choice for further analysis.

In [None]:
%%time

drifted_transactions_df = drift_table_df\
    .where("is_model_drift")\
    .select(["scoring_id","drift_model_confidence"])

count = drifted_transactions_df.count()

print("Total {} transactions are causing drop in accuracy.".format(count))

if count:
    num_rows = 10
    print("Showing {} such transactions in the order of drift_model_confidence".format(num_rows))

    drifted_transactions_df = payload_table_df\
        .join(drifted_transactions_df, ["scoring_id"], "leftsemi")\
        .join(drifted_transactions_df, ["scoring_id"], "left")\
        .sort(["drift_model_confidence"], ascending=False)

    show_dataframe(drifted_transactions_df, num_rows=num_rows, priority_columns=["drift_model_confidence"])


### Get all transactions causing drop in accuracy in given range of drift model confidence

The `drifted_transactions_df` can be exported to a format of your choice for further analysis.

In [None]:
%%time

# Drift Model Confidence Lower Limit
dm_conf_lower = 0.5
# Drift Model Confidence Upper Limit
dm_conf_upper = 1.0

drifted_transactions_df = drift_table_df\
    .where("is_model_drift")\
    .where(drift_table_df.drift_model_confidence.between(dm_conf_lower,dm_conf_upper))\
    .select(["scoring_id","drift_model_confidence"])

count = drifted_transactions_df.count()

print("Total {} transactions are causing drop in accuracy where drift model confidence is between {} and {}".format(count, dm_conf_lower, dm_conf_upper))

if count:
    num_rows = 10
    print("Showing {} such transactions in the order of drift_model_confidence".format(num_rows))

    drifted_transactions_df = payload_table_df\
        .join(drifted_transactions_df, ["scoring_id"], "leftsemi")\
        .join(drifted_transactions_df, ["scoring_id"], "left")\
        .sort(["drift_model_confidence"], ascending=False)

    show_dataframe(drifted_transactions_df, num_rows=num_rows, priority_columns=["drift_model_confidence"])


## Analyse Transactions Causing Drop in Accuracy and Drop in Data Consistency

### Get all transactions causing drop in accuracy and  drop in data consistency

The `drifted_transactions_df` can be exported to a format of your choice for further analysis.

In [None]:
%%time

drifted_transactions_df = drift_table_df\
    .where("is_model_drift")\
    .where("is_data_drift")\
    .select(["scoring_id","drift_model_confidence"])

count = drifted_transactions_df.count()

print("Total {} transactions are causing both drop in accuracy and drop in data consistency".format(count))

if count:
    num_rows = 10
    print("Showing {} such transactions in the order of drift_model_confidence".format(num_rows))

    drifted_transactions_df = payload_table_df\
        .join(drifted_transactions_df, ["scoring_id"], "leftsemi")\
        .join(drifted_transactions_df, ["scoring_id"], "left")\
        .sort(["drift_model_confidence"], ascending=False)

    show_dataframe(drifted_transactions_df, num_rows=num_rows, priority_columns=["drift_model_confidence"])


### Get all transactions causing drop in accuracy and drop in data consistency in given range of drift model confidence

The `drifted_transactions_df` can be exported to a format of your choice for further analysis.

In [None]:
%%time

# Drift Model Confidence Lower Limit
dm_conf_lower = 0.5
# Drift Model Confidence Upper Limit
dm_conf_upper = 1.0

drifted_transactions_df = drift_table_df\
    .where("is_model_drift")\
    .where("is_data_drift")\
    .where(drift_table_df.drift_model_confidence.between(dm_conf_lower,dm_conf_upper))\
    .select(["scoring_id","drift_model_confidence"])

count = drifted_transactions_df.count()

print("Total {} transactions are causing drop in accuracy and drop in data consistency where drift model confidence is between {} and {}".format(count, dm_conf_lower, dm_conf_upper))

if count:
    num_rows = 10
    print("Showing {} such transactions in the order of drift_model_confidence".format(num_rows))

    drifted_transactions_df = payload_table_df\
        .join(drifted_transactions_df, ["scoring_id"], "leftsemi")\
        .join(drifted_transactions_df, ["scoring_id"], "left")\
        .sort(["drift_model_confidence"], ascending=False)

    show_dataframe(drifted_transactions_df, num_rows=num_rows, priority_columns=["drift_model_confidence"])


## Analyse Transactions Causing Drop in Data Consistency

### Get all transactions causing drop in data consistency

The `drifted_transactions_df` can be exported to a format of your choice for further analysis.

In [None]:
%%time

drifted_transactions_df = drift_table_df\
    .where("is_data_drift")\
    .select(["scoring_id"])

count = drifted_transactions_df.count()

print("Total {} transactions are causing drop in data consistency".format(count))

if count:
    num_rows = 10
    print("Showing {} such transactions".format(num_rows))

    drifted_transactions_df = payload_table_df\
        .join(drifted_transactions_df, ["scoring_id"], "leftsemi")\
        .join(drifted_transactions_df, ["scoring_id"], "left")

    show_dataframe(drifted_transactions_df, num_rows=num_rows)


### Get all transactions violating a data constraint

The `drifted_transactions_df` can be exported to a format of your choice for further analysis.

In [None]:
%%time

constraint_name = ConstraintName.CATEGORICAL_DISTRIBUTION_CONSTRAINT

drifted_transactions_df = drift_table_df\
        .where("is_data_drift")\
        .where(F.col(constraint_name.value).like("%1%"))\
        .select(["scoring_id"])

count = drifted_transactions_df.count()

print("Total {} transactions are violating {}.".format(count, constraint_name.value))

if count:
    num_rows = 10
    print("Showing {} such transactions.".format(num_rows))
    
    drifted_transactions_df = payload_table_df\
        .join(drifted_transactions_df, ["scoring_id"], "leftsemi")\
        .join(drifted_transactions_df, ["scoring_id"], "left")\

    show_dataframe(drifted_transactions_df, num_rows=num_rows)



### Get all transactions where a column is causing drop in data consistency

The `drifted_transactions_df` can be exported to a format of your choice for further analysis.

In [None]:
filter_query = get_column_query(constraints_set, schema, column="<column_name>")

drifted_transactions_df = drift_table_df\
    .where("is_data_drift")\
    .where(filter_query)\
    .select(["scoring_id"])
count = drifted_transactions_df.count()

print("Total {} transactions are satisfying the given query.".format(count))

if count:
    num_rows = 10
    print("Showing {} such transactions.".format(num_rows))

    drifted_transactions_df = payload_table_df\
        .join(drifted_transactions_df, ["scoring_id"], "leftsemi")\
        .join(drifted_transactions_df, ["scoring_id"], "left")\

    show_dataframe(drifted_transactions_df, num_rows=num_rows)


### Query all the learnt constraints based on a column name

Use the `show_constraints_by_column` method to find all the constraints learnt for a particular column at training time. The constraint ids shown in the cell output can be used to explain the corresponding constraint in subsequent cells.

In [None]:
show_constraints_by_column(constraints_set, "<column_name>")

### Explain categorical distribution constraint violations

Explains categorical distribution constraint violations given a constraint id. The constraint id can be gotten by running [this cell](#Query-all-the-learnt-constraints-based-on-a-column-name)

The `drifted_transactions_df` can be exported to a format of your choice for further analysis.

In [None]:
%%time 

constraint_id = "<constraint_id>"

drifted_transactions_df = explain_categorical_distribution_constraint(drifted_transactions_df=drift_table_df,
                              payload_table_df=payload_table_df,
                              constraints_set=constraints_set,
                              schema=schema,
                              constraint_id=constraint_id)

if drifted_transactions_df:
    num_rows = 10
    print("Showing {} such transactions.".format(num_rows))

    show_dataframe(drifted_transactions_df, num_rows=num_rows)

### Explain numeric range constraint violations

Explains numeric range constraint violations given a constraint id. The constraint id can be gotten by running [this cell](#Query-all-the-learnt-constraints-based-on-a-column-name)

The `drifted_transactions_df` can be exported to a format of your choice for further analysis.

In [None]:
%%time

constraint_id = "<constraint_id>"

drifted_transactions_df = explain_numeric_range_constraint(drifted_transactions_df=drift_table_df,
                              payload_table_df=payload_table_df,
                              constraints_set=constraints_set,
                              schema=schema,
                              constraint_id=constraint_id)


if drifted_transactions_df:
    num_rows = 10
    print("Showing {} such transactions.".format(num_rows))

    show_dataframe(drifted_transactions_df, num_rows=num_rows)

### Explain cat-numeric range constraint violations

Explains cat-numeric range constraint violations given a constraint id. The constraint id can be gotten by running [this cell](#Query-all-the-learnt-constraints-based-on-a-column-name)

The `drifted_transactions_df` can be exported to a format of your choice for further analysis.

In [None]:
%%time

constraint_id = "<constraint_id>"

drifted_transactions_df = explain_catnum_range_constraint(drifted_transactions_df=drift_table_df,
                              payload_table_df=payload_table_df,
                              constraints_set=constraints_set,
                              schema=schema,
                              constraint_id=constraint_id)

if drifted_transactions_df:
    num_rows = 10
    print("Showing {} such transactions.".format(num_rows))

    show_dataframe(drifted_transactions_df, num_rows=num_rows)

### Explain cat-cat distribution constraint violations

Explains cat-cat distribution constraint violations given a constraint id. The constraint id can be gotten by running [this cell](#Query-all-the-learnt-constraints-based-on-a-column-name)

The `drifted_transactions_df` can be exported to a format of your choice for further analysis.

In [None]:
%%time

constraint_id = "<constraint_id>"

drifted_transactions_df = explain_catcat_distribution_constraint(drifted_transactions_df=drift_table_df,
                              payload_table_df=payload_table_df,
                              constraints_set=constraints_set,
                              schema=schema,
                              constraint_id=constraint_id)


if drifted_transactions_df:
    num_rows = 10
    print("Showing {} such transactions.".format(num_rows))

    show_dataframe(drifted_transactions_df, num_rows=num_rows)

#### Authors
Developed by [Prem Piyush Goyal](mailto:prempiyush@in.ibm.com)