# Custom Data Classes in IBM Knowledge Catalog

In IBM Knowledge Catalog there are many pre-defined data classes and various mechanisms to create new data classes using reference data, regular expressions and others.

There is now also the capability to create a data class 
defined through custom code. This notebook demonstrates how to do that.

In principle, to define a custom data class through your own code you need to
1. Define a so-called "deployment" with Watson Machine Learning (WML) that provides a REST endpoint for your custom code.
2. Associate your data class with this deployment through the Data Classes UI or REST API.

We will implement the code for a custom data class (which we will call `MyID`) that should be selected for a column if all these criteria are met:
1. The column name contains the string `ID`
2. The inferred data type of the column is `STRING`.
3. Its length is less than 15 characters

### Prerequisites

1. A Watson Machine Learning deployment space on your Cloud Pak For Data system.
2. If the notebook is not running inside IBM Watson Studio: User ID and password of a user with access to the deployment space.

## Set up this notebook

Enter the required settings to connect to your CPD instance.

Tip: Consider replacing the calls to `getpass.getpass()` and `input()`` with clear text values if appropriate to avoid having to enter them each time you run the notebook.

In [None]:
cp4d_url = input("Enter your Cloud Pak for Data cluster URL (example: https://cpd-wkc.apps.example.com): ")

import pkgutil

if pkgutil.find_loader('ibm_watson_studio_lib') is not None:
    from ibm_watson_studio_lib import access_project_or_space
    wslib = access_project_or_space()
    wkc_token = wslib.auth.get_current_token()

else:
    import requests
    import getpass
    import os

    cp4d_username = input("Enter your Cloud Pak for Data user ID: ")
    cp4d_password = getpass.getpass("Enter your Cloud Pak for Data password: ")

    def get_authentication_token():
        print("Authenticating")
        auth_response = requests.get(cp4d_url + "/v1/preauth/validateAuth", headers={"username": cp4d_username, "password": cp4d_password}, verify=False)
        auth_response.raise_for_status()
        print("Authentication successful")
        wkc_token = auth_response.json()['accessToken']
        return wkc_token
    wkc_token = get_authentication_token()

from ibm_watson_machine_learning import APIClient

def initialize_wml_client(wkc_token):
    api_client_auth = { 'token': wkc_token, 'url': cp4d_url, "instance_id": "openshift", 'version': '4.8' }   
    print("Creating WML client")
    wml_client = APIClient(api_client_auth)
    return wml_client

wml_client = initialize_wml_client(wkc_token)
print("Notebook configured.")

#### Optional: List available deployment spaces

In [None]:
print("Some available deployment spaces:")
wml_client.spaces.list(limit=50)

### Set deployment space ID

In [None]:
# ID of deployment space for artifacts created by this notebook
wml_space_id = input("Enter the ID of your Watson Machine Learning deployment space: ")
wml_client.set.default_space(wml_space_id)

print(f"Artifacts will be created in space {wml_space_id}.")

## Scoring Functions for Custom Data Classes

The scoring function for a custom data class receives the data profile of a column as input and returns a confidence value (between 0 and 100%) indicating the confidence that the column belongs to the custom data class.

Since scoring functions in Watson Machine Learning work on tables, the input is technically a table with a single column and a single row.
The value of this row is similar to the response of the `/v2/data_profiles?project_id=<project_id>&dataset_id=<data_asset_id>` API under the path `resources[0]/columns[name=<column_name>]`. 

Here is an example:

```
{
    "input_data": [
        {
            "values": [
                [
                    {
                        "name": "MY_ID_1",
                        "value_analysis": {
                            "unique_count": 10,
                            "null_count": 0,
                            "inferred_type": {
                                "type": {
                                    "length": 8,
                                    "type": "STRING"
                                }
                            }
                        }
                    }
                ]
            ]
        }
    ]
}
```

There is only one entry in `input_data` and this one entry has a single row in the `values` list which, in turn, contains the data profile of the column.

The column is called `MY_ID_1` and the data profile indicates that it has 10 unique values, no null values, and that the values were found to be strings of length 8.


Similarly, the output of the scoring function is a table with a single column and a single row. For instance, suppose we want  to express that the column with the input from above belongs to our data class with a confidence of 95%. Then the function should return a repsonse like this:

```
{
    "predictions": [
        { 
            "fields": ["MY_ID_1"], 
            "values": [[0.95]]
        }
    ]
}
```

## Create a Scoring Function

In this section, we will create the code that does compute if a column belongs to the `MyID` custom data class.

The main entry point is the `score(payload)` function. It extracts the needed data from the input and returns a confidence 1.0
if the column matches all the criteria above and 0 otherwise.

In [None]:
# The WML client API for creating deployments requires a function that, in turn, returns a function object
def custom_dataclass_scoring_function():

    def score(payload):
        data_profile = payload.get("input_data")[0].get("values")[0][0]
        column_name = data_profile.get("name")
        
        confidence = 0.0

        # If "id" is part of the column name
        #    and inferred data type is string
        #    and length is less than 15
        # then return confidence 1
        if "id" in column_name.lower() \
            and "STRING" == data_profile.get("value_analysis").get("inferred_type").get("type").get("type") \
            and data_profile.get("value_analysis").get("inferred_type").get("type").get("length") < 15:
            confidence = 1.0
        
        return {
            'predictions': [
                {
                    'fields': [column_name], 
                    'values': [[confidence]]
                }
            ]
        } 

    # Return the function
    return score

### (Optional) Test Scoring Function

The following cell contains code that calls the scoring function against three different test payloads. For readability, these test payloads only contain a very small subset of the data profile.

In [None]:
import json

# 1. Column matches all criteria of the MyID data class
test_input1 = '''
{
    "input_data": [
        {
            "values": [
                [
                    {
                        "name": "MY_ID_1",
                        "value_analysis": {
                            "unique_count": 10,
                            "null_count": 0,
                            "inferred_type": {
                                "type": {
                                    "length": 8,
                                    "type": "STRING"
                                }
                            }
                        }
                    }
                ]
            ]
        }
    ]
}                   
'''
response1 = custom_dataclass_scoring_function()(json.loads(test_input1))
print(f"Scoring test 1 output (confidence should be 1.0): {response1}" )


# 2. Column does not have 'id' in its name
test_input2 = '''
{
    "input_data": [
        {
            "values": [
                [
                    {
                        "name": "MY_COLUMN_2",
                        "value_analysis": {
                            "unique_count": 10,
                            "null_count": 0,
                            "inferred_type": {
                                "type": {
                                    "length": 8,
                                    "type": "STRING"
                                }
                            }
                        }
                    }
                ]
            ]
        }
    ]
}                   
'''
response2 = custom_dataclass_scoring_function()(json.loads(test_input2))
print(f"Scoring test 2 output (confidence should be 0.0): {response2}" )


# 3. Column is of numeric data type 
test_input3 = '''
{
    "input_data": [
        {
            "values": [
                [
                    {
                        "name": "MY_ID_3",
                        "value_analysis": {
                            "unique_count": 10,
                            "null_count": 0,
                            "inferred_type": {
                                "type": {
                                    "length": 8,
                                    "type": "NUMERIC"
                                }
                            }
                        }
                    }
                ]
            ]
        }
    ]
}                   
'''

response3 = custom_dataclass_scoring_function()(json.loads(test_input3))
print(f"Scoring test 3 output (confidence should be 0.0): {response3}" )

## Create Deployment 

Creates a scoring function and deployment of that function in the WML space.

### Artifact names
Modify the variables in the following cell to change with which names the scoring function and deployment are created.

In [None]:
dataclass_name = "data_class_myid"
scoring_function_name = dataclass_name + "_scoring_function"
deployment_name = dataclass_name + "_deployment"
serving_name = scoring_function_name

The next cell will create our storing function and a deployment for it on the WML instance.

In [None]:
# store function
scoring_function_sw_spec_uid = wml_client.software_specifications.get_uid_by_name("runtime-23.1-py3.10")
meta_props = {
    wml_client.repository.FunctionMetaNames.NAME: scoring_function_name,
    wml_client.repository.FunctionMetaNames.DESCRIPTION: f"Scoring function for custom data class {dataclass_name}",
    wml_client.repository.FunctionMetaNames.SOFTWARE_SPEC_UID: scoring_function_sw_spec_uid
}
scoring_function_details = wml_client.repository.store_function(meta_props=meta_props, function=custom_dataclass_scoring_function)
scoring_function_id = wml_client.repository.get_function_id(scoring_function_details)

# deploy function
meta_props = {
    wml_client.deployments.ConfigurationMetaNames.NAME: deployment_name,
    wml_client.repository.FunctionMetaNames.DESCRIPTION: f"Deployment of scoring function for custom data class {dataclass_name}",
    wml_client.deployments.ConfigurationMetaNames.ONLINE: { "parameters": { "serving_name": serving_name } }
}
scoring_function_deployment_details = wml_client.deployments.create(scoring_function_id, meta_props=meta_props)
scoring_function_deployment_id = wml_client.deployments.get_id(scoring_function_deployment_details)

print(f"Scoring function deployed with deployment ID: {scoring_function_deployment_id}")

### (Optional) Test the deployed scoring function

We can test the deployment with the same test payloads as above. 

In [None]:
import json

response1 = wml_client.deployments.score(serving_name, json.loads(test_input1))
print(f"Scoring test 1 output (confidence should be 1.0): {response1}" )

response2 = wml_client.deployments.score(serving_name, json.loads(test_input3))
print(f"Scoring test 2 output (confidence should be 0.0): {response2}" )

response3 = wml_client.deployments.score(serving_name, json.loads(test_input1))
print(f"Scoring test 3 output (confidence should be 0.0): {response3}" )

## Congratulations!

You have successfully deployed and tested the scoring function for a custom data class.
To try this in your Cloud Pak 4 Data system, create a new data class in the Glossary, define its
matching method as "Custom Service" and select the deployment you have just created.