## Consume Data Attribute Recommendation via AI API 

Deep dive into the Python SDK for the Data Attribute Recommendation service using the AI API from SAP AI Core

## Business Scenario

Let's examine a business scenario involving product master data. Creating and maintaining product master data requires manually choosing the right categories from a pre-set list for each product.
In this notebook, we will explore how to automate this tedious manual task using the Data Attribute Recommendation service.

This example will cover:
    
* Data Upload
* Model Training and Deployment
* Inference Requests
    
We will work through a basic example of how to achieve these tasks using the AI API Client of the [Python SDK for Data Attribute Recommendation](https://github.com/SAP/data-attribute-recommendation-python-sdk).




## Table of Contents

* [Exercise 01.1](#Exercise-01.1) - Installing the SDK and preparing the service key
* [Exercise 01.2](#Exercise-01.2) - Uploading data via DAR AI API
* [Exercise 01.3](#Exercise-01.3) - Training the model via DAR AI API
* [Exercise 01.4](#Exercise-01.4) - Deploying the model via DAR AI API
* [Exercise 01.5](#Exercise-01.5) - Predicting labels via DAR Inference Client
* [Cleaning up a service instance](#Cleaning-up-a-service-instance) - Clean up all resources on the service instance

# Exercise 01.1

*Back to [table of contents](#Table-of-Contents)*

In exercise 01.1, we will install the SDK and prepare the service key.

## Installing the SDK

The Data Attribute Recommendation SDK is available from the Python package repository. It can be installed with the standard `pip` tool:

In [None]:
! pip install data-attribute-recommendation-sdk

*Note: If you are not using a Jupyter notebook, but instead a regular Python development environment, we recommend using a Python virtual environment to set up your development environment. Please see [the dedicated tutorial to learn how to install the SDK inside a Python virtual environment](https://developers.sap.com/tutorials/cp-aibus-dar-sdk-setup.html).*

## Creating a service instance and key on BTP Trial

Please log in to your trial account: https://account.hanatrial.ondemand.com/

In your global account screen, go to the "Boosters" tab:

![trial_booster.png](images/trial_booster.png)

*Boosters are also available in production. If you are using a production environment, please follow this [tutorial](https://developers.sap.com/tutorials/cp-aibus-dar-booster-free-key.html) to set up account for Data Attribute Recommendation and get service key using either the free or the standard plan*.

In the Boosters tab, enter "Data Attribute Recommendation" into the search box. Then, select the
service tile from the search results: 
    
![trial_locate_dar_booster.png](images/trial_locate_dar_booster.png)

The resulting screen shows details of the booster pack. Here, click the "Start" button and wait a few seconds.

![trial_start_booster.png](images/trial_start_booster.png)

Once the booster is finished, click the "Download Service Key" link to obtain your service key and save it to the disk.

![trial_booster_finished.png](images/trial_booster_finished.png)

## Loading the service key into your Jupyter Notebook

Once you downloaded the service key from the Cockpit, upload it to your notebook environment. The service key must be uploaded to same directory where the `Data_Attribute_Recommendation_AI_API.ipynb` file is stored.

When using Jupyterlab, a file browser is visible to the left of the notebook view. Click the upload button here to upload the `default_key.json` file we downloaded earlier from the BTP Cockpit.


![service_key_main_jupyter_page.png](images/service_key_main_jupyter_page.png)



![service_key_upload.png](images/service_key_upload.png)

Once you click the upload button, a file chooser dialog will open where you can select the `default_key.json`:
After the upload finished successfully, you should see the `default_key.json` in the file browser.
**Make sure that the file name is `default_key.json`. If your service key file has a different name, this notebook will not work.**

The service key contains your credentials to access the service. Please treat this as carefully as you would treat any password. We keep the service key as a separate file outside this notebook to avoid leaking the secret credentials.

The service key is a JSON file. We will load this file once and use the credentials throughout this workshop. 

In [2]:
# First, set up logging so we can see the actions performed by the SDK behind the scenes
import sys
import logging

logging.basicConfig(level=logging.INFO,stream=sys.stdout)

from pprint import pprint  # for better output formatting

In [None]:
import json
import os

if not os.path.exists("default_key.json"):
    msg = "'default_key.json' is not found. Please follow instructions above to create a service key of"
    msg += " Data Attribute Recommendation. Then, upload it into the same directory where"
    msg += " this notebook is saved."
    print(msg)
    raise ValueError(msg)

with open("default_key.json") as file_handle:
    key = file_handle.read()
    SERVICE_KEY = json.loads(key)
    print("Service URL: ")
    pprint(SERVICE_KEY["url"])
    print("Client ID:")
    pprint(SERVICE_KEY["uaa"]["clientid"])

## Summary Exercise 01.1

In exercise 01.1, we have covered the following topics:

* How to install the Python SDK for Data Attribute Recommendation
* How to obtain a service key for the Data Attribute Recommendation service

# Exercise 01.2

*Back to [table of contents](#Table-of-Contents)*

*To perform this exercise, you need to execute the code in all previous exercises.*

In exercise 01.2, we will upload our demo dataset to the service.

## The Dataset

### Obtaining data

The dataset we use in this workshop is a CSV file containing scientific paper titles and their topic categories. This dataset is ideal to understand use cases where the labels are independent of one another. What this means is that the presence or absence of one label does not influence the others.

Let's inspect the data:

In [None]:
# if you are experiencing an import error here, run the following in a new cell:
# ! pip install pandas
import pandas as pd

df = pd.read_csv("data/arxiv.csv")
df.head(5)

In [None]:
df.tail()

In [None]:
print()
print(f"Data has {df.shape[0]} rows and {df.shape[1]} columns.")

The CSV contains the titles of several scientific papers. For each title, the set of topics associated with the title are provided as labels. The following are the labels and their associated full forms.
- CSC: Computer Science
- STA: Statistics
- QFI: Quantitative Finance
- QBI: Quantitative Biology
- PHY: Physics

For example, the instance of the dataset with the title `Contemporary machine learning: a guide for practitioners in the physical sciences` has the following set of labels:
- Computer Science
- Physics

We will use the Data Attribute Recommendation service to predict the labels for a given paper based on its **title**. However, you can add other attributes such as length of the paper, number of words, conference name and type to improve the classifier further.

 ### Initializing DAR AI API client

First, we initialize the DAR AI API Client that we will use troughought the rest of this notebook.

In [15]:
from sap.aibus.dar.client.aiapi.dar_ai_api_client import DARAIAPIClient

url = SERVICE_KEY['url']
client_id = SERVICE_KEY['uaa']['clientid']
client_secret = SERVICE_KEY['uaa']['clientsecret']
auth_url = SERVICE_KEY['uaa']['url']

dar_ai_api_client = DARAIAPIClient(
    base_url=url + '/model-manager/v2/lm',
    auth_url=auth_url + '/oauth/token',
    client_id=client_id,
    client_secret=client_secret
)

### Creating the dataset schema

In [None]:
file_path = "data/schema_arxiv.json"

# Read the JSON file
with open(file_path, "r") as json_file:
    schema = json.load(json_file)

pprint(schema)

### Uploading data to the service

We will now upload our dataset and dataset schema files using the DAR AI API Client which we created earlier.

The dataset must be a CSV file and fit to the dataset schema. The CSV file can optionally be `gzip` compressed.



We first have to describe the format of our data by creating a dataset schema. This schema informs the service about the individual column types found in the CSV. We also describe which are the target columns used for training. These columns will be later predicted.

The service currently supports three column types: **TEXT**, **CATEGORY** and **NUMBER**. As labels to be predicted, only **CATEGORY** and **NUMBER** are currently supported.

For this example, we have prepared the dataset schema already and it can be found in [data/schema_arxiv.json](./data/schema_arxiv.json). We can look at it as follows:

In [17]:
# Compress file first for a faster upload
! gzip -9 -c data/arxiv.csv > data/arxiv.csv.gz

The dataset and dataset schema files are uploaded using the file_upload_client.put_file() method in the DARAIAPIClient.

In [None]:
dataset_upload_response = dar_ai_api_client.file_upload_client.put_file(
    local_path='data/arxiv.csv.gz',
    remote_path='/trial-test/arxiv.csv.gz',
    overwrite=True,
)

dataset_url = dataset_upload_response.json()["url"]
print("The uploaded dataset URL: ", dataset_url)

In [None]:
schema_upload_response = dar_ai_api_client.file_upload_client.put_file(
    local_path='data/schema_arxiv.json',
    remote_path='/trial-test/schema_arxiv.json',
    overwrite=True,
)

schema_url = schema_upload_response.json()["url"]
print("The uploaded dataset schema URL: ", schema_url)

## Summary Exercise 01.2

In exercise 01.2, we have covered the following topics:

* How to create a dataset schema
* How to upload a dataset and the dataset schema files to the service

# Exercise 01.3

*Back to [table of contents](#Table-of-Contents)*

*To perform this exercise, you need to execute the code in all previous exercises.*

In exercise 01.3, we will register the artifacts and  train the model.

## Select the Scenario

To train a machine learning model, we first need to select the correct scenario. You can refer to the [official documentation on Scenarios](https://help.sap.com/docs/data-attribute-recommendation/data-attribute-recommendation/scenarios?locale=en-US) to learn more. Additional scenarios may be added over time, so check back regularly.

We can also query the list of scenarios through the DAR AI API:

In [None]:
from ai_api_client_sdk.models.scenario_query_response import ScenarioQueryResponse

scenario_query_response: ScenarioQueryResponse = dar_ai_api_client.scenario.query()
for scenario in scenario_query_response.resources:
     pprint(scenario.__dict__)

In this exercise, we are building a model to predict labels which are independent of one another. The scenario **Generic model template** is correct for this excercise. 

In [21]:
scenario_id = "ccb99c7c-07c1-45f5-b51b-3e7d8b76eb0c" # Scenario ID of Generic model template

## Artifact Registration 

Before training the model, the uploaded dataset and dataset schema files need to be registered as artifacts.

In [22]:
from ai_api_client_sdk.models.artifact_create_response import ArtifactCreateResponse
from ai_api_client_sdk.models.artifact import Artifact

In [None]:
artifact_response: ArtifactCreateResponse = dar_ai_api_client.artifact.create(
    name="datasetschema",
    kind=Artifact.Kind.OTHER,
    url=schema_url,
    scenario_id=scenario_id,
    description="Trial test dataset schema"
)

datasetschema_artifact_id = artifact_response.id
print(f"The artifact (dataset schema) ID is {artifact_response.id}")

In [None]:
artifact_response: ArtifactCreateResponse = dar_ai_api_client.artifact.create(
    name="dataset",
    kind=Artifact.Kind.DATASET,
    url=dataset_url,
    scenario_id=scenario_id,
    description="Trial test dataset")

dataset_artifact_id =  artifact_response.id
print(f"The artifact (dataset) ID is {artifact_response.id}")

## Select the Executable
Each Scenario comes with multiple Executables to do different tasks, for example a training executable or a deployment executable. You can refer to the [official documentation on Supported Executables](https://help.sap.com/docs/data-attribute-recommendation/data-attribute-recommendation/supported-executables) to learn more. 

We can also query the list of executables for through the DAR AI API:

In [None]:
executables = dar_ai_api_client.executable.query(scenario_id=scenario_id, version_id='3.0')
for executable in executables.resources:
    pprint(executable.__dict__)

The "Generic Training Executable" is selected as the training executable from the list.

In [26]:
training_executable_id = "40dcde13-ce0f-45cc-aac0-74da78175305"

## Create a Training Configuration



To start the training execution, we need to bring together the information we've assembled so far: the IDs of the dataset artifact and dataset schema artifact, the training executable and the desired scenario. We also have to provide a name for the model.

Generally, parameters are added using ParameterBinding objects and input artifacts are added using InputArtifactBinding objects.

*Only one model of a given name can exist. If you receive a message stating 'The model name specified is already in use', you either have to remove the training execution that created the model with that name or you have to change the `modelName` in the ParameterBinding below. You can also [clean up the entire service instance](#Cleaning-up-a-service-instance).*

In [27]:
from ai_api_client_sdk.models.input_artifact_binding import InputArtifactBinding
from ai_api_client_sdk.models.parameter_binding import ParameterBinding

In [None]:
# Create input artifact bindings
input_artifact_bindings = [
    InputArtifactBinding(key="dataset", artifact_id=dataset_artifact_id),
    InputArtifactBinding(key="datasetSchema", artifact_id=datasetschema_artifact_id)
]

# Create parameter bindings
parameter_bindings = [
    ParameterBinding(key="modelName", value="trial_model")
]

# Create the Configuration
training_configuration = dar_ai_api_client.configuration.create(
    name="trial_training_config",
    scenario_id=scenario_id,
    executable_id=training_executable_id,
    input_artifact_bindings=input_artifact_bindings,
    parameter_bindings=parameter_bindings,
)
print(f"Training Configuration ID: {training_configuration.id}")

## Create a Training Execution

In [29]:
from ai_api_client_sdk.models.execution_create_response import ExecutionCreateResponse

In [None]:
execution_response: ExecutionCreateResponse = dar_ai_api_client.execution.create(configuration_id=training_configuration.id)
print(f"Execution ID: {execution_response.id}, Status: {execution_response.status}, Message: {execution_response.message}")

## Get Training Status
The training execution is now running in the background and we can poll the execution's status until it reaches "COMPLETED".
The `DARAIAPIClient` provides a `get()` method which could be used to find the current status of the training execution.

In [None]:
from ai_api_client_sdk.models.execution import Execution

training_execution: Execution = dar_ai_api_client.execution.get(execution_id=execution_response.id)
print(f"The current status of the Execution {execution_response.id} is {training_execution.status}")

Repeat the above cell execution until the reported status is "COMPLETED". Once that is the case, the trained model will be listed as an output artifact of the training execution.

In [None]:
print(training_execution.output_artifacts[0].url)

## Summary Exercise 01.3

In exercise 01.3, we have covered the following topics:

* How to select the appropriate Scenario and Executable
* How to register dataset and dataset schema as Artifacts
* How to configure a training execution and obtain a model artifact

# Exercise 01.4

*Back to [table of contents](#Table-of-Contents)*

*To perform this exercise, you need to execute the code in all previous exercises.*

In exercise 01.4, we will deploy the model 

## Select the Executable
The training execution has finished and the model is ready to be deployed. By deploying the model, we create a server process in the background on the Data Attribute Recommendation service which will serve inference requests.

Just like for the training execution, we need to select the appropriate Executable (this time the "Deployment Exectubale" instead of the "Training Executable").


In [None]:
# List the Executables
executables = dar_ai_api_client.executable.query(scenario_id=scenario_id, version_id='3.0')
for executable in executables.resources:
    pprint(executable.__dict__)

# We select the Deployment Executable
deployment_executable_id = "88d4a864-117c-43df-882a-81b490c1919d"

## Create a Deployment Configuration
The deployment configuration is assembled similarly to the training configuration that we created earlier. We use the model artifact from the training execution as an input artifact for the deployment configuration.

In [51]:
model_artifact_id = training_execution.output_artifacts[0].id

input_artifact_bindings = [
    InputArtifactBinding(key="model", artifact_id=model_artifact_id)
]

deployment_configuration = dar_ai_api_client.configuration.create(
    name="trial_deployment_config",
    scenario_id=scenario_id,
    executable_id=deployment_executable_id,
    input_artifact_bindings=input_artifact_bindings,
)
      
print(f"Deployment Configuration ID: {deployment_configuration.id}")

## Create a Deployment

In [54]:
from ai_api_client_sdk.models.deployment_create_response import DeploymentCreateResponse

In [None]:
deployment_response:  DeploymentCreateResponse = dar_ai_api_client.deployment.create(configuration_id=deployment_configuration.id)
print(f"Deployment ID: {deployment_response.id}, Status: {deployment_response.status}, Message: {deployment_response.message}")

## Get Deployment Status
We can poll the API until the Deployment is in status "RUNNING". The `DARAIAPIClient` provides a `get()` method for this purpose.

In [56]:
from ai_api_client_sdk.models.deployment import Deployment

deployment_execution: Deployment = dar_ai_api_client.deployment.get(deployment_id=deployment_response.id)
print(f"The current status of the Deployment {deployment_response.id} is {deployment_execution.status}")

Repeat the above cell execution until the reported status is "RUNNING". Once that is the case, we can extract the URL of our deployment.

In [None]:
deployment_url = deployment_execution.deployment_url
print(deployment_url)

## Summary Exercise 01.4

In exercise 01.3, we have covered the following topics:

* How to select the appropriate Executable for a Deployment
* How to configure and create a Deployment

# Exercise 01.5

*Back to [table of contents](#Table-of-Contents)*

*To perform this exercise, you need to execute the code in all previous exercises.*

In exercise 01.5, we will predict labels for some unlabeled data.

With a single inference request, we can send up to 50 objects to the service to predict the labels. The data sent to the service must match the `features` section of the dataset schema created earlier. The `labels` defined inside of the dataset schema will be predicted for each object and returned as a response to the request.

In the SDK, the [`InferenceClient.create_inference_request()`](https://data-attribute-recommendation-python-sdk.readthedocs.io/en/latest/api.html#sap.aibus.dar.client.inference_client.InferenceClient.create_inference_request) method handles submission of inference requests.

In [77]:
from sap.aibus.dar.client.inference_client import InferenceClient

inference_client = InferenceClient.construct_from_credentials(
    dar_url=url,
    clientid=client_id,
    clientsecret=client_secret,
    uaa_url=auth_url,
)

objects_to_be_classified = [
    {
        "features": [
            {"name": "title", "value": "Not even wrong: The spurious link between biodiversity and ecosystem functioning"}
        ],
    },
]

inference_response = inference_client.create_inference_request_with_url(url=deployment_url,objects=objects_to_be_classified)
pprint(inference_response)

*Note: For trial accounts, you only have a limited number of objects which you can classify.*

You can also try to come up with your own example:

In [None]:
my_own_items = [
    {
        "features": [
            {"name": "title", "value": "EDIT THIS"}        
        ],
    },
]

inference_response = inference_client.create_inference_request_with_url(url=deployment_url,objects=my_own_items)
print()
print("Inference request processed. Response:")
print()
pprint(inference_response)

In some cases, the predicted category has the special value `nan`. In the `arxiv.csv` dataset, not all records have the full set of categories. Some records only have one label and some having up to three. The model learns this fact from the data and will occasionally suggest that a record should not have a label.

## Summary Exercise 01.5

In exercise 01.5, we have covered the following topics:

* How to predict labels for some unlabeled data.

# Wrapping up AI API

In this workshop, we looked into the following topics:

* Installation of the Python SDK for Data Attribute Recommendation 
* Uploading and registering a dataset and dataset schema to the DAR AI API
* Training a model
* Deploying the trained model
* Predicting labels for unlabelled data

Using these tools, we are able to solve the problem of missing Master Data attributes starting from just a CSV file containing training data.

 ## Clean Up Instance 

In this notebook, we have created several resources on the Data Attribute Recommendation Service:

* Uploaded csv and json files
* Training execution and the trained model
* Deployment

The SDK provides several methods to delete these resources. NOTE: The deletion of artifacts and configurations are not possible.

## Clean the Service Instance 
In general, if Executions or Deployments are in "PENDING" or "RUNNING" status, they need to be set to "STOPPED" before they can be deleted. This is usually the case for Deployments because their final status is "RUNNING".

### Stop and delete the deployment

In [37]:
from ai_api_client_sdk.models.target_status import TargetStatus

deployment_modify_response = dar_ai_api_client.deployment.modify(
    deployment_id=deployment_response.id,
    target_status=TargetStatus.STOPPED,
)

In [None]:
# Check deployment status
deployment_execution = dar_ai_api_client.deployment.get(deployment_id=deployment_response.id)
print(f"The current status of the Deployment {deployment_response.id} is {deployment_execution.status}")

Repeat the above cell until the deployment's status is "STOPPED", then move on to delete it

In [None]:
deployment_deletion_response = dar_ai_api_client.deployment.delete(deployment_id=deployment_response.id)

### Delete the execution

In [40]:
execution_deletion_response = dar_ai_api_client.execution.delete(execution_id=execution_response.id)

### Delete the uploaded files

In [88]:
deletedataset_response = dar_ai_api_client.file_upload_client.delete_file(remote_path='/trial-test/arxiv.csv.gz')
deletedatasetschema_response = dar_ai_api_client.file_upload_client.delete_file(remote_path='/trial-test/schema_arxiv.json')

## List the Remaining Executions and Deployments

In [None]:
executions = dar_ai_api_client.execution.query()

for d in executions.resources:
    print(f"{d}: {d.status} ({d.target_status})")

In [None]:
deployments = dar_ai_api_client.deployment.query()
 
for d in deployments.resources:
    print(f"{d}: {d.status} ({d.target_status}). deployment URL: {d.deployment_url}")