# Clustering Demo

In this notebook, we will see how to prepare the data for clustering, upload the data, start training and do inference.

### Install dependent libraries if not already installed

In [None]:
!pip install pyjwt
!pip install circlify
!pip install colorspacious
!pip install matplotlib
!pip install squarify

In [None]:
import os
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt
import jwt
import requests
import base64
import json

### Load and prepare data

We have a small dataset of mixed content. The dataset contains labels regarding the topics, but we will take an unsupervised learning approach. The labels will be ignored during training, only during inference, we will use the labels to evaluate the content of each clusters.

The below code block loads the data from file

In [None]:
df = pd.read_csv("../datasets/clustering_train.csv")

### Let's see the data

In [None]:
pd.set_option('display.max_colwidth', None)
df.head()

### Let's select the input and output mappings for training

The mapping describes which columns in the upload file should be used as sample input and which ones are to be saved and retrieved during inference.

In [None]:
input_cols = ['title','content']
output_cols = ["id",'title','label']
all_cols = input_cols + output_cols

# STI REST Endpoints

The STI service can be accessed and controlled through REST endpoint.
Documentation can be found in the following link: https://help.sap.com/viewer/product/SERVICE_TICKET_INTELLIGENCE

## Subscription and Authentication

Now we are ready to train a model using the Service Ticket Intelligence API. This requires a valid subscription to the STI API.

Note: Download the service key for STI and upload it to project root as `default_key.json`. This config file is placed one directory above this notebook. These values will be available in `service_keys` of your STI instance in the cloud foundry cockpit.

In [None]:
import configparser
from pathlib import Path
import sys

sys.path.append("..")
import clustering_functions 

In [None]:
 # import importlib
 # importlib.reload(clustering_functions)

In [None]:
STI_BASE_DIR = Path.cwd().parent
config_file_path = STI_BASE_DIR / 'default_key.json'

connection = clustering_functions.get_connection_object(config_file=config_file_path)
sti = clustering_functions.STIFunctions(connection)

# List models

Now lets do list model call using this python function to view all the models in this account

In [None]:
sti.list_models()

## File upload

This process will take a few minutes to complete depending on the file size. If file upload is successful, the response text will contain a model id - an UUID identifier which we can use as a reference to the uploaded training file.

In [None]:

payload = {
    "scenario": {
        "desc": "testing data for clustering",
        "type": "clustering",
        "language": "en",
        "business_object": "ticket",
    },
    "mapping": {
        "input": input_cols,
        "output": output_cols
    },
    "training": {
        "file": "{}".format(
            base64.b64encode(df.to_csv(index=False).encode("utf-8")).decode("utf-8")
        )
    },
}

response = sti.file_upload(payload)
our_model_id = response.get("model_id")
response

## Start training on uploaded file

Take the model id from file upload response text and pass it when in starting the model training

In [None]:
# our_model_id = '763f5e0f9ec5484191dad6540ac30814'

sti.start_model_training(model_id=our_model_id)

## Wait for training to succeed

After starting the model training, do a get model status and check if model status is `READY`

The model status transitions from `NEW` to `PENDING_TRAINING` once training is submitted and will further transition to `IN_TRAINING` and finally `READY` when training succeeds

Wait for model status to be `READY` before proceeding to next step. This will take up to 10-20 mins from the training submission time. Repeatedly run the above cell to get the latest model status

Once the model status is `READY` proceed to next step.

In [None]:
# our_model_id = "2fc0bb96169741b5b2950354210961a8"

status = sti.get_model_status(model_id=our_model_id)
status["model_status"]

## Activate the model

Once model training is completed, model needs to activated before inference can be run on

In [None]:
# our_model_id = "2fc0bb96169741b5b2950354210961a8"

status = sti.activate_model(model_id=our_model_id)
status

## Let's send some inference request

Retrieve all the clusters in the training dataset. They have been saved together with the model during training.

In [None]:
inference_payload = {}
    
inference_response = sti.clustering(data_payload=inference_payload)
len(inference_response["en"]["clusters"])

Retrieve cluster based on filter of top k cluster 

In [None]:
inference_payload = {

   "options":{
      "top_k_clusters":10
   }
}

inference_response = sti.clustering(data_payload=inference_payload)
len(inference_response["en"]["clusters"])

Retrieve cluster based on filter of groupby

In [None]:
inference_payload = {
    "options": {
    "language": "en",
    "cluster_groupby" : {"column" : "label",
                     "value" :  ["crypto"]}
    }
}

inference_response = sti.clustering(data_payload=inference_payload)
len(inference_response["en"]["clusters"])

# Visualize the clusters

In [None]:
clusters = inference_response.copy()
sti.clustering_plot_treemap(clusters = clusters,lang = "en",top_k=50)

In [None]:
clusters = inference_response.copy()
sti.clustering_plot_circlepacking(clusters = clusters,lang = "en",top_k=50)

## Deactivate model

We can deactivate any active models here.

In [None]:
#sti.deactivate_model(model_id="")

## Delete model

We can delete any unused models here.

In [None]:
#sti.delete_model(model_id="")