# Classification Demo

In this notebook, we will see how to prepare the data for classification, upload the data, start training and do inference.

### Install pyjwt library if not already installed

In [None]:
!pip install pyjwt
!pip install pandas
!pip install sklearn
!pip install matplotlib

In [None]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
%matplotlib inline
import matplotlib.pyplot as plt
import jwt
import time
import json

%load_ext autoreload
%autoreload 2

In [None]:
import requests
import base64

## Prepare training and test data

We have a small dataset of service request tickets that is in the context of the travel industry. We will attempt to build a classifier which automatically classifies these tickets into their respective categories.

The below code block loads the data from file

In [None]:
df = pd.read_csv("../datasets/travel.csv")

### Let's see the data

In [None]:
df.head()

### Let's select the input and output mappings for training

The mapping describes which columns in the upload file should be used as sample input and which ones are to be used as the classification output that the model should learn.

In [None]:
input_cols = ['Description']
output_cols = ['Category']
all_cols = input_cols + output_cols

### Check the data distribution

After loading the data into dataframe, we check the distribution of classes (target variable).

Usually the model works best if the dataset is balanced i.e. classes are equally distributed with little skewness, and each class has at least 1000 data points.

In [None]:
fig, axes = plt.subplots(nrows=len(output_cols), ncols=1, figsize=(15, 10))
for idx, output_col in enumerate(output_cols):
    distribution = df[output_col].value_counts()
    distribution.plot(kind='bar', rot=90, ax=axes[idx] if len(output_cols)!=1 else axes)

## Training, test data split

We split the loaded data into two sets
1. Data we should upload and do the training
2. Test data we use for testing the generated model

In [None]:
df_train, df_test = train_test_split(df[all_cols], test_size = 0.01, shuffle=True)

In [None]:
for output_col in output_cols:
    print(output_col)
    print("\nTraining Data:")
    print(df_train[output_col].value_counts())
    print("\nTesting Data:")
    print(df_test[output_col].value_counts())
    print("\n \n")

In [None]:
fig, axes = plt.subplots(nrows=len(output_cols), ncols=2, figsize=(20, 10))
for idx, output_col in enumerate(output_cols):
    df_train[output_col].value_counts().plot(kind='bar', rot=90, title='Training data', 
                                             ax=axes[idx][0] if len(output_cols)!=1 else axes[0] )
    df_test[output_col].value_counts().plot(kind='bar', rot=90, title='Test Data',
                                           ax=axes[idx][1] if len(output_cols)!=1 else axes[1])
fig.tight_layout()

# STI REST Endpoints

The STI service can be accessed and controlled through REST endpoint.
Documentation can be found in the following link: https://help.sap.com/viewer/product/SERVICE_TICKET_INTELLIGENCE

## Subscription and Authentication

Now we are ready to train a model using the Service Ticket Intelligence API. This requires a valid subscription to the STI API.

Note: Update the values for `service url`, `uaa url`, `client id` and `client secret` in the config file `sti_config.ini`. This config file is placed one directory above this notebook. These values will be available in `service_keys` of your STI instance in the cloud foundry cockpit.

Now we will use functions from `sti_functions.py` to access STI's REST endpoints. Feel free to browse the source code of it to see what's happening under the hood.

In [None]:
import configparser
from pathlib import Path
import sys

sys.path.append("..")
import sti_functions

In [None]:
# import importlib
# importlib.reload(sti_functions)

In [None]:
STI_BASE_DIR = Path.cwd().parent
config_file_path = STI_BASE_DIR / 'sti_config.ini'

connection = sti_functions.get_connection_object(config_file=config_file_path)
sti = sti_functions.STIFunctions(connection)

## List models

Now lets do list model call using this python function to view all the models in this account

In [None]:
sti.list_models()

### Let's check if we need to delete any unused model
Based on the model list above, ensure that the number of models does not exceed 20. Otherwise, we need to delete some unused model.

In [None]:
# sti.delete_model("8c99a13d405948de82e9ccdf4f9ada17")

## File upload

This process will take a few minutes to complete depending on the file size. If file upload is successful, the response text will contain a model id - an UUID identifier which we can use as a reference to the uploaded training file.

In [None]:
df_train_base64 = base64.b64encode(df_train.to_csv(index=False).encode('utf-8'))
payload = {
    "scenario":
      {
          "desc":"Travel data for classification",
          "type":"classification",
          "language":"en",
          "business_object":"ticket"
      },
      "mapping":
      {
            "input": input_cols,
            "output": output_cols
      },
      "training":
      {
            "file": "{}".format(df_train_base64.decode('utf-8'))
      }
}
response = sti.file_upload(payload)
payload = {}
our_model_id = response.get('model_id')
response

Note that the model status is new now. Once we submit training the `model_status` will transition from `NEW` -> `PENDING_TRAINING` -> `IN_TRAINING` -> `READY`

## Start training on uploaded file

Take the model id from file upload response text and pass it when in starting the model training

In [None]:
# our_model_id = "a33807253a204ba5a2f6192a45b727d6"
sti.start_model_training(model_id=our_model_id)

## Wait for training to succeed

After starting the model training, do a get model status and check if model status is `READY`

The model status transitions from `NEW` to `PENDING_TRAINING` once training is submitted and will further transition to `IN_TRAINING` and finally `READY` when training succeeds

In [None]:
# our_model_id = "cc078a539d6a433a92f0ac0a2fb445d2"
status = sti.get_model_status(model_id=our_model_id)
print("Model status: {}".format(status.get('model_status')))

Wait for model status to be `READY` before proceeding to next step. This will take upto 10-20 mins from the training submission time. Repeatedly run the above cell to get the latest model status

Once the model status is `READY` proceed to next step.

## Model accuracy

The model accuracy, confusion matrics and other metrics (such as f1, precision etc.,) can be retrived once training is completed and status becomes ready

In [None]:
# our_model_id = "cc078a539d6a433a92f0ac0a2fb445d2"
status = sti.get_model_status(model_id=our_model_id)
print("Model combined accuracy:", status["combined_accuracy"])

In [None]:
# our_model_id = "cc078a539d6a433a92f0ac0a2fb445d2"
accuracy = sti.get_model_accuracy(model_id=our_model_id)
for idx, result in enumerate(accuracy["validation_results"]):
    print("\nField:", accuracy["validation_results"][idx]["field"])
    print("Model average f1 score:", accuracy["validation_results"][idx]["average_f1_score"])
    print("Model average precision:", accuracy["validation_results"][idx]["average_precision"])
    print("Model average recall:", accuracy["validation_results"][idx]["average_recall"])

We can plot the confusion matrix as well to visually see the performance of the model

In [None]:
sti.plot_confusion_matrix(model_id=our_model_id)

## Activate the model

Once you are satisfied with the results, model needs to activated before inference can be run on

In [None]:
sti.activate_model(model_id=our_model_id)

## Build inference payload and send request

We will select a random example from our `df_test` dataframe which has not been sent for training and evaluate how the model performs

In [None]:
df_test.iloc[8]

In [None]:
payload = {}
payload["business_object"] = "ticket"
payload["messages"] = [{"id": 2001, "contents": []}]
for input_col in input_cols:
    payload["messages"][0]['contents'].append({"field": input_col, "value": df_test.iloc[8][input_col]})
    
inference_response = sti.classify_text(payload)
inference_response

You can explore around by giving different input from `df_test` or your own input and see how the model performs

## Lets evaluate the STI model performance ourselves

We also can run inference against all the data from `df_test` and evaluate by ourselves how the sti model performs. We will results from STI against the original value of the `df_test`

In [None]:
payload = {
    "business_object": "ticket",
    "messages": []
}
for index, row in df_test.iterrows():
    tmp = {'id': index, 'contents': []}
    for input_col in input_cols:
        tmp['contents'].append({"field": input_col, "value": row[input_col]})

    payload['messages'].append(tmp)

inference_response = sti.classify_text(payload)

In [None]:
inference_response

In [None]:
from sklearn.metrics import classification_report
y_true_collection = []
y_pred_collection = []
for idx, output_col in enumerate(output_cols):
    y_true = [row[output_col] for _, row in df_test.iterrows()]
    y_pred = [classification['classification'][idx]['value'] for classification in inference_response['results']]
    assert(len(y_true) == len(y_pred))
    y_true_collection.append(y_true)
    y_pred_collection.append(y_pred)
    print(classification_report(y_true, y_pred))

In [None]:
model_results = sti.get_model_accuracy(model_id=our_model_id)
for idx, (y_true, y_pred) in enumerate(zip(y_true_collection, y_pred_collection)):
    fig = plt.figure(figsize=(20,20))
    cnf_mtrx = confusion_matrix(y_true, y_pred)
    sti_functions.plot(cnf_mtrx, 
                       classes=model_results["validation_results"][idx]["confusion_matrix"]["labels"], 
                       title='Confusion matrix')

# Using STI's pre-trained models

Apart from building custom models with your own data, STI also provides pre-trained models for sentiment analysis and language detection.

## Language Detection

You can use this to detect the language of text.
ISO Language codes of message content will be shown in response

In [None]:
payload = {
    "business_object": "ticket",
    "messages": [
        {
            'id': 2001,
            'contents': [
                {
                    'field': input_cols[0],
                    'value': "I don't like your service"
                }
            ]
        },
        {
            'id': 2002,
            'contents': [
                {
                    'field': input_cols[0],
                    'value': 'Ich mag deinen Service nicht'
                }
            ]
        }
    ],
    "options": {
        "services": {
            "detect_language": True
        }
    }
}
inference_response = sti.classify_text(payload)
inference_response

## Sentiment Analysis

Provides sentiment score of the input content ranging from -1 to 1. Highly negative sentiment will have score of -1 and highly positive sentiment will have a score of +1. And 0 may denote a neutral phrase

In [None]:
payload = {
    "business_object": "ticket",
    "messages": [
        {
            'id': 2001,
            'contents': [
                {
                    'field': input_cols[0],
                    'value': "I don't like your service"
                }
            ]
        },
        {
            'id': 2002,
            'contents': [
                {
                    'field': input_cols[0],
                    'value': 'Ich mag deinen Service nicht'
                }
            ]
        }
    ],
    "options": {
        "services": {
            "detect_sentiment": True
        }
    }
}
inference_response = sti.classify_text(payload)
inference_response

## Multiple services in same request.

You can request one inference request to do category classification and sentiment analysis.

In [None]:
payload = {
    "business_object": "ticket",
    "messages": [
        {
            'id': 2001,
            'contents': [
                {
                    'field': input_cols[0],
                    'value': "I don't like your service"
                }
            ]
        }
    ],
    "options": {
        "services": {
            "detect_category": True,
            "detect_sentiment": True,
            "detect_language": True,
        }
    }
}
inference_response = sti.classify_text(payload)
inference_response

## Deactivate model

We can deactivate any active models here.

In [None]:
# sti.deactivate_model(model_id="")

## Delete model

We can delete any unused models here.

In [None]:
# sti.delete_model(model_id="")