## Text Classification - Emotion Detection 

This sample shows how to fine tune a model to detect emotions using emotion dataset and deploy it to an endpoint for real time inference. The model is trained on tiny sample of the dataset with a small number of epochs to illustrate the fine tuning approach.

### Training data
We will use the [emotion](https://huggingface.co/datasets/dair-ai/emotion) dataset a copy of which is available in the `azureml` system registry for easy access.

### Model
Models that can perform the `fill-mask` task are generally good candidates to fine tune for `text-classification`. We will list all models of the `fill-mask` type, from which you can pick one. If you opened this notebook from a specific model card, copy past the model `Asset ID`. Optionally, if you need to fine tune a model that is available on HuggingFace, but not available in `azureml` system registry, you can either [import](https://github.com/Azure/azureml-examples) the model or use the `huggingface_id` parameter to use a model directly from HuggingFace. [Learn more]().

### Outline
* Setup pre-requisites such as compute.
* Pick a model to fine tune.
* Pick and explore training data.
* Configure the fine tuning job.
* Run the fine tuning job.
* Register the fine tuned model. 
* Deploy the fine tuned model for real time inference.
* Clean up resources. 



### 1. Setup pre-requisites
* Install dependencies
* Connect to AzureML Workspace. Learn more at [set up SDK authentication](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-setup-authentication?tabs=sdk). Replace  `<WORKSPACE_NAME>`, `<RESOURCE_GROUP>` and `<SUBSCRIPTION_ID>` below.
* Connect to `azureml` system registry
* Set an optional experiment name
* Check or create compute. A single GPU node can have multiple GPU cards. For example, in one node of `Standard_ND40rs_v2` there are 8 NVIDIA V100 GPUs while in `Standard_NC12s_v3`, there are 2 NVIDIA V100 GPUs. Refer to the [docs](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-gpu) for this information. The number of GPU cards per node is set in the param `gpus_per_node` below. Setting this value correctly will ensure utilization of all GPUs in the node. The recommended GPU compute SKUs can be found [here](https://learn.microsoft.com/en-us/azure/virtual-machines/ncv3-series) and [here](https://learn.microsoft.com/en-us/azure/virtual-machines/ndv2-series).

In [64]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential, ClientSecretCredential
from azure.ai.ml.entities import AmlCompute
import time

try:
    credential = DefaultAzureCredential()
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    credential = InteractiveBrowserCredential()

workspace_ml_client = MLClient(
        credential,
        subscription_id =  "21d8f407-c4c4-452e-87a4-e609bfb86248", #"<SUBSCRIPTION_ID>"
        resource_group_name =  "rg-contoso-819prod", #"<RESOURCE_GROUP>",
        workspace_name =  "mlw-contoso-819prod", #"WORKSPACE_NAME>",
)
registry_ml_client = MLClient(credential, registry_name="azureml-preview")

experiment_name = "text-classification-emotion-detection"

# If you already have a gpu cluster, mention it here. Else will create a new one with the name 'gpu-cluster-big'
compute_cluster = "gpu-cluster-big"
try:
    workspace_ml_client.compute.get(compute_cluster)
except Exception as ex:
    compute = AmlCompute(
        name = compute_cluster, 
        size= "Standard_ND40rs_v2",
        max_instances= 2 # For multi node training set this to an integer value more than 1
    )
    workspace_ml_client.compute.begin_create_or_update(compute).wait()

# This is the number of GPUs in a single node of the selected 'vm_size' compute. 
# Setting this to less than the number of GPUs will result in underutilized GPUs, taking longer to train.
# Setting this to more than the number of GPUs will result in an error.
gpus_per_node = 2 

# genrating a unique timestamp that can be used for names and versions that need to be unique
timestamp = str(int(time.time())) 


### 2. Pick a model to fine tune

Models that support `fill-mask` tasks are good candidates to fine tune for `text-classification`. You can browse these models in the Model Catalog in the AzureML Studio, filtering by the `fill-mask` task. In this example, we use the `bert-base-uncased` model. If you have opened this notebook for a different model, replace the model name and version accordingly. 

Note the model id property of the model. This will be passed as input to the fine tuning job. This is also available as the `Asset ID` field in model details page in AzureML Studio Model Catalog. 

In [65]:
model_name = "bert-base-uncased"
model_version = "3"
foundation_model=registry_ml_client.models.get(model_name, model_version)
print ("\n\nUsing model name: {0}, version: {1}, id: {2} for fine tuning".format(foundation_model.name, foundation_model.version, foundation_model.id))



Using model name: bert-base-uncased, version: 3, id: azureml://registries/azureml-preview/models/bert-base-uncased/versions/3 for fine tuning


### 3. Pick the dataset to use for fine-tuning the model

A copy of the emotion dataset is available in the [emotion-dataset](./emotion-dataset/) folder. The next few cells show basic data preparation for fine tuning:
* Visualize some data rows
* Replace numerical categories in data with the actual string labels. This mapping is available in the [./emotion-dataset/label.json](./emotion-dataset/label.json).`
* We want this sample to run quickly, so save smaller `train`, `validation` and `test` files containing 10% of the original. This means the fine tuned model will have lower accuracy, hence it should not be put to real-world use. 

In [66]:
# load the ./emotion-dataset/train.jsonl file into a pandas dataframe and show the first 5 rows
import pandas as pd
df = pd.read_json("./emotion-dataset/train.jsonl", lines=True)
df.head()


Unnamed: 0,text,label
0,i didnt feel humiliated,0
1,i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake,0
2,im grabbing a minute to post i feel greedy wrong,3
3,i am ever feeling nostalgic about the fireplace i will know that it is still on the property,2
4,i am feeling grouchy,3


In [67]:
# load the id2label json element of the ./emotion-dataset/label.json file into pandas table with keys as 'label' column of int64 type and values as 'label_string' column as string type
import json
with open("./emotion-dataset/label.json") as f:
    id2label = json.load(f)
    id2label = id2label['id2label']
    label_df = pd.DataFrame.from_dict(id2label, orient='index', columns=['label_string'])
    label_df['label'] = label_df.index.astype('int64')
    label_df = label_df[['label', 'label_string']]
label_df.head()

Unnamed: 0,label,label_string
0,0,anger
1,1,fear
2,2,joy
3,3,love
4,4,sadness


In [68]:
# load test.jsonl, train.jsonl and validation.jsonl form the ./emotion-dataset folder into pandas dataframes
test_df = pd.read_json("./emotion-dataset/test.jsonl", lines=True)
train_df = pd.read_json("./emotion-dataset/train.jsonl", lines=True)
validation_df = pd.read_json("./emotion-dataset/validation.jsonl", lines=True)
# join the train, validation and test dataframes with the id2label dataframe to get the label_string column
train_df = train_df.merge(label_df, on='label', how='left')
validation_df = validation_df.merge(label_df, on='label', how='left')
test_df = test_df.merge(label_df, on='label', how='left')
# show the first 5 rows of the train dataframe
train_df.head()

Unnamed: 0,text,label,label_string
0,i didnt feel humiliated,0,anger
1,i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake,0,anger
2,im grabbing a minute to post i feel greedy wrong,3,love
3,i am ever feeling nostalgic about the fireplace i will know that it is still on the property,2,joy
4,i am feeling grouchy,3,love


In [69]:
# save 10% of the rows from the train, validation and test dataframes into files with small_ prefix in the ./emotion-dataset folder
train_df.sample(frac=0.1).to_json("./emotion-dataset/small_train.jsonl", orient='records', lines=True)
validation_df.sample(frac=0.1).to_json("./emotion-dataset/small_validation.jsonl", orient='records', lines=True)
test_df.sample(frac=0.1).to_json("./emotion-dataset/small_test.jsonl", orient='records', lines=True)


### 4. Submit the fine tuning job using the the model and data as inputs
 
Create the job that uses the `text-classification` pipeline component. [Learn more]() about all the parameters supported for fine tuning.

In [70]:
from azure.ai.ml.dsl import pipeline
from azure.ai.ml.entities import CommandComponent, PipelineComponent, Job, Component
from azure.ai.ml import PyTorchDistribution, Input

# fetch the pipeline component
pipeline_component_func = registry_ml_client.components.get(name="text_classification_pipeline", version="0.0.1")

# define the pipeline job
@pipeline()
def create_pipeline():
    finetuning_job = pipeline_component_func( 
        mlflow_model_path = foundation_model.id,
        # huggingface_id = 'bert-base-uncased', # if you want to use a huggingface model, uncomment this line and comment the above line
        compute_model_selector = compute_cluster,
        compute_preprocess = compute_cluster,
        compute_finetune = compute_cluster,
        train_file_path = Input(type="uri_file", path="./emotion-dataset/small_train.jsonl"),
        validation_file_path = Input(type="uri_file", path="./emotion-dataset/small_validation.jsonl"),
        test_file_path = Input(type="uri_file", path="./emotion-dataset/small_test.jsonl"),
        sentence1_key = "text", 
        label_key = "label_string", 
        number_of_gpu_to_use_finetuning = gpus_per_node, # set to the number of GPUs available in the compute
        num_train_epochs = 3,
        learning_rate = 2e-5, 
    )
    return {
        "trained_model": finetuning_job.outputs.mlflow_model_folder
    }

pipeline_object = create_pipeline()
pipeline_object.display_name =  "text-classification-using-" + model_name

Submit the job

In [71]:
pipeline_job = workspace_ml_client.jobs.create_or_update(pipeline_object, experiment_name=experiment_name)
workspace_ml_client.jobs.stream(pipeline_job.name)

Uploading small_train.jsonl (< 1 MB): 0.00B [00:00, ?B/s]

.

Uploading small_train.jsonl (< 1 MB): 100%|██████████| 226k/226k [00:00<00:00, 1.27MB/s] (< 1 MB): 100%|██████████| 226k/226k [00:00<00:00, 1.24MB/s]


Uploading small_validation.jsonl (< 1 MB): 0.00B [00:00, ?B/s] (< 1 MB): 100%|██████████| 27.2k/27.2k [00:00<00:00, 424kB/s]


Uploading small_test.jsonl (< 1 MB): 0.00B [00:00, ?B/s] (< 1 MB): 100%|██████████| 27.9k/27.9k [00:00<00:00, 435kB/s]




.RunId: musing_ticket_s25xz3pydm
Web View: https://ml.azure.com/runs/musing_ticket_s25xz3pydm?wsid=/subscriptions/21d8f407-c4c4-452e-87a4-e609bfb86248/resourcegroups/rg-contoso-819prod/workspaces/mlw-contoso-819prod

Streaming logs/azureml/executionlogs.txt

[2023-03-23 00:04:35Z] Submitting 1 runs, first five are: 966466eb:7f00352e-f556-4561-a7ca-62a23f877214
.........[2023-03-23 00:25:47Z] Completing processing run id 7f00352e-f556-4561-a7ca-62a23f877214.

Execution Summary
RunId: musing_ticket_s25xz3pydm
Web View: https://ml.azure.com/runs/musing_ticket_s25xz3pydm?wsid=/subscriptions/21d8f407-c4c4-452e-87a4-e609bfb86248/resourcegroups/rg-contoso-819prod/workspaces/mlw-contoso-819prod



### 5. Register the fine tuned model with the workspace

We will register the model from the output of the fine tuning job. This will track lineage between the fine tuned model and the fine tuning job. The fine tuning job, further, tracks lineage to the foundation model, data and training code.

In [72]:
from azure.ai.ml.entities import Model
from azure.ai.ml.constants import AssetTypes
# check if the `trained_model` output is available
print ("pipeline job outputs: ", workspace_ml_client.jobs.get(pipeline_job.name).outputs)

#fetch the model from pipeline job output - not working, hence fetching from fine tune child job
model_path_from_job = ("azureml://jobs/{0}/outputs/{1}".format(pipeline_job.name, "trained_model"))

finetuned_model_name = model_name + "-emotion-detection"
print("path to register model: ", model_path_from_job)
prepare_to_register_model = Model(
    path=model_path_from_job,
    type=AssetTypes.MLFLOW_MODEL,
    name=finetuned_model_name,
    version=timestamp, # use timestamp as version to avoid version conflict
    description=model_name + " fine tuned model for emotion detection"
)
print("prepare to register model: \n", prepare_to_register_model)
#register the model from pipeline job output 
registered_model = workspace_ml_client.models.create_or_update(prepare_to_register_model)
print ("registered model: \n", registered_model)


pipeline job outputs:  {'trained_model': <azure.ai.ml.entities._job.pipeline._io.base.PipelineOutput object at 0x7f83602a2d00>}
path to register model:  azureml://jobs/musing_ticket_s25xz3pydm/outputs/trained_model
prepare to register model: 
 description: bert-base-uncased fine tuned model for emotion detection
name: bert-base-uncased-emotion-detection
path: azureml://jobs/musing_ticket_s25xz3pydm/outputs/trained_model
properties: {}
tags: {}
type: mlflow_model
version: '1679529863'

registered model: 
 creation_context:
  created_at: '2023-03-23T00:26:37.445152+00:00'
  created_by: Manoj Bableshwar
  created_by_type: User
  last_modified_at: '2023-03-23T00:26:37.445152+00:00'
  last_modified_by: Manoj Bableshwar
  last_modified_by_type: User
description: bert-base-uncased fine tuned model for emotion detection
flavors:
  hftransformers:
    code: ''
    hf_config_class: BertConfig
    hf_pretrained_class: BertForSequenceClassification
    hf_tokenizer_class: BertTokenizerFast
    hug

### 6. Deploy the fine tuned model to an online endpoint
Online endpoints give a durable REST API that can be used to integrate with applications that need to use the model.

In [73]:
import time, sys
from azure.ai.ml.entities import ManagedOnlineEndpoint, ManagedOnlineDeployment

# Create online endpoint - endpoint names need to be unique in a region, hence using timestamp to create unique endpoint name

online_endpoint_name = "emotion-" + timestamp
# create an online endpoint
endpoint = ManagedOnlineEndpoint(
    name=online_endpoint_name,
    description="Online endpoint for " + registered_model.name + ", fine tuned model for emotion detection",
    auth_mode="key"
)
workspace_ml_client.begin_create_or_update(endpoint).wait()

In [74]:
# create a deployment
demo_deployment = ManagedOnlineDeployment(
    name="demo",
    endpoint_name=online_endpoint_name,
    model=registered_model.id,
    instance_type="Standard_DS2_v2",
    instance_count=1,
)
workspace_ml_client.online_deployments.begin_create_or_update(demo_deployment).wait()
endpoint.traffic = {"demo": 100}
workspace_ml_client.begin_create_or_update(endpoint).result()

Check: endpoint emotion-1679529863 exists
data_collector is not a known attribute of class <class 'azure.ai.ml._restclient.v2022_02_01_preview.models._models_py3.ManagedOnlineDeployment'> and will be ignored


.................................................................................................................

ManagedOnlineEndpoint({'public_network_access': 'Enabled', 'provisioning_state': 'Succeeded', 'scoring_uri': 'https://emotion-1679529863.eastus.inference.ml.azure.com/score', 'openapi_uri': 'https://emotion-1679529863.eastus.inference.ml.azure.com/swagger.json', 'name': 'emotion-1679529863', 'description': 'Online endpoint for bert-base-uncased-emotion-detection, fine tuned model for emotion detection', 'tags': {}, 'properties': {'azureml.onlineendpointid': '/subscriptions/21d8f407-c4c4-452e-87a4-e609bfb86248/resourcegroups/rg-contoso-819prod/providers/microsoft.machinelearningservices/workspaces/mlw-contoso-819prod/onlineendpoints/emotion-1679529863', 'AzureAsyncOperationUri': 'https://management.azure.com/subscriptions/21d8f407-c4c4-452e-87a4-e609bfb86248/providers/Microsoft.MachineLearningServices/locations/eastus/mfeOperationsStatus/oe:4b0187f3-cadc-4d54-91fe-55b6c38c8b45:3ef2bfb3-2334-41d8-9dc2-708a77b71a78?api-version=2022-02-01-preview'}, 'print_as_yaml': True, 'id': '/subscript

### 7. Test the endpoint with sample data

We will fetch some sample data from the test dataset and submit to online endpoint for inference. We will then show the display the scored labels alongside the ground truth labels

In [75]:
# read ./emotion-dataset/small_test.jsonl into a pandas dataframe and load first 10 rows into a new dataframe
test_df = pd.read_json("./emotion-dataset/small_test.jsonl", lines=True)
# rename the label_string column to ground_truth_label
test_df = test_df.rename(columns={"label_string": "ground_truth_label"})
test_df = test_df.head(10)
test_df.head(10)

Unnamed: 0,text,label,ground_truth_label
0,i still do feel left out i do feel like the most hated kid in the asian crew,3,love
1,i felt so bad for the bad grade and feeling like having to hide it that i didnt know what to say except to declare in all my frustration that i hated school,0,anger
2,im feeling a lot less ugly duckling and a lot more a href http,0,anger
3,i feel for all of you who have been supporting me is so extreme there would be no way to put a number value on it,1,fear
4,i dont know it if is the freshness of both but i feel more energetic during these seasons,1,fear
5,i tune out the rest of the world and focus on the rhythm of the needles and the softness of the yarn and for that time i feel my most peaceful,1,fear
6,i feel gloomy and tired,0,anger
7,i feel really selfish and feel guilty when i think about hurting myself,3,love
8,i came out of the airport that makes me feel irritable uncomfortable and even sadder,3,love
9,i was trying to demonstrate that i understood what she was feeling but she was very alarmed and worried for my safety,4,sadness


In [76]:
# create a json object with the key as "inputs" and value as a list of values from the text column of the test dataframe
test_json = {"inputs": test_df["text"].tolist()}
# save the json object to a file named sample_score.json in the ./emotion-dataset folder
with open("./emotion-dataset/sample_score.json", "w") as f:
    json.dump(test_json, f)

In [77]:
# score the sample_score.json file using the online endpoint with the azureml endpoint invoke method
response=workspace_ml_client.online_endpoints.invoke(
    endpoint_name=online_endpoint_name,
    deployment_name="demo",
    request_file="./emotion-dataset/sample_score.json"
)
print("raw response: \n", response, "\n")
# convert the response to a pandas dataframe and rename the label column as scored_label
response_df = pd.read_json(response)
response_df = response_df.rename(columns={"label": "scored_label"})
response_df.head(10)

raw response: 
 [{"label": "anger", "score": 0.9985894560813904}, {"label": "anger", "score": 0.983283519744873}, {"label": "anger", "score": 0.9988788962364197}, {"label": "love", "score": 0.9091328382492065}, {"label": "fear", "score": 0.999154806137085}, {"label": "fear", "score": 0.9991065859794617}, {"label": "anger", "score": 0.9990180730819702}, {"label": "love", "score": 0.8813483119010925}, {"label": "love", "score": 0.9977255463600159}, {"label": "sadness", "score": 0.9982689619064331}] 



Unnamed: 0,scored_label,score
0,anger,0.998589
1,anger,0.983284
2,anger,0.998879
3,love,0.909133
4,fear,0.999155
5,fear,0.999107
6,anger,0.999018
7,love,0.881348
8,love,0.997726
9,sadness,0.998269


In [78]:
# merge the test dataframe and the response dataframe on the index
merged_df = pd.merge(test_df, response_df, left_index=True, right_index=True)
pd.set_option('display.max_colwidth', 0) # set the max column width to 0 to display the full text
merged_df.head(10)

Unnamed: 0,text,label,ground_truth_label,scored_label,score
0,i still do feel left out i do feel like the most hated kid in the asian crew,3,love,anger,0.998589
1,i felt so bad for the bad grade and feeling like having to hide it that i didnt know what to say except to declare in all my frustration that i hated school,0,anger,anger,0.983284
2,im feeling a lot less ugly duckling and a lot more a href http,0,anger,anger,0.998879
3,i feel for all of you who have been supporting me is so extreme there would be no way to put a number value on it,1,fear,love,0.909133
4,i dont know it if is the freshness of both but i feel more energetic during these seasons,1,fear,fear,0.999155
5,i tune out the rest of the world and focus on the rhythm of the needles and the softness of the yarn and for that time i feel my most peaceful,1,fear,fear,0.999107
6,i feel gloomy and tired,0,anger,anger,0.999018
7,i feel really selfish and feel guilty when i think about hurting myself,3,love,love,0.881348
8,i came out of the airport that makes me feel irritable uncomfortable and even sadder,3,love,love,0.997726
9,i was trying to demonstrate that i understood what she was feeling but she was very alarmed and worried for my safety,4,sadness,sadness,0.998269


### 8. Delete the online endpoint
Don't forget to delete the online endpoint, else you will leave the billing meter running for the compute used by the endpoint

In [79]:
workspace_ml_client.online_endpoints.begin_delete(name=online_endpoint_name).wait()

....................................................................................