## Text Classification Evaluation - Sentiment Analysis

This sample shows how use the evaluate a group of models against a given set of metrics for the `text-classification` task. 

### Evaluation dataset
The Stanford Sentiment Treebank consists of sentences from movie reviews and human annotations of their sentiment. The task is to predict the sentiment of a given sentence. It uses the two-way (positive/negative) class split, with only sentence-level labels. The [SST2](https://huggingface.co/datasets/glue/viewer/sst2/validation) dataset is a subset of the larger [General Language Understanding Evaluation](https://gluebenchmark.com/) dataset. A copy of this dataset is available in the [glue-sst2-dataset](./glue-sst2-dataset/) folder.

### Model
The goal of evaluating models is to compare their performance on a variety of metrics. `text-classification` is generic task type that can be used for scenarios such as sentiment analysis, emotion detection, grammar checking, spam filtering, etc. As such, the models you pick to compare must be finetuned for same scenario. Given that we have the sentiment analysis dataset, we would like to look for models finetuned for this specific scenario. We will compare `distilbert-base-uncased-finetuned-sst-2-english` and `finiteautomata-bertweet-base-sentiment-analysis` in this sample, which are available in the `azureml` system registry.

If you'd like to evaluate models that are not in the system registry, you can import those models to your workspace or organization registry and then evaluate them using the approach outlined in this sample. Review the sample notebook for [importing models](../../import/import-model-from-huggingface.ipynb). 

### Outline
* Setup pre-requisites such as compute.
* Pick the models to evaluate.
* Pick and explore evaluate data.
* Configure the evaluation jobs.
* Run the evaluation jobs.
* Review the evaluation metrics. 

### 1. Setup pre-requisites
* Install dependencies
* Connect to AzureML Workspace. Learn more at [set up SDK authentication](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-setup-authentication?tabs=sdk). Replace  `<WORKSPACE_NAME>`, `<RESOURCE_GROUP>` and `<SUBSCRIPTION_ID>` below.
* Connect to `azureml` system registry
* Set an optional experiment name
* Check or create compute. A single GPU node can have multiple GPU cards. For example, in one node of `Standard_ND40rs_v2` there are 8 NVIDIA V100 GPUs while in `Standard_NC12s_v3`, there are 2 NVIDIA V100 GPUs. Refer to the [docs](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-gpu) for this information. The number of GPU cards per node is set in the param `gpus_per_node` below. Setting this value correctly will ensure utilization of all GPUs in the node. The recommended GPU compute SKUs can be found [here](https://learn.microsoft.com/en-us/azure/virtual-machines/ncv3-series) and [here](https://learn.microsoft.com/en-us/azure/virtual-machines/ndv2-series).

Install dependencies by running below cell. This is not an optional step if running in a new environment.

In [None]:
%pip install azure-ai-ml
%pip install azure-identity
%pip install datasets==2.9.0

In [None]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml.entities import AmlCompute
import time

try:
    credential = DefaultAzureCredential()
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    credential = InteractiveBrowserCredential()

workspace_ml_client = None
try:
    workspace_ml_client = MLClient.from_config(credential)
    subscription_id = workspace_ml_client.subscription_id
    workspace = workspace_ml_client.workspace_name
    resource_group = workspace_ml_client.resource_group_name
except Exception as ex:
    print(ex)
    # Enter details of your AML workspace
    subscription_id = "<SUBSCRIPTION_ID>"
    resource_group = "<RESOURCE_GROUP>"
    workspace = "<AML_WORKSPACE_NAME>"
    workspace_ml_client = MLClient(
        credential, subscription_id, resource_group, workspace
    )

# the models, fine tuning pipelines and environments are available in the AzureML system registry, "azureml-preview"
registry = "azureml"

registry_ml_client = MLClient(
    credential, subscription_id, resource_group, registry_name=registry
)
registry_ml_client

In [None]:
workspace_ml_client

In [None]:
# If you already have a gpu cluster, mention it here. Else will create a new one with the name 'gpu-cluster-big'
compute_cluster = "gpu-cluster-big"
try:
    compute = workspace_ml_client.compute.get(compute_cluster)
    print(f"GPU compute '{compute_cluster}' found.")
except Exception as ex:
    print(f"GPU compute '{compute_cluster}' not found. Creating new one.")
    compute = AmlCompute(
        name=compute_cluster,
        size="Standard_ND40rs_v2",
        max_instances=2,  # For multi node training set this to an integer value more than 1
    )
    workspace_ml_client.compute.begin_create_or_update(compute).wait()

# generating a unique timestamp that can be used for names and versions that need to be unique
timestamp = str(int(time.time()))

Below snippet will allow us to query number of GPU's present on the compute. We can use it to set `gpu_per_node` to ensure utilization of all GPUs in the node.

In [None]:
# This is the number of GPUs in a single node of the selected 'vm_size' compute.
# Setting this to less than the number of GPUs will result in underutilized GPUs, taking longer to train.
# Setting this to more than the number of GPUs will result in an error.
gpus_per_node = 1  # default value
gpu_count_found = False
ws_computes = workspace_ml_client.compute.list_sizes()
for ws_compute in ws_computes:
    if ws_compute.name.lower() == compute.size.lower():
        gpus_per_node = ws_compute.gpus
        print(f"Number of GPUs in compute {ws_compute.name} are {ws_compute.gpus}")
# if gpu_count_found not found, then print an error
if gpus_per_node > 0:
    gpu_count_found = True
else:
    gpu_count_found = False
    print(f"No GPUs found in compute. Number of GPUs in compute {compute.size} 0.")

### 2. Pick the models to evaluate

Verify that the models selected for evaluation are available in system registry

In [None]:
# need to specify model versions until the bug to support fetching the latest version using latest label is fixed
models = [
    {"name": "distilbert-base-uncased-finetuned-sst-2-english", "version": "4"},
    # please prepare appropriate dataset and config in similar way to run evaluation on this dataset
    #     {"name": "finiteautomata-bertweet-base-sentiment-analysis", "version": "1"},
]
for model in models:
    model = registry_ml_client.models.get(model["name"], version=model["version"])
    print(model.id)

### 3. Pick the test dataset for evaluation
A copy of the SST2 is available in the [glue-sst2-dataset](./glue-sst2-dataset/)  folder. The next few cells show basic data preparation:
* Visualize some data rows
* Replace numerical categories in data with the actual string labels. This mapping is available in the [./glue-sst2-dataset/label.json](./glue-sst2-dataset/label.json). This step is needed because the selected models will return labels such `POSITVE`, `NEGATIVE`, etc. when running prediction. If the labels in your ground truth data are left as `0`, `1`, etc., then they would not match with prediction labels returned by the models.
* The dataset contains `sentence` and `label` as two different columns. 

In [None]:
import os
import pandas as pd
from datasets import load_dataset

dataset_dir = "./glue-sst2-dataset"

df = pd.DataFrame(load_dataset("glue", "sst2", split="validation").take(800))
df.head()

In [None]:
# load the id2label json element of the label.json file into pandas table with keys as 'label' column of int64 type and values as 'label_string' column as string type
import json

label_file = "label.json"
with open(os.path.join(dataset_dir, label_file)) as f:
    id2label = json.load(f)
    id2label = id2label["id2label"]
    label_df = pd.DataFrame.from_dict(
        id2label, orient="index", columns=["label_string"]
    )
    label_df["label"] = label_df.index.astype("int64")
    label_df = label_df[["label", "label_string"]]
label_df.head()

In [None]:
# join the train, validation and test dataframes with the id2label dataframe to get the label_string column
df = df.merge(label_df, on="label", how="left")
# creating a new column to match the signature of mlflow base model
df["input_string"] = df["sentence"]
# drop the idx, sentence columns as they are not needed
df = df.drop(columns=["idx", "sentence"])
df.head()

In [None]:
# save 10% of the rows from the train, validation and test dataframes into files with small_ prefix in the ./dataset_dir folder
small_data_file = "small_validation.jsonl"
df.sample(frac=0.1).to_json(
    os.path.join(dataset_dir, small_data_file), orient="records", lines=True
)

### 4. Submit the evaluation jobs using the model and data as inputs
 
Create the job that uses the `model_evaluation_pipeline` component. We will submit one job per model. 

Note that the metrics that the evaluation jobs need to calculate are specified in the [sst2-eval-config.json](./sst2-eval-config.json) file.

All supported evaluation configurations for `text-classification` can be found in [README](./README.md).

In [None]:
from azure.ai.ml.dsl import pipeline
from azure.ai.ml.entities import CommandComponent, PipelineComponent, Job, Component
from azure.ai.ml import PyTorchDistribution, Input
from azure.ai.ml.constants import AssetTypes

# fetch the pipeline component
pipeline_component_func = registry_ml_client.components.get(
    name="model_evaluation_pipeline", label="latest"
)


# define the pipeline job
@pipeline()
def evaluation_pipeline(mlflow_model):
    evaluation_job = pipeline_component_func(
        # specify the foundation model available in the azureml system registry or a model from the workspace
        # mlflow_model = Input(type=AssetTypes.MLFLOW_MODEL, path=f"{mlflow_model_path}"),
        mlflow_model=mlflow_model,
        # test data
        test_data=Input(
            type=AssetTypes.URI_FILE, path=os.path.join(dataset_dir, small_data_file)
        ),
        # The following parameters map to the dataset fields
        input_column_names="input_string",
        label_column_name="label_string",
        # Evaluation settings
        task="text-classification",
        # config file containing the details of evaluation metrics to calculate
        evaluation_config=Input(
            type=AssetTypes.URI_FILE, path="./sst2-eval-config.json"
        ),
        # config cluster/device job is running on
        # set device to GPU/CPU on basis if GPU count was found
        device="gpu" if gpu_count_found else "cpu",
    )
    return {"evaluation_result": evaluation_job.outputs.evaluation_result}

Submit the jobs, passing the model as a parameter to the pipeline created in the above step.

In [None]:
# submit the pipeline job for each model that we want to evaluate
# you could consider submitting the pipeline jobs in parallel, provided your cluster has multiple nodes
pipeline_jobs = []

experiment_name = "text-classification-sentiment-analysis"

for model in models:
    model_object = registry_ml_client.models.get(
        model["name"], version=model["version"]
    )
    pipeline_object = evaluation_pipeline(
        mlflow_model=Input(type=AssetTypes.MLFLOW_MODEL, path=f"{model_object.id}"),
    )
    # don't reuse cached results from previous jobs
    pipeline_object.settings.force_rerun = True
    pipeline_object.settings.default_compute = compute_cluster
    pipeline_object.display_name = f"eval-{model['name']}-{timestamp}"
    pipeline_job = workspace_ml_client.jobs.create_or_update(
        pipeline_object, experiment_name=experiment_name
    )
    # add model['name'] and pipeline_job.name as key value pairs to a dictionary
    pipeline_jobs.append({"model_name": model["name"], "job_name": pipeline_job.name})
    # wait for the pipeline job to complete
    workspace_ml_client.jobs.stream(pipeline_job.name)

### 5. Review evaluation metrics
Viewing the job in AzureML studio is the best way to analyze logs, metrics and outputs of jobs. You can create custom charts and compare metics across different jobs. See https://learn.microsoft.com/en-us/azure/machine-learning/how-to-log-view-metrics?tabs=interactive#view-jobsruns-information-in-the-studio to learn more. 

![Model evaluation dashboard in AzureML studio](./sst2-eval-dashboard.png)

However, we may need to access and review metrics programmatically for which we will use MLflow, which is the recommended client for logging and querying metrics.

In [None]:
import mlflow, json

mlflow_tracking_uri = workspace_ml_client.workspaces.get(
    workspace_ml_client.workspace_name
).mlflow_tracking_uri
mlflow.set_tracking_uri(mlflow_tracking_uri)

metrics_df = pd.DataFrame()
for job in pipeline_jobs:
    # concat 'tags.mlflow.rootRunId=' and pipeline_job.name in single quotes as filter variable
    filter = "tags.mlflow.rootRunId='" + job["job_name"] + "'"
    runs = mlflow.search_runs(
        experiment_names=[experiment_name], filter_string=filter, output_format="list"
    )
    # get the compute_metrics runs.
    # using a hacky way till 'Bug 2320997: not able to show eval metrics in FT notebooks - mlflow client now showing display names' is fixed
    for run in runs:
        # else, check if run.data.metrics.accuracy exists
        if "accuracy" in run.data.metrics:
            # get the metrics from the mlflow run
            run_metric = run.data.metrics
            # add the model name to the run_metric dictionary
            run_metric["model_name"] = job["model_name"]
            # convert the run_metric dictionary to a pandas dataframe
            temp_df = pd.DataFrame(run_metric, index=[0])
            # concat the temp_df to the metrics_df
            metrics_df = pd.concat([metrics_df, temp_df], ignore_index=True)

# move the model_name columns to the first column
cols = metrics_df.columns.tolist()
cols = cols[-1:] + cols[:-1]
metrics_df = metrics_df[cols]
metrics_df.head()