## Text Classification Inference using Online Endpoints

This sample shows how to deploy `text-classification` type models to an online endpoint for inference.

### Task
`text-classification` is generic task type that can be used for scenarios such as sentiment analysis, emotion detection, grammar checking, spam filtering, etc. In this example, we will test for entailment v/s contradiction, meaning given a premise sentence and a hypothesis sentence, the task is to predict whether the premise entails the hypothesis (entailment), contradicts the hypothesis (contradiction), or neither (neutral). 

### Inference data
The Multi-Genre Natural Language Inference Corpus, or MNLI is a crowd sourced collection of sentence pairs with textual entailment annotations.The [MNLI](https://huggingface.co/datasets/glue) dataset is a subset of the larger [General Language Understanding Evaluation](https://gluebenchmark.com/) dataset. A copy of this dataset is available in the [glue-mnli](./glue-mnli/) folder.

### Model
Look for models tagged with `text-classification` in the system registry. Just looking for `text-classification` is not sufficient, you need to check if the model is specifically finetuned for  entailment v/s contradiction by studying the model card and looking at the input/output samples or signatures of the model. In this notebook, we use the `microsoft-deberta-base-mnli` model.

   

### Outline
* Set up pre-requisites.
* Pick a model to deploy.
* Prepare data for inference. 
* Deploy the model for real time inference.
* Test the endpoint
* Clean up resources.

### 1. Set up pre-requisites
* Install dependencies
* Connect to AzureML Workspace. Learn more at [set up SDK authentication](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-setup-authentication?tabs=sdk). Replace  `<WORKSPACE_NAME>`, `<RESOURCE_GROUP>` and `<SUBSCRIPTION_ID>` below.
* Connect to `azureml` system registry

In [1]:
from azure.ai.ml import MLClient
from azure.identity import (
    DefaultAzureCredential,
    InteractiveBrowserCredential,
    ClientSecretCredential,
)
from azure.ai.ml.entities import AmlCompute
import time

try:
    credential = DefaultAzureCredential()
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    credential = InteractiveBrowserCredential()

workspace_ml_client = MLClient(
    credential,
    subscription_id="ea4faa5b-5e44-4236-91f6-5483d5b17d14",
    resource_group_name="amyharrispersonal",
    workspace_name="amyharris-canary",
)
# the models, fine tuning pipelines and environments are available in the AzureML system registry, "azureml-preview"
registry_ml_client = MLClient(credential, registry_name="azureml-preview")

Class FeatureStoreOperations: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class FeatureSetOperations: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class FeatureStoreEntityOperations: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


### 2. Pick a model to deploy

Browse models in the Model Catalog in the AzureML Studio, filtering by the `fill-mask` task. In this example, we use the `bert-base-uncased` model. If you have opened this notebook for a different model, replace the model name and version accordingly. 

In [2]:
model_name = "microsoft-deberta-base-mnli"
model_version = "2"
foundation_model = registry_ml_client.models.get(model_name, model_version)
print(
    "\n\nUsing model name: {0}, version: {1}, id: {2} for fine tuning".format(
        foundation_model.name, foundation_model.version, foundation_model.id
    )
)



Using model name: microsoft-deberta-base-mnli, version: 2, id: azureml://registries/azureml-preview/models/microsoft-deberta-base-mnli/versions/2 for fine tuning


### 3. Prepare data for inference.

A subset of the MNLI is available in the [ glue-mnli-dataset](./glue-mnli-dataset/) folder. The next few cells show basic data preparation:
* Visualize some data rows
* Replace numerical categories in data with the actual string labels. This mapping is available in the [./glue-mnli-dataset/label.json](./glue-mnli-dataset/label.json). This step is needed because the selected models will return labels such `CONTRADICTION`, `CONTRADICTION`, etc. when running prediction. If the labels in your ground truth data are left as `0`, `1`, `2`, etc., then they would not match with prediction labels returned by the models.
* The dataset contains `premise` and `hypothesis` as two different columns. However, the models expect a single string for prediction in the format `[CLS] <premise text> [SEP] <hypothesis text> [SEP]`. Hence we merge the columns and drop the original columns.
* We want this sample to run quickly, so save smaller dataset containing 10% of the original. 

In [3]:
import os

dataset_dir = "./glue-mnli-dataset"
data_file = "train_100.jsonl"

# load the train_100.jsonl file into a pandas dataframe and show the first 5 rows
import pandas as pd

pd.set_option(
    "display.max_colwidth", 0
)  # set the max column width to 0 to display the full text
df = pd.read_json(os.path.join(dataset_dir, data_file), lines=True)
df.head()

# load the id2label json element of the label.json file into pandas table with keys as 'label' column of int64 type and values as 'label_string' column as string type
import json

label_file = "label.json"
with open(os.path.join(dataset_dir, label_file)) as f:
    id2label = json.load(f)
    id2label = id2label["id2label"]
    label_df = pd.DataFrame.from_dict(
        id2label, orient="index", columns=["label_string"]
    )
    label_df["label"] = label_df.index.astype("int64")
    label_df = label_df[["label", "label_string"]]

# join the train, validation and test dataframes with the id2label dataframe to get the label_string column
df = df.merge(label_df, on="label", how="left")
# concat the premise and hypothesis columns to with "[CLS]" in the beginning and "[SEP]" in the middle and end to get the text column
df["text"] = "[CLS] " + df["premise"] + " [SEP] " + df["hypothesis"] + " [SEP]"
# drop the idx, premise and hypothesis columns as they are not needed
df = df.drop(columns=["idx", "premise", "hypothesis", "label"])
# rename the label_string column to ground_truth_label
df = df.rename(columns={"label_string": "ground_truth_label"})

# save 10% of the rows from the train, validation and test dataframes into files with small_ prefix in the ./dataset_dir folder
small_data_file = "small_train.jsonl"
df.sample(frac=0.1).to_json(
    os.path.join(dataset_dir, small_data_file), orient="records", lines=True
)

df.head()

Unnamed: 0,ground_truth_label,text
0,ENTAILMENT,"[CLS] In 2002, 55 TIG awards were granted. [SEP] There were 55 TIG awards that year. [SEP]"
1,CONTRADICTION,"[CLS] In part, grants from the Colorado Lawyers Trust Account Foundation defray some of the costs associated with maintaining a strong pro bono, program. [SEP] There is not a pro bono program associated with the Colorado Lawyers Trust Account Foundation. [SEP]"
2,CONTRADICTION,"[CLS] The communes were strong enough to confine German Emperor Frederick Barbarossa's Italian ambitions to the south, where he secured Sicily for his Hohenstaufen heirs by marrying his son into the Norman royal family. [SEP] German Emperor Frederick Barbarossa took over all of Europe. [SEP]"
3,CONTRADICTION,"[CLS] For a time, Drudge straddled the divide between Web and checkout-line cultures, making the national press corps regularly scramble to cover a tabloid item. [SEP] Drudge only covered mainstream news and never delved into tabloid journalism. [SEP]"
4,NEUTRAL,"[CLS] Maybe to David Arnold Hanson, the famed engineer, no task was impossible. [SEP] Maybe David Arnold Hanson could do difficult tasks that other engineers couldn't. [SEP]"


### 4. Deploy the model to an online endpoint
Online endpoints give a durable REST API that can be used to integrate with applications that need to use the model.

In [4]:
import time, sys
from azure.ai.ml.entities import ManagedOnlineEndpoint, ManagedOnlineDeployment

# Create online endpoint - endpoint names need to be unique in a region, hence using timestamp to create unique endpoint name
timestamp = int(time.time())
online_endpoint_name = "entail-contra-" + str(timestamp)
# create an online endpoint
endpoint = ManagedOnlineEndpoint(
    name=online_endpoint_name,
    description="Online endpoint for "
    + foundation_model.name
    + ", to detect entailment v/s contradiction",
    auth_mode="key",
)
workspace_ml_client.begin_create_or_update(endpoint).wait()

In [5]:
# create a deployment
demo_deployment = ManagedOnlineDeployment(
    name="demo",
    endpoint_name=online_endpoint_name,
    model=foundation_model.id,
    instance_type="Standard_DS2_v2",
    instance_count=1,
)
workspace_ml_client.online_deployments.begin_create_or_update(demo_deployment).wait()
endpoint.traffic = {"demo": 100}
workspace_ml_client.begin_create_or_update(endpoint).result()

Instance type Standard_DS2_v2 may be too small for compute resources. Minimum recommended compute SKU is Standard_DS3_v2 for general purpose endpoints. Learn more about SKUs here: https://learn.microsoft.com/en-us/azure/machine-learning/referencemanaged-online-endpoints-vm-sku-list
Check: endpoint entail-contra-1684194378 exists
data_collector is not a known attribute of class <class 'azure.ai.ml._restclient.v2022_02_01_preview.models._models_py3.ManagedOnlineDeployment'> and will be ignored


......................................................................................................................

ManagedOnlineEndpoint({'public_network_access': 'Enabled', 'provisioning_state': 'Succeeded', 'scoring_uri': 'https://entail-contra-1684194378.eastus2euap.inference.ml.azure.com/score', 'openapi_uri': 'https://entail-contra-1684194378.eastus2euap.inference.ml.azure.com/swagger.json', 'name': 'entail-contra-1684194378', 'description': 'Online endpoint for microsoft-deberta-base-mnli, to detect entailment v/s contradiction', 'tags': {}, 'properties': {'azureml.onlineendpointid': '/subscriptions/ea4faa5b-5e44-4236-91f6-5483d5b17d14/resourcegroups/amyharrispersonal/providers/microsoft.machinelearningservices/workspaces/amyharris-canary/onlineendpoints/entail-contra-1684194378', 'AzureAsyncOperationUri': 'https://management.azure.com/subscriptions/ea4faa5b-5e44-4236-91f6-5483d5b17d14/providers/Microsoft.MachineLearningServices/locations/eastus2euap/mfeOperationsStatus/oe:c76e6446-545b-4141-80f9-e8ad59c471f2:72e8d8a9-2dc1-4d43-8405-46e5039242db?api-version=2022-02-01-preview'}, 'print_as_yam

### 5. Test the endpoint with sample data

We will fetch some sample data from the test dataset and submit to online endpoint for inference. We will then show the display the scored labels alongside the ground truth labels

In [6]:
import json

data_file_small = "small_train.jsonl"
score_file = "sample_score.json"
# read the data file into a pandas dataframe
df = pd.read_json(os.path.join(dataset_dir, data_file_small), lines=True)
# escape single and double quotes in the masked_text column
# pick 5 random rows
sample_df = df.sample(5)
# reset the index of sample_df
sample_df = sample_df.reset_index(drop=True)

# save the json object to a file named sample_score.json in the
test_json = {"inputs": {"input_string": sample_df["text"].tolist()}}
# save the json object to a file named sample_score.json in the ./glue-mnli-dataset folder
with open(os.path.join(".", dataset_dir, score_file), "w") as f:
    json.dump(test_json, f)
sample_df.head()

Unnamed: 0,ground_truth_label,text
0,ENTAILMENT,[CLS] that's that's on our list of things to see [SEP] That's on our list. [SEP]
1,NEUTRAL,"[CLS] At the south end of the square is the 14th-century Loggia della Signoria or dei Lanzi, transformed from the city fathers' ceremonial grandstand into a guardroom for Swiss mercenary Lands?­knechte. [SEP] The Loggia della Signoria gets a ton of visitors trying to view the Swiss mercenary Landsknecht. [SEP]"
2,CONTRADICTION,[CLS] Dreams of the Lefty (Al Pacino) and Brasco (Depp) (30 seconds) : [SEP] Al Pacino never had a role in Dreams of the Lefty. [SEP]
3,NEUTRAL,[CLS] There was Sage. [SEP] Sage was there at the mill. [SEP]
4,NEUTRAL,"[CLS] California Rural Legal Assistance has purchased an Oxnard building it had been renting since earlier this year, moving the provider of legal services for Ventura County's poor closer to two goals. [SEP] California Rural Legal Assistance bought a building for their new office. [SEP]"


In [7]:
# score the sample_score.json file using the online endpoint with the azureml endpoint invoke method
response = workspace_ml_client.online_endpoints.invoke(
    endpoint_name=online_endpoint_name,
    deployment_name="demo",
    request_file=os.path.join(".", dataset_dir, score_file),
)
print("raw response: \n", response, "\n")
# convert the json response to a pandas dataframe
response_df = pd.read_json(response)
# rename label column to predicted_label
response_df = response_df.rename(columns={"label": "predicted_label"})
response_df.head()

raw response: 
 [{"0": "ENTAILMENT"}, {"0": "NEUTRAL"}, {"0": "CONTRADICTION"}, {"0": "NEUTRAL"}, {"0": "NEUTRAL"}] 



Unnamed: 0,0
0,ENTAILMENT
1,NEUTRAL
2,CONTRADICTION
3,NEUTRAL
4,NEUTRAL


In [8]:
# merge the sample_df and response_df dataframes
merged_df = sample_df.merge(response_df, left_index=True, right_index=True)
merged_df.head()

Unnamed: 0,ground_truth_label,text,0
0,ENTAILMENT,[CLS] that's that's on our list of things to see [SEP] That's on our list. [SEP],ENTAILMENT
1,NEUTRAL,"[CLS] At the south end of the square is the 14th-century Loggia della Signoria or dei Lanzi, transformed from the city fathers' ceremonial grandstand into a guardroom for Swiss mercenary Lands?­knechte. [SEP] The Loggia della Signoria gets a ton of visitors trying to view the Swiss mercenary Landsknecht. [SEP]",NEUTRAL
2,CONTRADICTION,[CLS] Dreams of the Lefty (Al Pacino) and Brasco (Depp) (30 seconds) : [SEP] Al Pacino never had a role in Dreams of the Lefty. [SEP],CONTRADICTION
3,NEUTRAL,[CLS] There was Sage. [SEP] Sage was there at the mill. [SEP],NEUTRAL
4,NEUTRAL,"[CLS] California Rural Legal Assistance has purchased an Oxnard building it had been renting since earlier this year, moving the provider of legal services for Ventura County's poor closer to two goals. [SEP] California Rural Legal Assistance bought a building for their new office. [SEP]",NEUTRAL


### 6. Delete the online endpoint
Don't forget to delete the online endpoint, else you will leave the billing meter running for the compute used by the endpoint

In [9]:
workspace_ml_client.online_endpoints.begin_delete(name=online_endpoint_name).wait()

...................................................................................