## Fill Mask Inference using Batch Endpoints

This sample shows how deploy `fill-mask` type models to a batch endpoint for inference.

### Task
`fill-mask` task is about predicting masked words in a sentence. Models that perform this have a good understanding of the language structure and domain of the dataset of they are trained on. `fill-mask` models are typically used as foundation models for more scenario oriented tasks such as `text-classification` or `token-classification`.

### Model
Models that can perform the `fill-mask` task are tagged with `task: fill-mask`. We will use the `bert-base-uncased` model in this notebook. If you opened this notebook from a specific model card, remember to replace the specific model name. If you don't find a model that suits your scenario or domain, you can discover and [import models from HuggingFace hub](../../import/import-model-from-huggingface.ipynb) and then use them for inference. 

### Inference data
We will use the [book corpus](https://huggingface.co/datasets/bookcorpus) dataset. A copy of this dataset is available in the [book-corpus-dataset](./book-corpus-dataset/) folder. 

### Outline
* Setup pre-requisites.
* Pick a model to deploy.
* Prepare data for inference. 
* Deploy the model for batch inference.
* Run a batch inference job.
* Review inference predictions.
* Clean up resources.

### 1. Setup pre-requisites
* Install dependencies.
* Connect to AzureML Workspace. Learn more at [set up SDK authentication](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-setup-authentication?tabs=sdk). Replace  `<WORKSPACE_NAME>`, `<RESOURCE_GROUP>` and `<SUBSCRIPTION_ID>` below.
* Connect to `azureml` system registry.
* Check or create compute.

In [None]:
# Import packages used by the following code snippets
import csv
import json
import os
import random
import sys
import time

import pandas as pd
import urllib.request

from azure.ai.ml import Input, MLClient
from azure.ai.ml.constants import AssetTypes
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml.entities import (
    AmlCompute,
    BatchDeployment,
    BatchEndpoint,
    BatchRetrySettings,
    Model,
)

In [None]:
subscription_id = "<SUBSCRIPTION_ID>"
resource_group_name = "<RESOURCE_GROUP>"
workspace_name = "<WORKSPACE_NAME>"

In [None]:
try:
    credential = DefaultAzureCredential()
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    credential = InteractiveBrowserCredential()

workspace_ml_client = MLClient(
        credential,
        subscription_id=subscription_id,
        resource_group_name=resource_group_name,
        workspace_name=workspace_name
)
# the models, fine tuning pipelines and environments are available in the AzureML system registry, "azureml"
registry_ml_client = MLClient(credential, registry_name="azureml")

# Generate a unique timestamp that can be used for names and versions that need to be unique
timestamp = str(int(time.time())) 

#### Create a compute cluster.

In [None]:
compute_name = "cpu-cluster"

compute_cluster = AmlCompute(
    name=compute_name,
    description="An AML compute cluster",
    size="Standard_DS3_V2", # Use the model card from the AzureML system registry to check the minimum required inferencing sku.
    min_instances=0,
    max_instances=3,
    idle_time_before_scale_down=120) # 120 seconds

workspace_ml_client.begin_create_or_update(compute_cluster)

### 2. Pick a model to deploy

Browse models in the Model Catalog in the AzureML Studio, filtering by the `fill-mask` task. In this example, we use the `bert-base-uncased` model. If you have opened this notebook for a different model, replace the model name and version accordingly. 

In [None]:
model_name = "bert-base-uncased"
model_version = "1"
foundation_model = registry_ml_client.models.get(model_name, model_version)
print (f"Using model name: {foundation_model.name}, version: {foundation_model.version}, id: {foundation_model.id} for inferencing.")

### 3. Prepare data for inference.

A copy of the book corpus dataset is available in the [book-corpus-dataset](./book-corpus-dataset/) folder. The next few cells show basic data preparation:
* Visualize some data rows.
* We will `<mask>` one work in each sentence so that the model can predict the masked words.
* We want this sample to run quickly, so save a smaller dataset containing a fraction of the original.


In [None]:
# Define directories and filenames as variables
dataset_dir = "book-corpus-dataset"
training_datafile = "train.jsonl"

batch_input_file = "batch_input.csv"
batch_dir = os.path.join(dataset_dir, "batch")
os.makedirs(batch_dir, exist_ok=True)

In [None]:
# Load the ./book-corpus-dataset/train.jsonl file into a pandas dataframe and show the first 5 rows
pd.set_option('display.max_colwidth', 0) # set the max column width to 0 to display the full text
train_df = pd.read_json(os.path.join(".", dataset_dir, training_datafile), lines=True)
train_df.head()

Transform the data using the masking token.

In [None]:
# Get the right mask token from huggingface
with urllib.request.urlopen(f"https://huggingface.co/api/models/{model_name}") as url:
    data = json.load(url)
    mask_token = data["mask_token"]

# Take the value of the "text" column, replace a random word with the mask token, and save the result in the "masked_text" column
train_df["masked_text"] = train_df["text"].apply(lambda x: x.replace(random.choice(x.split()), mask_token, 1))

# Save the train_df dataframe to a jsonl file in the ./book-corpus-dataset folder with the `masked_` prefix
masked_datafile = os.path.join(".", dataset_dir, "masked_" + training_datafile)
train_df.to_json(masked_datafile, orient="records", lines=True)
train_df.head()

Save a tenth of the input data to a file for testing batch inference. The MLflow model's signature specifies the input should be a column named `"input_string"`, so rename the transformed `"masked_text"` column. 

In [None]:
batch_df = train_df[['masked_text']].rename(columns={'masked_text': 'input_string'}).sample(frac=0.01)
batch_df.to_csv(os.path.join(batch_dir, batch_input_file), quoting=csv.QUOTE_ALL)
batch_df.head()

### 4. Deploy the model to a batch endpoint
Batch endpoints are endpoints that are used batch score large datasets in job model. Batch endpoints receive pointers to data and run jobs asynchronously to process the data in parallel on compute clusters. Batch endpoints store outputs to a data store for further analysis.

* Create a batch endpoint.
* Create a batch deployment.
* Set the deployment as default; doing so allows invoking the endpoint without specifying the deployment's name.

#### Create the endpoint.

In [None]:
# Endpoint names need to be unique in a region, hence using timestamp to create unique endpoint name
timestamp = int(time.time())
endpoint_name = "fill-mask-" + str(timestamp)

endpoint = BatchEndpoint(
    name=endpoint_name,
    description="Batch endpoint for " + foundation_model.name + ", for fill-mask task"
)
workspace_ml_client.begin_create_or_update(endpoint).result()

#### Create the deployment.

In [None]:
deployment_name = "demo"

deployment = BatchDeployment(
    name=deployment_name,
    endpoint_name=endpoint_name,
    model=foundation_model.id,
    compute=compute_name,
    error_threshold=0,
    instance_count=1,
    logging_level="info",
    max_concurrency_per_instance=2,
    mini_batch_size=10,
    output_file_name="predictions.csv",
    retry_settings=BatchRetrySettings(max_retries=3, timeout=300),
)
workspace_ml_client.begin_create_or_update(deployment).result()

#### Set the deployment as default.

In [None]:
endpoint = workspace_ml_client.batch_endpoints.get(endpoint_name)
endpoint.defaults.deployment_name = deployment_name
workspace_ml_client.begin_create_or_update(endpoint).wait()

endpoint = workspace_ml_client.batch_endpoints.get(endpoint_name)
print(f"The default deployment is {endpoint.defaults.deployment_name}")

### 5. Run a batch inference job.

Invoke the batch endpoint with the input parameter pointing to the folder containing the batch inference input. This creates a pipeline job using the default deployment in the endpoint. Wait for the job to complete.

In [None]:
input = Input(
    path=batch_dir,
    type=AssetTypes.URI_FOLDER)

job = workspace_ml_client.batch_endpoints.invoke(
    endpoint_name=endpoint.name,
    input=input)

workspace_ml_client.jobs.stream(job.name)

### 6. Review inference predictions.
Download the predictions from the job output and review the predictions using a dataframe.

In [None]:
scoring_job = list(workspace_ml_client.jobs.list(parent_job_name=job.name))[0]

workspace_ml_client.jobs.download(
    name=scoring_job.name,
    download_path=dataset_dir,
    output_name="score")

predictions_file = os.path.join(dataset_dir, "named-outputs", "score", "predictions.csv")

# Load the batch predictions file with no headers into a dataframe and set your column names.
score_df = pd.read_csv(
    predictions_file,
    header=None,
    index_col=0,
    names= ['prediction', 'batch_input_file_name'])
score_df.head()

Join the predictions with input data to compare ground truth with predictions.

In [None]:
# Drop the batch_input_file_name column as it is not needed for reference since we only scored one file.
score_df = score_df.drop(columns=['batch_input_file_name'])

# Set the index from the batch input file.
score_df.set_index(batch_df.index, inplace=True)

In [None]:
# Join the ground truth dataframe with the score_df dataframe on the index row.
df = score_df.join(train_df)

# Show the first 10 rows of the dataframe.
df.head(10)

### 7. Clean up resources
Batch endpoints use compute resources only when jobs are submitted. You can keep the batch endpoint for your reference without worrying about compute bills, or choose to delete the endpoint. If you created your compute cluster to have zero minimum instances and scale down soon after being idle, you won't be charged for an unused compute.

In [None]:
workspace_ml_client.batch_endpoints.begin_delete(name=endpoint_name).result()
workspace_ml_client.compute.begin_delete(name=compute_name).result()