## Deploy fill-mask models from HuggingFaceHub to AzureML Online Endpoints

This sample shows how to deploy `fill-mask` models from the HuggingFaceHub to an online endpoint for inference. Learn more about `fill-mask` task: https://huggingface.co/tasks/fill-mask.

A large set of models hosted on [Hugging Face Hub](https://huggingface.co/models) are available in the Hugging Face Hub collection in AzureML Model Catalog. This collection is powered by the Hugging Face Hub community registry. Integration with the AzureML Model Catalog enables seamless deployment of Hugging Face Hub models in AzureML. _todo: learn more link_

### Outline
* Set up pre-requisites.
* Pick a model to deploy.
* Deploy the model for real time inference.
* Try sample inference.
* Clean up resources.

### Set up pre-requisites
* Install dependencies
* Connect to AzureML Workspace. Learn more at [set up SDK authentication](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-setup-authentication?tabs=sdk). Replace  `<WORKSPACE_NAME>`, `<RESOURCE_GROUP>` and `<SUBSCRIPTION_ID>` below.
* Connect to `HuggingFaceHub` community registry

In [None]:
from azure.ai.ml import MLClient
from azure.identity import (
    DefaultAzureCredential,
    InteractiveBrowserCredential,
    ClientSecretCredential,
)
from azure.ai.ml.entities import AmlCompute
import time

try:
    credential = DefaultAzureCredential()
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    credential = InteractiveBrowserCredential()

# connect to a workspace
workspace_ml_client = None
try:
    workspace_ml_client = MLClient.from_config(credential)
    subscription_id = workspace_ml_client.subscription_id
    workspace = workspace_ml_client.workspace_name
    resource_group = workspace_ml_client.resource_group_name
except Exception as ex:
    print(ex)
    # Enter details of your workspace
    subscription_id = "<SUBSCRIPTION_ID>"
    resource_group = "<RESOURCE_GROUP>"
    workspace = "<WORKSPACE_NAME>"
    workspace_ml_client = MLClient(
        credential, subscription_id, resource_group, workspace
    )
# Connect to the HuggingFaceHub registry
registry_ml_client = MLClient(credential, registry_name="HuggingFaceHub")

### Pick a model to deploy

Open the Model Catalog in AzureML Studio and choose the Hugging Face Hub collection. Filter by the `fill-mask` task and search any specific models you are interested in. In this example, we use the `bert-base-uncased` model. If you plan to deploy a different model, replace the model name and version accordingly. 

In [None]:
model_name = "bert-base-uncased"
foundation_model = registry_ml_client.models.get(model_name, version="19")
print(
    "\n\nUsing model name: {0}, version: {1}, id: {2} for inferencing".format(
        foundation_model.name, foundation_model.version, foundation_model.id
    )
)

### Deploy the model to an online endpoint
Online endpoints give a durable REST API that can be used to integrate with applications that need to use the model. Create an online endpoint and then create an online deployment. You need to specify the Virtual Machine instance or SKU when creating the deployment. You can find the optimal CPU or GPU SKU for a model by opening the quick deployment dialog from the model page in the AzureML Model Catalog. Specify the SKU in the `instance_type` input in deployment settings below.

Typically Online Endpoints require you to provide scoring script and a docker container image (through an AzureML environment), in addition to the model. You don't need to worry about them for HuggingFace Hub models available in AzureML Model Catalog because we have enabled 'no code deployments' for these models by packaging scoring script and container image along with the model.

Learn more about Online Endpoints: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-online-endpoints

In [None]:
import time, sys
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    OnlineRequestSettings,
)

# Create online endpoint - endpoint names need to be unique in a region, hence using timestamp to create unique endpoint name
timestamp = int(time.time())
online_endpoint_name = "fill-mask-" + str(timestamp)
# create an online endpoint
endpoint = ManagedOnlineEndpoint(
    name=online_endpoint_name,
    description="Online endpoint for " + foundation_model.name + ", for fill-mask task",
    auth_mode="key",
)
workspace_ml_client.begin_create_or_update(endpoint).wait()

In [None]:
# create a deployment
demo_deployment = ManagedOnlineDeployment(
    name="demo",
    endpoint_name=online_endpoint_name,
    model=foundation_model.id,
    instance_type="Standard_DS2_v2",
    instance_count=1,
)
workspace_ml_client.online_deployments.begin_create_or_update(demo_deployment).wait()
# online endpoints can have multiple deployments with traffic split or shadow traffic. Set traffic to 100% for demo deployment
endpoint.traffic = {"demo": 100}
workspace_ml_client.begin_create_or_update(endpoint).result()

### Try sample inference

Online endpoints expose a REST API that can be integrated into your applications. Learn how to fetch the scoring REST API and credentials for online endpoints here: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-authenticate-online-endpoint

In this example, we will use the Python SDK helper method to invoke the endpoint. 

In [None]:
# Get the right mask token from HuggingFaceHub
import urllib.request, json

raw_data = urllib.request.urlopen(
    "https://huggingface.co/api/models/" + foundation_model.tags["modelId"]
)
data = json.load(raw_data)
print(
    "Mask token fetched from https://huggingface.co/api/models/{0} is '{1}'".format(
        foundation_model.tags["modelId"], data["mask_token"]
    )
)

In [None]:
# check if there is sample inference data available on HuggingFaceHub for the model, else try with the backup sample data
scoring_file = "./sample_score.json"
inputs = []
if "widgetData" in data:
    for input in data["widgetData"]:
        inputs.append(input["text"])
    # write the sample_score.json file
    score_dict = {"inputs": inputs}
    with open(scoring_file, "w") as outfile:
        json.dump(score_dict, outfile)
else:
    scoring_file = "./sample_score_backup.json"

# print the sample scoring file
print("\n\nSample scoring file: ")
with open(scoring_file) as json_file:
    scoring_data = json.load(json_file)
    print(scoring_data)

In [None]:
# score the sample_score.json file using the online endpoint with the azureml endpoint invoke method
response = workspace_ml_client.online_endpoints.invoke(
    endpoint_name=online_endpoint_name,
    deployment_name="demo",
    request_file=scoring_file,
)
response_json = json.loads(response)
print(json.dumps(response_json, indent=2))

In [None]:
# load the repsonse into a dataframe to better visualize the results
import pandas as pd

response_df = pd.DataFrame()
for result in response_json:
    for row in result:
        df = pd.DataFrame(row, index=[0])
        response_df = pd.concat([response_df, df], ignore_index=True)

response_df.head(len(response_df))

### Delete the online endpoint
Don't forget to delete the online endpoint, else you will leave the billing meter running for the compute used by the endpoint.

In [None]:
workspace_ml_client.online_endpoints.begin_delete(name=online_endpoint_name).wait()