# Using Azure ML Pipelines to Train and Use HuggingFace models for Text Summarization Tasks

**Learning Objectives** 
By the end of this two part tutorial, you should be able to use Azure Machine Learning (AML) to finetune Hugging Face NLP models.

    

**Requirements**
In order to benefit from this tutorial, you need to have:
- basic understanding of Machine Learning projects workflow
- an Azure subscription. If you don't have an Azure subscription, [create a free account](https://aka.ms/AMLFree) before you begin.
- a working AML workspace. A workspace can be created via Azure Portal, Azure CLI, or Python SDK. [Read more](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-manage-workspace?tabs=python).
- a Python environmnet
- [installed Azure Machine Learning Python SDK v2](https://github.com/Azure/azureml-examples/blob/sdk-preview/sdk/setup.sh)
- familiarity with Hugging Face framework


**Motivations** 
In this tutorial, we will create an AML pipeline to finetune a huggingface model in AML. Specifically, we train a light bert model on CNN News dataset to perform text summarization . We evaluate the trained model's performance on a specific domain (here Medical), then finetune the model on the medical domain, and compare the results.


# Introduction
Text summarizarion is a task that tries to condense a text, usually an article, into a few sentences. The idea is to keep the key informatioanl elements of the context. Since the outout of this task is a short paragrah, evaluation should be done using text comparison methods. One widely used scoring method is [ROUGE](https://aclanthology.org/W04-1013/) score, which uses the count of common unigrams, bigrams, and subsequences between the candidate the reference texts, to evaluate the quality of the generated summary.

Text summarization metrics are not standardized, therefore it is a good practice if we create a baseline to compare the results we get from our transformer models. A simple method can be using the first three sentences of the text as the summary.

![](media/baseline_train_evaluate.png)

In this work, we first run the baseline evaluator to calculate some reference summarization scores, then we prepare our data for finetuning, finetune a model, and evaluate the model performance against the prelabeled data.

### Connect to AzureML

Before we dive in the code, we'll need to create an instance of MLClient to connect to Azure ML. Please provide the references to your workspace below.



In [None]:
# Handle to the workspace
from azure.ai.ml import MLClient

# Authentication package
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential

try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()

# Get a handle to the workspace
ml_client = MLClient(
    credential=credential,
    subscription_id="<SUBSCRIPTION_ID>",
    resource_group_name="<RESOURCE_GROUP>",
    workspace_name="<AML_WORKSPACE_NAME>",
)

### Provision the required resources for this notebook
We'll need a compute clusters for this notebook, you can use a CPU cluster or a GPU cluster. First, let's create a minimal clusters for the task.

In [None]:
# Set your prefered compute type here
USE_GPU = True

from azure.ai.ml.entities import AmlCompute

# Let's create the AML compute object with the intended parameters
cluster_basic = AmlCompute(
    # Name assigned to the compute cluster
    name="gpu-cluster" if USE_GPU else "cpu-cluster",
    # AML Compte is AML's on-demand VM service
    type="amlcompute",
    # VM Family: 1 x NVIDIA Tesla K80 or 14 GB RAM, 4 CPU VM
    size="Standard_NC6" if USE_GPU else "Standard_DS3_v2",
    # Minimum running nodes when there is no job running
    min_instances=0,
    # nodes in cluster
    max_instances=6,
    # How many seconds will the node running after the job termination
    idle_time_before_scale_down=600,
    # Dedicated or LowPriority. The latter is cheaper but there is a chance of job termination
    tier="Dedicated",
)

# Now, we pass the object to clinet's create_or_update method
cluster_basic = ml_client.begin_create_or_update(cluster_basic)

print(
    f"AMLCompute with name {cluster_basic.name} is created, the compute size is {cluster_basic.size}"
)


# 1. Preparing the Resources

## 1.1. Create a Job Environment
So far, in the requirements section, we have created a development environment on our development machine. AML needs to know what environment to use for each step of the pipeline. We can use any published docker image as is, or add or required dependencies to the image.In this example, we create a conda environment for our jobs, using a [conda yaml file](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#create-env-file-manually) and add it to an Ubuntu image in Microsoft Container Registry. For more information on AML environments and Azure Container Registries, please check [AML documentaiton](https://docs.microsoft.com/en-us/azure/machine-learning/concept-environments).


In [None]:
from azure.ai.ml.entities import Environment
import os

custom_env_name = "transformers-gpu" if USE_GPU else "transformers-cpu"

pipeline_job_env = Environment(
    name=custom_env_name,
    description="Custom environment for transformer model training",
    tags={"transformers": ">4.11.0"},
    conda_file=os.path.join("dependencies", "transformers_conda.yml"),
    image="mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.1-cudnn8-ubuntu18.04"
    if USE_GPU
    else "mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:20220218.v1",
)
pipeline_job_env = ml_client.environments.create_or_update(pipeline_job_env)

print(
    f"Environment with name {pipeline_job_env.name} is created, the version is {pipeline_job_env.version}"
)


## 1.2. Create or Load Components
Now that we have our workspace, compute and input data ready, let's work on the individual steps of our pipeline. 


### 1.2.1 Baseline Evaluation Component
We have implemented a baseline component below. It treats the first three sentences of each article as the summary and evaulate that agains the provided labels using ROUGE scores. 

In [None]:
# importing the CommandComponent Package
from azure.ai.ml.entities import CommandComponent

src_dir = "src/summarization/"


baseline_component = CommandComponent(
    # Name of the component
    name="baseline_summarization",
    # Component Version, no Version and the component will be automatically versioned
    # version="26",
    # The dictionary of the inputs. Each item is a dictionary itself.
    inputs=dict(
        dataset_name=dict(type="string"),
        dataset_config=dict(type="string"),
        text_column=dict(type="string"),
        summary_column=dict(type="string"),
        max_samples=dict(type="integer", default=-1),
    ),
    # The source folder of the component
    code=src_dir,
    # The environment the component job will be using
    environment=f"{pipeline_job_env.name}@latest",
    # The command that will be run in the component
    command="python baseline_evaluation.py --dataset_name ${{inputs.dataset_name}} --dataset_config ${{inputs.dataset_config}} "
    "--text_column ${{inputs.text_column}} --summary_column ${{inputs.summary_column}} "
    "--max_samples ${{inputs.max_samples}}",
)

baseline_component = ml_client.create_or_update(baseline_component)

print(
    f"Component {baseline_component.name} with Version {baseline_component.version} is registered"
)


### 1.2.1 Data Preparation Component
We can use a preprocessing step to save time on dataset download and tokenization. This is especially useful in scenarios involving Hyper Parameter Optimization.

In [None]:
# importing the CommandComponent Package
from azure.ai.ml.entities import CommandComponent

src_dir = "src/summarization/"


data_prep_component = CommandComponent(
    # Name of the component
    name="data_prep_summarization",
    # Component Version, no Version and the component will be automatically versioned
    # version="26",
    # The dictionary of the inputs. Each item is a dictionary itself.
    inputs=dict(
        dataset_name=dict(type="string"),
        dataset_config=dict(type="string"),
        text_column=dict(type="string"),
        summary_column=dict(type="string"),
        max_samples=dict(type="integer", default=-1),
        max_input_length=dict(type="integer", optional=True, default=512),
        max_target_length=dict(type="integer", optional=True, default=40),
        padding=dict(type="string", optional=True, default="max_length"),
        model_checkpoint=dict(type="string"),
        source_prefix=dict(type="string", optional=True, default=None),
    ),
    # The dictionary of the outputs. Each item is a dictionary itself.
    outputs=dict(
        encodings=dict(type="path"),
    ),
    # The source folder of the component
    code=src_dir,
    # The environment the component job will be using
    environment=f"{pipeline_job_env.name}@latest",
    # The command that will be run in the component
    command="python data_prep.py --dataset_name ${{inputs.dataset_name}} --dataset_config ${{inputs.dataset_config}} "
    "--text_column ${{inputs.text_column}} --summary_column ${{inputs.summary_column}} "
    "--max_samples ${{inputs.max_samples}} --model_checkpoint ${{inputs.model_checkpoint}} "
    "[--max_input_length ${{inputs.max_input_length}}]  [--max_target_length ${{inputs.max_target_length}}] "
    "[--padding ${{inputs.padding}}] [--source_prefix ${{inputs.source_prefix}}] "
    "--encodings ${{outputs.encodings}}",
)

data_prep_component = ml_client.create_or_update(data_prep_component)

print(
    f"Component {data_prep_component.name} with Version {data_prep_component.version} is registered"
)


### 1.2.3 Train Component
We have created a generic summarization script that can handle training and evaluation tasks with the right argument. The code is based on [`run_summarization.py`](https://github.com/huggingface/transformers/blob/main/examples/pytorch/summarization/run_summarization.py) in huggingface example repository. We use the same script to create two components, one for training and one for evaluation. The design choice was arbitrary as a single component with different inputs could be used for both tasks. 

Here both data related and training related arguments are exposed. We envoke the training action by adding `--do_train` input argument to the script. 


In [None]:
# importing the CommandComponent Package
from azure.ai.ml.entities import CommandComponent
from azure.ai.ml import Input

src_dir = "src/summarization/"


train_component = CommandComponent(
    # Name of the component
    name="train_summarization",
    # Component Version, no Version and the component will be automatically versioned
    # version="26",
    # The dictionary of the inputs. Each item is a dictionary itself.
    inputs=dict(
        dataset_name=dict(type="string", optional=True),
        dataset_config=dict(type="string", optional=True),
        text_column=dict(type="string", optional=True),
        summary_column=dict(type="string", optional=True),
        preprocessed_datasets=dict(type="path", optional=True, default=None),
        model_name=dict(type="string", optional=True, default=None),
        registered_model_name=dict(type="string", optional=True, default=None),
        max_samples=dict(type="integer", default=-1),
        model_path=dict(type="path", optional=True, default=None),
        learning_rate=dict(type="number", default=2e-5),
        num_train_epochs=dict(type="integer", default=5),
        per_device_train_batch_size=dict(type="integer", default=16),
        per_device_eval_batch_size=dict(type="integer", default=16),
        source_prefix=dict(type="string", optional=True, default=None),
    ),
    # The dictionary of the outputs. Each item is a dictionary itself.
    outputs=dict(
        trained_model_path=dict(type="path"),
    ),
    # The source folder of the component
    code=src_dir,
    # The environment the component job will be using
    environment=f"{pipeline_job_env.name}@latest",
    distribution=dict(type="pytorch", process_count_per_instance=1),  # number of gpus
    resources=dict(instance_count=1),  # number of nodes
    # The command that will be run in the component
    command="python run_summarization.py [--dataset_name ${{inputs.dataset_name}}] [--dataset_config ${{inputs.dataset_config}}] "
    "[--text_column ${{inputs.text_column}}] [--summary_column ${{inputs.summary_column}}] [--preprocessed_datasets ${{inputs.preprocessed_datasets}}] "
    "--learning_rate ${{inputs.learning_rate}} --per_device_train_batch_size ${{inputs.per_device_train_batch_size}} "
    "--per_device_eval_batch_size ${{inputs.per_device_eval_batch_size}} --max_samples ${{inputs.max_samples}} "
    "[--model_name ${{inputs.model_name}}] [--registered_model_name ${{inputs.registered_model_name}}] --output_dir outputs "
    "--num_train_epochs ${{inputs.num_train_epochs}}  --trained_model_path ${{outputs.trained_model_path}} "
    "--disable_tqdm True --do_train --do_eval [--source_prefix ${{inputs.source_prefix}}] [--model_path ${{inputs.model_path}}]",
)

train_component = ml_client.create_or_update(train_component)

print(
    f"Component {train_component.name} with Version {train_component.version} is registered"
)


### 1.2.3 Evaluate Component
for the evaluate component, only the data arguments are required.

In [None]:
# importing the CommandComponent Package
from azure.ai.ml.entities import CommandComponent

src_dir = "src/summarization/"


evaluate_component = CommandComponent(
    # Name of the component
    name="evaluate_summarization",
    # Component Version, no Version and the component will be automatically versioned
    # version="26",
    # The dictionary of the inputs. Each item is a dictionary itself.
    inputs=dict(
        dataset_name=dict(type="string", optional=True),
        dataset_config=dict(type="string", optional=True),
        text_column=dict(type="string", optional=True),
        summary_column=dict(type="string", optional=True),
        preprocessed_datasets=dict(type="path", optional=True, default=None),
        model_name=dict(type="string", optional=True, default=None),
        max_samples=dict(type="integer", default=-1),
        model_path=dict(type="path", optional=True, default=None),
    ),
    # The dictionary of the outputs. Each item is a dictionary itself.
    outputs=dict(
        trained_model_path=dict(type="path"),
    ),
    # The source folder of the component
    code=src_dir,
    # The environment the component job will be using
    environment=f"{pipeline_job_env.name}@latest",
    # The command that will be run in the component
    command="python run_summarization.py [--dataset_name ${{inputs.dataset_name}}] [--dataset_config ${{inputs.dataset_config}}] "
    "[--text_column ${{inputs.text_column}}] [--summary_column ${{inputs.summary_column}}] [--preprocessed_datasets ${{inputs.preprocessed_datasets}}] "
    "[--model_name ${{inputs.model_name}}] --max_samples ${{inputs.max_samples}} --output_dir outputs "
    "[--model_path ${{inputs.model_path}}] --trained_model_path ${{outputs.trained_model_path}} --do_eval",
)

evaluate_component = ml_client.create_or_update(evaluate_component)

print(
    f"Component {evaluate_component.name} with Version {evaluate_component.version} is registered"
)


# 2. Train and Evaluate Hugging Face Model in AML

## 2.1 Creating Azure ML Pipeline
The created component can be used in a pipeline to be connected to other steps if required. For training, we use the [CNN Dailymail](https://huggingface.co/datasets/cnn_dailymail) for text summarization. To mimic the scenario we explained in the first section, we use [PubMed dataset](https://huggingface.co/datasets/ccdv/pubmed-summarization) for evaulation to estimate the performance of a generically trained model on medial domain task.

Our training component supports distribution training, the settings should be set during the build time. Based on the compute we are using, we might be able to increase the number of GPUs per node. 

In [None]:
# Set the number of nodes and number of GPUs per node
num_nodes = 4
num_gpus_per_node = 1

In [None]:
# the dsl decorator tells the sdk that we are defining an Azure ML pipeline
from azure.ai.ml import dsl, Input, Output


# the dsl decorator tells the sdk that we are defining an AML pipeline
@dsl.pipeline(
    name="summarization-example_pipeline",
    compute="gpu-cluster" if USE_GPU else "cpu-cluster",
    description="Text Summarization pipeline",
)
def txt_summarization_pipeline(
    model_name, max_samples=-1, num_train_epochs=1, batch_size=8, learning_rate=5e-5
):
    baseline_step = baseline_component(
        dataset_name="ccdv/pubmed-summarization",
        dataset_config="section",
        text_column="article",
        summary_column="abstract",
        max_samples=max_samples,
    )
    baseline_step.compute = "cpu-cluster"

    data_prep_for_training_step = data_prep_component(
        dataset_name="cnn_dailymail",
        dataset_config="3.0.0",
        text_column="article",
        summary_column="highlights",
        max_samples=max_samples,
        max_input_length=512,
        max_target_length=40,
        padding="max_length",
        model_checkpoint=model_name,
        source_prefix="summarize: ",
    )

    train_step = train_component(
        preprocessed_datasets=data_prep_for_training_step.outputs.encodings,
        model_name=model_name,
        max_samples=max_samples,
        learning_rate=learning_rate,
        num_train_epochs=num_train_epochs,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        source_prefix="summarize: ",
    )

    train_step.distribution = dict(
        type="pytorch", process_count_per_instance=num_gpus_per_node
    )  # number of gpus
    train_step.resources = dict(instance_count=num_nodes)  # number of nodes

    data_prep_for_evaluation_step = data_prep_component(
        dataset_name="ccdv/pubmed-summarization",
        dataset_config="section",
        text_column="article",
        summary_column="abstract",
        max_samples=max_samples,
        max_input_length=512,
        max_target_length=40,
        padding="max_length",
        model_checkpoint=model_name,
        source_prefix="summarize: ",
    )

    evaluate_step = evaluate_component(
        preprocessed_datasets=data_prep_for_evaluation_step.outputs.encodings,
        model_name="",
        model_path=train_step.outputs.trained_model_path,
        max_samples=max_samples,
    )


Let's now use our pipeline definition to instantiate a pipeline with the parameters we choose for our run. Let's have a quick run with a `max_samples=1000` and `num_train_epochs=2` and to validate the whole pipeline.

In [None]:
# Let's instantiate the pipeline with the parameters of our choice
model_name = "t5-small"
max_samples = 1000
num_train_epochs = 1
batch_size = 8
learning_rate = 5e-5


pipeline = txt_summarization_pipeline(
    model_name=model_name,
    max_samples=max_samples,
    num_train_epochs=num_train_epochs,
    batch_size=batch_size,
    learning_rate=learning_rate,
)


## 2.2. Submitting a Job to AML Workspace
It is now time to submit the job for running in AML. This time we use `create_or_update`  on `ml_client.jobs`. Here we also pass an experiment name. An experiment is a container for all the iterations one does on a certain project. All the jobs submitted under the same experiment name would be listed next to each other in AML studio.

In [None]:
import webbrowser

# submit the pipeline job
returned_job = ml_client.jobs.create_or_update(
    pipeline,
    # Project's name
    experiment_name="text-summarization-example",
    # If there is no dependency, pipeline run will continue even after the failure of one component
    continue_run_on_step_failure=True,
    tags={
        "model_name": model_name,
        "max_samples": max_samples,
        "num_train_epochs": num_train_epochs,
        "batch_size": batch_size,
        "learning_rate": learning_rate,
        "num_nodes": num_nodes,
        "num_gpus_per_node":num_gpus_per_node 
    },
)
# get a URL for the status of the job
webbrowser.open(returned_job.services["Studio"].endpoint)


Wait until the pipeline run ends, now it is time call the pipeline with more data points and a more effective number of epochs.


In [None]:
model_name = "t5-small"
max_samples = 10000
num_train_epochs = 5
batch_size = 8
learning_rate = 5e-5


pipeline = txt_summarization_pipeline(
    model_name=model_name,
    max_samples=max_samples,
    num_train_epochs=num_train_epochs,
    batch_size=batch_size,
    learning_rate=learning_rate,
)


In [None]:
# submit the pipeline job
returned_job = ml_client.jobs.create_or_update(
    pipeline,
    # Project's name
    experiment_name="text-summarization-example",
    # If there is no dependency, pipeline run will continue even after the failure of one component
    continue_run_on_step_failure=True,
    tags={
        "model_name": model_name,
        "max_samples": max_samples,
        "num_train_epochs": num_train_epochs,
        "batch_size": batch_size,
        "learning_rate": learning_rate,
        "num_nodes": num_nodes,
        "num_gpus_per_node":num_gpus_per_node 
    },
)
# get a URL for the status of the job
webbrowser.open(returned_job.services["Studio"].endpoint)

Checking the result and compare with the baseline, it seems that there is room for finetuning the model to be more domain aware. Here we use the PubMed dataset to finetune the model we trained in the previous stage, to see the effect on the evaluation metrics.

![](media/baseline_train_finetune_evaluate.png)

Let's use the same components but expand the pipeline to include the extra finetunig step. There will be no cost for rerunnig the previous steps, as the system will reuse the results if no change is made to the inputs. 

| Method  |eval_rouge1|eval_rouge2|eval_rougeL|eval_rougeLsum |
|---------|-----------|-----------|-----------|---------------|
|baseline | 26.99    |9.16     | 16.95    |        24.38 |
|t5-small trained on CNN news         | 25.963    |  10.028  |  20.131   |   21.801      |
|t5-small finetuned on CNN PubMed         | 28.399    |  11.477  |  22.814   |   24.292      |

In [None]:
# Set the number of nodes and number of GPUs per node
num_nodes = 4
num_gpus_per_node = 1

In [None]:
# the dsl decorator tells the sdk that we are defining an AML pipeline
@dsl.pipeline(
    name="summarization-example_pipeline",
    compute="gpu-cluster" if USE_GPU else "cpu-cluster",
    description="Text Summarization pipeline",
)
def txt_summarization_pipeline(
    model_name, max_samples=-1, num_train_epochs=1, batch_size=8, learning_rate=5e-5
):
    baseline_step = baseline_component(
        dataset_name="ccdv/pubmed-summarization",
        dataset_config="section",
        text_column="article",
        summary_column="abstract",
        max_samples=max_samples,
    )
    baseline_step.compute = "cpu-cluster"

    data_prep_for_training_step = data_prep_component(
        dataset_name="cnn_dailymail",
        dataset_config="3.0.0",
        text_column="article",
        summary_column="highlights",
        max_samples=max_samples,
        max_input_length=512,
        max_target_length=40,
        padding="max_length",
        model_checkpoint=model_name,
        source_prefix="summarize: ",
    )

    train_step = train_component(
        preprocessed_datasets=data_prep_for_training_step.outputs.encodings,
        model_name=model_name,
        max_samples=max_samples,
        learning_rate=learning_rate,
        num_train_epochs=num_train_epochs,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        source_prefix="summarize: ",
    )

    train_step.distribution= dict(type="pytorch",
            process_count_per_instance=num_gpus_per_node) # number of gpus
    train_step.resources=dict(instance_count=num_nodes)  # number of nodes

    data_prep_for_evaluation_step = data_prep_component(
        dataset_name="ccdv/pubmed-summarization",
        dataset_config="section",
        text_column="article",
        summary_column="abstract",
        max_samples=max_samples,
        max_input_length=512,
        max_target_length=40,
        padding="max_length",
        model_checkpoint=model_name,
        source_prefix="summarize: ",
    )

    evaluate_step = evaluate_component(
        preprocessed_datasets=data_prep_for_evaluation_step.outputs.encodings,
        model_name="",
        model_path=train_step.outputs.trained_model_path,
        max_samples=max_samples,
    )

    train_step_2 = train_component(
        preprocessed_datasets=data_prep_for_evaluation_step.outputs.encodings,
        registered_model_name="t5-small-cnn-pubmed",
        model_path=train_step.outputs.trained_model_path,
        max_samples=max_samples,
        learning_rate=learning_rate,
        num_train_epochs=num_train_epochs,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        source_prefix="summarize: ",
    )
    train_step_2.distribution= dict(type="pytorch",
                process_count_per_instance=num_gpus_per_node) # number of gpus
    train_step_2.resources=dict(instance_count=num_nodes)  # number of nodes
    
    # This is not effective yet
    train_step_2.comment = "fine tuning on medical data"

    evaluate_step_2 = evaluate_component(
        preprocessed_datasets=data_prep_for_evaluation_step.outputs.encodings,
        model_name="",
        model_path=train_step_2.outputs.trained_model_path,
        max_samples=max_samples,
    )


In [None]:
model_name = "t5-small"
max_samples = 10000
num_train_epochs = 5
batch_size = 8
learning_rate = 5e-5


pipeline = txt_summarization_pipeline(
    model_name=model_name,
    max_samples=max_samples,
    num_train_epochs=num_train_epochs,
    batch_size=batch_size,
    learning_rate=learning_rate,
)
import webbrowser

# submit the pipeline job
returned_job = ml_client.jobs.create_or_update(
    pipeline,
    # Project's name
    experiment_name="text-summarization-example",
    # If there is no dependency, pipeline run will continue even after the failure of one component
    continue_run_on_step_failure=True,
    tags={
        "model_name": model_name,
        "max_samples": max_samples,
        "num_train_epochs": num_train_epochs,
        "batch_size": batch_size,
        "learning_rate": learning_rate,
        "num_nodes": num_nodes,
        "num_gpus_per_node":num_gpus_per_node 
    },
)
# get a URL for the status of the job
webbrowser.open(returned_job.services["Studio"].endpoint)


# 4. Deploy the Model as an Online Endpoint
Let's learn how to deploy your machine learning model as a web service in the Azure cloud [sdkv1link](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-deploy-and-where?tabs=azcli). 
A typical situation for a deployed machine learning service is that you need the following resources:

 - The model assets (filed, metadata) that you want deployed. We have already registered these in our training component.
 - Some code to run as a service. It executes the model on a given input request. This entry script receives data submitted to a deployed web service and passes it to the model. It then returns the model's response to the client. The script is specific to your model. The entry script must understand the data that the model expects and returns.

The two things you need to accomplish in your entry script are:

- Loading your model (using a function called `init()`)
- Running your model on input data (using a function called `run()`)

such entry script is located under *./src/deployment/*
In this implementation the `init()` function loads the model, and the run function expects the data in `json` format with the input data stored under `data`.

In [None]:
deploy_dir = "./src/deployment/"

## 4.1 Create an Inference Environment

In [None]:
from azure.ai.ml.entities import Environment
import os

custom_env_name = "transformer_inference"

endpoint_env = Environment(
    name=custom_env_name,
    description="Custom environment for transformer endpoints",
    tags={"transformers": "4.17.0", "azureml-defaults": "1.39.0"},
    conda_file=os.path.join("dependencies", "transformers_inference_conda.yml"),
    image="mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.0.3-cudnn8-ubuntu18.04",
)
endpoint_env = ml_client.environments.create_or_update(endpoint_env)

print(
    f"Environment with name {endpoint_env.name} is created, the version is {endpoint_env.version}"
)


## 4.2. Create a New Online Endpoint
It is now straight forward to create an online endpoint. First, we create an endpoint by providing its description. The deployment name needs to be unique in the entire azure region, therefore, for this tutorial, we create a unique name using [`UUID`](https://en.wikipedia.org/wiki/Universally_unique_identifier#:~:text=A%20universally%20unique%20identifier%20(UUID,%2C%20for%20practical%20purposes%2C%20unique.).

In [None]:
import uuid

# Creating a unique name for the endpoint
online_endpoint_name = "summarization-endpoint" + str(uuid.uuid4())[:8]


In [None]:
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    Model,
    Environment,
)

# create an online endpoint
endpoint = ManagedOnlineEndpoint(
    name=online_endpoint_name,
    description="this is an online endpoint",
    auth_mode="key",
    tags={
        "training_dataset": "cnn_news",
        "model_type": "t5-small",
    },
)

endpoint = ml_client.begin_create_or_update(endpoint)

print(f"Endpint {endpoint.name} provisioning state: {endpoint.provisioning_state}")


If you have previously created an endpoint, you can retrieve it as below:

In [None]:
endpoint = ml_client.online_endpoints.get(name=online_endpoint_name)

print(
    f'Endpint "{endpoint.name}" with provisioning state "{endpoint.provisioning_state}" is retrieved'
)


## 4.3. Deploy the Model to the Endpoint

Once the endpoint is created, we deploy the model with the entry script. Each endpoint can have multiple deployments and direct traffic to these deployments can be specified using rules. Here we create a single deployment that handles 100% of the incoming traffic. We have chosen a color name for our deployment, e.g. *blue*, *green*, *red* deployments, which is totally arbitrary.

You can check the *Models* page on the Azure ML Studio, to identify the latest version of your registered model. Alternatively, the code below can surface the latest version, if integer numbers are used for versioning.

In [None]:
registered_model_name = "t5-small-cnn-pubmed"

# Let's pick the latest version of the model
latest_model_version = max(
    [int(m.version) for m in ml_client.models.list(name=registered_model_name)]
)
latest_model_version


In [None]:
# picking the model to deploy. Here we use the latest version of our registered model
model = ml_client.models.get(name=registered_model_name, version=latest_model_version)


# create an online deployment.
cpu_deployment = ManagedOnlineDeployment(
    name="cpu",
    endpoint_name=online_endpoint_name,
    model=model,
    environment=f"{endpoint_env.name}:{endpoint_env.version}",
    code_path=deploy_dir,
    scoring_script="score.py",
    instance_type="STANDARD_DS5_V2",  #'Standard_NC12s_v3',
    instance_count=1,
)

cpu_deployment = ml_client.begin_create_or_update(cpu_deployment)


## 4.4. Test with a Sample Query

With the endpoint already published, we can run inference with it.

Let's create a sample request file following the design expected in the run method in the score script. The json file should have valid json format, with required string escaping.

In [None]:
%%writefile {deploy_dir}/sample-request.json
{
   "data":"Los Angeles (CNN) -- A medical doctor in Vancouver, British Columbia, said Thursday that California arson suspect Harry Burkhart suffered from severe mental illness in 2010, when she examined him as part of a team of doctors. Dr. Blaga Stancheva, a family physician and specialist in obstetrics, said both Burkhart and his mother, Dorothee, were her patients in Vancouver while both were applying for refugee status in Canada. \\\"I was asked to diagnose and treat Harry to support a claim explaining why he was unable to show up in a small-claims court case,\\\" Stancheva told CNN in a phone interview. She declined to cite the case or Burkhart\\'s role in it. Stancheva said she and other doctors including a psychiatrist diagnosed Burkhart with \\\"autism, severe anxiety, post-traumatic stress disorder and depression.\\\" The diagnosis was spelled out in a letter she wrote for the small-claims court case, Stancheva said. Stancheva, citing doctor-patient confidentiality, would not elaborate further, nor would she identify the psychiatrist involved in the diagnosis. Burkhart, a 24-year-old German national, has been charged with 37 counts of arson following a string of 52 fires in Los Angeles. The charges are in connection with arson fires at 12 locations scattered through Hollywood, West Hollywood and Sherman Oaks, according to authorities. Stancheva said the refugee applications by Burkhart and his mother were denied by the Canadian government, and she has not seen Burkhart since early March of 2010. \\\"I was shocked and dismayed at what happened in Los Angeles, and it appears he was not being treated for his depression,\\\" she said. Burkhart was in court on Wednesday for a preliminary hearing. Prosecutors said his \\\"rage against Americans,\\\" triggered by his mother\\'s arrest last week, motivated his \\\"campaign of terror\\\" with dozens of fires in Hollywood and nearby communities. Burkhart kept his eyes closed and remained limp during most of his hearing, requiring sheriff\\'s deputies to hold him up. The district attorney called his courtroom behavior \\\"very bizarre.\\\" \\\"This defendant has engaged in a protracted campaign in which he has set, the people believe, upwards of 52 arson fires in what essentially amounts to a campaign of terror against this community,\\\" Los Angeles County Deputy District Attorney Sean Carney said. \\\"The people believe he has engaged in this conduct because he has a hatred for Americans.\\\" Carney told the court Burkhart would flee the country if he was allowed out of jail on bond, but Los Angeles Superior Court Judge Upinder Kalra said he had no choice but to set bail. To go free while awaiting trial, Burkhart must post a $2.85 million bond and surrender his German passport. It was revealed that Burkhart is also under investigation for arson and fraud in relation to a fire in Neukirchen, near Frankfurt, Germany. The worst arson sprees in the city\\'s history began last Friday morning with a car fire in Hollywood that spread to apartments above a garage, but no new fires have happened since Burkhart was arrested Monday, Los Angeles District Attorney Steve Cooley said. No one was hurt in the fires, but property damage costs are likely to reach $3 million, authorities said. Cooley called it \\\"almost attempted murder,\\\" because people were sleeping in apartments above where Burkhart allegedly set cars on fire with incendiary devices placed under their engines. The criminal complaint filed Wednesday also alleged that the fires were \\\"caused by use of a device designed to accelerate the fire,\\\" Cooley said. \\\"If found true, the allegation could mean additional custody time for the defendant.\\\" \\\"In numerous instances, the cars were parked in carports, resulting in the fires spreading to the adjacent occupied apartment buildings,\\\" a sworn affidavit from a Los Angeles arson investigator said. \\\"The vast majority of these fires occurred late at night when the occupants of the apartment buildings were asleep.\\\" Investigator Edward Nordskog\\'s affidavit detailed Burkhart\\'s behavior a day before the fires began, when he was in a federal courtroom during extradition proceedings for his mother. \\\"While in the audience, the defendant (Burkhart) began yelling in an angry manner, \\'F--k all Americans.\\' The defendant also attempted to communicate with his mother who was in custody. Shortly thereafter, the defendant was ejected from the courtroom by Deputy U.S. Marshals,\\\" Nordskog wrote. Dorothee Burkhart was arrested a day before on an international arrest warrant issued by a district court in Frankfurt, Germany, said federal court spokesman Gunther Meilinger. The 53-year-old German woman is wanted on 16 counts of fraud and three counts of embezzlement, he said. The charges include an allegation that she failed to pay for a breast enhancement operation performed on her in 2004, Meilinger said. Most of the German charges, however, stem from phony real estate deals that Dorothee Burkhart allegedly conducted between 2000 and 2006. \\\"It is my opinion that the defendant\\'s criminal spree was motivated by his rage against Americans and that by setting these fires the defendant intended to harm and terrorize as many residents of the city and county of Los Angeles as possible,\\\" Nordskog wrote. A search of Burkhart\\'s Hollywood apartment found newspaper clippings about the Los Angeles fires and articles from Germany reporting similar car fires in Frankfurt, Germany in September, 2011, the investigator said. \\\"It is my opinion based on my experience that it is highly likely the defendant has a history of setting arson fires in Germany before he came to the United States,\\\" Nordskog wrote. Burkhart\\'s mother is scheduled for another extradition hearing Friday, while he is due back in court for arraignment on January 24. Meanwhile, both Burkharts are housed in a Los Angeles jail."
}

In [None]:
# test the blue deployment with some sample data
ml_client.online_endpoints.invoke(
    endpoint_name=online_endpoint_name,
    request_file=f"{deploy_dir}/sample-request.json",
    deployment_name="cpu",
)


It is also possible to use consume the Rest endpoint directly, or test it with a UI is Azure Machine Learning Studio.

![](media/endpoint.png)


## 4.5. Delete the Online Endpoint to Release Resources

The online endpoint consumes resources while running, in order to release the allocated resources, we should delete the deployment if not planning to use it.

In [None]:
ml_client.online_deployments.delete(name="cpu", endpoint_name=online_endpoint_name)
