# Using Azure ML Pipelines to Finetune HuggingFace models for GLUE Tasks

**Learning Objectives** 
By the end of this two part tutorial, you should be able to use Azure Machine Learning (AML) to finetune Hugging Face NLP models.

    

**Requirements**
In order to benefit from this tutorial, you need to have:
- basic understanding of Machine Learning projects workflow
- an Azure subscription. If you don't have an Azure subscription, [create a free account](https://aka.ms/AMLFree) before you begin.
- a working AML workspace. A workspace can be created via Azure Portal, Azure CLI, or Python SDK. [Read more](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-manage-workspace?tabs=python).
- a Python environmnet
- installed Azure Machine Learning Python SDK v2
```python
pip install azure-ml==0.0.139 --extra-index-url  https://azuremlsdktestpypi.azureedge.net/sdk-cli-v2
```
- familiarity with Hugging Face framework


**Motivations** 
In this tutorial, we will create an AML pipeline to finetune a huggingface model in AML. Specifically, we finetune a light bert model to perform [GLUE tasks](https://gluebenchmark.com/)). Here we have picked [Microsoft Research Paraphrase Corpus](https://gluebenchmark.com/tasks) (mrpc) task for demostration, the code can wasily be changed to work for other purposes.

The finetune task is performed in a single python file, which we run inside an AML Command Job. We then use AML's built-in Hyperparameter optimization to get the best performance from our model.


### Connect to AzureML

Before we dive in the code, we'll need to create an instance of MLClient to connect to Azure ML. Please provide the references to your workspace below.



In [None]:
# handle to the workspace
from azure.ml import MLClient

# authentication package
from azure.identity import InteractiveBrowserCredential

# get a handle to the workspace
ml_client = MLClient(
    InteractiveBrowserCredential(),
    subscription_id="<SUBSCRIPTION_ID>",
    resource_group_name="<RESOURCE_GROUP>",
    workspace_name="<AML_WORKSPACE_NAME>",
)

### Provision the required resources for this notebook
We'll need a compute clusters for this notebook, you can use a CPU cluster or a GPU cluster. First, let's create a minimal clusters for the task.

In [None]:
# Set your prefered compute type here
USE_GPU = True

from azure.ml.entities import AmlCompute

# Let's create the AML compute object with the intended parameters
cluster_basic = AmlCompute(
    # Name assigned to the compute cluster
    name= "gpu-cluster" if USE_GPU else "cpu-cluster",
    
    # AML Compte is AML's on-demand VM service
    type="amlcompute",
   
    # VM Family: 1 x NVIDIA Tesla K80 or 14 GB RAM, 4 CPU VM
    size= "Standard_NC6" if USE_GPU else "Standard_DS3_v2",
    
    # Minimum running nodes when there is no job running
    min_instances=0,
    
    # nodes in cluster
    max_instances=6,
    
    # How many seconds will the node running after the job termination
    idle_time_before_scale_down=120,
    
    # Dedicated or LowPriority. The latter is cheaper but there is a chance of job termination 
    tier='Dedicated'
)

# Now, we pass the object to clinet's create_or_update method
cluster_basic = ml_client.begin_create_or_update(cluster_basic)

print(
    f"AMLCompute with name {cluster_basic.name} is created, the compute size is {cluster_basic.size}"
)

# 1. Preparing the Resources

## 1.1. Create a Job Environment
So far, in the requirements section, we have created a development environment on our development machine. AML needs to know what environment to use for each step of the pipeline. We can use any published docker image as is, or add or required dependencies to the image.In this example, we create a conda environment for our jobs, using a [conda yaml file](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#create-env-file-manually) and add it to an Ubuntu image in Microsoft Container Registry. For more information on AML environments and Azure Container Registries, please check [sdkv1link](https://docs.microsoft.com/en-us/azure/machine-learning/concept-environments).


In [None]:
from azure.ml.entities import Environment
import os

custom_env_name = "transformers-gpu" if USE_GPU else "transformers-cpu"

pipeline_job_env = Environment(
    name=custom_env_name,
    description="Custom environment for transformer model training",
    tags={"scikit-learn": "0.24.2", "azureml-defaults": "1.38.0"},
    conda_file=os.path.join("components","finetune", "conda.yml"),
    image="mcr.microsoft.com/azureml/intelmpi2018.3-cuda10.0-cudnn7-ubuntu16.04"
    if USE_GPU
    else "mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:20220218.v1",
)
pipeline_job_env = ml_client.environments.create_or_update(pipeline_job_env)

print(
    f"Environment with name {pipeline_job_env.name} is created, the version is {pipeline_job_env.version}"
)

## 1.2. Create or Load Components
Now that we have our workspace, compute and input data ready, let's work on the individual steps of our pipeline. 

### 1.2.1 GLUE Component
We have created a python script to handle the task of loading the dataset, loading the target model and its weights, trainng and evaluating the model. Here we use the general purpose **CommandComponent** which runs command line action that can be directly calling system commands or running a script. In this example, we use a python script to finetune the Hugging Face model.  The inputs/outputs are accessible in the command via the `${{parameter}}` notation. In this script, we are using `transformers.HfArgumentParser` for argument parsing, which allows up to set all the training parameters in command line. The `${{inputs}}` in the command line can be set via the inputs section of the job. We can use the command line parsing, to specify any extra arguments in the `additional_args` using the `--arg value` format. 

Once the model is trained, it is registered into AML, so that it can be used in future inference tasks.

Please refer to the commented `finerune_glue.py` for the finetuning process.

In [None]:
# importing the CommandComponent Package
from azure.ml.entities import CommandComponent, Code

src_dir = "components/finetune/"


glue_component = CommandComponent(
    # Name of the component
    name="GLUE_mrpc",
    
    # Component Version, no Version and the component will be automatically versioned 
    # version="26",
    
    # The dictionary of the inputs. Each item is a dictionary itself.
    inputs=dict(
        model_checkpoint=dict(type="string"),
        learning_rate=dict(type="number", default=2e-5),
        num_train_epochs=dict(type="integer", default=5),
        per_device_train_batch_size=dict(type="integer", default=16),
        per_device_eval_batch_size=dict(type="integer", default=16),
        additional_args=dict(type="string", default="")
    ),
    
    # The dictionary of the outputs. Each item is a dictionary itself.
    outputs=dict(trained_model=dict(type="path"),
    ),
    
    # The source folder of the component
    code=Code(local_path=src_dir),
    
    # The environment the component job will be using
#     environment=pipeline_job_env,
    environment="transformers-gpu:1" if USE_GPU else "transformers-cpu:1",
    
    # The command that will be run in the component
    command="python finetune_glue.py --model_checkpoint ${{inputs.model_checkpoint}} --output_dir outputs "
    "--model_checkpoint ${{inputs.model_checkpoint}} --num_train_epochs ${{inputs.num_train_epochs}} "
    "--learning_rate ${{inputs.learning_rate}} --per_device_train_batch_size ${{inputs.per_device_train_batch_size}} --per_device_eval_batch_size ${{inputs.per_device_eval_batch_size}} "
    "--disable_tqdm True --trained_model %{{outputs.trained_model}} ${{inputs.additional_args}}",
)

# 2. Finetune Hugging Face Model in AML

## 2.1 Creating Azure ML Pipeline
The created component can be used in a pipeline to be connected to other steps if required.

In [None]:
from azure.ml import dsl
from azure.ml.entities import Component
from pathlib import Path

glue_func = dsl.load_component(component = glue_component)

# the dsl decorator tells the sdk that we are defining an AML pipeline
@dsl.pipeline(
    name="glue-example_pipeline",
    compute="gpu-cluster" if USE_GPU else "cpu-cluster",
    description="mrpc GLUE Finetune pipeline",
)
def glue_pipeline(
    pipeline_job_model_checkpoint,
    pipeline_job_learning_rate,
    pipeline_job_num_train_epochs,
    pipeline_job_per_device_train_batch_size,
    pipeline_job_per_device_eval_batch_size,
    pipeline_job_additional_args,
):
    glue_step = glue_func(
        model_checkpoint=pipeline_job_model_checkpoint,
        learning_rate=pipeline_job_learning_rate,
        num_train_epochs=pipeline_job_num_train_epochs,
        per_device_train_batch_size=pipeline_job_per_device_train_batch_size,
        per_device_eval_batch_size=pipeline_job_per_device_eval_batch_size,
        additional_args=pipeline_job_additional_args,
    )
    
    return {"model": glue_step.outputs.trained_model}

Let's now use our pipeline definition to instantiate a pipeline with the parameters we choose for our run.

In [None]:
# Let's instantiate the pipeline with the parameters of our choice
pipeline = glue_pipeline(
    pipeline_job_model_checkpoint="distilbert-base-uncased",
    pipeline_job_learning_rate=2e-5,
    pipeline_job_num_train_epochs=1,
    pipeline_job_per_device_train_batch_size=16,
    pipeline_job_per_device_eval_batch_size=16,
    pipeline_job_additional_args="--seed 37",
)

## 2.2. Submitting a Finetuning Job to AML Workspace
It is now time to submit the job for running in AML. This time we use `create_or_update`  on `ml_client.jobs`. Here we also pass an experiment name. An experiment is a container for all the iterations one does on a certain project. All the jobs submitted under the same experiment name would be listed next to each other in AML studio.

In [None]:
import webbrowser

# submit the pipeline job
returned_job = ml_client.jobs.create_or_update(
    pipeline,
    
    # Project's name
    experiment_name="glue-example",
    
    # If there is no dependency, pipeline run will continue even after the failure of one component
    continue_run_on_step_failure=True,
)
# get a URL for the status of the job
webbrowser.open(returned_job.services["Studio"].endpoint)

# 3. Hyperparameter Optimization

## 3.1. Run a Sweep Job on a Command Component

Sweep jobs are designed for hyper parameter tuning. Data scientists can define the search space for each hyper parameter (e.g. provide a list of values to pick from or a distribution from which to sample) as well as the objective and the algorithm. The system will then run many jobs (as many as the user specified) with different parameter combinations. The output of a sweep job is the list of outputs of the run that yielded the best results. 

A component can be used as the trial function of a sweep job. Search Space is a dictionary where the Hyperparameter distribution is defined. Discrete values can be assigned using `Choice`, `QUniform`, `QNormal`, etc. Continuous hyperparameters can be defined by `Normal`, `LogNormal`, `LogUniform` or`Uniform` distributions.

The resource budget for trial runs are defined in the `SweepJobLimits`.


In [None]:
#import required libraries
from azure.ml.entities import CommandJob, JobInput, SweepJob, Choice, Uniform, SweepJobLimits, TruncationSelectionPolicy, Objective, CommandComponent


# Each trial job will be provided with a different combination of hyperparameter values that the system samples from the search_space. 
search_space = {'learning_rate': Uniform(min_value=0.01, max_value=0.9)}

#define the limits for this sweep
limits = SweepJobLimits(max_total_trials=4, max_concurrent_trials=2, timeout=7200)

# set the sampling algorithm for trials, Random sampling, Grid sampling, Bayesian sampling can be chosen
sampling_algorithm ='random'

# Secify the primary metric you want hyperparameter tuning to optimize. 
objective=Objective(goal='Minimize', primary_metric='loss')

# The early termination policy uses the primary metric to identify low-performance runs.
early_termination = None

Once the search space, budget, sampling algorithm and optimization objective are set, we can create our Sweep job. AML will generate our trials and identify the most performing hyperparameter combination.


In [None]:
# run sweep using this component
inputs={
    "model_checkpoint": "distilbert-base-uncased",
    "learning_rate": 2e-5, # We will overwrite it with search space
    "num_train_epochs": 5,
    "per_device_train_batch_size": 16,
    "per_device_eval_batch_size": 16,
    "additional_args":"--seed 37"
}

cmd_sweep_job = SweepJob(
    trial=glue_component,
    compute="gpu-cluster" if USE_GPU else "cpu-cluster",
    sampling_algorithm=sampling_algorithm,
    inputs=inputs,
    search_space=search_space,
    objective=objective,
    limits=limits,
    early_termination=early_termination
    display_name='sweep job on glue task',
    experiment_name='glue-example',
    description='Run a hyperparameter sweep job using component for GLUE mrpc.'
)

In [None]:
#submit the sweep job
returned_sweep_job_cmd = ml_client.create_or_update(cmd_sweep_job)
#get a URL for the status of the job
returned_sweep_job_cmd.services["Studio"].endpoint