<a href="https://colab.research.google.com/github/AbeOmor/AzureML-BERT/blob/master/finetune/PyTorch/notebooks/BERT_Eval_GLUE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# PyTorch Pretrained BERT on AzureML with GLUE Dataset

In this notebook, you will find the following contents:
- Download GLUE dataset on the remote compute and store them in Azure storage
- Speed-up fine-tuning BERT for GLUE dataset on AzureML GPU clusters

## Prerequisites
Follow instructions in BERT_pretraining.ipynb notebook for setting up AzureML

In [1]:
# Check core SDK version number
import azureml.core

print("SDK version:", azureml.core.VERSION)


ModuleNotFoundError: ignored

## Initialize workspace

To create or access an Azure ML Workspace, you will need to import the AML library and the following information:
* A name for your workspace
* Your subscription id
* The resource group name

Initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the Prerequisites step or create a new one. 

In [0]:
from azureml.core.workspace import Workspace
ws = Workspace.setup()
ws_details = ws.get_details()
print('Name:\t\t{}\nLocation:\t{}'
      .format(ws_details['name'],
              ws_details['location']))


### Create an experiment
Create an [Experiment](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#experiment) to track all the runs in your workspace for this distributed PyTorch tutorial. 

## Download GLUE dataset on the remote compute

Before we start to fine-tune the pretained BERT model, we need to download the [GLUE data](https://gluebenchmark.com/tasks) by running the [script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) and unpack it to an Azure Blob container.

### Define AzureML datastore to collect training dataset

To make data accessible for remote training, AML provides a convenient way to do so via a [Datastore](https://docs.microsoft.com/azure/machine-learning/service/how-to-access-data). The datastore provides a mechanism for you to upload/download data to Azure Storage, and interact with it from your remote compute targets.

Each workspace is associated with a default Azure Blob datastore named `'workspaceblobstore'`. In this work, we use this default datastore to collect the GLUE training dataset .

In [0]:
from azureml.core import Datastore
ds = ws.get_default_datastore()

### Create a project directory
Create a directory that will contain all the necessary code from your local machine that you will need access to on the remote resource. This includes the training script and any additional files your training script depends on.

In [0]:
import os
import os.path as path
project_root = path.abspath(path.join(os.getcwd(),"../../../"))

Download GLUE dataset in BingBert/ directory

In [0]:
ds.upload(src_dir=os.path.join(project_root,'data','glue_data'), target_path='glue_data')

Create a folder named "bert-large-checkpoints" which contains the .pt bert checkpoint file against which you want to run your eval tasks. The following code will upload the folder to the datastore. The URL for the checkpoint is: https://bertonazuremlwestus2.blob.core.windows.net/public/models/bert_large_uncased_original/bert_encoder_epoch_200.pt

In [0]:
ds.upload(src_dir=os.path.join(project_root,'data','bert-large-checkpoints') , target_path='bert-large-checkpoints')

Uploading bert-large config file to datastore

In [0]:
ds.upload(src_dir=os.path.join(project_root,'pretrain','configs'), target_path='config')

**Remove /data folder to avoid uploading folder greater than 300MB.**

## Fine-tuning BERT with Distributed Training
As our `GLUE` dataset are ready in Azure storage, we can start the fine-tune the model by exploting the power of distributed training. 

### Create a GPU remote compute target

We need to create a GPU [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) to perform the fine-tuning. In this example, we create an AmlCompute cluster as our training compute resource.

This code creates a cluster for you if it does not already exist in your workspace.

In [0]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# choose a name for your cluster
gpu_cluster_name = "bertcodetesting"

try:
    gpu_compute_target = ComputeTarget(workspace=ws, name=gpu_cluster_name)
    print('Found existing compute target.')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC24', max_nodes=4)

    # create the cluster
    gpu_compute_target = ComputeTarget.create(ws, gpu_cluster_name, compute_config)
    gpu_compute_target.wait_for_completion(show_output=True)

# Use the 'status' property to get a detailed status for the current cluster. 
print(gpu_compute_target.status.serialize())

### Create a PyTorch estimator for fine-tuning
Let us create a new PyTorch estimator to run the fine-tuning script `run_classifier.py`, that is already provided at [the original repository](https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_classifier.py). Please refer [here](https://github.com/huggingface/pytorch-pretrained-BERT#fine-tuning-with-bert-running-the-examples) for more detail about the script. 

The original `run_classifier.py` script uses PyTorch distributed launch untility to launch multiple processes across nodes and GPUs. We prepared a modified version [run_classifier_azureml.py](./run_classifier_azureml.py) so that we can launch it based on AzureML build-in MPI backend.

To use AML's tracking and metrics capabilities, we need to add a small amount of AzureML code inside the training script.

In `run_classifier_azureml.py`, we will log some metrics to our AML run. To do so, we will access the AML run object within the script:
```Python
from azureml.core.run import Run
run = Run.get_context()
```
Further within `run_classifier_azureml.py`, we log learning rate, training loss and evaluation accuracy the model achieves as:
```Python
run.log('lr', np.float(args.learning_rate))
...

for step, batch in enumerate(tqdm(train_dataloader, desc="Iteration")): 
    ...
    run.log('train_loss', np.float(loss))

...

result = {'eval_loss': eval_loss,
          'eval_accuracy': eval_accuracy}
for key in sorted(result.keys()):
    run.log(key, str(result[key]))
```

The following code runs GLUE RTE task against a bert-large checkpoint with the parameters used by Huggingface for finetuning.
- num_train_epochs = 3
- max_seq_length = 128
- train_batch_size = 8
- learning_rate = 2e-5
- grad_accumulation_step = 2

In [0]:
from azureml.train.dnn import PyTorch
from azureml.core.runconfig import RunConfiguration
from azureml.core.container_registry import ContainerRegistry

run_user_managed = RunConfiguration()
run_user_managed.environment.python.user_managed_dependencies = True

# Using a pre-defined public docker image published on AzureML
image_name = 'mcr.microsoft.com/azureml/bert:pretrain-openmpi3.1.2-cuda10.0-cudnn7-ubuntu16.04'

estimator = PyTorch(source_directory='../../../',
                    compute_target=gpu_compute_target,
                     #Docker image
                    use_docker=True,
                    custom_docker_image=image_name,
                    user_managed=True,
                    
                    script_params = {
                          '--bert_model':'bert-large-uncased',
                          "--model_file_location": ds.path('bert-large-checkpoints/').as_mount(),
                          '--task_name': 'RTE',
                          '--data_dir': ds.path('glue_data/RTE/').as_mount(),
                          '--do_train' : '',
                          '--do_eval': '',                      
                          '--do_lower_case': '',
                          '--max_seq_length': 128,
                          '--train_batch_size': 8,
                          '--gradient_accumulation_steps': 2,
                          '--learning_rate': 2e-5,
                          '--num_train_epochs': 3.0,
                          '--output_dir': ds.path('output/').as_mount(),
                          '--model_file': 'bert_encoder_epoch_245.pt',
                          '--fp16': ""
                    },
                    entry_script='./finetune/run_classifier_azureml.py',
                    node_count=1,
                    process_count_per_node=4,
                    distributed_backend='mpi',
                    use_gpu=True)

# path to the Python environment in the custom Docker image
estimator._estimator_config.environment.python.interpreter_path = '/opt/miniconda/envs/amlbert/bin/python'

### Submit and Monitor your run

In [0]:
from azureml.core import Experiment

experiment_name = 'bert-large-RTE'
experiment = Experiment(ws, name=experiment_name)

In [0]:
run = experiment.submit(estimator)
from azureml.widgets import RunDetails
RunDetails(run).show()

In [0]:
#run.cancel()