# Natural Language Inference on MultiNLI Dataset using BERT with Azure Machine Learning

## Summary
In this notebook, we demostrate using the BERT model to do language inference in English. We use the [XNLI](https://github.com/facebookresearch/XNLI) dataset and the task is to classify sentence pairs into three classes: contradiction, entailment, and neutral.   
The figure below shows how [BERT](https://arxiv.org/abs/1810.04805) classifies sentence pairs. It concatenates the tokens in each sentence pairs and separates the sentences by the [SEP] token. A [CLS] token is prepended to the token list and used as the aggregate sequence representation for the classification task.
<img src="https://nlpbp.blob.core.windows.net/images/bert_two_sentence.PNG">

Azure Machine Learning features higlighted in the notebook : 

- Distributed training with Horovod

In [None]:
#Imports

import sys
sys.path.append("../..")

import os
import shutil

import torch
import azureml.core
from azureml.train.dnn import PyTorch
from azureml.core.runconfig import MpiConfiguration
from azureml.core import Experiment
from azureml.widgets import RunDetails
from azureml.core.compute import ComputeTarget
from utils_nlp.azureml.azureml_utils import get_or_create_workspace

## 2. AzureML Setup

### 2.1 Link to or create a Workspace

First, go through the [Configuration](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) notebook to install the Azure Machine Learning Python SDK and create an Azure ML `Workspace`. This will create a config.json file containing the values needed below to create a workspace.

**Note**: you do not need to fill in these values if you have a config.json in the same folder as this notebook

In [None]:
    ws = get_or_create_workspace(
    subscription_id="15ae9cb6-95c1-483d-a0e3-b1a1a3b06324",
    resource_group="nlprg",
    workspace_name="MAIDAIPBERT-eastus",
    workspace_region="East US",
)


In [None]:
print(
    "Workspace name: " + ws.name,
    "Azure region: " + ws.location,
    "Subscription id: " + ws.subscription_id,
    "Resource group: " + ws.resource_group,
    sep="\n",
)

### 2.3 Link AmlCompute Compute Target

We need to link a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training our model (see [compute options](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets#supported-compute-targets) for explanation of the different options). We will use an [AmlCompute](https://docs.microsoft.com/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute) target and link to an existing target (if the cluster_name exists) or create a STANDARD_NC6 GPU cluster (autoscales from 0 to 4 nodes) in this example. Creating a new AmlComputes takes approximately 5 minutes. 

As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota.

In [None]:
cluster_name = "bertncrs24"
#cluster_name = "gpu-entail"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print("Found compute target: {}".format(cluster_name))
except ComputeTargetException:
    print("Creating new compute target: {}".format(cluster_name))
    compute_config = AmlCompute.provisioning_configuration(
        vm_size="STANDARD_NC6", max_nodes=1
    )
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)
    compute_target.wait_for_completion(show_output=True)


print(compute_target.get_status().serialize())

In [None]:
DEBUG = True
project_dir = "./entail_utils"
if DEBUG and os.path.exists(project_dir): 
    shutil.rmtree(project_dir) 
shutil.copytree("../../utils_nlp", os.path.join(project_dir, "utils_nlp"))

## 3. Prepare Training Script

In [1]:
%%writefile $project_dir/train.py

import horovod.torch as hvd
import torch
import numpy as np
import time
import argparse
from torch.utils.data import DataLoader
from utils_nlp.dataset.xnli_torch_dataset import XnliDataset
from utils_nlp.models.bert.common import Language
from pytorch_pretrained_bert.optimization import BertAdam
from utils_nlp.models.bert.sequence_classification import BERTSequenceClassifier
from sklearn.metrics import classification_report

print("Torch version:", torch.__version__)

hvd.init()

LANGUAGE_ENGLISH = "en"
TRAIN_FILE_SPLIT = "train"
TEST_FILE_SPLIT = "test"
TO_LOWERCASE = True 
PRETRAINED_BERT_LNG = Language.ENGLISH
LEARNING_RATE= 5e-5
WARMUP_PROPORTION= 0.1
BATCH_SIZE = 32
NUM_GPUS = 4

## each machine gets it's own copy of data
CACHE_DIR = './xnli-data-%d' % hvd.rank()

parser = argparse.ArgumentParser()
# Training settings
parser.add_argument('--seed', type=int, default=42, metavar='S',help='random seed (default: 42)')
parser.add_argument('--no-cuda', action='store_true', default=False,help='disables CUDA training')

args = parser.parse_args()
args.cuda = not args.no_cuda and torch.cuda.is_available()


'''
Note: For example, you have 4 nodes and 4 GPUs each node, so you spawn 16 workers. 
Every worker will have a rank [0, 15], and every worker will have a local_rank [0, 3]
'''
if args.cuda:
    torch.cuda.set_device(hvd.local_rank())
    torch.cuda.manual_seed(args.seed)

#num_workers - this is equal to number of gpus per machine 
kwargs = {'num_workers': 4, 'pin_memory': True} if args.cuda else {}

train_dataset = XnliDataset(file_split=TRAIN_FILE_SPLIT, 
                            cache_dir=CACHE_DIR, 
                            language=LANGUAGE_ENGLISH,
                            to_lowercase=TO_LOWERCASE,
                            tok_language=PRETRAINED_BERT_LNG)

train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset, num_replicas=hvd.size(), rank=hvd.rank())
train_loader =  DataLoader(train_dataset, batch_size=BATCH_SIZE, sampler=train_sampler, **kwargs)
    
#set the label_encoder for evaluation
label_encoder = train_dataset.label_encoder
num_labels = len(np.unique(train_dataset.labels))

classifier = BERTSequenceClassifier(language=PRETRAINED_BERT_LNG,
                                            num_labels=num_labels,
                                            cache_dir=CACHE_DIR,
                                            )

# optimizer configurations
num_samples = len(train_loader.dataset)
num_batches = int(num_samples/BATCH_SIZE)
num_workers = args.num_workers
num_train_optimization_steps = num_batches*args.epochs #int(num_batches/hvd.size()) * args.epochs 
optimizer_grouped_parameters = classifier.optimizer_params

lr=LEARNING_RATE * hvd.size()

bert_optimizer = BertAdam(optimizer_grouped_parameters,
                   lr=lr,
                   t_total=num_train_optimization_steps,
                   warmup=WARMUP_PROPORTION,)

if WARMUP_PROPORTION is None:
    bert_optimizer = BertAdam(optimizer_grouped_parameters, lr=lr)
else:
    bert_optimizer = BertAdam(optimizer_grouped_parameters,
                   lr=lr,
                   t_total=num_train_optimization_steps,
                   warmup=WARMUP_PROPORTION,
                  )


## Distributed optimizer
bert_optimizer = hvd.DistributedOptimizer(bert_optimizer, classifier.model.named_parameters())
hvd.broadcast_parameters(classifier.model.state_dict(), root_rank=0)
    
classifier.fit(train_loader, bert_optimizer, args.epochs, NUM_GPUS, hvd.rank())

#evaluation
if(hvd.rank() == 0):
    NUM_GPUS = 0
    kwargs = {}
    test_dataset = XnliDataset(file_split=TEST_FILE_SPLIT,
                           cache_dir=CACHE_DIR,
                           language=LANGUAGE_ENGLISH,
                           to_lowercase=TO_LOWERCASE,
                           tok_language=PRETRAINED_BERT_LNG
                          )    
    
    test_loader = DataLoader(test_dataset, **kwargs)
    
    predictions = classifier.predict(test_loader, NUM_GPUS, probabilities=False)
    print('=================== Predictions =====================')
    print(predictions)

    test_labels = []
    for data in test_dataset:
        test_labels.append(data['labels'])
        
    predictions= label_encoder.inverse_transform(predictions)
    print(classification_report(test_labels, predictions))

Writing $project_dir/train.py


FileNotFoundError: [Errno 2] No such file or directory: '$project_dir/train.py'

## 4. Create a PyTorch Estimator

BERT is built on PyTorch, so we will use the AzureML SDK's PyTorch estimator to easily submit PyTorch training jobs for both single-node and distributed runs. For more information on the PyTorch estimator, see [How to Train Pytorch Models on AzureML](https://docs.microsoft.com/azure/machine-learning/service/how-to-train-pytorch). First we set up a .yml file with the necessary dependencies.

In [None]:
NODE_COUNT = 2
mpiConfig=MpiConfiguration()
mpiConfig.process_count_per_node=4

est = PyTorch(
    source_directory=project_dir,
    compute_target=compute_target,
    entry_script="train.py",
    node_count=NODE_COUNT,
    distributed_training=mpiConfig,
    use_gpu=True,
    framework_version="1.0",
    conda_packages=["scikit-learn=0.20.3", "numpy", "spacy", "nltk"],
    pip_packages=["pandas","seqeval[gpu]", "pytorch-pretrained-bert"],
)

## 5. Create Experiment and Submit a Job
Submit the estimator object to run your experiment. Results can be monitored using a Jupyter widget. The widget and run are asynchronous and update every 10-15 seconds until job completion.

In [None]:
experiment = Experiment(ws, name="NLP-Entailment-BERT")
run = experiment.submit(est)

In [None]:
RunDetails(run).show()