# Natural Language Inference on MultiNLI Dataset using BERT with Azure Machine Learning

## Summary
In this notebook, we demostrate using the BERT model to do language inference in English. We use the [XNLI](https://github.com/facebookresearch/XNLI) dataset and the task is to classify sentence pairs into three classes: contradiction, entailment, and neutral.   
The figure below shows how [BERT](https://arxiv.org/abs/1810.04805) classifies sentence pairs. It concatenates the tokens in each sentence pairs and separates the sentences by the [SEP] token. A [CLS] token is prepended to the token list and used as the aggregate sequence representation for the classification task.
<img src="https://nlpbp.blob.core.windows.net/images/bert_two_sentence.PNG">

Azure Machine Learning features higlighted in the notebook : 

- Distributed training with Horovod

In [1]:
#Imports
import sys
sys.path.append("../..")

import os
import shutil

import torch
import azureml.core
from azureml.train.dnn import PyTorch
from azureml.core.runconfig import MpiConfiguration
from azureml.core import Experiment
from azureml.widgets import RunDetails
from azureml.core.compute import ComputeTarget
from utils_nlp.azureml.azureml_utils import get_or_create_workspace

print("System version: {}".format(sys.version))
print("Azure ML SDK Version:", azureml.core.VERSION)

System version: 3.6.8 |Anaconda, Inc.| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit (AMD64)]
Azure ML SDK Version: 1.0.48


## 2. AzureML Setup

### 2.1 Link to or create a Workspace

First, go through the [Configuration](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) notebook to install the Azure Machine Learning Python SDK and create an Azure ML `Workspace`. This will create a config.json file containing the values needed below to create a workspace.

**Note**: you do not need to fill in these values if you have a config.json in the same folder as this notebook

In [2]:

    ws = get_or_create_workspace(
    subscription_id="15ae9cb6-95c1-483d-a0e3-b1a1a3b06324",
    resource_group="nlprg",
    workspace_name="MAIDAP-Entailment",
    workspace_region="East US",
)


In [3]:
print(
    "Workspace name: " + ws.name,
    "Azure region: " + ws.location,
    "Subscription id: " + ws.subscription_id,
    "Resource group: " + ws.resource_group,
    sep="\n",
)

Workspace name: MAIDAP-Entailment
Azure region: eastus
Subscription id: 15ae9cb6-95c1-483d-a0e3-b1a1a3b06324
Resource group: nlprg


### 2.3 Link AmlCompute Compute Target

We need to link a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training our model (see [compute options](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets#supported-compute-targets) for explanation of the different options). We will use an [AmlCompute](https://docs.microsoft.com/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute) target and link to an existing target (if the cluster_name exists) or create a STANDARD_NC6 GPU cluster (autoscales from 0 to 4 nodes) in this example. Creating a new AmlComputes takes approximately 5 minutes. 

As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota.

In [4]:
#cluster_name = "bertncrs24"
cluster_name = "gpu-entail"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print("Found compute target: {}".format(cluster_name))
except ComputeTargetException:
    print("Creating new compute target: {}".format(cluster_name))
    compute_config = AmlCompute.provisioning_configuration(
        vm_size="STANDARD_NC6", max_nodes=1
    )
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)
    compute_target.wait_for_completion(show_output=True)


print(compute_target.get_status().serialize())

Found compute target: gpu-entail
{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2019-07-31T22:29:25.780000+00:00', 'errors': None, 'creationTime': '2019-07-27T02:14:46.127092+00:00', 'modifiedTime': '2019-07-27T02:15:07.181277+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_NC6S_V2'}


In [5]:
DEBUG = True
project_dir = "./entail_utils"
if DEBUG and os.path.exists(project_dir): 
    shutil.rmtree(project_dir) 
shutil.copytree("../../utils_nlp", os.path.join(project_dir, "utils_nlp"))

'./entail_utils\\utils_nlp'

## 3. Prepare Training Script

In [6]:
%%writefile $project_dir/train.py
import torch
import numpy as np
import time
import argparse
import horovod.torch as hvd
from torch.utils.data import DataLoader, SequentialSampler
from utils_nlp.dataset.xnli import load_pandas_df
from utils_nlp.models.bert.common import Language, Tokenizer
from utils_nlp.models.bert.sequence_classification_distributed import (
    BERTSequenceDistClassifier,
)
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder

print("Torch version:", torch.__version__)

# constants to use in the file
TRAIN_FILE_SPLIT = "train"
TEST_FILE_SPLIT = "test"
TO_LOWER = True
LANGUAGE_ENGLISH = Language.ENGLISH
MAX_SEQ_LENGTH = 128
LABEL_COL = "label"
TEXT_COL = "text"
TRAIN_DATA_USED_PERCENT = 0.0025
LEARNING_RATE = 5e-5
WARMUP_PROPORTION = 0.1
BATCH_SIZE = 32
NUM_EPOCHS = 2
NUM_GPUS = 1

## each machine gets it's own copy of data
CACHE_DIR = "./xnli-data-%d" % hvd.rank()

parser = argparse.ArgumentParser()
parser.add_argument(
    "--seed",
    type=int,
    default=42,
    metavar="S",
    help="random seed (default: 42)",
)
parser.add_argument(
    "--no-cuda",
    action="store_true",
    default=False,
    help="disables CUDA training",
)
args = parser.parse_args()
args.cuda = not args.no_cuda and torch.cuda.is_available()
print(args.cuda)

if args.cuda:
    torch.cuda.set_device(hvd.local_rank())
    torch.cuda.manual_seed(args.seed)

train_df = load_pandas_df(
    local_cache_path=CACHE_DIR, file_split="train", language="en"
)
train_data_used_count = round(TRAIN_DATA_USED_PERCENT * train_df.shape[0])
train_df = train_df.loc[:train_data_used_count]

dev_df = load_pandas_df(
    local_cache_path=CACHE_DIR, file_split="dev", language="en"
)
test_df = load_pandas_df(
    local_cache_path=CACHE_DIR, file_split="test", language="en"
)

tokenizer = Tokenizer(LANGUAGE_ENGLISH, to_lower=TO_LOWER, cache_dir=CACHE_DIR)
train_tokens = tokenizer.tokenize(train_df[TEXT_COL])
test_tokens = tokenizer.tokenize(test_df[TEXT_COL])

label_encoder = LabelEncoder()
train_labels = label_encoder.fit_transform(train_df[LABEL_COL])
num_labels = len(np.unique(train_labels))

train_token_ids, train_input_mask, train_token_type_ids = tokenizer.preprocess_classification_tokens(
    train_tokens, max_len=MAX_SEQ_LENGTH
)

classifier = BERTSequenceDistClassifier(
    language=LANGUAGE_ENGLISH, num_labels=num_labels, cache_dir=CACHE_DIR
)

classifier.fit(
    token_ids=train_token_ids,
    input_mask=train_input_mask,
    token_type_ids=train_token_type_ids,
    labels=train_labels,
    num_gpus=NUM_GPUS,
    num_epochs=NUM_EPOCHS,
    batch_size=BATCH_SIZE,
    lr=LEARNING_RATE,
    warmup_proportion=WARMUP_PROPORTION,
)


# evaluation
if hvd.rank() == 0:
    NUM_GPUS = 1
    kwargs = {}

    test_token_ids, test_input_mask, test_token_type_ids = tokenizer.preprocess_classification_tokens(
        test_tokens, max_len=MAX_SEQ_LENGTH
    )

    predictions, labels = classifier.predict(
        token_ids=test_token_ids,
        input_mask=test_input_mask,
        token_type_ids=test_token_type_ids,
        batch_size=BATCH_SIZE,
    )
    
    predictions = label_encoder.inverse_transform(predictions)
    
    print(classification_report(test_df[LABEL_COL], predictions.tolist(), digits=3))


Writing ./entail_utils/train.py


## 4. Create a PyTorch Estimator

BERT is built on PyTorch, so we will use the AzureML SDK's PyTorch estimator to easily submit PyTorch training jobs for both single-node and distributed runs. For more information on the PyTorch estimator, see [How to Train Pytorch Models on AzureML](https://docs.microsoft.com/azure/machine-learning/service/how-to-train-pytorch). First we set up a .yml file with the necessary dependencies.

In [7]:
NODE_COUNT = 4
mpiConfig=MpiConfiguration()
mpiConfig.process_count_per_node=1

est = PyTorch(
    source_directory=project_dir,
    compute_target=compute_target,
    entry_script="train.py",
    node_count=NODE_COUNT,
    distributed_training=mpiConfig,
    use_gpu=True,
    framework_version="1.0",
    conda_packages=["scikit-learn=0.20.3", "numpy", "spacy", "nltk"],
    pip_packages=["pandas","seqeval[gpu]", "pytorch-pretrained-bert"],
)

## 4. Create Experiment and Submit a Job
Submit the estimator object to run your experiment. Results can be monitored using a Jupyter widget. The widget and run are asynchronous and update every 10-15 seconds until job completion.

In [8]:
experiment = Experiment(ws, name="Nlp-Entailment-BERT")
run = experiment.submit(est)

In [9]:
RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

In [None]:
run.cancel()