# Parallel BERT Experiments with AzureML Pipelines

SentEval is a widely used benchmarking tool for evaluating general-purpose sentence embeddings. It provides a simple interface for evaluating your embeddings on up to 17 supported downstream tasks (such as sentiment classification, natural language inference, semantic similarity, etc.) 

In this notebook, we show how to evaluate BERT sentence encodings on SentEval **in parallel** with AzureML. 

### 00 Global Settings

In [1]:
import itertools
import os
import pickle
import shutil
import sys
import torch
from collections import OrderedDict
import numpy as np
import pandas as pd
from copy import deepcopy

from azureml.core import Experiment
from azureml.data.data_reference import DataReference
from azureml.pipeline.core import Pipeline, PipelineData
from azureml.pipeline.steps import PythonScriptStep
from azureml.widgets import RunDetails
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies

sys.path.append("../../")
from utils_nlp.azureml.azureml_utils import get_or_create_workspace, get_or_create_amlcompute
from utils_nlp.models.bert.common import Language, Tokenizer
from utils_nlp.models.bert.sequence_encoding import BERTSentenceEncoder, PoolingStrategy
from utils_nlp.eval.senteval import SentEvalConfig

In [2]:
# device config
NUM_GPUS = 1

# # model config
LANGUAGE = Language.ENGLISH
TO_LOWER = True
MAX_SEQ_LENGTH = 128

# path config
CACHE_DIR = "./temp"
PATH_TO_SENTEVAL = "../../../SentEval"

# experiment config
BATCH_SIZE = 32
TRANSFER_TASKS = ["STSBenchmark"]
EXP_PARAMS = {
    "layer_index": [-1, -2],
    "pooling_strategy": [PoolingStrategy.MEAN, PoolingStrategy.MAX],
}

# azureml config
CONFIG_PATH = ".azureml"
EXPERIMENT_NAME = "NLP-SS-bert"
CLUSTER_NAME = "eval-gpu"

### 01 Set up AzureML resources

In [3]:
ws = get_or_create_workspace(config_path=CONFIG_PATH)
exp = Experiment(workspace=ws, name=EXPERIMENT_NAME)
ds = ws.get_default_datastore()

compute = get_or_create_amlcompute(
    workspace=ws,
    compute_name=CLUSTER_NAME,
    vm_size="STANDARD_NC6",
    max_nodes=8,
    idle_seconds_before_scaledown=300,
    verbose=False,
)

In [4]:
ds.upload(
    src_dir=PATH_TO_SENTEVAL,
    target_path=os.path.join(EXPERIMENT_NAME, "senteval"),
    overwrite=False,
    show_progress=True,
)

Uploading an estimated of 200 files
Target already exists. Skipping upload for NLP-SS-bert/senteval/LICENSE
Target already exists. Skipping upload for NLP-SS-bert/senteval/README.md
Target already exists. Skipping upload for NLP-SS-bert/senteval/setup.py
Target already exists. Skipping upload for NLP-SS-bert/senteval/.gitignore
Target already exists. Skipping upload for NLP-SS-bert/senteval/senteval/sts.py
Target already exists. Skipping upload for NLP-SS-bert/senteval/senteval/binary.py
Target already exists. Skipping upload for NLP-SS-bert/senteval/senteval/__init__.py
Target already exists. Skipping upload for NLP-SS-bert/senteval/senteval/engine.py
Target already exists. Skipping upload for NLP-SS-bert/senteval/senteval/snli.py
Target already exists. Skipping upload for NLP-SS-bert/senteval/senteval/utils.py
Target already exists. Skipping upload for NLP-SS-bert/senteval/senteval/probing.py
Target already exists. Skipping upload for NLP-SS-bert/senteval/senteval/sick.py
Target alre

Target already exists. Skipping upload for NLP-SS-bert/senteval/data/downstream/SNLI/s1.test
Target already exists. Skipping upload for NLP-SS-bert/senteval/data/downstream/CR/custrev.neg
Target already exists. Skipping upload for NLP-SS-bert/senteval/data/downstream/CR/custrev.pos
Target already exists. Skipping upload for NLP-SS-bert/senteval/data/downstream/MRPC/msr_paraphrase_test.txt
Target already exists. Skipping upload for NLP-SS-bert/senteval/data/downstream/MRPC/msr_paraphrase_train.txt
Target already exists. Skipping upload for NLP-SS-bert/senteval/data/downstream/COCO/train.pkl
Target already exists. Skipping upload for NLP-SS-bert/senteval/data/downstream/COCO/test.pkl
Target already exists. Skipping upload for NLP-SS-bert/senteval/data/downstream/COCO/valid.pkl
Target already exists. Skipping upload for NLP-SS-bert/senteval/data/downstream/MPQA/mpqa.neg
Target already exists. Skipping upload for NLP-SS-bert/senteval/data/downstream/MPQA/mpqa.pos
Target already exists. Ski

Target already exists. Skipping upload for NLP-SS-bert/senteval/data/downstream/STS/STS15-en-test/STS.input.images.txt
Target already exists. Skipping upload for NLP-SS-bert/senteval/data/downstream/STS/STS15-en-test/correlation-noconfidence.pl
Target already exists. Skipping upload for NLP-SS-bert/senteval/data/downstream/STS/STS15-en-test/STS.gs.answers-forums.txt
Target already exists. Skipping upload for NLP-SS-bert/senteval/data/downstream/STS/STS15-en-test/STS.gs.images.txt
Target already exists. Skipping upload for NLP-SS-bert/senteval/data/downstream/STS/STSBenchmark/sts-test.csv
Target already exists. Skipping upload for NLP-SS-bert/senteval/data/downstream/STS/STSBenchmark/sts-dev.csv
Target already exists. Skipping upload for NLP-SS-bert/senteval/data/downstream/STS/STSBenchmark/readme.txt
Target already exists. Skipping upload for NLP-SS-bert/senteval/data/downstream/STS/STSBenchmark/correlation.pl
Target already exists. Skipping upload for NLP-SS-bert/senteval/data/downstr

$AZUREML_DATAREFERENCE_e9cdbb86f83c464fb686e73f4e80f436

### 02 Define the model

In [5]:
se = BERTSentenceEncoder(
    language=LANGUAGE,
    num_gpus=NUM_GPUS,
    cache_dir=CACHE_DIR,
    to_lower=TO_LOWER,
    max_len=MAX_SEQ_LENGTH,
)

### 03 Define SentEval configurations
As specified in the SentEval repo, we implement 2 functions:

**prepare** (sees the whole dataset of each task and can thus construct the word vocabulary, the dictionary of word vectors etc)

**batcher** (transforms a batch of text sentences into sentence embeddings)

In [6]:
def prepare(params, samples):
    sentences = [" ".join(s).lower() for s in samples]
    params["embeddings"] = params["model"].encode(
        sentences, batch_size=params["batch_size"], as_numpy=False
    )
    params["sentence2idx"] = collections.OrderedDict(
        list(zip(sentences, range(len(sentences))))
    )
    return


def batcher(params, batch):
    sentences = [" ".join(s).lower() for s in batch]
    sentence_indices = [params["sentence2idx"][s] for s in sentences]

    df = params["embeddings"]
    embeddings = []
    for i in sentence_indices:
        values = np.squeeze(
            df.loc[
                (df["text_index"] == i) & (df["layer_index"] == params["layer_index"])
            ]["values"].values
        ).tolist()
        embeddings.append(values)
    embeddings = np.array(embeddings)
    return embeddings

In [7]:
sec = SentEvalConfig(
    path=ds.path("{}/senteval".format(EXPERIMENT_NAME)).as_mount(),
    model=se,
    prepare_func=prepare,
    batcher_func=batcher,
    transfer_tasks=TRANSFER_TASKS,
    params={"usepytorch": True, "batch_size": BATCH_SIZE},
)

### 04 Define the script run 

In [8]:
src_dir = os.path.join(CACHE_DIR, EXPERIMENT_NAME)
os.makedirs(src_dir, exist_ok=True)
shutil.copytree("../../utils_nlp", os.path.join(src_dir, "utils_nlp"))

'./temp/NLP-SS-bert/utils_nlp'

In [9]:
%%writefile $src_dir/run.py
import pickle
import argparse
from utils_nlp.eval.senteval import SentEvalConfig

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--config",
        type=str,
        dest="config",
        help="Filename of serialized SentEvalConfig object",
    )
    parser.add_argument(
        "--output",
        type=str,
        dest="output",
        help="Filename to write serialized results to",
    )
    args = parser.parse_args()
    
    config = pickle.load(open(args.config, "rb"))
    sys.path.insert(0, config.path)
    import senteval

    se = senteval.engine.SE(config.params, config.prepare, config.batcher)

    results = se.eval(config.transfer_tasks)
    pickle.dump(results, open(args.output, "wb"))

Overwriting ./temp/NLP-SS-bert/run.py


### 05 Run experiments in parallel

In [10]:
parameter_groups = list(itertools.product(*list(EXP_PARAMS.values())))

In [11]:
conda_dependencies = CondaDependencies.create(
    conda_packages=[
        "numpy",
        "pandas",
    ],
    pip_packages=["azureml-sdk==1.0.43.*", 
                  "torch==1.1", 
                  "tqdm==4.31.1",
                 "pytorch-pretrained-bert>=0.6"],
    python_version="3.6.8",
)

rc = RunConfiguration(conda_dependencies=conda_dependencies)
rc.target = CLUSTER_NAME
rc.environment.docker.enabled = True

In [12]:
steps = []
for i, p in enumerate(parameter_groups):
    exp_params = dict(zip(EXP_PARAMS.keys(), p))
    sc = deepcopy(sec)
    sc.append_params(exp_params)
    for k, v in exp_params.items():
        setattr(sc.model, k, v)

    pickle.dump(sc, open(os.path.join(CACHE_DIR, "config{0:03d}.pkl".format(i)), "wb"))
    ds.upload_files(
        [os.path.join(CACHE_DIR, "config{0:03d}.pkl".format(i))],
        target_path=EXPERIMENT_NAME,
        overwrite=False,
        show_progress=False,
    )

    input_config = DataReference(
        datastore=ds,
        data_reference_name="config{0:03d}".format(i),
        path_on_datastore="{0}/config{1:03d}.pkl".format(EXPERIMENT_NAME, i),
    )
    output_results = PipelineData(
        datastore=ds,
        name="results{0:03d}".format(i),
        output_path_on_compute="{0}/results{1:03d}.pkl".format(EXPERIMENT_NAME, i),
    )

    step = PythonScriptStep(
        source_directory=src_dir,
        script_name="run.py",
        arguments=[
            "--config",
            input_config,
            "--output",
            output_results,
        ],
        inputs=[input_config],
        outputs=[output_results],
        runconfig=rc,
    )

    steps.append(step)

In [13]:
pipeline = Pipeline(workspace=ws, steps=steps)
pipeline.validate()

Step run.py is ready to be created [d0fba441]
Step run.py is ready to be created [a90bd4e1]
Step run.py is ready to be created [34ebeb9e]
Step run.py is ready to be created [481b4441]


[]

In [14]:
pipeline_run = exp.submit(pipeline, regenerate_outputs=False)

Created step run.py [d0fba441][c68d387a-e3b1-4f4e-b602-7d44645f478c], (This step will run and generate new outputs)
Created step run.py [a90bd4e1][d0e78112-92a5-41d8-976a-f92c038d909d], (This step will run and generate new outputs)
Created step run.py [34ebeb9e][68b9c214-29d1-4ffb-a0f0-47616f676d35], (This step will run and generate new outputs)
Created step run.py [481b4441][ef9c270e-5884-443f-85b0-cecbb5c37eac], (This step will run and generate new outputs)
Using data reference config000 for StepId [ad497262][3fb28726-32ae-48a5-b91d-f23228d42611], (Consumers of this data are eligible to reuse prior runs.)
Using data reference config001 for StepId [af28eb33][95ed72e0-6882-4bc2-803f-727c945ad703], (Consumers of this data are eligible to reuse prior runs.)
Using data reference config002 for StepId [0554c17c][c0fc39fd-63f1-41c1-af53-9fb56ccb5c2f], (Consumers of this data are eligible to reuse prior runs.)
Using data reference config003 for StepId [d62120c0][02844cfe-8247-4705-8b3c-9f8ce0

In [15]:
RunDetails(pipeline_run).show()

_PipelineWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': True, 'log_level': 'INFO', '…