Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/experimental/automl-nlp-ner/automl-nlp-ner.png)

# Automated Machine Learning
_**Named Entity Recognition Using AutoML NLP**_

## Contents
1. [Introduction](#Introduction)
1. [Setup](#Setup)
1. [Data](#Data)
1. [Train](#Train)
1. [Inference](#Inference)

## Introduction
This notebook demonstrates Named Entity Recognition (NER) with text data using AutoML NLP.

AutoML highlights here include using end to end deep learning for NLP tasks like NER.

Make sure you have executed the [configuration](../../../configuration.ipynb) before running this notebook.

Notebook synopsis:

1. Creating an Experiment in an existing Workspace
2. Configuration and remote run of AutoML for CoNLL 2003 dataset for NER task
3. Evaluating the trained model on a test set

## Setup

In [None]:
import logging
import os
import tempfile

import pandas as pd

import azureml.core
from azureml.core import Dataset
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.core.dataset import Dataset
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.compute_target import ComputeTargetException
from azureml.data.datapath import DataPath
from azureml.core.run import Run
from azureml.core.script_run_config import ScriptRunConfig
from azureml.train.automl import AutoMLConfig

This sample notebook may use features that are not available in previous versions of the Azure ML SDK.

In [None]:
print("This notebook was created using version 1.39.0 of the Azure ML SDK")
print("You are currently using version", azureml.core.VERSION, "of the Azure ML SDK")

As part of the setup you have already created a <b>Workspace</b>. To run AutoML, you also need to create an <b>Experiment</b>. An Experiment corresponds to a prediction problem you are trying to solve, while a Run corresponds to a specific approach to the problem.

In [None]:
ws = Workspace.from_config()

# Choose an experiment name.
experiment_name = "automl-nlp-text-ner"

experiment = Experiment(ws, experiment_name)

output = {}
output["Subscription ID"] = ws.subscription_id
output["Workspace Name"] = ws.name
output["Resource Group"] = ws.resource_group
output["Location"] = ws.location
output["Experiment Name"] = experiment.name
pd.set_option("display.max_colwidth", None)
outputDf = pd.DataFrame(data=output, index=[""])
outputDf.T

## Set up a compute cluster
This section uses a user-provided compute cluster (named "gpu-compute" in this example). If a cluster with this name does not exist in the user's workspace, the below code will create a new cluster. You can choose the parameters of the cluster as mentioned in the comments.

In [None]:
num_nodes = 1

# Choose a name for your cluster.
amlcompute_cluster_name = "gpu-compute"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
    print("Found existing cluster, use it.")
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(
        vm_size="STANDARD_NC6", max_nodes=num_nodes  # Use GPU only
    )
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)

# Data

In [None]:
# Upload dataset to datastore
data_dir = "data"  # Local directory to store data
blobstore_datadir = data_dir  # Blob store directory to store data in

datastore = ws.get_default_datastore()
target = DataPath(datastore=datastore, path_on_datastore=blobstore_datadir)
Dataset.File.upload_directory(
    src_dir=data_dir, target=target, overwrite=True, show_progress=True
)

In [None]:
datastore_path = [(datastore, blobstore_datadir + "/train.txt")]
train_data = Dataset.File.from_files(path=datastore_path)

In [None]:
datastore_path = [(datastore, blobstore_datadir + "/dev.txt")]
val_data = Dataset.File.from_files(path=datastore_path)

In [None]:
train_data = train_data.register(
    workspace=ws,
    name="CoNLL_2003_train",
    description="NER train data",
    create_new_version=True,
)

val_data = val_data.register(
    workspace=ws,
    name="CoNLL_2003_val",
    description="NER val data",
    create_new_version=True,
)

# Train

## Submit AutoML run

Here we do not set `primary_metric` parameter as we only train one model and we do not need to rank trained models. The run will use default primary metrics, `accuracy`. But it is only for reporting purpose.

In [None]:
automl_settings = {
    "verbosity": logging.INFO,
}

automl_config = AutoMLConfig(
    task="text-ner",
    debug_log="automl_errors.log",
    compute_target=compute_target,
    training_data=train_data,
    validation_data=val_data,
    **automl_settings
)

#### Submit AutoML Run

In [None]:
automl_run = experiment.submit(automl_config, show_output=False)
_ = automl_run.wait_for_completion(show_output=False)

## Download Metrics

These metrics logged with the training run are computed with the trained model on validation dataset

In [None]:
validation_metrics = automl_run.get_metrics()
pd.DataFrame(
    {"metric_name": validation_metrics.keys(), "value": validation_metrics.values()}
)

You can also get the best run id and the best model with `get_output` method.

In [None]:
best_run, best_model = automl_run.get_output()
best_run

# Inference

Now you can use the trained model to do inference on unseen data. We use a `ScriptRun` to do this, with script that we provide. The following blocks will register the test dataset, download the inference script and trigger the inference run. Unlink multiclass or multilabel scenario, the inference runs for NER saves the evaluation metrics. So we do not have to download the predictions, but directly get the metrics.

## Submit Inference Run

In [None]:
datastore_path = [(datastore, blobstore_datadir + "/test.txt")]
test_data = Dataset.File.from_files(path=datastore_path)

In [None]:
test_data = test_data.register(
    workspace=ws, name="CoNLL_2003_test", description="NER test data"
)

In [None]:
# Load training script run corresponding to AutoML run above.
training_run_id = best_run.id
training_run = Run(experiment, training_run_id)

In [None]:
# Inference script run arguments
arguments = [
    "--run_id",
    training_run_id,
    "--experiment_name",
    experiment.name,
    "--input_dataset_id",
    test_data.as_named_input("test_data"),
]

In [None]:
scoring_args = arguments
with tempfile.TemporaryDirectory() as tmpdir:
    # Download required files from training run into temp folder.
    entry_script_name = "score_script.py"
    output_path = os.path.join(tmpdir, entry_script_name)
    training_run.download_file(
        "outputs/" + entry_script_name, os.path.join(tmpdir, entry_script_name)
    )

    script_run_config = ScriptRunConfig(
        source_directory=tmpdir,
        script=entry_script_name,
        compute_target=compute_target,
        environment=training_run.get_environment(),
        arguments=scoring_args,
    )
    scoring_run = experiment.submit(script_run_config)

In [None]:
scoring_run

In [None]:
_ = scoring_run.wait_for_completion(show_output=False)

## Get Evaluation Metrics

In [None]:
test_metrics = scoring_run.get_metrics()
test_metrics

In [None]:
pd.DataFrame(
    {"metric name": list(test_metrics.keys()), "value": list(test_metrics.values())}
)