Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/experimental/automl-nlp-multilabel/automl-nlp-text-classification-multilabel.png)

# Automated Machine Learning
_**Multilabel Text Classification Using AutoML NLP**_

## Contents
1. [Introduction](#Introduction)
1. [Setup](#Setup)
1. [Data](#Data)
1. [Train](#Train)
1. [Inference](#Inference)

## Introduction
This notebook demonstrates multilabel classification with text data using AutoML NLP.

AutoML highlights here include using end to end deep learning for NLP tasks like multilabel text classification.

Make sure you have executed the [configuration](../../../configuration.ipynb) before running this notebook.

Notebook synopsis:

1. Creating an Experiment in an existing Workspace
2. Configuration and remote run of AutoML for a multilabel text classification dataset from [Kaggle](www.kaggle.com), [arXiv Paper Abstracts](https://www.kaggle.com/spsayakpaul/arxiv-paper-abstracts). 
3. Evaluating the trained model on a test set

## Setup

In [None]:
import ast
import logging
import os
import tempfile

import numpy as np
import pandas as pd
from sklearn.metrics import classification_report

from azureml.automl.dnn.nlp.classification.io.read.read_utils import load_model_wrapper
from azureml.automl.runtime.shared.score.scoring import score_classification
import azureml.core
from azureml.core import Dataset
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.core.dataset import Dataset
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.compute_target import ComputeTargetException
from azureml.data.datapath import DataPath
from azureml.core.run import Run
from azureml.core.script_run_config import ScriptRunConfig
from azureml.train.automl import AutoMLConfig

This sample notebook may use features that are not available in previous versions of the Azure ML SDK.

In [None]:
print("This notebook was created using version 1.39.0 of the Azure ML SDK")
print("You are currently using version", azureml.core.VERSION, "of the Azure ML SDK")

As part of the setup you have already created a <b>Workspace</b>. To run AutoML, you also need to create an <b>Experiment</b>. An Experiment corresponds to a prediction problem you are trying to solve, while a Run corresponds to a specific approach to the problem.

In [None]:
ws = Workspace.from_config()

# Choose an experiment name.
experiment_name = "automl-nlp-text-classification-multilabel"

experiment = Experiment(ws, experiment_name)

output = {}
output["Subscription ID"] = ws.subscription_id
output["Workspace Name"] = ws.name
output["Resource Group"] = ws.resource_group
output["Location"] = ws.location
output["Experiment Name"] = experiment.name
pd.set_option("display.max_colwidth", -1)
outputDf = pd.DataFrame(data=output, index=[""])
outputDf.T

## Set up a compute cluster
This section uses a user-provided compute cluster (named "parallel-2" in this example). If a cluster with this name does not exist in the user's workspace, the below code will create a new cluster. You can choose the parameters of the cluster as mentioned in the comments.

In [None]:
num_nodes = 2

# Choose a name for your cluster.
amlcompute_cluster_name = "parallel-{}".format(num_nodes)

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
    print("Found existing cluster, use it.")
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(
        vm_size="STANDARD_NC6", max_nodes=num_nodes  # Use GPU Nodes
    )
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)

## Data

Since the original dataset is very large, we leverage a subsampled dataset to allow for faster training for the purposes of running this example notebook. To run the full dataset (50K+ samples and 1k+ labels) you might need a GPU instance with larger memory and it may take longer to finish training.

To run the code below, please first download `arxiv_data.csv` from [this link](https://www.kaggle.com/spsayakpaul/arxiv-paper-abstracts) and save it under the same directory as this notebook, and then run `preprocessing.py` to create a subset of the data for training, evaluation and test

Now we register train and valid for training purpose. We will register the test part later.

In [None]:
# Upload dataset to datastore
data_dir = "data"  # Local directory to store data
blobstore_datadir = data_dir  # Blob store directory to store data in

datastore = ws.get_default_datastore()
target = DataPath(datastore=datastore, path_on_datastore=blobstore_datadir)
Dataset.File.upload_directory(
    src_dir=data_dir, target=target, overwrite=True, show_progress=True
)

In [None]:
# Obtain training data as a Tabular dataset to pass into AutoMLConfig
train_dataset = Dataset.Tabular.from_delimited_files(
    path=[(datastore, blobstore_datadir + "/arxiv_abstract_train.csv")]
)
valid_dataset = Dataset.Tabular.from_delimited_files(
    path=[(datastore, blobstore_datadir + "/arxiv_abstract_valid.csv")]
)
test_dataset = Dataset.Tabular.from_delimited_files(
    path=[(datastore, blobstore_datadir + "/arxiv_abstract_test.csv")]
)

In [None]:
train_dataset = train_dataset.register(
    workspace=ws,
    name="arxiv_abstract_train",
    description="Multilabel train dataset",
    create_new_version=True,
)

valid_dataset = valid_dataset.register(
    workspace=ws,
    name="arxiv_abstract_valid",
    description="Multilabel validation dataset",
    create_new_version=True,
)

# Train

## Submit AutoML run

Now we can start the run with the prepared compute resource and datasets. On a `STANDARD_NC6` compute instance with one node, the training would take around 25 minutes, excluding activating nodes in the compute instance. Here, to make training faster, we will use a `STANDARD_NC6` instance with 2 nodes and enable parallel training.

To use distributed training, we need to set `enable_distributed_dnn_training = True` and `max_concurrent_iterations` to be the number of nodes available in you compute instance.

Here we do not set `primary_metric` parameter as we only train one model and we do not need to rank trained models. The run will use default primary metrics, `accuracy`. But it is only for reporting purpose.

In [None]:
automl_settings = {
    "max_concurrent_iterations": num_nodes,
    "enable_distributed_dnn_training": True,
    "verbosity": logging.INFO,
}
target_column_name = "terms"
automl_config = AutoMLConfig(
    task="text-classification-multilabel",
    debug_log="automl_errors.log",
    compute_target=compute_target,
    training_data=train_dataset,
    validation_data=valid_dataset,
    label_column_name=target_column_name,
    **automl_settings,
)

In [None]:
automl_run = experiment.submit(
    automl_config, show_output=False
)  # You might see a warning about "enable_distributed_dnn_training". Please simply ignore.
_ = automl_run.wait_for_completion(show_output=False)

## Download Metrics

These metrics logged with the training run are computed with the trained model on validation dataset

In [None]:
validation_metrics = automl_run.get_metrics()
pd.DataFrame(
    {"metric_name": validation_metrics.keys(), "value": validation_metrics.values()}
)

You can also get the best run and the best model with `get_output` method

In [None]:
(
    best_run,
    best_model,
) = (
    automl_run.get_output()
)  # You might see a warning about "enable_distributed_dnn_training". Please simply ignore.
best_run

# Inference

Now you can use the trained model to do inference on unseen data. We use a `ScriptRun` to do this, with script that we provide. The following blocks will register the test dataset, download the inference script and trigger the inference run. Our inference run do not directly log the metrics. So we need to download the results and calculate the metrics offline

## Submit Inference Run

In [None]:
test_dataset = test_dataset.register(
    workspace=ws,
    name="arxiv_abstract_test",
    description="Multilabel text dataset",
    create_new_version=True,
)

In [None]:
# Load training script run corresponding to AutoML run above.
training_run_id = best_run.id
training_run = Run(experiment, training_run_id)

In [None]:
# Inference script run arguments
arguments = [
    "--run_id",
    training_run_id,
    "--experiment_name",
    experiment.name,
    "--input_dataset_id",
    test_dataset.as_named_input("test_data"),
]

In [None]:
scoring_args = arguments
with tempfile.TemporaryDirectory() as tmpdir:
    # Download required files from training run into temp folder.
    entry_script_name = "score_script.py"
    output_path = os.path.join(tmpdir, entry_script_name)
    training_run.download_file(
        "outputs/" + entry_script_name, os.path.join(tmpdir, entry_script_name)
    )

    script_run_config = ScriptRunConfig(
        source_directory=tmpdir,
        script=entry_script_name,
        compute_target=compute_target,
        environment=training_run.get_environment(),
        arguments=scoring_args,
    )
    scoring_run = experiment.submit(script_run_config)

In [None]:
scoring_run

In [None]:
_ = scoring_run.wait_for_completion(show_output=False)

## Download Prediction

In [None]:
output_prediction_file = "./preds_multilabel.csv"
scoring_run.download_file(
    "outputs/predictions.csv", output_file_path=output_prediction_file
)

In [None]:
test_data_df = test_dataset.to_pandas_dataframe()
test_set_predictions_df = pd.read_csv("preds_multilabel.csv")
test_set_predictions_df["label_confidence"] = test_set_predictions_df[
    "label_confidence"
].apply(lambda x: [float(num) for num in x.split(",")])

In [None]:
# install this package to run the following block
# !pip install azureml-automl-dnn-nlp

In [None]:
y_transformer = load_model_wrapper(training_run).y_transformer

## Offline Evaluation

We will use the evaluation module within AzureML to calculate the metrics. 

In [None]:
test_y = y_transformer.transform(
    test_data_df[target_column_name].apply(ast.literal_eval)
).toarray()

In [None]:
test_pred_probs = []
for i in range(test_set_predictions_df.shape[0]):
    test_pred_probs.append(test_set_predictions_df.loc[i, "label_confidence"])
test_pred_probs = np.array(test_pred_probs)

In [None]:
L = len(y_transformer.classes_)
test_metrics = score_classification(
    test_y,
    test_pred_probs,
    list(validation_metrics.keys()),
    np.arange(L),
    np.arange(L),
    y_transformer=y_transformer,
    multilabel=True,
)

In [None]:
pd.DataFrame({"metric_name": test_metrics.keys(), "value": test_metrics.values()})

## Classification Report

We also provide the following function, which enables you to evaluate the trained model, for each class and average among classes, with any value of threshold you would like

In [None]:
def classification_report_multilabel(
    test_df, pred_df, label_col, y_transformer, threshold=0.5
):

    message = (
        "test_df and pred_df should have the same number of rows, but get {} and {}"
    )
    assert test_df.shape[0] == pred_df.shape[0], message.format(
        test_df.shape[0], pred_df.shape[0]
    )

    label_set = y_transformer.classes_
    n = len(label_set)

    y_true = []
    y_pred = []

    for row in range(test_df.shape[0]):
        true_labels = y_transformer.transform(
            [ast.literal_eval(test_df.loc[row, label_col])]
        ).toarray()[0]
        pred_labels = pred_df.loc[row, "label_confidence"]
        for ind, (label, prob) in enumerate(zip(true_labels, pred_labels)):
            predict_positive = prob >= threshold
            if label or predict_positive:
                y_true.append(label_set[ind] if label else "")
                y_pred.append(label_set[ind] if predict_positive else "")

    print(classification_report(y_true, y_pred, label_set))

In [None]:
classification_report_multilabel(
    test_data_df,
    test_set_predictions_df,
    target_column_name,
    y_transformer,
    threshold=0.1,
)

In [None]:
classification_report_multilabel(
    test_data_df,
    test_set_predictions_df,
    target_column_name,
    y_transformer,
    threshold=0.5,
)

In [None]:
classification_report_multilabel(
    test_data_df,
    test_set_predictions_df,
    target_column_name,
    y_transformer,
    threshold=0.9,
)