Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/experimental/automl-nlp-multiclass/automl-nlp-text-classification-multiclass.png)

# Automated Machine Learning
_**Multiclass Text Classification Using AutoML NLP**_

## Contents
1. [Introduction](#Introduction)
1. [Setup](#Setup)
1. [Data](#Data)
1. [Train](#Train)
1. [Inference](#Inference)

## Introduction
This notebook demonstrates classification with text data using AutoML NLP.

AutoML highlights here include using end to end deep learning for NLP tasks like multiclass text classification.

Make sure you have executed the [configuration](../../../configuration.ipynb) before running this notebook.

Notebook synopsis:

1. Creating an Experiment in an existing Workspace
2. Configuration and remote run of AutoML for a multiclass text dataset from scikit-learn, [20 Newsgroups dataset](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html)
3. Evaluating the trained model on a test set

## Setup

In [None]:
import logging
import os
import tempfile

import pandas as pd

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.core.dataset import Dataset
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.compute_target import ComputeTargetException
from azureml.core.script_run_config import ScriptRunConfig
from azureml.core.run import Run
from azureml.data.datapath import DataPath
from azureml.train.automl import AutoMLConfig
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics import classification_report

This sample notebook may use features that are not available in previous versions of the Azure ML SDK.

In [None]:
print("You are currently using version", azureml.core.VERSION, "of the Azure ML SDK")

As part of the setup you have already created a <b>Workspace</b>. To run AutoML, you also need to create an <b>Experiment</b>. An Experiment corresponds to a prediction problem you are trying to solve, while a Run corresponds to a specific approach to the problem.

In [None]:
ws = Workspace.from_config()

# Choose an experiment name.
experiment_name = "automl-nlp-text-classification-multiclass"

experiment = Experiment(ws, experiment_name)

output = {}
output["Subscription ID"] = ws.subscription_id
output["Workspace Name"] = ws.name
output["Resource Group"] = ws.resource_group
output["Location"] = ws.location
output["Experiment Name"] = experiment.name
pd.set_option("display.max_colwidth", None)
outputDf = pd.DataFrame(data=output, index=[""])
outputDf.T

## Set up a compute cluster
This section uses a user-provided compute cluster (named "gpu-compute" in this example). If a cluster with this name does not exist in the user's workspace, the below code will create a new cluster. You can choose the parameters of the cluster as mentioned in the comments.

In [None]:
num_nodes = 1

# Choose a name for your cluster.
amlcompute_cluster_name = "gpu-compute"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
    print("Found existing cluster, use it.")
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(
        vm_size="Standard_NC6s_v3", max_nodes=num_nodes  # use GPU Nodes
    )
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)

# Data
For this notebook we will use 20 Newsgroups data from scikit-learn. We filter the data to contain four classes and take a sample as training data. Please note that for accuracy improvement, more data is needed. For this notebook we provide a small-data example so that you can use this template to use with your larger sized data.

In [None]:
target_column_name = "y"
feature_column_name = "X"


def get_20newsgroups_data():
    """Fetches 20 Newsgroups data from scikit-learn
    Returns them in form of pandas dataframes
    """
    remove = ("headers", "footers", "quotes")
    categories = [
        "rec.sport.baseball",
        "rec.sport.hockey",
        "comp.graphics",
        "sci.space",
    ]

    data = fetch_20newsgroups(
        subset="train",
        categories=categories,
        shuffle=True,
        random_state=42,
        remove=remove,
    )
    data = pd.DataFrame(
        {feature_column_name: data.data, target_column_name: data.target}
    )

    data_train = data.loc[:200]
    data_val = data.loc[200:300]
    data_test = data.loc[300:400]

    data_train = remove_blanks_20news(data_train)
    data_val = remove_blanks_20news(data_val)
    data_test = remove_blanks_20news(data_test)

    return data_train, data_val, data_test


def remove_blanks_20news(data):
    data = data.copy()
    data[feature_column_name] = (
        data[feature_column_name]
        .replace(r"\n", " ", regex=True)
        .apply(lambda x: x.strip())
    )
    data = data[data[feature_column_name] != ""]

    return data

## Fetch data and upload to datastore

In [None]:
data_train, data_val, data_test = get_20newsgroups_data()

data_dir = "data"  # Local directory to store data
blobstore_datadir = data_dir  # Blob store directory to store data in
if not os.path.isdir(data_dir):
    os.mkdir(data_dir)

train_data_fname = data_dir + "/train_data.csv"
val_data_fname = data_dir + "/val_data.csv"
test_data_fname = data_dir + "/test_data.csv"

data_train.to_csv(train_data_fname, index=False)
data_val.to_csv(val_data_fname, index=False)
data_test.to_csv(test_data_fname, index=False)

datastore = ws.get_default_datastore()
target = DataPath(
    datastore=datastore, path_on_datastore=blobstore_datadir, name="news_group_data"
)
Dataset.File.upload_directory(
    src_dir=data_dir, target=target, overwrite=True, show_progress=True
)

In [None]:
train_dataset = Dataset.Tabular.from_delimited_files(
    path=[(datastore, blobstore_datadir + "/train_data.csv")]
)
val_dataset = Dataset.Tabular.from_delimited_files(
    path=[(datastore, blobstore_datadir + "/val_data.csv")]
)

In [None]:
train_dataset = train_dataset.register(
    workspace=ws,
    name="20newsgroups_data_train",
    description="20newsgroups_data_train",
    create_new_version=True,
)

val_dataset = val_dataset.register(
    workspace=ws,
    name="20newsgroups_data_val",
    description="20newsgroups_data_val",
    create_new_version=True,
)

# Train

## Submit AutoML run

Now we can start the run with the prepared compute resource and datasets. This should only take a few minutes.

Here we do not set `primary_metric` parameter as we only train one model and we do not need to rank trained models. The run will use default primary metrics, `accuracy`. But it is only for reporting purpose.

In [None]:
automl_settings = {
    "verbosity": logging.INFO,
}

automl_config = AutoMLConfig(
    task="text-classification",
    debug_log="automl_errors.log",
    compute_target=compute_target,
    training_data=train_dataset,
    validation_data=val_dataset,
    label_column_name=target_column_name,
    **automl_settings
)

#### Submit AutoML Run

In [None]:
automl_run = experiment.submit(automl_config, show_output=False)
automl_run.wait_for_completion(show_output=False)

## Download Metrics

These metrics logged with the training run are computed with the trained model on validation dataset

In [None]:
validation_metrics = automl_run.get_metrics()
pd.DataFrame(
    {"metric_name": validation_metrics.keys(), "value": validation_metrics.values()}
)

You can also get the best run id and the best model with `get_output` method.

In [None]:
best_run = automl_run.get_best_child()
best_run

# Inference

Now you can use the trained model to do inference on unseen data. We use a `ScriptRun` to do this, with script that we provide. The following blocks will register the test dataset, download the inference script and trigger the inference run. Our inference run do not directly log the metrics. So we need to download the results and calculate the metrics offline

## Submit Inference Run

In [None]:
test_dataset = Dataset.Tabular.from_delimited_files(
    path=[(datastore, blobstore_datadir + "/test_data.csv")]
)

In [None]:
test_dataset = test_dataset.register(
    workspace=ws,
    name="20newsgroups_data_test",
    description="20newsgroups_data_test",
    create_new_version=True,
)

In [None]:
training_run_id = best_run.id
training_run = Run(experiment, training_run_id)

In [None]:
# Inference script run arguments
arguments = [
    "--run_id",
    training_run_id,
    "--experiment_name",
    experiment.name,
    "--input_dataset_id",
    test_dataset.as_named_input("test_data"),
]

In [None]:
scoring_args = arguments
with tempfile.TemporaryDirectory() as tmpdir:
    # Download required files from training run into temp folder.
    entry_script_name = "score_script.py"
    output_path = os.path.join(tmpdir, entry_script_name)
    training_run.download_file(
        "outputs/" + entry_script_name, os.path.join(tmpdir, entry_script_name)
    )

    script_run_config = ScriptRunConfig(
        source_directory=tmpdir,
        script=entry_script_name,
        compute_target=compute_target,
        environment=training_run.get_environment(),
        arguments=scoring_args,
    )
    scoring_run = experiment.submit(script_run_config)

In [None]:
scoring_run

In [None]:
_ = scoring_run.wait_for_completion(show_output=False)

## Download Prediction

In [None]:
output_prediction_file = "./preds_multiclass.csv"
scoring_run.download_file(
    "outputs/predictions.csv", output_file_path=output_prediction_file
)

In [None]:
test_set_predictions_df = pd.read_csv("preds_multiclass.csv")

In [None]:
test_data_df = test_dataset.to_pandas_dataframe()

## Offline Evaluation

In [None]:
print(
    classification_report(
        test_data_df[target_column_name], test_set_predictions_df[target_column_name]
    )
)