Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/metrics/binary-classification-metric-and-confidence-interval.png)

# Automated Machine Learning
_**New metric features in Azure AutoML**_

## Contents
1. [Introduction](#Introduction)
1. [Setup](#Setup)
1. [Train](#Train)
1. [Results](#Results)
1. [Test](#Test)
1. [Acknowledgements](#Acknowledgements)

## Introduction

In this example notebook we use the sklearn datasets, [digits](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) and [boston](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html) to help you get familiar with binary classification metrics and confidence interval. The goal is to learn how to use these features through the examples. 

This notebook is using remote compute to train the model.

If you are using an Azure Machine Learning Compute Instance, you are all set. Otherwise, go through the [configuration](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) notebook first if you haven't already to establish your connection to the AzureML Workspace. 

In this notebook you will learn how to:
1. How to have binary classification metrics calculated for AutoML runs
2. How to find binary classification metrics in UI and how to retrieve the values through code
3. How to have confidence intervals calculated for both classification and regression AutoML runs
4. How to find confidence intervals in UI and how to retrieve the values through code

## Setup

As part of the setup you have already created an Azure ML `Workspace` object. For Automated ML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments.

In [None]:
import logging

import pandas as pd
import os

from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.core.dataset import Dataset
from azureml.train.automl import AutoMLConfig

In [None]:
ws = Workspace.from_config()

experiment_name = "metrics-new-feature-test"

experiment = Experiment(ws, experiment_name)

output = {}
output["Subscription ID"] = ws.subscription_id
output["Workspace"] = ws.name
output["Resource Group"] = ws.resource_group
output["Location"] = ws.location
output["Experiment Name"] = experiment.name
pd.set_option("display.max_colwidth", -1)
outputDf = pd.DataFrame(data=output, index=[""])
outputDf.T

## Create or Attach existing AmlCompute
A compute target is required to execute the Automated ML run. In this tutorial, you create AmlCompute as your training compute resource.

> Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.

#### Creation of AmlCompute takes approximately 5 minutes. 
If the AmlCompute with that name is already in your workspace this code will skip the creation process.
As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota.

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
cpu_cluster_name = "cpu-cluster-1"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print("Found existing cluster, use it.")
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(
        vm_size="STANDARD_DS12_V2", max_nodes=6
    )
    compute_target = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)

# Data

### Load Data

We load datasets from sklearn and save to local files to register them to workspace.

For classification, we use [digits dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html#sklearn.datasets.load_digits)

For regression, we use [boston dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html#sklearn.datasets.load_boston)

In [None]:
import numpy as np
import sklearn.datasets


def load_classification_data():
    if os.path.exists("./data/digits.csv"):
        print("Find downloaded dataset. Loading")
    else:
        print("Downloading dataset")
        os.makedirs("./data", exist_ok=True)
        classification_dataset = sklearn.datasets.load_digits()
        X = classification_dataset["data"]
        y = classification_dataset["target"]
        full_data = np.concatenate([X, y.reshape(-1, 1)], axis=1).astype("int")
        columns = ["feature_{}".format(i) for i in range(X.shape[1])] + ["label"]
        full_data = pd.DataFrame(data=full_data, columns=columns)
        full_data.to_csv("./data/digits.csv", index=False)
        print("Dataset downloaded")
    ws = Workspace.from_config()
    datastore = ws.get_default_datastore()
    datastore.upload(
        src_dir="./data", target_path="data/new-metric-features/", overwrite=True
    )
    data = Dataset.Tabular.from_delimited_files(
        path=[(datastore, ("data/new-metric-features/digits.csv"))]
    )
    train, test = data.random_split(percentage=0.8, seed=101)
    validation, test = test.random_split(percentage=0.5, seed=47)
    return train, validation, test, np.arange(10), "label"


(
    digit_train,
    digit_validation,
    digit_test,
    labels,
    label_column_name,
) = load_classification_data()

# Binary Classification Metrics

In this section we will explain how to set parameters for AutoML runs to have binary classification metrics calculated.

## Binary Classification Metrics
Binary classification metrics will be calculated for AutoML in two cases:
1. There are exactly two classes.
2. parameter `positive_label` in `AutoMLConfig` is specified as an existing class.

When a `positive_label` is specified for multiclass classification tasks, all other classes will all be treated the negative class when calculating the binary classification metrics.

When there are exactly two classes, `np.unique()` will be used to sort the classes and the class with larger index will be used as the positive class. However, we would recommend always specify a `positive_label` when you want to calculate binary classification metrics to make sure that it is calculated for the correct class. In the example below, we use class `4` as the positive class.

In [None]:
automl_settings = {
    "primary_metric": "AUC_weighted",
    "enable_early_stopping": True,
    "max_concurrent_iterations": 6,
    "experiment_timeout_hours": 0.25,
    "verbosity": logging.INFO,
}

automl_config = AutoMLConfig(
    task="classification",
    debug_log="automl_errors.log",
    compute_target=compute_target,
    training_data=digit_train,
    validation_data=digit_validation,
    label_column_name=label_column_name,
    positive_label=4,  # specify the positive class with this parameter
    **automl_settings
)

classification_run = experiment.submit(automl_config, show_output=False)

In [None]:
classification_run.wait_for_completion(show_output=False)

## Find Binary Metrics in UI

After training, you can click the link above to visit the page of this run. You can find all training runs under `Child runs` tab:

![](imgs/child-runs.png)

Then under `Metrics` tab, you can find some metrics names that end with `_binary`. They are the binary classification metrics with the specified positive class.

![](imgs/binary-metrics.png)

## Retrieve Binary Metrics with Code

You can also retrieve the metrics values for any training run with codes. They returned values will be a dictionary with structure `{name: value}`. The example below retrieves the metrics of the best trained model.

In [None]:
best_run, fitted_model = classification_run.get_output()
training_metrics = best_run.get_metrics()
training_metrics["AUC_binary"]

With data downloaded, you can also calculate the binary classification metrics with other classes as the positive class. 

To calculate metrics with codes, you will need to import Azure AutoML's scoring modules and specify the value of `positive_label` as desired. See example code below:

In [None]:
from azureml.automl.runtime.shared.score import constants, scoring

test_df = digit_test.to_pandas_dataframe()
y_test = test_df[label_column_name]
test_df = test_df.drop(columns=[label_column_name])
y_pred_proba = fitted_model.predict_proba(test_df)

In [None]:
for positive_label in range(10):
    metrics = scoring.score_classification(
        y_test,
        y_pred_proba,
        constants.CLASSIFICATION_SCALAR_SET,
        labels,
        labels,
        positive_label=positive_label,
    )
    print(
        "AUC_binary for label {} is {:.4f}".format(
            positive_label, metrics["AUC_binary"]
        )
    )

## Wrong Value of `positive_label` Fails the Run

The value of `positive_label` passed into `AutoMLConfig` must be exactly the same as it is in the dataset. If you passed in a `positive_label` that cannot be found in the training dataset, the run will fail. See the example below, where the correct value `4` is replaced by its string version, `'4'`

In [None]:
automl_settings = {
    "primary_metric": "AUC_weighted",
    "enable_early_stopping": True,
    "max_concurrent_iterations": 6,
    "experiment_timeout_hours": 0.25,
    "verbosity": logging.INFO,
}

automl_config = AutoMLConfig(
    task="classification",
    debug_log="automl_errors.log",
    compute_target=compute_target,
    training_data=digit_train,
    validation_data=digit_validation,
    label_column_name=label_column_name,
    positive_label="4",  # replace the correct integer value with its string version
    **automl_settings
)

classification_run = experiment.submit(automl_config, show_output=False)

In [None]:
classification_run.wait_for_completion(show_output=False)

# Confidence Interval

We calculate confidence intervals for metrics by doing bootstrap and we give conservative estimates. Like binary classification metrics, you can find the confidence intervals in UI, and also retrieve them with codes. 

To calculate confidence intervals in AutoML runs, we need to pass two other parameters to `AutoMLConfig`:
1. `enable_metric_confidence = True` to tell the run to calculate confidence interval
2. `test_data` to activate a test run, as confidence intervals will only be calculated for test runs.

Currently, if the task is classification, only primary metrics will have their confidence intervals logged with the run. To get confidence intervals for other metrics, you can use codes. We will provide examples below.

In [None]:
automl_settings = {
    "primary_metric": "AUC_weighted",
    "enable_early_stopping": True,
    "max_concurrent_iterations": 6,
    "experiment_timeout_hours": 0.25,
    "verbosity": logging.INFO,
}

automl_config = AutoMLConfig(
    task="classification",
    debug_log="automl_errors.log",
    compute_target=compute_target,
    training_data=digit_train,
    validation_data=digit_validation,
    test_data=digit_test,  # if you only have a test set, you can pass validation set here, instead of at validation_data
    label_column_name=label_column_name,
    enable_metric_confidence=True,
    **automl_settings
)

classification_run = experiment.submit(automl_config, show_output=False)
classification_run.wait_for_completion(show_output=False)

## Find Confidence Interval in UI

To locate the confidence intervals in UI, we must first find the run which gives the best model, as only the best model will be run on test set. In order to do so, click the link above for the AutoML run, and go to `Models` tab. The model listed on the top is the one with best performance:

![](imgs/best-model.png)

Then for this best model, go to its `Child runs` tab and click the run with tab `Test model`

![](imgs/test-run.png)

For this test run, under tab `Metrics`, you can find some metrics whose names end with `extras`. By switching `View as` from `Chart` to `Table`, you can find the confidence intervals for those metrics.

![](imgs/confidence-intervals.png)

## Find Confidence Interval with Code

You can retrieve the `Run` object for test run with the following code, and get confidence interval from its metrics.

In [None]:
best_run, fitted_model = classification_run.get_output()
test_run = next(best_run.get_children(type="automl.model_test"))
test_run.wait_for_completion(show_output=False, wait_post_processing=True)
test_metrics = test_run.get_metrics()

In [None]:
CIs = {"metric_name": [], "lower_ci_95": [], "upper_ci_95": [], "value": []}

for key, ci in test_metrics.items():
    if key.endswith("extras"):
        CIs["metric_name"].append(key[:-7])  # remove "_extras" to get metric name
        for ci_key, ci_value in ci.items():
            CIs[ci_key].append(ci_value)

pd.DataFrame(CIs)

Or, you can retrieve the best model, do inference yourself, and get confidence intervals for all metrics. However, since our confidence intervals includes a large number of bootstraps, it will take some time.

In [None]:
test_df = digit_test.to_pandas_dataframe()
y_test = test_df[label_column_name]
test_df = test_df.drop(columns=[label_column_name])
y_pred_proba = fitted_model.predict_proba(test_df)

In [None]:
from azureml.automl.runtime._ml_engine.classification_ml_engine import (
    evaluate_classifier,
)

test_metrics = evaluate_classifier(
    y_test,
    y_pred_proba,
    constants.CLASSIFICATION_SCALAR_SET,
    labels,
    labels,
    enable_metric_confidence=True,
)

In [None]:
CIs = {"metric_name": [], "lower_ci_95": [], "upper_ci_95": [], "value": []}

for key, ci in test_metrics.items():
    if key.endswith("extras"):
        CIs["metric_name"].append(key[:-7])  # remove "_extras" to get metric name
        for ci_key, ci_value in ci.items():
            CIs[ci_key].append(ci_value)

pd.DataFrame(CIs)

## Confidence Interval for Regression

Confidence intervals are also supported for regression runs and all confidence intervals can be found in UI. You can find it by following the exact same steps as you do for a classification run. Here we only provide example code for a regression run, screen shots of the confidence intervals, and retrieve it with codes.

In [None]:
def load_regression_data():
    if os.path.exists("./data/boston.csv"):
        print("Find downloaded dataset. Loading")
    else:
        print("Downloading dataset")
        os.makedirs("./data", exist_ok=True)
        regression_data = sklearn.datasets.load_boston()
        X = regression_data["data"]
        y = regression_data["target"]
        full_data = np.concatenate([X, y.reshape(-1, 1)], axis=1)
        columns = ["feature_{}".format(i) for i in range(X.shape[1])] + ["label"]
        full_data = pd.DataFrame(data=full_data, columns=columns)
        full_data.to_csv("./data/boston.csv", index=False)
        print("Dataset downloaded")
    ws = Workspace.from_config()
    datastore = ws.get_default_datastore()
    datastore.upload(
        src_dir="./data", target_path="data/new-metric-features/", overwrite=True
    )
    data = Dataset.Tabular.from_delimited_files(
        path=[(datastore, ("data/new-metric-features/boston.csv"))]
    )
    train, test = data.random_split(percentage=0.8, seed=101)
    validation, test = test.random_split(percentage=0.5, seed=47)
    return train, validation, test, "label"


boston_train, boston_validation, boston_test, label_column_name = load_regression_data()

In [None]:
automl_settings = {
    "primary_metric": "normalized_root_mean_squared_error",
    "enable_early_stopping": True,
    "max_concurrent_iterations": 6,
    "experiment_timeout_hours": 0.25,
    "verbosity": logging.INFO,
}

automl_config = AutoMLConfig(
    task="regression",
    debug_log="automl_errors.log",
    compute_target=compute_target,
    training_data=boston_train,
    validation_data=boston_validation,
    test_data=boston_test,  # if you only have a test set, you can pass validation set here, instead of at validation_data
    label_column_name=label_column_name,
    enable_metric_confidence=True,
    **automl_settings
)

regression_run = experiment.submit(automl_config, show_output=False)
regression_run.wait_for_completion(show_output=False)

In [None]:
best_run, fitted_model = regression_run.get_output()
test_run = next(best_run.get_children(type="automl.model_test"))
test_run.wait_for_completion(show_output=False, wait_post_processing=True)
test_metrics = test_run.get_metrics()

CIs = {"metric_name": [], "lower_ci_95": [], "upper_ci_95": [], "value": []}

for key, ci in test_metrics.items():
    if key.endswith("extras"):
        CIs["metric_name"].append(key[:-7])  # remove "_extras" to get metric name
        for ci_key, ci_value in ci.items():
            CIs[ci_key].append(ci_value)

pd.DataFrame(CIs)

![](imgs/regression-confidence-interval.png)