# Assess predictions on multiclass wine data with a DNN model

This notebook is an adaptation of the [corresponding notebook in the `responsible-ai-toolbox` repository](https://github.com/microsoft/responsible-ai-toolbox/blob/main/notebooks/responsibleaidashboard/responsibleaidashboard-multiclass-dnn-model-debugging.ipynb) to work with the Responsible AI components in AzureML.

We will use the Responsible AI components to assess a multiclass classification model trained on data about wine. Next, we will walk through the API calls necessary to create a widget with model analysis insights, then undertake a visual analysis of the model.

First, we need to specify the version of the RAI components which are available in the workspace. This was specified when the components were uploaded, and will have defaulted to '1':

In [1]:
version_string = '1'

We also need to give the name of the compute cluster we want to use in AzureML. Later in this notebook, we will create it if it does not already exist:

In [2]:
compute_name = "cpucluster"

Finally, we need to specify a version for the data and components we will create while running this notebook. This should be unique for the workspace, but the specific value doesn't matter:

In [3]:
rai_wine_multiclass_example_version_string = '6'

## Accessing the data

First, we need to obtain the dataset and upload it to our AzureML workspace:

In [4]:
from sklearn.datasets import load_wine
import pandas as pd

from sklearn.model_selection import train_test_split

In [16]:
wine = load_wine()
X = wine['data']
y = wine['target']
classes = wine['target_names']
feature_names = wine['feature_names']
target_column_name = 'y'

data_df = pd.DataFrame(data=X, columns=feature_names)
data_df[target_column_name] = y

display(data_df)

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,y
0,14.23,1.71,2.43,15.6,127.0,2.80,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.20,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.40,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.80,3.24,0.30,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.50,16.8,113.0,3.85,3.49,0.24,2.18,7.80,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.80,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,13.71,5.65,2.45,20.5,95.0,1.68,0.61,0.52,1.06,7.70,0.64,1.74,740.0,2
174,13.40,3.91,2.48,23.0,102.0,1.80,0.75,0.43,1.41,7.30,0.70,1.56,750.0,2
175,13.27,4.28,2.26,20.0,120.0,1.59,0.69,0.43,1.35,10.20,0.59,1.56,835.0,2
176,13.17,2.59,2.37,20.0,120.0,1.65,0.68,0.53,1.46,9.30,0.60,1.62,840.0,2


Split the data into training and test sets:

In [18]:
data_train, data_test = train_test_split(data_df, test_size=0.5, random_state=1+1+2+3+5+8)

Write to parquet files:

In [19]:
train_filename = "wine_multiclass_train.parquet"
test_filename = "wine_multiclass_test.parquet"

data_train.to_parquet(train_filename, index=False)
data_test.to_parquet(test_filename, index=False)

We are going to create two Datasets in AzureML, one for the train and one for the test datasets. The first step is to create an `MLClient` to perform the upload. The method we use assumes that there is a `config.json` file (downloadable from the Azure or AzureML portals) present in the same directory as this notebook file:

In [20]:
from azure.ml import MLClient
from azure.identity import DefaultAzureCredential
ml_client = MLClient.from_config(credential=DefaultAzureCredential(exclude_shared_token_cache_credential=True),
                     logging_enable=True)

Found the config file in: C:\Users\riedgar\source\repos\RAI-vNext-Preview\config.json


We can then define and upload the datasets:

In [22]:
from azure.ml.entities import Data
from azure.ml.constants import AssetTypes

input_train_data = "wine_multiclass_train_pq"
input_test_data = "wine_multiclass_test_pq"

train_data = Data(
    path=train_filename,
    type=AssetTypes.URI_FILE,
    description="RAI wine_multiclass example training data",
    name=input_train_data,
    version=rai_wine_multiclass_example_version_string,
)
ml_client.data.create_or_update(train_data)

test_data = Data(
    path=test_filename,
    type=AssetTypes.URI_FILE,
    description="RAI wine_multiclass example test data",
    name=input_test_data,
    version=rai_wine_multiclass_example_version_string,
)
ml_client.data.create_or_update(test_data)

[32mUploading wine_multiclass_train.parquet[32m (< 1 MB): 100%|############################| 15.3k/15.3k [00:00<00:00, 281kB/s][0m
[39m

[32mUploading wine_multiclass_test.parquet[32m (< 1 MB): 100%|#############################| 15.3k/15.3k [00:00<00:00, 252kB/s][0m
[39m



Data({'type': 'uri_file', 'is_anonymous': False, 'auto_increment_version': False, 'name': 'wine_multiclass_test_pq', 'description': 'RAI wine_multiclass example test data', 'tags': {}, 'properties': {}, 'id': '/subscriptions/589c7ae9-223e-45e3-a191-98433e0821a9/resourceGroups/amlisdkv2-rg-1651831398/providers/Microsoft.MachineLearningServices/workspaces/amlisdkv21651831398/data/wine_multiclass_test_pq/versions/6', 'base_path': './', 'creation_context': <azure.ml._restclient.v2022_05_01.models._models_py3.SystemData object at 0x000002022C8229A0>, 'serialize': <msrest.serialization.Serializer object at 0x0000020227A83E20>, 'version': '6', 'latest_version': None, 'path': 'azureml://subscriptions/589c7ae9-223e-45e3-a191-98433e0821a9/resourcegroups/amlisdkv2-rg-1651831398/workspaces/amlisdkv21651831398/datastores/workspaceblobstore/paths/LocalUpload/0adb0f324e8547238f870136179ad20a/wine_multiclass_test.parquet', 'referenced_uris': None})

## A model training pipeline

To simplify the model creation process, we're going to use a pipeline. This will have two stages:

1. The actual training component
1. A model registration component

We have to register the model in AzureML in order for our RAI insights components to use it.

### The Training Component

The training component is for this particular model. In this case, we are going to train a PyTorch neural network on the input data and save it using MLFlow. We need command line arguments to specify the location of the input data, the location where MLFlow should write the output model, and the name of the target column in the dataset.

We start by creating a directory to hold the component source:

In [24]:
import os

os.makedirs('component_src', exist_ok=True)

Next, put our training script into the directory:

In [None]:
%%writefile component_src/wine_multiclass_training_script.py

import argparse
import os
import shutil
import tempfile


from azureml.core import Run

import mlflow
import mlflow.sklearn

import pandas as pd
from sklearn.ensemble import RandomForestRegressor

def parse_args():
    # setup arg parser
    parser = argparse.ArgumentParser()

    # add arguments
    parser.add_argument("--training_data", type=str, help="Path to training data")
    parser.add_argument("--target_column_name", type=str, help="Name of target column")
    parser.add_argument("--model_output", type=str, help="Path of output model")

    # parse args
    args = parser.parse_args()

    # return args
    return args


def main(args):
    current_experiment = Run.get_context().experiment
    tracking_uri = current_experiment.workspace.get_mlflow_tracking_uri()
    print("tracking_uri: {0}".format(tracking_uri))
    mlflow.set_tracking_uri(tracking_uri)
    mlflow.set_experiment(current_experiment.name)

    # Read in data
    print("Reading data")
    all_data = pd.read_parquet(args.training_data)

    print("Extracting X_train, y_train")
    print("all_data cols: {0}".format(all_data.columns))
    y_train = all_data[args.target_column_name]
    X_train = all_data.drop(labels=args.target_column_name, axis="columns")
    print("X_train cols: {0}".format(X_train.columns))

    print("Training model")
    # The estimator can be changed to suit
    model = RandomForestRegressor()
    model.fit(X_train, y_train)

    # Saving model with mlflow - leave this section unchanged
    with tempfile.TemporaryDirectory() as td:
        print("Saving model with MLFlow to temporary directory")
        tmp_output_dir = os.path.join(td, "my_model_dir")
        mlflow.sklearn.save_model(sk_model=model, path=tmp_output_dir)

        print("Copying MLFlow model to output path")
        for file_name in os.listdir(tmp_output_dir):
            print("  Copying: ", file_name)
            # As of Python 3.8, copytree will acquire dirs_exist_ok as
            # an option, removing the need for listdir
            shutil.copy2(src=os.path.join(tmp_output_dir, file_name), dst=os.path.join(args.model_output, file_name))


# run script
if __name__ == "__main__":
    # add space in logs
    print("*" * 60)
    print("\n\n")

    # parse args
    args = parse_args()

    # run main function
    main(args)

    # add space in logs
    print("*" * 60)
    print("\n\n")