Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Automated Machine Learning: Explain classification model and visualize the explanation

In this example we use the sklearn's [iris dataset](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html) to showcase how you can use the AutoML Classifier for a simple classification problem.

Make sure you have executed the [configuration](../configuration.ipynb) before running this notebook.

In this notebook you would see
1. Creating an Experiment in an existing Workspace
2. Instantiating AutoMLConfig
3. Training the Model using local compute and explain the model
4. Visualization model's feature importance in widget
5. Explore best model's explanation


## Create Experiment

As part of the setup you have already created a <b>Workspace</b>. For AutoML you would need to create an <b>Experiment</b>. An <b>Experiment</b> is a named object in a <b>Workspace</b>, which is used to run experiments.

In [2]:
! pip install --upgrade azureml-sdk[automl]

Collecting azureml-sdk[automl]
  Using cached https://files.pythonhosted.org/packages/75/5d/b9a03efc12c2d18bac509cc8754c3015ee70a50749a63f3b1ba0070c01de/azureml_sdk-1.0.43-py3-none-any.whl
Collecting azureml-train==1.0.43.* (from azureml-sdk[automl])
  Using cached https://files.pythonhosted.org/packages/67/e4/b5a3d84ac40ceaf4203ca0ef0629e8de9c27edefd9ba0e7c32f5630f1930/azureml_train-1.0.43-py3-none-any.whl
Collecting azureml-dataprep<1.2.0a,>=1.1.3a (from azureml-sdk[automl])
  Using cached https://files.pythonhosted.org/packages/bd/ec/dd8521421adaf64264aa26ab31a8be4ffd01c29d0600497eed7b955868ac/azureml_dataprep-1.1.5-py3-none-any.whl
Collecting azureml-core==1.0.43.* (from azureml-sdk[automl])
  Using cached https://files.pythonhosted.org/packages/f6/b2/ba8fde6c28251cec7fee4f6040ba13476a42ecbc138785bf958a5f500704/azureml_core-1.0.43.1-py2.py3-none-any.whl
Collecting azureml-pipeline==1.0.43.* (from azureml-sdk[automl])
  Using cached https://files.pythonhosted.org/packages/67/31/9266

ERROR: keras2onnx 1.5.0 has requirement onnxconverter-common>=1.5.0, but you'll have onnxconverter-common 1.4.2 which is incompatible.


In [6]:
pip install azureml-sdk[notebooks]

Collecting azureml-sdk[notebooks]
  Using cached https://files.pythonhosted.org/packages/75/5d/b9a03efc12c2d18bac509cc8754c3015ee70a50749a63f3b1ba0070c01de/azureml_sdk-1.0.43-py3-none-any.whl
Collecting azureml-dataprep<1.2.0a,>=1.1.3a (from azureml-sdk[notebooks])
  Using cached https://files.pythonhosted.org/packages/bd/ec/dd8521421adaf64264aa26ab31a8be4ffd01c29d0600497eed7b955868ac/azureml_dataprep-1.1.5-py3-none-any.whl
Collecting azureml-core==1.0.43.* (from azureml-sdk[notebooks])
  Using cached https://files.pythonhosted.org/packages/f6/b2/ba8fde6c28251cec7fee4f6040ba13476a42ecbc138785bf958a5f500704/azureml_core-1.0.43.1-py2.py3-none-any.whl
Collecting azureml-pipeline==1.0.43.* (from azureml-sdk[notebooks])
  Using cached https://files.pythonhosted.org/packages/67/31/9266e565b2965616ed694aabb70f035f01627f8e7cfeff48553c3631f0d7/azureml_pipeline-1.0.43-py3-none-any.whl
Collecting azureml-train==1.0.43.* (from azureml-sdk[notebooks])
  Using cached https://files.pythonhosted.org/p

In [15]:
import logging
import os
import random

import pandas as pd
import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace

In [13]:
import azureml.train.automl

ModuleNotFoundError: No module named 'azureml.train.automl'

In [16]:
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun

ModuleNotFoundError: No module named 'azureml.train.automl'

In [0]:
from azureml.core.authentication import AzureCliAuthentication

cli_auth = AzureCliAuthentication()

ws = Workspace(subscription_id="3c3bb71f-3a4c-436f-9e0a-7407d75a82fa",
               resource_group="eml-training",
               workspace_name="eml01-student99",
               auth=cli_auth)

#ws = Workspace.from_config()

print("Found workspace {} at location {}".format(ws.name, ws.location))

Found workspace eml01-student99 at location southcentralus


In [0]:
# choose a name for experiment
experiment_name = 'automl-local-classification'
# project folder
project_folder = './sample_projects/automl-local-classification-model-explanation'

experiment=Experiment(ws, experiment_name)

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace Name'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
output['Experiment Name'] = experiment.name
pd.set_option('display.max_colwidth', -1)
pd.DataFrame(data = output, index = ['']).T

Unnamed: 0,Unnamed: 1
SDK version,1.0.17
Subscription ID,3c3bb71f-3a4c-436f-9e0a-7407d75a82fa
Workspace Name,eml01-student99
Resource Group,eml-training
Location,southcentralus
Project Directory,./sample_projects/automl-local-classification-model-explanation
Experiment Name,automl-local-classification


## Diagnostics

Opt-in diagnostics for better experience, quality, and security of future releases

In [0]:
from azureml.telemetry import set_diagnostics_collection
set_diagnostics_collection(send_diagnostics=True)

Turning diagnostics collection on. 


## Load Iris Data Set

In [0]:
from sklearn import datasets

iris = datasets.load_iris()
y = iris.target
X = iris.data

features = iris.feature_names

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.1,
                                                    random_state=100,
                                                    stratify=y)

X_train = pd.DataFrame(X_train, columns=features)
X_test = pd.DataFrame(X_test, columns=features)

## Instantiate Auto ML Config

Instantiate a AutoMLConfig object. This defines the settings and data used to run the experiment.

|Property|Description|
|-|-|
|**task**|classification or regression|
|**primary_metric**|This is the metric that you want to optimize. Classification supports the following primary metrics: <br><i>accuracy</i><br><i>AUC_weighted</i><br><i>average_precision_score_weighted</i><br><i>norm_macro_recall</i><br><i>precision_score_weighted</i>|
|**max_time_sec**|Time limit in minutes for each iterations|
|**iterations**|Number of iterations. In each iteration Auto ML trains the data with a specific pipeline|
|**X**|(sparse) array-like, shape = [n_samples, n_features]|
|**y**|(sparse) array-like, shape = [n_samples, ], [n_samples, n_classes]<br>Multi-class targets. An indicator matrix turns on multilabel classification.  This should be an array of integers. |
|**X_valid**|(sparse) array-like, shape = [n_samples, n_features]|
|**y_valid**|(sparse) array-like, shape = [n_samples, ], [n_samples, n_classes]|
|**model_explainability**|Indicate to explain each trained pipeline or not |
|**path**|Relative path to the project folder.  AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder. |

In [0]:
automl_config = AutoMLConfig(task = 'classification',
                             debug_log = 'automl_errors.log',
                             primary_metric = 'AUC_weighted',
                             iteration_timeout_minutes = 200,
                             iterations = 10,
                             verbosity = logging.INFO,
                             X = X_train, 
                             y = y_train,
                             X_valid = X_test,
                             y_valid = y_test,
                             model_explainability=True,
                             path=project_folder)

## Training the Model

You can call the submit method on the experiment object and pass the run configuration. For Local runs the execution is synchronous. Depending on the data and number of iterations this can run for while.
You will see the currently running iterations printing to the console.

In [0]:
! pip install --upgrade numpy

In [0]:
local_run = experiment.submit(automl_config, show_output=True)

Running on local machine
Parent Run ID: AutoML_e79f03e4-55a1-448f-b218-f073420e35ba
********************************************************************************************************************
ITERATION: The iteration being evaluated.
PIPELINE: A summary description of the pipeline being evaluated.
SAMPLING %: Percent of the training data to sample.
DURATION: Time taken for the current iteration.
METRIC: The result of computing score on the fitted pipeline.
BEST: The best observed score thus far.
********************************************************************************************************************

 ITERATION   PIPELINE                                       SAMPLING %  DURATION      METRIC      BEST
         0   MaxAbsScaler LightGBM                          100.0000    0:00:16       1.0000    1.0000
         1   RobustScaler LightGBM                          100.0000    0:00:18       1.0000    1.0000
         2   RobustScaler LogisticRegression                100

## Exploring the results

### Widget for monitoring runs

The widget will sit on "loading" until the first iteration completed, then you will see an auto-updating graph and table show up. It refreshed once per minute, so you should see the graph update as child runs complete.

NOTE: The widget displays a link at the bottom. This links to a web-ui to explore the individual run details.

In [0]:
from azureml.widgets import RunDetails
RunDetails(local_run).show() 

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': True, 'log_level': 'INFO', 'sd…

### Retrieve the Best Model

Below we select the best pipeline from our iterations. The *get_output* method on automl_classifier returns the best run and the fitted model for the last *fit* invocation. There are overloads on *get_output* that allow you to retrieve the best run and fitted model for *any* logged metric or a particular *iteration*.

In [0]:
best_run, fitted_model = local_run.get_output()
print(best_run)
print(fitted_model)

Run(Experiment: automl-local-classification,
Id: AutoML_e79f03e4-55a1-448f-b218-f073420e35ba_9,
Type: None,
Status: Completed)
Pipeline(memory=None,
     steps=[('prefittedsoftvotingclassifier', PreFittedSoftVotingClassifier(classification_labels=None,
               estimators=[('LightGBM_8', Pipeline(memory=None,
     steps=[('StandardScalerWrapper', <automl.client.core.common.model_wrappers.StandardScalerWrapper object at 0x000001BE4399EA20>), ('L...x000001BE43A67828>)]))],
               flatten_transform=None, weights=[0.6, 0.1, 0.1, 0.1, 0.1]))])


### Best Model 's explanation

Retrieve the explanation from the best_run. And explanation information includes:

1.	shap_values: The explanation information generated by shap lib
2.	expected_values: The expected value of the model applied to set of X_train data.
3.	overall_summary: The model level feature importance values sorted in descending order
4.	overall_imp: The feature names sorted in the same order as in overall_summary
5.	per_class_summary: The class level feature importance values sorted in descending order. Only available for the classification case
6.	per_class_imp: The feature names sorted in the same order as in per_class_summary. Only available for the classification case

In [0]:
! pip install --upgrade azureml-sdk[explain]

In [0]:
! conda upgrade numpy

In [0]:
from azureml.train.automl.automlexplainer import retrieve_model_explanation

shap_values, expected_values, overall_summary, overall_imp, per_class_summary, per_class_imp = \
    retrieve_model_explanation(best_run)

In [0]:
print(overall_summary)
print(overall_imp)

[0.14166846392786872, 0.1335369079959715, 0.053450419494447406, 0.04017985901439138]
['petal length (cm)', 'petal width (cm)', 'sepal length (cm)', 'sepal width (cm)']


In [0]:
print(per_class_summary)
print(per_class_imp)

[[0.13848804955812774, 0.11406277723118231, 0.07270583269504863, 0.05967851129014119], [0.1414412060638979, 0.1315330126039155, 0.05192642321504621, 0.04845770098122552], [0.1550149341528167, 0.1450761361615805, 0.039187724807068086, 0.008934642537986743]]
[['petal length (cm)', 'petal width (cm)', 'sepal length (cm)', 'sepal width (cm)'], ['petal length (cm)', 'petal width (cm)', 'sepal width (cm)', 'sepal length (cm)'], ['petal width (cm)', 'petal length (cm)', 'sepal length (cm)', 'sepal width (cm)']]


Beside retrieve the existed model explanation information, explain the model with different train/test data

In [0]:
from azureml.train.automl.automlexplainer import explain_model

shap_values, expected_values, overall_summary, overall_imp, per_class_summary, per_class_imp = \
    explain_model(fitted_model, X_train, X_test)

100%|██████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 86.16it/s]


In [0]:
print(overall_summary)
print(overall_imp)

[0.14166846392786872, 0.1335369079959715, 0.053450419494447406, 0.04017985901439138]
['petal length (cm)', 'petal width (cm)', 'sepal length (cm)', 'sepal width (cm)']
