Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/responsible-ai/model-analysis/classification/azureml-model-analysis-classification.png)

# Model analysis for binary classification scenarios
**This notebook will demonstrate on how to compute Responsible AI insights like explanations, counterfactual examples, causal effects and error analysis on remote compute for a binary classification model.**

## Contents
1. [Prerequisites](#Prerequisites)
1. [Dataset](#Dataset)
1. [Create or attach existing AmlCompute cluster](#AmlCompute)
1. [Train model on remote compute](#Train)
1. [Generate RAI insights](#Generate)
1. [Responsible AI dashboard](#Dashboard)

# Prerequisites
## Install azureml-responsibleai 
Please make sure that you have the latest pypi version of `azureml-responsibleai` installed in you environment. Otherwise you can execute `pip install --upgrade azureml-responsibleai` before running this notebook.

In [None]:
# !pip install --upgrade azureml-responsibleai
# !pip install liac-arff

In [None]:
from responsibleai import ModelAnalysis

import azureml.core
from azureml.core import Experiment, Run, Workspace

from azureml.responsibleai.common.pickle_model_loader import PickleModelLoader
from azureml.responsibleai.tools.model_analysis.model_analysis_config import ModelAnalysisConfig
from azureml.responsibleai.tools.model_analysis.model_analysis_run import ModelAnalysisRun
from azureml.responsibleai.tools.model_analysis.explain_config import ExplainConfig

This sample notebook may use features that are not available in previous versions of the Azure ML SDK.

In [None]:
print("This notebook was created using version AZUREML-SDK-VERSION of the Azure ML SDK")
print("You are currently using version", azureml.core.VERSION, "of the Azure ML SDK")

## Link an AzureML workspace

To use this notebook, an Azure Machine Learning workspace is required.
Please see the [configuration notebook](../../configuration.ipynb) for information about creating one, if required.

In [None]:
user_workspace = Workspace.from_config()
print('Workspace name: ' + user_workspace.name, 
      'Azure region: ' + user_workspace.location, 
      'Subscription id: ' + user_workspace.subscription_id, 
      'Resource group: ' + user_workspace.resource_group, sep = '\n')

# Dataset
This notebook uses the Adult Census dataset, which can be used to solve a binary classification task: given demographic data on about 32,000 individuals, predict whether a person's annual income is above or below fifty thousand dollars per year. We consider records having income > 50K in category "1" while recrods having income <= 50K in category "0".

In [None]:
from utilities import fetch_census_dataset

dataset = fetch_census_dataset()
data, y = dataset['data'], dataset['target']
label_name = 'income'
data[label_name] = y
data = data.replace({label_name: {'<=50K': 0, '>50K': 1}})

In [None]:
data.head(5)

## Prepare the datasets for training and evaluation
Split the dataset into train and test datasets. The features **age**, **hours-per-week**, **fnlwgt**, **education-num**, **capital-gain** and **capital-loss** are treated as continuous features while rest are treated as categorical features.

In [None]:
from sklearn.model_selection import train_test_split
# Create train and test datasets
data_train, data_test = train_test_split(data, test_size=0.001)

# Separate the training data and the label column
x_train = data_train.drop(label_name, axis=1)
y_train = data_train[label_name].values

# Separate the test data and the label column
x_test = data_test.drop(label_name, axis=1)
y_test = data_test[label_name].values
print(x_test.shape)

# Create lists of categorical and continuous (numerical) features
continuous_features = ['age', 'hours-per-week', 'fnlwgt', 'education-num',
                       'capital-gain', 'capital-loss']
categorical_features = x_train.columns.difference(continuous_features).tolist()

# feature names comprises of both categorical and continuous_features
feature_names = list(x_train.columns)

## Upload the train and test dataset to datastore
In the cell below, we upload the train and test datasets to the default datastore and register the train data and test data as azureml datasets.

In [None]:
from azureml.core import Dataset
datastore = user_workspace.get_default_datastore()

# Upload train data to datastore
train_name = 'adult_train'
train_datastore_path = (datastore, train_name)
train_dataset = Dataset.Tabular.register_pandas_dataframe(
    data_train, train_datastore_path, train_name)


# Upload test data to datastore
test_name = 'adult_test'
test_datastore_path = (datastore, test_name)
test_dataset = Dataset.Tabular.register_pandas_dataframe(
    data_test, test_datastore_path, test_name)

# Create or attach existing AmlCompute cluster

You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training your model and computing RAI insights for the trained model. In this tutorial, you create `AmlCompute` as your training compute resource.

> Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.

**Creation of AmlCompute takes approximately 5 minutes.** If the AmlCompute with that name is already in your workspace this code will skip the creation process.

As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota.

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
cpu_cluster_name = "rai-cluster"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=user_workspace, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS12_V2',
                                                           max_nodes=6)
    compute_target = ComputeTarget.create(user_workspace, cpu_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)

# Train model on remote compute
In this section, we train a simple classification model on the remote compute.

Add `azureml-responsibleai` as a pip dependency in the run configuration.

In [None]:
from azureml.core import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies

run_config = RunConfiguration(framework="python")

conda_dependencies = CondaDependencies.create()
run_config.environment.python.conda_dependencies = conda_dependencies
run_config.environment.python.conda_dependencies.add_pip_package("azureml-responsibleai=={}".format(azureml.core.VERSION))
run_config.target = compute_target
run_config

Copy the train script into the script directory. This train script will be used to train the model and register the model on remote compute.  

In [None]:
import shutil
import os

# create script folder
script_folder = './sample_projects/rai-adult-classification'
if not os.path.exists(script_folder):
    os.makedirs(script_folder)

# Copy the sample script to script folder.
shutil.copy('train.py', script_folder)

# Create the explainer script that will run on the remote compute.
script_file_name = script_folder + '/train.py'

Submit the train script via `ScriptRunConfig` to train the model on remote compute.

In [None]:
# Now submit a run on AmlCompute for model explanations
from azureml.core.script_run_config import ScriptRunConfig

exp_name = "RAI-Classification-Adult"
experiment = Experiment(user_workspace, exp_name)


script_run_config = ScriptRunConfig(source_directory=script_folder,
                                    script='train.py',
                                    run_config=run_config)

run = experiment.submit(script_run_config)

# Show run details
run

Wait for the above model training run to complete.

In [None]:
run.wait_for_completion(raise_on_error=True, wait_post_processing=True)

# Generate RAI insights

This section will walk you through the workflow to compute Responsible AI insights like model explanations, counterfactual examples, causal effects and error analysis using model analysis workflow on your remote compute for the model trained in the previous section.

## Configure model analysis and submit RAI insight computation runs
In this section, we will demonstrate how to configure model analysis, submit the model analysis run and submit the individual RAI computations for explanations, counterfactual examples, error analysis and causal effects for your trained model

### Create ModelAnalysis configuration

Create `ModelAnalysisConfig` for computing the RAI insights for the trained model. The `ModelAnalysisConfig` requires the following:-
1. The registered model which was registered during the model training.
2. The train and test datasets.
3. `confidential_datastore_name`which is the name of the datastore where the analyses will be uploaded.
4. List of the feature column names by dropping the name of the label column from the list of all column names.
5. List of categorical features.
6. Azureml run configuration whcih was setup in the previous section.


In [None]:
from azureml.core import Model

registered_model = Model.list(user_workspace, 'adult')[0]
model_loader = PickleModelLoader('adult.pkl')

train_dataset = Dataset.get_by_name(workspace=user_workspace, name='adult_train')
test_dataset = Dataset.get_by_name(workspace=user_workspace, name='adult_test')

ma = ModelAnalysisConfig(
    title="RAI Classification", # The name to assign to this model analysis
    model=registered_model, # The register model in AzureML to analyse
    model_type='classification', # Type of model it is, 'classification' or 'regression'
    model_loader=model_loader, # The model loader module for loading the model. Specify 'mlflow' to load using mlflow.
    train_dataset=train_dataset, # The training dataset to use for this analysis.
    test_dataset=test_dataset, # The test dataset to use for this analysis.
    X_column_names=feature_names, # The names of the columns in the train dataset.
    target_column_name=label_name, # The name of the target column.
    confidential_datastore_name=user_workspace.get_default_datastore().name, # The name of the confidential datastore where the analyses will be uploaded.
    run_configuration=run_config, # The RunConfiguration specifying the compute on which this analysis will be computed.
    categorical_column_names=categorical_features # List of all categorical columns in the dataset.
)

### Submit model analysis run

The model analysis run takes a snapshot of the data in preparation for model explanation, error analysis, causal and counterfactual.
The model analysis run is the parent run for the model explanation, error analysis, causal and counterfactual runs.

In [None]:
model_analysis_run = experiment.submit(ma)
model_analysis_run.wait_for_completion(raise_on_error=True,
                                       wait_post_processing=True)
model_analysis_run

### Submit run for explanations

Run model explanation based on the model analysis.
The explanation run is a child run of the model analysis run.
In the future, the `add_request` method will allow extra parameters to configure the explanation generated.

In [None]:
ec = ExplainConfig(model_analysis_run, run_config)
ec.add_request(
    comment="Compute Explanations" # Comment to identify the explain configuration
)
explain_run = model_analysis_run.submit_child(ec)

### Submit run for error analysis

Run error analysis based on the model analysis.
The error analysis run is a child run of the model analysis run.

In [None]:
from azureml.responsibleai.tools.model_analysis.error_analysis_config import ErrorAnalysisConfig

ec = ErrorAnalysisConfig(model_analysis_run, run_config)
ec.add_request(
    filter_features=['capital-gain', 'hours-per-week'], # One or two features to use for the matrix filter
    max_depth=3, # The maximum depth of the error analysis tree
    comment="Compute ErrorAnalysis" # Comment to identify the error analysis configuration
)
error_analysis_run = model_analysis_run.submit_child(ec)

### Submit run for counterfactual examples

Generate counterfactuals for all the samples in the `test_dataset` based on the model analysis.
The counterfactual run is a child run of the model analysis run.
You may use the `add_request` method that allows you to specify extra parameters to configure the counterfactual examples to be generated.

In [None]:
from azureml.responsibleai.tools.model_analysis.counterfactual_config import CounterfactualConfig

cf_config = CounterfactualConfig(model_analysis_run, run_config)
cf_config.add_request(
    total_CFs=10, # Total number of counterfactuals required
)
cf_run = model_analysis_run.submit_child(cf_config)

### Submit run for causal effects

Compute causal effects based on the model analysis.
The causal run is a child run of the model analysis run.
You may use the `add_request` method that allows you to specify extra parameters to configure the causal effects to be generated.

In [None]:
from azureml.responsibleai.tools.model_analysis.causal_config import CausalConfig

causal_config = CausalConfig(model_analysis_run, run_config)
causal_config.add_request(
    treatment_features=['capital-gain', 'hours-per-week'], # Treatment feature names
    nuisance_model='linear', # Model type to use for nuisance estimation
)
causal_run = model_analysis_run.submit_child(causal_config)

## Download and inspect RAI insights
In this section, we will demonstrate how to download the RAI insights computed in previous section and look at different aspects of your trained model.

### Download explanations and view global feature importance
Before downloading the explanations, make sure that the `explain_run` has completed.

The `explanation_manager.list` method below returns a list of metadata dictionaries for each explain run.  In this case, there is a single explain run.  So, the list contains a single dictionary. 

You can then download the computed explanations using the `download_by_id` method in the `explanation_manager` and look at the feature importance.

In [None]:
explain_run.wait_for_completion(raise_on_error=True, wait_post_processing=True)
explanations_meta = model_analysis_run.explanation_manager.list()
explanation = model_analysis_run.explanation_manager.download_by_id(explanations_meta[0]['id'])

In [None]:
explanation.get_feature_importance_dict()

### Download error analysis report
Before downloading the error analysis report, make sure that the `error_analysis_run` has completed.

The `error_analysis_manager.list` method below returns a list of metadata dictionaries for each error analysis run.  In this case, there is a single error analysis run.  So, the list contains a single dictionary. 

You can then download the computed error analysis report using the `download_by_id` method in the `error_analysis_manager` and inspect the error analysis report.

In [None]:
error_analysis_run.wait_for_completion(raise_on_error=True, wait_post_processing=True)
erroranalysis_meta = model_analysis_run.error_analysis_manager.list()
erroranalysis_report = model_analysis_run.error_analysis_manager.download_by_id(erroranalysis_meta[0]['id'])

You can view the json tree and heatmap representations on the error analysis report directly, without the visualization widget or uploading it to AzureML

In [None]:
erroranalysis_report.tree

### Download counterfactuals examples
Before downloading the counterfactual examples, make sure that the `cf_run` has completed.

The `counterfactual_manager.list` method below returns a list of metadata dictionaries for each counterfactual run.  In this case, there is a single counterfactual run.  So, the list contains a single dictionary.

The `download_by_id()` method available in the `counterfactual_manager` can be used to download the counterfactual example.


In [None]:
cf_run.wait_for_completion(raise_on_error=True, wait_post_processing=True)
cf_meta = model_analysis_run.counterfactual_manager.list()
cf_meta
counterfactual_object = model_analysis_run.counterfactual_manager.download_by_id(cf_meta[0]['id'])

You can use `visualize_as_dataframe()` method to view the generated counterfactual examples for the samples in `test_dataset`.

In [None]:
counterfactual_object.visualize_as_dataframe()

You can use `summary_importance` property to see the feature importance which is computed when generating counterfactual examples. 

In [None]:
counterfactual_object.summary_importance

### Download causal effects
Before downloading the causal effects, make sure that the `causal_run` has completed.

The `causal_manager.list` method below returns a list of metadata dictionaries for each causal effects run.  In this case, there is a single causal effects run.  So, the list contains a single dictionary. 

You can then download the computed causal effects using the `download_by_id` method in the `causal_manager` and inspect the downloaded causal effects.

In [None]:
causal_run.wait_for_completion(raise_on_error=True, wait_post_processing=True)
causal_meta = model_analysis_run.causal_manager.list()
causal_object = model_analysis_run.causal_manager.download_by_id(causal_meta[0]['id'])

In [None]:
causal_object['global_effects']

# Responsible AI dashboard
The dashboard containing the responsible AI insights, which were computed in previous sections, can be found under the Models section in [AzureML studio](https://ml.azure.com/).