Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/classification-bank-marketing-all-features/auto-ml-classification-bank-marketing.png)

# Automated Machine Learning
_**Classification with Deployment using a Bank Marketing Dataset**_

## Contents
1. [Introduction](#Introduction)
1. [Setup](#Setup)
1. [Train](#Train)
1. [Results](#Results)
1. [Deploy](#Deploy)
1. [Test](#Test)
1. [Acknowledgements](#Acknowledgements)

## Introduction

In this example we use the UCI Bank Marketing dataset to showcase how you can use AutoML for a  classification problem and deploy it to an Azure Container Instance (ACI). The classification goal is to predict if the client will subscribe to a term deposit with the bank.

If you are using an Azure Machine Learning Compute Instance, you are all set.  Otherwise, go through the [configuration](../../../configuration.ipynb)  notebook first if you haven't already to establish your connection to the AzureML Workspace. 

Please find the ONNX related documentations [here](https://github.com/onnx/onnx).

In this notebook you will learn how to:
1. Create an experiment using an existing workspace.
2. Configure AutoML using `AutoMLConfig`.
3. Train the model using local compute with ONNX compatible config on.
4. Explore the results, featurization transparency options and save the ONNX model
5. Inference with the ONNX model.
6. Register the model.
7. Create a container image.
8. Create an Azure Container Instance (ACI) service.
9. Test the ACI service.

In addition this notebook showcases the following features
- **Blocking** certain pipelines
- Specifying **target metrics** to indicate stopping criteria
- Handling **missing data** in the input

## Setup

As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments.

In [1]:
%env AZURE_EXTENSION_DIR=/home/schrodinger/automl/sdk-cli-v2/src/cli/src
%env AZURE_ML_CLI_PRIVATE_FEATURES_ENABLED=true

env: AZURE_EXTENSION_DIR=/home/schrodinger/automl/sdk-cli-v2/src/cli/src
env: AZURE_ML_CLI_PRIVATE_FEATURES_ENABLED=true


In [2]:
import azure.ml
from azure.ml import MLClient

from azure.core.exceptions import ResourceExistsError

from azure.ml.entities import Workspace
from azure.ml.entities import AmlCompute
from azure.ml.entities import Data

import pandas as pd

This sample notebook may use features that are not available in previous versions of the Azure ML SDK.

In [3]:
# TODO: Versions need to change
print("This notebook was created using version 1.31.0 of the Azure ML SDK")
print("You are currently using SDK version", azure.ml.version.VERSION, "of the Azure ML SDK")

This notebook was created using version 1.31.0 of the Azure ML SDK
You are currently using SDK version 0.0.86 of the Azure ML SDK


#### TODO: Equivalents for the following may be missing.

Accessing the Azure ML workspace requires authentication with Azure.

The default authentication is interactive authentication using the default tenant.  Executing the `ws = Workspace.from_config()` line in the cell below will prompt for authentication the first time that it is run.

If you have multiple Azure tenants, you can specify the tenant by replacing the `ws = Workspace.from_config()` line in the cell below with the following:

```
from azureml.core.authentication import InteractiveLoginAuthentication
auth = InteractiveLoginAuthentication(tenant_id = 'mytenantid')
ws = Workspace.from_config(auth = auth)
```

If you need to run in an environment where interactive login is not possible, you can use Service Principal authentication by replacing the `ws = Workspace.from_config()` line in the cell below with the following:

```
from azureml.core.authentication import ServicePrincipalAuthentication
auth = auth = ServicePrincipalAuthentication('mytenantid', 'myappid', 'mypassword')
ws = Workspace.from_config(auth = auth)
```
For more details, see [aka.ms/aml-notebook-auth](http://aka.ms/aml-notebook-auth)

### Initialize MLClient

Create an MLClient object, to interact with Azure ML resources, such as computes, jobs.

In [4]:
subscription_id = '381b38e9-9840-4719-a5a0-61d9585e1e91'
resource_group_name = 'gasi_rg_centraleuap'
workspace_name = "gasi_ws_centraleuap"
experiment_name = "automl-classification-bmarketing-all"

client = MLClient(subscription_id, resource_group_name, default_workspace_name=workspace_name)

client

<azure.ml._ml_client.MLClient at 0x7f2ecec30350>

### Initialize MLFlowClient

Create an MLFlowClient to interact with the resources that the AutoML job creates, such as models, metrics.

In [5]:
!pip install azureml-core azureml-mlflow

Collecting docker<5.0.0
  Using cached docker-4.4.4-py2.py3-none-any.whl (147 kB)
Collecting azure-mgmt-resource<15.0.0,>=1.2.1
  Using cached azure_mgmt_resource-13.0.0-py2.py3-none-any.whl (1.3 MB)
Collecting azure-mgmt-storage<16.0.0,>=1.5.0
  Using cached azure_mgmt_storage-11.2.0-py2.py3-none-any.whl (547 kB)


Installing collected packages: docker, azure-mgmt-storage, azure-mgmt-resource
  Attempting uninstall: docker
    Found existing installation: docker 5.0.0
    Uninstalling docker-5.0.0:
      Successfully uninstalled docker-5.0.0
  Attempting uninstall: azure-mgmt-storage
    Found existing installation: azure-mgmt-storage 18.0.0
    Uninstalling azure-mgmt-storage-18.0.0:
      Successfully uninstalled azure-mgmt-storage-18.0.0
  Attempting uninstall: azure-mgmt-resource
    Found existing installation: azure-mgmt-resource 18.0.0
    Uninstalling azure-mgmt-resource-18.0.0:
      Successfully uninstalled azure-mgmt-resource-18.0.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
azure-cli 2.25.0 requires azure-mgmt-resource==18.0.0, but you have azure-mgmt-resource 13.0.0 which is incompatible.
azure-cli 2.25.0 requires azure-mgmt-storage~=18.0.0, but you

In [6]:
import mlflow

########
# TODO: The API to get tracking URI is not yet available on Worksapce object.
from azureml.core import Workspace as WorkspaceV1
ws = WorkspaceV1(workspace_name=workspace_name, resource_group=resource_group_name, subscription_id=subscription_id)
mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri())
del ws
########

# Not sure why this doesn't work w/o the double + single quotes
# mlflow.set_tracking_uri("azureml://northeurope.experiments.azureml.net/mlflow/v1.0/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourceGroups/gasi_rg_neu/providers/Microsoft.MachineLearningServices/workspaces/gasi_ws_neu?")
mlflow.set_experiment(experiment_name)

print("\nCurrent tracking uri: {}".format(mlflow.get_tracking_uri()))


Current tracking uri: azureml://master.experiments.azureml-test.net/mlflow/v1.0/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourceGroups/gasi_rg_centraleuap/providers/Microsoft.MachineLearningServices/workspaces/gasi_ws_centraleuap?


## Create or Attach existing AmlCompute
You will need to create a compute target for your AutoML run. In this tutorial, you create AmlCompute as your training compute resource.

> Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.

#### Creation of AmlCompute takes approximately 5 minutes. 
If the AmlCompute with that name is already in your workspace this code will skip the creation process.
As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota.

In [24]:
# Set or create compute

cpu_cluster_name = "cpu-cluster"
compute = AmlCompute(
    name=cpu_cluster_name, size="STANDARD_D13_V2",
    min_instances=0, max_instances=3,
    idle_time_before_scale_down=120
)

# Load directly from YAML file
# compute = Compute.load("./compute.yaml")

try:
    # TODO: This currently results in an exception in Azure ML, please create compute manually.
    client.compute.create(compute)
except ResourceExistsError as re:
    print(re)
except Exception as e:
    import traceback
    
    print("Could not create compute.", str(e))
#     traceback.print_exc()
    # Reload an existing compute target
    compute = client.compute.get(cpu_cluster_name)

compute

Could not create compute. 'NoneType' object has no attribute 'properties'


AmlCompute({'name': 'cpu-cluster', 'id': '/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourceGroups/gasi_rg_centraleuap/providers/Microsoft.MachineLearningServices/workspaces/gasi_ws_centraleuap/computes/cpu-cluster', 'description': None, 'tags': {}, 'properties': {}, 'base_path': './', 'location': 'centraluseuap', 'type': 'amlcompute', 'enable_public_ip': False, 'resource_id': None, 'provisioning_state': 'Succeeded', 'provisioning_errors': None, 'created_on': None, 'size': 'STANDARD_DS2_V2', 'min_instances': 0, 'max_instances': 2, 'idle_time_before_scale_down': 120.0, 'identity_type': None, 'user_assigned_identities': None, 'admin_username': 'azureuser', 'admin_password': None, 'ssh_key_value': None, 'vnet_name': None, 'subnet': None, 'priority': 'Dedicated'})

# Data

### Load Data

Leverage azure compute to load the bank marketing dataset as a Tabular Dataset into the dataset variable. 

### Training Data

In [25]:
data = pd.read_csv("https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv")
data.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,57,technician,married,high.school,no,no,yes,cellular,may,mon,...,1,999,1,failure,-1.8,92.893,-46.2,1.299,5099.1,no
1,55,unknown,married,unknown,unknown,yes,no,telephone,may,thu,...,2,999,0,nonexistent,1.1,93.994,-36.4,4.86,5191.0,no
2,33,blue-collar,married,basic.9y,no,no,no,cellular,may,fri,...,1,999,1,failure,-1.8,92.893,-46.2,1.313,5099.1,no
3,36,admin.,married,high.school,no,no,no,telephone,jun,fri,...,4,999,0,nonexistent,1.4,94.465,-41.8,4.967,5228.1,no
4,27,housemaid,married,high.school,no,yes,no,cellular,jul,fri,...,2,999,0,nonexistent,1.4,93.918,-42.7,4.963,5228.1,no


In [26]:
# Add missing values in 75% of the lines.
import numpy as np

missing_rate = 0.75
n_missing_samples = int(np.floor(data.shape[0] * missing_rate))
missing_samples = np.hstack((np.zeros(data.shape[0] - n_missing_samples, dtype=np.bool), np.ones(n_missing_samples, dtype=np.bool)))
rng = np.random.RandomState(0)
rng.shuffle(missing_samples)
missing_features = rng.randint(0, data.shape[1], n_missing_samples)

data.values[np.where(missing_samples)[0], missing_features] = np.nan

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  


In [27]:
# Create validation and test datasets

validation_data = pd.read_csv("https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_validate.csv")
test_data = pd.read_csv("https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_test.csv")

validation_data.shape, test_data.shape

((4118, 21), (4120, 21))

In [28]:
# Save the CSV file locally, so that it can be uploaded to create a 
# tabular dataset

import os

if not os.path.isdir('data'):
    os.mkdir('data')
    
# Save the train-test-valid data to a csv to be uploaded to the datastore
data.to_csv("data/train_data.csv", index=False)
validation_data.to_csv("data/valid_data.csv", index=False)
test_data.to_csv("data/test_data.csv", index=False)

In [68]:
# TODO: This doesnt' work, ensure dataset is created via. the UI
# Create dataset

dataset_name = "bankmarketing_train"
dataset_version = 1

try:
    training_data = client.data.get(dataset_name, dataset_version)
#     training_data = Data(name=dataset_name, version=dataset_version, local_path="./data")
#     training_data = client.data.create_or_update(training_data)
#     print("Uploaded to path  : ", data.path)
#     print("Datastore location: ", data.datastore)
except Exception as e:
    print("Could not create dataset. ", str(e))

training_data

Data({'is_anonymous': False, 'name': 'bankmarketing_train', 'id': '/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourceGroups/gasi_rg_centraleuap/providers/Microsoft.MachineLearningServices/workspaces/gasi_ws_centraleuap/data/bankmarketing_train/versions/1', 'description': None, 'tags': {}, 'properties': {}, 'base_path': './', 'creation_context': <azure.ml._restclient.v2021_03_01_preview.models._models_py3.SystemData object at 0x7f60a3f11150>, 'version': 1, 'datastore': '/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourceGroups/gasi_rg_centraleuap/providers/Microsoft.MachineLearningServices/workspaces/gasi_ws_centraleuap/datastores/workspaceblobstore', 'path': 'UI/06-28-2021_072836_UTC/bank_marketing_train_data.csv', 'local_path': None})

### Validation Data

In [30]:
validation_dataset_name = "bankmarketing_valid"
validation_data = client.data.get(validation_dataset_name, dataset_version)

validation_data

Data({'is_anonymous': False, 'name': 'bankmarketing_valid', 'id': '/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourceGroups/gasi_rg_centraleuap/providers/Microsoft.MachineLearningServices/workspaces/gasi_ws_centraleuap/data/bankmarketing_valid/versions/1', 'description': None, 'tags': {}, 'properties': {}, 'base_path': './', 'creation_context': <azure.ml._restclient.v2021_03_01_preview.models._models_py3.SystemData object at 0x7f1fc0405290>, 'version': 1, 'datastore': '/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourceGroups/gasi_rg_centraleuap/providers/Microsoft.MachineLearningServices/workspaces/gasi_ws_centraleuap/datastores/workspaceblobstore', 'path': 'UI/06-28-2021_072923_UTC/bank_marketing_valid_data.csv', 'local_path': None})

### Test Data

In [69]:
test_dataset_name = "bankmarketing_test"
test_data = client.data.get(test_dataset_name, dataset_version)

test_data

Data({'is_anonymous': False, 'name': 'bankmarketing_test', 'id': '/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourceGroups/gasi_rg_centraleuap/providers/Microsoft.MachineLearningServices/workspaces/gasi_ws_centraleuap/data/bankmarketing_test/versions/1', 'description': None, 'tags': {}, 'properties': {}, 'base_path': './', 'creation_context': <azure.ml._restclient.v2021_03_01_preview.models._models_py3.SystemData object at 0x7f604e3c97d0>, 'version': 1, 'datastore': '/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourceGroups/gasi_rg_centraleuap/providers/Microsoft.MachineLearningServices/workspaces/gasi_ws_centraleuap/datastores/workspaceblobstore', 'path': 'UI/06-28-2021_072954_UTC/bank_marketing_test_data.csv', 'local_path': None})

## Train

Instantiate a AutoMLConfig object. This defines the settings and data used to run the experiment.

|Property|Description|
|-|-|
|**task**|classification or regression or forecasting|
|**primary_metric**|This is the metric that you want to optimize. Classification supports the following primary metrics: <br><i>accuracy</i><br><i>AUC_weighted</i><br><i>average_precision_score_weighted</i><br><i>norm_macro_recall</i><br><i>precision_score_weighted</i>|
|**iteration_timeout_minutes**|Time limit in minutes for each iteration.|
|**blocked_models** | *List* of *strings* indicating machine learning algorithms for AutoML to avoid in this run. <br><br> Allowed values for **Classification**<br><i>LogisticRegression</i><br><i>SGD</i><br><i>MultinomialNaiveBayes</i><br><i>BernoulliNaiveBayes</i><br><i>SVM</i><br><i>LinearSVM</i><br><i>KNN</i><br><i>DecisionTree</i><br><i>RandomForest</i><br><i>ExtremeRandomTrees</i><br><i>LightGBM</i><br><i>GradientBoosting</i><br><i>TensorFlowDNN</i><br><i>TensorFlowLinearClassifier</i><br><br>Allowed values for **Regression**<br><i>ElasticNet</i><br><i>GradientBoosting</i><br><i>DecisionTree</i><br><i>KNN</i><br><i>LassoLars</i><br><i>SGD</i><br><i>RandomForest</i><br><i>ExtremeRandomTrees</i><br><i>LightGBM</i><br><i>TensorFlowLinearRegressor</i><br><i>TensorFlowDNN</i><br><br>Allowed values for **Forecasting**<br><i>ElasticNet</i><br><i>GradientBoosting</i><br><i>DecisionTree</i><br><i>KNN</i><br><i>LassoLars</i><br><i>SGD</i><br><i>RandomForest</i><br><i>ExtremeRandomTrees</i><br><i>LightGBM</i><br><i>TensorFlowLinearRegressor</i><br><i>TensorFlowDNN</i><br><i>Arima</i><br><i>Prophet</i>|
|**allowed_models** |  *List* of *strings* indicating machine learning algorithms for AutoML to use in this run. Same values listed above for **blocked_models** allowed for **allowed_models**.|
|**experiment_exit_score**| Value indicating the target for *primary_metric*. <br>Once the target is surpassed the run terminates.|
|**experiment_timeout_hours**| Maximum amount of time in hours that all iterations combined can take before the experiment terminates.|
|**enable_early_stopping**| Flag to enble early termination if the score is not improving in the short term.|
|**featurization**| 'auto' / 'off'  Indicator for whether featurization step should be done automatically or not. Note: If the input data is sparse, featurization cannot be turned on.|
|**n_cross_validations**|Number of cross validation splits.|
|**training_data**|Input dataset, containing both features and label column.|
|**label_column_name**|The name of the label column.|

**_You can find more information about primary metrics_** [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train#primary-metric)

In [32]:
from azure.ml._restclient.v2020_09_01_preview.models import (
    GeneralSettings,
    DataSettings,
    LimitSettings,
    TrainingDataSettings,
    ValidationDataSettings,
    TestDataSettings,
    FeaturizationSettings,
)

from azure.ml.entities._job.automl.training_settings import TrainingSettings
from azure.ml.entities._job.automl.featurization import FeaturizationSettings
from azure.ml.entities import AutoMLJob, ComputeConfiguration


compute_settings = ComputeConfiguration(target=cpu_cluster_name)

general_settings = GeneralSettings(
    task_type="classification",
    primary_metric= "auc_weighted",
    log_verbosity="Info")

limit_settings = LimitSettings(
    timeout=60,
    trial_timeout=5,
    max_concurrent_trials=4,
    enable_early_termination=True)

training_data_settings = TrainingDataSettings(
    dataset_arm_id="{}:{}".format(training_data.name, training_data.version)
)
validation_data_settings = ValidationDataSettings(
    dataset_arm_id="{}:{}".format(validation_data.name, validation_data.version),
)

data_settings = DataSettings(
    training_data=training_data_settings,
    target_column_name="y",
    validation_data=validation_data_settings
)

featurization_settings = FeaturizationSettings(
    featurization_config="auto"
)

training_settings = TrainingSettings(
    block_list_models=['KNN','LinearSVM'],
    enable_onnx_compatible_models=True,
)

extra_automl_settings = {"save_mlflow": True}

automl_job = AutoMLJob(
    compute=compute_settings,
    general_settings=general_settings,
    limit_settings=limit_settings,
    data_settings=data_settings,
    training_settings=training_settings,
    featurization_settings=featurization_settings,
    properties=extra_automl_settings,
)

automl_job

AutoMLJob({'name': '7e7a9a95-23d2-4ea7-8b03-8c034afcc48a', 'id': None, 'description': None, 'tags': {}, 'properties': {'save_mlflow': True}, 'base_path': './', 'type': 'automl_job', 'creation_context': None, 'experiment_name': 'classification-bank-marketing-all-features', 'status': None, 'interaction_endpoints': None, 'log_files': None, 'output': None, 'general_settings': <azure.ml._restclient.v2020_09_01_preview.models._models_py3.GeneralSettings object at 0x7f1fbf666450>, 'data_settings': <azure.ml._restclient.v2020_09_01_preview.models._models_py3.DataSettings object at 0x7f1fbf666510>, 'limit_settings': <azure.ml._restclient.v2020_09_01_preview.models._models_py3.LimitSettings object at 0x7f1fbf666310>, 'forecasting_settings': None, 'training_settings': <azure.ml.entities._job.automl.training_settings.TrainingSettings object at 0x7f1fbf666250>, 'featurization_settings': <azure.ml.entities._job.automl.featurization.FeaturizationSettings object at 0x7f1fbf666210>, 'compute': {'instan

Call the `submit` method on the experiment object and pass the run configuration. Execution of local runs is synchronous. Depending on the data and the number of iterations this can run for a while. Validation errors and current status will be shown when setting `show_output=True` and the execution will be synchronous.

In [33]:
created_job = client.jobs.create_or_update(automl_job)
created_job

AutoMLJob({'name': '7e7a9a95-23d2-4ea7-8b03-8c034afcc48a', 'id': '/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourceGroups/gasi_rg_centraleuap/providers/Microsoft.MachineLearningServices/workspaces/gasi_ws_centraleuap/jobs/7e7a9a95-23d2-4ea7-8b03-8c034afcc48a', 'description': None, 'tags': {}, 'properties': {'save_mlflow': 'True'}, 'base_path': './', 'type': 'automl_job', 'creation_context': <azure.ml._restclient.v2020_09_01_preview.models._models_py3.SystemData object at 0x7f1fbe3e45d0>, 'experiment_name': 'classification-bank-marketing-all-features', 'status': 'NotStarted', 'interaction_endpoints': {'Tracking': <azure.ml._restclient.v2020_09_01_preview.models._models_py3.JobEndpoint object at 0x7f1fbe3e4890>, 'Studio': <azure.ml._restclient.v2020_09_01_preview.models._models_py3.JobEndpoint object at 0x7f1fbe3e4650>}, 'log_files': None, 'output': None, 'general_settings': <azure.ml._restclient.v2020_09_01_preview.models._models_py3.GeneralSettings object at 0x7f1fbe3e4ad0>,

In [9]:
print("Studio URL: ", created_job.interaction_endpoints["Studio"].endpoint)

Studio URL:  https://ml.azure.com/runs/7e7a9a95-23d2-4ea7-8b03-8c034afcc48a?wsid=/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourcegroups/gasi_rg_centraleuap/workspaces/gasi_ws_centraleuap&tid=72f988bf-86f1-41af-91ab-2d7cd011db47


In [11]:
client.jobs.stream("7e7a9a95-23d2-4ea7-8b03-8c034afcc48a")

RunId: 7e7a9a95-23d2-4ea7-8b03-8c034afcc48a
Web View: https://ml.azure.com/runs/7e7a9a95-23d2-4ea7-8b03-8c034afcc48a?wsid=/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourcegroups/gasi_rg_centraleuap/workspaces/gasi_ws_centraleuap

Execution Summary
RunId: 7e7a9a95-23d2-4ea7-8b03-8c034afcc48a
Web View: https://ml.azure.com/runs/7e7a9a95-23d2-4ea7-8b03-8c034afcc48a?wsid=/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourcegroups/gasi_rg_centraleuap/workspaces/gasi_ws_centraleuap



Run the following cell to access previous runs. Uncomment the cell below and update the run_id.

In [None]:
# TODO: Wait for the remote run to complete
# remote_run.wait_for_completion()

In [5]:
from mlflow.tracking import MlflowClient

# TODO: Use this run, as it has MLFlow model stored on the run
job_name = "AutoML_16f21d80-5bb0-4258-b452-da0361219f2b"
# job_name = created_job.name

mlflow_client = MlflowClient()
mlflow_parent_run = mlflow_client.get_run(job_name)

best_child_run_id = mlflow_parent_run.data.tags["automl_best_child_run_id"]
print("Found best child run id: ", best_child_run_id)

best_run_customized = mlflow_client.get_run(best_child_run_id)
best_run_customized

Found best child run id:  AutoML_16f21d80-5bb0-4258-b452-da0361219f2b_0


<Run: data=<RunData: metrics={'AUC_macro': 0.9537056697431252,
 'AUC_micro': 0.9826734687571428,
 'AUC_weighted': 0.9537056697431252,
 'accuracy': 0.9237493929091792,
 'average_precision_score_macro': 0.8430327340366603,
 'average_precision_score_micro': 0.9834011986147951,
 'average_precision_score_weighted': 0.959992859687783,
 'balanced_accuracy': 0.7717056650246306,
 'f1_score_macro': 0.7936078137928921,
 'f1_score_micro': 0.9237493929091792,
 'f1_score_weighted': 0.9205655010653557,
 'log_loss': 0.16640483179348017,
 'matthews_correlation': 0.5909070642243929,
 'norm_macro_recall': 0.5434113300492611,
 'precision_score_macro': 0.8212770320032137,
 'precision_score_micro': 0.9237493929091792,
 'precision_score_weighted': 0.9188551905972525,
 'recall_score_macro': 0.7717056650246306,
 'recall_score_micro': 0.9237493929091792,
 'recall_score_weighted': 0.9237493929091792,
 'weighted_accuracy': 0.9617508999033835}, params={}, tags={'_aml_system_ComputeTargetStatus': '{"AllocationState

In [9]:
# This step requires AutoML runtime libraries to be installed
# !pip install azureml-train-automl-runtime

import mlflow.sklearn

fitted_model_customized = mlflow.sklearn.load_model("runs:/{}/outputs".format(best_run_customized.info.run_id))

## Transparency

View updated featurization summary

In [11]:
custom_featurizer = fitted_model_customized.named_steps['datatransformer']
df = custom_featurizer.get_featurization_summary()
pd.DataFrame(data=df)

Unnamed: 0,RawFeatureName,TypeDetected,Dropped,EngineeredFeatureCount,Transformations
0,age,Numeric,No,1,[MeanImputer]
1,duration,Numeric,No,1,[MeanImputer]
2,emp.var.rate,Numeric,No,1,[MeanImputer]
3,cons.price.idx,Numeric,No,1,[MeanImputer]
4,cons.conf.idx,Numeric,No,1,[MeanImputer]
5,euribor3m,Numeric,No,1,[MeanImputer]
6,nr.employed,Numeric,No,1,[MeanImputer]
7,job,Categorical,No,12,[StringCast-CharGramCountVectorizer]
8,marital,Categorical,No,4,[StringCast-CharGramCountVectorizer]
9,education,Categorical,No,8,[StringCast-CharGramCountVectorizer]


Set `is_user_friendly=False` to get a more detailed summary for the transforms being applied.

In [12]:
df = custom_featurizer.get_featurization_summary(is_user_friendly=False)
pd.DataFrame(data=df)

Unnamed: 0,RawFeatureName,TypeDetected,Dropped,EngineeredFeatureCount,Transformations,TransformationParams
0,age,Numeric,No,1,[MeanImputer],"{'Transformer1': {'Input': ['age'], 'Transform..."
1,duration,Numeric,No,1,[MeanImputer],"{'Transformer1': {'Input': ['duration'], 'Tran..."
2,emp.var.rate,Numeric,No,1,[MeanImputer],"{'Transformer1': {'Input': ['emp.var.rate'], '..."
3,cons.price.idx,Numeric,No,1,[MeanImputer],"{'Transformer1': {'Input': ['cons.price.idx'],..."
4,cons.conf.idx,Numeric,No,1,[MeanImputer],"{'Transformer1': {'Input': ['cons.conf.idx'], ..."
5,euribor3m,Numeric,No,1,[MeanImputer],"{'Transformer1': {'Input': ['euribor3m'], 'Tra..."
6,nr.employed,Numeric,No,1,[MeanImputer],"{'Transformer1': {'Input': ['nr.employed'], 'T..."
7,job,Categorical,No,12,[StringCast-CharGramCountVectorizer],"{'Transformer1': {'Input': ['job'], 'Transform..."
8,marital,Categorical,No,4,[StringCast-CharGramCountVectorizer],"{'Transformer1': {'Input': ['marital'], 'Trans..."
9,education,Categorical,No,8,[StringCast-CharGramCountVectorizer],"{'Transformer1': {'Input': ['education'], 'Tra..."


In [13]:
df = custom_featurizer.get_stats_feature_type_summary()
pd.DataFrame(data=df)

## Results

In [None]:
# Widgets don't exist yet in V2

# from azureml.widgets import RunDetails
# RunDetails(remote_run).show() 

### Retrieve the Best Model's explanation
Retrieve the explanation from the best_run which includes explanations for engineered features and raw features. Make sure that the run for generating explanations for the best model is completed.

In [20]:
# Wait for the best model explanation run to complete

model_explainability_run_id = job_name + "_" + "ModelExplain"
print(model_explainability_run_id)

model_explainability_run = mlflow_client.get_run(model_explainability_run_id)

# TODO: Wait for the remote run to complete
assert model_explainability_run.info.status == "FINISHED"

AutoML_16f21d80-5bb0-4258-b452-da0361219f2b_ModelExplain


#### Download engineered feature importance from artifact store
You can use ExplanationClient to download the engineered feature explanations from the artifact store of the best_run.

In [22]:
# TODO: This wouldn't work due to v1 Run dependencies for Explanation client

# client = ExplanationClient.from_run(best_run)
# engineered_explanations = client.download_model_explanation(raw=False)
# exp_data = engineered_explanations.get_feature_importance_dict()
# exp_data

#### Download raw feature importance from artifact store
You can use ExplanationClient to download the raw feature explanations from the artifact store of the best_run.

In [23]:
# TODO: This wouldn't work due to v1 Run dependencies for Explanation client

# client = ExplanationClient.from_run(best_run)
# engineered_explanations = client.download_model_explanation(raw=True)
# exp_data = engineered_explanations.get_feature_importance_dict()
# exp_data

### Retrieve the Best ONNX Model

Below we select the best pipeline from our iterations. The `get_output` method returns the best run and the fitted model. The Model includes the pipeline and any pre-processing.  Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*.

Set the parameter return_onnx_model=True to retrieve the best ONNX model, instead of the Python model.

In [44]:
# Search all child runs with a parent id
experiment = mlflow_client.get_experiment_by_name(experiment_name)
print(experiment)

###########################################################################################
# Steps:
# 1. Get all child runs for the parent run, filtered on runs that have ONNX resource on the properties, 
# & sorted on primary metrics
# 2. Take the head of that list - which will be the best ONNX model

# TODO: This filter should work - but currently, the child runs don't have this tag set.
# The single quotes around 'mlflow.parentRunId' are required due to a bug in AzureML MLFlow.
# query = "tags.'mlflow.parentRunId' = '{}'".format(mlflow_parent_run.info.run_id)
# print(query)
# results = mlflow_client.search_runs(experiment_ids=experiment.experiment_id, filter_string=query)
###########################################################################################

# print(results[["run_id", "params.child", "tags.mlflow.runName"]])
# results

<Experiment: artifact_location='', experiment_id='f1c9dd89-f798-42c2-b08c-75fafc18ad2b', lifecycle_stage='active', name='automl-classification-bmarketing-all', tags={}>


In [40]:
# Assuming the we already have the Run ID which has a valid ONNX model - the code below shows how to retrieve & 
# load it
import os

def download_outputs_via_mlflow_client(mlflow_client, run_id, path) -> str:
    """Download the `path` (file or dir) from the run artifacts, returns the local path download"""
    local_dir = "/tmp/artifact_downloads/{}".format(run_id)
    local_path = os.path.join(local_dir, path)
    if os.path.exists(local_path):
        print("Directory {} already exists. Skipping download.".format(os.path.join(local_path, path)))
    else:
        # download outputs
        if not os.path.exists(local_path):
            os.makedirs(local_path, exist_ok = False) 

        local_path = mlflow_client.download_artifacts(run_id, path, local_path)
        print("Artifacts downloaded to: {}".format(local_path))
        print("Artifacts: {}".format(os.listdir(local_path)))
    return local_path

output_path = download_outputs_via_mlflow_client(mlflow_client, best_run_customized.info.run_id, "outputs")
onnx_model_path = os.path.join(output_path, "model.onnx")
onnx_resource_json_path = os.path.join(output_path, "model_onnx.json")

print("Downloaded onnx model to path: ", onnx_model_path)

Downloaded onnx model to path:  /tmp/artifact_downloads/AutoML_16f21d80-5bb0-4258-b452-da0361219f2b_0/outputs/outputs/model.onnx


In [None]:
# best_run, onnx_mdl = remote_run.get_output(return_onnx_model=True)

### Save the best ONNX model

In [None]:
# from azureml.automl.runtime.onnx_convert import OnnxConverter
# onnx_fl_path = "./best_model.onnx"
# OnnxConverter.save_onnx_model(onnx_mdl, onnx_fl_path)

### Predict with the ONNX model, using onnxruntime package

In [43]:
import json

def get_onnx_res(onnx_resource_json_path):
    with open(onnx_resource_json_path) as f:
        onnx_res = json.load(f)
    return onnx_res

# Loading an ONNX model with MLFlow can be done via.the following, however, we currently don't save the flavor
# information in the MLModel file.
# mlflow.onnx.load_model(onnx_model_path)

# Loading via. OnnxConverter
from azureml.automl.runtime.onnx_convert import OnnxConverter
from azureml.automl.runtime.onnx_convert import OnnxInferenceHelper

fitted_onnx_model = OnnxConverter.load_onnx_model(onnx_model_path)

mdl_bytes = fitted_onnx_model.SerializeToString()
onnx_res = get_onnx_res(onnx_resource_json_path)

onnxrt_helper = OnnxInferenceHelper(mdl_bytes, onnx_res)

test_data_url = "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_test.csv"
test_pdf = pd.read_csv(test_data_url)

pred_onnx, pred_prob_onnx = onnxrt_helper.predict(test_pdf)

print(pred_onnx)
print(pred_prob_onnx)

['yes' 'no' 'no' ... 'yes' 'no' 'no']
[[0.1733194  0.8266806 ]
 [0.9801272  0.01987278]
 [0.90318245 0.09681755]
 ...
 [0.31706262 0.6829374 ]
 [0.9961349  0.00386512]
 [0.99343926 0.00656074]]


## Deploy

### Retrieve the Best Model

Below we select the best pipeline from our iterations.  The `get_output` method returns the best run and the fitted model. Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*.

#### Widget for Monitoring Runs

The widget will first report a "loading" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.

**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details

In [51]:
# Can't access properties right now - https://msdata.visualstudio.com/Vienna/_workitems/edit/1252056
model_name = "AutoML_model" # best_run.properties['model_name']

# Outputs are already downloaded for the best child run, as part of the ONNX step
# script_file_name = 'inference/score.py'
# best_run.download_file('outputs/scoring_file_v_1_0_0.py', 'inference/score.py')

### Register the Fitted Model for Deployment
If neither `metric` nor `iteration` are specified in the `register_model` call, the iteration with the best primary metric is registered.

### Deploy the model as a Web Service on Azure Container Instance

In [54]:
from azure.ml.entities._assets import Model

# Note: This is not using MLFlow's deployment mechanism at all (flavors, scoring script / examples etc.)
# Create / register the model
# TODO: This doesn't track the lineage (run id) from which the model is created. 
azure_model = Model(name=model_name, version=1, local_path=os.path.join(output_path, "model.pkl"))
azure_model = client.models.create_or_update(azure_model)
azure_model

Uploading model.pkl: 100%|██████████| 396k/396k [00:00<00:00, 1.13MB/s]


Model({'is_anonymous': False, 'name': 'AutoML_model', 'id': '/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourceGroups/gasi_rg_centraleuap/providers/Microsoft.MachineLearningServices/workspaces/gasi_ws_centraleuap/models/AutoML_model/versions/1', 'description': None, 'tags': {}, 'properties': {'azureml.modelFormat': 'CUSTOM'}, 'base_path': './', 'creation_context': <azure.ml._restclient.v2021_03_01_preview.models._models_py3.SystemData object at 0x7f604e38f210>, 'version': 1, 'datastore': '/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourceGroups/gasi_rg_centraleuap/providers/Microsoft.MachineLearningServices/workspaces/gasi_ws_centraleuap/datastores/workspaceblobstore', 'path': 'LocalUpload/e708532d0eb3d88d2d9a71590073567f/model.pkl', 'local_path': None, 'utc_time_created': None, 'flavors': {}})

In [55]:
# Register via. MLFlow (Not recommended)
mlflow_model = mlflow.register_model(
    "runs:/{}/outputs".format(best_run_customized.info.run_id),
    "mlflow_automl_model"
)

mlflow_model

Successfully registered model 'mlflow_automl_model'.
2021/07/06 10:47:01 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: mlflow_automl_model, version 1
Created version '1' of model 'mlflow_automl_model'.


<ModelVersion: creation_timestamp=1625593628865, current_stage='None', description='', last_updated_timestamp=1625593628865, name='mlflow_automl_model', run_id='AutoML_16f21d80-5bb0-4258-b452-da0361219f2b_0', run_link='', source='azureml://experiments/classification-bank-marketing-all-features/runs/AutoML_16f21d80-5bb0-4258-b452-da0361219f2b_0/artifacts/outputs', status='READY', status_message='', tags={}, user_id='', version='1'>

In [64]:
from azure.ml.entities import Endpoint, ManagedOnlineEndpoint, Environment, \
CodeConfiguration, ManagedOnlineDeployment, ManualScaleSettings, Code

inference_script_file_name = os.path.join(output_path, "scoring_file_v_1_0_0.py")
conda_environment_yaml = os.path.join(output_path, "conda.yaml")

print("Inference File: ", inference_script_file_name)
print("Conda Environment File: ", conda_environment_yaml)

assert os.path.exists(inference_script_file_name)
assert os.path.exists(conda_environment_yaml)


# Prepare the deployment configuration
environment = Environment(
    name="environment-{}".format(best_run_customized.info.run_id[:6]),
    version=1,
    path=".",
    conda_file=conda_environment_yaml,
    docker_image="mcr.microsoft.com/azureml/intelmpi2018.3-ubuntu16.04:20210301.v1",
)

code = Code(
    name="environment-{}".format(best_run_customized.info.run_id[:6]),
    version=1,
    local_path=inference_script_file_name,
)
code_configuration = CodeConfiguration(
    code=code,
    scoring_script=inference_script_file_name
)

scale_settings = ManualScaleSettings(
    scale_type="Manual",
    min_instances=1,
    max_instances=2,
    instance_count=1
)
deployment = ManagedOnlineDeployment(
    name="deployment-{}".format(best_run_customized.info.run_id[:6]),
    model=azure_model,
    environment=environment,
    code_configuration=code_configuration,
    instance_type="Standard_F2s_v2",
    scale_settings=scale_settings,
                                    )
online_endpoint = ManagedOnlineEndpoint(
    name="endpoint-{}".format(best_run_customized.info.run_id[:6]),
    deployments=[deployment],
    description="Demo model deployment",
    tags={"deployed_using": "sdkv2"}
)
##### Loading from YAML
# endpoint = Endpoint.load("/home/schrodinger/automl/Easy-AutoML-MLOps/notebooks/3-automl-remote-compute-run/endpoint.yml")

try:
    client.endpoints.create(online_endpoint)
except Exception as e:
    import traceback
    print("Deployment failed: ", str(e))
    traceback.print_exc()

Inference File:  /tmp/artifact_downloads/AutoML_16f21d80-5bb0-4258-b452-da0361219f2b_0/outputs/outputs/scoring_file_v_1_0_0.py
Conda Environment File:  /tmp/artifact_downloads/AutoML_16f21d80-5bb0-4258-b452-da0361219f2b_0/outputs/outputs/conda.yaml


The deployment request gasi_ws_centraleuap-endpoint-automl-7882909 was accepted. ARM deployment URI for reference: 
https://ms.portal.azure.com/#blade/HubsExtension/DeploymentDetailsBlade/overview/id/%2Fsubscriptions%2F381b38e9-9840-4719-a5a0-61d9585e1e91%2FresourceGroups%2Fgasi_rg_centraleuap%2Fproviders%2FMicrosoft.Resources%2Fdeployments%2Fgasi_ws_centraleuap-endpoint-automl-7882909
Registering environment version (environment-AutoML:1)  Done (3s)
Registering model version (AutoML_model:1)  Done (2s)
Creating endpoint endpoint-automl ..  Done (23s)


Deployment failed:  (DeploymentFailed) At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/DeployOperations for usage details.


Traceback (most recent call last):
  File "/home/schrodinger/anaconda3/envs/dpv2sdk/lib/python3.7/site-packages/azure/core/polling/base_polling.py", line 482, in run
    self._poll()
  File "/home/schrodinger/anaconda3/envs/dpv2sdk/lib/python3.7/site-packages/azure/core/polling/base_polling.py", line 521, in _poll
    raise OperationFailed("Operation failed or canceled")
azure.core.polling.base_polling.OperationFailed: Operation failed or canceled

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<ipython-input-64-fce66328ac65>", line 58, in <module>
    client.endpoints.create(online_endpoint)
  File "/home/schrodinger/automl/sdk-cli-v2/src/azure-ml/azure/ml/_operations/endpoint_operations.py", line 214, in create
    return self._create_online_endpoint(internal_endpoint=endpoint, no_wait=no_wait)
  File "/home/schrodinger/automl/sdk-cli-v2/src/azure-ml/azure/ml/_operations/endpoint_operations.py", line 592, in _create_onl

### Get Logs from a Deployed Web Service

Gets logs from a deployed web service.

In [66]:
# Returns nothing at the moment (?)
client.endpoints.get_deployment_logs(online_endpoint.name, deployment.name, lines=200)

## Test

Now that the model is trained, run the test data through the trained model to get the predicted values.  This calls the ACI web service to do the prediction.

Note that the JSON passed to the ACI web service is an array of rows of data.  Each row should either be an array of values in the same order that was used for training or a dictionary where the keys are the same as the column names used for training.  The example below uses dictionary rows.

In [None]:
# Load the bank marketing datasets.
from numpy import array

In [None]:
X_test = test_dataset.drop_columns(columns=['y'])
y_test = test_dataset.keep_columns(columns=['y'], validate=True)
test_dataset.take(5).to_pandas_dataframe()

In [None]:
X_test = X_test.to_pandas_dataframe()
y_test = y_test.to_pandas_dataframe()

In [None]:
import json
import requests

X_test_json = X_test.to_json(orient='records')
data = "{\"data\": " + X_test_json +"}"
headers = {'Content-Type': 'application/json'}

resp = requests.post(aci_service.scoring_uri, data, headers=headers)

y_pred = json.loads(json.loads(resp.text))['result']

In [None]:
actual = array(y_test)
actual = actual[:,0]
print(len(y_pred), " ", len(actual))

### Calculate metrics for the prediction

Now visualize the data as a confusion matrix that compared the predicted values against the actual values.


In [None]:
%matplotlib notebook
from sklearn.metrics import confusion_matrix
import numpy as np
import itertools

cf =confusion_matrix(actual,y_pred)
plt.imshow(cf,cmap=plt.cm.Blues,interpolation='nearest')
plt.colorbar()
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
class_labels = ['no','yes']
tick_marks = np.arange(len(class_labels))
plt.xticks(tick_marks,class_labels)
plt.yticks([-0.5,0,1,1.5],['','no','yes',''])
# plotting text value inside cells
thresh = cf.max() / 2.
for i,j in itertools.product(range(cf.shape[0]),range(cf.shape[1])):
    plt.text(j,i,format(cf[i,j],'d'),horizontalalignment='center',color='white' if cf[i,j] >thresh else 'black')
plt.show()

### Delete a Web Service

Deletes the specified web service.

In [None]:
client.endpoints.delete(name=online_endpoint.name)

## Acknowledgements

This Bank Marketing dataset is made available under the Creative Commons (CCO: Public Domain) License: https://creativecommons.org/publicdomain/zero/1.0/. Any rights in individual contents of the database are licensed under the Database Contents License: https://creativecommons.org/publicdomain/zero/1.0/ and is available at: https://www.kaggle.com/janiobachmann/bank-marketing-dataset .

_**Acknowledgements**_
This data set is originally available within the UCI Machine Learning Database: https://archive.ics.uci.edu/ml/datasets/bank+marketing

[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014