# AutoML on remote AML Compute (Porto Seguro's Safe Driving Prediction)

This notebook is refactored (from the original AutoML local training notebook) to use AutoML on remote AML compute, in a cluster.
It also uses AML Datasets for training instead of Pandas Dataframes.

## Import Needed Packages

Import the packages needed for this notebook. The most widely used package for machine learning is [scikit-learn](https://scikit-learn.org/stable/), [pandas](https://pandas.pydata.org/docs/getting_started/index.html#getting-started), and [numpy](https://numpy.org/). These packages have various features, as well as a lot of clustering, regression and classification algorithms that make it a good choice for data mining and data analysis.

In [18]:
import numpy as np
import pandas as pd
import joblib
from sklearn import metrics

## Check Azure ML SDK version

In [19]:
import azureml.core
print("This notebook was tested using version 1.24.0 of the Azure ML SDK")
print("You are currently using version", azureml.core.VERSION, "of the Azure ML SDK")

This notebook was tested using version 1.24.0 of the Azure ML SDK
You are currently using version 0.1.0.56562636 of the Azure ML SDK


##  Get Azure ML Workspace to use

In [20]:
from azureml.core import Workspace, Dataset

# Get Workspace defined in by default config.json file
ws = Workspace.from_config()

output = {}
output["Subscription ID"] = ws.subscription_id
output["Workspace"] = ws.name
output["SKU"] = ws.sku
output["Resource Group"] = ws.resource_group
output["Location"] = ws.location
pd.set_option("display.max_colwidth", None)
outputDf = pd.DataFrame(data=output, index=[""])
outputDf.T

Unnamed: 0,Unnamed: 1
Subscription ID,102a16c3-37d3-48a8-9237-4c9b1e8e80e0
Workspace,cesardl-automl-westcentralus-ws
SKU,Basic
Resource Group,automlpmdemo
Location,westcentralus


### (Optional) Submit dataset file into DataStore (Azure Blob under the covers)

In [21]:
datastore = ws.get_default_datastore()
datastore.upload(src_dir='../../data/', 
                 target_path='Datasets/porto_seguro_safe_driver_prediction', overwrite=True, show_progress=True)

Uploading an estimated of 1 files
Uploading ../../data\README-Download-Dataset-And-Copy-Here.txt
Uploaded ../../data\README-Download-Dataset-And-Copy-Here.txt, 1 files out of an estimated total of 1
Uploaded 1 files


$AZUREML_DATAREFERENCE_b87b8aced9144acf9b70d5528d963a29

## Load data into Azure ML Dataset and Register into Workspace

In [22]:
# Try to load the dataset from the Workspace. Otherwise, create it from the file in the HTTP URL
found = False
aml_dataset_name = "porto_seguro_safe_driver_prediction_train"

if aml_dataset_name in ws.datasets.keys(): 
       found = True
       aml_dataset = ws.datasets[aml_dataset_name] 
       print("Dataset loaded from the Workspace")
       
if not found:
        # Create AML Dataset and register it into Workspace
        print("Dataset does not exist in the current Workspace. It will be imported and registered.")
        
        # Option A: Create AML Dataset from file in AML DataStore
        # datastore = ws.get_default_datastore()
        # aml_dataset = Dataset.Tabular.from_delimited_files(path=datastore.path('Datasets/porto_seguro_safe_driver_prediction/porto_seguro_safe_driver_prediction_train.csv'))
        # data_origin_type = 'AMLDataStore'
        
        # Option B: Create AML Dataset from file in HTTP URL
        data_url = 'https://azmlworkshopdata.blob.core.windows.net/safedriverdata/porto_seguro_safe_driver_prediction_train.csv'
        aml_dataset = Dataset.Tabular.from_delimited_files(data_url)  
        data_origin_type = 'HttpUrl'
        
        print(aml_dataset)
                
        #Register Dataset in Workspace
        registration_method = 'SDK'  # or 'UI'
        aml_dataset = aml_dataset.register(workspace=ws,
                                           name=aml_dataset_name,
                                           description='Porto Seguro Safe Driver Prediction Train dataset file',
                                           tags={'Registration-Method': registration_method, 'Data-Origin-Type': data_origin_type},
                                           create_new_version=True)
        
        print("Dataset created from file and registered in the Workspace")


Dataset loaded from the Workspace


In [23]:
# Use Pandas DataFrame just to sneak peak some data and schema
data_df = aml_dataset.to_pandas_dataframe()
print(data_df.shape)
# print(data_df.describe())
data_df.head(5)

(595212, 59)


Unnamed: 0,id,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,7,0,2,2,5,1,0,0,1,0,...,9,1,5,8,0,1,1,0,0,1
1,9,0,1,1,7,0,0,0,0,1,...,3,1,1,9,0,1,1,0,1,0
2,13,0,5,4,9,1,0,0,0,1,...,4,2,7,7,0,1,1,0,1,0
3,16,0,0,1,2,0,0,1,0,0,...,2,2,4,9,0,0,0,0,0,0
4,17,0,0,2,0,1,0,1,0,0,...,3,1,1,3,0,0,0,1,1,0


## Split Data into Train and Test AML Tabular Datasets

Remote AML Training you need to use AML Datasets, you cannot submit Pandas Dataframes to remote runs of AutoMLConfig.

Note that AutoMLConfig below is not using the Test dataset (you only provide a single dataset that will internally be split in validation/train datasets or use cross-validation depending on the size of the dataset. The boundary for that is 20k rows, using cross-validation if less than 20k. This can also be decided by the user.). 

The Test dataset will be used at the end of the notebook to manually calculate the quality metrics with a dataset not seen by AutoML training.

In [24]:
# Split in train/test datasets (Test=10%, Train=90%)

train_dataset, test_dataset = aml_dataset.random_split(0.9, seed=0)

# Use Pandas DF only to check the data
train_df = train_dataset.to_pandas_dataframe()
test_df = test_dataset.to_pandas_dataframe()

In [25]:
print(train_df.shape)
print(test_df.shape)

train_df.describe()

(535580, 59)
(59632, 59)


Unnamed: 0,id,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
count,535580.0,535580.0,535580.0,535580.0,535580.0,535580.0,535580.0,535580.0,535580.0,535580.0,...,535580.0,535580.0,535580.0,535580.0,535580.0,535580.0,535580.0,535580.0,535580.0,535580.0
mean,743573.3,0.036523,1.899649,1.358977,4.423797,0.416492,0.405812,0.393949,0.256823,0.163878,...,5.441572,1.441902,2.87181,7.538396,0.122327,0.627701,0.554033,0.28741,0.348887,0.153101
std,429492.5,0.187588,1.983804,0.66462,2.70066,0.493258,1.351799,0.488624,0.436881,0.370166,...,2.334558,1.202918,1.694975,2.746512,0.327664,0.483418,0.497072,0.452555,0.476619,0.360085
min,7.0,0.0,0.0,-1.0,0.0,-1.0,-1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,371661.8,0.0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,...,4.0,1.0,2.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,743136.0,0.0,1.0,1.0,4.0,0.0,0.0,0.0,0.0,0.0,...,5.0,1.0,3.0,7.0,0.0,1.0,1.0,0.0,0.0,0.0
75%,1115584.0,0.0,3.0,2.0,6.0,1.0,0.0,1.0,1.0,0.0,...,7.0,2.0,4.0,9.0,0.0,1.0,1.0,1.0,1.0,0.0
max,1488027.0,1.0,7.0,4.0,11.0,1.0,6.0,1.0,1.0,1.0,...,19.0,10.0,13.0,23.0,1.0,1.0,1.0,1.0,1.0,1.0


In [26]:
train_df.head(5)

Unnamed: 0,id,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,7,0,2,2,5,1,0,0,1,0,...,9,1,5,8,0,1,1,0,0,1
1,9,0,1,1,7,0,0,0,0,1,...,3,1,1,9,0,1,1,0,1,0
2,13,0,5,4,9,1,0,0,0,1,...,4,2,7,7,0,1,1,0,1,0
3,16,0,0,1,2,0,0,1,0,0,...,2,2,4,9,0,0,0,0,0,0
4,17,0,0,2,0,1,0,1,0,0,...,3,1,1,3,0,0,0,1,1,0


## Connect to Remote AML Compute (Existing AML cluster)

In [27]:
from azureml.core.compute import AmlCompute, ComputeTarget
# Define remote compute target to use
# Further docs on Remote Compute Target: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-auto-train-remote

# Choose a name for your cluster.
amlcompute_cluster_name = "cpu-cluster"

found = False
# Check if this compute target already exists in the workspace.
cts = ws.compute_targets

if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'AmlCompute':
     found = True
     print('Found existing training cluster.')
     # Get existing cluster
     # Method 1:
     aml_remote_compute = cts[amlcompute_cluster_name]
     # Method 2:
     # aml_remote_compute = ComputeTarget(ws, amlcompute_cluster_name)
    
if not found:
     print('Creating a new training cluster...')
     provisioning_config = AmlCompute.provisioning_configuration(vm_size = "STANDARD_DS12_V2", # for GPU, use "STANDARD_NC12"
                                                                 #vm_priority = 'lowpriority', # optional
                                                                 max_nodes = 5)
     # Create the cluster.
     aml_remote_compute = ComputeTarget.create(ws, amlcompute_cluster_name, provisioning_config)
    
print('Checking cluster status...')
# Can poll for a minimum number of nodes and for a specific timeout.
# If no min_node_count is provided, it will use the scale settings for the cluster.
aml_remote_compute.wait_for_completion(show_output = True, min_node_count = 0, timeout_in_minutes = 20)

Creating a new training cluster...
Checking cluster status...
InProgress.
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


In [28]:
# For additional details of current AmlCompute status:
aml_remote_compute.get_status()

<azureml.core.compute.amlcompute.AmlComputeStatus at 0x271820d1a58>

## Train with Azure AutoML automatically searching for the 'best model' (Best algorithms and best hyper-parameters)

### List and select primary metric to drive the AutoML classification problem

In [29]:
from azureml.train import automl

# List of possible primary metrics is here:
# https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train#primary-metric
    
# Get a list of valid metrics for your given task
automl.utilities.get_primary_metrics('classification')

['accuracy',
 'precision_score_weighted',
 'average_precision_score_weighted',
 'AUC_weighted',
 'norm_macro_recall']

## Define AutoML Experiment settings

In [30]:
import logging
from azureml.train.automl import AutoMLConfig

automl_config = AutoMLConfig(compute_target=aml_remote_compute,
                             task='classification',
                             primary_metric='AUC_weighted',                           
                             training_data=train_dataset, 
                             validation_size = 0.1,
                             label_column_name="target",
                             blocked_models = ['LogisticRegression', 'ExtremeRandomTrees', 'RandomForest'], 
                             # allowed_models = ['LightGBM'],                          
                             enable_early_stopping= True,
                             iterations=10,                         
                             featurization= 'auto',   # (auto/off) All feature columns in this dataset are numbers, no need to featurize with AML Dataset. 
                             model_explainability=True,
                             enable_code_generation=True
                             )

# Explanation of Settings: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train#configure-your-experiment-settings

# AutoMLConfig info on: 
# https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig.automlconfig

############################################
# More extense example AutoML Configuration:

# You can provide additional settings as a **kwargs parameter for the AutoMLConfig object
# automl_settings = {
#       # "validation_data": validation_df,  # If you have an explicit validation set
#       # "n_cross_validations": 5, # If using cross validation
#       "experiment_exit_score": 0.64,
#       "max_cores_per_iteration": -1,
#       # "enable_batch_run": True,
#       # "save_mlflow": True,
#       "enable_code_generation": True
# }

# from azureml.train.automl import AutoMLConfig

# automl_config = AutoMLConfig(compute_target=aml_remote_compute,
#                              task='classification',
#                              primary_metric='AUC_weighted',                           
#                              training_data=train_dataset, # AML Dataset
#                              validation_size = 0.1,
#                              label_column_name="target",
#                              blocked_models = ['LogisticRegression', 'ExtremeRandomTrees', 'RandomForest'], 
#                              # allowed_models = ['LightGBM'],
#                              enable_voting_ensemble = True,
#                              enable_stack_ensemble = False,
#                              enable_early_stopping= True,
#                              iterations=10,
#                              max_concurrent_iterations=5,
#                              experiment_timeout_hours=1,                           
#                              featurization= 'auto',   # (auto/off) All feature columns in this dataset are numbers, no need to featurize with AML Dataset. 
#                              debug_log='automated_ml_errors.log',
#                              verbosity= logging.DEBUG,
#                              model_explainability=True,
#                              enable_onnx_compatible_models=False,
#                              **automl_settings
#                              )

## Run Experiment (on remote AML Compute) with multiple child runs under the covers

In [31]:
from azureml.core import Experiment

experiment_name = "SDK_Codegen_remote_porto_seguro"
print(experiment_name)

experiment = Experiment(workspace=ws, 
                        name=experiment_name)

import time
start_time = time.time()
            
parent_run = experiment.submit(automl_config, show_output=True)

print('Manual run timing: --- %s minutes needed for running the whole Remote AutoML Experiment ---' % ((time.time() - start_time)/60))


SDK_Codegen_remote_porto_seguro
Submitting remote run.
No run_configuration provided, running on cpu-cluster with default configuration
Running on remote compute: cpu-cluster


Experiment,Id,Type,Status,Details Page,Docs Page
SDK_Codegen_remote_porto_seguro,AutoML_a353756b-8d35-4964-a1e8-7a6c548ad53f,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation



Current status: DatasetEvaluation. Gathering dataset statistics.
Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Beginning to fit featurizers and featurize the dataset.
Current status: DatasetBalancing. Performing class balancing sweeping
Current status: ModelSelection. Beginning model selection.

********************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       ALERTED
DESCRIPTION:  To decrease model bias, please cancel the current run and fix balancing problem.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData
DETAILS:      Imbalanced data can lead to a falsely perceived positive effect of a model's accuracy because the input data has bias towards one class.
+------------------------------+--------------------------------+--------------------------------------+
|Size of the smallest c

## Explore results with Widget

In [16]:
# Explore the results of automatic training with a Jupyter widget: https://docs.microsoft.com/en-us/python/api/azureml-widgets/azureml.widgets?view=azure-ml-py
from azureml.widgets import RunDetails
RunDetails(parent_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

In [18]:
# Wait for the remote parent run to complete
parent_run.wait_for_completion()

{'runId': 'AutoML_8dcf09d7-acfa-4299-9aa2-307aa1e38c13',
 'target': 'cpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2021-08-18T17:55:00.076809Z',
 'endTimeUtc': '2021-08-18T18:18:02.733732Z',
 'properties': {'num_iterations': '10',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'AUC_weighted',
  'train_split': '0.1',
  'acquisition_parameter': '0',
  'num_cross_validation': None,
  'target': 'cpu-cluster',
  'DataPrepJsonString': '{\\"training_data\\": {\\"datasetId\\": \\"02e66846-1b8c-452a-9a48-b1d46dbde4c0\\"}, \\"datasets\\": 0}',
  'EnableSubsampling': 'False',
  'runTemplate': 'AutoML',
  'azureml.runsource': 'automl',
  'display_task_type': 'classification',
  'dependencies_versions': '{"azureml-tensorboard": "1.6.0", "azureml-contrib-services": "1.6.0", "azureml-contrib-server": "1.6.0", "azureml-contrib-pipeline-steps": "1.6.0", "azureml-contrib-notebook": "1.6.0", "azureml-widgets": "0.1.0.43589424", "azureml-train": "0.1.0.435894

### Measure Parent Run Time needed for the whole AutoML process 

In [19]:
import time
from datetime import datetime

run_details = parent_run.get_details()

# Like: 2020-01-12T23:11:56.292703Z
end_time_utc_str = run_details['endTimeUtc'].split(".")[0]
start_time_utc_str = run_details['startTimeUtc'].split(".")[0]
timestamp_end = time.mktime(datetime.strptime(end_time_utc_str, "%Y-%m-%dT%H:%M:%S").timetuple())
timestamp_start = time.mktime(datetime.strptime(start_time_utc_str, "%Y-%m-%dT%H:%M:%S").timetuple())

parent_run_time = timestamp_end - timestamp_start
print('Run Timing: --- %s minutes needed for running the whole Remote AutoML Experiment ---' % (parent_run_time/60))

Run Timing: --- 23.033333333333335 minutes needed for running the whole Remote AutoML Experiment ---


### Creating ModelProxy for submitting prediction runs to the training environment.
We will create a ModelProxy for the best child run, which will allow us to submit a run that does the prediction in the training environment. Unlike the local client, which can have different versions of some libraries, the training environment will have all the compatible libraries for the model already.

In [20]:
from azureml.train.automl.model_proxy import ModelProxy

best_run = parent_run.get_best_child()
# best_run = parent_run.get_best_child(metric = "accuracy")

best_model_proxy = ModelProxy(best_run, aml_remote_compute)

Class ModelProxy: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


In [21]:
y_test = test_dataset.keep_columns('target')
test_data_no_label = test_dataset.drop_columns('target')

test_data_no_label_df = test_data_no_label.to_pandas_dataframe()
print(test_data_no_label_df.shape)

(59657, 58)


In [22]:
import time
start_time = time.time()
            
y_pred_test = best_model_proxy.predict(test_data_no_label)

print('Manual run timing: --- %s minutes needed for Predicting with ModelProxy ---' % ((time.time() - start_time)/60))

y_pred_test

Manual run timing: --- 3.0498180707295734 minutes needed for Predicting with ModelProxy ---


{
  "source": [
    "('workspaceblobstore', 'ExperimentRun/dcid.SDK_Codegen_remote_porto_seguro_1629314660_659048c0/predictions/predictions.csv')"
  ],
  "definition": [
    "GetDatastoreFiles",
    "ParseDelimited",
    "DropColumns",
    "SetColumnTypes"
  ]
}

### Show hyperparameters
Show the model pipeline used for the best run with its hyperparameters.

In [71]:
run_properties = json.loads(best_run.get_details()['properties']['pipeline_script'])
print(json.dumps(run_properties, indent = 1)) 

{
 "pipeline_id": "__AutoML_Ensemble__",
 "objects": [
  {
   "module": "azureml.train.automl.ensemble",
   "class_name": "Ensemble",
   "spec_class": "sklearn",
   "param_args": [],
   "param_kwargs": {
    "automl_settings": "{'task_type':'classification','primary_metric':'AUC_weighted','verbosity':10,'ensemble_iterations':15,'is_timeseries':False,'name':'SDK_remote_porto_seguro_driver_pred','compute_target':'cpu-cluster','subscription_id':'381b38e9-9840-4719-a5a0-61d9585e1e91','region':'eastus2euap','spark_service':None}",
    "ensemble_run_id": "AutoML_7fddc313-1f37-48dd-a117-5c10c1d7a963_14",
    "experiment_name": "SDK_remote_porto_seguro_driver_pred",
    "workspace_name": "cesardl-automl-eastus2euap-ws",
    "subscription_id": "381b38e9-9840-4719-a5a0-61d9585e1e91",
    "resource_group_name": "cesardl-automl-eastus2euap-resgrp"
   }
  }
 ]
}


## Retrieve the 'Best' Model

In [46]:
best_run, fitted_model = parent_run.get_output()
print(best_run)
print('--------')
print(fitted_model)

Run(Experiment: SDK_remote_porto_seguro_driver_pred,
Id: AutoML_7fddc313-1f37-48dd-a117-5c10c1d7a963_14,
Type: azureml.scriptrun,
Status: Completed)
--------
Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                                 feature_sweeping_config=None,
                                 feature_sweeping_timeout=None,
                                 featurization_config=None, force_text_dnn=None,
                                 is_cross_validation=None,
                                 is_onnx_compatible=None, logger=None,
                                 observer=None, task=None, working_dir=None)),
                ('prefittedsoftvotingclassifier',...
                                                                                                num_leaves=161,
                                                                                                objective=None,
          

#### Retrieve METRICS for All Child Runs
You can also use SDK methods to fetch all the child runs and see individual metrics that we log.

In [47]:
children = list(parent_run.get_children())
metricslist = {}
for run in children:
    properties = run.get_properties()
    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}
    metricslist[int(properties['iteration'])] = metrics

rundata = pd.DataFrame(metricslist).sort_index(1)
rundata

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
recall_score_macro,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5
f1_score_macro,0.49,0.49,0.49,0.49,0.49,0.49,0.49,0.49,0.49,0.49,0.49,0.49,0.49,0.49,0.49
average_precision_score_macro,0.52,0.52,0.52,0.52,0.52,0.52,0.52,0.52,0.52,0.52,0.52,0.52,0.52,0.5,0.52
precision_score_weighted,0.95,0.93,0.93,0.93,0.93,0.93,0.93,0.93,0.93,0.93,0.93,0.93,0.93,0.93,0.93
AUC_micro,0.97,0.97,0.97,0.97,0.97,0.97,0.97,0.97,0.97,0.97,0.97,0.97,0.97,0.96,0.97
AUC_weighted,0.63,0.63,0.61,0.63,0.63,0.61,0.63,0.64,0.64,0.64,0.63,0.62,0.61,0.52,0.64
matthews_correlation,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
accuracy,0.96,0.96,0.96,0.96,0.96,0.96,0.96,0.96,0.96,0.96,0.96,0.96,0.96,0.96,0.96
recall_score_micro,0.96,0.96,0.96,0.96,0.96,0.96,0.96,0.96,0.96,0.96,0.96,0.96,0.96,0.96,0.96
precision_score_macro,0.82,0.48,0.48,0.48,0.48,0.48,0.48,0.48,0.48,0.48,0.48,0.48,0.48,0.48,0.48


## Retrieve the Best Model's explanation
Retrieve the explanation from the best_run which includes explanations for engineered features and raw features. Make sure that the run for generating explanations for the best model is completed.

In [48]:
# Wait for the best model explanation run to complete
from azureml.core.run import Run

# AutoML_525e9be6-0cb8-4750-9c4b-b8518636b0ce_ModelExplain
model_explainability_run_id = parent_run.id + "_" + "ModelExplain"
print(model_explainability_run_id)

model_explainability_run = Run(experiment=experiment, run_id=model_explainability_run_id)
model_explainability_run.wait_for_completion()


AutoML_7fddc313-1f37-48dd-a117-5c10c1d7a963_ModelExplain


{'runId': 'AutoML_7fddc313-1f37-48dd-a117-5c10c1d7a963_ModelExplain',
 'target': 'cpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2021-03-27T01:33:55.876623Z',
 'endTimeUtc': '2021-03-27T01:36:29.563341Z',
 'properties': {'azureml.runsource': 'automl',
  'parentRunId': 'AutoML_7fddc313-1f37-48dd-a117-5c10c1d7a963_14',
  '_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': '7432bb83-4a3a-4e7f-9f15-a9f88ab6d797',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json',
  'dependencies_versions': '{"azureml-train-automl-runtime": "1.24.0", "azureml-train-automl-client": "1.24.0", "azureml-telemetry": "1.24.0", "azureml-pipeline-core": "1.24.0", "azureml-model-management-sdk": "1.0.1b6.post1", "azureml-mlflow": "1.24.0", "azureml-interpret": "1.24.0", "azureml-defaults": "1.24.0", "azureml-dataset-runtime": "1.24.0", "azureml-dataprep": "2.11.2", "azureml-dataprep-rslex": "1.9.1", "azureml-dataprep-native": "30.

### Download and Print engineered feature importance from artifact store
You can use ExplanationClient to download the engineered feature explanations from the artifact store of the best_run.

In [49]:
from azureml.interpret import ExplanationClient

client = ExplanationClient.from_run(best_run)
engineered_explanations = client.download_model_explanation(raw=False)
exp_data = engineered_explanations.get_feature_importance_dict()
exp_data

{'ps_car_13_MeanImputer': 0.12413941787718004,
 'ps_ind_05_cat_CharGramCountVectorizer_0': 0.07218309962514628,
 'ps_reg_03_MeanImputer': 0.06728167953002182,
 'ps_ind_17_bin_ModeCatImputer_LabelEncoder': 0.05378192369925554,
 'ps_ind_06_bin_ModeCatImputer_LabelEncoder': 0.041771263979577034,
 'ps_reg_02_MeanImputer': 0.03721567809935376,
 'ps_car_01_cat_CharGramCountVectorizer_7': 0.034438749128948766,
 'ps_reg_01_MeanImputer': 0.03347948966577743,
 'ps_ind_07_bin_ModeCatImputer_LabelEncoder': 0.03342231843436714,
 'ps_ind_03_CharGramCountVectorizer_3': 0.032890840542521824,
 'ps_ind_16_bin_ModeCatImputer_LabelEncoder': 0.03270286142702199,
 'ps_ind_03_CharGramCountVectorizer_2': 0.028811311358928966,
 'ps_car_07_cat_CharGramCountVectorizer_1': 0.028351156858802063,
 'ps_car_03_cat_CharGramCountVectorizer_-1': 0.027110927328239245,
 'ps_ind_04_cat_CharGramCountVectorizer_0': 0.019418141091614546,
 'ps_car_15_MeanImputer': 0.0185750273162823,
 'ps_ind_03_CharGramCountVectorizer_4': 0.0

### Download raw feature importance from artifact store
You can use ExplanationClient to download the raw feature explanations from the artifact store of the best_run.

In [50]:
client = ExplanationClient.from_run(best_run)
engineered_explanations = client.download_model_explanation(raw=True)
exp_data = engineered_explanations.get_feature_importance_dict()
exp_data

{'ps_car_13': 0.12413941787718004,
 'ps_ind_03': 0.0955921606782005,
 'ps_ind_05_cat': 0.07969469607480711,
 'ps_ind_15': 0.07101131128282193,
 'ps_reg_03': 0.06728167953002182,
 'ps_ind_17_bin': 0.05378192369925554,
 'ps_car_01_cat': 0.053332432963143106,
 'ps_car_03_cat': 0.04350569733129902,
 'ps_ind_06_bin': 0.041771263979577034,
 'ps_reg_02': 0.03721567809935376,
 'ps_reg_01': 0.03347948966577743,
 'ps_ind_07_bin': 0.03342231843436714,
 'ps_ind_16_bin': 0.03270286142702199,
 'ps_car_09_cat': 0.030894487383221918,
 'ps_car_07_cat': 0.029925740176601694,
 'ps_ind_01': 0.02287323677321861,
 'ps_ind_04_cat': 0.0223085987664737,
 'ps_car_04_cat': 0.021950455296653647,
 'ps_car_15': 0.0185750273162823,
 'ps_car_06_cat': 0.01676893513373356,
 'ps_ind_09_bin': 0.012858081592979598,
 'ps_car_11_cat': 0.012744837576436824,
 'ps_ind_02_cat': 0.01189136104865129,
 'ps_ind_08_bin': 0.01033079680748262,
 'ps_car_11': 0.009410274755862147,
 'ps_calc_02': 0.008395891161325528,
 'ps_calc_11': 0.00

## Register Model in Workspace model registry

In [51]:

registered_model = parent_run.register_model(model_name='porto-seg-automl-remote-compute', 
                                           description='Porto Seguro Model from plain AutoML in remote AML compute')

print(parent_run.model_id)
registered_model

porto-seg-automl-remote-compute


Model(workspace=Workspace.create(name='cesardl-automl-eastus2euap-ws', subscription_id='381b38e9-9840-4719-a5a0-61d9585e1e91', resource_group='cesardl-automl-eastus2euap-resgrp'), name=porto-seg-automl-remote-compute, id=porto-seg-automl-remote-compute:3, version=3, tags={}, properties={})

## See files associated with the 'Best run'

In [52]:
print(best_run.get_file_names())

# best_run.download_file('azureml-logs/70_driver_log.txt')

['accuracy_table', 'automl_driver.py', 'azureml-logs/55_azureml-execution-tvmps_befcca04d7bfa028d869a119734221745338430c01ead02580d6823dbdea3861_d.txt', 'azureml-logs/65_job_prep-tvmps_befcca04d7bfa028d869a119734221745338430c01ead02580d6823dbdea3861_d.txt', 'azureml-logs/70_driver_log.txt', 'azureml-logs/75_job_post-tvmps_befcca04d7bfa028d869a119734221745338430c01ead02580d6823dbdea3861_d.txt', 'azureml-logs/process_info.json', 'azureml-logs/process_status.json', 'confusion_matrix', 'explanation/f1bcffd2/classes.interpret.json', 'explanation/f1bcffd2/expected_values.interpret.json', 'explanation/f1bcffd2/features.interpret.json', 'explanation/f1bcffd2/global_names/0.interpret.json', 'explanation/f1bcffd2/global_rank/0.interpret.json', 'explanation/f1bcffd2/global_values/0.interpret.json', 'explanation/f1bcffd2/local_importance_values.interpret.json', 'explanation/f1bcffd2/local_importance_viz.interpret.json', 'explanation/f1bcffd2/per_class_names/0.interpret.json', 'explanation/f1bcffd2

## Make Predictions and calculate metrics

### Prep Test Data: Extract X values (feature columns) from test dataset and convert to NumPi array for predicting 

In [53]:
import pandas as pd

x_test_df = test_df.copy()

if 'target' in x_test_df.columns:
    y_test_df = x_test_df.pop('target')

print(test_df.shape)
print(x_test_df.shape)
print(y_test_df.shape)

(59291, 59)
(59291, 58)
(59291,)


In [54]:
y_test_df.describe()

count   59291.00
mean        0.04
std         0.19
min         0.00
25%         0.00
50%         0.00
75%         0.00
max         1.00
Name: target, dtype: float64

### Make predictions in bulk

In [55]:
# Try the best model making predictions with the test dataset
y_predictions = fitted_model.predict(x_test_df)

print(y_predictions.shape)
print('30 predictions: ')
print(y_predictions[:30])

(59291,)
30 predictions: 
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


### Get all the predictions' probabilities needed to calculate ROC AUC

In [56]:
class_probabilities = fitted_model.predict_proba(x_test_df)
print(class_probabilities.shape)

print('Some class probabilities...: ')
print(class_probabilities[:3])

print('Probabilities for class 1:')
print(class_probabilities[:,1])

print('Probabilities for class 0:')
print(class_probabilities[:,0])

(59291, 2)
Some class probabilities...: 
[[0.94909273 0.05090727]
 [0.97383981 0.02616019]
 [0.97830645 0.02169355]]
Probabilities for class 1:
[0.05090727 0.02616019 0.02169355 ... 0.03003549 0.05621396 0.04049687]
Probabilities for class 0:
[0.94909273 0.97383981 0.97830645 ... 0.96996451 0.94378604 0.95950313]


## Evaluate Model

Evaluating performance is an essential task in machine learning. In this case, because this is a classification problem, the data scientist elected to use an AUC - ROC Curve. When we need to check or visualize the performance of the multi - class classification problem, we use AUC (Area Under The Curve) ROC (Receiver Operating Characteristics) curve. It is one of the most important evaluation metrics for checking any classification model’s performance.

<img src="https://www.researchgate.net/profile/Oxana_Trifonova/publication/276079439/figure/fig2/AS:614187332034565@1523445079168/An-example-of-ROC-curves-with-good-AUC-09-and-satisfactory-AUC-065-parameters.png"
     alt="Markdown Monster icon"
     style="float: left; margin-right: 12px; width: 320px; height: 239px;" />

### Calculate the ROC AUC with probabilities vs. the Test Dataset

In [57]:
from sklearn.metrics import roc_auc_score

print('ROC AUC *method 1*:')
print(roc_auc_score(y_test_df, class_probabilities[:,1]))

print('ROC AUC Weighted:')
print(roc_auc_score(y_test_df, class_probabilities[:,1], average='weighted'))
# AUC with plain LightGBM was: 0.6374553321494826 

ROC AUC *method 1*:
0.6414290862854027
ROC AUC Weighted:
0.6414290862854027


In [58]:
# print('ROC AUC *method 2*:')
# fpr, tpr, thresholds = metrics.roc_curve(y_test_df, class_probabilities[:,1])
# metrics.auc(fpr, tpr)

### Calculate the Accuracy with predictions vs. the Test Dataset

In [59]:
print(y_test_df.shape)
print(y_predictions.shape)

(59291,)
(59291,)


In [60]:
from sklearn.metrics import accuracy_score

print('Accuracy:')
print(accuracy_score(y_test_df, y_predictions))


Accuracy:
0.9630298021622169


### Load model in memory

#### (Option A: Load from model .pkl file)

In [61]:
# Load the model into memory from downloaded file
import joblib

best_run.download_file('outputs/model.pkl')

fitted_model = joblib.load('model.pkl')
print(fitted_model)

Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                                 feature_sweeping_config=None,
                                 feature_sweeping_timeout=None,
                                 featurization_config=None, force_text_dnn=None,
                                 is_cross_validation=None,
                                 is_onnx_compatible=None, logger=None,
                                 observer=None, task=None, working_dir=None)),
                ('prefittedsoftvotingclassifier',...
                                                                                                num_leaves=161,
                                                                                                objective=None,
                                                                                                random_state=None,
                                                     

#### (Option B: Load from model registry in Workspace)

In [62]:
from azureml.core.model import Model

# not used, just to see the registered model definition
registered_model_definition = Model(ws, 'porto-seg-automl-remote-compute')
print(registered_model_definition)

Model(workspace=Workspace.create(name='cesardl-automl-eastus2euap-ws', subscription_id='381b38e9-9840-4719-a5a0-61d9585e1e91', resource_group='cesardl-automl-eastus2euap-resgrp'), name=porto-seg-automl-remote-compute, id=porto-seg-automl-remote-compute:3, version=3, tags={}, properties={})


In [63]:
# Load model from model registry in Workspace
from azureml.core.model import Model

model_path = Model.get_model_path('porto-seg-automl-remote-compute', _workspace=ws)
fitted_model = joblib.load(model_path)
print(fitted_model)

Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                                 feature_sweeping_config=None,
                                 feature_sweeping_timeout=None,
                                 featurization_config=None, force_text_dnn=None,
                                 is_cross_validation=None,
                                 is_onnx_compatible=None, logger=None,
                                 observer=None, task=None, working_dir=None)),
                ('prefittedsoftvotingclassifier',...
                                                                                                num_leaves=161,
                                                                                                objective=None,
                                                                                                random_state=None,
                                                     

## Try model inference with hardcoded input data for the model to predict

In [64]:
# Data from Dataframe for comparison with hardcoded data
# x_test_df.head(1)

In [65]:
# Data from Dataframe for comparison with hardcoded data
# print(x_test_df.head(1).values)
# print(x_test_df.head(1).columns)

In [66]:
import json

raw_data = json.dumps({
     'data': [[20,2,1,3,1,0,0,1,0,0,0,0,0,0,0,8,1,0,0,0.6,0.1,0.61745445,6,1,-1,0,1,11,1,1,0,1,99,2,0.31622777,0.6396829,0.36878178,3.16227766,0.2,0.6,0.5,2,2,8,1,8,3,10,3,0,0,10,0,1,0,0,1,0]],
     'method': 'predict'  # If you have a classification model, you can get probabilities by changing this to 'predict_proba'.
 })

print(json.loads(raw_data)['data'])

numpy_data = np.array(json.loads(raw_data)['data'])

df_data = pd.DataFrame(data=numpy_data, columns=['id', 'ps_ind_01', 'ps_ind_02_cat', 'ps_ind_03', 'ps_ind_04_cat',
                                               'ps_ind_05_cat', 'ps_ind_06_bin', 'ps_ind_07_bin', 'ps_ind_08_bin',
                                               'ps_ind_09_bin', 'ps_ind_10_bin', 'ps_ind_11_bin', 'ps_ind_12_bin',
                                               'ps_ind_13_bin', 'ps_ind_14', 'ps_ind_15', 'ps_ind_16_bin',
                                               'ps_ind_17_bin', 'ps_ind_18_bin', 'ps_reg_01', 'ps_reg_02', 'ps_reg_03',
                                               'ps_car_01_cat', 'ps_car_02_cat', 'ps_car_03_cat', 'ps_car_04_cat',
                                               'ps_car_05_cat', 'ps_car_06_cat', 'ps_car_07_cat', 'ps_car_08_cat',
                                               'ps_car_09_cat', 'ps_car_10_cat', 'ps_car_11_cat', 'ps_car_11',
                                               'ps_car_12', 'ps_car_13', 'ps_car_14', 'ps_car_15', 'ps_calc_01',
                                               'ps_calc_02', 'ps_calc_03', 'ps_calc_04', 'ps_calc_05', 'ps_calc_06',
                                               'ps_calc_07', 'ps_calc_08', 'ps_calc_09', 'ps_calc_10', 'ps_calc_11',
                                               'ps_calc_12', 'ps_calc_13', 'ps_calc_14', 'ps_calc_15_bin',
                                               'ps_calc_16_bin', 'ps_calc_17_bin', 'ps_calc_18_bin', 'ps_calc_19_bin',
                                               'ps_calc_20_bin'])
df_data

[[20, 2, 1, 3, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 8, 1, 0, 0, 0.6, 0.1, 0.61745445, 6, 1, -1, 0, 1, 11, 1, 1, 0, 1, 99, 2, 0.31622777, 0.6396829, 0.36878178, 3.16227766, 0.2, 0.6, 0.5, 2, 2, 8, 1, 8, 3, 10, 3, 0, 0, 10, 0, 1, 0, 0, 1, 0]]


Unnamed: 0,id,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,ps_ind_09_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,20.0,2.0,1.0,3.0,1.0,0.0,0.0,1.0,0.0,0.0,...,3.0,0.0,0.0,10.0,0.0,1.0,0.0,0.0,1.0,0.0


In [67]:
# Get predictions from the model
y_predictions = fitted_model.predict(df_data) # x_test_df.head(1)
y_predictions # Should return a [0] or [1] depending on the prediction result

array([0])

## Retrieve the Best ONNX Model
Below we select the best pipeline from our iterations. The get_output method returns the best run and the fitted model. The Model includes the pipeline and any pre-processing. Overloads on get_output allow you to retrieve the best run and fitted model for any logged metric or for a particular iteration.

Set the parameter return_onnx_model=True to retrieve the best ONNX model, instead of the Python model.

In [68]:
best_run, onnx_mdl = parent_run.get_output(return_onnx_model=True)

### Save the best ONNX model to local path

In [69]:
from azureml.automl.runtime.onnx_convert import OnnxConverter
onnx_fl_path = "./best_model.onnx"
OnnxConverter.save_onnx_model(onnx_mdl, onnx_fl_path)

### Predict with the ONNX model, using onnxruntime package

In [70]:
import sys
import json
from azureml.automl.core.onnx_convert import OnnxConvertConstants
from azureml.train.automl import constants

if sys.version_info < OnnxConvertConstants.OnnxIncompatiblePythonVersion:
    python_version_compatible = True
else:
    python_version_compatible = False

import onnxruntime
from azureml.automl.runtime.onnx_convert import OnnxInferenceHelper

def get_onnx_res(run):
    res_path = 'onnx_resource.json'
    run.download_file(name=constants.MODEL_RESOURCE_PATH_ONNX, output_file_path=res_path)
    with open(res_path) as f:
        onnx_res = json.load(f)
    return onnx_res

if python_version_compatible:
    test_df = test_dataset.to_pandas_dataframe()
    mdl_bytes = onnx_mdl.SerializeToString()
    onnx_res = get_onnx_res(best_run)

    onnxrt_helper = OnnxInferenceHelper(mdl_bytes, onnx_res)
    pred_onnx, pred_prob_onnx = onnxrt_helper.predict(test_df)

    print(pred_onnx)
    print(pred_prob_onnx)
else:
    print('Please use Python version 3.6 or 3.7 to run the inference helper.')

[0 0 0 ... 0 0 0]
[[0.9490927  0.05090731]
 [0.9738398  0.02616016]
 [0.9783065  0.02169357]
 ...
 [0.96996456 0.03003547]
 [0.94378614 0.05621394]
 [0.9595032  0.04049686]]
