# AutoML on remote AML Compute (Porto Seguro's Safe Driving Prediction)

This notebook is refactored (from the original AutoML local training notebook) to use AutoML on remote AML compute, in a cluster.
It also uses AML Datasets for training instead of Pandas Dataframes.

## Import Needed Packages

Import the packages needed for this notebook. The most widely used package for machine learning is [scikit-learn](https://scikit-learn.org/stable/), [pandas](https://pandas.pydata.org/docs/getting_started/index.html#getting-started), and [numpy](https://numpy.org/). These packages have various features, as well as a lot of clustering, regression and classification algorithms that make it a good choice for data mining and data analysis.

In [1]:
import numpy as np
import pandas as pd
import joblib
from sklearn import metrics

## Check Azure ML SDK version

In [2]:
import azureml.core
print("This notebook was tested using version 1.6.0 of the Azure ML SDK")
print("You are currently using version", azureml.core.VERSION, "of the Azure ML SDK")

This notebook was tested using version 1.6.0 of the Azure ML SDK
You are currently using version 1.20.0 of the Azure ML SDK


##  Get Azure ML Workspace to use

In [None]:
from azureml.core import Workspace, Dataset

# Get Workspace defined in by default config.json file
ws = Workspace.from_config()

### (Optional) Submit dataset file into DataStore (Azure Blob under the covers)

In [4]:
datastore = ws.get_default_datastore()
datastore.upload(src_dir='../../data/', 
                 target_path='Datasets/porto_seguro_safe_driver_prediction', overwrite=True, show_progress=True)

Uploading an estimated of 1 files
Uploading ../../data/README-Download-Dataset-And-Copy-Here.txt
Uploaded ../../data/README-Download-Dataset-And-Copy-Here.txt, 1 files out of an estimated total of 1
Uploaded 1 files


$AZUREML_DATAREFERENCE_73dec002895d42ca8f95ed379814889b

## Load data into Azure ML Dataset and Register into Workspace

In [5]:
# Try to load the dataset from the Workspace. Otherwise, create it from the file in the HTTP URL
found = False
aml_dataset_name = "porto_seguro_safe_driver_prediction_train"

if aml_dataset_name in ws.datasets.keys(): 
       found = True
       aml_dataset = ws.datasets[aml_dataset_name] 
       print("Dataset loaded from the Workspace")
       
if not found:
        # Create AML Dataset and register it into Workspace
        print("Dataset does not exist in the current Workspace. It will be imported and registered.")
        
        # Option A: Create AML Dataset from file in AML DataStore
        datastore = ws.get_default_datastore()
        aml_dataset = Dataset.Tabular.from_delimited_files(path=datastore.path('Datasets/porto_seguro_safe_driver_prediction/porto_seguro_safe_driver_prediction_train.csv'))
        data_origin_type = 'AMLDataStore'
        
        # Option B: Create AML Dataset from file in HTTP URL
        # data_url = 'https://azmlworkshopdata.blob.core.windows.net/safedriverdata/porto_seguro_safe_driver_prediction_train.csv'
        # aml_dataset = Dataset.Tabular.from_delimited_files(data_url)  
        # data_origin_type = 'HttpUrl'
        
        print(aml_dataset)
                
        #Register Dataset in Workspace
        registration_method = 'SDK'  # or 'UI'
        aml_dataset = aml_dataset.register(workspace=ws,
                                           name=aml_dataset_name,
                                           description='Porto Seguro Safe Driver Prediction Train dataset file',
                                           tags={'Registration-Method': registration_method, 'Data-Origin-Type': data_origin_type},
                                           create_new_version=True)
        
        print("Dataset created from file and registered in the Workspace")


Dataset loaded from the Workspace


In [6]:
# Use Pandas DataFrame just to sneak peak some data and schema
data_df = aml_dataset.to_pandas_dataframe()
print(data_df.shape)
# print(data_df.describe())
data_df.head(5)

(595212, 59)


Unnamed: 0,id,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,7,0,2,2,5,1,0,0,1,0,...,9,1,5,8,0,1,1,0,0,1
1,9,0,1,1,7,0,0,0,0,1,...,3,1,1,9,0,1,1,0,1,0
2,13,0,5,4,9,1,0,0,0,1,...,4,2,7,7,0,1,1,0,1,0
3,16,0,0,1,2,0,0,1,0,0,...,2,2,4,9,0,0,0,0,0,0
4,17,0,0,2,0,1,0,1,0,0,...,3,1,1,3,0,0,0,1,1,0


## Split Data into Train and Test AML Tabular Datasets

Remote AML Training you need to use AML Datasets, you cannot submit Pandas Dataframes to remote runs of AutoMLConfig.

Note that AutoMLConfig below is not using the Test dataset (you only provide a single dataset that will internally be split in validation/train datasets or use cross-validation depending on the size of the dataset. The boundary for that is 20k rows, using cross-validation if less than 20k. This can also be decided by the user.). 

The Test dataset will be used at the end of the notebook to manually calculate the quality metrics with a dataset not seen by AutoML training.

In [7]:
# Split in train/test datasets (Test=10%, Train=90%)

train_dataset, test_dataset = aml_dataset.random_split(0.9, seed=0)

# Use Pandas DF only to check the data
train_df = train_dataset.to_pandas_dataframe()
test_df = test_dataset.to_pandas_dataframe()

In [8]:
print(train_df.shape)
print(test_df.shape)

train_df.describe()

(535837, 59)
(59375, 59)


Unnamed: 0,id,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
count,535837.0,535837.0,535837.0,535837.0,535837.0,535837.0,535837.0,535837.0,535837.0,535837.0,...,535837.0,535837.0,535837.0,535837.0,535837.0,535837.0,535837.0,535837.0,535837.0,535837.0
mean,743665.1,0.036464,1.900459,1.358796,4.42104,0.416595,0.404123,0.39395,0.256837,0.163902,...,5.440893,1.442558,2.872773,7.539302,0.122509,0.627775,0.554002,0.286763,0.349114,0.153134
std,429322.2,0.187443,1.984136,0.664525,2.698681,0.493283,1.348814,0.488624,0.436889,0.370187,...,2.332877,1.203214,1.695767,2.747292,0.327873,0.483398,0.497076,0.45225,0.47669,0.360117
min,7.0,0.0,0.0,-1.0,0.0,-1.0,-1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,371913.0,0.0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,...,4.0,1.0,2.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,743313.0,0.0,1.0,1.0,4.0,0.0,0.0,0.0,0.0,0.0,...,5.0,1.0,3.0,7.0,0.0,1.0,1.0,0.0,0.0,0.0
75%,1115137.0,0.0,3.0,2.0,6.0,1.0,0.0,1.0,1.0,0.0,...,7.0,2.0,4.0,9.0,0.0,1.0,1.0,1.0,1.0,0.0
max,1488027.0,1.0,7.0,4.0,11.0,1.0,6.0,1.0,1.0,1.0,...,19.0,10.0,13.0,23.0,1.0,1.0,1.0,1.0,1.0,1.0


In [9]:
train_df.head(5)

Unnamed: 0,id,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,7,0,2,2,5,1,0,0,1,0,...,9,1,5,8,0,1,1,0,0,1
1,9,0,1,1,7,0,0,0,0,1,...,3,1,1,9,0,1,1,0,1,0
2,13,0,5,4,9,1,0,0,0,1,...,4,2,7,7,0,1,1,0,1,0
3,16,0,0,1,2,0,0,1,0,0,...,2,2,4,9,0,0,0,0,0,0
4,19,0,5,1,4,0,0,0,0,0,...,4,2,0,9,0,1,0,1,1,1


## Connect to Remote AML Compute (Existing AML cluster)

In [10]:
from azureml.core.compute import AmlCompute, ComputeTarget
# Define remote compute target to use
# Further docs on Remote Compute Target: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-auto-train-remote

# Choose a name for your cluster.
amlcompute_cluster_name = "cpu-cluster"

found = False
# Check if this compute target already exists in the workspace.
cts = ws.compute_targets

if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'AmlCompute':
     found = True
     print('Found existing training cluster.')
     # Get existing cluster
     # Method 1:
     aml_remote_compute = cts[amlcompute_cluster_name]
     # Method 2:
     # aml_remote_compute = ComputeTarget(ws, amlcompute_cluster_name)
    
if not found:
     print('Creating a new training cluster...')
     provisioning_config = AmlCompute.provisioning_configuration(vm_size = "STANDARD_D13_V2", # for GPU, use "STANDARD_NC12"
                                                                 #vm_priority = 'lowpriority', # optional
                                                                 max_nodes = 5)
     # Create the cluster.
     aml_remote_compute = ComputeTarget.create(ws, amlcompute_cluster_name, provisioning_config)
    
print('Checking cluster status...')
# Can poll for a minimum number of nodes and for a specific timeout.
# If no min_node_count is provided, it will use the scale settings for the cluster.
aml_remote_compute.wait_for_completion(show_output = True, min_node_count = 0, timeout_in_minutes = 20)

Found existing training cluster.
Checking cluster status...
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


In [11]:
# For additional details of current AmlCompute status:
aml_remote_compute.get_status()

<azureml.core.compute.amlcompute.AmlComputeStatus at 0x7f4f1dd93dd8>

## Train with Azure AutoML automatically searching for the 'best model' (Best algorithms and best hyper-parameters)

### List and select primary metric to drive the AutoML classification problem

In [12]:
from azureml.train import automl

# List of possible primary metrics is here:
# https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train#primary-metric
    
# Get a list of valid metrics for your given task
automl.utilities.get_primary_metrics('classification')

['precision_score_weighted',
 'norm_macro_recall',
 'AUC_weighted',
 'average_precision_score_weighted',
 'accuracy']

## Define AutoML Experiment settings

In [13]:
import logging

# You can provide additional settings as a **kwargs parameter for the AutoMLConfig object
automl_settings = {
      # "blacklist_models":['LogisticRegression', 'ExtremeRandomTrees', 'RandomForest'], 
      # "whitelist_models": ['LightGBM'],
      "validation_size": 0.1,
      # "validation_data": validation_df,  # If you have an explicit validation set
      # "n_cross_validations": 5,
      # "experiment_exit_score": 0.7,
      # "max_cores_per_iteration": -1,
      "enable_batch_run": True,
      "enable_voting_ensemble": True,
      "enable_stack_ensemble": True
}

from azureml.train.automl import AutoMLConfig

automl_config = AutoMLConfig(compute_target=aml_remote_compute,
                             task='classification',
                             primary_metric='AUC_weighted',                           
                             training_data=train_dataset, # AML Dataset
                             label_column_name="target",                                                    
                             enable_early_stopping= True,
                             iterations=5,
                             max_concurrent_iterations=5,
                             experiment_timeout_hours=3,                           
                             featurization= 'auto',   # (auto/off) All feature columns in this dataset are numbers, no need to featurize with AML Dataset. 
                             debug_log='automated_ml_errors.log',
                             verbosity= logging.DEBUG,
                             model_explainability=True,
                             enable_onnx_compatible_models=True,
                             **automl_settings
                             )

# Explanation of Settings: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train#configure-your-experiment-settings

# AutoMLConfig info on: 
# https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig.automlconfig

## Run Experiment (on remote AML Compute) with multiple child runs under the covers

In [14]:
from azureml.core import Experiment

experiment_name = "SDK_remote_porto_seguro_driver_pred"
print(experiment_name)

experiment = Experiment(workspace=ws, 
                        name=experiment_name)

import time
start_time = time.time()
            
parent_run = experiment.submit(automl_config, show_output=False)

print('Manual run timing: --- %s minutes needed for running the whole Remote AutoML Experiment ---' % ((time.time() - start_time)/60))


SDK_remote_porto_seguro_driver_pred
Running on remote.
Manual run timing: --- 0.2428380290667216 minutes needed for running the whole Remote AutoML Experiment ---




## Explore results with Widget

In [45]:
# Explore the results of automatic training with a Jupyter widget: https://docs.microsoft.com/en-us/python/api/azureml-widgets/azureml.widgets?view=azure-ml-py
from azureml.widgets import RunDetails
RunDetails(parent_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

### Measure Parent Run Time needed for the whole AutoML process 

In [17]:
import time
from datetime import datetime

run_details = parent_run.get_details()

# Like: 2020-01-12T23:11:56.292703Z
end_time_utc_str = run_details['endTimeUtc'].split(".")[0]
start_time_utc_str = run_details['startTimeUtc'].split(".")[0]
timestamp_end = time.mktime(datetime.strptime(end_time_utc_str, "%Y-%m-%dT%H:%M:%S").timetuple())
timestamp_start = time.mktime(datetime.strptime(start_time_utc_str, "%Y-%m-%dT%H:%M:%S").timetuple())

parent_run_time = timestamp_end - timestamp_start
print('Run Timing: --- %s minutes needed for running the whole Remote AutoML Experiment ---' % (parent_run_time/60))

Run Timing: --- 18.266666666666666 minutes needed for running the whole Remote AutoML Experiment ---


## Retrieve the 'Best' Model

In [18]:
best_run, fitted_model = parent_run.get_output()
print(best_run)
print('--------')
print(fitted_model)

Package:azureml-automl-runtime, training version:1.24.0, current version:1.20.0
Package:azureml-core, training version:1.24.0.post1, current version:1.20.0
Package:azureml-dataprep, training version:2.11.2, current version:2.7.3
Package:azureml-dataprep-native, training version:30.0.0, current version:27.0.0
Package:azureml-dataprep-rslex, training version:1.9.1, current version:1.5.0
Package:azureml-dataset-runtime, training version:1.24.0, current version:1.20.0
Package:azureml-defaults, training version:1.24.0, current version:1.20.0
Package:azureml-interpret, training version:1.24.0, current version:1.20.0
Package:azureml-mlflow, training version:1.24.0, current version:1.20.0.post1
Package:azureml-pipeline-core, training version:1.24.0, current version:1.20.0
Package:azureml-telemetry, training version:1.24.0, current version:1.20.0
Package:azureml-train-automl-client, training version:1.24.0, current version:1.20.0
Package:azureml-train-automl-runtime, training version:1.24.0, cu

Run(Experiment: SDK_remote_porto_seguro_driver_pred,
Id: AutoML_13a2f742-dfbe-4d80-a727-0f26162bf517_4,
Type: azureml.scriptrun,
Status: Completed)
--------
Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                                 feature_sweeping_config=None,
                                 feature_sweeping_timeout=None,
                                 featurization_config=None, force_text_dnn=None,
                                 is_cross_validation=None,
                                 is_onnx_compatible=None, logger=None,
                                 observer=None, task=None, working_dir=None)),
                ('prefittedsoftvotingclassifier',...
                                                                                                learning_rate=0.1,
                                                                                                max_depth=-1,
          

#### Retrieve METRICS for All Child Runs
You can also use SDK methods to fetch all the child runs and see individual metrics that we log.

In [19]:
children = list(parent_run.get_children())
metricslist = {}
for run in children:
    properties = run.get_properties()
    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}
    metricslist[int(properties['iteration'])] = metrics

rundata = pd.DataFrame(metricslist).sort_index(1)
rundata

Unnamed: 0,0,1,2,3,4
average_precision_score_micro,0.97,0.97,0.97,0.62,0.97
precision_score_macro,0.48,0.48,0.48,0.51,0.48
average_precision_score_weighted,0.94,0.94,0.94,0.94,0.94
AUC_macro,0.64,0.64,0.63,0.63,0.64
f1_score_weighted,0.95,0.95,0.95,0.72,0.95
precision_score_micro,0.96,0.96,0.96,0.6,0.96
average_precision_score_macro,0.52,0.52,0.52,0.52,0.52
f1_score_micro,0.96,0.96,0.96,0.6,0.96
AUC_weighted,0.64,0.64,0.63,0.63,0.64
balanced_accuracy,0.5,0.5,0.5,0.6,0.5


## Retrieve the Best Model's explanation
Retrieve the explanation from the best_run which includes explanations for engineered features and raw features. Make sure that the run for generating explanations for the best model is completed.

In [20]:
# Wait for the best model explanation run to complete
from azureml.core.run import Run

# AutoML_525e9be6-0cb8-4750-9c4b-b8518636b0ce_ModelExplain
model_explainability_run_id = parent_run.id + "_" + "ModelExplain"
print(model_explainability_run_id)

model_explainability_run = Run(experiment=experiment, run_id=model_explainability_run_id)
model_explainability_run.wait_for_completion()


AutoML_13a2f742-dfbe-4d80-a727-0f26162bf517_ModelExplain


{'runId': 'AutoML_13a2f742-dfbe-4d80-a727-0f26162bf517_ModelExplain',
 'target': 'cpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2021-03-25T21:02:49.410545Z',
 'endTimeUtc': '2021-03-25T21:06:02.350958Z',
 'properties': {'azureml.runsource': 'automl',
  'parentRunId': 'AutoML_13a2f742-dfbe-4d80-a727-0f26162bf517_4',
  '_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': '63f063af-7e2b-4231-ac53-e4471bf8a98c',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json',
  'dependencies_versions': '{"azureml-train-automl-runtime": "1.24.0", "azureml-train-automl-client": "1.24.0", "azureml-telemetry": "1.24.0", "azureml-pipeline-core": "1.24.0", "azureml-model-management-sdk": "1.0.1b6.post1", "azureml-mlflow": "1.24.0", "azureml-interpret": "1.24.0", "azureml-defaults": "1.24.0", "azureml-dataset-runtime": "1.24.0", "azureml-dataprep": "2.11.2", "azureml-dataprep-rslex": "1.9.1", "azureml-dataprep-native": "30.0

### Download and Print engineered feature importance from artifact store
You can use ExplanationClient to download the engineered feature explanations from the artifact store of the best_run.

In [21]:
from azureml.interpret import ExplanationClient

client = ExplanationClient.from_run(best_run)
engineered_explanations = client.download_model_explanation(raw=False)
exp_data = engineered_explanations.get_feature_importance_dict()
exp_data

{'ps_car_13_MeanImputer': 0.14417766307980923,
 'ps_reg_03_MeanImputer': 0.0788330602992105,
 'ps_ind_05_cat_CharGramCountVectorizer_0': 0.072647124042033,
 'ps_ind_17_bin_ModeCatImputer_LabelEncoder': 0.0540467680958349,
 'ps_ind_06_bin_ModeCatImputer_LabelEncoder': 0.04482576568234897,
 'ps_ind_16_bin_ModeCatImputer_LabelEncoder': 0.04128472977011646,
 'ps_reg_01_MeanImputer': 0.039818282528853326,
 'ps_car_01_cat_CharGramCountVectorizer_7': 0.038397028884868935,
 'ps_ind_07_bin_ModeCatImputer_LabelEncoder': 0.037541942111168625,
 'ps_reg_02_MeanImputer': 0.03256620329878524,
 'ps_ind_03_CharGramCountVectorizer_3': 0.03203785063157633,
 'ps_car_03_cat_CharGramCountVectorizer_-1': 0.026998845290837566,
 'ps_ind_03_CharGramCountVectorizer_2': 0.024175150579557225,
 'ps_car_07_cat_CharGramCountVectorizer_1': 0.023873471392716235,
 'ps_car_03_cat_CharGramCountVectorizer_1': 0.023836376761082648,
 'ps_car_04_cat_CharGramCountVectorizer_0': 0.01875590688999048,
 'ps_car_09_cat_CharGramCoun

### Download raw feature importance from artifact store
You can use ExplanationClient to download the raw feature explanations from the artifact store of the best_run.

In [22]:
client = ExplanationClient.from_run(best_run)
engineered_explanations = client.download_model_explanation(raw=True)
exp_data = engineered_explanations.get_feature_importance_dict()
exp_data

{'ps_car_13': 0.14417766307980923,
 'ps_ind_03': 0.08396152554585654,
 'ps_ind_05_cat': 0.08061774457158648,
 'ps_reg_03': 0.0788330602992105,
 'ps_ind_17_bin': 0.0540467680958349,
 'ps_car_01_cat': 0.052247339681112115,
 'ps_car_03_cat': 0.05121265733768188,
 'ps_ind_06_bin': 0.04482576568234897,
 'ps_ind_15': 0.04203968917068513,
 'ps_ind_16_bin': 0.04128472977011646,
 'ps_reg_01': 0.039818282528853326,
 'ps_ind_07_bin': 0.037541942111168625,
 'ps_reg_02': 0.03256620329878524,
 'ps_car_04_cat': 0.027340630813597878,
 'ps_car_07_cat': 0.026190065291555945,
 'ps_car_09_cat': 0.018739135528751626,
 'ps_ind_01': 0.018050916237341768,
 'ps_ind_04_cat': 0.015247811033962257,
 'ps_ind_09_bin': 0.010169779011031622,
 'ps_ind_08_bin': 0.007786698097683488,
 'ps_car_15': 0.007729405117228737,
 'ps_ind_02_cat': 0.005649083974565207,
 'ps_car_11_cat': 0.0032674909298780167,
 'ps_car_06_cat': 0.003255573520637106,
 'ps_car_11': 0.002549767342496647,
 'ps_car_14': 0.0021318109228354283,
 'ps_car_0

## Register Model in Workspace model registry

In [23]:

registered_model = parent_run.register_model(model_name='porto-seg-automl-remote-compute', 
                                           description='Porto Seguro Model from plain AutoML in remote AML compute')

print(parent_run.model_id)
registered_model



porto-seg-automl-remote-compute


Model(workspace=Workspace.create(name='cesardl-automl-centraluseuap-ws', subscription_id='102a16c3-37d3-48a8-9237-4c9b1e8e80e0', resource_group='automlpmdemo'), name=porto-seg-automl-remote-compute, id=porto-seg-automl-remote-compute:6, version=6, tags={}, properties={})

## See files associated with the 'Best run'

In [24]:
print(best_run.get_file_names())

# best_run.download_file('azureml-logs/70_driver_log.txt')

['accuracy_table', 'automl_driver.py', 'azureml-logs/55_azureml-execution-tvmps_87211ed8621640b825fbca142b2facd33bd03dd48df59404b750f4d1053c6080_d.txt', 'azureml-logs/65_job_prep-tvmps_87211ed8621640b825fbca142b2facd33bd03dd48df59404b750f4d1053c6080_d.txt', 'azureml-logs/70_driver_log.txt', 'azureml-logs/75_job_post-tvmps_87211ed8621640b825fbca142b2facd33bd03dd48df59404b750f4d1053c6080_d.txt', 'azureml-logs/process_info.json', 'azureml-logs/process_status.json', 'confusion_matrix', 'explanation/1db4a4e2/classes.interpret.json', 'explanation/1db4a4e2/expected_values.interpret.json', 'explanation/1db4a4e2/features.interpret.json', 'explanation/1db4a4e2/global_names/0.interpret.json', 'explanation/1db4a4e2/global_rank/0.interpret.json', 'explanation/1db4a4e2/global_values/0.interpret.json', 'explanation/1db4a4e2/local_importance_values.interpret.json', 'explanation/1db4a4e2/local_importance_viz.interpret.json', 'explanation/1db4a4e2/per_class_names/0.interpret.json', 'explanation/1db4a4e2

## Make Predictions and calculate metrics

### Prep Test Data: Extract X values (feature columns) from test dataset and convert to NumPi array for predicting 

In [30]:
import pandas as pd

x_test_df = test_df.copy()

if 'target' in x_test_df.columns:
    y_test_df = x_test_df.pop('target')

print(test_df.shape)
print(x_test_df.shape)
print(y_test_df.shape)

(59375, 59)
(59375, 58)
(59375,)


In [31]:
y_test_df.describe()

count   59375.00
mean        0.04
std         0.19
min         0.00
25%         0.00
50%         0.00
75%         0.00
max         1.00
Name: target, dtype: float64

### Make predictions in bulk

In [32]:
# Try the best model making predictions with the test dataset
y_predictions = fitted_model.predict(x_test_df)

print(y_predictions.shape)
print('30 predictions: ')
print(y_predictions[:30])

(59375,)
30 predictions: 
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


### Get all the predictions' probabilities needed to calculate ROC AUC

In [33]:
class_probabilities = fitted_model.predict_proba(x_test_df)
print(class_probabilities.shape)

print('Some class probabilities...: ')
print(class_probabilities[:3])

print('Probabilities for class 1:')
print(class_probabilities[:,1])

print('Probabilities for class 0:')
print(class_probabilities[:,0])

(59375, 2)
Some class probabilities...: 
[[0.96949423 0.03050576]
 [0.96898435 0.03101564]
 [0.96545951 0.03454051]]
Probabilities for class 1:
[0.03050576 0.03101564 0.03454051 ... 0.02641905 0.01898987 0.02541346]
Probabilities for class 0:
[0.96949423 0.96898435 0.96545951 ... 0.97358098 0.98101014 0.97458654]


## Evaluate Model

Evaluating performance is an essential task in machine learning. In this case, because this is a classification problem, the data scientist elected to use an AUC - ROC Curve. When we need to check or visualize the performance of the multi - class classification problem, we use AUC (Area Under The Curve) ROC (Receiver Operating Characteristics) curve. It is one of the most important evaluation metrics for checking any classification model’s performance.

<img src="https://www.researchgate.net/profile/Oxana_Trifonova/publication/276079439/figure/fig2/AS:614187332034565@1523445079168/An-example-of-ROC-curves-with-good-AUC-09-and-satisfactory-AUC-065-parameters.png"
     alt="Markdown Monster icon"
     style="float: left; margin-right: 12px; width: 320px; height: 239px;" />

### Calculate the ROC AUC with probabilities vs. the Test Dataset

In [None]:
print('ROC AUC *method 1*:')
fpr, tpr, thresholds = metrics.roc_curve(y_test_df, class_probabilities[:,1])
metrics.auc(fpr, tpr)

In [35]:
from sklearn.metrics import roc_auc_score

print('ROC AUC *method 2*:')
print(roc_auc_score(y_test_df, class_probabilities[:,1]))

print('ROC AUC Weighted:')
print(roc_auc_score(y_test_df, class_probabilities[:,1], average='weighted'))
# AUC with plain LightGBM was: 0.6374553321494826 

ROC AUC *method 2*:
0.6406061434233158
ROC AUC Weighted:
0.6406061434233158


### Calculate the Accuracy with predictions vs. the Test Dataset

In [36]:
print(y_test_df.shape)
print(y_predictions.shape)

(59375,)
(59375,)


In [37]:
from sklearn.metrics import accuracy_score

print('Accuracy:')
print(accuracy_score(y_test_df, y_predictions))


Accuracy:
0.9637052631578947


### Load model in memory

#### (Option A: Load from model .pkl file)

In [38]:
# Load the model into memory from downloaded file
import joblib

best_run.download_file('outputs/model.pkl')

fitted_model = joblib.load('model.pkl')
print(fitted_model)

Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                                 feature_sweeping_config=None,
                                 feature_sweeping_timeout=None,
                                 featurization_config=None, force_text_dnn=None,
                                 is_cross_validation=None,
                                 is_onnx_compatible=None, logger=None,
                                 observer=None, task=None, working_dir=None)),
                ('prefittedsoftvotingclassifier',...
                                                                                                learning_rate=0.1,
                                                                                                max_depth=-1,
                                                                                                min_child_samples=20,
                                                 

#### (Option B: Load from model registry in Workspace)

In [39]:
from azureml.core.model import Model

# not used, just to see the registered model definition
registered_model_definition = Model(ws, 'porto-seg-automl-remote-compute')
print(registered_model_definition)

Model(workspace=Workspace.create(name='cesardl-automl-centraluseuap-ws', subscription_id='102a16c3-37d3-48a8-9237-4c9b1e8e80e0', resource_group='automlpmdemo'), name=porto-seg-automl-remote-compute, id=porto-seg-automl-remote-compute:6, version=6, tags={}, properties={})


In [40]:
# Load model from model registry in Workspace
from azureml.core.model import Model

model_path = Model.get_model_path('porto-seg-automl-remote-compute', _workspace=ws)
fitted_model = joblib.load(model_path)
print(fitted_model)

Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                                 feature_sweeping_config=None,
                                 feature_sweeping_timeout=None,
                                 featurization_config=None, force_text_dnn=None,
                                 is_cross_validation=None,
                                 is_onnx_compatible=None, logger=None,
                                 observer=None, task=None, working_dir=None)),
                ('prefittedsoftvotingclassifier',...
                                                                                                learning_rate=0.1,
                                                                                                max_depth=-1,
                                                                                                min_child_samples=20,
                                                 

## Try model inference with hardcoded input data for the model to predict

In [41]:
# Data from Dataframe for comparison with hardcoded data
# x_test_df.head(1)

In [42]:
# Data from Dataframe for comparison with hardcoded data
# print(x_test_df.head(1).values)
# print(x_test_df.head(1).columns)

In [43]:
import json

raw_data = json.dumps({
     'data': [[20,2,1,3,1,0,0,1,0,0,0,0,0,0,0,8,1,0,0,0.6,0.1,0.61745445,6,1,-1,0,1,11,1,1,0,1,99,2,0.31622777,0.6396829,0.36878178,3.16227766,0.2,0.6,0.5,2,2,8,1,8,3,10,3,0,0,10,0,1,0,0,1,0]],
     'method': 'predict'  # If you have a classification model, you can get probabilities by changing this to 'predict_proba'.
 })

print(json.loads(raw_data)['data'])

numpy_data = np.array(json.loads(raw_data)['data'])

df_data = pd.DataFrame(data=numpy_data, columns=['id', 'ps_ind_01', 'ps_ind_02_cat', 'ps_ind_03', 'ps_ind_04_cat',
                                               'ps_ind_05_cat', 'ps_ind_06_bin', 'ps_ind_07_bin', 'ps_ind_08_bin',
                                               'ps_ind_09_bin', 'ps_ind_10_bin', 'ps_ind_11_bin', 'ps_ind_12_bin',
                                               'ps_ind_13_bin', 'ps_ind_14', 'ps_ind_15', 'ps_ind_16_bin',
                                               'ps_ind_17_bin', 'ps_ind_18_bin', 'ps_reg_01', 'ps_reg_02', 'ps_reg_03',
                                               'ps_car_01_cat', 'ps_car_02_cat', 'ps_car_03_cat', 'ps_car_04_cat',
                                               'ps_car_05_cat', 'ps_car_06_cat', 'ps_car_07_cat', 'ps_car_08_cat',
                                               'ps_car_09_cat', 'ps_car_10_cat', 'ps_car_11_cat', 'ps_car_11',
                                               'ps_car_12', 'ps_car_13', 'ps_car_14', 'ps_car_15', 'ps_calc_01',
                                               'ps_calc_02', 'ps_calc_03', 'ps_calc_04', 'ps_calc_05', 'ps_calc_06',
                                               'ps_calc_07', 'ps_calc_08', 'ps_calc_09', 'ps_calc_10', 'ps_calc_11',
                                               'ps_calc_12', 'ps_calc_13', 'ps_calc_14', 'ps_calc_15_bin',
                                               'ps_calc_16_bin', 'ps_calc_17_bin', 'ps_calc_18_bin', 'ps_calc_19_bin',
                                               'ps_calc_20_bin'])
df_data

[[20, 2, 1, 3, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 8, 1, 0, 0, 0.6, 0.1, 0.61745445, 6, 1, -1, 0, 1, 11, 1, 1, 0, 1, 99, 2, 0.31622777, 0.6396829, 0.36878178, 3.16227766, 0.2, 0.6, 0.5, 2, 2, 8, 1, 8, 3, 10, 3, 0, 0, 10, 0, 1, 0, 0, 1, 0]]


Unnamed: 0,id,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,ps_ind_09_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,20.0,2.0,1.0,3.0,1.0,0.0,0.0,1.0,0.0,0.0,...,3.0,0.0,0.0,10.0,0.0,1.0,0.0,0.0,1.0,0.0


In [44]:
# Get predictions from the model
y_predictions = fitted_model.predict(df_data) # x_test_df.head(1)
y_predictions # Should return a [0] or [1] depending on the prediction result

array([0])