# AutoML on remote AML Compute (Porto Seguro's Safe Driving Prediction)

This notebook is refactored (from the original AutoML local training notebook) to use AutoML on remote AML compute, in a cluster.
It also uses AML Datasets for training instead of Pandas Dataframes.

## Import Needed Packages

Import the packages needed for this solution notebook. The most widely used package for machine learning is [scikit-learn](https://scikit-learn.org/stable/), [pandas](https://pandas.pydata.org/docs/getting_started/index.html#getting-started), and [numpy](https://numpy.org/). These packages have various features, as well as a lot of clustering, regression and classification algorithms that make it a good choice for data mining and data analysis.

In [None]:
import numpy as np
import pandas as pd
import joblib
from sklearn import metrics

## Check Azure ML SDK version

In [None]:
import azureml.core
print("This notebook was created and tested using version 1.2.0 of the Azure ML SDK")
print("You are currently using version", azureml.core.VERSION, "of the Azure ML SDK")

##  Get Azure ML Workspace to use

In [None]:
from azureml.core import Workspace, Dataset

# Get Workspace defined in by default config.json file
ws = Workspace.from_config()

### (Optional) Submit dataset file into DataStore (Azure Blob under the covers)

In [None]:
datastore = ws.get_default_datastore()
datastore.upload(src_dir='../../data/', 
                 target_path='Datasets/porto_seguro_safe_driver_prediction', overwrite=True, show_progress=True)

## Load data into Azure ML Dataset and Register into Workspace

In [None]:
# Try to load the dataset from the Workspace. Otherwise, create it from the file in the HTTP URL
found = False
aml_dataset_name = "porto_seguro_safe_driver_prediction_train"

if aml_dataset_name in ws.datasets.keys(): 
       found = True
       aml_dataset = ws.datasets[aml_dataset_name] 
       print("Dataset loaded from the Workspace")
       
if not found:
        # Create AML Dataset and register it into Workspace
        print("Dataset does not exist in the current Workspace. It will be imported and registered.")
        
        # Option A: Create AML Dataset from file in AML DataStore
        datastore = ws.get_default_datastore()
        aml_dataset = Dataset.Tabular.from_delimited_files(path=datastore.path('Datasets/porto_seguro_safe_driver_prediction/porto_seguro_safe_driver_prediction_train.csv'))
        data_origin_type = 'AMLDataStore'
        
        # Option B: Create AML Dataset from file in HTTP URL
        # data_url = 'https://azmlworkshopdata.blob.core.windows.net/safedriverdata/porto_seguro_safe_driver_prediction_train.csv'
        # aml_dataset = Dataset.Tabular.from_delimited_files(data_url)  
        # data_origin_type = 'HttpUrl'
        
        print(aml_dataset)
                
        #Register Dataset in Workspace
        registration_method = 'SDK'  # or 'UI'
        aml_dataset = aml_dataset.register(workspace=ws,
                                           name=aml_dataset_name,
                                           description='Porto Seguro Safe Driver Prediction Train dataset file',
                                           tags={'Registration-Method': registration_method, 'Data-Origin-Type': data_origin_type},
                                           create_new_version=True)
        
        print("Dataset created from file and registered in the Workspace")


In [None]:
# Use Pandas DataFrame just to sneak peak some data and schema
data_df = aml_dataset.to_pandas_dataframe()
print(data_df.shape)
# print(data_df.describe())
data_df.head(5)

## Split Data into Train and Test AML Tabular Datasets

Remote AML Training you need to use AML Datasets, you cannot submit Pandas Dataframes to remote runs of AutoMLConfig.

Note that AutoMLConfig below is not using the Test dataset (you only provide a single dataset that will internally be split in validation/train datasets or use cross-validation depending on the size of the dataset. The boundary for that is 20k rows, using cross-validation if less than 20k. This can also be decided by the user.). 

The Test dataset will be used at the end of the notebook to manually calculate the quality metrics with a dataset not seen by AutoML training.

In [None]:
# Split in train/test datasets (Test=10%, Train=90%)

train_dataset, test_dataset = aml_dataset.random_split(0.9, seed=0)

# Use Pandas DF only to check the data
train_df = train_dataset.to_pandas_dataframe()
test_df = test_dataset.to_pandas_dataframe()

In [None]:
print(train_df.shape)
print(test_df.shape)

train_df.describe()

In [None]:
train_df.head(5)

## Connect to Remote AML Compute (Existing AML cluster)

In [None]:
from azureml.core.compute import AmlCompute, ComputeTarget
# Define remote compute target to use
# Further docs on Remote Compute Target: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-auto-train-remote

# Choose a name for your cluster.
amlcompute_cluster_name = "cpu-cluster"

found = False
# Check if this compute target already exists in the workspace.
cts = ws.compute_targets

if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'AmlCompute':
     found = True
     print('Found existing training cluster.')
     # Get existing cluster
     # Method 1:
     aml_remote_compute = cts[amlcompute_cluster_name]
     # Method 2:
     # aml_remote_compute = ComputeTarget(ws, amlcompute_cluster_name)
    
if not found:
     print('Creating a new training cluster...')
     provisioning_config = AmlCompute.provisioning_configuration(vm_size = "STANDARD_D13_V2", # for GPU, use "STANDARD_NC12"
                                                                 #vm_priority = 'lowpriority', # optional
                                                                 max_nodes = 5)
     # Create the cluster.
     aml_remote_compute = ComputeTarget.create(ws, amlcompute_cluster_name, provisioning_config)
    
print('Checking cluster status...')
# Can poll for a minimum number of nodes and for a specific timeout.
# If no min_node_count is provided, it will use the scale settings for the cluster.
aml_remote_compute.wait_for_completion(show_output = True, min_node_count = 0, timeout_in_minutes = 20)

In [None]:
# For additional details of current AmlCompute status:
aml_remote_compute.get_status()

## Train with Azure AutoML automatically searching for the 'best model' (Best algorithms and best hyper-parameters)

### List and select primary metric to drive the AutoML classification problem

In [None]:
from azureml.train import automl

# List of possible primary metrics is here:
# https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train#primary-metric
    
# Get a list of valid metrics for your given task
automl.utilities.get_primary_metrics('classification')

## Define AutoML Experiment settings

In [None]:
import logging

# You can provide additional settings as a **kwargs parameter for the AutoMLConfig object
automl_settings = {
      "blacklist_models":['LogisticRegression', 'ExtremeRandomTrees', 'RandomForest'], 
      # "whitelist_models": ['LightGBM'],
      # "n_cross_validations": 5,
      # "validation_data": test_dataset,   # Better to holdout the Test Dataset
      "experiment_exit_score": 0.7
}

from azureml.train.automl import AutoMLConfig

automl_config = AutoMLConfig(compute_target=aml_remote_compute,
                             task='classification',
                             primary_metric='AUC_weighted',                           
                             training_data=train_dataset, # AML Dataset
                             label_column_name="target",                                                    
                             enable_early_stopping= True,
                             iterations=20,
                             max_concurrent_iterations=5,
                             experiment_timeout_hours=1,                           
                             featurization= 'auto',   # (auto/off) All feature columns in this dataset are numbers, no need to featurize with AML Dataset. 
                             debug_log='automated_ml_errors.log',
                             verbosity= logging.INFO,
                             model_explainability=True,
                             enable_onnx_compatible_models=False,
                             **automl_settings
                             )

# Explanation of Settings: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train#configure-your-experiment-settings

# AutoMLConfig info on: 
# https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig.automlconfig

## Run Experiment (on remote AML Compute) with multiple child runs under the covers

In [None]:
from azureml.core import Experiment

experiment_name = "SDK_remote_porto_seguro_driver_pred"
print(experiment_name)

experiment = Experiment(workspace=ws, 
                        name=experiment_name)

import time
start_time = time.time()
            
run = experiment.submit(automl_config, show_output=True)

print('Manual run timing: --- %s minutes needed for running the whole Remote AutoML Experiment ---' % ((time.time() - start_time)/60))


## Explore results with Widget

In [None]:
# Explore the results of automatic training with a Jupyter widget: https://docs.microsoft.com/en-us/python/api/azureml-widgets/azureml.widgets?view=azure-ml-py
from azureml.widgets import RunDetails
RunDetails(run).show()

### Measure Parent Run Time needed for the whole AutoML process 

In [None]:
import time
from datetime import datetime

run_details = run.get_details()

# Like: 2020-01-12T23:11:56.292703Z
end_time_utc_str = run_details['endTimeUtc'].split(".")[0]
start_time_utc_str = run_details['startTimeUtc'].split(".")[0]
timestamp_end = time.mktime(datetime.strptime(end_time_utc_str, "%Y-%m-%dT%H:%M:%S").timetuple())
timestamp_start = time.mktime(datetime.strptime(start_time_utc_str, "%Y-%m-%dT%H:%M:%S").timetuple())

parent_run_time = timestamp_end - timestamp_start
print('Run Timing: --- %s minutes needed for running the whole Remote AutoML Experiment ---' % (parent_run_time/60))

## Retrieve the 'Best' Model

In [None]:
best_run, fitted_model = run.get_output()
print(best_run)
print('--------')
print(fitted_model)

## Register Model in Workspace model registry

In [None]:
registered_model = run.register_model(model_name='porto-seg-automl-remote-compute', 
                                      description='Porto Seguro Model from plain AutoML in remote AML compute')

print(run.model_id)
registered_model

## See files associated with the 'Best run'

In [None]:
print(best_run.get_file_names())

# best_run.download_file('azureml-logs/70_driver_log.txt')

## Make Predictions and calculate metrics

### Prep Test Data: Extract X values (feature columns) from test dataset and convert to NumPi array for predicting 

In [None]:
import pandas as pd

x_test_df = test_df.copy()

if 'target' in x_test_df.columns:
    y_test_df = x_test_df.pop('target')

print(test_df.shape)
print(x_test_df.shape)
print(y_test_df.shape)

In [None]:
y_test_df.describe()

### Make predictions in bulk

In [None]:
# Try the best model making predictions with the test dataset
y_predictions = fitted_model.predict(x_test_df)

print('30 predictions: ')
print(y_predictions[:30])

### Get all the predictions' probabilities needed to calculate ROC AUC

In [None]:
class_probabilities = fitted_model.predict_proba(x_test_df)
print(class_probabilities.shape)

print('Some class probabilities...: ')
print(class_probabilities[:3])

print('Probabilities for class 1:')
print(class_probabilities[:,1])

print('Probabilities for class 0:')
print(class_probabilities[:,0])

## Evaluate Model

Evaluating performance is an essential task in machine learning. In this case, because this is a classification problem, the data scientist elected to use an AUC - ROC Curve. When we need to check or visualize the performance of the multi - class classification problem, we use AUC (Area Under The Curve) ROC (Receiver Operating Characteristics) curve. It is one of the most important evaluation metrics for checking any classification model’s performance.

<img src="https://www.researchgate.net/profile/Oxana_Trifonova/publication/276079439/figure/fig2/AS:614187332034565@1523445079168/An-example-of-ROC-curves-with-good-AUC-09-and-satisfactory-AUC-065-parameters.png"
     alt="Markdown Monster icon"
     style="float: left; margin-right: 12px; width: 320px; height: 239px;" />

### Calculate the ROC AUC with probabilities vs. the Test Dataset

In [None]:
print('ROC AUC *method 1*:')
fpr, tpr, thresholds = metrics.roc_curve(y_test_df, class_probabilities[:,1])
metrics.auc(fpr, tpr)

In [None]:
from sklearn.metrics import roc_auc_score

print('ROC AUC *method 2*:')
print(roc_auc_score(y_test_df, class_probabilities[:,1]))

print('ROC AUC Weighted:')
print(roc_auc_score(y_test_df, class_probabilities[:,1], average='weighted'))
# AUC with plain LightGBM was: 0.6374553321494826 

### Calculate the Accuracy with predictions vs. the Test Dataset

In [None]:
from sklearn.metrics import accuracy_score

print('Accuracy:')
print(accuracy_score(y_test_df, y_predictions))


### Load model in memory

#### (Option A: Load from model .pkl file)

In [None]:
# Load the model into memory from downloaded file
import joblib

fitted_model = joblib.load('model.pkl')
print(fitted_model)

#### (Option B: Load from model registry in Workspace)

In [None]:
# Load model from model registry in Workspace
from azureml.core.model import Model

# model_from_reg = Model(ws, 'porto-seg-automl-remote-compute')

name_model_from_plain_automl = 'porto-seg-automl-remote-compute'
name_model_from_pipeline_automlstep = 'porto-model-from-automlstep'

model_path = Model.get_model_path(name_model_from_pipeline_automlstep, _workspace=ws)
fitted_model = joblib.load(model_path)
print(fitted_model)

## Try model inference with hardcoded input data for the model to predict

In [None]:
# Data from Dataframe for comparison with hardcoded data
# x_test_df.head(1)

In [None]:
# Data from Dataframe for comparison with hardcoded data
# print(x_test_df.head(1).values)
# print(x_test_df.head(1).columns)

In [None]:
import json

raw_data = json.dumps({
     'data': [[20,2,1,3,1,0,0,1,0,0,0,0,0,0,0,8,1,0,0,0.6,0.1,0.61745445,6,1,-1,0,1,11,1,1,0,1,99,2,0.31622777,0.6396829,0.36878178,3.16227766,0.2,0.6,0.5,2,2,8,1,8,3,10,3,0,0,10,0,1,0,0,1,0]],
     'method': 'predict'  # If you have a classification model, you can get probabilities by changing this to 'predict_proba'.
 })

print(json.loads(raw_data)['data'])

numpy_data = np.array(json.loads(raw_data)['data'])

df_data = pd.DataFrame(data=numpy_data, columns=['id', 'ps_ind_01', 'ps_ind_02_cat', 'ps_ind_03', 'ps_ind_04_cat',
                                               'ps_ind_05_cat', 'ps_ind_06_bin', 'ps_ind_07_bin', 'ps_ind_08_bin',
                                               'ps_ind_09_bin', 'ps_ind_10_bin', 'ps_ind_11_bin', 'ps_ind_12_bin',
                                               'ps_ind_13_bin', 'ps_ind_14', 'ps_ind_15', 'ps_ind_16_bin',
                                               'ps_ind_17_bin', 'ps_ind_18_bin', 'ps_reg_01', 'ps_reg_02', 'ps_reg_03',
                                               'ps_car_01_cat', 'ps_car_02_cat', 'ps_car_03_cat', 'ps_car_04_cat',
                                               'ps_car_05_cat', 'ps_car_06_cat', 'ps_car_07_cat', 'ps_car_08_cat',
                                               'ps_car_09_cat', 'ps_car_10_cat', 'ps_car_11_cat', 'ps_car_11',
                                               'ps_car_12', 'ps_car_13', 'ps_car_14', 'ps_car_15', 'ps_calc_01',
                                               'ps_calc_02', 'ps_calc_03', 'ps_calc_04', 'ps_calc_05', 'ps_calc_06',
                                               'ps_calc_07', 'ps_calc_08', 'ps_calc_09', 'ps_calc_10', 'ps_calc_11',
                                               'ps_calc_12', 'ps_calc_13', 'ps_calc_14', 'ps_calc_15_bin',
                                               'ps_calc_16_bin', 'ps_calc_17_bin', 'ps_calc_18_bin', 'ps_calc_19_bin',
                                               'ps_calc_20_bin'])
df_data

In [None]:
# Get predictions from the model
y_predictions = fitted_model.predict(df_data) # x_test_df.head(1)
y_predictions # Should return a [0] or [1] depending on the prediction result