# Porto Seguro's Safe Driving Prediction (AutoML Local Compute)

Notebook prepared to repro a possible bug:

#### RELATED BUG:
https://msdata.visualstudio.com/Vienna/_workitems/edit/583733

## Import Needed Packages

Import the packages needed for this solution notebook. The most widely used package for machine learning is [scikit-learn](https://scikit-learn.org/stable/), [pandas](https://pandas.pydata.org/docs/getting_started/index.html#getting-started), and [numpy](https://numpy.org/). These packages have various features, as well as a lot of clustering, regression and classification algorithms that make it a good choice for data mining and data analysis.

In [7]:
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import joblib
from sklearn import metrics

## Check Azure ML SDK version

In [8]:
import azureml.core
print("This notebook was created and tested using version 1.2.0 of the Azure ML SDK")
print("You are currently using version", azureml.core.VERSION, "of the Azure ML SDK")

This notebook was created and tested using version 1.2.0 of the Azure ML SDK
You are currently using version 1.3.0 of the Azure ML SDK


##  Get Azure ML Workspace to use

In [9]:
from azureml.core import Workspace, Dataset

# Get Workspace defined in by default config.json file
ws = Workspace.from_config()

## Load data into Azure ML Dataset and Register into Workspace

In [10]:
# Try to load the dataset from the Workspace. Otherwise, create it from the file in the HTTP URL
found = False
aml_dataset_name = "porto_seguro_safe_driver_prediction_train"

if aml_dataset_name in ws.datasets.keys(): 
       found = True
       aml_dataset = ws.datasets[aml_dataset_name] 
       print("Dataset loaded from the Workspace")
       
if not found:
        # Create AML Dataset and register it into Workspace
        print("Dataset does not exist in the current Workspace. It will be imported and registered.")
               
        # Create AML Dataset from file in HTTP URL
        data_url = 'https://azmlworkshopdata.blob.core.windows.net/safedriverdata/porto_seguro_safe_driver_prediction_train.csv'
        aml_dataset = Dataset.Tabular.from_delimited_files(data_url)  
        data_origin_type = 'HttpUrl'
        
        print(aml_dataset)
                
        #Register Dataset in Workspace
        registration_method = 'SDK'  # or 'UI'
        aml_dataset = aml_dataset.register(workspace=ws,
                                           name=aml_dataset_name,
                                           description='Porto Seguro Safe Driver Prediction Train dataset file',
                                           tags={'Registration-Method': registration_method, 'Data-Origin-Type': data_origin_type},
                                           create_new_version=True)
        
        print("Dataset created from file and registered in the Workspace")

Dataset loaded from the Workspace


In [11]:
# Use Pandas DataFrame just to sneak peak some data and schema
data_df = aml_dataset.to_pandas_dataframe()
# print(data_df.describe())

print(data_df.shape)
data_df.head(5)

(595212, 59)


Unnamed: 0,id,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,7,0,2,2,5,1,0,0,1,0,...,9,1,5,8,0,1,1,0,0,1
1,9,0,1,1,7,0,0,0,0,1,...,3,1,1,9,0,1,1,0,1,0
2,13,0,5,4,9,1,0,0,0,1,...,4,2,7,7,0,1,1,0,1,0
3,16,0,0,1,2,0,0,1,0,0,...,2,2,4,9,0,0,0,0,0,0
4,17,0,0,2,0,1,0,1,0,0,...,3,1,1,3,0,0,0,1,1,0


## Split Data into Train and Test Sets but using AML Dataset split()

In [14]:
# Split in train/test datasets (Test=10%, Train=90%)

train_dataset, test_dataset = aml_dataset.random_split(0.9, seed=0)

# Use Pandas DF only to check the data
train_df = train_dataset.to_pandas_dataframe()
test_df = test_dataset.to_pandas_dataframe()

print(train_df.shape)
print(test_df.shape)

train_df.describe()

(535850, 59)
(59362, 59)


Unnamed: 0,id,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
count,535850.0,535850.0,535850.0,535850.0,535850.0,535850.0,535850.0,535850.0,535850.0,535850.0,...,535850.0,535850.0,535850.0,535850.0,535850.0,535850.0,535850.0,535850.0,535850.0,535850.0
mean,743612.9,0.036529,1.89961,1.359192,4.422809,0.416652,0.404369,0.394075,0.257141,0.16405,...,5.442555,1.44078,2.871871,7.537524,0.122517,0.627788,0.554633,0.287006,0.349128,0.153354
std,429421.6,0.187602,1.983225,0.664765,2.700265,0.493273,1.34878,0.488652,0.437058,0.370321,...,2.334103,1.202311,1.695257,2.746149,0.327883,0.483395,0.497007,0.452365,0.476695,0.360329
min,7.0,0.0,0.0,-1.0,0.0,-1.0,-1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,371606.0,0.0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,...,4.0,1.0,2.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,743346.5,0.0,1.0,1.0,4.0,0.0,0.0,0.0,0.0,0.0,...,5.0,1.0,3.0,7.0,0.0,1.0,1.0,0.0,0.0,0.0
75%,1115294.0,0.0,3.0,2.0,6.0,1.0,0.0,1.0,1.0,0.0,...,7.0,2.0,4.0,9.0,0.0,1.0,1.0,1.0,1.0,0.0
max,1488027.0,1.0,7.0,4.0,11.0,1.0,6.0,1.0,1.0,1.0,...,19.0,10.0,13.0,23.0,1.0,1.0,1.0,1.0,1.0,1.0


In [15]:
test_df.head(5)

Unnamed: 0,id,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,20,0,2,1,3,1,0,0,1,0,...,3,0,0,10,0,1,0,0,1,0
1,74,0,2,1,2,1,0,0,1,0,...,7,1,3,9,0,1,0,1,0,0
2,78,0,0,1,7,0,0,1,0,0,...,6,4,4,4,0,0,1,1,0,1
3,89,0,0,1,6,0,0,1,0,0,...,5,3,1,11,0,1,0,0,1,0
4,111,0,1,1,9,0,0,1,0,0,...,0,0,2,5,0,1,1,0,1,1


In [16]:
train_df.head(5)

Unnamed: 0,id,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,7,0,2,2,5,1,0,0,1,0,...,9,1,5,8,0,1,1,0,0,1
1,9,0,1,1,7,0,0,0,0,1,...,3,1,1,9,0,1,1,0,1,0
2,13,0,5,4,9,1,0,0,0,1,...,4,2,7,7,0,1,1,0,1,0
3,16,0,0,1,2,0,0,1,0,0,...,2,2,4,9,0,0,0,0,0,0
4,17,0,0,2,0,1,0,1,0,0,...,3,1,1,3,0,0,0,1,1,0


## Train with Azure AutoML automatically searching for the 'best model' (Best algorithms and best hyper-parameters)

## Define AutoML Experiment settings

IMPORTANT: Use one or the other to repro de bug:

- Pandas DataFrame by simply changing training_data= train_df (BUGs repro, AUC will be around 0.49 in the custom calculation at the end of the notebook)
- AML DataSet by simply changing training_data= train_dataset (Works good, AUC will be around 0.63 in the custom calculation at the end of the notebook)

In [17]:
import logging

# You can provide additional settings as a **kwargs parameter for the AutoMLConfig object
automl_settings = {
      "blacklist_models":['LogisticRegression', 'ExtremeRandomTrees', 'RandomForest'], 
      # "whitelist_models": ['LightGBM'],
      # "n_cross_validations": 5,
      # "validation_data": test_df,  # Better to holdout the Test Dataset
      "experiment_exit_score": 0.7
}

from azureml.train.automl import AutoMLConfig

automl_config = AutoMLConfig(task='classification',
                             primary_metric='AUC_weighted',                           
                             training_data= train_df, #  train_dataset, # Pandas DataFrame
                             label_column_name="target",                                                    
                             enable_early_stopping= True,
                             iterations=10,
                             experiment_timeout_hours=1,  # Enforced: Cannot be less than 1h due to the size of this dataset (Cols*Rows)                         
                             featurization= 'auto',        # (auto/off) All feature columns in this dataset are numbers, no need to featurize. 
                             debug_log='automated_ml_errors.log',
                             verbosity= logging.INFO,
                             model_explainability=True,
                             enable_onnx_compatible_models=False,
                             **automl_settings
                             )

# Explanation of Settings: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train#configure-your-experiment-settings

# AutoMLConfig info on: 
# https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig.automlconfig

## Run Experiment with multiple child runs under the covers

In [None]:
from azureml.core import Experiment

experiment_name = "SDK_local_porto_seguro_driver_pred"
print(experiment_name)

experiment = Experiment(workspace=ws, 
                        name=experiment_name)

import time
start_time = time.time()

run = experiment.submit(automl_config, show_output=True)

print('Manual run timing: --- %s minutes needed for running the whole Remote AutoML Experiment ---' % ((time.time() - start_time)/60))

SDK_local_porto_seguro_driver_pred
Running on local machine
Parent Run ID: AutoML_c7ab11fc-4b09-4c48-b33f-9910a5f7716a

Current status: DatasetEvaluation. Gathering dataset statistics.
Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Beginning to fit featurizers and featurize the dataset.
Current status: DatasetFeaturizationCompleted. Completed fit featurizers and featurizing the dataset.
Current status: DatasetBalancing. Performing class balancing sweeping

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Train-Test data split
STATUS:       DONE
DESCRIPTION:  Your input data has been split into a training dataset and a holdout test dataset for validation of the model. The test holdout dataset reflects the original distribution of your input data.
PARAMETERS:   Dataset : train, Row counts : 482265, Percentage : 90.0
              Dataset : t

## Explore results with Widget

In [None]:
# Explore the results of automatic training with a Jupyter widget: https://docs.microsoft.com/en-us/python/api/azureml-widgets/azureml.widgets?view=azure-ml-py
from azureml.widgets import RunDetails
RunDetails(run).show()

## Retrieve the 'Best' Model

In [None]:
best_run, fitted_model = run.get_output()
print(best_run)
print('--------')
print(fitted_model)

## Make Predictions and calculate metrics

### Prep Test Data: Extract X values (feature columns) from dataset and convert to NumPi array for predicting 

In [14]:
import pandas as pd

x_test_df = test_df.copy()

if 'target' in x_test_df.columns:
    y_test_df = x_test_df.pop('target')

print(test_df.shape)
print(x_test_df.shape)
print(y_test_df.shape)


(59374, 59)
(59374, 58)
(59374,)


In [15]:
y_test_df.describe()

count   59374.00
mean        0.04
std         0.19
min         0.00
25%         0.00
50%         0.00
75%         0.00
max         1.00
Name: target, dtype: float64

### Make predictions in bulk

In [16]:
# Try the best model making predictions with the test dataset
y_predictions = fitted_model.predict(x_test_df)

print('30 predictions: ')
print(y_predictions[:30])

30 predictions: 
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


### Get all the predictions' probabilities needed to calculate ROC AUC

In [17]:
class_probabilities = fitted_model.predict_proba(x_test_df)
print(class_probabilities.shape)

print('Some class probabilities...: ')
print(class_probabilities[:3])

print('Probabilities for class 1:')
print(class_probabilities[:,1])

print('Probabilities for class 0:')
print(class_probabilities[:,0])

(59374, 2)
Some class probabilities...: 
[[0.83451319 0.16548681]
 [0.92256433 0.07743567]
 [0.78280498 0.21719502]]
Probabilities for class 1:
[0.16548681 0.07743567 0.21719502 ... 0.52304236 0.22136817 0.22178124]
Probabilities for class 0:
[0.83451319 0.92256433 0.78280498 ... 0.47695764 0.77863183 0.77821876]


## Evaluate Model

Evaluating performance is an essential task in machine learning. In this case, because this is a classification problem, the data scientist elected to use an AUC - ROC Curve. When we need to check or visualize the performance of the multi - class classification problem, we use AUC (Area Under The Curve) ROC (Receiver Operating Characteristics) curve. It is one of the most important evaluation metrics for checking any classification model’s performance.

<img src="https://www.researchgate.net/profile/Oxana_Trifonova/publication/276079439/figure/fig2/AS:614187332034565@1523445079168/An-example-of-ROC-curves-with-good-AUC-09-and-satisfactory-AUC-065-parameters.png"
     alt="Markdown Monster icon"
     style="float: left; margin-right: 12px; width: 320px; height: 239px;" />

### Calculate the ROC AUC with probabilities vs. the Test Dataset

In [18]:
print('ROC AUC *method 1*:')
fpr, tpr, thresholds = metrics.roc_curve(y_test_df, class_probabilities[:,1])
metrics.auc(fpr, tpr)


ROC AUC *method 1*:


0.5145758345189049

In [19]:
from sklearn.metrics import roc_auc_score

print('ROC AUC *method 2*:')
print(roc_auc_score(y_test_df, class_probabilities[:,1]))

print('ROC AUC Weighted:')
print(roc_auc_score(y_test_df, class_probabilities[:,1], average='weighted'))

# ********** THIS IS THE BUG when training with Pandas DataFrame ***********
# AUC should be around 0.63 instead of 0.49 or 0.5


ROC AUC *method 2*:
0.5145758345189049
ROC AUC Weighted:
0.5145758345189049


### Calculate the Accuracy with predictions vs. the Test Dataset

In [20]:
from sklearn.metrics import accuracy_score

print('Accuracy:')
print(accuracy_score(y_test_df, y_predictions))


Accuracy:
0.8500690537945902
