# AutoML - distributed training for classification

## Contents
1. [Introduction](#Introduction)
1. [Setup](#Setup)
1. [Train](#Train)
1. [Featurization transparency and model explanation](#Results)
1. [Deploy](#Results)
1. [Test](#Test)
1. [Acknowledgements](#Acknowledgements)

## Introduction

In this example we use the associated credit card dataset to showcase how you can use AutoML distributed training for a simple classification problem. The goal is to predict if a credit card transaction is considered a fraudulent charge.

This notebook is using multiple remote compute nodes to train the model. 

In this notebook you will learn how to:
1. Create an experiment using an existing workspace.
2. Configure AutoML using `AutoMLConfig` for distrbuted training
3. Train the model using multiple remote compute nodes.

## Setup

As part of the setup you have already created an Azure ML `Workspace` object. For Automated ML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments.

In [None]:
import json
import logging

from matplotlib import pyplot as plt
import pandas as pd
import os

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.core.dataset import Dataset
from azureml.train.automl import AutoMLConfig
from azureml.interpret import ExplanationClient

This sample notebook may use features that are not available in previous versions of the Azure ML SDK.

In [None]:
ws = Workspace.from_config()

# choose a name for experiment
experiment_name = "automl-distributed-training-classification"

experiment = Experiment(ws, experiment_name)

output = {}
output["Subscription ID"] = ws.subscription_id
output["Workspace"] = ws.name
output["Resource Group"] = ws.resource_group
output["Location"] = ws.location
output["Experiment Name"] = experiment.name
output["SDK Version"] = azureml.core.VERSION
pd.set_option("display.max_colwidth", None)
outputDf = pd.DataFrame(data=output, index=[""])
outputDf.T

## Create or Attach existing AmlCompute
A compute target is required to execute the Automated ML run. In this tutorial, you create AmlCompute as your training compute resource.

> Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.

#### Creation of AmlCompute takes approximately 5 minutes. 
If the AmlCompute with that name is already in your workspace this code will skip the creation process.
As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota.

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
cpu_cluster_name = "automl-distributed-training"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print("Found existing cluster, use it.")
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(
        vm_size="STANDARD_DS12_V2", max_nodes=8
    )
    compute_target = ComputeTarget.create(ws, cpu_cluster_name, compute_config)
compute_target.wait_for_completion(show_output=True)

# Data

### Load data

The training data this notebook using not large and is dynamically constructing the dataset.
For large data we recommend that the dataset be registered in your workspace prior to running this notebook and use Dataset.get_by_name() API to retrieve training data as shown in the commented code below.


In [None]:
from azureml.data import DataType
data = "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/creditcard.csv"
training_data = Dataset.Tabular.from_delimited_files(data, set_column_types={'Time': DataType.to_float()})

# optional - validation data. Specify null for auto splitting or specify validation_size to control auto splitting size
data_validate = "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/creditcard_validate.csv"
validation_data = Dataset.Tabular.from_delimited_files(data_validate, set_column_types={'Time': DataType.to_float()})

# optional - validation data. Specify null for auto splitting or specify test_size to control auto splitting size
data_test = "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/creditcard_test.csv"
test_data = Dataset.Tabular.from_delimited_files(data_test, set_column_types={'Time': DataType.to_float()})

label_column_name = "Class"

# Please register your large dataset in the worksapce with proper types configured for each column
# And then please replace the above lines with following code 
# training_data = Dataset.get_by_name(ws, "name-of-training-dataset")
# validation_data = Dataset.get_by_name(ws, "name-of-validation-dataset")
# test_data = Dataset.get_by_name(ws, "name-of-test-dataset")
# label_column_name = "name-of-label-column"


## Train

Instantiate a AutoMLConfig object. Besides the properties shown in the sample, specify the following properties to properly configure distributed training 

|Property|Description|
|-|-|
|**use_distributed**|specify true to enable distributed training. Default is false.|
|**max_nodes**|Specify how many nodes you want to use for this job. Make sure the compute cluster you specify has these number of nodes allocated.|
|**allowed_models**| Specify LightGBM as that is the only agorithm supported. Specifying this will not be necessary in the upcoming versions. |
|**experiment_timeout_hours**| Specify few hours at least. Large data training takes more time. |
|**validation_data**| Consider providing validation data. Specify null for auto splitting or specify test_size to control auto splitting size|
|**test_data**| Consider providing test data. Specify null for auto splitting or specify test_size to control auto splitting size|


In [None]:
automl_settings = {
    "use_distributed":True,
    "max_nodes":8,
    "allowed_models": ["LightGBM"],
    "experiment_timeout_hours": 24,  
    "primary_metric": "average_precision_score_weighted",
    "verbosity": logging.INFO,
}

automl_config = AutoMLConfig(
    task="classification",
    debug_log="automl_errors.log",
    compute_target=compute_target,
    training_data=training_data,
    validation_data=validation_data,
    test_data=test_data, 
    label_column_name=label_column_name,
    iterations=10,
    **automl_settings,
)

Call the `submit` method on the experiment object and pass the run configuration. Depending on the data and the number of iterations this can run for a while. Validation errors and current status will be shown when setting `show_output=True` and the execution will be synchronous.

In [None]:
remote_run = experiment.submit(automl_config, show_output=True)

In [None]:
# Retrieve the best Run object
best_run = remote_run.get_best_child()

## Featurization transparency and model explanation

View featurization summary for the best model - to study how different features were transformed. 

In [None]:
# Download the featurization summary JSON file locally
best_run.download_file(
    "outputs/featurization_summary.json", "featurization_summary.json"
)

# Render the JSON as a pandas DataFrame
with open("featurization_summary.json", "r") as f:
    records = json.load(f)

pd.DataFrame.from_records(records)

Retrieve the explanation from the best_run which includes explanations for engineered features and raw features. Make sure that the run for generating explanations for the best model is completed.

In [None]:
# Wait for the best model explanation run to complete
from azureml.core.run import Run

model_explainability_run_id = remote_run.id + "_" + "ModelExplain"
print(model_explainability_run_id)
model_explainability_run = Run(
    experiment=experiment, run_id=model_explainability_run_id
)
model_explainability_run.wait_for_completion()

client = ExplanationClient.from_run(best_run)
engineered_explanations = client.download_model_explanation(raw=True) # specify false for raw feature names
exp_data = engineered_explanations.get_feature_importance_dict()
exp_data

## Deploy

### Retrieve the best model for deployment and register it


In [None]:
model_name = best_run.properties["model_name"]
script_file_name = "inference/score.py"
best_run.download_file("outputs/scoring_file_v_1_0_0.py", "inference/score.py")

description = "AutoML Model trained using distributed training"
tags = None
model = remote_run.register_model(
    model_name=model_name, description=description, tags=tags
)

print(
    remote_run.model_id
)  # This will be written to the script file later in the notebook.

### Deploy the model as a Web Service on Azure Container Instance

In [None]:
from azureml.core.model import InferenceConfig
from azureml.core.webservice import AciWebservice
from azureml.core.webservice import Webservice
from azureml.core.model import Model
from azureml.core.environment import Environment

inference_config = InferenceConfig(entry_script=script_file_name)

aciconfig = AciWebservice.deploy_configuration(
    cpu_cores=2,
    memory_gb=2,
    tags={"area": "bmData", "type": "automl_classification"},
    description="sample service for Automl Classification",
)

aci_service_name = model_name.lower()
print(aci_service_name)
aci_service = Model.deploy(ws, aci_service_name, [model], inference_config, aciconfig)
aci_service.wait_for_deployment(True)
print(aci_service.state)

### Call the ACI web service to do the prediction

In [None]:
import requests
from numpy import array
pd.set_option('display.max_rows', None)

sampled_test_data = test_data.take_sample(20)
X_test = sampled_test_data.drop_columns(columns=[label_column_name]).to_pandas_dataframe()
y_test = sampled_test_data.keep_columns(columns=[label_column_name], validate=True).to_pandas_dataframe()

X_test_json = X_test.to_json(orient="records")
data = '{"data": ' + X_test_json + "}"
headers = {"Content-Type": "application/json"}

resp = requests.post(aci_service.scoring_uri, data, headers=headers)

y_pred = json.loads(json.loads(resp.text))["result"]

actual = array(y_test)
actual = actual[:, 0]

compare_results_df = pd.DataFrame({'actual': actual, 'predicted': y_pred})
print(compare_results_df)

### Call the ACI web service to do the prediction

In [None]:
aci_service.delete()

## Acknowledgements

This Credit Card fraud Detection dataset is made available under the Open Database License: http://opendatacommons.org/licenses/odbl/1.0/. Any rights in individual contents of the database are licensed under the Database Contents License: http://opendatacommons.org/licenses/dbcl/1.0/ and is available at: https://www.kaggle.com/mlg-ulb/creditcardfraud

The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (UniversitÃ© Libre de Bruxelles) on big data mining and fraud detection.
More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the DefeatFraud project

Please cite the following works:

Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015

Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon

Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE

Dal Pozzolo, Andrea Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi)

Carcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-AÃ«l; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier

Carcillo, Fabrizio; Le Borgne, Yann-AÃ«l; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing

Bertrand Lebichot, Yann-AÃ«l Le Borgne, Liyun He, Frederic OblÃ©, Gianluca Bontempi Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection, INNSBDDL 2019: Recent Advances in Big Data and Deep Learning, pp 78-88, 2019

Fabrizio Carcillo, Yann-AÃ«l Le Borgne, Olivier Caelen, Frederic OblÃ©, Gianluca Bontempi Combining Unsupervised and Supervised Learning in Credit Card Fraud Detection Information Sciences, 2019