Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/classification-credit-card-fraud/auto-ml-classification-credit-card-fraud.png)

# Automated Machine Learning
_**Classification of credit card fraudulent transactions with local run **_

## Contents
1. [Introduction](#Introduction)
1. [Setup](#Setup)
1. [Train](#Train)
1. [Results](#Results)
1. [Test](#Test)
1. [Acknowledgements](#Acknowledgements)

## Introduction

In this example we use the associated credit card dataset to showcase how you can use AutoML for a simple classification problem. The goal is to predict if a credit card transaction is considered a fraudulent charge.

This notebook is using the local machine compute to train the model.

If you are using an Azure Machine Learning [Notebook VM](https://docs.microsoft.com/en-us/azure/machine-learning/service/tutorial-1st-experiment-sdk-setup), you are all set. Otherwise, go through the [configuration](../../../configuration.ipynb) notebook first if you haven't already to establish your connection to the AzureML Workspace. 

In this notebook you will learn how to:
1. Create an experiment using an existing workspace.
2. Configure AutoML using `AutoMLConfig`.
3. Train the model.
4. Explore the results.
5. Test the fitted model.

## Setup

As part of the setup you have already created an Azure ML `Workspace` object. For Automated ML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments.

In [1]:
import logging

from matplotlib import pyplot as plt
import pandas as pd
import os

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.core.dataset import Dataset
from azureml.train.automl import AutoMLConfig

In [2]:
ws = Workspace.from_config()

# choose a name for experiment
experiment_name = 'automl-onn-pickle-perf'

experiment=Experiment(ws, experiment_name)

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Experiment Name'] = experiment.name
pd.set_option('display.max_colwidth', -1)
outputDf = pd.DataFrame(data = output, index = [''])
outputDf.T

Unnamed: 0,Unnamed: 1
SDK version,1.0.85
Subscription ID,672be801-622a-4828-aa13-743ec59b8e29
Workspace,aibuilder-dev
Resource Group,aibuilder-dev
Location,westcentralus
Experiment Name,automl-onn-pickle-perf


In [24]:
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=50000, n_features=40)

In [52]:
X = pd.DataFrame(X, columns = [f"f_{i}" for i in range(X.shape[1])])

In [53]:
X.head()

Unnamed: 0,f_0,f_1,f_2,f_3,f_4,f_5,f_6,f_7,f_8,f_9,...,f_30,f_31,f_32,f_33,f_34,f_35,f_36,f_37,f_38,f_39
0,-0.24,1.83,0.13,-2.36,0.55,0.55,-0.93,0.82,-0.18,0.5,...,0.89,-1.88,-1.4,0.54,-1.24,-2.14,0.18,0.72,0.48,-0.01
1,1.61,0.1,-1.32,1.2,-0.69,-0.59,-0.32,-0.74,0.23,0.56,...,0.05,0.01,0.85,0.89,1.7,0.34,-1.38,-0.07,-0.48,0.93
2,-0.95,-0.06,0.18,2.58,-1.6,0.68,-1.08,0.28,-0.56,-0.76,...,-0.71,-0.66,0.36,-0.49,2.01,1.68,1.09,0.21,-0.35,0.74
3,0.67,1.79,0.87,-0.27,1.16,-1.18,-0.38,0.94,-0.88,0.05,...,0.22,0.88,-0.38,0.79,0.07,0.89,1.47,0.27,0.59,2.16
4,1.5,-1.65,1.15,-1.9,-0.6,0.44,1.49,0.61,1.33,-0.35,...,0.62,1.0,0.76,0.48,-1.49,0.08,0.16,0.66,-0.52,-1.0


### Load Data

Load the credit card dataset from a csv file containing both training features and labels. The features are inputs to the model, while the training labels represent the expected output of the model. Next, we'll split the data using random_split and extract the training data for the model.

## Train

Instantiate a AutoMLConfig object. This defines the settings and data used to run the experiment.

|Property|Description|
|-|-|
|**task**|classification or regression|
|**primary_metric**|This is the metric that you want to optimize. Classification supports the following primary metrics: <br><i>accuracy</i><br><i>AUC_weighted</i><br><i>average_precision_score_weighted</i><br><i>norm_macro_recall</i><br><i>precision_score_weighted</i>|
|**enable_early_stopping**|Stop the run if the metric score is not showing improvement.|
|**n_cross_validations**|Number of cross validation splits.|
|**training_data**|Input dataset, containing both features and label column.|
|**label_column_name**|The name of the label column.|

**_You can find more information about primary metrics_** [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train#primary-metric)

In [54]:
automl_settings = {
    # "n_cross_validations": 3,
    "primary_metric": 'AUC_weighted',
    "preprocess": True,
    "experiment_timeout_hours": 0.6, # This is a time limit for testing purposes, remove it for real use cases, this will drastically limit ability to find the best model possible
    "verbosity": logging.INFO,
    "whitelist_models": ["LightGBM"],
    "iterations": 1,
    "enable_onnx_compatible_models": True,
    "enable_stack_ensemble": False
}

automl_config = AutoMLConfig(task = 'classification',
                             debug_log = 'automl_errors.log',
                             X = X,
                             y = y,
                             **automl_settings
                            )



Call the `submit` method on the experiment object and pass the run configuration. Depending on the data and the number of iterations this can run for a while.
In this example, we specify `show_output = True` to print currently running iterations to the console.

In [55]:
local_run = experiment.submit(automl_config, show_output = True)

Running on local machine
Parent Run ID: AutoML_de8b3217-7085-4f48-a0b7-fc98b5a57120

Current status: DatasetFeaturization. Beginning to featurize the dataset.
Current status: DatasetEvaluation. Gathering dataset statistics.
Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturizationCompleted. Completed featurizing the dataset.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Train-Test data split
STATUS:       DONE
DESCRIPTION:  Your input data has been split into a training dataset and a holdout test dataset for validation of the model. The test holdout dataset reflects the original distribution of your input data.
PARAMETERS:   Dataset : train, Row counts : 45000, Percentage : 90.0
              Dataset : test, Row counts : 5000, Percentage : 10.0
              
TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Classes are ba

In [7]:
# If you need to retrieve a run that already started, use the following code
#from azureml.train.automl.run import AutoMLRun
#local_run = AutoMLRun(experiment = experiment, run_id = '<replace with your run id>')

In [56]:
local_run

Experiment,Id,Type,Status,Details Page,Docs Page
automl-onn-pickle-perf,AutoML_de8b3217-7085-4f48-a0b7-fc98b5a57120,automl,Completed,Link to Azure Machine Learning studio,Link to Documentation


## Analyze results

### Retrieve the Best Model

Below we select the best pipeline from our iterations. The `get_output` method on `automl_classifier` returns the best run and the fitted model for the last invocation. Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*.

In [57]:
best_run, fitted_model = local_run.get_output()
fitted_model

Pipeline(memory=None,
     steps=[('datatransformer', DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
        feature_sweeping_config=None, feature_sweeping_timeout=None,
        featurization_config=None, force_text_dnn=None,
        is_cross_validation=None, is_onnx_compatible=None, logger=None,
        obser...      silent=True, subsample=1, subsample_for_bin=200000,
          subsample_freq=0, verbose=-10))])

In [58]:
from azureml.explain.model.mimic.mimic_explainer import MimicExplainer
from azureml.explain.model.mimic.models.lightgbm_model import LGBMExplainableModel
import time

In [59]:
start = time.time()
explainer = MimicExplainer(fitted_model, X, LGBMExplainableModel, augment_data=False)
end = time.time()
print(end - start)

1.197920322418213


In [61]:
best_run, onnx_model = local_run.get_output(return_onnx_model=True)

In [62]:
out_name = "outputs/model_onnx.json"
onnx_res = best_run._download_artifact_contents_to_string(out_name)

In [63]:
onnx_res

'{"RawColumnNameToOnnxNameMap": {"f_0": "f_0", "f_1": "f_1", "f_2": "f_2", "f_3": "f_3", "f_4": "f_4", "f_5": "f_5", "f_6": "f_6", "f_7": "f_7", "f_8": "f_8", "f_9": "f_9", "f_10": "f_10", "f_11": "f_11", "f_12": "f_12", "f_13": "f_13", "f_14": "f_14", "f_15": "f_15", "f_16": "f_16", "f_17": "f_17", "f_18": "f_18", "f_19": "f_19", "f_20": "f_20", "f_21": "f_21", "f_22": "f_22", "f_23": "f_23", "f_24": "f_24", "f_25": "f_25", "f_26": "f_26", "f_27": "f_27", "f_28": "f_28", "f_29": "f_29", "f_30": "f_30", "f_31": "f_31", "f_32": "f_32", "f_33": "f_33", "f_34": "f_34", "f_35": "f_35", "f_36": "f_36", "f_37": "f_37", "f_38": "f_38", "f_39": "f_39"}, "InputRawColumnSchema": {"f_0": "floating", "f_1": "floating", "f_2": "floating", "f_3": "floating", "f_4": "floating", "f_5": "floating", "f_6": "floating", "f_7": "floating", "f_8": "floating", "f_9": "floating", "f_10": "floating", "f_11": "floating", "f_12": "floating", "f_13": "floating", "f_14": "floating", "f_15": "floating", "f_16": "fl

In [64]:
from azureml.automl.runtime.onnx_convert import OnnxInferenceHelper
from typing import Any, Tuple
from numpy import ndarray


class OnnxModelWrapper:
    """
        helper class for prediction when using onnx model
    """
    def __init__(self, onnx_model_bytes: bytes, onnx_input_map: dict):
        """
        :param onnx_model_bytes: the onnx model in bytes
        :param onnx_input_map: the onnx_resource dictionary
        """
        self.onnx_model_bytes = onnx_model_bytes
        self.onnx_input_map = onnx_input_map
        self.wrapper_model = OnnxInferenceHelper(self.onnx_model_bytes, self.onnx_input_map)

    def predict(self, X) -> Tuple[Any, Any]:
        """
        predict by using OnnxInferenceHelper
        :param X: features to predict
        :returns tuple of <label, prob>
        """
        return self.wrapper_model.predict(X)

    def predict_proba(self, X) -> ndarray:
        """
        predict proba by using OnnxInferenceHelper
        :param X: features to predict
        :returns ndarray of prob
        """
        _, y_prob = self.wrapper_model.predict(X, with_prob=True)
        return y_prob


In [65]:
import json

In [67]:
onnxrt_wrapper = OnnxModelWrapper(onnx_model.SerializeToString(), json.loads(onnx_res))

In [69]:
start = time.time()
explainer = MimicExplainer(onnxrt_wrapper, X, LGBMExplainableModel, augment_data=False)
end = time.time()
print(end - start)

34.04661011695862
