# AutoML 013: Test for automatic blacklisting of models based on size of dataset
Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.


In this example we use the scikit learn's [digit dataset](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) to showcase how the AutoML Classifier omits algorithms that are unlikely to be performant given the structure of the training data (especially the number of samples).

Make sure you have executed the [setup](setup.ipynb) before running this notebook.

In this notebook you would see
1. Creating or reusing an existing Project and Workspace
2. Instantiating a AutoML Classifier 
4. Training the Model
5. Exploring the results
6. Testing the fitted model

In addition this notebook showcases the following features
- **Automatic blacklist** of certain pipelines


## Create Project and Workspace

As part of the setup you have already created a workspace. For AutoML you would need to create a <b>Project</b>. A Project is a local folder that contains files for your Azure ML experiments. It is associated with a run history, a cloud container of run metrics and output artifacts from your experiments. You can either attach a local folder as a new project, or load a local folder as a project if it has been attached before.

In [None]:
import logging
import os
import random

from matplotlib import pyplot as plt
from matplotlib.pyplot import imshow
import numpy as np
import pandas as pd
from sklearn import datasets

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun

In [None]:
ws = Workspace.from_config()

# choose a name for the run history container in the workspace
experiment_name = 'automl-local-missing-data'
# project folder
project_folder = './sample_projects/automl-local-missing-data'

experiment=Experiment(ws, experiment_name)

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
output['Experiment Name'] = experiment_name
pd.set_option('display.max_colwidth', -1)
pd.DataFrame(data=output, index=['']).T

## Diagnostics
Opt-in diagnostics collection for better experience, quality, and security of future releases

In [None]:
from azureml.telemetry import set_diagnostics_collection
set_diagnostics_collection(send_diagnostics=True)

Set your primary metric:

In [None]:
primary_metric = "AUC_weighted"
data_library = "numpy"

## Instantiate Auto ML


Instantiate a AutoML Object This creates an Experiment in Azure ML. You can reuse this objects to trigger multiple runs. Each run will be part of the same experiment.

|Property|Description|
|-|-|
|**primary_metric**|This is the metric that you want to optimize.<br> Auto ML Classifier supports the following primary metrics <br><i>AUC_macro</i><br><i>AUC_weighted</i><br><i>accuracy</i><br><i>weighted_accuracy</i><br><i>norm_macro_recall</i><br><i>balanced_accuracy</i>|
|**iteration_timeout_minutes**|Time limit in minutes for each iterations|
|**iterations**|Number of iterations. In each iteration Auto ML Classifier trains the data with a specific pipeline|
|**num_cross_folds**|Cross Validation split|
|**preprocess**| *True/False* <br>Setting this to *True* enables Auto ML Classifier to perform preprocessing <br>on the input to handle *missing data*, and perform some common *feature extraction*|
|**experiment_exit_score**|*double* value indicating the target for *primary_metric*. <br> Once the target is surpassed the run terminates|
|**blacklist_models**|*Array* of *strings* indicating pipelines to ignore for Auto ML.<br><br> Allowed values for **Classification**<br><i>logistic regression</i><br><i>SGD classifier</i><br><i>MultinomialNB</i><br><i>BernoulliNB</i><br><i>SVM</i><br><i>LinearSVM</i><br><i>kNN</i><br><i>DT</i><br><i>RF</i><br><i>extra trees</i><br><i>gradient boosting</i><br><i>lgbm_classifier</i><br><br>Allowed values for **Regression**<br><i>Elastic net</i><br><i>Gradient boosting regressor</i><br><i>DT regressor</i><br><i>kNN regressor</i><br><i>Lasso lars</i><br><i>SGD regressor</i><br><i>RF regressor</i><br><i>extra trees regressor</i><br><i>lightGBM regressor</i>|

### Creating Large and Small Datasets

In [None]:
digits = datasets.load_digits()

# Split into training and test data
X_digits_train = digits.data[10:,:]
y_digits_train = digits.target[10:]

X_digits_train_large = np.repeat(X_digits_train, 3, 0)
y_digits_train_large = np.repeat(y_digits_train, 3)

X_digits_test = digits.data[:10,:]
y_digits_test = digits.target[:10]

In [None]:
df = pd.DataFrame(data=X_digits_train)
df['Label'] = pd.Series(y_digits_train, index=df.index)
df.head()

In [None]:
x_data = X_digits_train_large
y_data = y_digits_train_large

if data_library == 'pandas':
        x_data = pd.DataFrame(X_digits)
        y_data = pd.DataFrame(y_digits)    
        
automl_config = AutoMLConfig(task = 'classification',
                             debug_log = 'automl_errors.log',
                             primary_metric = primary_metric,
                             iteration_timeout_minutes = 60,
                             iterations = 10,
                             n_cross_validations = 2,
                             verbosity = logging.INFO,
                             X = x_data, 
                             y = y_data,
                             preprocess = True,
                             path = project_folder)

## Training the Model

You can call the fit method on the AutoML instance and pass the run configuration. For Local runs the execution is synchronous. Depending on the data and number of iterations this can run for while.
You will see the currently running iterations printing to the console.

*fit* method on Auto ML Classifier triggers the training of the model. It can be called with the following parameters

|**Parameter**|**Description**|
|-|-|
|**X**|(sparse) array-like, shape = [n_samples, n_features]|
|**y**|(sparse) array-like, shape = [n_samples, ], [n_samples, n_classes]<br>Multi-class targets. An indicator matrix turns on multilabel classification.|
|**compute_target**|Indicates the compute used for training. <i>local</i> indicates train on the same compute which hosts the jupyter notebook. <br>For DSVM and Batch AI please refer to the relevant notebooks.|
|**show_output**| True/False to turn on/off console output|

In [None]:
local_run = experiment.submit(automl_config, show_output=True)

## Exploring the results

#### Widget for monitoring runs

The widget will sit on "loading" until the first iteration completed, then you will see an auto-updating graph and table show up. It refreshed once per minute, so you should see the graph update as child runs complete.

NOTE: The widget will display a link at the bottom. This will not currently work, but will eventually link to a web-ui to explore the individual run details.

In [None]:
from azureml.widgets import RunDetails
from azureml.train.automl.constants import MAX_SAMPLES_BLACKLIST_ALGOS as blacklist, MAX_SAMPLES_BLACKLIST as blacklist_threshold

run_details = RunDetails(local_run)
run_details.show()

#### Test Blacklist
Check for blacklisted algorithms in the child runs when the number of samples is above the assigned threshold

In [None]:
child_runs = run_details.get_widget_data()['child_runs']
pipeline_names = [child_run['run_name'] for child_run in child_runs]

def flatten(xs):
    return [y for x in xs for y in x]

stages = flatten([name.split(', ') for name in pipeline_names])
autoblacklist_success = set(stages).isdisjoint(blacklist)
if len(y_data) > blacklist_threshold:
    assert(autoblacklist_success)
    if autoblacklist_success:
        print("Successful autoblacklist")
else:
    print("Not enough data to autoblacklist")


#### Retrieve All Child Runs
You can also use sdk methods to fetch all the child runs and see individual metrics that we log. 

In [None]:
children = list(local_run.get_children())
metricslist = {}
for run in children:
    properties = run.get_properties()
    metrics = run.get_metrics()    
    metricslist[properties['iteration']] = metrics

rundata = pd.DataFrame(metricslist).sort_index(1)
rundata