Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# AutoML 05 : Blacklisting models, Early termination and handling missing data

In this example we use the scikit learn's [digit dataset](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) to showcase how you can use AutoML for handling missing values in data. We also provide a stopping metric indicating a target for the primary metric so that AutoML can terminate the run without necessarly going through all the iterations. Finally, if you want to avoid a certain pipeline, we allow you to specify a black list of algos that AutoML will ignore for this run.

Make sure you have executed the [00.configuration](00.configuration.ipynb) before running this notebook.

In this notebook you would see
1. Creating an Experiment using an existing Workspace
2. Instantiating AutoMLConfig
4. Training the Model
5. Exploring the results
6. Testing the fitted model

In addition this notebook showcases the following features
- **Blacklist** certain pipelines
- Specify a **target metrics** to indicate stopping criteria
- Handling **Missing Data** in the input



## Create Experiment

As part of the setup you have already created a <b>Workspace</b>. For AutoML you would need to create an <b>Experiment</b>. An <b>Experiment</b> is a named object in a <b>Workspace</b>, which is used to run experiments.

In [None]:
import logging
import os
import random

from matplotlib import pyplot as plt
from matplotlib.pyplot import imshow
import numpy as np
import pandas as pd
from sklearn import datasets

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun

In [None]:
ws = Workspace.from_config()

# choose a name for the experiment
experiment_name = 'automl-local-missing-data'
# project folder
project_folder = './sample_projects/automl-local-missing-data'

experiment=Experiment(ws, experiment_name)

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
output['Experiment Name'] = experiment.name
pd.set_option('display.max_colwidth', -1)
pd.DataFrame(data=output, index=['']).T

## Diagnostics

Opt-in diagnostics for better experience, quality, and security of future releases

In [None]:
from azureml.telemetry import set_diagnostics_collection
set_diagnostics_collection(send_diagnostics=True)

### Creating Missing Data

In [None]:
from scipy import sparse

digits = datasets.load_digits()
X_digits = digits.data[10:,:]
y_digits = digits.target[10:]

# Add missing values in 75% of the lines
missing_rate = 0.75
n_missing_samples = int(np.floor(X_digits.shape[0] * missing_rate))
missing_samples = np.hstack((np.zeros(X_digits.shape[0] - n_missing_samples, dtype=np.bool), np.ones(n_missing_samples, dtype=np.bool)))
rng = np.random.RandomState(0)
rng.shuffle(missing_samples)
missing_features = rng.randint(0, X_digits.shape[1], n_missing_samples)
X_digits[np.where(missing_samples)[0], missing_features] = np.nan

In [None]:
df = pd.DataFrame(data=X_digits)
df['Label'] = pd.Series(y_digits, index=df.index)
df.head()

## Instantiate Auto ML Config


This defines the settings and data used to run the experiment.

|Property|Description|
|-|-|
|**task**|classification or regression|
|**primary_metric**|This is the metric that you want to optimize.<br> Classification supports the following primary metrics <br><i>accuracy</i><br><i>AUC_weighted</i><br><i>balanced_accuracy</i><br><i>average_precision_score_weighted</i><br><i>precision_score_weighted</i>|
|**max_time_sec**|Time limit in seconds for each iteration|
|**iterations**|Number of iterations. In each iteration Auto ML trains the data with a specific pipeline|
|**n_cross_validations**|Number of cross validation splits|
|**preprocess**| *True/False* <br>Setting this to *True* enables Auto ML to perform preprocessing <br>on the input to handle *missing data*, and perform some common *feature extraction*|
|**exit_score**|*double* value indicating the target for *primary_metric*. <br> Once the target is surpassed the run terminates|
|**blacklist_algos**|*Array* of *strings* indicating pipelines to ignore for Auto ML.<br><br> Allowed values for **Classification**<br><i>LogisticRegression</i><br><i>SGDClassifierWrapper</i><br><i>NBWrapper</i><br><i>BernoulliNB</i><br><i>SVCWrapper</i><br><i>LinearSVMWrapper</i><br><i>KNeighborsClassifier</i><br><i>DecisionTreeClassifier</i><br><i>RandomForestClassifier</i><br><i>ExtraTreesClassifier</i><br><i>LightGBMClassifier</i><br><br>Allowed values for **Regression**<br><i>ElasticNet<i><br><i>GradientBoostingRegressor<i><br><i>DecisionTreeRegressor<i><br><i>KNeighborsRegressor<i><br><i>LassoLars<i><br><i>SGDRegressor<i><br><i>RandomForestRegressor<i><br><i>ExtraTreesRegressor<i>|
|**X**|(sparse) array-like, shape = [n_samples, n_features]|
|**y**|(sparse) array-like, shape = [n_samples, ], [n_samples, n_classes]<br>Multi-class targets. An indicator matrix turns on multilabel classification.  This should be an array of integers. |
|**path**|Relative path to the project folder.  AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder. |

In [None]:
automl_config = AutoMLConfig(task = 'classification',
                             debug_log = 'automl_errors.log',
                             primary_metric = 'AUC_weighted',
                             max_time_sec = 3600,
                             iterations = 20,
                             n_cross_validations = 5,
                             preprocess = True,
                             exit_score = 0.994,
                             blacklist_algos = ['KNeighborsClassifier','LinearSVMWrapper'],
                             verbosity = logging.INFO,
                             X = X_digits, 
                             y = y_digits,
                             path=project_folder)

## Training the Model

You can call the submit method on the experiment object and pass the run configuration. For Local runs the execution is synchronous. Depending on the data and number of iterations this can run for while.
You will see the currently running iterations printing to the console.

In [None]:
local_run = experiment.submit(automl_config, show_output=True)

## Exploring the results

#### Widget for monitoring runs

The widget will sit on "loading" until the first iteration completed, then you will see an auto-updating graph and table show up. It refreshed once per minute, so you should see the graph update as child runs complete.

NOTE: The widget will display a link at the bottom. This will not currently work, but will eventually link to a web-ui to explore the individual run details.

In [None]:
from azureml.train.widgets import RunDetails
RunDetails(local_run).show() 


#### Retrieve All Child Runs
You can also use sdk methods to fetch all the child runs and see individual metrics that we log. 

In [None]:
children = list(local_run.get_children())
metricslist = {}
for run in children:
    properties = run.get_properties()
    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}    
    metricslist[int(properties['iteration'])] = metrics

rundata = pd.DataFrame(metricslist).sort_index(1)
rundata

### Retrieve the Best Model

Below we select the best pipeline from our iterations. Each pipeline is a tuple of three elements. The first element is the score for the pipeline the second element is the string description of the pipeline and the last element are the pipeline objects used for each fold in the cross-validation.

In [None]:
best_run, fitted_model = local_run.get_output()

#### Best Model based on any other metric

In [None]:
# lookup_metric = "accuracy"
# best_run, fitted_model = local_run.get_output(metric=lookup_metric)

#### Model from a specific iteration

In [None]:
# iteration = 3
# best_run, fitted_model = local_run.get_output(iteration=iteration)

### Register fitted model for deployment

In [None]:
description = 'AutoML Model'
tags = None
local_run.register_model(description=description, tags=tags)
local_run.model_id # Use this id to deploy the model as a web service in Azure

### Testing the Fitted Model 

In [None]:
digits = datasets.load_digits()
X_digits = digits.data[:10, :]
y_digits = digits.target[:10]
images = digits.images[:10]

#Randomly select digits and test
for index in np.random.choice(len(y_digits), 2):
    print(index)
    predicted = fitted_model.predict(X_digits[index:index + 1])[0]
    label = y_digits[index]
    title = "Label value = %d  Predicted value = %d " % ( label,predicted)
    fig = plt.figure(1, figsize=(3,3))
    ax1 = fig.add_axes((0,0,.8,.8))
    ax1.set_title(title)
    plt.imshow(images[index], cmap=plt.cm.gray_r, interpolation='nearest')
    plt.show()
