# AutoML 001: Classification numpy

In this example we use the scikit learn's [digit dataset](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) to showcase how you can use the AutoML Classifier for a simple classification problem.

Make sure you have executed the [setup](setup.ipynb) before running this notebook.

In this notebook you would see
1. Creating or reusing an existing Project and Workspace
2. Instantiating AutoML Classifier
3. Training the Model using local compute
4. Exploring the results
5. Testing the fitted model


## Create Project and Workspace

As part of the setup you have already created a workspace. For AutoML you would need to create a <b>Project</b>. A Project is a local folder that contains files for your Azure ML experiments. It is associated with a run history, a cloud container of run metrics and output artifacts from your experiments. You can either attach a local folder as a new project, or load a local folder as a project if it has been attached before.

In [None]:
import logging

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig

In [None]:
ws = Workspace.from_config()

# choose a name for the run history container in the workspace
experiment_name = 'automl-classification-numpy'
# project folder
project_folder = './sample_projects/automl-classification-numpy'

experiment = Experiment(ws, experiment_name)

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
output['Run History Name'] = experiment_name
pd.set_option('display.max_colwidth', -1)
pd.DataFrame(data=output, index=['']).T

## Diagnostics
Opt-in diagnostics collection for better experience, quality, and security of future releases

In [None]:
from azureml.telemetry import set_diagnostics_collection
set_diagnostics_collection(send_diagnostics=True)

Set your primary metric:

In [None]:
primary_metric = "AUC_weighted"
data_library = "numpy"

# Load Digits Dataset

In [None]:
digits = datasets.load_digits()

# only take the first 100 rows if you want the training steps to run faster
#X_digits = digits.data[100:,:]
#y_digits = digits.target[100:]

# use full dataset
X_digits = digits.data
y_digits = digits.target

## Instantiate Auto ML Config

Instantiate a AutoMLConfig object. This defines the settings and data used to run the experiment.

|Property|Description|
|-|-|
|**task**|classification or regression|
|**primary_metric**|This is the metric that you want to optimize.<br> Classification supports the following primary metrics <br><i>accuracy</i><br><i>AUC_weighted</i><br><i>balanced_accuracy</i><br><i>average_precision_score_weighted</i><br><i>precision_score_weighted</i>|
|**iteration_timeout_minutes**|Time limit in minutes for each iterations|
|**iterations**|Number of iterations. In each iteration Auto ML trains the data with a specific pipeline|
|**n_cross_validations**|Number of cross validation splits|
|**X**|(sparse) array-like, shape = [n_samples, n_features]|
|**y**|(sparse) array-like, shape = [n_samples, ], [n_samples, n_classes]<br>Multi-class targets. An indicator matrix turns on multilabel classification.  This should be an array of integers. |
|**path**|Relative path to the project folder.  AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder. |

In [None]:
automl_config = None

def InitAutoMLConfig():
    global automl_config    
    X_data = X_digits
    y_data = y_digits
    
    if (data_library == 'pandas'):
        X_data = pd.DataFrame(X_digits) #intentionally y data is skipped since pandas returns it as 2-d array and we need 1-d array       
        
    automl_config = AutoMLConfig(task = 'classification',
                                     debug_log="{0}_{1}_normal.log".format(primary_metric, data_library),
                                     primary_metric = primary_metric,
                                     iteration_timeout_minutes = 60,
                                     iterations = 10,
                                     X = X_data, 
                                     y = y_data,
                                     n_cross_validations = 2,
                                     verbosity=logging.INFO
                                    )    


In [None]:
local_run = None

def Submit():    
    global local_run
    local_run = experiment.submit(automl_config, show_output = True)

## Exploring the results

### Retrieve the Best Model

Below we select the best pipeline from our iterations. The *get_output* method on automl_classifier returns the best run and the fitted model for the last *fit* invocation. There are overloads on *get_output* that allow you to retrieve the best run and fitted model for *any* logged metric or a particular *iteration*.

In [None]:
def ValidateBestFitPrimaryMetric():
    best_run, fitted_model = local_run.get_output()
    metric_value = best_run.get_metrics()[primary_metric]
    lower_limit = .93
    if primary_metric == 'norm_macro_recall':
        lower_limit = .7
        
    if not (lower_limit < float(metric_value) <= 1):
        raise Exception('Metric value of {0} is not in the valid range.'.format(metric_value))
    print("\n Finished running 'ValidateBestFitPrimaryMetric'")    

#### Best Model based on any other metric

In [None]:
def ValidateBestFitOtherMetric():
    best_run, fitted_model = local_run.get_output(metric=primary_metric)
    if fitted_model == None:
        raise Exception('Fitted model is None for {metric}.'.format(metric=primary_metric))
    print("\n Finished running 'ValidateBestFitOtherMetric'")    

#### Best Model based on any iteration

In [None]:
def ValidateAllModelsPrimaryMetric():
    for iteration in range(0, 10):
        best_run, fitted_model = local_run.get_output(iteration=iteration)        
        try:
            fitted_model.predict(X_digits[[0]])
        except Exception as e:
            raise Exception('Invalid fitted model returned for iteration'
                            ' {0} for AUC_macro.'.format(iteration)) from e
    print("\n Finished running 'ValidateAllModelsPrimaryMetric'")     

### Testing our best pipeline

In [None]:
def TestPipeline():
    #load test data
    digits = datasets.load_digits()
    X_digits = digits.data[:10, :]
    y_digits = digits.target[:10]
    images = digits.images[:10]

    #Randomly select digits and test
    best_run, fitted_model = local_run.get_output()
    for index in np.random.choice(len(y_digits), 2):
        print(index)
        predicted = fitted_model.predict(pd.DataFrame(X_digits[index:index + 1]) if data_library == "pandas" else X_digits[index:index + 1])[0]
        label = y_digits[index]
        title = "Label value = %d  Predicted value = %d " % ( label,predicted)
        fig = plt.figure(1, figsize=(3,3))
        ax1 = fig.add_axes((0,0,.8,.8))
        ax1.set_title(title)
        plt.imshow(images[index], cmap=plt.cm.gray_r, interpolation='nearest')
        plt.show()

### Test other primary metrics and data libraries

We can do the same steps for other metrics.

In [None]:
steps = [ InitAutoMLConfig, Submit, ValidateBestFitPrimaryMetric, ValidateBestFitOtherMetric, ValidateAllModelsPrimaryMetric, TestPipeline]
primary_metrics = ['accuracy', 'precision_score_weighted', 'norm_macro_recall']
print("data_library is '%s'" % (data_library))
for metric in primary_metrics:    
    primary_metric = metric
    print("primary_metric is '%s'" % (primary_metric))
    for step in steps:
        step()