# AutoML 015: Monte Carlo CV
Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.


In this example we use the scikit learn's [20newsgroup](In this example we use the scikit learn's [digit dataset](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) to showcase how you can use the AutoML Classifier with Monte Carlo cross validation and sparse data.

Make sure you have executed the [setup](setup.ipynb) before running this notebook.

In this notebook you would see
1. Creating or reusing an existing Project and Workspace
2. Instantiating a AutoML Classifier 
4. Training the Model
5. Exploring the results
6. Testing the fitted model

In addition this notebook showcases the following features
- **Monte Carlo** cross validation
- **Sparse data**

## Create Project and Workspace
As part of the setup you have already created a workspace. For AutoML you would need to create a <b>Project</b>. A Project is a local folder that contains files for your Azure ML experiments. It is associated with a run history, a cloud container of run metrics and output artifacts from your experiments. You can either attach a local folder as a new project, or load a local folder as a project if it has been attached before.

In [None]:
import logging
import os
import random

from matplotlib import pyplot as plt
from matplotlib.pyplot import imshow
import numpy as np
import pandas as pd
from sklearn import datasets

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun

In [None]:
ws = Workspace.from_config()

# choose a name for the run history container in the workspace
experiment_name = 'automl-local-missing-data'
# project folder
project_folder = './sample_projects/automl-local-missing-data'

experiment = Experiment(ws, experiment_name)

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
output['Experiment Name'] = experiment_name
pd.set_option('display.max_colwidth', -1)
pd.DataFrame(data=output, index=['']).T

## Diagnostics
Opt-in diagnostics collection for better experience, quality, and security of future releases

In [None]:
from azureml.telemetry import set_diagnostics_collection
set_diagnostics_collection(send_diagnostics=True)

Set your primary metric:

In [None]:
primary_metric = "AUC_weighted"
data_library = "numpy"

## Creating Sparse Data

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.model_selection import train_test_split

remove = ('headers', 'footers', 'quotes')
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]

data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=remove)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data_train.data, data_train.target, test_size=0.1, random_state=42)


vectorizer = HashingVectorizer(stop_words='english', alternate_sign=False, n_features=2**16)

X_train = vectorizer.transform(X_train)
X_test = vectorizer.transform(X_test)

summary_df = pd.DataFrame(index = ['No of Samples', 'No of Features'])
summary_df['Train'] = [X_train.shape[0], X_train.shape[1]]
summary_df['Test'] = [X_test.shape[0], X_test.shape[1]]
summary_df

## Instantiate Auto ML Config object.


Instantiate a AutoML config Object. This will contain all the configuration values expected by an experiment.

|Property|Description|
|-|-|
|**primary_metric**|This is the metric that you want to optimize.<br> Auto ML Classifier supports the following primary metrics <br><i>AUC_macro</i><br><i>AUC_weighted</i><br><i>accuracy</i><br><i>weighted_accuracy</i><br><i>norm_macro_recall</i><br><i>balanced_accuracy</i>|
|**iteration_timeout_minutes**|Time limit in minutes for each iterations|
|**iterations**|Number of iterations. In each iteration Auto ML Classifier trains the data with a specific pipeline|
|**preprocess**| *True/False* <br>Setting this to *True* enables Auto ML Classifier to perform preprocessing <br>on the input to handle *missing data*, and perform some common *feature extraction*<br>*Note: If input data is Sparse you cannot use preprocess=True*|

In [None]:
local_run = None

X_data = X_train
y_data = y_train

if data_library == 'pandas':
        X_data = pd.SparseDataFrame(X_data)
        y_data = pd.DataFrame(y_data)
        
n_cross_validations = 3
automl_config = AutoMLConfig(task = 'classification',
                             debug_log = 'automl_errors.log',
                             primary_metric = primary_metric,
                             iteration_timeout_minutes = 60,
                             iterations = 10,
                             n_cross_validations = n_cross_validations,
                             validation_size=1 / n_cross_validations,
                             verbosity = logging.INFO,
                             X = X_data, 
                             y = y_data,
                             preprocess = False,                             
                             path=project_folder) 

## Training the Model

You can call the submit method on the Experiment instance and pass the AutoML configuration object to it.. For Local runs the execution is synchronous. Depending on the data and number of iterations this can run for while.
You will see the currently running iterations printing to the console.

*submit* method on Experiment triggers the training of the model. It can be called with the following parameters

|**Parameter**|**Description**|
|-|-|
|**automl_config**|Configuration values for the experiment.
|**show_output**| True/False to turn on/off console output|

In [None]:
local_run = experiment.submit(automl_config, show_output = True)

## Exploring the results

#### Widget for monitoring runs

The widget will sit on "loading" until the first iteration completed, then you will see an auto-updating graph and table show up. It refreshed once per minute, so you should see the graph update as child runs complete.

NOTE: The widget displays a link at the bottom. This links to a web-ui to explore the individual run details.

In [None]:
from azureml.widgets import RunDetails
RunDetails(local_run).show() 


#### Retrieve All Child Runs
You can also use sdk methods to fetch all the child runs and see individual metrics that we log. 

In [None]:
children = list(local_run.get_children())
metricslist = {}
for run in children:
    properties = run.get_properties()
    metrics = run.get_metrics()    
    metricslist[properties['iteration']] = metrics

rundata = pd.DataFrame(metricslist).sort_index(1)
rundata

### Retrieve the Best Model

Below we select the best pipeline from our iterations. The *get_output* method on automl_classifier returns the best run and the fitted model for the last *fit* invocation. There are overloads on *get_output* that allow you to retrieve the best run and fitted model for *any* logged metric or a particular *iteration*.

In [None]:
def ValidateBestFitPrimaryMetric(primary_metric, data_library):
    best_run, fitted_model = local_run.get_output()
    metric_value = best_run.get_metrics()[primary_metric]
    if not (.90 < float(metric_value) <= 1):
        raise Exception('Metric value of {0} is not in the valid range.'.format(metric_value))
        
ValidateBestFitPrimaryMetric(primary_metric, data_library)

#### Best Model based on any other metric

In [None]:
def ValidateBestFitOtherMetric(primary_metric, data_library):
    best_run, fitted_model = local_run.get_output(metric=primary_metric)
    if fitted_model == None:
        raise Exception('Fitted model is None for {metric}.'.format(metric=primary_metric))
        
ValidateBestFitOtherMetric(primary_metric, data_library)

#### Best Model based on any iteration

In [None]:
def ValidateAllModelsPrimaryMetric(primary_metric, data_library):
    for iteration in range(0, 10):
        best_run, fitted_model = local_run.get_output(iteration=iteration)        
        try:
            fitted_model.predict(X_data[[0]])
        except Exception as e:
            raise Exception('Invalid fitted model returned for iteration'
                            ' {0} for AUC_macro.'.format(iteration)) from e
    print("\n Finished running 'ValidateAllModelsPrimaryMetric'")     
ValidateAllModelsPrimaryMetric(primary_metric, data_library)

### Register fitted model for deployment

In [None]:
description = 'AutoML Model'
tags = None
local_run.register_model(description=description, tags=tags)
local_run.model_id # Use this id to deploy the model as a web service in Azure

### Testing the Fitted Model 

In [None]:
digits = datasets.load_digits()### Testing the Fitted Model

#### Load Test Data
import sklearn
from pandas_ml import ConfusionMatrix


remove = ('headers', 'footers', 'quotes')
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]


data_test = fetch_20newsgroups(subset='test', categories=categories,
                                shuffle=True, random_state=42,
                                remove=remove)

vectorizer = HashingVectorizer(stop_words='english', alternate_sign=False,
                               n_features=2**16)

X_test = vectorizer.transform(data_test.data)
y_test = data_test.target

#### Testing our best pipeline

def TestPipeline(primary_metric, data_library):
    best_run, fitted_model = local_run.get_output()
    ypred = fitted_model.predict(X_test)
    ypred_strings = [categories[i] for i in ypred]
    ytest_strings = [categories[i] for i in y_test]

    cm = ConfusionMatrix(ytest_strings, ypred_strings)
    print(cm)
    cm.plot()
TestPipeline(primary_metric, data_library)