Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Lab 3: Automated Machine Learning with AutomML
_**Classification with Local Compute**_

## Introduction
In this lab, you will use your AdeventureWorks model training feature data you created earlier, to let AutoML find the best performing model.

In this notebook you will learn how to:
1. Create an `Experiment` in an existing `Workspace`.
2. Configure AutoML using `AutoMLConfig`.
3. Train the model using local compute.
4. Explore the results.
5. Test the best fitted model.

## Setup

As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments.

In [12]:
import azureml.core
print(azureml.core.VERSION)

1.0.72


In [13]:
from azureml.core import Workspace

# check core SDK version number
print("Azure ML SDK Version: ", azureml.core.VERSION)

Azure ML SDK Version:  1.0.72


## Connect to your Azure Machine Learning service workspace.

There are two ways to do this:

- Using the workspace config file we created earlier.
- Calling the workspace get() method 

In [14]:
# Method 1: Using the workspace config file...

ws = Workspace.from_config()
print(ws.name, ws.location, ws.resource_group, ws.location, sep='\t')

MBAUER	eastus	RG-DP100	eastus


In [11]:
# Method 2: Using the get() method...

from azureml.core import Workspace, Experiment, Run
ws = Workspace.get(name='MBAUER',
                   subscription_id='52b56929-ee84-495c-91c3-a84dfacbc9d2',
                   resource_group='RG-DP100'
                  )
print(ws.name, ws.location, ws.resource_group, ws.location, sep='\t')

Performing interactive authentication. Please follow the instructions on the terminal.
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code FUKLTN3L9 to authenticate.
Interactive authentication successfully completed.


UserErrorException: UserErrorException:
	Message: You are currently logged-in to 60623c36-25e7-4dec-a900-05b500441e54 tenant. You don't have access to <your-subscription_id subscription, please check if it is in this tenant. All the subscriptions that you have access to in this tenant are = 
 [SubscriptionInfo(subscription_name='MSDN Platforms', subscription_id='52b56929-ee84-495c-91c3-a84dfacbc9d2')]. 
 Please refer to aka.ms/aml-notebook-auth for different authentication mechanisms in azureml-sdk.
	InnerException None
	ErrorResponse 
{
    "error": {
        "code": "UserError",
        "message": "You are currently logged-in to 60623c36-25e7-4dec-a900-05b500441e54 tenant. You don't have access to <your-subscription_id subscription, please check if it is in this tenant. All the subscriptions that you have access to in this tenant are = \n [SubscriptionInfo(subscription_name='MSDN Platforms', subscription_id='52b56929-ee84-495c-91c3-a84dfacbc9d2')]. \n Please refer to aka.ms/aml-notebook-auth for different authentication mechanisms in azureml-sdk."
    }
}

In [48]:
import logging

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
print('Modules loaded...')

In [52]:
#ws = Workspace.from_config()

# Choose a name for the experiment and specify the project folder.
experiment_name = 'dp100lab-automl'
project_folder = './dp100lab'

experiment = Experiment(ws, experiment_name)

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace Name'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
output['Experiment Name'] = experiment.name
pd.set_option('display.max_colwidth', -1)
outputDf = pd.DataFrame(data = output, index = [''])
outputDf.T

Unnamed: 0,Unnamed: 1
Experiment Name,automl-local-classification
Location,eastus2
Project Directory,./sample_projects/automl-local-classification
Resource Group,rg_amlsstudentworkspace1
SDK version,1.0.17
Subscription ID,050aedf0-1ce2-495f-a726-be8756f5ab18
Workspace Name,amlsstudentworkspace1


## Connect to existing workspace.

# Explore data
### You already explored the data in the last lab. You need to copy the data into the cloud so it can be accessed by your cloud training environment. We saved the model training data to a csv file so all we have to do is load it.

In [3]:
iimport pandas as pd

df_features = pd.read_csv(r'./BikeModelFeatures.csv')
df_features.head()

In [22]:
from sklearn.model_selection import train_test_split

x_train , x_test = train_test_split(df_features.values,test_size=0.2)       #test_size=0.5(whole_data)

In [31]:
y_train = x_train[:,3]
y_train

array(['Touring-2000', 'Touring-1000', 'Road-550-W', ..., 'Road-650',
       'Road-750', 'Mountain-100'], dtype=object)

In [32]:
y_test = x_test[:,3]
y_test

array(['Touring-2000', 'Road-750', 'Mountain-200', ..., 'Mountain-200',
       'Mountain-400-W', 'Touring-2000'], dtype=object)

In [39]:
x_train = x_train[:,0:3]
x_train

array([['S', 'M', 'GB'],
       ['S', 'M', 'AU'],
       ['M', 'F', 'DE'],
       ...,
       ['M', 'F', 'US'],
       ['M', 'F', 'US'],
       ['M', 'M', 'AU']], dtype=object)

In [40]:
x_test = x_test[:,0:3]
x_test

array([['S', 'M', 'AU'],
       ['M', 'M', 'US'],
       ['M', 'M', 'GB'],
       ...,
       ['M', 'M', 'GB'],
       ['S', 'F', 'FR'],
       ['M', 'F', 'AU']], dtype=object)

In [20]:
print(len(x_train))
len(x_test)

12164


3041

## Train

Instantiate an `AutoMLConfig` object to specify the settings and data used to run the experiment.

|Property|Description|
|-|-|
|**task**|classification or regression|
|**primary_metric**|This is the metric that you want to optimize. Regression supports the following primary metrics: <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>|
|**iteration_timeout_minutes**|Time limit in minutes for each iteration.|
|**iterations**|Number of iterations. In each iteration AutoML trains a specific pipeline with the data.|
|**n_cross_validations**|Number of cross validation splits.|
|**X**|(sparse) array-like, shape = [n_samples, n_features]|
|**y**|(sparse) array-like, shape = [n_samples, ], targets values.|
|**path**|Relative path to the project folder. AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder.|

In [60]:
automl_config = AutoMLConfig(task = 'classification',
                             iteration_timeout_minutes = 10,
                             iterations = 10,
                             primary_metric = 'precision_score_weighted',
                             n_cross_validations = 5,
                             debug_log = 'automl.log',
                             verbosity = logging.INFO,
                             X = x_train, 
                             y = y_train,
                             preprocess=True,
                             path = project_folder)

In [57]:
print('Done')

Done


Call the `submit` method on the experiment object and pass the run configuration. Execution of local runs is synchronous. Depending on the data and the number of iterations this can run for a while.
In this example, we specify `show_output = True` to print currently running iterations to the console.

In [62]:
local_run = experiment.submit(automl_config, show_output = True)

Running on local machine
Parent Run ID: AutoML_2e1a8864-811d-45cf-adf6-4a161a165ef6
********************************************************************************************************************
ITERATION: The iteration being evaluated.
PIPELINE: A summary description of the pipeline being evaluated.
SAMPLING %: Percent of the training data to sample.
DURATION: Time taken for the current iteration.
METRIC: The result of computing score on the fitted pipeline.
BEST: The best observed score thus far.
********************************************************************************************************************

 ITERATION   PIPELINE                                       SAMPLING %  DURATION      METRIC      BEST
         0   MaxAbsScaler SGD                               100.0000    0:00:34       0.0976    0.0976
         1   MaxAbsScaler ExtremeRandomTrees                100.0000    0:00:26       0.0110    0.0976
         2   MaxAbsScaler SGD                               100

In [63]:
local_run

Experiment,Id,Type,Status,Details Page,Docs Page
automl-local-classification,AutoML_2e1a8864-811d-45cf-adf6-4a161a165ef6,automl,Completed,Link to Azure Portal,Link to Documentation


## Results

#### Widget for Monitoring Runs

The widget will first report a "loading" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.

**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details.

In [15]:
from azureml.widgets import RunDetails
RunDetails(local_run).show() 

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…


#### Retrieve All Child Runs
You can also use SDK methods to fetch all the child runs and see individual metrics that we log.

In [16]:
children = list(local_run.get_children())
metricslist = {}
for run in children:
    properties = run.get_properties()
    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}
    metricslist[int(properties['iteration'])] = metrics

rundata = pd.DataFrame(metricslist).sort_index(1)
rundata

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
explained_variance,0.5,0.48,0.48,0.49,-0.03,0.54,0.45,0.52,0.54,0.53
mean_absolute_error,45.89,46.43,47.35,46.27,62.63,43.64,46.07,45.17,43.64,44.33
median_absolute_error,41.6,39.25,43.73,40.7,51.0,37.72,39.72,41.39,37.67,39.8
normalized_mean_absolute_error,0.14,0.14,0.15,0.14,0.2,0.14,0.14,0.14,0.14,0.14
normalized_median_absolute_error,0.13,0.12,0.14,0.13,0.16,0.12,0.12,0.13,0.12,0.12
normalized_root_mean_squared_error,0.17,0.18,0.18,0.18,0.25,0.17,0.18,0.17,0.17,0.17
normalized_root_mean_squared_log_error,0.17,0.17,0.17,0.17,0.23,0.16,0.17,0.16,0.16,0.16
r2_score,0.48,0.46,0.45,0.47,-0.06,0.52,0.43,0.5,0.52,0.51
root_mean_squared_error,55.44,56.84,57.18,56.22,79.6,53.31,58.35,54.54,53.3,54.06
root_mean_squared_log_error,0.43,0.43,0.44,0.44,0.59,0.42,0.44,0.43,0.42,0.42


### Retrieve the Best Model

Below we select the best pipeline from our iterations. The `get_output` method returns the best run and the fitted model. The Model includes the pipeline and any pre-processing.  Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*.

In [17]:
best_run, fitted_model = local_run.get_output()
print(best_run)
print(fitted_model)

Run(Experiment: automl-local-regression,
Id: AutoML_64ef6e81-abdf-47b5-a4a6-acb0cd127f66_8,
Type: None,
Status: Completed)
Pipeline(memory=None,
     steps=[('StandardScalerWrapper', <automl.client.core.common.model_wrappers.StandardScalerWrapper object at 0x7fbd8ff38780>), ('LassoLars', LassoLars(alpha=0.001, copy_X=True, eps=2.220446049250313e-16,
     fit_intercept=True, fit_path=True, max_iter=500, normalize=False,
     positive=False, precompute='auto', verbose=False))])


#### Best Model Based on Any Other Metric
Show the run and the model that has the smallest `root_mean_squared_error` value (which turned out to be the same as the one with largest `spearman_correlation` value):

In [18]:
lookup_metric = "root_mean_squared_error"
best_run, fitted_model = local_run.get_output(metric = lookup_metric)
print(best_run)
print(fitted_model)

Run(Experiment: automl-local-regression,
Id: AutoML_64ef6e81-abdf-47b5-a4a6-acb0cd127f66_8,
Type: None,
Status: Completed)
Pipeline(memory=None,
     steps=[('StandardScalerWrapper', <automl.client.core.common.model_wrappers.StandardScalerWrapper object at 0x7fbd8fef6160>), ('LassoLars', LassoLars(alpha=0.001, copy_X=True, eps=2.220446049250313e-16,
     fit_intercept=True, fit_path=True, max_iter=500, normalize=False,
     positive=False, precompute='auto', verbose=False))])


#### Model from a Specific Iteration
Show the run and the model from the third iteration:

In [19]:
iteration = 3
third_run, third_model = local_run.get_output(iteration = iteration)
print(third_run)
print(third_model)

Run(Experiment: automl-local-regression,
Id: AutoML_64ef6e81-abdf-47b5-a4a6-acb0cd127f66_3,
Type: None,
Status: Completed)
Pipeline(memory=None,
     steps=[('StandardScalerWrapper', <automl.client.core.common.model_wrappers.StandardScalerWrapper object at 0x7fbd8fe8e0b8>), ('LightGBMRegressor', <automl.client.core.common.model_wrappers.LightGBMRegressor object at 0x7fbd8fecd160>)])


## Test

Predict on training and test set, and calculate model accuracy.

In [None]:
y_predtest = fitted_model.predict(x_test)

In [None]:
# calculate accuracy on the prediction
acc = np.average(y_predtest == y_test)
print('Accuracy is', acc)