Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/experimental/regression-model-proxy/auto-ml-regression-model-proxy.png)

# Automated Machine Learning
_**Regression with Aml Compute**_

## Contents
1. [Introduction](#Introduction)
1. [Setup](#Setup)
1. [Data](#Data)
1. [Train](#Train)
1. [Results](#Results)
1. [Test](#Test)



## Introduction
In this example we use an experimental feature, Model Proxy, to do a predict on the best generated model without downloading the model locally. The prediction will happen on same compute and environment that was used to train the model. This feature is currently in the experimental state, which means that the API is prone to changing, please make sure to run on the latest version of this notebook if you face any issues.

If you are using an Azure Machine Learning Compute Instance, you are all set.  Otherwise, go through the [configuration](../../../../configuration.ipynb)  notebook first if you haven't already to establish your connection to the AzureML Workspace. 

In this notebook you will learn how to:
1. Create an `Experiment` in an existing `Workspace`.
2. Configure AutoML using `AutoMLConfig`.
3. Train the model using remote compute.
4. Explore the results.
5. Test the best fitted model.

## Setup

As part of the setup you have already created an Azure ML `Workspace` object. For Automated ML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments.

In [1]:
import logging

import json


import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.core.dataset import Dataset
from azureml.train.automl import AutoMLConfig

This sample notebook may use features that are not available in previous versions of the Azure ML SDK.

In [2]:
print("This notebook was created using version AZUREML-SDK-VERSION of the Azure ML SDK")
print("You are currently using version", azureml.core.VERSION, "of the Azure ML SDK")

This notebook was created using version AZUREML-SDK-VERSION of the Azure ML SDK
You are currently using version 1.24.0 of the Azure ML SDK


In [3]:
ws = Workspace.from_config()

# Choose a name for the experiment.
experiment_name = 'automl-regression-model-proxy'

experiment = Experiment(ws, experiment_name)

output = {}
output['Subscription ID'] = ws.subscription_id
output['Workspace'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Run History Name'] = experiment_name
output

{'Subscription ID': '381b38e9-9840-4719-a5a0-61d9585e1e91',
 'Workspace': 'cesardl-automl-eastus2euap-ws',
 'Resource Group': 'cesardl-automl-eastus2euap-resgrp',
 'Location': 'eastus2euap',
 'Run History Name': 'automl-regression-model-proxy'}

### Using AmlCompute
You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for your AutoML run. In this tutorial, you use `AmlCompute` as your training compute resource.

In [4]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
# Try to ensure that the cluster name is unique across the notebooks
cpu_cluster_name = "reg-model-proxy"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4)
    compute_target = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## Data


### Load Data
Load the hardware dataset from a csv file containing both training features and labels. The features are inputs to the model, while the training labels represent the expected output of the model. Next, we'll split the data using random_split and extract the training data for the model. 

In [5]:
data = "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/machineData.csv"
dataset = Dataset.Tabular.from_delimited_files(data)

# Split the dataset into train and test datasets
train_data, test_data = dataset.random_split(percentage=0.8, seed=223)

label = "ERP"


## Train

Instantiate an `AutoMLConfig` object to specify the settings and data used to run the experiment.

|Property|Description|
|-|-|
|**task**|classification, regression or forecasting|
|**primary_metric**|This is the metric that you want to optimize. Regression supports the following primary metrics: <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>|
|**n_cross_validations**|Number of cross validation splits.|
|**training_data**|(sparse) array-like, shape = [n_samples, n_features]|
|**label_column_name**|(sparse) array-like, shape = [n_samples, ], targets values.|

**_You can find more information about primary metrics_** [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train#primary-metric)

In [6]:
automl_settings = {
    "n_cross_validations": 3,
    "primary_metric": 'r2_score',
    "enable_early_stopping": True, 
    "experiment_timeout_hours": 0.3, #for real scenarios we reccommend a timeout of at least one hour 
    "max_concurrent_iterations": 4,
    "max_cores_per_iteration": -1,
    "verbosity": logging.INFO,
}

automl_config = AutoMLConfig(task = 'regression',
                             compute_target = compute_target,
                             training_data = train_data,
                             label_column_name = label,
                             iterations=5,
                             **automl_settings
                            )

Call the `submit` method on the experiment object and pass the run configuration. Execution of remote runs is asynchronous. Depending on the data and the number of iterations this can run for a while.  Validation errors and current status will be shown when setting `show_output=True` and the execution will be synchronous.

In [7]:
remote_run = experiment.submit(automl_config, show_output = False)

Running on remote.


In [8]:
# If you need to retrieve a run that already started, use the following code
#from azureml.train.automl.run import AutoMLRun
#remote_run = AutoMLRun(experiment = experiment, run_id = '<replace with your run id>')

In [9]:
remote_run

Experiment,Id,Type,Status,Details Page,Docs Page
automl-regression-model-proxy,AutoML_6c7cb439-ca43-4fa6-a4ca-8820f026ab2a,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation


## Results

In [10]:
remote_run.wait_for_completion(show_output=True)


Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Missing feature values imputation
STATUS:       PASSED
DESCRIPTION:  No feature missing values were detected in the training data.
              Learn more about missing value imputation: https://aka.ms/AutomatedMLFeaturization

****************************************************************************************************

TYPE:         High cardinality feature detection
STATUS:       DONE
DESCRIPTION:  High cardinality features were detected in your inputs and handled.
              Learn more about high cardinality feature handling: https://aka.ms/AutomatedMLFeaturization
DETAILS:      High cardinality features refer to colum

{'runId': 'AutoML_6c7cb439-ca43-4fa6-a4ca-8820f026ab2a',
 'target': 'reg-model-proxy',
 'status': 'Completed',
 'startTimeUtc': '2021-03-27T01:42:24.386655Z',
 'endTimeUtc': '2021-03-27T01:53:13.978258Z',
 'properties': {'num_iterations': '5',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'r2_score',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': '3',
  'target': 'reg-model-proxy',
  'DataPrepJsonString': '{\\"training_data\\": \\"{\\\\\\"blocks\\\\\\": [{\\\\\\"id\\\\\\": \\\\\\"e35b057d-82fc-4ad8-9088-f789c0be3a8e\\\\\\", \\\\\\"type\\\\\\": \\\\\\"Microsoft.DPrep.GetFilesBlock\\\\\\", \\\\\\"arguments\\\\\\": {\\\\\\"isArchive\\\\\\": false, \\\\\\"path\\\\\\": {\\\\\\"target\\\\\\": 1, \\\\\\"resourceDetails\\\\\\": [{\\\\\\"path\\\\\\": \\\\\\"https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/machineData.csv\\\\\\", \\\\\\"sas\\\\\\": null, \\\\\\"storageAccountName\\\\\\": null

### Retrieve the Best Child Run

Below we select the best pipeline from our iterations. The `get_best_child` method returns the best run. Overloads on `get_best_child` allow you to retrieve the best run for *any* logged metric.

In [11]:
best_run = remote_run.get_best_child()
print(best_run)

Run(Experiment: automl-regression-model-proxy,
Id: AutoML_6c7cb439-ca43-4fa6-a4ca-8820f026ab2a_1,
Type: azureml.scriptrun,
Status: Completed)


#### Show hyperparameters
Show the model pipeline used for the best run with its hyperparameters.

In [12]:
run_properties = json.loads(best_run.get_details()['properties']['pipeline_script'])
print(json.dumps(run_properties, indent = 1)) 

{
 "objects": [
  {
   "spec_class": "preproc",
   "class_name": "MaxAbsScaler",
   "module": "sklearn.preprocessing",
   "param_args": [],
   "param_kwargs": {},
   "prepared_kwargs": {}
  },
  {
   "spec_class": "sklearn",
   "class_name": "XGBoostRegressor",
   "module": "automl.client.core.common.model_wrappers",
   "param_args": [],
   "param_kwargs": {
    "tree_method": "auto"
   },
   "prepared_kwargs": {}
  }
 ],
 "pipeline_id": "4bc4ec47eb8df2d5d68b361cd60120e65196f757",
 "module": "sklearn.pipeline",
 "class_name": "Pipeline"
}


#### Best Child Run Based on Any Other Metric
Show the run and the model that has the smallest `root_mean_squared_error` value (which turned out to be the same as the one with largest `spearman_correlation` value):

In [13]:
lookup_metric = "root_mean_squared_error"
best_run = remote_run.get_best_child(metric = lookup_metric)
print(best_run)

Run(Experiment: automl-regression-model-proxy,
Id: AutoML_6c7cb439-ca43-4fa6-a4ca-8820f026ab2a_3,
Type: azureml.scriptrun,
Status: Completed)


In [14]:
y_test = test_data.keep_columns('ERP')
test_data = test_data.drop_columns('ERP')

y_train = train_data.keep_columns('ERP')
train_data = train_data.drop_columns('ERP')

train_df = train_data.to_pandas_dataframe()
print(train_df.shape)

test_df = test_data.to_pandas_dataframe()
print(test_df.shape)

(164, 9)
(45, 9)


#### Creating ModelProxy for submitting prediction runs to the training environment.
We will create a ModelProxy for the best child run, which will allow us to submit a run that does the prediction in the training environment. Unlike the local client, which can have different versions of some libraries, the training environment will have all the compatible libraries for the model already.

In [15]:
from azureml.train.automl.model_proxy import ModelProxy
best_model_proxy = ModelProxy(best_run, compute_target)


Class ModelProxy: This is an experimental class, and may change at any time.<br/>For more information, see https://aka.ms/azuremlexperimental.


In [16]:
import time
start_time = time.time()
            
y_pred_train = best_model_proxy.predict(train_data)
y_pred_test = best_model_proxy.predict(test_data)

print('Manual run timing: --- %s minutes needed for Predicting with ModelProxy ---' % ((time.time() - start_time)/60))

Method predict: This is an experimental method, and may change at any time.<br/>For more information, see https://aka.ms/azuremlexperimental.
Method predict: This is an experimental method, and may change at any time.<br/>For more information, see https://aka.ms/azuremlexperimental.


Manual run timing: --- 4.830551930268606 minutes needed for Predicting with ModelProxy ---


#### Exploring results

In [17]:
y_pred_train = y_pred_train.to_pandas_dataframe().values.flatten()
y_train = y_train.to_pandas_dataframe().values.flatten()
y_residual_train = y_train - y_pred_train

y_pred_test = y_pred_test.to_pandas_dataframe().values.flatten()
y_test = y_test.to_pandas_dataframe().values.flatten()
y_residual_test = y_test - y_pred_test
print(y_residual_train)
print(y_residual_test)

[  2.37783813   6.31307983   6.31307983   4.4949646   -6.55200195
  17.17581177 -11.55050659   1.80633545  10.73364258   0.67485619
  -6.62722778  -4.21692085  -3.60374451  -0.13629341   0.79958344
  -0.96817398  -1.22341919  -2.20858383  -0.82394028  -3.02576065
  -0.83440399  -3.02576065  -0.5663166   -0.70326424  -2.15982056
  -1.62332916  -1.62332916   5.15447998   5.33682251  -3.16680908
  -0.24840355   0.91962242  -0.16282654   0.88231468   1.78347397
  -0.47849274  -2.73429871  -0.06510544  -3.25606537  -1.62138939
   1.63004303  -0.66490173  -0.36995697   1.91962242  -0.83935547
   2.87353134  -0.36995697  -0.36995697  -0.36995697  -0.36995697
  -0.36995697  -1.0384922  -10.1778183   -1.12951279   2.03851318
   1.79985809   4.13475037   0.17902565   0.35430908  -0.78735733
  -2.70798492  -1.13717461  -0.02788544  -0.16744232  -0.36228943
   0.34705162  -0.28208542   0.59902382   2.68049622   3.69607544
   1.37962341   1.32244873   0.59902382 -10.79672241  15.22167969
  -2.66299

In [18]:
# Check particular Child run environment

# import azureml.core
# from azureml.core.run import Run
# 
# child_run = Run(experiment, run_id = 'AutoML_e068009e-b245-4028-b3e6-7910acf759b8_0')
# child_run.get_environment()