Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# AutoML 022: Forecasting with Remote Execution using DSVM (Ubuntu)

In this example, we show how AutoML can be used for energy demand forecasting.


Make sure you have executed the [00.configuration](00.configuration.ipynb) before running this notebook.

In this notebook you wiil learn how to:
1. Create an `Experiment` in an existing `Workspace`.
2. Attach an existing DSVM to a workspace.
3. Configure AutoML using `AutoMLConfig`.
4. Train the model using the DSVM.
5. Get the best fitted model
5. Testing the fitted model

## Create an Experiment

As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments.

In [None]:
import azureml.core
import pandas as pd
import numpy as np
import os
import logging
import time

from azureml.core.workspace import Workspace
from azureml.core.experiment import Experiment
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun
from matplotlib import pyplot as plt
from matplotlib.pyplot import imshow
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [None]:
ws = Workspace.from_config()

# Choose a name for the run history container in the workspace.
experiment_name = 'timeseries-remote'
project_folder = './sample_projects/timeseries-remote'

experiment = Experiment(ws, experiment_name)

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace Name'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
output['Experiment Name'] = experiment.name
pd.set_option('display.max_colwidth', -1)
pd.DataFrame(data = output, index = ['']).T

## Diagnostics

Opt-in diagnostics for better experience, quality, and security of future releases.

In [None]:
from azureml.telemetry import set_diagnostics_collection
set_diagnostics_collection(send_diagnostics = True)

## Create a Remote Linux DSVM
**Note:** If creation fails with a message about Marketplace purchase eligibilty, start creation of a DSVM through the [Azure portal](https://portal.azure.com), and select "Want to create programmatically" to enable programmatic creation. Once you've enabled this setting, you can exit the portal without actually creating the DSVM, and creation of the DSVM through the notebook should work.


In [None]:
from azureml.core.compute import DsvmCompute

dsvm_name = 'forecastdsvm'

try:
    dsvm_compute = DsvmCompute(ws, dsvm_name)
    print('Found an existing DSVM.')
except:
    print('Creating a new DSVM.')
    dsvm_config = DsvmCompute.provisioning_configuration(vm_size = "Standard_D2_v2")
    dsvm_compute = DsvmCompute.create(ws, name = dsvm_name, provisioning_configuration = dsvm_config)
    dsvm_compute.wait_for_completion(show_output = True)
    print("Waiting one minute for ssh to be accessible")
    time.sleep(60) # Wait for ssh to be accessible

## Add dependencies to the run config
Timeseries package has some dependency from Pypi, we need to add these to the run config.

In [None]:
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies

# create a new RunConfig object
conda_run_config = RunConfiguration(framework="python")

# Set compute target to the Linux DSVM
conda_run_config.target = dsvm_compute

cd = CondaDependencies.create(pip_packages=['azureml-sdk[automl]', 'dill', 'h5py', 'keras', 'numexpr', 'statsmodels', 'distributed==1.23.1'], conda_packages=['numpy'])
conda_run_config.environment.python.conda_dependencies = cd

## Create Get Data File
For remote executions you should author a `get_data.py` file containing a `get_data()` function. This file should be in the root directory of the project. You can encapsulate code to read data either from a blob storage or local disk in this file.
In this example, the `get_data()` function returns NYC energy data.

In [None]:
if not os.path.exists(project_folder):
    os.makedirs(project_folder)

In [None]:
%%writefile $project_folder/get_data.py

from sklearn import datasets
from sklearn.model_selection import train_test_split
from scipy import sparse
import numpy as np
import pandas as pd

def get_data():
    df = pd.read_csv("https://automldata.blob.core.windows.net/datasets/nyc_energy.csv", parse_dates=['timeStamp'])
    train = df[df['timeStamp'] < '2017-02-01']

    X_train = train[train['timeStamp'] < '2017-01-01']
    X_valid = train[train['timeStamp'] >= '2017-01-01']

    y_train = X_train.pop('demand').values
    y_valid = X_valid.pop('demand').values

    return { "X" : X_train, "y" : y_train, "X_valid" : X_valid, "y_valid" : y_valid, "x_raw_column_names" : train.columns }

In [None]:
%run  $project_folder/get_data.py
data_dict = get_data()
df = data_dict["X"]
y = data_dict["y"]
df.head()

## Instantiate Auto ML Config

Instantiate a AutoMLConfig object. This defines the settings and data used to run the experiment.

|Property|Description|
|-|-|
|**task**|forecasting|
|**primary_metric**|This is the metric that you want to optimize.<br> Forecasting supports the following primary metrics <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>
|**iterations**|Number of iterations. In each iteration, Auto ML trains a specific pipeline on the given data|
|**iteration_timeout_minutes**|Time limit in minutes for each iteration.|
|**path**|Relative path to the project folder.  AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder. 

In [None]:
time_column_name = 'timeStamp'
automl_settings = {
    "iteration_timeout_minutes": 5,
    "iterations": 10,
    "primary_metric": 'normalized_root_mean_squared_error',
    "time_column_name": time_column_name,
    "debug_log": 'automl_forecast_remote.log',

}

automl_config = AutoMLConfig(task = 'forecasting',           
                             path = project_folder, 
                             compute_target = dsvm_compute,
                             data_script = project_folder + "/get_data.py",
                             run_configuration=conda_run_config,
                             **automl_settings
                            )


**Note:** The first run on a new DSVM may take several minutes to prepare the environment.

## Train the Models

Call the `submit` method on the experiment object and pass the run configuration. For remote runs the execution is asynchronous, so you will see the iterations get populated as they complete. You can interact with the widgets and models even when the experiment is running to retrieve the best model up to that point. Once you are satisfied with the model, you can cancel a particular iteration or the whole run.

In this example, we specify `show_output = False` to suppress console output while the run is in progress.

In [None]:
remote_run = experiment.submit(automl_config, show_output = False)

## Exploring the Results <a class="anchor" id="Exploring-the-Results-Remote-DSVM"></a>
#### Widget for Monitoring Runs

The widget will first report a "loading" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.

You can click on a pipeline to see run properties and output logs.  Logs are also available on the DSVM under `/tmp/azureml_run/{iterationid}/azureml-logs`

**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details.

In [None]:
from azureml.widgets import RunDetails
RunDetails(remote_run).show()

In [None]:
# Wait until the run finishes.
remote_run.wait_for_completion(show_output = True)

### Pre-process cache cleanup
The preprocess data gets cache at user default file store. When the run is completed the cache can be cleaned by running below cell

In [None]:
remote_run.clean_preprocessor_cache()

### Retrieve the Best Model

Below we select the best pipeline from our iterations. The `get_output` method returns the best run and the fitted model. The Model includes the pipeline and any pre-processing.  Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*.

In [None]:
best_run, fitted_model = remote_run.get_output()
print(best_run)
print(fitted_model)

### Test the Best Fitted Model

Predict on training and test set, and calculate residual values.

In [None]:
df = pd.read_csv("https://automldata.blob.core.windows.net/datasets/nyc_energy.csv", parse_dates=['timeStamp'])
    
test = df[df['timeStamp'] >= '2017-02-01']
y_test = test.pop('demand').values
X_test = test

In [None]:
y_pred = fitted_model.predict(X_test)
y_pred

### Define a Check Data Function

Remove the nan values from y_test to avoid error when calculate metrics 

In [None]:
def _check_calc_input(y_true, y_pred, rm_na=True):
    """
    Check that 'y_true' and 'y_pred' are non-empty and
    have equal length.

    :param y_true: Vector of actual values
    :type y_true: array-like

    :param y_pred: Vector of predicted values
    :type y_pred: array-like

    :param rm_na:
        If rm_na=True, remove entries where y_true=NA and y_pred=NA.
    :type rm_na: boolean

    :return:
        Tuple (y_true, y_pred). if rm_na=True,
        the returned vectors may differ from their input values.
    :rtype: Tuple with 2 entries
    """
    if len(y_true) != len(y_pred):
        raise ValueError(
            'the true values and prediction values do not have equal length.')
    elif len(y_true) == 0:
        raise ValueError(
            'y_true and y_pred are empty.')
    # if there is any non-numeric element in the y_true or y_pred,
    # the ValueError exception will be thrown.
    y_true = np.array(y_true).astype(float)
    y_pred = np.array(y_pred).astype(float)
    if rm_na:
        # remove entries both in y_true and y_pred where at least
        # one element in y_true or y_pred is missing
        y_true_rm_na = y_true[~(np.isnan(y_true) | np.isnan(y_pred))]
        y_pred_rm_na = y_pred[~(np.isnan(y_true) | np.isnan(y_pred))]
        return (y_true_rm_na, y_pred_rm_na)
    else:
        return y_true, y_pred

In [None]:
y_test,y_pred =  _check_calc_input(y_test,y_pred)

### Calculate metrics for the prediction


In [None]:
print("[Test Data] \nRoot Mean squared error: %.2f" % np.sqrt(mean_squared_error(y_test, y_pred)))
# Explained variance score: 1 is perfect prediction
print('mean_absolute_error score: %.2f' % mean_absolute_error(y_test, y_pred))
print('R2 score: %.2f' % r2_score(y_test, y_pred))


%matplotlib notebook
# Plot outputs
test_pred = plt.scatter(y_test, y_pred, color='b')
test_test = plt.scatter(y_test, y_test, color='g')
plt.legend((test_pred, test_test), ('prediction', 'truth'), loc='upper left', fontsize=8)
plt.show()