Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Machine Learning Experimentation with AML Compute

## Introduction

### Recap

In the previous notebooks we developed a solution for online anomaly detection.  
- We implemented an algorithm for calculating running averages of sensor data online, rather than batch.
- We also tried to determine how many recent sensory readings we need store, so that our algorithm can still identify seasonal and linear trends.  

These improvements to the batch solution for AD saved both time and space.  

To test whether the resulting online version worked as well as the original batch soution (which can be considered the *ground truth*), we used a test script (`sample_run.py`) that compared the performance of the two solutions. 


### Goals

In the previous lab, we ran our test script manually with different parameters to see how well our online solution compared to the original batch solution.

In this lab, we will develop a more sophisticated approach to optimizing our online solution, leveraging AML SDK tools for *Machine Learning experimentation* that allow us to log model performance.

You will learn:
- How to define a remote `AmlCompute` compute target for ML Experimentation on Azure
- How to configure this compute target for auto scaling, so you only pay for computing resources you are using
- How to modify your analysis script, so that you can run it on AmlCompute
- How to investigate the run history to determine which hyperparameters settings in your analysis script gave you the best results.

## Getting started

Let's get started. First let's import some Python libraries.

In [1]:
# %matplotlib inline

import numpy as np
import os
import matplotlib
import matplotlib.pyplot as plt

In [2]:
import azureml
from azureml.core import Workspace, Run

# check core SDK version number
print("Azure ML SDK Version: ", azureml.core.VERSION)

Azure ML SDK Version:  1.0.6


## Diagnostics
Opt-in diagnostics for better experience, quality, and security of future releases.

In [3]:
from azureml.telemetry import set_diagnostics_collection
set_diagnostics_collection(send_diagnostics=True)

Turning diagnostics collection on. 


## Initialize workspace
Initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the Prerequisites step. `Workspace.from_config()` creates a workspace object from the details stored in `config.json`.

In [4]:
from azureml.core.workspace import Workspace

# If you are on Azure Databricks, use this
# config_path = '/dbfs/tmp/'

# If you are running this on Jupyter, you may want to run 
# config_path = os.path.expanduser('~')
config_path = '..'

ws = Workspace.from_config(path=os.path.join(config_path, 'aml_config','config.json'))

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

Found the config file in: /home/nbuser/library/aml_config/config.json
Workspace name: myADworkspace
Azure region: westus2
Subscription id: 5be49961-ea44-42ec-8021-b728be90d58c
Resource group: wopauli_AD


## Create an Azure ML experiment
Let's create an experiment named "ADMLExp" and a folder to hold the training scripts. The script runs will be recorded under the experiment in Azure.

In [5]:
from azureml.core import Experiment

script_folder = 'scripts'

os.makedirs(script_folder, exist_ok=True)

exp = Experiment(workspace=ws, name='ADMLExp')

## Download telemetry dataset
In order to test on the telemetry dataset we will first need to download it from Yan LeCun's web site directly and save them in a `data` folder locally.

In [6]:
import os
import urllib

data_path = os.path.join(config_path, 'data')
os.makedirs(data_path, exist_ok=True)

container = 'https://sethmottstore.blob.core.windows.net/predmaint/'

urllib.request.urlretrieve(container + 'telemetry.csv', filename=os.path.join(data_path, 'telemetry.csv'))
urllib.request.urlretrieve(container + 'anoms.csv', filename=os.path.join(data_path, 'anoms.csv'))

('../data/anoms.csv', <http.client.HTTPMessage at 0x7efd498af320>)

## Upload dataset to default datastore 
A [datastore](https://docs.microsoft.com/azure/machine-learning/service/how-to-access-data) is a place where data can be stored that is then made accessible to a Run either by means of mounting or copying the data to the compute target. A datastore can either be backed by an Azure Blob Storage or and Azure File Share (ADLS will be supported in the future). For simple data handling, each workspace provides a default datastore that can be used, in case the data is not already in Blob Storage or File Share.

In this next step, we will upload the training and test set into the workspace's default datastore, which we will then later be mount on a Batch AI cluster for training.

In [7]:
ds = ws.get_default_datastore()
ds.upload(src_dir=data_path, target_path='telemetry', overwrite=True, show_progress=True)

Uploading ../data/anoms.csv
Uploading ../data/failures.csv
Uploading ../data/machines.csv
Uploading ../data/maintenance.csv
Uploading ../data/telemetry.csv
Uploaded ../data/machines.csv, 1 files out of an estimated total of 5
Uploaded ../data/failures.csv, 2 files out of an estimated total of 5
Uploaded ../data/maintenance.csv, 3 files out of an estimated total of 5
Uploaded ../data/anoms.csv, 4 files out of an estimated total of 5
Uploaded ../data/telemetry.csv, 5 files out of an estimated total of 5


$AZUREML_DATAREFERENCE_d46375d1e41d47468eb9a4ea61e724bb

## Create Batch AI cluster as compute target
[Batch AI](https://docs.microsoft.com/en-us/azure/batch-ai/overview) is a service for provisioning and managing clusters of Azure virtual machines for running machine learning workloads. Let's create a new Batch AI cluster in the current workspace, if it doesn't already exist. We will then run the training script on this compute target.

If we could not find the cluster with the given name in the previous cell, then we will create a new cluster here. We will create a AmlCompute Cluster of `Standard_DS3_v2` CPU VMs. This process is broken down into 3 steps:
1. create the configuration (this step is local and only takes a second)
2. create the Batch AI cluster (this step will take about **20 seconds**)
3. provision the VMs to bring the cluster to the initial size (of 1 in this case). This step will take about **3-5 minutes** and is providing only sparse output in the process. Please make sure to wait until the call returns before moving to the next cell

> Pay close attention to the `provisioning_configuration` below! We can configure our compute target such that it auto-scales, and if we set the minimum number of nodes to zero, we won't be paying anything for the compute while we are not using it.

In [8]:
help(azureml.core.compute.AmlCompute)

Help on class AmlCompute in module azureml.core.compute.amlcompute:

class AmlCompute(azureml.core.compute.compute.ComputeTarget)
 |  Class for managing AmlCompute target objects.
 |  
 |  Method resolution order:
 |      AmlCompute
 |      azureml.core.compute.compute.ComputeTarget
 |      abc.ABC
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  delete(self)
 |      Removes the AmlCompute object from its associated workspace.
 |      
 |      .. remarks::
 |          If this object was created through Azure ML,
 |          the corresponding cloud based objects will also be deleted. If this object was created externally and only
 |          attached to the workspace, it will raise exception and nothing will be changed.
 |      
 |      :raises: ComputeTargetException
 |  
 |  detach(self)
 |      Detach is not supported for AmlCompute object. Try to use delete instead.
 |      
 |      :raises: ComputeTargetException
 |  
 |  get(self)
 |      Returns compute object
 |  

In [9]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# choose a name for your cluster
cluster_name = "ADPMAmlCompute"

try:
    # look for the existing cluster by name
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    if type(compute_target) is AmlCompute:
        print('Found existing compute target {}.'.format(cluster_name))
    else:
        print('{} exists but it is not a Batch AI cluster. Please choose a different name.'.format(cluster_name))
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size="Standard_DS3_v2",
                                                               idle_seconds_before_scaledown=1800,
                                                               min_nodes=0, 
                                                               max_nodes=4)

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)
    
    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it uses the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
    # Use the 'status' property to get a detailed status for the current cluster. 
    print(compute_target.status.serialize())

Found existing compute target ADPMAmlCompute.


## Create an Execution script for Azure ML experimentation

### Azure ML concepts  

Please note the following three things in the code below:
1. The script accepts arguments using the argparse package. In this case there is one argument `--data_folder` which specifies the file system folder in which the script can find the telemetry data
```
    parser = argparse.ArgumentParser()
    parser.add_argument('--data_folder')
```
2. The script is accessing the Azure ML `Run` object by executing `run = Run.get_context()`. Further down the script is using the `run` to report the F2 score for a given choice of `window_size`.
```
    run.log('fbeta_score', np.float(score))
```
3. When running the script on Azure ML, you can write files out to a folder `./outputs` that is relative to the root directory. This folder is specially tracked by Azure ML in the sense that any files written to that folder during script execution on the remote target will be picked up by Run History; these files (known as artifacts) will be available as part of the run history record.

### Hands-on Lab: Adopt the execution script

In order for you to be able to use HyperDrive and AmlCompute, we have to modify the script from the previous notebook in several ways.  

The main adoptions are:
1. The handling of input arguments (e.g. the hyperparameter `window_size` that we are interested in).  
2. We log the results rather than to return them as a function output.  For this purpose, we use the `context` of our Azure ML [Run](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.run.run?view=azure-ml-py) object, which allows us to log our results.

*Instructions:*

We have done most of the heavy lifting for you in this regard, but left two exercises for you:
1. Modify the `main` function, so that it also parses the `n_epochs` argument to the execution script.
2. Modify the `sample_run` function, so that it also logs the `fbeta_score` to the `context` of our `Run`.

In [11]:
# %%writefile scripts/sample_run_AmlCompute.py

# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.

"""

This script was modified from the sample_run function of lab 1.2, such that it can be run on AmlCompute.

"""

import pandas as pd
import numpy as np
from sklearn.metrics import fbeta_score
import os
import time

from pyculiarity import detect_ts

from azureml.core import Run

import argparse # for parsing input arguments

def running_avg(ts, com=6):
    rm_o = np.zeros_like(ts)
    rm_o[0] = ts[0]
    
    for r in range(1, len(ts)):
        curr_com = float(min(com, r))
        rm_o[r] = rm_o[r-1] + (ts[r] - rm_o[r-1])/(curr_com + 1)
    
    return rm_o


def detect_ts_online(df_smooth, window_size, stop):
    is_anomaly = False
    run_time = 9999
    start_index = max(0, stop - window_size)
    df_win = df_smooth.iloc[start_index:stop, :]
    start_time = time.time()
    results = detect_ts(df_win, alpha=0.05, max_anoms=0.02, only_last=None, longterm=False, e_value=False, direction='both')
    run_time = time.time() - start_time
    if results['anoms'].shape[0] > 0:
        timestamp = df_win['timestamp'].tail(1).values[0]
        if timestamp == results['anoms'].tail(1)['timestamp'].values[0]:
            is_anomaly = True
    return is_anomaly, run_time


def sample_run(df, anoms_batch, run, window_size = 500, com = 12, n_epochs=10):

    # create arrays that will hold the results of batch AD (y_true) and online AD (y_pred)
    y_true = [False] * n_epochs
    y_pred = [True] * n_epochs
    run_times = []

    # check which unique machines, sensors, and timestamps we have in the dataset
    machineIDs = df['machineID'].unique()
    sensors = df.columns[2:]
    timestamps = df['datetime'].unique()[window_size:]

    # sample n_machines_test random machines and sensors 
    random_machines = np.random.choice(machineIDs, n_epochs)
    random_sensors = np.random.choice(sensors, n_epochs)

    # we intialize an array with that will later hold a sample of timetamps
    random_timestamps = np.random.choice(timestamps, n_epochs)

    for i in range(0, n_epochs):
        # take a slice of the dataframe that only contains the measures of one random machine
        df_s = df[df['machineID'] == random_machines[i]]
        
        # smooth the values of one random sensor, using our run_avg function
        smooth_values = running_avg(df_s[random_sensors[i]].values, com)

        # create a data frame with two columns: timestamp, and smoothed values
        df_smooth = pd.DataFrame(data={'timestamp': df_s['datetime'].values, 'value': smooth_values})

        # load the results of batch AD for this machine and sensor
        anoms_s = anoms_batch[((anoms_batch['machineID'] == random_machines[i]) & (anoms_batch['errorID'] == random_sensors[i]))]

        # only do anomaly detection online, if the batch solution actually found an anomaly
        if anoms_s.shape[0] > 0:
            # Let's make sure we have at least one anomaly in our sample! Otherwise it doesn't make sense to calculate
            # any performance metric.  We can just use the timestamp of the last anomalys
            if i == 0:
                anoms_timestamps = anoms_s['datetime'].values
                random_timestamps[i] = anoms_timestamps[-1:][0]

            # select the row of the test case
            test_case = df_smooth[df_smooth['timestamp'] == random_timestamps[i]]
            test_case_index = test_case.index.values[0]

            # check whether the batch AD found an anomaly at that time stamps and copy into y_true at idx
            y_true_i = random_timestamps[i] in anoms_s['datetime'].values

            # perform online AD, and write result to y_pred
            y_pred_i, run_times_i = detect_ts_online(df_smooth, window_size, test_case_index)
        else:
            y_pred_i, y_true_i = 0, 0
        
        y_true[i] = y_true_i
        y_pred[i] = y_pred_i
        run_times.append(run_times_i)
        
        score = np.float(fbeta_score(y_true, y_pred, beta=2))
        print("fbeta_score: %s" % round(score, 2))
        
        run.log('run_time', np.mean(run_times))
        run.log(<your solution for logging the fbeta_score goes here>)
        
    run.log('final_fbeta_score', np.float(score))

        
def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder mounting point')
    parser.add_argument('--window_size', type=int, dest='window_size', default=100, help='window size')
    parser.add_argument('--com', type=int, dest='com', default=12, help='Specify decay in terms of center of mass for running avg')
    parser.add_argument(<your solution for handinling the n_epochs parameter goes here>)
    args = parser.parse_args()

    data_folder = os.path.join(args.data_folder, 'telemetry')
    window_size = args.window_size
    com = args.com
    n_epochs = args.n_epochs
    
    # start an Azure ML run
    run = Run.get_context()

    print("Reading data ... ", end="")
    df = pd.read_csv(os.path.join(data_folder, 'telemetry.csv'))
    print("Done.")

    print("Parsing datetime...", end="")
    df['datetime'] = pd.to_datetime(df['datetime'], format="%m/%d/%Y %I:%M:%S %p")
    print("Done.")
    
    print("Reading data ... ", end="")
    anoms_batch = pd.read_csv(os.path.join(data_folder, 'anoms.csv'))
    anoms_batch['datetime'] = pd.to_datetime(anoms_batch['datetime'], format="%Y-%m-%d %H:%M:%S")
    print("Done.")

    print('Dataset is stored here: ', data_folder)

    sample_run(df, anoms_batch, run, window_size, com, n_epochs)

    
if __name__== "__main__":
      main()


usage: __main__.py [-h] [--data-folder DATA_FOLDER]
                   [--window_size WINDOW_SIZE] [--com COM]
                   [--n_epochs N_EPOCHS]
__main__.py: error: unrecognized arguments: -f /home/nbuser/.local/share/jupyter/runtime/kernel-3d9b45c4-dbcd-4069-878b-0462e4101786.json


SystemExit: 2

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


### Save your solution

Once you are done making those changes, you can uncomment the first line of the above cell, and execute the cell so that your solution is saved.

## End of Hands-on Lab

## Copy the test script into the script folder

The next step is to copy the script to the staging folder for your ML Experiment.

In [13]:
from shutil import copyfile

copyfile('../solutions/sample_run_AmlCompute.py', os.path.join(script_folder,'sample_run_AmlCompute.py'))

'scripts/sample_run_AmlCompute.py'

## Create an AzureML training Estimator

Next, we construct an `azureml.train.estimator.Estimator` estimator object, use the Batch AI cluster as compute target, and pass the mount-point of the datastore to the training code as a parameter. The azureml.train module contains several estimators.  Here we use `Estimator`, because it is the most general one of all of them, but if you used e.g. `PyTorch`, you could use an estimator for that ([see AML SDK documentation](https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.dnn.pytorch?view=azure-ml-py)). 

The estimator is providing a simple way of launching a custom job on a compute target.  It will automatically provide a docker image, if additional pip or conda packages are required, their names can be passed in via the `pip_packages` and `conda_packages` arguments and they will be included in the resulting docker.

In our case, we will need to install the following `pip_packages`: `numpy`, `pandas`, `scikit-learn`.

In [14]:
from azureml.train.estimator import Estimator

# input arguments to the script
script_params = {
    '--data-folder': ws.get_default_datastore().as_mount(),
    '--window_size': 500,
    '--n_epochs': 1000,
    '--com': 12
}

est = Estimator(source_directory=script_folder, # this is the folder where you saved the sample_run script
                 script_params=script_params, # these are the input arguments ot the script
                 compute_target=compute_target, # the name of the compute target you created above
                 entry_script='sample_run_AmlCompute.py',
                 pip_packages=['numpy','pandas','scikit-learn','pyculiarity'])

## Submit job to run
Calling the `fit` function on the estimator submits the job to Azure ML for execution. Submitting the job should only take a few seconds.

In [15]:
run = exp.submit(config=est)

### Monitor the Run
As the Run is executed, it will go through the following stages:
1. Preparing: A docker image is created matching the Python environment specified by the estimator and it will be uploaded to the workspace's Azure Container Registry. This step will only happen once for each Python environment -- the container will then be cached for subsequent runs.  Creating and uploading the image takes about **5 minutes**. While the job is preparing, logs are streamed to the run history and can be viewed to monitor the progress of the image creation.

2. Scaling: If the compute needs to be scaled up (i.e. the Batch AI cluster requires more nodes to execute the run than currently available), the Batch AI cluster will attempt to scale up in order to make the required amount of nodes available. Scaling typically takes about **5 minutes**.

3. Running: All scripts in the script folder are uploaded to the compute target, data stores are mounted/copied and the `entry_script` is executed. While the job is running, stdout and the `./logs` folder are streamed to the run history and can be viewed to monitor the progress of the run.

4. Post-Processing: The `./outputs` folder of the run is copied over to the run history

We can periodically check the status of the run object, and navigate to Azure portal to monitor the run.

In [16]:
run

Experiment,Id,Type,Status,Details Page,Docs Page
ADMLExp,ADMLExp_1547664508693,azureml.scriptrun,Queued,Link to Azure Portal,Link to Documentation



### Widget for Monitoring Runs

The widget will first report a "loading" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.

Note: The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details.


In [19]:
from azureml.widgets import RunDetails
RunDetails(run).show() 

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': True, 'log_level': 'INFO', 's…

### Hands-on Lab: Explore AML workspace on the Azure Portal

When you are running this for the first time, the job will take a while to prepare, because the AmlCompute target has to be fully provisioned first.  You can seize this opportunity to become familiar with the Azure Portal.

Try to find the following items and explore them
- ML workspace in the portal (make sure you use the one you created and save to config.json)
- Experiments, and run objects
- Compute Targets
- Look at the log files for the job you have submitted. These can be very helpful for monitoring progress and troubleshooting.

### End of Lab

You can run the following command if you want this notebook to not continue until the run is completed. (not necessary)

In [None]:
# run.wait_for_completion(show_output = True)

### The Run object

The Run object provides the interface to the run history - both to the job and to the control plane (this notebook), and both while the job is running and after it has completed. It provides a number of interesting features for instance:
* `run.get_details()`: Provides a rich set of properties of the run
* `run.get_metrics()`: Provides a dictionary with all the metrics that were reported for the Run
* `run.get_file_names()`: List all the files that were uploaded to the run history for this Run. This will include the `outputs` and `logs` folder, azureml-logs and other logs, as well as files that were explicitly uploaded to the run using `run.upload_file()`

Below are some examples -- please run through them and inspect their output.

In [None]:
run.get_details()

In [None]:
run.get_metrics()

In [None]:
run.get_file_names()

## Plot accuracy over epochs
Since we can retrieve the metrics from the run, we can easily make plots using `matplotlib` in the notebook. Then we can add the plotted image to the run using `run.log_image()`, so all information about the run is kept together.

In [None]:
import os
os.makedirs('./imgs', exist_ok = True)
metrics = run.get_metrics()

plt.close()
plt.figure(figsize = (13,5))
plt.plot(metrics['fbeta_score'])
plt.xlabel('epochs', fontsize = 14)
plt.ylabel('fbeta score', fontsize = 14)
plt.title('Fbeta Score over Epochs', fontsize = 16)
run.log_image(name = 'fbeta_score_over_epochs', plot = plt)
display()

## Hands-on lab

The goal of this lab is to have you become comfortable with ML model experimentation:
- Running your model with different parameters
- Investigating the results in the Azure portal

### Instructions:
1. Try different settings for the `script_params` above
  - Change the `script_params` settings
  - Execute the cell to update the `Estimator`
  - `submit` the updated `Estimator` as a new `Run` in your ML `Experiment`
2. After you have submitted a couple of different runs, go into the Azure portal to explore the results
3. Try to figure out which of the runs had the best preformance.

## End Lab

## Clean up

At this point we could delete the compute target. But remember if you set the `cluster_min_nodes` value to 0 when you created the cluster, once the jobs are finished, all nodes are deleted automatically. So you don't have to delete the cluster itself since it won't incur any cost. Next time you submit jobs to it, the cluster will then automatically "grow" up to the `cluster_min_nodes` which is set to 4.

Also, if you are planning to take the next lab on Hyperdrive, you will be using the same compute target again.

In [None]:
# # Are you sure you want to delete the compute target?
# compute_target.delete()

## Hyper Parameter Tuning with HyperDrive and AmlCompute (fka BatchAI)

In the next lab, we will show you how to use BatchAI to do this tedious labor for you!

If you want to, you can already read up on HyperDrive:
- [documentation](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-tune-hyperparameters)
- [sample notebook](https://github.com/Azure/MachineLearningNotebooks/tree/master/training/03.train-hyperparameter-tune-deploy-with-tensorflow)