# AutoML 017:  Timeseries dataset
Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.


In this example we use the Appliances energy prediction data set (https://archive.ics.uci.edu/ml/datasets/Appliances+energy+prediction) to showcase how you can use the AutoML Regressor for IOT data.

Make sure you have executed the [setup](setup.ipynb) before running this notebook.

In this notebook you would see
1. Creating or reusing an existing Project and Workspace
2. Loading a time-series dataset
3. Instantiating a AutoML Regressor 
4. Training the Model locally
5. Exploring the results



## Create Project and Workspace

As part of the setup you have already created a workspace. For AutoML you would need to create a <b>Project</b>. A Project is a local folder that contains files for your Azure ML experiments. It is associated with a run history, a cloud container of run metrics and output artifacts from your experiments. You can either attach a local folder as a new project, or load a local folder as a project if it has been attached before.

In [None]:
import logging
import os
import random

from matplotlib import pyplot as plt
from matplotlib.pyplot import imshow
import numpy as np
import pandas as pd
from sklearn import datasets

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun

In [None]:
ws = Workspace.from_config()

# choose a name for the run history container in the workspace
experiment_name = 'automl-iot-remote-timeseries'
# project folder
project_folder = './sample_projects/automl-iot-remote-timeseries'

experiment=Experiment(ws, experiment_name)

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
output['Experiment Name'] = experiment_name
pd.set_option('display.max_colwidth', -1)
pd.DataFrame(data=output, index=['']).T

In [None]:
if not os.path.exists(project_folder):
    os.makedirs(project_folder)
if not os.path.exists(project_folder + "/aml_config"):
    os.makedirs(project_folder + "/aml_config")

In [None]:
from azureml.core.runconfig import CondaDependencies
cd = CondaDependencies.create(pip_packages=['azureml-sdk[automl]'], conda_packages=['numpy'])


In [None]:
!pip install xlrd


## Load the time-series data set
This dataset is of appliances energy use in a low energy building. The house temperature and humidity conditions were monitored with a ZigBee wireless sensor network.  Weather from the nearest airport weather station was downloaded from a public data set from Reliable Prognosis (rp5.ru), and merged together with the experimental data sets using the date and time column

In [None]:
##data prep
data = pd.read_excel("EnergyData_Complete.xlsx")
print(data.shape)

def split(data, sort_column= 'Date', drop_cols=True):
    #data['DateTimeUTC_str'] =  data[['DateTimeUTC']].apply(lambda x:  x[0].strftime("%Y-%d-%m-%T") , axis=1) # '2014-01-01-02:26:00'
    #data.pop('DateTimeUTC').values[:, None]
    y = data.pop('Appliances').values[:, None].ravel()
    X = data.values
    N = data.shape[0]
    
    perc_train = 0.80
    perc_valid = 0.10
    perc_test  = 0.10

    train = int(N * perc_train)
    valid = int(N * perc_valid)
    test  = int(N * perc_test)

    y_train, y_valid, y_test = y[:train], y[train:train+valid], y[train+valid:train+valid+test]
    X_train, X_valid, X_test = X[:train], X[train:train+valid], X[train+valid:train+valid+test]
    X_full, y_full = X[:train+valid], y[:train+valid]    
    return X_train, y_train, X_valid, y_valid, X_test, y_test, train, valid, test, X, y, X_full, y_full

##Extract data
_, _, _, _, _, _, train, valid, test, X, y, X_full, y_full = split(data.copy(), drop_cols=True)
N = X.shape[0]

X_train = X[:train+valid]
Y_train = y[:train+valid]
X_test = X[train+valid:train+valid+test]

#print(Y_train)


## Diagnostics
Opt-in diagnostics collection for better experience, quality, and security of future releases

In [None]:
from azureml.telemetry import set_diagnostics_collection
set_diagnostics_collection(send_diagnostics=True)

## Instantiate AutoML <a class="anchor" id="Instatiate-AutoML-Remote-DSVM"></a>

You can specify automl_settings as **kwargs** as well. Also note that you can use the get_data() symantic for local excutions too. 

<i>Note: For Remote DSVM and Batch AI you cannot pass Numpy arrays directly to the fit method.</i>

|Property|Description|
|-|-|
|**primary_metric**|This is the metric that you want to optimize.<br> Auto ML Regressor supports the following primary metrics <br><i>AUC_macro</i><br><i>AUC_weighted</i><br><i>accuracy</i><br><i>weighted_accuracy</i><br><i>norm_macro_recall</i><br><i>balanced_accuracy</i>|
|**iteration_timeout_minutes**|Time limit in minutes for each iterations|
|**iterations**|Number of iterations. In each iteration Auto ML Regressor trains the data with a specific pipeline|
|**num_cross_folds**|Cross Validation split|
|**max_concurrent_iterations**|Max number of iterations that would be executed in parallel

In [None]:
primary_metric = 'spearman_correlation'
experiment_name = "AutoML_IOT_TimeSeries"
automl_settings = {
    "name": experiment_name,     
    "preprocess": True
    }

automl_config = AutoMLConfig(task = 'regression',
                             debug_log = 'automl_errors.log',                             
                             primary_metric = primary_metric,
                             iteration_timeout_minutes = 60,
                             iterations = 3,       
                             n_cross_validations = 2,
                             X = X_train,                             
                             y = Y_train,                            
                             path = project_folder,
                             **automl_settings
                            )


## Training the Model <a class="anchor" id="Training-the-model-Remote-DSVM"></a>

You can call the *submit* method on the Experiment instance and pass the automl config instance. For remote runs the execution is asynchronous, so you will see the iterations get populated as they complete. You can interact with the widgets/models even when the experiment is running to retreive the best model up to that point. Once you are satisfied with the model you can cancel a particular iteration or the whole run.


*submit* method on experiment triggers the training of the model. It can be called with the following parameters

|**Parameter**|**Description**|
|-|-|
|**automl_config**|Indicates the automl configuration used|
|**show_output**| True/False to turn on/off console output|

In [None]:
local_run = experiment.submit(automl_config, show_output = True)

## Exploring the results


#### Retrieve All Child Runs
You can also use sdk methods to fetch all the child runs and see individual metrics that we log. 

In [None]:
children = list(local_run.get_children())
metricslist = {}
for run in children:
    properties = run.get_properties()
    metrics = run.get_metrics()    
    metricslist[properties['iteration']] = metrics

rundata = pd.DataFrame(metricslist).sort_index(1)
rundata

### Retrieve the Best Model

Below we select the best pipeline from our iterations. The *get_output* method on automl_classifier returns the best run and the fitted model for the last *fit* invocation. There are overloads on *get_output* that allow you to retrieve the best run and fitted model for *any* logged metric or a particular *iteration*.

In [None]:
best_run, fitted_model = local_run.get_output()
fitted_model


#### Best Model based on any iteration

In [None]:
# iteration = 3
# best_run, fitted_model = remote_run.get_output(iteration=iteration)