Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# NYC Energy Demand Forecasting


## Problem overview:

This scenario focuses on energy demand forecasting where the goal is to predict the future load on an energy grid. It is a critical business operation for companies in the energy sector as operators need to maintain the fine balance between the energy consumed on a grid and the energy supplied to it. Too much power supplied to the grid can result in waste of energy or technical faults. However, if too little power is supplied it can lead to blackouts, leaving customers without power. Typically, grid operators can take short-term decisions to manage energy supply to the grid and keep the load in balance. An accurate short-term forecast of energy demand is therefore essential for the operator to make these decisions with confidence.

This scenario details the development of a machine learning energy demand forecasting model. The model is trained on a public dataset from the New York Independent System Operator (NYISO), which operates the power grid for New York State. The dataset includes hourly power demand data for New York City over a period of five years. 


## Solution overview:

We will use the automated ML capability ([what is that?](https://www.youtube.com/watch?v=l8c-4iDPE0M&t=27s)) in [Azure Machine Learning service](https://azure.microsoft.com/en-us/services/machine-learning-service/) to quickly train a model that can predict the future load on an energy grid. This example can be extended to all sorts of forecasting use cases. Automated ML empowers data scientists to identify an end-to-end machine learning pipeline for any problem, and they are able to achieve higher accuracy while spending far less time. 

1. Basic setup

2. Data prep

3. Model training

4. Explore the results and test the best model 

5. Model Explainability: which features matter for the forecast?


## Section 1. Basic setup
Before starting this step, you need to create an Azure Machine Learning service <b>workspace</b> ([instructions](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-workspace)). 

Let's get started by creating an experiment in your Azure Machine Learning workspace. An <b>experiment</b> is a named object in a <b>workspace</b>, which is used to do model training.

In [None]:
import azureml.core
import pandas as pd
import numpy as np
import logging
import warnings
# Squash warning messages for cleaner output in the notebook
warnings.showwarning = lambda *args, **kwargs: None


from azureml.core.workspace import Workspace
from azureml.core.experiment import Experiment
from azureml.train.automl import AutoMLConfig
from matplotlib import pyplot as plt
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

### <font color="red">Action Required</font>

<font color="red"> Replace <subscription_guid> with your Azure Subscription. <resource_group>, <workspace_region>, and <workspace_name> with the values that you would like for this experiment. Make sure you remove the angle brackets. The values should be inside the quotes. </font>

In [None]:
subscription_id = "<subscription id goes here>"
resource_group = "<resource group goes here>"
workspace_region = "<workspace region goes here>"
workspace_name = "<workspace name goes here>"

### <font color="red"> Action Required
    
<font color="red"> Executing next cell will ask you to go to a url to authenticate. Once the authentication is successful, you will see a message that says Interactive authentication successfully completed. In this step you are logging into the Azure subscription </font>

In [None]:
ws = Workspace(workspace_name = workspace_name,
               subscription_id = subscription_id,
               resource_group = resource_group)

In [None]:
# choose a name for the run history container in the workspace
experiment_name = 'NYCEnergyForecast'

# project folder
project_folder = './sample_projects/automl-energydemandforecasting'

experiment=Experiment(ws, experiment_name)

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace Name'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
output['Experiment Name'] = experiment.name
pd.set_option('display.max_colwidth', -1)
pd.DataFrame(data = output, index = ['']).T

## Section 2. Data prep
nyc_weather.csv contains hourly weather values (nyc_weather.csv contains hourly weather values) for New York City over the same years 2012-2017.

In [None]:
data = pd.read_csv("nyc_energy.csv", parse_dates=['timeStamp'])

### 2.1 Inspect data
Display the first few rows of the data

In [None]:
data.head()

In [None]:
plt_df = data.loc[(data.timeStamp>'2016-07-01') & (data.timeStamp<='2016-07-07')]
plt.plot(plt_df['timeStamp'], plt_df['demand'])
plt.title('New York City power demand over one week in July 2016')
plt.xticks(rotation=45)
plt.show()

In [None]:
plt_df = data.copy().loc[(data['timeStamp']>='2016-01-01') & (data['timeStamp']<'2017-01-01'), ]
plt.plot(plt_df['timeStamp'], plt_df['demand'], markersize=1)
plt.title('Hourly demand in 2016')
plt.ylabel('demand')
plt.xticks(rotation=45)
plt.show()

### 2.2 Split the data into train and test sets


In [None]:
# let's take note of what columns means what in the data
time_column_name = 'timeStamp'
target_column_name = 'demand'

X_train = data[data[time_column_name] < '2017-02-01']
X_test = data[data[time_column_name] >= '2017-02-01']
y_train = X_train.pop(target_column_name).values
y_test = X_test.pop(target_column_name).values

## Section 3. Model training

In this section you will configure automated ML,([What is that?](https://www.youtube.com/watch?v=l8c-4iDPE0M&t=27s)) and run an automated ML experiment which will generate machine learning models. The training jobs are run on VMs provided and managed by Azure Notebooks. 

What I love the most of automated ML is that even if it accelerates my work as data scientist, I still have total <b>Control</b>, <b>Transparency</b>, <b>Visibility</b> on what I am doing with my data, the training piece and all the metrics it is using to evaluate different ML approaches.

Below there is the <b>configuration file</b> for submitting an automated machine learning experiment in Azure Machine Learning service.

This configuration object contains and persists the parameters for configuring the experiment run parameters, as well as the training data to be used at run time.


|Property|Description|
|-|-|
|**task**|forecasting|
|**primary_metric**|This is the metric that you want to optimize.<br> Forecasting supports the following primary metrics <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>
|**iterations**|Number of iterations. In each iteration, Auto ML trains a specific pipeline on the given data|
|**iteration_timeout_minutes**|Time limit in minutes for each iteration.|
|**X**|(sparse) array-like, shape = [n_samples, n_features]|
|**y**|(sparse) array-like, shape = [n_samples, ], targets values.|
|**n_cross_validations**|Number of cross validation splits.|
|**path**|Relative path to the project folder.  AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder. 

### 3.1 Automated ML configuration 
Automated ML comes with many levers that you can use to configure. This is to give you flexibility and control  
For example primary_metric is the metric that Automated ML uses to optimize the machine learning model it is building. Automated ML supports several different primary_metrics. To find out more about all the configuration settings you can leisurely read at ( https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train ) 
    
Notice that the task is set to forecasting. We are also passing the training dataset and validation dataset (X_train, y_train that we prepared in section 2.2).

In [None]:
automl_settings = {
    "time_column_name": time_column_name, 
    "max_horizon": 20
}


automl_config = AutoMLConfig(task = 'forecasting',
                             debug_log = 'automl_nyc_energy_errors.log',
                             primary_metric= 'normalized_root_mean_squared_error',
                             iterations = 5,
                             iteration_timeout_minutes = 5,
                             X = X_train,
                             y = y_train,
                             n_cross_validations = 3,
                             path=project_folder,
                             verbosity = logging.INFO,
                            **automl_settings)

### 3.2 Train your models on local compute
You can call the submit method on the experiment object and pass the AutoMLConfig object you instantiated above. Submit generates a number of machine learning models equivalent to the iterations you set (In this case 5 different models). Depending on the data and number of iterations this can run for while. For local runs the execution is synchronous.
You will see the currently running iterations printing to the console.

In [None]:
local_run = experiment.submit(automl_config, show_output=True)

### 3.3 Monitor training

When you execute the cell below, it will render a widget showing the status of all the iterations in a table. The iterations are shown in a leaderboard format with the best performing model being on the top. However you can sort these by clicking on the column heading. If you hover your mouse on any of the pipelines, you will see all the hyper parameters used to build that particular model. If you click on any of the pipelines, a window will popup with a plethora of information including various metrics of the model and a few visual charts. 
You will also see a step chart showing the metric for each iteration 

In [None]:
from azureml.widgets import RunDetails
RunDetails(local_run).show()

## 4. Explore the results and test the best model 



### 4.1 Retrieve all child runs
You can fetch all the child runs and see individual metrics that we log for each of the models.

In [None]:
children = list(local_run.get_children())
metricslist = {}
for run in children:
    properties = run.get_properties()
    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}    
    metricslist[int(properties['iteration'])] = metrics

rundata = pd.DataFrame(metricslist).sort_index(1)
rundata

### 4.2 Retrieve best fitted model
Below we select the best model from our iterations. 

The get_output file allows you to retrieve the best run and fitted model for any logged metric or a particular iteration.

In [None]:
best_run, fitted_model = local_run.get_output()
fitted_model.steps

### 4.3 Test the best fitted model

For forecasting, we will use the `forecast` function instead of the `predict` function. There are two reasons for this.

We need to pass the recent values of the target variable `y`, whereas the scikit-compatible `predict` function only takes the non-target variables `X`. In our case, the test data immediately follows the training data, and we fill the `y` variable with `NaN`. The `NaN` serves as a question mark for the forecaster to fill with the actuals. Using the forecast function will produce forecasts using the shortest possible forecast horizon. The last time at which a definite (non-NaN) value is seen is the _forecast origin_ - the last time when the value of the target is known. 

Using the `predict` method would result in getting predictions for EVERY horizon the forecaster can predict at. This is useful when training and evaluating the performance of the forecaster at various horizons, but the level of detail is excessive for normal use.

In [None]:
y_pred_test = fitted_model.predict(X_test)
y_residual_test = y_test - y_pred_test

plt.plot(X_test['timeStamp'], y_test, label='Actual')
plt.plot(X_test['timeStamp'], y_pred_test, label="Predicted")
plt.xticks(rotation=90)
plt.title('Actual demand vs predicted for test data ')
plt.legend()
plt.show()

## 5. Model Explainability: Which features matter for the forecast?

I can use model explainability with automated ML - This gives me transparency into how the model was built and what features has the most influence on the prediction.  
 
The informative features make all sorts of intuitive sense. Temperature is a strong driver of heating and cooling demand in NYC. Apart from that, the daily life cycle, expressed by `hour`, and the weekly cycle, expressed by `wday` drives people's energy use habits. 

In [None]:
from azureml.train.automl.automlexplainer import explain_model

y_query = y_test.copy().astype(np.float)
y_query.fill(np.nan)
y_fcst, X_trans = fitted_model.forecast(X_test, y_query)

# feature names are everything in the transformed data except the target
features = X_trans.columns[:-1]
expl = explain_model(fitted_model, X_train, X_test, features = features, best_run=best_run, y_train = y_train)

# unpack the tuple
shap_values, expected_values, feat_overall_imp, feat_names, per_class_summary, per_class_imp = expl
best_run

Please go to the Azure Portal's best run to see the top features chart.

The informative features make all sorts of intuitive sense. Temperature is a strong driver of heating and cooling demand in NYC. Apart from that, the daily life cycle, expressed by `hour`, and the weekly cycle, expressed by `wday` drives people's energy use habits.