<a href='https://ai.meng.duke.edu'> = <img align="left" style="padding-top:10px;" src=https://storage.googleapis.com/aipi_datasets/Duke-AIPI-Logo.png>

# Household enegy prediction for utility load forecasting

_Version 1.0_  
_Author(s): Jon Reifschneider, Duke University Pratt School of Engineering_

<img align="left" style="padding-top:10px;" src="smart_meter.jpg">

Image source: Portland General Electric

## _About this teaching case_
**Level:** Intermediate  
**Language:** Python  
**Libraries:** pandas, matplotlib, scikit learn  
**Industry:** Energy

**Learning Topics:**  
- Time Series
- Exploratory Data Analysis
- Feature Engineering & Feature Selection
- Supervised Learning Model Selection
- Hyperparameter Tuning

**Learning Objectives**   
- Learn strategies for feature engineering and creation with time series data
- Build skills in exploratory data analysis using visual and statistical analysis techniques
- Learn how to apply the different methods of feature selection
- Build experience in supervised model selection, tuning and validation/testing

**Pre-requisites**  
- Basic proficiency in Python and pandas
- Familiarity with the theoretical foundations of supervised machine learning algorithms

**Case Structure**  
This teaching case is structured to follow the ***Modified CRISP-DM Data Science Process*** used in Duke University's AI for Product Innovation graduate programs. 

**Datasets Used**  
Makonin, S., Ellert, B., Bajić, I. et al. Electricity, water, and natural gas consumption of a residential house in Canada from 2012 to 2014. Sci Data 3, 160037 (2016). https://doi.org/10.1038/sdata.2016.37

Data has been adapted from the original dataset for purposes of this learning activity. The original dataset can be accessed from https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/FIE0S4

# Contents
[1: Business Understanding](#1)  
[2: Data Understanding](#2)  
[3: Prepare Data](#3)  
[4: Modeling](#4)  
[5: Evaluation & Interpretation](#5)  

In [1]:
# Run this before any other code cell
# This downloads the csv data files into the same directory where you have saved this notebook

import urllib.request
from pathlib import Path
import os
path = Path()

import warnings
warnings.filterwarnings("ignore")

# Dictionary of file names and download links
files = {'Electricity_simple.csv':'https://storage.googleapis.com/aipi_datasets/Electricity_simple.csv',
        'Weather_simple.csv': 'https://storage.googleapis.com/aipi_datasets/Weather_simple.csv'}

# Download each file
for key,value in files.items():
    filename = path/key
    url = value
    # If the file does not already exist in the directory, download it
    if not os.path.exists(filename):
        urllib.request.urlretrieve(url,filename)

# Step 1: Business Understanding <a class="anchor" id="1"></a>

You are working as an analyst for the Canadian electic utility BC Hydro, which serves 1.8 million customers in British Columbia.  You have been asked to work on the utility's load forecasting model for predicting future energy demand.  Accurate demand prediction is critical for the utility to produce sufficient power to meet demand on a short-term and long-term basis.  

The expansion of smart meter usage among many utilities has created increasing interest in machine learning models which predict energy demand at the individual home or building level, which can then be aggregated to a utility's entire territory.  You have been asked to explore this approach using your organization's smart meter data.  A review of advantages and limitations of different machine learning based approaches can be found <a href='https://www.informs-sim.org/wsc15papers/396.pdf'> here</a>.

In particular, you have been asked to better understand the relationship between changing weather conditions and the electric demand for a household. Your first objective is to examine the relationships between various weather parameters and electric consumption and determine which weather parameters most influence a home's energy usage.  Once you have understood these relationships, you have been asked to evaluate different machine learning approaches to modeling a home's energy usage and report back to your team on what features are most valuable to use in prediction, which modeling approach(es) have shown most promise, and how well you are able to predict load at an individual home level.  You will use root mean squared error (RMSE) as the evaluation metric for your modeling work.

If your work ultimately leads to an improvement upon your organization's load forecasting capability, it can result in significant operational savings.  Overforecasting of power demand leads to greater production of power than needed, which must then either be stored or sold. In times when demand is high, overforecasting can also lead to the unnecessary use of expensive peaker plants to produce extra power.  On the other hand, underforecasting can have significant consequences as the utility must purchase additional power, employ demand response measures to curtail demand, or in the worst case enforce rolling blackouts.  



# Step 2: Data Understanding <a class="anchor" id="2"></a>

To assist you with your analysis, you have been provided a dataset containing energy usage per minute for a home in the Vancouver area for the time period April 1 2012 through March 30 2014. The dataset contains the energy consumption in Watts for each minute over the time period.  You have also received a separate dataset of weather information, which contains weather parameters for the local area of the home on an hourly basis over the time period.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### LOAD ADDITIONAL LIBRARIES AS NEEDED



import warnings
warnings.filterwarnings("ignore")

pd.options.display.float_format = '{:,.2f}'.format

## Gather data

First let's import the energy consumption data and convert the index to a time series. 

In [3]:
# Import electricity data
raw_data = pd.read_csv("Electricity_simple.csv")
energy_data = raw_data.copy()

# Convert index to timeseries
energy_data.index = pd.to_datetime(energy_data['UNIX_TS'],unit='s').values
energy_data = energy_data.drop(labels='UNIX_TS',axis=1)

### Check for anomalous values

Plot your data to visually check for potential outliers. 

In [1]:
### BEGIN SOLUTION


### END SOLUTION

When working with sensor data, we occasionally have null values due to problems with power or communications.  Sometimes these are recorded as null values, and other times a value of 0 is reported.  If we look closely at the plot we generated above, there appear to be a few instances where the consumption is reported as 0.

Filter the data to check for any instances where the 'Consumption' value is 0.  If there are 0 values, forward-fill them from the last previously recorded energy usage value

In [2]:
### BEGIN SOLUTION

    
### END SOLUTION

### Resample the energy consumption to hourly

Our energy consumption data has been collected every minute over the time period. For purposes of modeling, aggregate the data to an hourly frequency.  Then, visualize the energy consumption by hour of the day to understand the usage pattern throughout a normal day.

In [3]:
### BEGIN SOLUTION



### END SOLUTION

### Add in weather data
Now, add in weather data to your dataframe.  The hypothesis you have been asked to explore is that weather conditions impact energy usage, and so you expect that a model which includes weather parameters will perform better than one which accounts only for the time-dependent fluctuations in consumption over the course of a day.

Read in the weather data from 'Weather_simple.csv', add it to your dataframe and convert any categorical variables into numerical variables.

In [4]:
### BEGIN SOLUTION


### END SOLUTION

## Validate data

Check to see if there are any null values in the dataset.  If there are null values, replace them using a forward-fill approach (fill null values using the last previous reported value)

In [5]:
### BEGIN SOLUTION


### END SOLUTION

## Explore data

Visually analyze the relationships between Consumption and the weather parameters you have available, and explain the relationships you can visually identify.

In [6]:
### BEGIN SOLUTION


### END SOLUTION

# Step 3: Prepare Data <a class="anchor" id="3"></a>
## Feature Engineering & Feature Selection
### Add time series features
Create new features which may help explain the variation of consumption over time due to seasonality.  Add these features into your dataframe.

In [7]:
### BEGIN SOLUTION


### END SOLUTION

Visually analyze your new features and comment on whether you would expect them to have value in modeling the consumption over time, based on your visualizations

In [8]:
### BEGIN SOLUTION



### END SOLUTION

### Univariate feature selection  
First, you should create a subset of the data to use for feature selection which includes only the training data.  We do not use our test data set for feature selection so it does not influence our choice of features to use for modeling.  We will train our model on the data from the period April 2012 - February 2014.  We will then use our trained model to predict the energy consumption for each hour in the month of March 2014 as a test set to evaluate our model performance.  Below, create a subset to use for feature selection.

In [9]:
### BEGIN SOLUTION



### END SOLUTION

**Continuous variables**  
Perform univariate feature selection on your continuous variables, selecting an appropriate statistical test, to identify any features which are duplicative or unnecessary.

In [10]:
### BEGIN SOLUTION


### END SOLUTION

**Categorical variables**.  
Perform univariate feature selection on your categorical variables, selecting an appropriate statistical test, to identify any features which are duplicative or unnecessary.


In [11]:
### BEGIN SOLUTION


### END SOLUTION

**Drop unnecessary features**  
Any features which you have identified as being duplicative or clearly unnecessary in our univariate feature selection tests you can now drop to simplify our dataset.  Based on your univariate feature selection tests, which variables have you determined can be dropped from our dataset? Go ahead and drop them now

In [12]:
### BEGIN SOLUTION


### END SOLUTION

**Add polynomial terms for continuous features**  
As we could see in the scatterplots above, the relationship between Consumption and some of our continuous variables may not be linear.  To better model these relationships, we can add polynomial terms as new features.  Do this below, before you conduct your final round of feature selection, so you can include and evaluate these quadratic features and interaction terms.

In [13]:
### BEGIN SOLUTION


### END SOLUTION

### Embedded feature selection using Feature Importance

Use the Random Forest Feature Importance model from scikit learn to evaluate the importance of each feature in your dataset

In [14]:
### BEGIN SOLUTION


### END SOLUTION

Using the results from your feature importance model, determine which features you would like to keep and drop the features that do not have value.

In [15]:
### BEGIN SOLUTION


### END SOLUTION

## Encode categorical variables

Encode your categorical variables to prepare them for modeling, using an encoding method of your choice.

In [16]:
### BEGIN SOLUTION


### END SOLUTION

# Step 4: Modeling <a class="anchor" id="4"></a>

## Define validation/test approach

Before you begin training and evaluating models, you need to determine our approach for validation and testing. Earlier we determined that we will use the data from March 2014 as our test set.  We will use the remainder for training.  

However, to evaluate and compare different modeling approaches you also should define a validation set strategy.  Because we are working with time series data, normal cross-validation is not a good strategy since consumption values which are closer in time are likely to be similar.  Thus, we can either create hold-out validation sets or we can use scikit learn's TimeSeriesSplit which creates successive training sets as supersets of those that come before them. Since you are going to perform algorithm selection as well as hyperparameter tuning, the preferred approach would be to use nested validation or nested cross-validation.  

Using a nested cross-validation approach, you will use an inner cross-validation loop to perform hyperparameter optimization of each algorithm.  You will then use an outer cross-validation loop compare the prediction ability of the optimized model from each algorithm and select the algorithm that performs best.  Finally, you will re-do the hyperparameter tuning using the full training plus validation data, and use the test set to evaluate the perfomance of your final optimized model. 

Begin by splitting data into training and test set.

In [20]:
### BEGIN SOLUTION


### END SOLUTION

## Establish a baseline for performance
It is usually a good idea to start with a simple model and evaluate its performance to establish a baseline.  Then, you can proceed with more advanced model selection methods and hyperparameter tuning and easily determine how much they improve your performance relative to your simple baseline.

Train a linear regression model on your training set and then calculate the RMSE of your baseline model on your test set.

In [17]:
### BEGIN SOLUTION


### END SOLUTION

## Model selection

Now you are ready to train and evaluate models on the data. Determine which algorithms you would like to evaluate.  Then use a time series specific nested cross-validation approach to optimize models for each algorithm and compare your optimized models.  

In [18]:
### BEGIN SOLUTION


### END SOLUTION

## Hyperparameter tuning on selected model  
Based on the results of your above model selection, determine the algorithm you will use.  Then, use the entire training dataset to tune and train your model. Start by tuning your model's hyperparameters to optimize it.  Consider your choice of validation strategy for the tuning - either a hold-out validation set or a time series cross validation approach.

In [19]:
### BEGIN SOLUTION


### END SOLUTION

# Step 5: Evaluation & Interpretation <a class="anchor" id="5"></a>

You can now use your optimized model to generate predictions on your test set to calculate the test set RMSE

In [20]:
### BEGIN SOLUTION


### END SOLUTION

Also, plot a residual plot of your model predictions to help diagnose any potential issues with your model.  For a high-quality model, the residuals should be small and randomly distributed around the centerline.  Patterns in our residual plot mean our model is unable to capture some of the explanatory information. We can also use the residual plot as another check for the existance of outliers.

In [21]:
### BEGIN SOLUTION


### END SOLUTION

Finally, for the test period plot the predicted consumption values and the actual consumption values.

In [22]:
### BEGIN SOLUTION


### END SOLUTION

### Model Interpretation

Based on your feature selection and model, what have you discovered about the key drivers of energy consumption for this household?

Based on your final model performance and your visual analyses of the predictions and residuals, what insights can you derive about your model? Does it do a reasonable job in predicting future energy consumption at the household level? What might you do to further improve it?

If your utility organization were to design a new load forecasting system for their whole network using a bottom-up approach based on your individual household model, what challenges might you foresee?

# The End