In [None]:
!pip install -r https://raw.githubusercontent.com/EluciDATALab/elucidatalab.starterkits/main/notebooks/SK_1_3_Resource_Demand_Forecasting/requirements.txt
from google.colab import output
output.enable_custom_widget_manager()

In [None]:
# from IPython.core.display import display, HTML   # deprecated
from IPython.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

import plotly.io as pio
pio.renderers.default="plotly_mimetype+notebook"

In [None]:
import numpy as np
import pandas as pd
import matplotlib
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.float_format', '{:.3f}'.format)

from starterkits.starterkit_1_3.support import read_consumption_data, read_climate_data, printmd
from starterkits.starterkit_1_3.visualizations import run_app, plot_effects_resampling, plot_temperature_power_one_year, plot_yearly_data, \
    plot_weekly_data, plot_correlation, plot_consumption_with_holidays_weekends, plot_auto_correlation
import holidays

from pathlib import Path
DATA_PATH = Path('../../data/')

In [None]:
data = read_consumption_data(DATA_PATH)
ext_data = read_climate_data(DATA_PATH)

# Starter Kit 1.3: Resource Demand Forecasting

## Business context

*Resource demand forecasting* concerns accurately predicting the future need for a resource, typically using historical information. It is one of the most essential steps in resource demand management, i.e. ensuring sufficient resources are available to satisfy a fluctuating demand. Resource demand forecasting allows one to plan ahead and guarantee that sufficient resources are available when needed and to avoid costly countermeasures in case of shortage.

A typical example of resource demand forecasting is provided by the energy sector. For energy suppliers it is more beneficial to buy electricity on the day-ahead market than on the spot market. Consequently, the more accurate the energy consumption can be predicted, the lower the cost. Given that more and more houses and appliances are enabled with a smart meter, an increasing amount of data becomes available that can enable and improve this prediction. Furthermore, these kinds of predictions also allow energy suppliers to better balance demand and supply, and helps them to ensure proper grid operation (and e.g., prepare for possible peak loads in winter time). 

This Starter Kit will focus energy consumption forecasting as a use case for illustrating how resource demand forecasting using historical data can be realized.
 
## Business goal

The business goal related to this Starter Kit is **forecasting the demand for a particular resource at a particular point in the future** based on usage data of that resource that is available from the past. More specifically, we will illustrate a data-driven approach for forecasting the energy consumption of households. 



## Application contexts

Forecasting the demand of a particular resource can be useful for several purposes in a variety of industrial contexts:
* Forecast parking demand or traffic density in a particular neighbourhood in order to control traffic in intelligent ways (e.g. divert to alternative routes or parking spots)
* Predict battery consumption based on operating conditions in order to intelligently suggest recharging moments or switch to low-power mode
* Estimate the usage of consumables (e.g., ink, paper) based on product usage data in order to deliver new consumables exactly on time before stock breakage or avoid expensive storage of superfluous stock (e.g. in a manufacturing environment)
* ...

## Data requirements

In order to realize the business goal, we need a dataset (time series) that includes:
1. a value indicating the demand for the resource at each moment in time.
2. the factors that influence the demand of the resource. These can be internal or external, for example:
    - the energy consumption of a building is influenced by the outside temperature, the amount of people present in the building, etc.
    - the consumption of raw materials is influenced by the type of product that is being produced, the production line settings, etc.
    - the traffic density is influenced by the time of day, the presence of road works, the weather, etc.
3. 'representative' historical data. The amount of data that is needed to make an accurate prediction is typically determined by the length of the temporal patterns in the data. For example, if the data expresses a yearly seasonality, historical data of multiple years is needed. 
4. data sampled with the 'appropriate' frequency (i.e., how many times per minute/hour/day you gather data points). This frequency is related to the required forecast frequency, e.g. it is impossible to forecast every 10 minutes when only 15-minute data is available.
5. In some use cases, where resource demand forecasting concerns particular processes, (expert) knowledge on these processes is required (i.e. as labeled data)
    

## Starter Kit outline

In this Starter Kit, we will illustrate a data-driven approach for forecasting the energy consumption of a household. 
As a dataset we will use the energy consumption of the household and climate data collected by a nearby weather station. We will use this dataset to illustrate:

1. How to prepare your data for the analysis. We will describe how to handle the typical problems of raw data and how to perform data fusion when multiple sources of data are available.

2. How to get insights from your dataset. We will present visual and numerical techniques in order to explore your data for identifying interesting patterns. 

3. How to transform the obtained insights into useful features for building forecasting models. 

4. How to compare correctly the performance of multiple forecasting models.

## Data loading and preprocessing

For the **household energy consumption** use case we consider here, two types of data are relevant:
1. Household data, which refers to the total household energy consumption along with the consumption of the individual energy-consuming appliances
2. Climate data, which contains information of climate factors (e.g. temperature) that can influence on the energy consumption of households

In the following subsections, we will load these datasets and perform some basic preprocessing, e.g. correctly handling missing values and fusing both datasets.

### Household data

As a dataset we use the [public energy consumption data set from the UCI repository](http://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption). This dataset contains measurements of electric power consumption in one household, located near Paris, with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and sub-metering values are available, as can be seen in the table below:

1. `Date_Time`: the date and time of day of the measurement
3. `Global_active_power`: household global minute-averaged active power (in kilowatt) 
4. `Global_reactive_power`: household global minute-averaged reactive power (in kilowatt) 
5. `Voltage`: minute-averaged voltage (in volt) 
6. `Global_intensity`: household global minute-averaged current intensity (in ampere) 
7. `Sub_metering_1`: energy sub-metering No. 1 (in watt-hour of active energy). It corresponds to the kitchen, containing mainly a dishwasher, an oven and a microwave (hot plates are not electric but gas powered). 
8. `Sub_metering_2`: energy sub-metering No. 2 (in watt-hour of active energy). It corresponds to the laundry room, containing a washing-machine, a tumble-drier, a refrigerator and a light. 
9. `Sub_metering_3`: energy sub-metering No. 3 (in watt-hour of active energy). It corresponds to an electric water-heater and an air-conditioner.

In [None]:
start_date = data.index.min().date()
end_date = data.index.max().date()
printmd(f"The dataset contains data from {start_date} to {end_date}.")

data.head()

From the table below, we can see that for all sensors we have 2.049.280 instances. This already indicates that there are no missing values. Besides this information, we do not spot immediately any strange insights, e.g. an anomaly in sensor data like negative consumed power.

In [None]:
data.describe()

For the purpose of this Starter Kit, only the `Date_Time` and the `Global_active_power` attributes will be considered from now onwards. 

In [None]:
data = data[['Global_active_power']]

### Climate data

As an additional dataset, we use the climate information recorded close to the city of Paris, where the household is located, provided by the <i>Wunderground</i> website.  It contains information on climate conditions for every 30 minutes, via the following attributes shown in the table below:

1. `time_UTC`: the date and hour of the measurement 
2. `temperature`: temperature (degrees Celsius)
3. `dew_point`: dew point (degrees Celsius)
4. `humidity`: air humidity (percentage)
5. `sea_lvl_pressure`: pressure at sea level (hPa)
6. `visibility`: visibility (km)
7. `wind_dir_degrees`: wind direction expressed in degrees (degrees)

In [None]:
ext_data.head(5)

From the statistics in the table below, we can see the data contains minimum values of -9999 for some variables. According to [the documentation](https://www.wunderground.com/weather/api/d/docs?d=resources/phrase-glossary&_ga=1.264468047.1395257589.1468305478&MR=1), values of -9999 or -999 signify missing values. We will replace these with the proper missing value NaN, as otherwise our algorithms would get confused.

In [None]:
ext_data.describe()

In [None]:
ext_data.replace(to_replace=-9999, value=np.nan, inplace=True)
ext_data.describe()

We can now notice that the temperature values seem to be more plausible. As expected, we also see that not all sensors have the same number of instances, meaning there are missing values.

For the purpose of this Starter Kit, only the `time_UTC` and `temperature` attributes will be considered from now on.

### Data fusion

The evolution of household energy consumption may be influenced by climatological factors, e.g. lower or higher temperatures can lead to a higher energy consumption for some appliances such as water-heaters and air-conditioners. 

To understand and analyze this relationship, we need to fuse the household data with the climate data. However, these datasets have different sampling time: the household data is sampled every minute, whereas the climate data is sampled every 30 minutes. The integration of these 2 datasets thus requires careful consideration. 

Two options are possible: we can either upsample the climate data or downsample the household data. We select the second option, as otherwise, we would need to find a way to impute missing data in the climate dataset, and it allows us to reduce the size of the dataset, which will reduce compute time required for the analysis. 

The effect of downsampling the household data can be seen below. The user can experiment how the window sizes of 30 minutes, 1 hour, or 4 hours affect the evolution of the global active power.

In [None]:
plot_effects_resampling(data)

As we can see from the plot, resampling reduces the (small) peaks of the original signal. The effect becomes more pronounced for larger window sizes. A window size of 4 hours arguably removes too much variation. A window of 30 minutes or 1 hours on the other hand still captures the general evolution of the active power, which is exactly what is needed for characterizing the typical household consumption. Between 30 or 60min time windows, we decide to use a window size of 60min since this reduces the dataset and speeds up the computation time required later on for the analysis.

To downsample the climate dataset, the mean of the values of the respective hour is used. To downsample the household dataset, the sum of each of the values for the corresponding hour is taken

An excerpt of the downsampling of household dataset is shown in the table below.

In [None]:
ext_data_h = ext_data.resample('1H').apply(np.mean)

data_h = data.resample('1H').agg({
'Global_active_power':np.sum, 
})

data_h[:3]

Since `Global_reactive_power` and `Global_active_power` are expressed in  minute-averaged kW, we will change their units to Wh over an hour, such that this is in line with the resampling frequency. To this end, the following transformations are applied (the summing was already done in the previous step):

\begin{align*}
\mathit{globalactive}\ Wh &= \sum^{60}_{min=1} 1000\cdot\mathit{globalactivepower}_{\mathit{min}}\ kW\cdot \frac{1}{60}\ h\\
\end{align*}

The result of this transformation can be seen in the table below.

In [None]:
data_h['Global_active_power'] = data_h['Global_active_power'].apply(lambda x: x*1000/60)
data_h[:3]

Now we can finally merge the two datasets into an extended dataset that will be explored in the next section.

In [None]:
new_data = pd.concat((data_h,ext_data_h),axis=1,join='inner')
new_data[:3]

Before proceeding, we inspect to what extent this extended dataset contains missing data.

In [None]:
new_data.isna().mean()

In case of too many missing values, we could consider data imputation. However, in this case, only the `temperature` has 0.01% of msising values, so there is no need for imputation.

## Data Exploration

In this section, we explore the extended dataset in order to understand which factors influence the household energy consumption. Based on this exploration, we will later on identify which features are useful for training a machine learning algorithm to forecast energy consumption.

There are several approaches to data exploration. In this section we will use:
- visual exploration by means of time plots
- statistical data exploration

### Visual Data Exploration


Time plots are a common way for exploring time series data, such as the energy consumption that we consider here. We explore time plots in different time frames in order to highlight different aspects in the data:
- we start with a yearly plot, which allows us to identify global patterns. 
- a monthly plot allows us to zoom in and understand relationships between energy consumption in different weeks and days
- a weekly plot allows us to zoom in even further and identify daily and hourly patterns

#### Yearly patterns

It is expected that energy consumption depends on the outdoor temperature, since in winter time heaters are used more frequently than in summer time for example. For this reason, we plot the evolution of the energy consumption along with the temperature. 

In the plot below, we resample the hourly data of both datasets to daily data, so that we don't get lost in details but can observe more global patterns. 

The user can toggle on or off the normalization of the data. The normalization that is applied is called _min-max normalization_ and is explained in detail in SK 3.4. It applies the following formulate to the data in order to make sure all values are contained within the [0, 1] range:

\begin{align*}
\mathit{X_{scaled}} = \frac{X - X_{min}}{X_{max} - {X_{min}}}\\
\end{align*}


The user is invited to verify, without selecting the option normalization, whether of temperature and global active power have the same trend. The reader can then repeat the analysis using the option normalization.

In [None]:
plot_temperature_power_one_year(new_data)

It should be clear that without normalization, it is difficult to compare the temperature with the global active power because both use different ranges, and the range of the global active power dominates the range of the temperature. 

With normalization however, the plot above reveals interesting insights. Both the temperature and the global active power follow a seasonal trend. Obviously, the temperature is lower in winter and higher in summer, whereas the reverse is true for the global active power. This is to be expected: energy consumption is higher during winter time. In more formal terms, we can observe that the temperate and global active power exhibit an inverse correlation: when one increases the other decreases, and vice versa.

#### Monthly patterns

Let's now see if there is a monthly recurring pattern as well. In the following figure, the active power of the selected year(s) is shown. The user can change the resampling rate to make the insights more clear to them. To make the plot more clear, the user can (de)selected the years they want on the right of the figure.

In [None]:
plot_yearly_data(new_data)

While we can again clearly see the seasonal effect, there is no clear pattern recurring every month. One interesting thing that can be seen is that strong dips in power consumption tend to happen in the same time in February and November in 2007 and 2008. Most likely, these are holidays which tend to happen at the same time for a fixed household. However, these dips are not present at the same time for 2009. 

#### Weekly patterns

In this section we further explore the evolution of the energy consumption in a more fine grained time frame. 
We want to verify whether during a week, the energy consumption exhibits the same patterns.

In the plot below, the user can inspect and compare the energy consumption in December of different years. We mark weekends in yellow and Christmas day in orange. We invite the user to see for each year if they can notice a difference between week days and weekends and if there are any changes around the Holidays.

In [None]:
plot_consumption_with_holidays_weekends(new_data)

What should be obvious from exploring this plot is the existence of a daily pattern for energy consumption dependent on the day of the week: in most weekend days of the three years, we can observe slightly higher energy consumption than during weekdays. This can depend on the common habits of working at the office or children being at school. 

The effect of the holidays can also be seen. On Christmas day in 2007, the power consumption is slightly higher than other typical weekdays, but similar values can be seen e.g. exactly one week before. So the holidays don't have a strong impact here. In 2008 on the other hand, the power consumption is lower, indicating that perhaps the household celebrated outside their house. In 2009, there is a peak in consumption on Christmas eve.

We can conclude that whether the day is a weekday or in the weekend or a holiday has an influence on the expected power consumption.

#### Daily patterns

Next to yearly and monthly patterns, we are also interested in daily patterns. Below, we plot the evolution of the global active power for one week (The first week of December 2009). 

We can see how the power consumption follows a daily recurring pattern, with peaks in the morning and evening and troughs at night and midday. In fact, this pattern is also already present in the previous plot, where daily morning and evening peaks can also be observed. 

During the night and the day, we still see low amounts of power usage, most likely due to appliances like refrigerators and heating.

In [None]:
plot_weekly_data(new_data)

### Statistical Data Exploration

The visualizations above highlighted interesting patterns in the energy consumption of the household that can be exploited for feature extraction. In this section we use statistical methods to verify, and eventually quantify those observations.

#### Autocorrelation

In order to find repetitive patterns in time series data, we can use *autocorrelation*, which provides the correlation between the time series and a delayed copy of itself. In order to investigate this autocorrelation, the plot below visualizes the autocorrelation of the global active power with time lags of 1 day, 2 days, etc. Note that we do not show the results of a delay of 0 days, since the correlation with the exact same day would trivially be 1.

In [None]:
plot_auto_correlation(new_data)

The plot above reveals that
- a clear peak occurs after 1 day, indicating the consumption of a day is highly correlated with the consumption of the previous day
- another clear peak occurs every 7 days, which confirms the existence of a weekly pattern, i.e. the consumption of a day is highly correlated with the consumption of the same day a week earlier

#### Correlation

As visually observed above, the temperature influences the energy consumption of a household. In order to verify the strength of this relationship, we can compute the correlation between these two variables. For this purpose, we compute Pearson's correlation. 

The plot below shows again both temperature and consumption and allows to investigate their relationship. This relationship is influenced by the frequency of the dataset, e.g. temperature changes are typically less drastic than energy consumption changes. We invite the user to experiment with the different resampling periods, to understand how that changes the correlation and what frequency leads to the highest correlation.

In [None]:
plot_correlation(new_data)

What should be obvious from the above is that there is a negative correlation between temperature and energy consumption, meaning the consumption rises when temperature drops and vice versa. The larger the resampling window is chosen, the higher is the correlation. This is because with larger windows, we smooth out more the short-term fluctuations, leaving only the seasonal pattern described above.


## Feature engineering

In this section, we will extract several features from the extended dataset that can be used for forecasting energy consumption. A feature is an individual measurable property of the phenomenon being observed. Choosing informative, discriminating and independent features is a crucial step for building effective machine learning algorithms. 

Based on the previous visual and statistical observation, we will consider the following features to predict the energy consumption for the hour ahead:

- energy consumption of the day before
- energy consumption of the 7 days before
- the hour of the day, which will allow to distinguish between different periods of the day (i.e. night, morning, afternoon, evening)
- the day of the week, which will allow to distinguish between weekdays and weekend days
- whether the day is a holiday or not
- the month of the year, which will allow to distinguish between yearly seasons
- the temperature

In [None]:
new_feats = new_data[['Global_active_power']].rename(columns=({'Global_active_power':'consumption'}))
new_feats['Pastday'] = new_feats['consumption'].shift(1, freq='D')
new_feats['Pastweek'] = new_feats['consumption'].shift(7, freq='D')
new_feats['Hour'] = new_feats.index.hour
new_feats['Weekday'] = new_feats.index.dayofweek
new_feats['Month'] = new_feats.index.month
new_feats['Holiday'] = [1 if (date in holidays.France()) else 0 for date in new_feats.index]
new_feats['Temperature'] = new_data['temperature']

new_feats = new_feats.dropna()

## Modeling 


This section allows the user to discover the most important factors for training a machine learning model. 
More specifically, the user will get insights on the influence of the *training strategy*, type of *machine learning model* and effect of data *normalization* on model performance.

To evaluate the performance of the models the mean absolute error (MAE) is used (a metric commonly used in literature for this purpose). This metric quantifies to what extent the model  forecasts are close to the real values. As the name suggests, the mean absolute error is an average of the absolute errors. It holds that the lower the MAE of the model, the better its performance.

This section ends with an interactive application where the user can test all options in order to find the one that leads the model to achieve the best performance. 

### Training strategies

The training data influences models performances. 
For this experiment, the following options are available:

1. Use **1 month before** the test month as training data. In this experiment, several models are trained for the predictions of each individual month. Each model is trained on one month before the month we are making predictions for. One month gap is introduced between the training and the test month (this is done to avoid that the last day of the training set is also included in the first day of the test set). For example, the models that will make the predictions for April 2008 will be trained on the data from February 2008. 
2. Use **6 months before** the test month as training data. Similar to the above experiment, but with 6 months of training data. It still includes a 1 month gap.
3. Use **1 year before** the test month as training data. Similar to the above experiment, but with 1 year of training data, still include a 1 month gap.
4. Use **1 month the year before** as training data. Similar to strategy number 1, but with an 11 months gap. This way, the training data and test data are taken from the same month, one year apart.
5. Use **all months before** the test month as training data. For each test month, train a model using all the data prior to this (including a 1 month gap). This strategy simulates the scenario where the model is retrained as new data comes in. Since this requires training several models based on potentially large training sets, this might require a certain computational time.
6. Use **train-test split**. Train on the first year of data and make predictions on the rest. This strategy is commonly used for training these kinds of models.

Notice that for the benchmark model, there is no training phase.

### Machine learning models

There is a plethora of models that can be used for forecasting. In this notebook, we decide to use two models that are commonly used in literature for the task at hand: **Random Forest Regression (RFR)** and **Support Vector Regression (SVR)**.  Besides these models, we will also use a simple **benchmark model** that we use to compare the performance of the other models against. For the prediction of the energy of a certain day, this benchmark model simply takes the energy consumption of the previous day at the same time. 
We will use these models to forecast the energy consumption based on the past energy measurements (see training strategies). 

### Normalization

Before being used for training, data may need to be normalized, i.e. rescaled so that all input and output values are between 0 and 1. This is indeed a requirement for the correct training of some machine learning models. 
It is left to the user the exercise of discovering which model requires the normalization of the data.

In [None]:
from IPython.utils import io

mode = 'inline' #external

if mode == 'inline':
    run_app(new_feats, mode='inline')
else:
    PORT = '8095' 
    HOST = '0.0.0.0' # in AWS replace here by public IP
    with io.capture_output() as captured:
        run_app(new_feats, mode='external', port=PORT, host=HOST)

    print(f'Dash app running on http://{HOST}:{PORT}')

The mean absolute error of each experiment that was run is reported in the table above.

### Insights from the experiments

These are the basic insights the user could have noticed when trying different models, parameters and training strategies.


#### Effect of the training strategy

For the SVR, we see that the standard train-test split has the best performance, although using all months and  using a one year window before the test month as training data both have a similar performance. For the RFR on the other hand, we clearly see that using a one year window before the test month as training data leads to better performance than the other training strategies. We note that for both this "one year window" strategy and the standard train-test split, the training data is one year. The difference however, is that for the former the training set changes, while for the latter it is fixed (using all the data from 2017).


A general trend that can be seen is that the performance increases as a larger training set is used. We therefore urge the user to look at the predictions made using the training strategy where all months before the test month are used as training data: we can expect that the predictions for the later months will be better than for the first months since more training data was used. However, this is hard to see by eye in the prediction plot directly.


#### Effect of the chosen model

We can see that, for most parameter choices, both the SVR and RFR can outperform the simple benchmark model we set up. 
The RFR outperforms the SVR for most (if not all) parameter choices and training strategies. 
This shows the importance of knowing which model is best suited for your problem and testing different ones.

#### Effect of the chosen parameters

The search for the optimal set of hyperparameters for your machine learning model is a non-trivial task, since there might be a strong interplay between the parameters so we can not do the optimal parameter search one by one. Here, we let the user try a pre-set number of choices, but more automatic and smart search algorithms exist to tackle this problem. We found that for the RFR, the higher the number of estimators, the better the results, but at a greater computational cost and with diminishing returns. We see that 50 estimators and leaving the max depth at 10 is a good choice. For the SVR, $C=10$ and $\gamma=0.01$ work well.


#### Effect of normalization

Normalizing the data, rescaling it so that the input and output variables all have values within a similar range (in this case, between 0 and 1) is a common step when setting up machine learning models. This is because, depending on the model, they work much better with values of the order 1. We see that normalization indeed greatly improves the predictions for the SVR, but has very little influence on the RFR. Indeed, algorithms such as random forest and decision tree are not influenced by the scale of the input variables.




## Conclusion

In this Starter Kit we demonstrated forecasting the demand for a particular resource at a particular point in the future, illustrated in the case of energy demand forecasting. To this end, we first analysed the business problem, preprocessed the data, visually and numerically explored it to eventually illustrate the use of different models ( Random Forest Regression, Support Vector Regression and benchmark) to perform the actual forecasting with different amounts of available historical data. From the analysis, it may be clear that achieving good prediction results depend on different factors (appropriate data preprocessing, the amount of available training data, algorithm parametrisation, etc.).

## Additional information

Copyright © 2022 Sirris

This Starter Kit was developed in the context of the EluciDATA project (http://www.elucidata.be). For more information, please contact info@elucidata.be.

 
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Notebook"), to deal in the Notebook without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Notebook, and to permit persons to whom the Notebook is provided to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies of the Notebook and/or copies of substantial portions of the Notebook.

THE NOTEBOOK IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL SIRRIS, THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, DIRECT OR INDIRECT, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE NOTEBOOK OR THE USE OR OTHER DEALINGS IN THE NOTEBOOK.