# Energy consumption prediction using LSTM/GRU in PyTorch

In this notebook, we tackle a time series forecasting task using GRU and LSTM models implemented with PyTorch. Our objective is to predict the next hour’s energy consumption based on historical usage data. We use the Hourly Energy Consumption dataset, which provides hourly power usage across various U.S. regions.

We compare the performance of GRU and LSTM by training both models on past data and evaluating them on a separate test set. The workflow includes feature engineering, data preprocessing, model definition, training, and evaluation. Common Python libraries are used throughout the process to support data analysis and modeling.

Source : [Kaggle](https://www.kaggle.com/robikscube/hourly-energy-consumption)

## GRU/LSTM cells

* Long Short-Term Memory networks (LSTMs) have great memories and can remember information which the vanilla RNN is unable to!

* The Gated Recurrent Unit (GRU) is the younger sibling of the more popular Long Short-Term Memory (LSTM) network, and also a type of Recurrent Neural Network (RNN). Just like its sibling, GRUs are able to effectively retain long-term dependencies in sequential data. And additionally, they can address the “short-term memory” issue plaguing vanilla RNNs.

## The ML Pipeline

In [1]:
# Built-in
import os
import time
import gc

# Third-party
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import yaml
from tqdm.notebook import tqdm as tqdm_notebook
from sklearn.preprocessing import MinMaxScaler
from pathlib import Path

# PyTorch
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader

In [2]:
# Local module
from fct import move_sliding_window
from fct import GRUNet, LSTMNet


In [3]:
print(torch.__version__)

2.7.1+cpu


In [4]:
# Import of parameters
with open("config.yaml", "r") as f:
    config = yaml.safe_load(f)

In [5]:
# Accès aux variables
label_col_index = config["label_col_index"]
inputs_cols_indices = config["inputs_cols_indices"]
window_size = config["window_size"]
num_files_for_dataset = config["num_files_for_dataset"]

In [6]:
# Paths
data_raw_path = Path.cwd().parent / "data" / "raw"
data_processed_path = Path.cwd().parent / "data" / "processed"
model_path = Path.cwd().parent / "models"

## Exploratory Data Analysis (EDA)

In [7]:
# Try read a single file
pd.read_csv(os.path.join(data_raw_path, "DEOK_hourly.csv")).head()


Unnamed: 0,Datetime,DEOK_MW
0,2012-12-31 01:00:00,2945.0
1,2012-12-31 02:00:00,2868.0
2,2012-12-31 03:00:00,2812.0
3,2012-12-31 04:00:00,2812.0
4,2012-12-31 05:00:00,2860.0


We have a total of **12** *.csv* files containing hourly energy trend data (note that *'est_hourly.parquet'* and *'pjm_hourly_est.csv'* are excluded). Our next step consists of reading these files and preprocessing the data in the following order:

- Extract and generalize the time features for each time step:
    - Hour of the day (0–23)
    - Day of the week (1–7)
    - Month (1–12)
    - Day of the year (1–365)

- Scale the data to values between 0 and 1:
    - Scaling helps algorithms perform better and converge faster by putting features on a comparable scale and closer to a normal distribution.
    - This scaling preserves the original distribution’s shape and maintains the impact of outliers.

- Organize the data into sequences to serve as model inputs and prepare the corresponding labels:
    - The **sequence length** or **window_size** defines how many historical data points the model will use to predict the future.
    - The label corresponds to the data point immediately following the last point in the input sequence.

- Finally, split the inputs and labels into training and test sets for model development.


## Create training instances by moving sliding window

## Integrate files to build the training set
To speed things up, I will only be using `num_files_for_dataset` .csv files for creating my dataset. Feel free to run it yourself with the entire dataset if you have the time and computing capacity. 

In [8]:
# The scaler objects will be stored in this dictionary so that our output test data from the model can be re-scaled during evaluation
label_scalers = {}

train_x = []
test_x = {}
test_y = {}

# Skipping the files we're not using
processing_files = [
    file for file in os.listdir(data_raw_path) if os.path.splitext(file)[1] == ".csv"
]

for file in tqdm_notebook(processing_files[:num_files_for_dataset]):
    print(f"Processing {file} ...")
    # Store csv file in a Pandas DataFrame
    df = pd.read_csv(os.path.join(data_raw_path, file), parse_dates=["Datetime"])

    # Processing the time data into suitable input formats
    df = df.assign(
        hour=df["Datetime"].dt.hour,
        dayofweek=df["Datetime"].dt.dayofweek,
        month=df["Datetime"].dt.month,
        dayofyear=df["Datetime"].dt.dayofyear,
    )
    df = df.sort_values("Datetime").drop("Datetime", axis=1)

    # Scaling the input data
    sc = MinMaxScaler()
    label_sc = MinMaxScaler()
    data = sc.fit_transform(df.values)
    

    # Obtaining the scaler for the labels(usage data) so that output can be
    # re-scaled to actual value during evaluation
    label_sc.fit(df.iloc[:, label_col_index].values.reshape(-1, 1))
    label_scalers[file] = label_sc

    # Move the window
    inputs, labels = move_sliding_window(
        data,
        window_size,
        inputs_cols_indices=inputs_cols_indices,
        label_col_index=label_col_index,
    )
    
    # Redure the precision of data
    data = data.astype(np.float32)
    inputs = inputs.astype(np.float32)
    labels = labels.astype(np.float32)

    # CONCAT created instances from all .csv files.
    # Split data into train/test portions and combining all data from different files into a single array
    test_portion = int(0.1 * len(inputs))
    if len(train_x) == 0:  # first iteration
        train_x = inputs[:-test_portion]
        train_y = labels[:-test_portion]
    else:
        train_x = np.concatenate((train_x, inputs[:-test_portion]))
        train_y = np.concatenate((train_y, labels[:-test_portion]))
    test_x[file] = inputs[-test_portion:]
    test_y[file] = labels[-test_portion:]
    
    # Remove temporary variables
    del df, data, inputs, labels
    gc.collect()
    

  0%|          | 0/5 [00:00<?, ?it/s]

Processing AEP_hourly.csv ...


(121183, 90, 5) (121183, 1)
Processing COMED_hourly.csv ...
(66407, 90, 5) (66407, 1)
Processing DAYTON_hourly.csv ...
(121185, 90, 5) (121185, 1)
Processing DEOK_hourly.csv ...
(57649, 90, 5) (57649, 1)
Processing DOM_hourly.csv ...
(116099, 90, 5) (116099, 1)


## What have we made?

In [9]:
train_x.shape, test_x["DEOK_hourly.csv"].shape

((434274, 90, 5), (5764, 90, 5))

## Pytorch data loaders/generators

To improve the speed of our training, we can process the data in batches so that the model does not need to update its weights as frequently. The `TensorDataset` and `DataLoader` classes are useful for splitting our data into batches and shuffling them.

In [10]:
batch_size = config["batch_size"]

train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))

# Drop the last incomplete batch
train_loader = DataLoader(
    train_data, shuffle=True, batch_size=batch_size, drop_last=True
)

In [11]:
print(
    f"Train Size: {train_x.shape}, Batch Size: {batch_size}, # of iterations per epoch: {int(train_x.shape[0]/batch_size)}"
)

Train Size: (434274, 90, 5), Batch Size: 500, # of iterations per epoch: 868


In [12]:
# release some memory
del train_x, train_y

We can also check if we have any GPUs to speed up our training time by many folds. If you’re using "https://colab.research.google.com/" with GPU to run this code, the training time will be significantly reduced.

In [13]:
# torch.cuda.is_available() checks and returns a Boolean True if a GPU is available, else it'll return False
is_cuda = torch.cuda.is_available()

# If we have a GPU available, we'll set our device to GPU. We'll use this device variable later in our code.
if is_cuda:
    device = torch.device("cuda")
    print("GPU is available")
else:
    device = torch.device("cpu")

Next, we'll be defining the structure of the GRU and LSTM models. Both models have the same structure, with the only difference being the **recurrent layer** (GRU/LSTM) and the initializing of the hidden state. The hidden state for the LSTM is a tuple containing both the **cell state** and the **hidden state**, whereas the **GRU only has a single hidden state**. 
Please refer to official PyTorch documentation to get familiar with GRU and LSTM interfaces in PyTorch:

- https://pytorch.org/docs/stable/nn.html#torch.nn.GRU
- https://pytorch.org/docs/stable/nn.html#torch.nn.LSTM

In [14]:
def train(
    train_loader,
    learn_rate,
    hidden_dim=256,
    n_layers=2,
    n_epochs=5,
    model_type="GRU",
    print_every=100,
):

    input_dim = next(iter(train_loader))[0].shape[2]  # 5

    # Batch generator (train_data, train_label)
    # print(next(iter(train_loader))[0].shape, next(iter(train_loader))[1].shape) # torch.Size([1024, 90, 5]) torch.Size([1024, 1])

    output_dim = 1

    # Instantiating the models
    if model_type == "GRU":
        model = GRUNet(input_dim, hidden_dim, output_dim, n_layers)
    else:
        model = LSTMNet(input_dim, hidden_dim, output_dim, n_layers)
    model.to(device)

    # Defining loss function and optimizer
    criterion = nn.MSELoss()  # Mean Squared Error
    optimizer = torch.optim.Adam(model.parameters(), lr=learn_rate)

    model.train()
    print("Starting Training of {} model".format(model_type))
    epoch_times = []

    # Start training loop
    for epoch in range(1, n_epochs + 1):
        start_time = time.process_time()
        h = model.init_hidden(batch_size)
        avg_loss = 0.0
        counter = 0
        for x, label in train_loader:
            counter += 1
            if model_type == "GRU":
                h = h.data
            # Unpcak both h_0 and c_0
            elif model_type == "LSTM":
                h = tuple([e.data for e in h])

            # Set the gradients to zero before starting to do backpropragation because
            # PyTorch accumulates the gradients on subsequent backward passes
            model.zero_grad()

            out, h = model(x.to(device).float(), h)
            loss = criterion(out, label.to(device).float())

            # Perform backpropragation
            loss.backward()
            optimizer.step()

            avg_loss += loss.item()
            if counter % print_every == 0:
                print(
                    f"Epoch {epoch} - Step: {counter}/{len(train_loader)} - Average Loss for Epoch: {avg_loss/counter}"
                )
        current_time = time.process_time()

        print(
            f"Epoch {epoch}/{n_epochs} Done, Total Loss: {avg_loss/len(train_loader)}"
        )

        print(f"Time Elapsed for Epoch: {current_time-start_time} seconds")

        epoch_times.append(current_time - start_time)

    print(f"Total Training Time: {sum(epoch_times)} seconds")
    return model

## Training the GRU model

In [15]:
# seq_len = 90  # (timestamps)
# Paramètres pour l'entraînement
n_hidden = config["n_hidden"]
n_layers = config["n_layers"]
n_epochs = config["n_epochs"]
print_every = config["print_every"]
lr = config["lr"]

gru_model = train(
    train_loader,
    learn_rate=lr,
    hidden_dim=n_hidden,
    n_layers=n_layers,
    n_epochs=n_epochs,
    model_type="GRU",
    print_every=print_every,
)

Starting Training of GRU model
Epoch 1 - Step: 100/868 - Average Loss for Epoch: 0.025420364928431808
Epoch 1 - Step: 200/868 - Average Loss for Epoch: 0.013931295262300409
Epoch 1 - Step: 300/868 - Average Loss for Epoch: 0.00988527845707722
Epoch 1 - Step: 400/868 - Average Loss for Epoch: 0.007751718809595331
Epoch 1 - Step: 500/868 - Average Loss for Epoch: 0.006418049654108473
Epoch 1 - Step: 600/868 - Average Loss for Epoch: 0.0054986996486938245
Epoch 1 - Step: 700/868 - Average Loss for Epoch: 0.004825032595212438
Epoch 1 - Step: 800/868 - Average Loss for Epoch: 0.004310546857595909
Epoch 1/5 Done, Total Loss: 0.004023309500712563
Time Elapsed for Epoch: 1023.469607352 seconds
Epoch 2 - Step: 100/868 - Average Loss for Epoch: 0.0006136846003937535
Epoch 2 - Step: 200/868 - Average Loss for Epoch: 0.0005840548567357473
Epoch 2 - Step: 300/868 - Average Loss for Epoch: 0.000565221977303736
Epoch 2 - Step: 400/868 - Average Loss for Epoch: 0.0005450130865210667
Epoch 2 - Step: 50

## Save the GRU model

In [16]:
torch.save(gru_model.state_dict(), os.path.join(model_path, "gru_model.pt"))

## Train and Save an LSTM model

In [17]:
lstm_model = train(
    train_loader,
    learn_rate=lr,
    hidden_dim=n_hidden,
    n_layers=n_layers,
    n_epochs=n_epochs,
    model_type="LSTM",
    print_every=print_every,
)

Starting Training of LSTM model
Epoch 1 - Step: 100/868 - Average Loss for Epoch: 0.040653518978506324
Epoch 1 - Step: 200/868 - Average Loss for Epoch: 0.02273187775281258
Epoch 1 - Step: 300/868 - Average Loss for Epoch: 0.01625208417031293
Epoch 1 - Step: 400/868 - Average Loss for Epoch: 0.01276947476726491
Epoch 1 - Step: 500/868 - Average Loss for Epoch: 0.010554868601961061
Epoch 1 - Step: 600/868 - Average Loss for Epoch: 0.009018121115902129
Epoch 1 - Step: 700/868 - Average Loss for Epoch: 0.007883682757466367
Epoch 1 - Step: 800/868 - Average Loss for Epoch: 0.0070186707939865305
Epoch 1/5 Done, Total Loss: 0.0065334923662624025
Time Elapsed for Epoch: 908.0744691480004 seconds
Epoch 2 - Step: 100/868 - Average Loss for Epoch: 0.0007986272289417684
Epoch 2 - Step: 200/868 - Average Loss for Epoch: 0.0007659769992460496
Epoch 2 - Step: 300/868 - Average Loss for Epoch: 0.000732149876615343
Epoch 2 - Step: 400/868 - Average Loss for Epoch: 0.0007022703820257448
Epoch 2 - Step:

In [18]:
torch.save(lstm_model.state_dict(), os.path.join(model_path, "lstm_model.pt"))

As we can see from the training time of both models, the GRU model is the clear winner in terms of speed, as we have mentioned earlier. The GRU finished 5 training epochs faster than the LSTM model.