## UK natural gas system price prediction project

The purpose of this project is to investigate how well machine learning can predict a commodity price, given just a few market fundamentals, and previous prices, as features. I have chosen the UK natural gas market because the key data on supply, demand and prices are freely available and up to date via https://data.nationalgas.com/

The goal is to predict the next day's daily System Average Price and System Marginal (Buy and Sell) Prices. The System Average Price is the volume weighted average price of trades on the UK natural gas On-the-Day Commodity Market - i.e. gas for immediate delivery. The System Marginal Price (Buy) is related to the day's highest price, and is the price that suppliers must pay for the balance of gas used by their customers, if that is more than the amount they have supplied to the system (a "short imbalance") The System Marginal Price (Sell) is related to the day's lowest price. The System Marginal Price (Sell) is related to the day's lowest price, and is the price that suppliers receive for any surplus gas that they have supplied to the system, which their customers have not used (a "long imbalance"). All prices are in pence per kilowatt-hour (p/kWh).

The dataset is drawn from the five year history available at https://data.nationalgas.com/, focusing on the fields that make up the Daily Summary Report, with the data for the three target prices coming from 

Model performance will be measured based on Root Mean Squared Error (RMSE), as compared to the RMSE of a naive predictor that simply assumes that the next day's price will be the same as the current day's price. RMSE has been chosen as most suitable to price prediction because it penalises larger errors more harshly than smaller ones.

### Initial setup steps

First we'll make sure the required libraries are available

In [25]:
import requests
import datetime
import time
import pandas as pd
import numpy as np
import joblib

from pathlib import Path

import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV
#!pip install scikit-optimize # needed on Google Colab
from skopt import BayesSearchCV
from skopt.space import Real, Integer

from xgboost import XGBRegressor

import tensorflow as tf
from tensorflow.keras import layers, models


Initial steps if running on Google Colab, to download support files from GitHub and set the working directory

In [2]:
#for running on Colab
!git clone https://github.com/MBWestcott/gas-forecast.git

# 2. Change into the repo directory
%cd /content/gas-forecast/notebooks



[WinError 3] The system cannot find the path specified: '/content/gas-forecast/notebooks'
d:\dev\gas-forecast\notebooks


Cloning into 'gas-forecast'...


### First download the raw data from the National Gas data portal

In [3]:
raw_data_folder = Path("../data/raw/")

def download_csv(url, output_file):
    """
    Downloads a CSV file from the given URL and saves it to the specified file.

    :param url: URL to download the CSV data from.
    :param output_file: Path to the local file where the CSV will be saved.
    """
    try:
        # Send a GET request to the URL
        response = requests.get(url)
        response.raise_for_status()  # Ensure we notice bad responses

        # Write the content (CSV data) to a file in binary mode
        with open(output_file, 'wb') as f:
            f.write(response.content)

        print(f"CSV file has been successfully downloaded and saved as '{output_file}'.")

    except requests.HTTPError as http_err:
        print(f"HTTP error occurred: {http_err}")
    except Exception as err:
        print(f"An error occurred: {err}")


def download_raw_data():
    pubIdsFile = Path("../PUB ids.txt")
    with open(pubIdsFile) as f:
        pubIds = f.read()
        pubIds = pubIds.replace("\n", ",").strip()

    earliest = datetime.date(2020,4,1) # Download data going back 5 years
    
    download_from = datetime.date.today().replace(day=1) # start first download on first day of current month
    download_to = datetime.date.today() # end first download on today's date
    while(download_from > earliest):

        # Format the date in yyyy-mm-dd format
        formatted_from = download_from.strftime("%Y-%m-%d")
        formatted_to = download_to.strftime("%Y-%m-%d")

        csv_url = f"https://data.nationalgas.com/api/find-gas-data-download?applicableFor=Y&dateFrom={formatted_from}&dateTo={formatted_to}&dateType=GASDAY&latestFlag=Y&ids={pubIds}&type=CSV"
        month_format = download_from.strftime("%Y-%m")
        output_filename = raw_data_folder /  f"{month_format}.csv"

        download_csv(csv_url, output_filename)
        time.sleep(2) # brief courtesy sleep
        download_to = download_from - datetime.timedelta(days=1) # next download should go up to the day before the previous download start date
        download_from = download_to.replace(day=1) # next download should start on the first day of the month

# Do the download if the raw data is not there already
csvCount = sum(1 for f in raw_data_folder.iterdir() if f.is_file() and f.suffix == '.csv')
if(csvCount < 60):
    download_raw_data()

### Load the raw data

Load the raw CSVs into a single and dataframe, pivot it so that each column represents a feature.
Rename the Applicable At date field to Gas Day, and rename the columns that are going to be reused for ground truth and time series

In [None]:
price_targets = ["SAP", "SMPBuy", "SMPSell"]

def pivot(df : pd.DataFrame, cols):

    #only keep the values we are interested in
    mask = df["Data Item"].isin(cols)

    df_filtered = df[mask]

    # if there are duplicates for the field and gas day, take the latest
    df_latest = (
        df_filtered
        .sort_values("Applicable At")
        .groupby(["Gas Day", "Data Item"])
        .last()  # this takes the row with the highest (i.e. latest) "Applicable At" per group
        .reset_index()
    )

    # pivot to get 1 row per gas day
    df_latest = df_latest.pivot(index="Gas Day", columns="Data Item", values="Value").reset_index()

    df_latest = df_latest.sort_values("Gas Day", ascending=True)

    return df_latest

def load_data():
    #Read raw CSVs
    pathlist = list(Path(raw_data_folder).rglob('*.csv'))
    file_count = len(pathlist)
    dfs = []
    files_done = 0
    for path_obj in pathlist:
        path = str(path_obj)

        df = pd.read_csv(path,
            parse_dates=["Applicable At", "Applicable For", "Generated Time"],
            dayfirst=True)

        df.rename(columns={'Applicable For': 'Gas Day'}, inplace=True)
        df['Gas Day'] = pd.to_datetime(df['Gas Day'], dayfirst=True)
        # pivot to 1 row per gas day, with features as columns

        daily_cols = df["Data Item"].unique()

        df_daily = pivot(df, daily_cols)
        dfs.append(df_daily)

        files_done += 1
        if files_done % 10 == 0:
            print(f"Processed {files_done} of {file_count} raw files")

    df = pd.concat(dfs)

    #Rename the columns that are going to be reused for ground truth and time series
    df.rename(columns={"SAP, Actual Day": 'SAP', "SMP Buy, Actual Day": 'SMPBuy', "SMP Sell, Actual Day": 'SMPSell'}, inplace=True)
    return df

df = load_data()
df.to_csv(Path("../data/processed/pivoted.csv"), index=False)
df.info()

Processed 10 of 60 raw files
Processed 20 of 60 raw files
Processed 30 of 60 raw files
Processed 40 of 60 raw files
Processed 50 of 60 raw files
Processed 60 of 60 raw files
<class 'pandas.core.frame.DataFrame'>
Index: 1819 entries, 0 to 22
Data columns (total 45 columns):
 #   Column                                                    Non-Null Count  Dtype         
---  ------                                                    --------------  -----         
 0   Gas Day                                                   1819 non-null   datetime64[ns]
 1   Aggregate LNG Importations - Daily Flow                   1816 non-null   float64       
 2   Beach Including Norway - Daily Flow                       1816 non-null   float64       
 3   Beach and IOG - Beach Delivery                            1815 non-null   float64       
 4   Beach and IOG - Daily Flow                                1816 non-null   float64       
 5   Composite Weather Variable - Actual                       1554 

### Preprocess data

Add the previous 5 days' prices as lag features, and 7- and 30-day rolling averages and standard deviations. Also add day of week features, and a cyclical coding of the day of year for seasonality.

In [None]:
def preprocess(df: pd.DataFrame, add_lags=True, add_labels=True):

    """Deal with missing values, add lagged features, rolling averages and stds, Day of Week, and cyclic encoding for seasonality"""

    if add_lags:
        lag_days = 5
        for i in range(1, lag_days+1):
            for pt in price_targets:
                df[f"{pt} D-{i}"] = df[pt].shift(i)

        # add rolling averages and stds
        for pt in price_targets:
            for window in [7, 30]:
                df[f'{pt} D{window} roll mean'] = (
                    df[pt]
                    .shift(1)               # so today's feature doesn't include today's price
                    .rolling(window=window, min_periods=1)  
                    .mean()
                    )
                df[f'{pt} D{window} roll std'] = (
                    df[pt]
                    .shift(1)               # so today's feature doesn't include today's price
                    .rolling(window=window, min_periods=1)  
                    .std()
                )

    # add day of week
    df['Day of Week'] = df['Gas Day'].dt.weekday
    df['Is Weekday'] = (df['Gas Day'].dt.weekday < 5).astype(int)
    df['Next Day Is Weekday'] = ((df['Gas Day'] + pd.Timedelta(days=1)).dt.weekday < 5).astype(int)
    # cyclic encoding for seasonality
    df['Day of Year'] = df['Gas Day'].dt.dayofyear
    df['sin_DoY'] = np.sin(2 * np.pi * df['Day of Year'] / 365)
    df['cos_DoY'] = np.cos(2 * np.pi * df['Day of Year'] / 365)

    if add_labels:
        # Add labels for next day's actuals
        for pt in price_targets:
            df[f"Next Day {pt}"] = df[pt].shift(-1)

    return df

df = preprocess(df)
df.to_csv(Path("../data/processed/preprocessed.csv"), index=False)
df.head()

Data Item,Gas Day,Aggregate LNG Importations - Daily Flow,Beach Including Norway - Daily Flow,Beach and IOG - Beach Delivery,Beach and IOG - Daily Flow,Composite Weather Variable - Actual,Composite Weather Variable - Cold,Composite Weather Variable - Normal,Composite Weather Variable - Warm,Demand - Cold,...,SMPSell D30 roll std,Day of Week,Is Weekday,Next Day Is Weekday,Day of Year,sin_DoY,cos_DoY,Next Day SAP,Next Day SMPBuy,Next Day SMPSell
0,2020-05-01,66.1511,135.26344,201.41454,201.41454,10.5824,7.85,11.36,14.94,268.090483,...,,4,1,0,122,0.863142,-0.504961,0.477,0.5123,0.4417
1,2020-05-02,58.7863,131.66283,190.44913,190.44913,11.3089,7.99,11.47,15.02,244.938804,...,,5,0,0,123,0.854322,-0.519744,0.484,0.5193,0.4487
2,2020-05-03,56.55015,141.57363,198.12378,198.12378,11.6531,8.13,11.55,15.09,242.854588,...,0.003748,6,0,1,124,0.845249,-0.534373,0.472,0.5073,0.4367
3,2020-05-04,52.82721,152.87412,205.70133,205.70133,11.9252,8.26,11.65,15.14,242.098717,...,0.00617,0,1,1,125,0.835925,-0.548843,0.479,0.5143,0.4437
4,2020-05-05,62.21188,134.50813,196.72001,196.72001,11.4803,8.4,11.76,15.18,247.112422,...,0.005755,1,1,1,126,0.826354,-0.563151,0.5017,0.537,0.4664


### Clean missing values and outliers

Most of the missing values are missing "Composite Weather Variable - Actual" from 2020-21. These affect around 15% of the dataset. Best way to fill in those is with the Normal forecast, which should usually be the closest. Apart from that there are very few missing readings so it is feasible to discard any remaining rows with missing data (done at the end, to avoid introducing errors into the lag features)

Also remove outliers where any of the prices was 0, and one of the next day prices was more 50% away from the current day's price

In [None]:
def clean(df: pd.DataFrame, remove_outliers=True):
    # fill missing CWV actuals with the normal forecast
    df['Composite Weather Variable - Actual'] = df['Composite Weather Variable - Actual'].fillna(df['Composite Weather Variable - Normal'])

    # There should be very remaining few rows that have any NaNs so we can drop any that do
    df.dropna(inplace=True)

    # Can drop the composite weather forecasts
    df.drop(columns=["Composite Weather Variable - Normal", "Composite Weather Variable - Cold", "Composite Weather Variable - Warm"], inplace=True)

    if(remove_outliers):
        for pt in price_targets:    
            # remove outliers where any of the prices was 0
            print(df.shape)
            df = df[df[pt] != 0]
            print(df.shape)
            df = df[df[f"Next Day {pt}"] != 0]
            print(df.shape)
            #... and where the next day price is more than least 50% away from the current day's price
            df = df[abs(df[pt] - df[f"Next Day {pt}"])/df[pt] < 0.5]
            print(df.shape)
    return df    

df = clean(df)
df.to_csv(Path("../data/processed/preprocessed_and_cleaned.csv"), index=False)
df.head()

(1802, 78)
(1802, 78)
(1802, 78)
(1789, 78)
(1789, 78)
(1789, 78)
(1789, 78)
(1784, 78)
(1784, 78)
(1783, 78)
(1783, 78)
(1762, 78)


Data Item,Gas Day,Aggregate LNG Importations - Daily Flow,Beach Including Norway - Daily Flow,Beach and IOG - Beach Delivery,Beach and IOG - Daily Flow,Composite Weather Variable - Actual,Demand - Cold,"Demand - Cold, (excluding interconnector and storage)",Demand - Warm,"Demand - Warm, (excluding interconnector and storage)",...,SMPSell D30 roll std,Day of Week,Is Weekday,Next Day Is Weekday,Day of Year,sin_DoY,cos_DoY,Next Day SAP,Next Day SMPBuy,Next Day SMPSell
5,2020-05-06,50.2373,142.20118,192.43848,192.43848,12.0645,245.39958,216.15049,145.432835,116.183744,...,0.005142,2,1,1,127,0.816538,-0.577292,0.4834,0.5187,0.4481
6,2020-05-07,53.5977,141.87295,195.47065,195.47065,13.4655,243.927941,214.341578,145.153439,115.567076,...,0.01118,3,1,1,128,0.80648,-0.591261,0.4756,0.5109,0.4403
7,2020-05-08,51.08822,135.19872,186.28694,186.28694,15.54,171.0,131.0,154.0,114.0,...,0.010249,4,1,0,129,0.796183,-0.605056,0.4722,0.5075,0.4369
8,2020-05-09,53.28634,127.04213,180.32847,180.32847,15.08,172.0,134.0,141.0,104.0,...,0.009697,5,0,0,130,0.78565,-0.618671,0.4615,0.4968,0.4262
9,2020-05-10,53.14522,127.29315,180.43837,180.43837,12.62,264.115666,226.223933,180.336933,142.445199,...,0.009489,6,0,1,131,0.774884,-0.632103,0.4569,0.4922,0.4216


### Split the data into training and test sets
Using two configurations:
- Split the data by date - earliest portion to train, then later portion to validate, and the last to test. Designed to test whether the model will generalise to the most recent period, despite having been trained on earlier periods
- Split the data randomly regardless of date

By default, discard the earliest data from training, which coincided with Covid restrictions as experimentally this seems to improve performance.

In [None]:

def split_sequential(df, n_train = 0.7, n_validate = 0.2, n_test = 0.1, discard_before_date = '2021-04-01'):
    """Split based on date"""
    # Convert the 'Gas Day' column to datetime if it's not already  
    
    df_sorted = df.sort_values("Gas Day", ascending=True)
    df_filtered = df_sorted[df_sorted['Gas Day'] >= discard_before_date]
    train_df, vt_df = train_test_split(df_filtered, test_size=n_validate + n_test, train_size=n_train, shuffle=False)
    validate_df, test_df = train_test_split(vt_df, test_size=n_test/(n_validate + n_test), train_size=n_validate/(n_validate + n_test), shuffle=False)
    
    return train_df, validate_df, test_df

def split_random(df, n_train = 0.7, n_validate = 0.2, n_test = 0.1, discard_before_date = '2021-04-01'):
    """Split based on number or fraction of rows"""
    
    df_filtered = df[df['Gas Day'] >= discard_before_date]
    # Split the DataFrame into training and testing sets
    train_df, vt_df = train_test_split(df_filtered, test_size=n_validate + n_test, train_size=n_train, shuffle=True)
    validate_df, test_df = train_test_split(vt_df, test_size=n_test/(n_validate + n_test), train_size=n_validate/(n_validate + n_test), shuffle=True)
    
    return train_df, validate_df, test_df

def get_X(df):
    ys = ["Next Day " + col for col in price_targets]
    df2 = df.drop(columns=ys)
    df2.drop(columns=["Gas Day"], inplace=True)
    
    return df2

#train, validate, test = split_sequential(df,0.7, 0.2, 0.1, '2023-09-01')
#X_train = get_X(train)
#X_test = get_X(test)


### Use Root Mean Squared Error as the measure of accuracy

This is appropriate to price forecasting because it penalises larger inaccuracies

In [8]:
# Root mean squared error - penalises larger errors more than smaller ones
def get_rmse(actuals, predictions):
    rmse =  np.sqrt(np.mean((predictions - actuals)**2))
    return round(rmse, 4)

def print_model_stats(model, X):

    # 1. Coefficients and intercept
    if hasattr(model, "coef_"):
        #print("Coefficients:", model.coef_)      # array of shape (n_features,)
        cdf = pd.DataFrame(model.coef_, X.columns, columns=['Coefficients'])
        cdf = cdf.sort_values(by='Coefficients', ascending=False)
        print(cdf)
    if hasattr(model, "intercept_"):
        print("Intercept:", model.intercept_)    # scalar (or array if multi-output)

    # 2. Model parameters
    print("Parameters:", model.get_params())

    # 3. Linear algebra internals
    if hasattr(model, "rank_"):
        print("Rank of design matrix:", model.rank_)
    if hasattr(model, "singular_"):
        print("Singular values of X:", model.singular_)

### Set up a framework to train models, and compare their performance on the test dataset against a naive predictor

The naive predictor takes the current day's System Average Price and System Marginal Prices as the predictions for the next day

In [None]:
SPLIT_RANDOM = "Random"
SPLIT_SEQUENTIAL = "Sequential"

class Context:
    """Context for a model evaluation"""

    def __init__(self, model_type, test_set):
        self.model_type = model_type
        self.test_set = test_set

    def __repr__(self):
        return f"Context(model_type={self.model_type}, test_set={self.test_set})"
    
class Result:
    """Result of a model evaluation"""
    
    def __init__(self, context:Context, price_label, model_rmse, naive_rmse):
        self.context = context
        self.price_label = price_label
        self.model_rmse = model_rmse
        self.naive_rmse = naive_rmse
        self.timestamp = datetime.datetime.now()

    def __repr__(self):
        return f"GasPredictResult(context={self.context}, price_label={self.price_label}, model_rmse={self.model_rmse}, naive_rmse={self.naive_rmse}, timestamp={self.timestamp})"    

def get_y(df, col):
    return df["Next Day " + col]

def validate_model(model, X, y):
    y_pred = model.predict(X)
    rmse = get_rmse(y, y_pred)
    return rmse

def train_and_validate_model(model, df_train, df_validate, col):
    X_train = get_X(df_train)
    X_validate = get_X(df_validate)
    y_train = get_y(df_train, col)
    y_validate = get_y(df_validate, col)
    #scaler = StandardScaler()
    #X_train_scaled = scaler.fit_transform(X_train)
    #X_test_scaled = scaler.fit_transform(X_test)
    #X_train_scaled = X_train
    #X_validate_scaled = X_validate

    model.fit(X_train, y_train)

    rmse_train = validate_model(model, X_train, y_train)
    rmse_validate = validate_model(model, X_validate, y_validate)

    return model, rmse_train, rmse_validate

def train_validate_and_report_for_prices(model_factory, df_train: pd.DataFrame, df_validate: pd.DataFrame, context:Context, print_model_stats=True):
    results = []
    for pt in price_targets:
        # Instantiate model.
        model = model_factory()

        # Train and test it
        model, rmse_train, rmse_validate = train_and_validate_model(model, df_train, df_validate, pt)

        # Print model details
        if print_model_stats:
            X_train = get_X(df_train)
            print_model_stats(model, X_train)

        # Get naive prediction stats for comparison
        rmse_naive_train = naive_predictions(df_train, pt)
        rmse_naive_validate = naive_predictions(df_validate, pt)

        print_results(pt + " train", rmse_naive_train, rmse_train)
        print_results(pt + " validate", rmse_naive_validate, rmse_validate)

        testResult = Result(context, pt, rmse_validate, rmse_naive_validate)
        results.append(testResult)
    return results

def naive_predictions(df, priceTarget):
    naive_predictions = df[priceTarget]
    actuals = df[f"Next Day {priceTarget}"]
    return get_rmse(actuals, naive_predictions)

def print_results(case, rmse_naive, rmse_model):
    headline = "Worse" if rmse_naive <= rmse_model else "Better"
    print(f"{case} - {headline} - model {rmse_model} v naive {rmse_naive}")


all_results = []

### Try linear regression models

...to predict each of SAP (System Average Price), SMPBuy (System Marginal Price - Buy) and SMPSell (System Marginal Price - Sell). This generally performs worse than the naive predictor in testing, especially using a date-based split

In [10]:
print ("Linear regression model:")
model_factory = lambda: LinearRegression()

print("Using random train-validate-test split...")
context = Context("Linear regression", SPLIT_RANDOM)
train, validate, test = split_random(df)
all_results += train_validate_and_report_for_prices(model_factory, train, validate, context, print_model_stats=False)

print("Using sequential train-validate-test split...")
context = Context("Linear regression", SPLIT_SEQUENTIAL)
train, validate, test = split_sequential(df)
all_results += train_validate_and_report_for_prices(model_factory, train, validate, context, print_model_stats=False)

print(all_results)

Linear regression model:
Using random train-validate-test split...
SAP train - Better - model 0.3678 v naive 0.4575
SAP validate - Better - model 0.4737 v naive 0.4908
SMPBuy train - Better - model 0.4424 v naive 0.5233
SMPBuy validate - Better - model 0.5469 v naive 0.5526
SMPSell train - Better - model 0.5243 v naive 0.6093
SMPSell validate - Worse - model 0.7367 v naive 0.633
Using sequential train-validate-test split...
SAP train - Better - model 0.4455 v naive 0.549
SAP validate - Worse - model 0.2533 v naive 0.0724
SMPBuy train - Better - model 0.5346 v naive 0.6244
SMPBuy validate - Worse - model 0.3255 v naive 0.0824
SMPSell train - Better - model 0.6468 v naive 0.7189
SMPSell validate - Worse - model 0.4198 v naive 0.0738
[GasPredictResult(context=Context(model_type=Linear regression, test_set=Random), price_label=SAP, model_rmse=0.4737, naive_rmse=0.4908, timestamp=2025-04-25 09:31:22.311062), GasPredictResult(context=Context(model_type=Linear regression, test_set=Random), pr

### Try a random forest model
Linear regression generally performed worse than the naive predictor in testing, especially using a date-based split, so let's try a random forest model. The hyperparameters for the best version were obtained by random search in the second code block below

In [11]:
print ("Random forest model:")
#RandomForestRegressor(n_estimators = 500, min_samples_split = 2, min_samples_leaf= 2, max_features = 0.9, max_depth = 20, ccp_alpha = 0.0) # best from random searh
#RandomForestRegressor(n_estimators = 200, min_samples_split = 2, min_samples_leaf= 2, max_features = 0.7, max_depth = 20, ccp_alpha = 0.0) # best for SAP: SAP test - Better - model 0.77 v naive 0.8
model_factory = lambda: RandomForestRegressor(n_estimators = 500, min_samples_split = 2, min_samples_leaf= 2, max_features = 0.9, max_depth = 20, ccp_alpha = 0.0)

print("Using random train-validate-test split...")
context = Context("Random forest", SPLIT_RANDOM)
train, validate, test = split_random(df)
all_results += train_validate_and_report_for_prices(model_factory, train, validate, context, print_model_stats=False)

print("Using sequential train-validate-test split...")
context = Context("Random forest", SPLIT_SEQUENTIAL)
train, validate, test = split_sequential(df)
all_results += train_validate_and_report_for_prices(model_factory, train, validate, context, print_model_stats=False)


Random forest model:
Using random train-validate-test split...
SAP train - Better - model 0.1835 v naive 0.4222
SAP validate - Worse - model 0.5861 v naive 0.5534
SMPBuy train - Better - model 0.223 v naive 0.4987
SMPBuy validate - Worse - model 0.6367 v naive 0.5954
SMPSell train - Better - model 0.2741 v naive 0.586
SMPSell validate - Worse - model 0.7464 v naive 0.659
Using sequential train-validate-test split...
SAP train - Better - model 0.2155 v naive 0.549
SAP validate - Worse - model 0.0912 v naive 0.0724
SMPBuy train - Better - model 0.2582 v naive 0.6244
SMPBuy validate - Worse - model 0.1083 v naive 0.0824
SMPSell train - Better - model 0.3196 v naive 0.7189
SMPSell validate - Worse - model 0.1095 v naive 0.0738


### Next, try gradient boosting
Again the random forest improves in test slightly on a random split but not when trained on earlier data and tested on later. Let's try tree-based gradient boosting

In [12]:
print("Gradient Boosting model (XGBoost XGBRegressor):")

model_factory = lambda: XGBRegressor(
        n_estimators=200,
        max_depth=6,

        learning_rate=0.1,
        subsample=0.8,
        colsample_bytree=0.7,
        reg_alpha=0.0,
        reg_lambda=1.0,
        random_state=42
    )

print("Using random train-validate-test split...")
context = Context("Gradient boosting", SPLIT_RANDOM)
train, validate, test = split_random(df)
all_results += train_validate_and_report_for_prices(model_factory, train, validate, context, print_model_stats=False)

print("Using sequential train-validate-test split...")
context = Context("Gradient boosting", SPLIT_SEQUENTIAL)
train, validate, test = split_sequential(df)
all_results += train_validate_and_report_for_prices(model_factory, train, validate, context, print_model_stats=False)


Gradient Boosting model (XGBoost XGBRegressor):
Using random train-validate-test split...
SAP train - Better - model 0.0231 v naive 0.4558
SAP validate - Better - model 0.386 v naive 0.4248
SMPBuy train - Better - model 0.0254 v naive 0.5266
SMPBuy validate - Better - model 0.4474 v naive 0.4827
SMPSell train - Better - model 0.0321 v naive 0.573
SMPSell validate - Better - model 0.5738 v naive 0.6944
Using sequential train-validate-test split...
SAP train - Better - model 0.0275 v naive 0.549
SAP validate - Worse - model 0.1391 v naive 0.0724
SMPBuy train - Better - model 0.0322 v naive 0.6244
SMPBuy validate - Worse - model 0.1433 v naive 0.0824
SMPSell train - Better - model 0.0346 v naive 0.7189
SMPSell validate - Worse - model 0.1146 v naive 0.0738


### Try Recurrent Neural Network

The gradient booster likewise did not perform any better than the naive predictor, especially when trained on the earlier data and tested on the later data. Let's try a neural net. For a simple time series, a Temporal Convolutional Net would be the obvious choice, but in this case we have a lot of market fundamentals to use as additional features so a RNN seems the better fit.

(1) Because of how the inputs need to be shaped into sequences, we'll load the data again, skipping the manually-engineered lag features and next-day labels. We'll still fill in missing actual Composite Weather Variables with the normal forecast, but won't delete the few with outlying prices in case the RNN is sophisticated enough to make good use of them.

In [13]:
#Reload with minimal preprocessing and cleaning
df = load_data()
df = preprocess(df, add_lags=False, add_labels=False)
df = clean(df, remove_outliers=False)
df = df.sort_values('Gas Day').reset_index(drop=True) # Should already be sorted, but just in case
df = df[df['Gas Day'] >= '2021-04-01'] # discard the earliest data, as per the train/val/test split default
df.head()

Processed 10 of 60 raw files
Processed 20 of 60 raw files
Processed 30 of 60 raw files
Processed 40 of 60 raw files
Processed 50 of 60 raw files
Processed 60 of 60 raw files


Data Item,Gas Day,Aggregate LNG Importations - Daily Flow,Beach Including Norway - Daily Flow,Beach and IOG - Beach Delivery,Beach and IOG - Daily Flow,Composite Weather Variable - Actual,Demand - Cold,"Demand - Cold, (excluding interconnector and storage)",Demand - Warm,"Demand - Warm, (excluding interconnector and storage)",...,"Storage, Short Range, Maximum potential flow","Storage, Short Range, Stock Levels","System Entry Flows, National, Forecast","System Entry Flows, National, Physical",Day of Week,Is Weekday,Next Day Is Weekday,Day of Year,sin_DoY,cos_DoY
332,2021-04-01,56.64959,147.32989,202.42128,202.42128,8.04,305.628045,299.428045,192.178205,185.978205,...,0.0,0.0,216.5818,235.986349,3,1,1,91,0.999991,0.004304
333,2021-04-02,56.98982,166.66129,222.13281,222.13281,8.42962,294.933461,288.293461,184.176358,177.536358,...,0.0,0.0,226.122795,221.070202,4,1,0,92,0.999917,-0.01291
334,2021-04-03,59.86041,163.58733,221.85874,221.85874,7.78921,282.792862,275.712862,171.236588,164.156588,...,0.0,0.0,232.033606,249.997365,5,0,0,93,0.999546,-0.03012
335,2021-04-04,56.81811,162.65719,217.8417,217.8417,8.09275,280.273502,272.753502,168.677337,161.157337,...,0.0,0.0,219.171588,210.53911,6,0,1,94,0.99888,-0.047321
336,2021-04-05,58.17918,155.78494,212.49572,212.49572,6.20191,297.694091,289.734091,185.852197,177.892197,...,0.0,0.0,221.80961,212.792903,0,1,1,95,0.997917,-0.064508


(2) Make the sequences, covering 30 days of the salient features

In [None]:

WINDOW_SIZE = 30

feature_cols = ['Composite Weather Variable - Actual', 'Demand Actual, NTS, D+1', 'Demand Forecast, NTS, hourly update', 'Interconnector - Daily Flow', 'Medium Storage - Actual Stock',
              'Medium Storage - Stock Level at Max Flow', 'Predicted Closing Linepack (PCLP1)', 
              'SAP', 'SMPBuy',	'SMPSell', 
              'Storage - Daily Flow','Storage - Delivery', 'Storage, Medium Range, Stock Levels', 'System Entry Flows, National, Forecast', 'System Entry Flows, National, Physical',
              'Day of Week','Is Weekday','Next Day Is Weekday','Day of Year']

def make_sequences(df, feature_cols):
    X, Y = [], []
    for i in range(len(df) - WINDOW_SIZE):
        X.append(df[feature_cols].iloc[i : i + WINDOW_SIZE].values)
        Y.append(df[price_targets].iloc[i + WINDOW_SIZE].values) # using SAP, SMPBuy and SMPSell as labels as before
    return np.array(X), np.array(Y)

X, y = make_sequences(df, feature_cols)

(3) Split sequentially into training, validate and test sets, so that the new gas days introduced at each stage are later than the days already seen. Then scale the sets individually.

In [15]:
train_size = int(0.7 * len(X))
val_size   = int(0.2 * len(X))

X_train, y_train = X[:train_size], y[:train_size]
X_validate,   y_validate   = X[train_size:train_size+val_size], y[train_size:train_size+val_size]
X_test,  y_test  = X[train_size+val_size:], y[train_size+val_size:]
print(f"Training records: {X_train.shape[0]}")
print(f"Validation records: {X_validate.shape[0]}")
print(f"Test records: {X_test.shape[0]}")


n_feats = X_train.shape[2]
scaler = StandardScaler()
X_train_2d = X_train.reshape(-1, n_feats)
scaler.fit(X_train_2d)

def scale_split(X):
    X_2d = X.reshape(-1, n_feats)
    Xs = scaler.transform(X_2d)
    return Xs.reshape(-1, WINDOW_SIZE, n_feats)

#Take a copy of the unscaled test data for comparison against the naive predictor
X_validate_unscaled = X_validate.copy()

X_train = scale_split(X_train)
X_validate   = scale_split(X_validate)


Training records: 1012
Validation records: 289
Test records: 145


(4) Train and test the model - unfortunately it doesn't fit into the framework for sklearn-type models

In [None]:
def make_rnn():
    #model = models.Sequential([
        #layers.LSTM(128, return_sequences=True, input_shape=(WINDOW_SIZE, n_feats)),
        #layers.Dropout(0.02),
        #layers.LSTM(64),
        #layers.Dropout(0.02),
        #layers.Dense(32, activation='relu'),
        #layers.Dense(3, name='multi_output')   # predicts [SAP, SMPBuy, SMPSell]
    #])
    model = models.Sequential([
        # Single, small LSTM — no return_sequences, so it only outputs the last hidden state
        layers.LSTM(32, input_shape=(WINDOW_SIZE, n_feats)),

        # (Optional) small dense “bottleneck” to pick up any non-linear mix
        layers.Dense(16, activation='relu'),

        # Multi-output head predicts [SAP, SMPBuy, SMPSell]
        layers.Dense(3, name='multi_output')
    ])


    model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-3),
    loss='mse',
    metrics=[tf.keras.metrics.RootMeanSquaredError(name='rmse')]
    )

    model.summary()
    return model

def train_and_validate_rnn(model, context:Context):

    callbacks = [
        tf.keras.callbacks.EarlyStopping(
            monitor='val_loss',
            patience=10,
            restore_best_weights=True
        ),
        tf.keras.callbacks.ReduceLROnPlateau(
            monitor='val_loss',
            factor=0.5,
            patience=5
        )
    ]

    history = model.fit(
        X_train, y_train,
        validation_data=(X_validate, y_validate),
        epochs=100,
        batch_size=32,
        callbacks=callbacks
    )

    # 5. Evaluate model RMSE on validation set
    eval_results = model.evaluate(X_validate, y_validate, return_dict=True)
    model_rmse = eval_results['rmse']
    print(f"Overall RMSE: {model_rmse:.4f}")

    results = []
    #Individual RMSE for each target
    y_pred = model.predict(X_validate)
    for i, name in enumerate(price_targets):
        #get model RMSE for each target
        rmse = get_rmse(y_validate[:,i], y_pred[:,i])
        print(f"{name} RMSE: {rmse:.4f}")

        # get naive predictor RMSE based on the unscaled inputs
        feat_idx = feature_cols.index(name)
        y_pred_naive = X_validate_unscaled[:, -1, feat_idx]
        y_true = y_validate[:, i]
        naive_rmse = get_rmse(y_true, y_pred_naive)
        print(f"{name} naive predictor RMSE: {naive_rmse:.4f}")

        # add stats
        testResult = Result(context, name, rmse, naive_rmse)

        # add to the running list of results
        results.append(testResult)

    return results

model = make_rnn()
context = Context("RNN", SPLIT_SEQUENTIAL)
all_results += train_and_validate_rnn(model, context)

  super().__init__(**kwargs)


Epoch 1/100
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 81ms/step - loss: 33.2486 - rmse: 5.7651 - val_loss: 4.4083 - val_rmse: 2.0996 - learning_rate: 0.0010
Epoch 2/100
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 18ms/step - loss: 23.1846 - rmse: 4.8112 - val_loss: 1.5409 - val_rmse: 1.2413 - learning_rate: 0.0010
Epoch 3/100
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 25ms/step - loss: 8.7249 - rmse: 2.9276 - val_loss: 0.9626 - val_rmse: 0.9811 - learning_rate: 0.0010
Epoch 4/100
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 23ms/step - loss: 3.9230 - rmse: 1.9558 - val_loss: 0.5775 - val_rmse: 0.7599 - learning_rate: 0.0010
Epoch 5/100
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 17ms/step - loss: 1.9024 - rmse: 1.3775 - val_loss: 0.3993 - val_rmse: 0.6319 - learning_rate: 0.0010
Epoch 6/100
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 15ms/step - loss: 1.4181 - rmse:

### Finally, try reframing RNN with a residual model

The RNN is still testing worse than naive predictor, indicating that the network is not learning anything from the additional fields that adds anything to the current day prices.
As a final option, try a residual model that predicts the delta from the current day's price



In [None]:

def make_rnn_residual():
    # 1) Inputs
    inputs = layers.Input(shape=(WINDOW_SIZE, n_feats))

    # 2) Core LSTM
    x = layers.LSTM(32)(inputs)
    x = layers.Dense(16, activation='relu')(x)

    # 3) Delta prediction head (predict tomorrow’s Δ for each series)
    delta = layers.Dense(3, name='delta')(x)  
    #   outputs [ΔSAP, ΔSMPBuy, ΔSMPSell]
    idxs_of_labels = [feature_cols.index(pt) for pt in price_targets]
    # 4) Grab today's values from the last timestep of the sequence
    #    This gives shape (batch, 3) corresponding to [SAP_t, SMPBuy_t, SMPSell_t].
    last_vals = layers.Lambda(lambda z: tf.gather(z[:, -1, :], idxs_of_labels, axis=1),
                            name='last_vals')(inputs)

    # 5) Add skip-connection: tomorrow = today + predicted Δ
    outputs = layers.Add(name='residual_output')([last_vals, delta])

    # 6) Assemble and compile
    model = models.Model(inputs, outputs)
    model.compile(
        optimizer='adam',
        loss='mse',
        metrics=[tf.keras.metrics.RootMeanSquaredError(name='rmse')]
    )

    model.summary()
    return model

model = make_rnn_residual()
context = Context("Residual RNN", SPLIT_SEQUENTIAL)
all_results += train_and_validate_rnn(model, context)




Epoch 1/100
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 31ms/step - loss: 33.4401 - rmse: 5.7796 - val_loss: 12.2580 - val_rmse: 3.5011 - learning_rate: 0.0010
Epoch 2/100
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 13ms/step - loss: 27.9117 - rmse: 5.2822 - val_loss: 10.6101 - val_rmse: 3.2573 - learning_rate: 0.0010
Epoch 3/100
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 22ms/step - loss: 26.0676 - rmse: 5.1006 - val_loss: 5.3083 - val_rmse: 2.3040 - learning_rate: 0.0010
Epoch 4/100
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 15ms/step - loss: 14.3820 - rmse: 3.7763 - val_loss: 1.0811 - val_rmse: 1.0397 - learning_rate: 0.0010
Epoch 5/100
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 13ms/step - loss: 4.4569 - rmse: 2.1105 - val_loss: 0.9065 - val_rmse: 0.9521 - learning_rate: 0.0010
Epoch 6/100
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step - loss: 2.9813 - r

### Pick the best performing model type
No model outperformed the naive predictor so I'll have to choose the least bad one to tune based on the gathered results


In [18]:
# sort by the delta (model_rmse - naive_rmse)
#Filter results to only include "Latest" data test set
filtered_results = [r for r in all_results if r.context.test_set == SPLIT_SEQUENTIAL]

sorted_results = sorted(filtered_results, key=lambda r: r.naive_rmse - r.model_rmse, reverse=True)

for r in sorted_results:
    difference = r.naive_rmse - r.model_rmse
    print(f"{r.context.model_type} ({r.price_label}): difference over naive RMSE = {difference:.4f}")

Random forest (SAP): difference over naive RMSE = -0.0188
Random forest (SMPBuy): difference over naive RMSE = -0.0259
Random forest (SMPSell): difference over naive RMSE = -0.0357
Gradient boosting (SMPSell): difference over naive RMSE = -0.0408
Gradient boosting (SMPBuy): difference over naive RMSE = -0.0609
Gradient boosting (SAP): difference over naive RMSE = -0.0667
Residual RNN (SMPBuy): difference over naive RMSE = -0.0672
Residual RNN (SAP): difference over naive RMSE = -0.0712
Residual RNN (SMPSell): difference over naive RMSE = -0.0909
Linear regression (SAP): difference over naive RMSE = -0.1809
RNN (SMPSell): difference over naive RMSE = -0.1959
RNN (SMPBuy): difference over naive RMSE = -0.2373
Linear regression (SMPBuy): difference over naive RMSE = -0.2431
RNN (SAP): difference over naive RMSE = -0.2544
Linear regression (SMPSell): difference over naive RMSE = -0.3460


Random forest came out best so I'll try to improve on the hyperparameters

In [None]:
# Set up search framework in order to try bayesian and random search optimization
def search_hyperparams(search, df_train, df_validate, priceTarget):
    X_train = get_X(df_train)
    y_train = get_y(df_train, priceTarget)
    
    # Run the hyperparameter search
    start = time.perf_counter()
    search.fit(X_train, y_train)
    end = time.perf_counter()
    
    elapsed_minutes = (end - start) / 60
    print(f"Search took {elapsed_minutes:.4f} minutes")

    print("Best hyperparameters:", search.best_params_)
    print("Best CV RMSE on train set: {:.4f}".format(-search.best_score_))

    # Get the best model and evaluate it on the validation set        
    
    X_validate = get_X(df_validate)
    y_validate = get_y(df_validate, priceTarget)

    best_model = search.best_estimator_

    rmse_validate = validate_model(best_model, X_validate, y_validate)
    
    # Get naive predictor RMSE for comparison
    rmse_naive_validate = naive_predictions(df_validate, priceTarget)
    
    print_results(priceTarget + " validate", rmse_naive_validate, rmse_validate)
    return best_model, search.best_params_, rmse_validate, rmse_naive_validate

In [None]:
# reload the data for the sklearn-style framework
df = load_data()
df = preprocess(df)
df = clean(df)
train, validate, test = split_sequential(df)
models_dir = Path('..') / 'models'

# Start with a wide range of candidates
random_search_grid = {
    'n_estimators':     [100, 200, 500],
    'max_depth':        [None, 10, 20],
    'min_samples_split':[2,5],
    'min_samples_leaf': [1,2],
    'max_features':     [0.1, 0.2, 0.3, 0.4, 0.5, 0.7, 1.0],
    'ccp_alpha':        [0.0, 0.001]
}
hyperparam_results = {}
for pt in price_targets:
    rf = RandomForestRegressor(
            random_state=42,
            n_jobs=-1,  
            oob_score=True
        )
    # Try random search
    rand_search = RandomizedSearchCV(
        estimator=rf,
        param_distributions=random_search_grid,             
        n_iter=50,                                  
        cv=5,
        scoring='neg_root_mean_squared_error',      
        n_jobs=-1,
        verbose=3,
        random_state=42
    )

    print(f"{pt}: Random search")
    best_model, best_params, rmse_validate, rmse_naive_validate = search_hyperparams(rand_search, train, validate, pt)
    hyperparam_results[pt + " Random"]= {
        'model': best_model,
        'params': best_params,
        'model_rmse': rmse_validate,
        'naive_rmse': rmse_naive_validate
    }
    # Persist the best model
    file_path = models_dir / f"{pt}_random_best_rf.joblib"
    joblib.dump(model, file_path)

    # Try bayesian seatch
    bayes_search_spaces = {
        'n_estimators':      Integer(100, 500),
        'max_depth':         Integer(10, 50),
        'min_samples_split': Integer(2, 5),
        'min_samples_leaf':  Integer(1, 2),
        'max_features':      Real(0.1, 1.0),
        'ccp_alpha':         Real(0.0, 0.01)
    }

    rf = RandomForestRegressor(
            random_state=42,
            n_jobs=-1,
            oob_score=True
        )

    bayes_search = BayesSearchCV(
        estimator=rf,
        search_spaces=bayes_search_spaces,
        n_iter=50,
        cv=5,
        scoring='neg_root_mean_squared_error',
        n_jobs=-1,
        verbose=3,
        random_state=51
    )

    print(f"{pt}: Bayesian search")
    best_model, best_params, rmse_validate, rmse_naive_validate = search_hyperparams(bayes_search, train, validate, pt)
    hyperparam_results[pt + " Bayesian"]= {
        'model': best_model,
        'params': best_params,
        'model_rmse': rmse_validate,
        'naive_rmse': rmse_naive_validate
    }
    # Persist the best model
    file_path = models_dir / f"{pt}_bayesian_best_rf.joblib"
    joblib.dump(model, file_path)

# Print out details of the best estimators from each search
sorted_results = sorted(
    hyperparam_results.items(),
    key=lambda kv: kv[1]['naive_rmse'] - kv[1]['model_rmse'],
    reverse=True,
)
for run_name, result in sorted_results:
    difference = result['naive_rmse'] - result['model_rmse']
    print(f"{run_name}: difference over naive RMSE = {difference:.4f} with parameters: {result['params']}")

Processed 10 of 60 raw files
Processed 20 of 60 raw files
Processed 30 of 60 raw files
Processed 40 of 60 raw files
Processed 50 of 60 raw files
Processed 60 of 60 raw files
(1802, 78)
(1802, 78)
(1802, 78)
(1789, 78)
(1789, 78)
(1789, 78)
(1789, 78)
(1784, 78)
(1784, 78)
(1783, 78)
(1783, 78)
(1762, 78)
SAP: Random search
Fitting 5 folds for each of 50 candidates, totalling 250 fits


KeyboardInterrupt: 

Based on the output, the best-tuned estimator for each price target came from the random search (despite the Bayesian search being given a search space covering all the random search grid, and taking 3 times as much time). Now to reload the best estimator for each price target and finally test against the latest 10%  of data which has been held out so far.

In [None]:
X_test = get_X(test)
model_rmses = []
naive_rmses = []
for pt in price_targets:
    file_path = models_dir / f"{pt}_random_best_rf.joblib"
    model = joblib.load(file_path)
    y_test = get_y(test, pt)
    model_rmse = validate_model(model, X_test, y_test)
    model_rmses.append(model_rmse)
    naive_rmse = naive_predictions(test, pt)
    naive_rmses.append(naive_rmse)
    

x = range(len(price_targets))
width = 0.35

fig, ax = plt.subplots()
ax.bar([i - width/2 for i in x], model_rmses, width, label='Model RMSE')
ax.bar([i + width/2 for i in x], naive_rmses, width, label='Naive predictor RMSE')

ax.set_xlabel('Price Target')
ax.set_ylabel('RMSE')
ax.set_title('Model vs. Naive RMSE by Price Target')
ax.set_xticks(x)
ax.set_xticklabels(price_targets)
ax.legend()

plt.tight_layout()
plt.show()