# Train regional groundwater model

This notebook trains a regional LSTM model on borehole data. It does not perform validation or inference.

**Tasks**
- Load in the data
- Split data into training, testing and validation with finetuning:
    - First 10% of data from 50 boreholes for testing
    - Final 90% of data from 50 boreholes for training
    - First 20% of data from unseen boreholes for finetuning
    - Final 80% of data from unseen boreholes for validation
- Instantiate, fit and transform data with scalers:
    - Instantiate a scaler for each borehole
    - Fit StandardScalers using the training period of each borehole (precip, PET and GWL)
    - Transform both the training and testing data using scalers
- Reshape dynamic and static data into sequences
- Create datasets and dataloaders
- Train model

Train model on 90% of data from 50 boreholes. Give the input size of the embedding layer as 54, to allow for the model to be finetuned on the unseen boreholes.

In [70]:
import numpy as np
import pandas as pd
from pathlib import Path
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torch.optim.lr_scheduler import ReduceLROnPlateau
from sklearn.preprocessing import StandardScaler
import plotly.express as px
import plotly.graph_objects as go

In [71]:
# Define hyperparameters
# Number of validation boreholes (not used in training or testing)
num_validation = 4
# Sequence length also acts similarly to the warm up period in conventional models
seq_length = 730
# Portion of data for training, from which test proportion is inferred
train_split = 0.8
# Batch size should be a exponent of base 2
batch_size = 512
# Hidden size of LSTM
hidden_size = 20
# Number of stacked layers of LSTM
num_layers = 3
# Embedding size
embedding_size = 16
# Initial learning rate
lr = 0.0001
# Number of epochs
epochs = 20
# Learning rate scheduler patience
patience = 10


# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [72]:
# Function to load data
def load_data(file_path: str|Path, prefix: str) -> list[pd.DataFrame]:
    """
    Load and concatenate CSV files from a directory into a DataFrame.

    This function reads all CSV files in the specified directory that match the provided prefix, 
    parses the "Date" column into datetime format with day first, and concatenates the data into a single DataFrame.

    Args:
        file_path (str|Path): The directory path where the CSV files are located.
        prefix (str): The prefix of the CSV files to be loaded.

    Returns:
        pd.DataFrame: The concatenated DataFrame of all matched CSV files.

    """
    return pd.concat(
        [
            pd.read_csv(p, parse_dates=["Date"], dayfirst=True)
            for p in Path(file_path).glob(f"{prefix}*.csv")
        ]
    )

In [73]:
# Loop through the data
# Add borehole id to one of the dataframes
# Count the number of boreholes
# Split the data into training and testing

In [74]:
# Load in observed groundwater and meteorological data
df_gwl = load_data("../data_processed/", "AquiMod_")
df_met = load_data("../data_processed/", "ukcp18_")
# Generate incremental borehole_id
df_gwl['bhid'] = (df_gwl['Borehole'] != df_gwl['Borehole'].shift()).cumsum() - 1
num_boreholes = df_gwl["bhid"].max() + 1
num_training = num_boreholes - num_validation
# Merge data
df_data = pd.merge(left=df_gwl, right=df_met, on=["Borehole", "Model", "Date"], how="inner").dropna().reset_index(drop=True)
# Save memory
del df_gwl
del df_met
df_data

Unnamed: 0,Borehole,Model,Date,Sim,Obs,bhid,precipwsnow,PET
0,Allington No 2,AquiMod,2006-09-21,72.5336,67.997000,0,0.156802,2.210
1,Allington No 2,AquiMod,2006-09-22,72.4902,67.933000,0,8.686279,2.210
2,Allington No 2,AquiMod,2006-09-23,72.4493,67.903000,0,12.740111,2.210
3,Allington No 2,AquiMod,2006-09-24,72.4086,67.859000,0,0.284288,2.210
4,Allington No 2,AquiMod,2006-09-25,72.3679,67.754000,0,0.002568,2.210
...,...,...,...,...,...,...,...,...
825373,Woodend Farm,AquiMod,2005-02-14,85.1202,86.970968,53,0.002877,0.675
825374,Woodend Farm,AquiMod,2005-02-15,85.1124,86.983226,53,0.000412,0.675
825375,Woodend Farm,AquiMod,2005-02-16,85.1042,86.995484,53,0.008233,0.675
825376,Woodend Farm,AquiMod,2005-02-17,85.0955,87.007742,53,0.470635,0.675


In [75]:
# Split data into separate dataframes for training, testing and validation
# Fit GWL scalers
# Transform GWL data with scalers
# Create sequences
# I have just realised that the validation data needs to include the final seq_length values from the training data

df_train_test = df_data.query(f"bhid < {num_training}")
# df_validation = df_data.query(f"bhid >= {num_training}")
train_list = []
test_list = []
# Loop through training boreholes
for i in range(num_training):
    # Slice dataframe to borehole
    df = df_train_test.query("bhid == @i").copy()
    # Split data into training and testing
    train_size = int((len(df) - seq_length) * train_split)
    train_list.append(df.iloc[:train_size])
    test_list.append(df.iloc[train_size:])

df_train = pd.concat(train_list)
df_test = pd.concat(test_list)

StandardScaler subtracts the mean and scales by the variance. To prevent data leakage from the testing or validation datasets into training, the scaler is fit using only the training dataset. The testing and validation data are scaled using the pre-fitted scalers.

In [76]:
# Initialise scalers
precip_scaler = StandardScaler()
pet_scaler = StandardScaler()
gwl_scalers = [StandardScaler() for _ in range(num_boreholes)]

# Fit and transform borehole-independent scalers
precip_train = precip_scaler.fit_transform(df_train["precipwsnow"].values.reshape(-1, 1))
precip_test = precip_scaler.transform(df_test["precipwsnow"].values.reshape(-1, 1))
pet_train = pet_scaler.fit_transform(df_train["PET"].values.reshape(-1, 1))
pet_test = pet_scaler.transform(df_test["PET"].values.reshape(-1, 1))
# Extract bhid data
bhid_train = df_train["bhid"].values.reshape(-1, 1)
bhid_test = df_test["bhid"].values.reshape(-1, 1)

# Fit and transform borehole scalers
gwl_train = []
gwl_test = []

for i in range(num_training):
    scaler = gwl_scalers[i]
    gwl_train.append(scaler.fit_transform(df_train[df_train["bhid"] == i]["Obs"].values.reshape(-1, 1)))
    gwl_test.append(scaler.transform(df_test[df_test["bhid"] == i]["Obs"].values.reshape(-1, 1)))

gwl_train = np.vstack(gwl_train)
gwl_test = np.vstack(gwl_test)

In [77]:
# Add the final (seq_length - 1) number of timesteps to the testing data to generate continous sequences
# I need to confirm this but I think we need to add (seq_length - 1) instead of (seq_length)
# This is because prepending the full seq_length would create an entire timestep within the training data
# Ultimately, it is only one day and doesn't actually matter much
precip_test = np.concatenate((precip_train[-(seq_length - 1):], precip_test), axis=0)
pet_test = np.concatenate((pet_train[-(seq_length - 1):], pet_test), axis=0)
bhid_test = np.concatenate((bhid_train[-(seq_length - 1):], bhid_test), axis=0)
gwl_test = np.concatenate((gwl_train[-(seq_length - 1):], gwl_test), axis=0)

In [78]:
print(precip_train.shape)
print(precip_test.shape)
print(pet_train.shape)
print(pet_test.shape)
print(bhid_train.shape)
print(bhid_test.shape)
print(gwl_train.shape)
print(gwl_test.shape)

(575170, 1)
(181046, 1)
(575170, 1)
(181046, 1)
(575170, 1)
(181046, 1)
(575170, 1)
(181046, 1)


In [79]:
def create_sequences(data: np.ndarray, seq_length: int) -> np.ndarray:
    """
    Transforms 2D time-series data into an array of sequences of a specified length.

    Parameters:
    data (np.ndarray): A 2D numpy array where each row is a time step and each column is a feature.
    seq_length (int): The number of time steps to include in each output sequence.

    Returns:
    np.ndarray: A 3D numpy array of shape (num_samples - seq_length + 1, seq_length, num_features).
    """

    xs = []  # Initialise an empty list to store sequences

    # For each possible sequence in the data...
    for i in range(len(data) - seq_length + 1):
        # Extract a sequence of length `seq_length`
        x = data[i: (i + seq_length)]
        # Append the sequence to the list
        xs.append(x)

    # Convert the list of sequences into a 3D numpy array
    return np.array(xs)

In [80]:
# create_sequences has to be called individually on each timeseries from each borehole
# Initialise lists to hold dynamic and static data for each borehole for train and test periods
dynamic_train_list = []
dynamic_test_list = []
static_train_list = []
static_test_list = []
gwl_train_list = []
gwl_test_list = []

# Loop through training boreholes in each of the data types and call create_sequences
for i in range(num_training):
    train_mask = (bhid_train == i)
    test_mask = (bhid_test == i)
    dynamic_train_list.append(create_sequences(np.column_stack((precip_train[train_mask], pet_train[train_mask])), seq_length))
    dynamic_test_list.append(create_sequences(np.column_stack((precip_test[test_mask], pet_test[test_mask])), seq_length))
    static_train_list.append(create_sequences(bhid_train[train_mask], seq_length))
    static_test_list.append(create_sequences(bhid_test[test_mask], seq_length))
    gwl_train_list.append(gwl_train[train_mask][seq_length - 1:].reshape(-1, 1))
    gwl_test_list.append(gwl_test[test_mask][seq_length - 1:].reshape(-1, 1))

dynamic_train_arr = np.concatenate(dynamic_train_list)
dynamic_test_arr = np.concatenate(dynamic_test_list)
static_train_arr = np.concatenate(static_train_list)
static_test_arr = np.concatenate(static_test_list)
gwl_train_arr = np.concatenate(gwl_train_list)
gwl_test_arr = np.concatenate(gwl_test_list)

# Save memory
del dynamic_train_list
del dynamic_test_list
del static_train_list
del static_test_list
del gwl_train_list
del gwl_test_list


In [81]:
print(dynamic_train_arr.shape)
print(dynamic_test_arr.shape)
print(static_train_arr.shape)
print(static_test_arr.shape)
print(gwl_train_arr.shape)
print(gwl_test_arr.shape)

(538720, 730, 2)
(144596, 730, 2)
(538720, 730)
(144596, 730)
(538720, 1)
(144596, 1)


In [82]:
dynamic_train_tensor = torch.from_numpy(dynamic_train_arr).float()
dynamic_test_tensor = torch.from_numpy(dynamic_test_arr).float()
static_train_tensor = torch.from_numpy(static_train_arr)
static_test_tensor = torch.from_numpy(static_test_arr)
gwl_train_tensor = torch.from_numpy(gwl_train_arr).float()
gwl_test_tensor = torch.from_numpy(gwl_test_arr).float()

In [83]:
print(dynamic_train_tensor.dtype)
print(dynamic_test_tensor.dtype)
print(static_train_tensor.dtype)
print(static_test_tensor.dtype)
print(gwl_train_tensor.dtype)
print(gwl_test_tensor.dtype)
print(dynamic_train_tensor.shape)
print(dynamic_test_tensor.shape)
print(static_train_tensor.shape)
print(static_test_tensor.shape)
print(gwl_train_tensor.shape)
print(gwl_test_tensor.shape)

torch.float32
torch.float32
torch.int32
torch.int32
torch.float32
torch.float32
torch.Size([538720, 730, 2])
torch.Size([144596, 730, 2])
torch.Size([538720, 730])
torch.Size([144596, 730])
torch.Size([538720, 1])
torch.Size([144596, 1])


In [84]:
# Define the dataset class
class MultiTimeSeriesDataset(Dataset):
    """Pytorch dataset class for timeseries sequences data with both dynamic and static features."""
    def __init__(
            self,
            dynamic: torch.Tensor,
            static: torch.Tensor,
            target: torch.Tensor
        ):
        self.dynamic = dynamic
        self.static = static
        self.target = target

    def __len__(self):
        return len(self.dynamic)

    def __getitem__(self, i):
        return self.dynamic[i], self.static[i], self.target[i]

In [85]:
# Instantiate datasets and dataloaders
train_dataset = MultiTimeSeriesDataset(dynamic_train_tensor, static_train_tensor, gwl_train_tensor)
test_dataset = MultiTimeSeriesDataset(dynamic_test_tensor, static_test_tensor, gwl_test_tensor)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

In [86]:
# Define the LSTM model
class LSTM(nn.Module):
    def __init__(
            self,
            dynamic_size: int,
            static_len: int,
            embedding_size: int,
            hidden_size: int,
            num_layers: int,
            output_size: int,
        ):
        super().__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.embedding = nn.Embedding(static_len, embedding_size, max_norm=1)
        self.lstm = nn.LSTM(dynamic_size + embedding_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, dynamic: torch.Tensor, static: torch.Tensor):
        # Pass catchment identifiers through embedding layer
        static_embeddings = self.embedding(static)
        # Concatenate catchment embeddings with other features
        x = torch.cat((dynamic, static_embeddings), dim=-1)
        # Initialise hidden state with zeros
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
        # Initialise cell state with zeros
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
        # Forward propagate LSTM using input data, x, and initial states, h0 and c0
        # x is a batch of input sequences of features
        # h contains the updated hidden states from each timestep
        # c contains the update cell states from each timestep
        h, c = self.lstm(x, (h0, c0))
        # The last hidden state from the output sequence is pass to the fully connected layer
        out = self.fc(h[:, -1, :])
        return out

In [87]:
# Define the training function
def train_epoch(model: LSTM, criterion: nn.Module, optimiser: nn.Module):
    model.train()
    running_loss = 0
    for batch in train_loader:
        # batch is a list of three elements, the dynamic, static and the target
        dynamic_batch, static_batch, target_batch = (
            batch[0].to(device), batch[1].to(device), batch[2].to(device)
        )
        # Forward propagate the model and get outputs
        output = model(dynamic_batch, static_batch)
        # Calculate loss between outputs and the target
        loss = criterion(output, target_batch)
        # Add loss to the running loss
        running_loss += loss.item()
        # Reset the gradients
        optimiser.zero_grad()
        # This, rather unpythonically, computes the gradients for all model parameters
        loss.backward()
        # Take a step in the gradient direction
        optimiser.step()
    # Return the average loss for each batch across the epoch
    return running_loss / len(train_loader)

In [88]:
# Define the testing function
def test_epoch(model: LSTM, criterion: nn.Module):
    model.eval()
    running_loss = 0
    for batch in test_loader:
        # batch is a list of three elements, the dynamic, static and the target
        dynamic_batch, static_batch, target_batch = (
            batch[0].to(device), batch[1].to(device), batch[2].to(device)
        )
        with torch.no_grad():
            # Forward propagate the model and get outputs
            output = model(dynamic_batch, static_batch)
            # Calculate loss between outputs and the target
            loss = criterion(output, target_batch)
            # Add loss to the running loss
            running_loss += loss.item()
    # Return the average loss for each batch across the epoch
    return running_loss / len(test_loader)

In [89]:
# Initialise the model, loss function, and optimiser
model = LSTM(
    dynamic_size=dynamic_train_tensor.size()[-1],
    static_len=num_boreholes,
    embedding_size=embedding_size,
    hidden_size=hidden_size,
    num_layers=num_layers,
    output_size=gwl_train_tensor.size()[-1],
).to(device)
optimiser = torch.optim.Adam(model.parameters(), lr=lr)
scheduler = ReduceLROnPlateau(optimiser, patience=patience, verbose=True)
criterion = nn.MSELoss()

def count_parameters(model: LSTM) -> int:
    return sum(p.numel() for p in model.parameters() if p.requires_grad)


print(f"The model has {count_parameters(model)} parameters")

The model has 10805 parameters


In [90]:
# Initialise lists for plotting
train_loss = []
test_loss = []

In [91]:
epochs = 50

In [94]:
for epoch in range(epochs):
    train_loss.append(train_epoch(model, criterion, optimiser))
    test_loss.append(test_epoch(model, criterion))
    scheduler.step(train_loss[-1])

    # Calculate NSE based on unscaled data
    print(f"Epoch {epoch + 1} Train: {round(train_loss[-1], 6)}, Test: {round(test_loss[-1], 6)}")

Epoch 1 Train: 0.115338, Test: 0.220245
Epoch 2 Train: 0.113927, Test: 0.211858
Epoch 3 Train: 0.112436, Test: 0.212494
Epoch 4 Train: 0.112295, Test: 0.222453
Epoch 5 Train: 0.110473, Test: 0.219187
Epoch 6 Train: 0.109975, Test: 0.224082
Epoch 7 Train: 0.109729, Test: 0.217837
Epoch 8 Train: 0.117366, Test: 0.209738
Epoch 9 Train: 0.105498, Test: 0.222246
Epoch 10 Train: 0.105032, Test: 0.210989
Epoch 11 Train: 0.103705, Test: 0.210031
Epoch 12 Train: 0.102764, Test: 0.228216
Epoch 13 Train: 0.103802, Test: 0.203342
Epoch 14 Train: 0.101666, Test: 0.214546
Epoch 15 Train: 0.100087, Test: 0.208484
Epoch 16 Train: 0.099657, Test: 0.207407
Epoch 17 Train: 0.101911, Test: 0.211019
Epoch 18 Train: 0.098584, Test: 0.216181
Epoch 19 Train: 0.100961, Test: 0.231014
Epoch 20 Train: 0.09694, Test: 0.22758
Epoch 21 Train: 0.096702, Test: 0.208977
Epoch 22 Train: 0.09713, Test: 0.207029
Epoch 23 Train: 0.094406, Test: 0.211003
Epoch 24 Train: 0.094471, Test: 0.25275
Epoch 25 Train: 0.094881, Tes

seq_length|hidden_size|num_layers|embedding_size|best_train|best_test|comment
----------|-----------|----------|--------------|-----|----|-------
730|2|2|2|0.2|0.24|
730|5|2|2|0.16|0.21|
730|5|2|4|0.14|0.20|beginning to overfit
730|5|2|32|0.12|0.21|
730|5|5|4|0.12|0.20|
730|20|3|4|0.09|0.22
730|20|3|16|0.09|0.20|edit here

In [95]:
px.line(pd.DataFrame(np.column_stack((train_loss, test_loss))))

In [25]:
torch.save(model, "temp_model.pt")
# model = torch.load("temp_model.pt")

To test NSE, I will need to perform infererence on all training boreholes, for the training and testing periods.