# Single Series modelling

*Masked Multi-Step Autoregressive Regression*


#### Contents

1. [Dataset](#1)
2. [Model](#2)
3. [Training](#3)
4. [Evaluation](#4)

---


### Initial design choices:
- *Autoregressive*: predict the next value based on the previous values.
- *Rain data*: include the complete rain series.
- *Masking*: mask to predict the the last 1 minute of the sensor data.
- *Normalize*: normalize the data to a range of 0.1-1, and mask missing values with 0.
- *Loss function*: mask missing values.

### Possible experiments

- Predict residuals/changes
- Multi-step-ahead scheduling
- Quantiles?
- Data augmentation
- Utilize Mike predictions for training or evaluation
- Alternate masking for multi-sensor and comparison?
- 1-minute masking or 5-minute masking, or alternate?
- include both rainfall sensors?


---

### TODO:

- figures wrt masking

### Import

In [30]:
import os
import random

import pandas as pd
import numpy as np
from tqdm import tqdm


import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import pytorch_lightning as pl
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import TimeSeriesSplit



from fault_management_uds.data.HDF5_functions import print_tree, load_dataframe_from_HDF5
from fault_management_uds.data.process import remove_nans_from_start_end
from fault_management_uds.config import indicator_2_meta, bools_2_meta, error_indicators, natural_sensor_order
from fault_management_uds.data.load import import_external_metadata, import_metadata


from fault_management_uds.config import PROJ_ROOT
from fault_management_uds.config import DATA_DIR, RAW_DATA_DIR, INTERIM_DATA_DIR, PROCESSED_DATA_DIR, EXTERNAL_DATA_DIR
from fault_management_uds.config import MODELS_DIR, REPORTS_DIR, FIGURES_DIR, REFERENCE_DIR

# set random seed
seed = 42
np.random.seed(seed)
random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)

In [2]:
data_file_path = PROCESSED_DATA_DIR / 'Bellinge.h5'
external_metadata = import_metadata(REFERENCE_DIR / 'external_metadata.csv')
metadata = import_metadata(REFERENCE_DIR / 'sensor_metadata.csv')

# Dataset

- Create a dataset class for the single series
- Split the dataset into train and test sets
- Normalize the dataset
- Create a dataloader for the train and test sets

In [36]:
class SensorDataset(Dataset):
    def __init__(self, data, sequence_length=50):


        if len(data) <= sequence_length:
            raise ValueError("Dataset size must be larger than sequence_length.")
        self.data = data
        self.sequence_length = sequence_length

    def __len__(self):
        return len(self.data) - self.sequence_length

    def __getitem__(self, idx):
        # TODO: should it be e.g. data.values()? what is first dim?
        x = self.data[idx:idx + self.sequence_length]
        y = self.data[idx + self.sequence_length]
        return torch.tensor(x, dtype=torch.float32), torch.tensor(y, dtype=torch.float32)

In [None]:

def prepare_data(data, train_index, val_index, test_index, sequence_length=50):
    # normalizer
    scaler = MinMaxScaler() # TODO: does it normalize column wise?
    train_set = scaler.fit_transform(data[train_index])
    train_data = scaler.transform(train_set)
    val_data = scaler.transform(data[val_index])
    test_data = scaler.transform(data[test_index])

    # create the datasets
    train_dataset = SensorDataset(train_data, sequence_length)
    val_dataset = SensorDataset(val_data, sequence_length)
    test_dataset = SensorDataset(test_data, sequence_length)

    return train_dataset, val_dataset, test_dataset, scaler

# Model

- Create a `LSTM` model

In [38]:
class LSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, output_size, dropout=0.2):
        super(LSTMModel, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        # LSTM Layer with dropout
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True, dropout=dropout)

        # Fully connected output layer
        self.fc = nn.Linear(hidden_size, output_size)

        # Weight initialization
        self.init_weights()

    def forward(self, x):
        # Initialize hidden and cell states
        h_0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
        c_0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)

        # Forward propagate through LSTM
        out, _ = self.lstm(x, (h_0, c_0))

        # Fully connected layer on the last hidden state
        out = self.fc(out[:, -1, :])  # Use the last time step's output
        return out

    def init_weights(self):
        # Initialize LSTM weights and biases
        for name, param in self.lstm.named_parameters():
            if 'weight' in name:
                nn.init.xavier_uniform_(param)
            elif 'bias' in name:
                nn.init.zeros_(param)

        # Initialize FC layer
        nn.init.xavier_uniform_(self.fc.weight)
        nn.init.zeros_(self.fc.bias)

In [39]:
class SensorLSTM(pl.LightningModule):
    def __init__(self, input_size, hidden_size, num_layers, output_size, learning_rate=0.001, dropout=0.2):
        super(SensorLSTM, self).__init__()
        self.save_hyperparameters()  # Save hyperparameters for easier model checkpointing
        self.model = LSTMModel(input_size, hidden_size, num_layers, output_size, dropout)
        self.criterion = nn.MSELoss()  # Mean Squared Error Loss
        self.learning_rate = learning_rate

    def forward(self, x):
        return self.model(x)

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = self.criterion(y_hat, y)
        self.log('train_loss', loss, prog_bar=True, on_epoch=True)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = self.criterion(y_hat, y)

        # Compute additional metrics (e.g., MAE)
        mae = nn.L1Loss()(y_hat, y)
        self.log('val_loss', loss, prog_bar=True, on_epoch=True)
        self.log('val_mae', mae, prog_bar=True, on_epoch=True)

    def test_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = self.criterion(y_hat, y)
        self.log('test_loss', loss)
        return loss

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=self.learning_rate)
        scheduler = optim.lr_scheduler.ReduceLROnPlateau(
            optimizer, mode='min', factor=0.5, patience=3, verbose=True
        )
        return {"optimizer": optimizer, "lr_scheduler": scheduler, "monitor": "val_loss"}


# Training

- Train the model on the dataset
- Save and load the model

In [42]:

def handle_splits(data, dataset_args):
    splits = []
    # split by percentage
    if dataset_args['n_splits'] == 1:
        # split by percentage, idx ordered by train, val, test
        train_index = list(range(int(len(data) * dataset_args['train_split'])))
        val_index = list(range(int(len(data) * dataset_args['train_split']), int(len(data) * (dataset_args['train_split'] + dataset_args['val_split']))))
        test_index = list(range(int(len(data) * (dataset_args['train_split'] + dataset_args['val_split'])), len(data)))
        splits.append((train_index, val_index, test_index))

    # time series split
    else:
        testing_pct = 1 - dataset_args['train_split']
        tscv = TimeSeriesSplit(n_splits=dataset_args['n_splits'], test_size=testing_pct)
        for train_index, test_index in tscv.split(data):
            val_index = test_index[:int(len(test_index) * dataset_args['val_split'])]
            test_index = test_index[int(len(test_index) * dataset_args['val_split']):]
            splits.append((train_index, val_index, test_index))

    return splits


def train_model(data, dataset_args, model_args):


    splits = handle_splits(data, dataset_args)
    scalers = []
    # TODO: only one? check how this works
    for train_index, val_index, test_index in tqdm(splits, desc='Cross-validation', total=len(splits)):
        # prepare data
        train_dataset, val_dataset, test_dataset, scaler = prepare_data(data, train_index, val_index, test_index, dataset_args['sequence_length'])
        scalers.append(scaler)

        # create loader
        train_loader = DataLoader(train_dataset, batch_size=dataset_args['batch_size'])
        val_loader = DataLoader(val_dataset, batch_size=dataset_args['batch_size'])
        test_loader = DataLoader(test_dataset, batch_size=dataset_args['batch_size'])

        raise ValueError("Not implemented yet")
        # create model
        model = SensorLSTM(input_size=model_args['input_size'], output_size=model_args['output_size'], 
            hidden_size=model_args['hidden_size'], num_layers=model_args['num_layers'])


        # train model
        # TODO: training the lstm model, input output, predicitng 1 step ahead, implemented correctly???
        trainer = pl.Trainer(max_epochs=model_args['max_epochs'])
        trainer.fit(model, train_loader, val_loader)

    return model, splits, scalers 

In [43]:
dataset_args = {
    # define
    'sensor': ['G80F11B_Level1'],

    # dataset
    'n_splits': 1,
    'train_split': 0.7, 
    'val_split': 0.2,
    'test_split': 0.1,

    # model
    'batch_size': 64,
    'sequence_length': 50,
}

model_args = {
    'input_size': (len(dataset_args['sensor']), dataset_args['sequence_length']),
    'output_size': len(dataset_args['sensor']),
    'hidden_size': 50,
    'num_layers': 2,
    'learning_rate': 0.001,
    'max_epochs': 10,
}


In [27]:
# load data
data, _, _, _ = load_dataframe_from_HDF5(data_file_path, "combined_data/clean", columns=dataset_args['sensor'])
data = remove_nans_from_start_end(data, columns=data.columns)

# reset index
timestamps = data.index
data = data.reset_index(drop=True).values

In [28]:
data.shape

(956975, 1)

In [44]:
model, splits = train_model(data, dataset_args, model_args)

Cross-validation:   0%|          | 0/1 [00:00<?, ?it/s]


ValueError: Not implemented yet

# Evaluation

- Evaluate the model on the test set
- Plot the predictions
- Calculate the metrics


---
# OLD:

### Considerations:


- **Prediction horizon**: 
    - *1-step-ahead*: focusing on anomaly detection, a 1-step-ahead prediction horizon is chosen for its simplicity.
    - *considered*:
      - *Multi-step-ahead*: predict multiple steps ahead and compare the predictions with the actual values to detect anomalies. This approach may be more accurate but also more complex.
- **Model expandability**:
    - *Input*: simply increase the number of input features to include data from multiple sensors.
      - *considered*: 
          - Multiple Models: train a separate model for each sensor and combine their predictions for anomaly detection. But this approach may be less efficient,harder to manage and won't capture interactions between sensors.
- **Rain data**:
    - *In Output and Learned*: If rain data is a critical factor in detecting anomalies, include it in the output and allow the model to learn its patterns. This approach can help the model differentiate between anomalies caused by rain and other factors. Learn its pattern, then better to predict the anomalies.
    - *considered*:
        - Not in Output: If rain data is not directly related to the anomalies you're interested in, you might exclude it from the output.
        - In Output but Masked in Loss: If rain data affects the system but should not be considered an anomaly, you can include it in the output but mask it in the loss function. This way, the model learns to predict rain data without penalizing deviations.

- **Missing data**: mask it in the loss function? what values should it have as input?
    - *0.1-1 Range and Masking in Loss*: Normalize the data to a range of 0.1-1 and impute missing values with 0. Then, mask the missing values in the loss function so that the model doesn't penalize them.
    - *considered*:
        - *Imputing*: keeping 0-1 range but impute with e.g. mean or -1
        - *Indicator*: add an indicator feature that specifies whether the value is missing or not