# Single Series modelling

*Masked Multi-Step Autoregressive Regression*


#### Contents

1. [Dataset](#1)
2. [Model](#2)
3. [Training](#3)
4. [Evaluation](#4)

---


### Initial design choices:
- *Autoregressive*: predict the next value based on the previous values.
- *Masking*: mask to predict the the last 1 minute of the sensor and rain data.
- *Normalize*: normalize the data to a range of 0.1-1, and mask missing values with 0.
- *Loss function*: mask missing values.
- *Sliding window*: use a sliding window of # minutes to predict the next 1 minute for better comparison with the other models.

### Possible experiments

- Predict residuals/changes
- Multi-step-ahead scheduling
- Quantiles?
- Data augmentation
- Utilize Mike predictions for training or evaluation
- Alternate masking for multi-sensor and comparison?
- 1-minute masking or 5-minute masking, or alternate?
- include both rainfall sensors?
- train on multistep ahead predictions?


---

### TODO:

- figures wrt masking

### Import

In [None]:
import os
import random
import time
import yaml
import pickle
from pathlib import Path
import json
import copy


import pandas as pd
import numpy as np
from tqdm import tqdm
import matplotlib.pyplot as plt


import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import pytorch_lightning as pl


from torch.utils.data import Dataset, DataLoader
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import TimeSeriesSplit
from torch.utils.data import WeightedRandomSampler

from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping
from pytorch_lightning.loggers import TensorBoardLogger
import tensorflow as tf


from fault_management_uds.data.hdf_functions import print_tree, load_dataframe_from_HDF5
from fault_management_uds.data.process import remove_nans_from_start_end
from fault_management_uds.config import indicator_2_meta, bools_2_meta, error_indicators, natural_sensor_order
from fault_management_uds.data.load import import_external_metadata, import_metadata
from fault_management_uds.data.format import merge_intervals
from fault_management_uds.plots import get_segment_start_end_color, set_meaningful_xticks


from fault_management_uds.utilities import get_accelerator
from fault_management_uds.data.dataset import get_datasets, handle_splits
from fault_management_uds.data.load import load_data

from fault_management_uds.modelling.models import get_model


from fault_management_uds.config import PROJ_ROOT
from fault_management_uds.config import DATA_DIR, RAW_DATA_DIR, INTERIM_DATA_DIR, PROCESSED_DATA_DIR, EXTERNAL_DATA_DIR
from fault_management_uds.config import MODELS_DIR, REPORTS_DIR, FIGURES_DIR, REFERENCE_DIR
from fault_management_uds.config import rain_gauge_color, rain_gauges




2024-12-11 14:36:16.156 | INFO     | fault_management_uds.config:<module>:11 - PROJ_ROOT path is: /Users/arond.jacobsen/Documents/GitHub/fault_management_uds
Seed set to 42


In [2]:
data_file_path = PROCESSED_DATA_DIR / 'Bellinge.h5'
external_metadata = import_metadata(REFERENCE_DIR / 'external_metadata.csv')
metadata = import_metadata(REFERENCE_DIR / 'sensor_metadata.csv')

# Define arguments

In [3]:
# create the obvious min dict
obvious_min_dict = {}
for sensor in natural_sensor_order:
    obvious_min_dict[sensor] = metadata[metadata['IdMeasurement'] == sensor].iloc[0]['obvious_min']

for rain_gauge in rain_gauges:
    obvious_min_dict[rain_gauge] = 0.0

len(obvious_min_dict)#, obvious_min_dict

21

In [None]:
dataset_args = {
    # define the sensors to use
    'engineered_vars': ['sin_time', 'cos_time'], # ['time_of_day', 'day_of_week']
    'exogenous_vars': ['5425'],
    # target
    'endogenous_vars': ['G72F040'],

    # define priority of rain event, _ times more important than other events
    'rain_event_priority': 1.0, # _ times more sampling in data loader

    # precision
    'precision': 3,

    # processing
    'function_transform_type': 'log', # ['none', 'log', 'sqrt']
    'scaler_type': 'min-max', # ['min-max', 'standard']
    'feature_range': (0, 1),  #(0.1, 1),
    'nan_value': 0,

    # data augmentation
    'noise_injection': False,

    # dataset
    'n_splits': 1,
    'train_split': 0.7, 
    'val_split': 0.15,
    'test_split': 0.15,

    # model
    'sequence_length': 60*3, 
    'steps_ahead': 1, # 1 minute ahead prediction

    # other
    'data_file_path': str(data_file_path),

}
# all variables
dataset_args['variable_list'] = dataset_args['engineered_vars'] + dataset_args['exogenous_vars'] + dataset_args['endogenous_vars']
# variables within the data loaded
dataset_args['data_variables'] = dataset_args['exogenous_vars'] + dataset_args['endogenous_vars']
# save relevant obvious min values
dataset_args['obvious_min'] = {sensor: obvious_min_dict.get(sensor, 0.0) for sensor in dataset_args['variable_list']}



In [None]:
training_args = {
    'learning_rate': 0.001,
    'loss_function': 'MSELoss', # ['MSELoss', 'MAELoss']
    'max_epochs': 30,
    'batch_size': 64,
    'log_every_n_steps': 1,
    # val check every epoch
    'val_check_interval': 1.0, # []
    # early stopping
    'early_stopping_patience': 5,
    'seed': seed,
}

model_args = {
    'model_name': 'SimpleNN', # ['SimpleNN', 'LSTM']
    'input_size': len(dataset_args['variable_list']),
    'sequence_length': dataset_args['sequence_length'],
    'output_size': len(dataset_args['endogenous_vars']),
    'hidden_size': 16,
    'num_layers': 1,
    'dropout': 0.0,
}

experiment_name = 'Time of Day + Log + Min-Max'


Save a more unique name

In [None]:
# TODO:
configuring_parameters = [
    'rain_event_priority',
    'function_transform_type',
    'scaler_type',
    'sequence_length',
    'learning_rate',
    'hidden_size',
    'num_layers',
]
# create a configuration name based on the parameters
config_name = ''
for parameter in configuring_parameters:
    if parameter in dataset_args:
        config_name += f"{parameter}_{dataset_args[parameter]}_"
    elif parameter in training_args:
        config_name += f"{parameter}_{training_args[parameter]}_"
    elif parameter in model_args:
        config_name += f"{parameter}_{model_args[parameter]}_"

config_name = config_name[:-1]

# Training

- Train the model on the dataset
- Save and load the model

In [None]:

def train_model(model, train_loader, val_loader, callbacks, logger, training_args):
    accelerator = get_accelerator()
    trainer = pl.Trainer(
        max_epochs=training_args['max_epochs'],
        #max_steps=1,
        log_every_n_steps=training_args['log_every_n_steps'],
        val_check_interval=training_args['val_check_interval'],  
        check_val_every_n_epoch=1,  # Ensure it evaluates at least once per epoch
        callbacks=callbacks,
        logger=logger,
        accelerator=accelerator,
        devices="auto",
        )
    trainer.fit(model, train_loader, val_loader)
    return model, callbacks, logger 

In [8]:

model_dir = MODELS_DIR / model_args['model_name']

# clean up directories if they are empty
if os.path.exists(model_dir):
    # iterate folders
    for folder in os.listdir(model_dir):
        folder_path = model_dir / folder
        # check if folder is empty
        if os.path.isdir(folder_path):
            if not os.listdir(folder_path):
                # remove empty folder
                os.rmdir(folder_path)
        # it is a file and should be removed
        else:
            os.remove(folder_path)


Load data

In [9]:
data = load_data([None, None], data_file_path, dataset_args, data_type='complete')

Train

In [13]:
"aaa.aaa".split('.')

['aaa', 'aaa']

In [None]:
# create new folder
save_folder = model_dir / time.strftime("%Y-%m-%d_%H:%M")
os.makedirs(save_folder, exist_ok=True)

# get the splits
n_obs = len(data)
splits = handle_splits(n_obs, dataset_args)

# create folders for each split
split_folders = [save_folder / f"{i+1}_split" for i in range(dataset_args['n_splits'])]
for folder in split_folders:
    os.makedirs(folder, exist_ok=True)



# save the configs
configs = {
    'experiment_name': experiment_name,
    'config_name': config_name,
    'save_folder': str(save_folder),
    'dataset_args': dataset_args,
    'training_args': training_args,
    'model_args': model_args,
    'split_folders': [str(folder) for folder in split_folders],
}
# save as json
with open(save_folder / 'configs.json', 'w') as f:
    json.dump(configs, f, indent=4)


split_info = []
for i, (train_index, val_index, test_index) in enumerate(tqdm(splits, desc='Cross-validation', total=len(splits))):
    # Paths
    current_save_folder = split_folders[i]
    start_time = time.time()   

    ### Prepare data
    train_dataset, val_dataset, _, dataset_config = get_datasets(data, train_index, val_index, test_index, dataset_args)
    # create loader
    sampler = WeightedRandomSampler(train_dataset.priority_weight, len(train_dataset), replacement=True)
    train_loader = DataLoader(train_dataset, batch_size=training_args['batch_size'], sampler=sampler, num_workers=0)
    val_loader = DataLoader(val_dataset, batch_size=training_args['batch_size'], shuffle=False, num_workers=0)

    # Get model
    model = get_model(model_args, training_args)

    # Define callbacks
    checkpoint_callback = ModelCheckpoint(
        dirpath=current_save_folder, filename="{epoch:02d}-{val_loss:.5f}", save_last=True,
        monitor="val_loss", save_top_k=1, mode="min",
    )
    early_stopping = EarlyStopping(monitor="val_loss", patience=training_args['early_stopping_patience'], mode="min", verbose=False)
    callbacks = [checkpoint_callback, early_stopping]
    # logger
    logger = TensorBoardLogger(current_save_folder, sub_dir='tensorboard', name='', version='', default_hp_metric=False)

    # train model
    model, callbacks, logger = train_model(model, train_loader, val_loader, callbacks, logger, training_args)
    end_time = time.time()
    training_time = end_time - start_time
    # save the run info
    run_info = {
        'save_folder': str(current_save_folder),
        'best_model_path': callbacks[0].best_model_path,
        'last_model_path': callbacks[0].last_model_path,
        'top_k_best_model_paths': callbacks[0].best_k_models,
        'dataset_config': dataset_config,
        'training_time': training_time,
    }
    split_info.append(run_info)

# save the split info
with open(save_folder / 'split_info.pkl', 'wb') as f:
    pickle.dump(split_info, f)


Cross-validation:   0%|          | 0/1 [00:00<?, ?it/s]

Using MPS


GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..


Validity: 0 minutes are invalid.
Data validation passed.
Using MPS
Validity: 0 minutes are invalid.
Data validation passed.
Using MPS
Validity: 0 minutes are invalid.
Data validation passed.
Using MPS


/Users/arond.jacobsen/anaconda3/envs/thesis/lib/python3.12/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:654: Checkpoint directory /Users/arond.jacobsen/Documents/GitHub/fault_management_uds/models/SimpleNN/2024-12-11_14:36/1_split exists and is not empty.

  | Name  | Type          | Params | Mode 
------------------------------------------------
0 | model | SimpleNNModel | 11.6 K | train
------------------------------------------------
11.6 K    Trainable params
0         Non-trainable params
11.6 K    Total params
0.046     Total estimated model params size (MB)
6         Modules in train mode
0         Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/Users/arond.jacobsen/anaconda3/envs/thesis/lib/python3.12/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


                                                                           

/Users/arond.jacobsen/anaconda3/envs/thesis/lib/python3.12/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Epoch 18: 100%|██████████| 1182/1182 [03:28<00:00,  5.67it/s, train_loss_step=5.96e-5, val_loss=5.25e-5, train_loss_epoch=7.93e-5]  

Cross-validation: 100%|██████████| 1/1 [1:05:56<00:00, 3956.15s/it]







In [11]:
raise ValueError("Training done")
# load 

ValueError: Training done

Tensorboard command:

```
tensorboard --logdir=models/LSTM
```

## Performance of baseline models


In [None]:
MAEs.loc[(1, 'Overall'), 'Overall']
# Previous value predictor MAE: 0.0010530973451327434

In [None]:
run_info['dataset_config']['val_timestamps']

array(['2020-02-29T23:40:00.000000000', '2020-02-29T23:41:00.000000000',
       '2020-02-29T23:42:00.000000000', ...,
       '2020-03-12T06:15:00.000000000', '2020-03-12T06:16:00.000000000',
       '2020-03-12T06:17:00.000000000'], dtype='<U48')

In [None]:
# load the output

timestamps = pd.to_datetime(run_info['dataset_config']['val_timestamps'])
starttime, endtime = timestamps[0], timestamps[-1]

# # handle time related variables
# starttime = pd.to_datetime(timestamps[0, 0])
# endtime = pd.to_datetime(timestamps[-1, -1]) + pd.Timedelta(minutes=steps_ahead)

endogenous_vars = dataset_args['endogenous_vars']
targets, _, _, _ = load_dataframe_from_HDF5(data_file_path, "combined_data/clean", columns=endogenous_vars, starttime=starttime, endtime=endtime, complete_range=True)
targets = targets.to_numpy()


#### Mean predictor

In [None]:
def evaluate_mae(predictions, targets):
    # Calculate the mean absolute error
    assert predictions.shape == targets.shape, "Predictions and targets must have the same shape"
    mae = np.sum(np.abs(predictions - targets)) / len(predictions)
    return mae


In [None]:
# mean predictor
mean_predictions = np.mean(targets, axis=0)
# create an a array of the same shape as the targets
mean_predictions = np.tile(mean_predictions, (len(targets))).reshape(len(targets), -1)

In [None]:
# evaluate the MAE
mae = evaluate_mae(mean_predictions, targets)

print(f"Mean predictor MAE: {mae}")

Mean predictor MAE: 0.014198638065767656


### Previous step predictor


In [None]:
# predict next step based on the last value, targets is (9000) shape
previous_values = targets[:-1] # remove the last value

In [None]:
mae = evaluate_mae(previous_values, targets[1:]) # remove the first value
print(f"Previous value predictor MAE: {mae}")

Previous value predictor MAE: 0.0009065714109749338


- very low learning rate???
- visualize 1 event per month?


---
# OLD:

### Considerations:


- **Prediction horizon**: 
    - *1-step-ahead*: focusing on anomaly detection, a 1-step-ahead prediction horizon is chosen for its simplicity.
    - *considered*:
      - *Multi-step-ahead*: predict multiple steps ahead and compare the predictions with the actual values to detect anomalies. This approach may be more accurate but also more complex.
- **Model expandability**:
    - *Input*: simply increase the number of input features to include data from multiple sensors.
      - *considered*: 
          - Multiple Models: train a separate model for each sensor and combine their predictions for anomaly detection. But this approach may be less efficient,harder to manage and won't capture interactions between sensors.
- **Rain data**:
    - *In Output and Learned*: If rain data is a critical factor in detecting anomalies, include it in the output and allow the model to learn its patterns. This approach can help the model differentiate between anomalies caused by rain and other factors. Learn its pattern, then better to predict the anomalies.
    - *considered*:
        - Not in Output: If rain data is not directly related to the anomalies you're interested in, you might exclude it from the output.
        - In Output but Masked in Loss: If rain data affects the system but should not be considered an anomaly, you can include it in the output but mask it in the loss function. This way, the model learns to predict rain data without penalizing deviations.

- **Missing data**: mask it in the loss function? what values should it have as input?
    - *0.1-1 Range and Masking in Loss*: Normalize the data to a range of 0.1-1 and impute missing values with 0. Then, mask the missing values in the loss function so that the model doesn't penalize them.
    - *considered*:
        - *Imputing*: keeping 0-1 range but impute with e.g. mean or -1
        - *Indicator*: add an indicator feature that specifies whether the value is missing or not