# **Data Loading and Preprocessing for LSTM Autoencoder (NAB Dataset)**
This notebook covers the **data loading and preprocessing** steps for an LSTM autoencoder anomaly detection project using the **Numenta Anomaly Benchmark (NAB)** dataset. We will load a selected subset of time series, split them into train/validation/test sets, normalize the values, and create sequential windows of data (look-back sequences) suitable as input for an LSTM model. All steps are explained with clear comments for clarity and reproducibility.
### **1. Setup and Data Overview**
First, let's import the necessary libraries and define the dataset location. We assume the NAB dataset CSV files are available locally in a directory (e.g. nab_data/data). The target time series we will use are:
- **machine_temperature_system_failure** – a machine's internal temperature (with system failures).
- **ambient_temperature_system_failure** – ambient office temperature (with a system failure event).
- **cpu_utilization_asg_misconfiguration** – CPU usage of an AWS cluster (with a misconfiguration anomaly).
- **speed_t4013** – traffic speed from sensor T4013.
- **speed_7578** – traffic speed from sensor 7578.
- **art_daily_jumpsdown** – an artificially generated daily pattern with a sudden jump anomaly.
We'll load each CSV into a Pandas DataFrame, parse the timestamps as datetimes, and set them as the index.

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.preprocessing import MinMaxScaler

# Define the data directory (adjust the path as needed for your environment)
data_dir = Path("nab_data/data")

# Define the file paths for each selected time series
series_files = {
    "machine_temperature_system_failure": data_dir / "realKnownCause/machine_temperature_system_failure.csv",
    "ambient_temperature_system_failure": data_dir / "realKnownCause/ambient_temperature_system_failure.csv",
    "cpu_utilization_asg_misconfiguration": data_dir / "realKnownCause/cpu_utilization_asg_misconfiguration.csv",
    "speed_t4013": data_dir / "realTraffic/speed_t4013.csv",
    "speed_7578": data_dir / "realTraffic/speed_7578.csv",
    "art_daily_jumpsdown": data_dir / "artificialWithAnomaly/art_daily_jumpsdown.csv"
}

# Load each series into a DataFrame
series_dfs = {}
for name, filepath in series_files.items():
    # Read CSV, parse dates, set timestamp as index
    df = pd.read_csv(filepath, parse_dates=['timestamp'], index_col='timestamp')
    df = df.sort_index()  # Ensure the data is sorted by time
    series_dfs[name] = df
    # Print basic info for verification
    print(f"{name}: loaded {df.shape[0]} rows, from {df.index.min()} to {df.index.max()}")


machine_temperature_system_failure: loaded 22695 rows, from 2013-12-02 21:15:00 to 2014-02-19 15:25:00
ambient_temperature_system_failure: loaded 7267 rows, from 2013-07-04 00:00:00 to 2014-05-28 15:00:00
cpu_utilization_asg_misconfiguration: loaded 18050 rows, from 2014-05-14 01:14:00 to 2014-07-15 17:19:00
speed_t4013: loaded 2495 rows, from 2015-09-01 11:25:00 to 2015-09-17 16:19:00
speed_7578: loaded 1127 rows, from 2015-09-08 11:39:00 to 2015-09-17 14:05:00
art_daily_jumpsdown: loaded 4032 rows, from 2014-04-01 00:00:00 to 2014-04-14 23:55:00


Output: The code above will output the number of data points and the date range for each loaded series, confirming that the data has been loaded correctly and is indexed by timestamp.

### **2. Train-Validation-Test Split**
For each time series, we will split the data chronologically into three segments:
- **Training set**: 60% of the earliest data points (used for model training).
- **Validation set**: the next 20% of data (used for hyperparameter tuning and model validation).
- **Test set**: the final 20% of data (used to evaluate model performance on unseen data).

Splitting by time (instead of random splitting) is crucial for time series to respect the chronological order and avoid future data leakage into training. 
We'll calculate the index boundaries for 60/20/20 split based on the number of samples and then slice the DataFrame accordingly.

In [2]:
# Define split ratios
train_ratio = 0.6
val_ratio = 0.2  # (test will implicitly be 0.2 as well since train+val+test = 1.0)

# Initialize dictionaries to hold split data
train_dfs = {}
val_dfs = {}
test_dfs = {}

# Perform chronological splitting for each series
for name, df in series_dfs.items():
    n = len(df)
    train_end = int(n * train_ratio)              # index for end of train set
    val_end = train_end + int(n * val_ratio)      # index for end of val set (train_end + 20% of total)
    # Slice the DataFrame into train, val, test segments
    train_dfs[name] = df.iloc[:train_end]
    val_dfs[name]   = df.iloc[train_end:val_end]
    test_dfs[name]  = df.iloc[val_end:]
    # Verify the split sizes
    print(f"{name}: train {len(train_dfs[name])}, val {len(val_dfs[name])}, test {len(test_dfs[name])}")


machine_temperature_system_failure: train 13617, val 4539, test 4539
ambient_temperature_system_failure: train 4360, val 1453, test 1454
cpu_utilization_asg_misconfiguration: train 10830, val 3610, test 3610
speed_t4013: train 1497, val 499, test 499
speed_7578: train 676, val 225, test 226
art_daily_jumpsdown: train 2419, val 806, test 807


Each series is now split into three sets. The printout confirms the number of points in each split (which should roughly follow a 60%/20%/20% division of the data). We used index slicing (df.iloc[...]) assuming the DataFrame is time-sorted.

### **3. Feature Scaling with MinMaxScaler**
Time series values can have different scales and units. To help the LSTM autoencoder train effectively, we will **normalize** each series using a MinMaxScaler (scaling values to the range [0, 1]). Importantly, the scaler is **fit on the training data only** to avoid leaking information from the validation/test sets. We then transform the validation and test sets using the same scaler parameters (min and max from train). This preserves the relative scale and ensures that anomalies in val/test (which might be out of the train range) are not introduced into the scaling calculation.

In [3]:
# Initialize dictionaries to hold scalers and scaled data
scalers = {}
scaled_train = {}
scaled_val = {}
scaled_test = {}

for name in series_dfs.keys():
    # Initialize a MinMaxScaler for each series
    scaler = MinMaxScaler(feature_range=(0, 1))
    # Fit on training data (expects 2D array)
    train_values = train_dfs[name][['value']].values  # shape (n_train, 1)
    scaler.fit(train_values)
    # Transform train, val, and test data using the fitted scaler
    train_scaled = scaler.transform(train_values)
    val_scaled   = scaler.transform(val_dfs[name][['value']].values)
    test_scaled  = scaler.transform(test_dfs[name][['value']].values)
    # Store the scaler and scaled data
    scalers[name] = scaler
    scaled_train[name] = train_scaled
    scaled_val[name]   = val_scaled
    scaled_test[name]  = test_scaled
    # Optionally, confirm scaling ranges
    print(f"{name}: train min {train_scaled.min():.2f}, max {train_scaled.max():.2f}")


machine_temperature_system_failure: train min 0.00, max 1.00
ambient_temperature_system_failure: train min 0.00, max 1.00
cpu_utilization_asg_misconfiguration: train min 0.00, max 1.00
speed_t4013: train min 0.00, max 1.00
speed_7578: train min 0.00, max 1.00
art_daily_jumpsdown: train min 0.00, max 1.00


After this step, each series' values are scaled between 0 and 1 (with train set spanning the full [0,1] range by definition of MinMaxScaler). The printed output shows that the min is 0.00 and max is 1.00 for each training set, confirming the scaling. The validation and test values are also now within a 0–1 range (they may exceed 0 or 1 slightly if they have values outside the train range, which can happen if anomalies are present, but that’s acceptable).

### **4. Creating Overlapping Sequence Windows (Look-back = 288)**
**LSTMs are sequence models**, so we need to **convert our scaled data into overlapping sequences of a fixed length**. We choose a **look-back window of 288 time steps** for each sequence. If the data is sampled at 5-minute intervals (as in the NAB dataset for these series), 288 points correspond to 24 hours of data, capturing a full daily cycle. 

##### **How we create sequences:**
- We will use a **sliding window approach**. For a given series segment (train, val, or test), we take the first 288 points as the first sequence, then shift one step forward to get the next sequence (points 2 to 289), and so on.
- This yields overlapping sequences of length 288. If a **segment has N points**, this process will **produce N - 288 + 1 sequences**.
- For an LSTM autoencoder, we will train the model to **reconstruct the input sequence**. Therefore, we set each sequence as both the input (X) and the target (y) for training. (In other words, y is identical to X for each window in an autoencoder setup.)

Let's define a helper function to create these sequences, then apply it to each dataset split.

In [4]:
# Define look-back window size (e.g., 288 time steps ~ 24 hours of 5-minute data)
LOOK_BACK = 288

def create_sequences(data_array, window_size):
    """
    Generate overlapping sequences of length `window_size` from a 1D array.
    Returns a tuple (X, y) where:
      - X is a 3D array of shape (num_sequences, window_size, num_features)
      - y is a 3D array of the same shape (for autoencoder target = input sequence)
    """
    X, y = [], []
    for i in range(len(data_array) - window_size + 1):
        seq = data_array[i : i + window_size]
        X.append(seq)
        y.append(seq)  # for autoencoder, target sequence is the same as input sequence
    # Convert to numpy arrays
    X = np.array(X)
    y = np.array(y)
    return X, y

# Create sequence windows for each dataset split of each series
sequence_data = {}  # to hold the resulting X and y arrays for each series
for name in series_dfs.keys():
    X_train, y_train = create_sequences(scaled_train[name], LOOK_BACK)
    X_val, y_val     = create_sequences(scaled_val[name], LOOK_BACK)
    X_test, y_test   = create_sequences(scaled_test[name], LOOK_BACK)
    sequence_data[name] = {
        "X_train": X_train, "y_train": y_train,
        "X_val": X_val,     "y_val": y_val,
        "X_test": X_test,   "y_test": y_test
    }
    print(f"{name}: X_train shape {X_train.shape}, X_val shape {X_val.shape}, X_test shape {X_test.shape}")


machine_temperature_system_failure: X_train shape (13330, 288, 1), X_val shape (4252, 288, 1), X_test shape (4252, 288, 1)
ambient_temperature_system_failure: X_train shape (4073, 288, 1), X_val shape (1166, 288, 1), X_test shape (1167, 288, 1)
cpu_utilization_asg_misconfiguration: X_train shape (10543, 288, 1), X_val shape (3323, 288, 1), X_test shape (3323, 288, 1)
speed_t4013: X_train shape (1210, 288, 1), X_val shape (212, 288, 1), X_test shape (212, 288, 1)
speed_7578: X_train shape (389, 288, 1), X_val shape (0,), X_test shape (0,)
art_daily_jumpsdown: X_train shape (2132, 288, 1), X_val shape (519, 288, 1), X_test shape (520, 288, 1)


We now have the input (**X**) and target (**y**) sequences for training, validation, and testing, for each time series. The printed shapes confirm the dimensions: each X (and y) has shape (*number_of_sequences, 288, 1*), since we have 288 time steps and 1 feature per time step. For example, if a training set had 10,000 points, after windowing it would produce 10,000 - 288 + 1 = 9,713 sequences for training.

### **5. Summary of Prepared Data**
At this stage, the data is fully preprocessed and ready for modeling:

- **Scaled and windowed data**: For each series, we have X_train, y_train, X_val, y_val, X_test, y_test as NumPy arrays. These can be fed into an LSTM autoencoder model (with X as input and y as target).
- **Shape of sequences**: Each sequence is of length 288 with a single feature (the time series value). Thus, X_train.shape is (num_train_sequences, 288, 1). The target y_train has the same shape.
- **Next steps**: We would proceed to define the LSTM autoencoder model, train it on the training sequences, validate on the validation set, and use the test set for final anomaly detection performance evaluation. (Those steps will be handled in subsequent notebook sections.)

With this preprocessing complete, we have a clean, reproducible pipeline for converting raw NAB time series data into a form suitable for training an LSTM autoencoder for anomaly detection. Each step was carefully executed to avoid data leakage and preserve the time order of events, which is critical in time series anomaly detection tasks.

---------------------------------------------------------------