# 02 - Feature Engineering

This notebook demonstrates the statistical feature extraction capabilities.

Note: For this project, we feed raw windows directly to the transformer, so explicit feature engineering is optional. This notebook shows the available features for reference and potential hybrid approaches.

In [None]:
import sys
sys.path.insert(0, '..')

import numpy as np
import matplotlib.pyplot as plt

from threatsim.data import load_nab_data, create_anomaly_mask, create_windows
from threatsim.features import extract_features, get_feature_names

plt.style.use('seaborn-v0_8-whitegrid')

In [None]:
# Load data and create windows
df, anomaly_timestamps = load_nab_data("realKnownCause/machine_temperature_system_failure.csv")
anomaly_mask = create_anomaly_mask(df, anomaly_timestamps)
values = df['value'].values.astype(np.float32)
windows, labels = create_windows(values, anomaly_mask, window_size=50, step_size=10)

print(f"Windows shape: {windows.shape}")

In [None]:
# Extract features
features = extract_features(windows)
feature_names = get_feature_names()

print(f"Features shape: {features.shape}")
print(f"Feature names: {feature_names}")

In [None]:
# Compare feature distributions: normal vs anomalous
normal_mask = labels == 0
anomaly_idx = labels == 1

fig, axes = plt.subplots(2, 5, figsize=(16, 6))
axes = axes.flatten()

for i, (ax, name) in enumerate(zip(axes, feature_names)):
    ax.hist(features[normal_mask, i], bins=30, alpha=0.6, label='Normal', density=True)
    ax.hist(features[anomaly_idx, i], bins=30, alpha=0.6, label='Anomaly', density=True)
    ax.set_title(name)
    ax.legend(fontsize=8)

plt.suptitle('Feature Distributions: Normal vs Anomalous Windows')
plt.tight_layout()
plt.show()

## Summary

The extracted features show different distributions between normal and anomalous windows. Features like `std`, `range`, and `slope` may help distinguish anomalies.

For this project, we use the transformer directly on raw sequences, which learns its own representations. These explicit features could be concatenated for a hybrid approach.