# üöÄ Updated Preprocessing Pipeline

## ‚ú® What Changed?

### ‚ùå **REMOVED** (Not needed for shock detection):
1. **`time`** column
   - LSTM learns temporal patterns from sequence order
   - Absolute time is irrelevant for pattern recognition
   
2. **`Speed`** column  
   - Not available on smartwatch
   - Focus on HR patterns only
   
3. **`ID` and `ID_test`** from final data
   - Kept during processing, removed before model training
   - Prevents model from memorizing specific people

### ‚úÖ **ADDED** (Better shock detection):
1. **`HR_acceleration`** - 2nd derivative
   - Exercise: Smooth, low acceleration
   - Shock: Sharp, high acceleration
   
2. **`HR_change_abs`** - Magnitude of change
   - Exercise: Small consistent values
   - Shock: Large sudden values
   
3. **Data sampling (40%)**
   - Keeps training fast
   - Still representative of patterns

### üìä **Final Features** (7 total):
All normalized, ready for LSTM:
1. `HR` - Current heart rate
2. `HR_change` - First derivative
3. `HR_acceleration` - Second derivative (NEW!)
4. `HR_rolling_mean` - Recent average
5. `HR_rolling_std` - Pattern variability
6. `HR_deviation` - Distance from normal
7. `HR_change_abs` - Change magnitude (NEW!)

---

# Treadmill Maximal Exercise Tests Dataset
###### [Link](https://physionet.org/content/treadmill-exercise-cardioresp/1.0.1/)

This dataset contains cardiorespiratory measurements taken during 992 treadmill maximal graded exercise tests conducted at the Exercise Physiology and Human Performance Lab, University of Malaga.

## File: `test_measure.csv`

This file contains all breath-by-breath cardiorespiratory measurements for each graded effort test.

### General Info

- **Rows:** 575,087 (one per breath measurement)
- **Tests:** 992
- **Median measurements per test:** 580 [IQR: 484‚Äì673]
- **Median test duration:** 1,093.00 seconds [IQR: 978.75‚Äì1,208.00]

### Variables

| Name     | Description                                | Unit                  |
|----------|--------------------------------------------|-----------------------|
| time     | Time since measurement started             | seconds               |
| Speed    | Treadmill speed                            | km/h                  |
| HR       | Heart rate                                 | beats per minute      |
| VO2      | Oxygen consumption                         | mL/min                |
| VCO2     | Carbon dioxide production                  | mL/min                |
| RR       | Respiration rate                           | respirations/min      |
| VE       | Pulmonary ventilation                      | L/min                 |
| ID       | Participant identification                 | -                     |
| ID_test  | Effort test identification                 | -                     |

_Note: VO2, VCO2, and VE are missing for 30 tests._

**ID_test** is formatted as `{participant_id}_{test_number}`, e.g., `245_3` = third test of participant 245.

---

**Reference:**  
Mongin, D., Garc√≠a Romero, J., & Alvero Cruz, J. R. (2021). Treadmill Maximal Exercise Tests from the Exercise Physiology and Human Performance Lab of the University of Malaga (version 1.0.1). PhysioNet. https://doi.org/10.13026/7ezk-j442


In [3]:
#  clean the uncessary columns in the dataset
import pandas as pd

# Load your data
df = pd.read_csv('test_measure.csv')

print("=" * 60)
print("STEP 1: REMOVING UNNECESSARY COLUMNS")
print("=" * 60)

# Show what we have
print(f"\nOriginal columns: {df.columns.tolist()}")
print(f"Total rows: {len(df):,}")

# Keep only the columns we need (ONLY HR and identifiers for processing)
columns_to_keep = ['HR', 'ID_test', 'ID']

df_cleaned = df[columns_to_keep]

# Show what we kept
print(f"\nColumns after cleaning: {df_cleaned.columns.tolist()}")
print(f"Removed columns: time, Speed, VO2, VCO2, RR, VE")
print("  ‚Ü≥ time: Not needed (LSTM learns temporal patterns from sequence order)")
print("  ‚Ü≥ Speed: Not available on smartwatch")
print("  ‚Ü≥ Others: Not relevant for HR spike detection")

# Check the data
print("\nFirst 10 rows of cleaned data:")
print(df_cleaned.head(10))

print("\nData info:")
print(df_cleaned.info())

# Save to new CSV
df_cleaned.to_csv('dataset/output_step1.csv', index=False)

print("\n" + "=" * 60)
print("‚úì STEP 1 COMPLETE!")
print("‚úì Saved as: dataset/output_step1.csv")
print("=" * 60)

STEP 1: REMOVING UNNECESSARY COLUMNS

Original columns: ['time', 'Speed', 'HR', 'VO2', 'VCO2', 'RR', 'VE', 'ID_test', 'ID']
Total rows: 575,087

Columns after cleaning: ['HR', 'ID_test', 'ID']
Removed columns: time, Speed, VO2, VCO2, RR, VE
  ‚Ü≥ time: Not needed (LSTM learns temporal patterns from sequence order)
  ‚Ü≥ Speed: Not available on smartwatch
  ‚Ü≥ Others: Not relevant for HR spike detection

First 10 rows of cleaned data:
     HR ID_test  ID
0  63.0     2_1   2
1  75.0     2_1   2
2  82.0     2_1   2
3  87.0     2_1   2
4  92.0     2_1   2
5  94.0     2_1   2
6  95.0     2_1   2
7  96.0     2_1   2
8  97.0     2_1   2
9  97.0     2_1   2

Data info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 575087 entries, 0 to 575086
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   HR       574106 non-null  float64
 1   ID_test  575087 non-null  object 
 2   ID       575087 non-null  int64  
dtypes: float64(1), int6

In [4]:
# separate testing and training data + SAMPLE to reduce size
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np

# Load the cleaned data from step 1
df = pd.read_csv('dataset/output_step1.csv')

print("=" * 60)
print("STEP 2: SAMPLING & SPLITTING DATA")
print("=" * 60)

# Get unique participant IDs
unique_participants = df['ID'].unique()
print(f"\nTotal participants: {len(unique_participants)}")

# SAMPLE 40% of participants to keep dataset manageable
np.random.seed(42)
sampled_participants = np.random.choice(
    unique_participants, 
    size=int(len(unique_participants) * 0.4),  # Keep 40% of data
    replace=False
)

print(f"Sampled participants: {len(sampled_participants)} (40% for faster training)")

# Filter to sampled participants
df_sampled = df[df['ID'].isin(sampled_participants)]
print(f"Rows after sampling: {len(df_sampled):,} (from {len(df):,})")

# Split sampled participants 80-20 for train/test
train_ids, test_ids = train_test_split(
    sampled_participants, 
    test_size=0.2, 
    random_state=42
)

print(f"\nTraining participants: {len(train_ids)}")
print(f"Testing participants: {len(test_ids)}")

# Split the data based on participant ID
train_df = df_sampled[df_sampled['ID'].isin(train_ids)]
test_df = df_sampled[df_sampled['ID'].isin(test_ids)]

print(f"\nTraining rows: {len(train_df):,}")
print(f"Testing rows: {len(test_df):,}")

# Save both files
train_df.to_csv('dataset/train_data.csv', index=False)
test_df.to_csv('dataset/test_data.csv', index=False)

print("\n‚úì Saved dataset/train_data.csv")
print("‚úì Saved dataset/test_data.csv")
print("\n" + "=" * 60)

STEP 2: SAMPLING & SPLITTING DATA

Total participants: 857
Sampled participants: 342 (40% for faster training)
Rows after sampling: 231,886 (from 575,087)

Training participants: 273
Testing participants: 69

Training rows: 189,459
Testing rows: 42,427

‚úì Saved dataset/train_data.csv
‚úì Saved dataset/test_data.csv



## üéØ Goal: One-Class Classification for Shock Detection

**Problem:** Detect if a heart rate spike is due to **exercise** (normal) or **shock/panic** (abnormal)

**Approach:** 
- Train LSTM Autoencoder on **exercise data only** (treadmill tests)
- Model learns: "This is what normal exercise-induced HR patterns look like"
- At deployment: Patterns that DON'T match ‚Üí Flag as anomaly (shock/panic)

**Key Features (NO `time` needed!):**
1. ‚úÖ `HR` - Current heart rate value
2. ‚úÖ `HR_change` - Speed of change (1st derivative)
3. ‚úÖ `HR_acceleration` - How change is changing (2nd derivative)
4. ‚úÖ `HR_rolling_mean` - Recent average (context)
5. ‚úÖ `HR_rolling_std` - Recent variability (smoothness)
6. ‚úÖ `HR_deviation` - Distance from recent normal
7. ‚úÖ `HR_change_abs` - Magnitude of changes

**Why NO `time`?**
- LSTM learns temporal patterns from **sequence order** automatically
- What matters: Pattern of changes, not when they occur
- Exercise at second 50 looks same as at second 500

**Data Strategy:**
- Sample 40% of participants ‚Üí Faster training, still representative
- Split 80/20 for train/test ‚Üí Standard practice
- 60-second windows ‚Üí Captures full pattern context

---

## üìã Preprocessing Steps

In [5]:
# create features for shock vs exercise detection
import pandas as pd
import numpy as np

print("=" * 60)
print("STEP 3: CREATING FEATURES FOR SHOCK DETECTION")
print("=" * 60)

# Function to add features
def add_features(df):
    """
    Create features that distinguish exercise from shock/panic:
    - Exercise: Gradual, smooth HR increase
    - Shock/Panic: Sudden, erratic HR spike
    """
    df = df.copy()
    
    # Sort by ID_test to ensure correct temporal order
    df = df.sort_values(['ID_test']).reset_index(drop=True)
    
    # Process each test separately
    for test_id in df['ID_test'].unique():
        mask = df['ID_test'] == test_id
        
        # 1. HR_change: First derivative (how much HR changed)
        #    Exercise: Gradual positive changes
        #    Shock: Large sudden jump
        df.loc[mask, 'HR_change'] = df.loc[mask, 'HR'].diff()
        
        # 2. HR_acceleration: Second derivative (rate of change of change)
        #    Exercise: Low acceleration (smooth)
        #    Shock: High acceleration (sudden)
        df.loc[mask, 'HR_acceleration'] = df.loc[mask, 'HR_change'].diff()
        
        # 3. HR_rolling_mean: Average HR over last 30 seconds
        #    Provides context for current HR
        df.loc[mask, 'HR_rolling_mean'] = df.loc[mask, 'HR'].rolling(
            window=30, min_periods=1
        ).mean()
        
        # 4. HR_rolling_std: Variability over last 30 seconds
        #    Exercise: Low std (consistent)
        #    Shock: High std (erratic)
        df.loc[mask, 'HR_rolling_std'] = df.loc[mask, 'HR'].rolling(
            window=30, min_periods=1
        ).std()
        
        # 5. HR_deviation: How far is current HR from recent average
        #    Shows if HR is unusually high/low
        df.loc[mask, 'HR_deviation'] = df.loc[mask, 'HR'] - df.loc[mask, 'HR_rolling_mean']
        
        # 6. HR_change_abs: Magnitude of HR change (ignoring direction)
        #    Exercise: Small values
        #    Shock: Large values
        df.loc[mask, 'HR_change_abs'] = df.loc[mask, 'HR_change'].abs()
    
    # Fill NaN values (from diff and rolling operations)
    df = df.fillna(0)
    
    return df

# Process training data
print("\nProcessing TRAIN data...")
train_df = pd.read_csv('dataset/train_data.csv')
train_with_features = add_features(train_df)

print(f"‚úì Train rows: {len(train_with_features):,}")
print(f"‚úì Features created: {train_with_features.shape[1]} columns")

# Process test data
print("\nProcessing TEST data...")
test_df = pd.read_csv('dataset/test_data.csv')
test_with_features = add_features(test_df)

print(f"‚úì Test rows: {len(test_with_features):,}")
print(f"‚úì Features created: {test_with_features.shape[1]} columns")

print("\nAll features:")
print(train_with_features.columns.tolist())

print("\nFeatures for MODEL (HR-based only):")
model_features = [
    'HR',                # Current heart rate
    'HR_change',         # Speed of change
    'HR_acceleration',   # Acceleration (NEW!)
    'HR_rolling_mean',   # Recent average
    'HR_rolling_std',    # Recent variability
    'HR_deviation',      # Distance from normal
    'HR_change_abs'      # Magnitude of change (NEW!)
]
print(model_features)
print(f"Total model features: {len(model_features)}")

print("\nMetadata columns (not for model):")
print(['ID_test', 'ID'])

print("\nSample from train data:")
print(train_with_features[['HR', 'HR_change', 'HR_acceleration', 'HR_rolling_std', 'HR_deviation']].head(10))

# Save with features
train_with_features.to_csv('dataset/train_data_with_features.csv', index=False)
test_with_features.to_csv('dataset/test_data_with_features.csv', index=False)

print("\n‚úì Saved dataset/train_data_with_features.csv")
print("‚úì Saved dataset/test_data_with_features.csv")
print("\n" + "=" * 60)

STEP 3: CREATING FEATURES FOR SHOCK DETECTION

Processing TRAIN data...
‚úì Train rows: 189,459
‚úì Features created: 9 columns

Processing TEST data...
‚úì Test rows: 42,427
‚úì Features created: 9 columns

All features:
['HR', 'ID_test', 'ID', 'HR_change', 'HR_acceleration', 'HR_rolling_mean', 'HR_rolling_std', 'HR_deviation', 'HR_change_abs']

Features for MODEL (HR-based only):
['HR', 'HR_change', 'HR_acceleration', 'HR_rolling_mean', 'HR_rolling_std', 'HR_deviation', 'HR_change_abs']
Total model features: 7

Metadata columns (not for model):
['ID_test', 'ID']

Sample from train data:
      HR  HR_change  HR_acceleration  HR_rolling_std  HR_deviation
0  153.0        0.0              0.0        0.000000      0.000000
1  157.0        4.0              0.0        2.828427      2.000000
2  157.0        0.0             -4.0        2.309401      1.333333
3  158.0        1.0              1.0        2.217356      1.750000
4  158.0        0.0             -1.0        2.073644      1.400000
5 

In [6]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
import pickle

print("=" * 60)
print("STEP 4: NORMALIZING FEATURES")
print("=" * 60)

# Load data with features
train_df = pd.read_csv('dataset/train_data_with_features.csv')
test_df = pd.read_csv('dataset/test_data_with_features.csv')

print(f"\nTrain rows: {len(train_df):,}")
print(f"Test rows: {len(test_df):,}")

# ONLY normalize HR-based features (not ID columns!)
features_to_normalize = [
    'HR', 
    'HR_change',
    'HR_acceleration',      # NEW!
    'HR_rolling_mean', 
    'HR_rolling_std', 
    'HR_deviation',
    'HR_change_abs'         # NEW!
]

print(f"\nNormalizing {len(features_to_normalize)} features:")
for feat in features_to_normalize:
    print(f"  ‚úì {feat}")

# Create scaler
scaler = StandardScaler()

# FIT the scaler ONLY on training data (prevent data leakage!)
scaler.fit(train_df[features_to_normalize])

print("\n‚úì Scaler fitted on training data")
print(f"  Mean values learned: {scaler.mean_[:3]}... (showing first 3)")
print(f"  Std values learned: {scaler.scale_[:3]}... (showing first 3)")

# TRANSFORM both train and test using the SAME scaler
train_df[features_to_normalize] = scaler.transform(train_df[features_to_normalize])
test_df[features_to_normalize] = scaler.transform(test_df[features_to_normalize])

print("\n‚úì Both datasets normalized")

# Show normalized values
print("\nSample normalized values (train data):")
print(train_df[['HR', 'HR_change', 'HR_acceleration', 'HR_deviation']].head(10))

# Save normalized data
train_df.to_csv('dataset/train_data_normalized.csv', index=False)
test_df.to_csv('dataset/test_data_normalized.csv', index=False)

print("\n‚úì Saved dataset/train_data_normalized.csv")
print("‚úì Saved dataset/test_data_normalized.csv")

# Save the scaler (CRITICAL for deployment!)
with open('dataset/scaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)

print("‚úì Saved dataset/scaler.pkl")
print("  ‚Ü≥ You'll need this to normalize smartwatch data in real-time!")

print("\n" + "=" * 60)
print("NORMALIZATION COMPLETE!")
print("=" * 60)

STEP 4: NORMALIZING FEATURES

Train rows: 189,459
Test rows: 42,427

Normalizing 7 features:
  ‚úì HR
  ‚úì HR_change
  ‚úì HR_acceleration
  ‚úì HR_rolling_mean
  ‚úì HR_rolling_std
  ‚úì HR_deviation
  ‚úì HR_change_abs

‚úì Scaler fitted on training data
  Mean values learned: [ 1.47060799e+02 -1.12583725e-02 -5.23596134e-03]... (showing first 3)
  Std values learned: [32.36364114  9.14558772 15.08045565]... (showing first 3)

‚úì Both datasets normalized

Sample normalized values (train data):
         HR  HR_change  HR_acceleration  HR_deviation
0  0.183515   0.001231         0.000347      0.005629
1  0.307110   0.438600         0.000347      0.125427
2  0.307110   0.001231        -0.264897      0.085494
3  0.338009   0.110573         0.066658      0.110452
4  0.338009   0.001231        -0.065964      0.089488
5  0.338009   0.001231         0.000347      0.075511
6  0.338009   0.001231         0.000347      0.065528
7  0.368908   0.110573         0.066658      0.110452
8  0.368908

In [7]:
# Remove metadata columns (keep only model features)
import pandas as pd

print("=" * 60)
print("STEP 5: REMOVING METADATA COLUMNS")
print("=" * 60)

# Load the normalized data
train_df = pd.read_csv('dataset/train_data_normalized.csv')
test_df = pd.read_csv('dataset/test_data_normalized.csv')

print(f"\nOriginal columns: {train_df.columns.tolist()}")

# Remove ID columns (they're not features, just identifiers)
columns_to_remove = ['ID_test', 'ID']

train_df = train_df.drop(columns=columns_to_remove)
test_df = test_df.drop(columns=columns_to_remove)

print(f"\nRemoved: {columns_to_remove}")
print("  ‚Ü≥ These are identifiers, not features")
print("  ‚Ü≥ Model should learn patterns, not memorize people")

print(f"\nRemaining columns (MODEL FEATURES): {train_df.columns.tolist()}")

# Save back to same files
train_df.to_csv('dataset/train_data_normalized.csv', index=False)
test_df.to_csv('dataset/test_data_normalized.csv', index=False)

print(f"\n‚úì Updated dataset/train_data_normalized.csv")
print(f"‚úì Updated dataset/test_data_normalized.csv")

print("\nüìä Final features for LSTM model:")
feature_columns = train_df.columns.tolist()
for i, feat in enumerate(feature_columns, 1):
    print(f"  {i}. {feat}")
    
print(f"\nTotal features: {len(feature_columns)}")
print("\n" + "=" * 60)

STEP 5: REMOVING METADATA COLUMNS

Original columns: ['HR', 'ID_test', 'ID', 'HR_change', 'HR_acceleration', 'HR_rolling_mean', 'HR_rolling_std', 'HR_deviation', 'HR_change_abs']

Removed: ['ID_test', 'ID']
  ‚Ü≥ These are identifiers, not features
  ‚Ü≥ Model should learn patterns, not memorize people

Remaining columns (MODEL FEATURES): ['HR', 'HR_change', 'HR_acceleration', 'HR_rolling_mean', 'HR_rolling_std', 'HR_deviation', 'HR_change_abs']

‚úì Updated dataset/train_data_normalized.csv
‚úì Updated dataset/test_data_normalized.csv

üìä Final features for LSTM model:
  1. HR
  2. HR_change
  3. HR_acceleration
  4. HR_rolling_mean
  5. HR_rolling_std
  6. HR_deviation
  7. HR_change_abs

Total features: 7



In [8]:
import pandas as pd
import numpy as np

print("=" * 60)
print("STEP 6: CREATING 60-SECOND WINDOWS")
print("=" * 60)

# Load normalized data
train_df = pd.read_csv('dataset/train_data_normalized.csv')
test_df = pd.read_csv('dataset/test_data_normalized.csv')

print(f"\nTrain rows: {len(train_df):,}")
print(f"Test rows: {len(test_df):,}")

# ALL columns are features now (no metadata)
feature_columns = train_df.columns.tolist()

print(f"\nUsing ALL {len(feature_columns)} features per timestep:")
for i, feat in enumerate(feature_columns, 1):
    print(f"  {i}. {feat}")

# Window size: 60 seconds
WINDOW_SIZE = 60

def create_windows(df, window_size):
    """
    Create sliding 60-second windows from continuous data.
    Each window = 60 timesteps √ó 7 features
    
    Note: We can't use ID_test anymore since we removed it.
    We'll create windows by sliding through the entire dataset.
    This assumes data is already sorted by test and time.
    """
    
    data = df.values  # Convert to numpy array
    windows = []
    
    # Create sliding windows
    for i in range(len(data) - window_size + 1):
        window = data[i:i + window_size]
        windows.append(window)
    
    return np.array(windows)

# Create windows for training data
print("\nCreating windows for TRAIN data...")
X_train = create_windows(train_df, WINDOW_SIZE)

print(f"‚úì Created {len(X_train):,} training windows")
print(f"  Each window shape: {X_train[0].shape} ({WINDOW_SIZE} timesteps √ó {len(feature_columns)} features)")

# Create windows for test data
print("\nCreating windows for TEST data...")
X_test = create_windows(test_df, WINDOW_SIZE)

print(f"‚úì Created {len(X_test):,} test windows")
print(f"  Each window shape: {X_test[0].shape} ({WINDOW_SIZE} timesteps √ó {len(feature_columns)} features)")

# Save as numpy arrays
np.save('dataset/X_train.npy', X_train)
np.save('dataset/X_test.npy', X_test)

print("\n‚úì Saved dataset/X_train.npy")
print("‚úì Saved dataset/X_test.npy")

print("\n" + "=" * 60)
print("üéâ PREPROCESSING COMPLETE!")
print("=" * 60)
print(f"\nüìä Dataset Summary:")
print(f"  Training windows: {len(X_train):,}")
print(f"  Test windows: {len(X_test):,}")
print(f"  Window size: {WINDOW_SIZE} timesteps")
print(f"  Features per timestep: {len(feature_columns)}")
print(f"\n‚úÖ Ready for LSTM training!")
print(f"‚úÖ Model will learn: Exercise patterns (normal)")
print(f"‚úÖ Model will detect: Shock/panic patterns (anomaly)")
print("=" * 60)

STEP 6: CREATING 60-SECOND WINDOWS

Train rows: 189,459
Test rows: 42,427

Using ALL 7 features per timestep:
  1. HR
  2. HR_change
  3. HR_acceleration
  4. HR_rolling_mean
  5. HR_rolling_std
  6. HR_deviation
  7. HR_change_abs

Creating windows for TRAIN data...
‚úì Created 189,400 training windows
  Each window shape: (60, 7) (60 timesteps √ó 7 features)

Creating windows for TEST data...
‚úì Created 42,368 test windows
  Each window shape: (60, 7) (60 timesteps √ó 7 features)

‚úì Saved dataset/X_train.npy
‚úì Saved dataset/X_test.npy

üéâ PREPROCESSING COMPLETE!

üìä Dataset Summary:
  Training windows: 189,400
  Test windows: 42,368
  Window size: 60 timesteps
  Features per timestep: 7

‚úÖ Ready for LSTM training!
‚úÖ Model will learn: Exercise patterns (normal)
‚úÖ Model will detect: Shock/panic patterns (anomaly)
