# Treadmill Maximal Exercise Tests Dataset
###### [Link](https://physionet.org/content/treadmill-exercise-cardioresp/1.0.1/)

This dataset contains cardiorespiratory measurements taken during 992 treadmill maximal graded exercise tests conducted at the Exercise Physiology and Human Performance Lab, University of Malaga.

## File: `test_measure.csv`

This file contains all breath-by-breath cardiorespiratory measurements for each graded effort test.

### General Info

- **Rows:** 575,087 (one per breath measurement)
- **Tests:** 992
- **Median measurements per test:** 580 [IQR: 484–673]
- **Median test duration:** 1,093.00 seconds [IQR: 978.75–1,208.00]

### Variables

| Name     | Description                                | Unit                  |
|----------|--------------------------------------------|-----------------------|
| time     | Time since measurement started             | seconds               |
| Speed    | Treadmill speed                            | km/h                  |
| HR       | Heart rate                                 | beats per minute      |
| VO2      | Oxygen consumption                         | mL/min                |
| VCO2     | Carbon dioxide production                  | mL/min                |
| RR       | Respiration rate                           | respirations/min      |
| VE       | Pulmonary ventilation                      | L/min                 |
| ID       | Participant identification                 | -                     |
| ID_test  | Effort test identification                 | -                     |

_Note: VO2, VCO2, and VE are missing for 30 tests._

**ID_test** is formatted as `{participant_id}_{test_number}`, e.g., `245_3` = third test of participant 245.

---

**Reference:**  
Mongin, D., García Romero, J., & Alvero Cruz, J. R. (2021). Treadmill Maximal Exercise Tests from the Exercise Physiology and Human Performance Lab of the University of Malaga (version 1.0.1). PhysioNet. https://doi.org/10.13026/7ezk-j442


In [None]:
#  clean the uncessary columns in the dataset
import pandas as pd

# Load your data
df = pd.read_csv('test_measure.csv')

print("=" * 60)
print("STEP 1: REMOVING UNNECESSARY COLUMNS")
print("=" * 60)

# Show what we have
print(f"\nOriginal columns: {df.columns.tolist()}")
print(f"Total rows: {len(df):,}")

# Keep only the columns we need
columns_to_keep = ['time', 'Speed', 'HR', 'ID_test', 'ID']

df_cleaned = df[columns_to_keep]

# Show what we kept
print(f"\nColumns after cleaning: {df_cleaned.columns.tolist()}")
print(f"Removed columns: VO2, VCO2, RR, VE")

# Check the data
print("\nFirst 10 rows of cleaned data:")
print(df_cleaned.head(10))

print("\nData info:")
print(df_cleaned.info())

# Save to new CSV
df_cleaned.to_csv('output_step1.csv', index=False)

print("\n" + "=" * 60)
print("✓ STEP 1 COMPLETE!")
print("✓ Saved as: output_step1.csv")
print("=" * 60)

STEP 1: REMOVING UNNECESSARY COLUMNS

Original columns: ['time', 'Speed', 'HR', 'VO2', 'VCO2', 'RR', 'VE', 'ID_test', 'ID']
Total rows: 575,087

Columns after cleaning: ['time', 'Speed', 'HR', 'ID_test', 'ID']
Removed columns: VO2, VCO2, RR, VE

First 10 rows of cleaned data:
   time  Speed    HR ID_test  ID
0     0    5.0  63.0     2_1   2
1     2    5.0  75.0     2_1   2
2     4    5.0  82.0     2_1   2
3     7    5.0  87.0     2_1   2
4     9    5.0  92.0     2_1   2
5    11    5.0  94.0     2_1   2
6    14    5.0  95.0     2_1   2
7    16    5.0  96.0     2_1   2
8    17    5.0  97.0     2_1   2
9    19    5.0  97.0     2_1   2

Data info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 575087 entries, 0 to 575086
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   time     575087 non-null  int64  
 1   Speed    575087 non-null  float64
 2   HR       574106 non-null  float64
 3   ID_test  575087 non-null  object 
 4  

In [None]:
# seperate testing and training data
import pandas as pd

# Load the cleaned data from step 1
df = pd.read_csv('output_step1.csv')

# Get unique participant IDs
unique_participants = df['ID'].unique()
print(f"Total participants: {len(unique_participants)}")

# Split participants 80-20
from sklearn.model_selection import train_test_split

train_ids, test_ids = train_test_split(
    unique_participants, 
    test_size=0.2, 
    random_state=42
)

print(f"Training participants: {len(train_ids)}")
print(f"Testing participants: {len(test_ids)}")

# Split the data based on participant ID
train_df = df[df['ID'].isin(train_ids)]
test_df = df[df['ID'].isin(test_ids)]

print(f"\nTraining rows: {len(train_df):,}")
print(f"Testing rows: {len(test_df):,}")

# Save both files
train_df.to_csv('train_data.csv', index=False)
test_df.to_csv('test_data.csv', index=False)

print("\n✓ Saved train_data.csv")
print("✓ Saved test_data.csv")

Total participants: 857
Training participants: 685
Testing participants: 172

Training rows: 457,351
Testing rows: 117,736

✓ Saved train_data.csv
✓ Saved test_data.csv


STEP 3: CREATING FEATURES

Original columns: ['time', 'Speed', 'HR', 'ID_test', 'ID']
New columns added: ['HR_change', 'Speed_change', 'HR_rolling_mean', 'HR_rolling_std', 'HR_deviation']

Sample data with new features:
    time     HR  HR_change  HR_rolling_mean  HR_deviation
0      0    0.0        0.0         0.000000      0.000000
1      2    0.0        0.0         0.000000      0.000000
2      5   54.0        0.0        54.000000      0.000000
3      7    0.0        0.0        54.000000      0.000000
4      9   91.0        0.0        72.500000     18.500000
5     12   93.0        2.0        79.333333     13.666667
6     14   94.0        1.0        83.000000     11.000000
7     16   95.0        1.0        85.400000      9.600000
8     18   94.0       -1.0        86.833333      7.166667
9     21   93.0       -1.0        87.714286      5.285714
10    26   93.0        0.0        88.375000      4.625000
11    27   93.0        0.0        88.888889      4.111111
12    32   94.0        1.0

In [None]:
# create new features 
import pandas as pd

print("=" * 60)
print("STEP 3: CREATING FEATURES FOR BOTH TRAIN AND TEST")
print("=" * 60)

# Function to add features
def add_features(df):
    df = df.copy()
    
    # Sort by ID_test and time to ensure correct order
    df = df.sort_values(['ID_test', 'time']).reset_index(drop=True)
    
    # Process each test separately
    for test_id in df['ID_test'].unique():
        mask = df['ID_test'] == test_id
        
        # 1. HR_change: How much HR changed from previous second
        df.loc[mask, 'HR_change'] = df.loc[mask, 'HR'].diff()
        
        # 2. Speed_change: How much speed changed
        df.loc[mask, 'Speed_change'] = df.loc[mask, 'Speed'].diff()
        
        # 3. HR_rolling_mean: Average HR over last 30 seconds
        df.loc[mask, 'HR_rolling_mean'] = df.loc[mask, 'HR'].rolling(window=30, min_periods=1).mean()
        
        # 4. HR_rolling_std: Variability in HR over last 30 seconds
        df.loc[mask, 'HR_rolling_std'] = df.loc[mask, 'HR'].rolling(window=30, min_periods=1).std()
        
        # 5. HR_deviation: How far is current HR from recent average
        df.loc[mask, 'HR_deviation'] = df.loc[mask, 'HR'] - df.loc[mask, 'HR_rolling_mean']
    
    # Fill NaN values
    df = df.fillna(0)
    
    return df

# Process training data
print("\nProcessing TRAIN data...")
train_df = pd.read_csv('dataset/train_data.csv')
train_with_features = add_features(train_df)
train_with_features.to_csv('train_data_with_features.csv', index=False)
print(f"✓ Train rows: {len(train_with_features):,}")
print("✓ Saved train_data_with_features.csv")

# Process test data
print("\nProcessing TEST data...")
test_df = pd.read_csv('dataset/test_data.csv')
test_with_features = add_features(test_df)
test_with_features.to_csv('test_data_with_features.csv', index=False)
print(f"✓ Test rows: {len(test_with_features):,}")
print("✓ Saved test_data_with_features.csv")

print("\nNew columns added:")
print(['HR_change', 'Speed_change', 'HR_rolling_mean', 'HR_rolling_std', 'HR_deviation'])

print("\nSample from train data:")
print(train_with_features[['time', 'HR', 'HR_change', 'HR_rolling_mean', 'HR_deviation']].head(10))

STEP 3: CREATING FEATURES FOR BOTH TRAIN AND TEST

Processing TRAIN data...
✓ Train rows: 457,351
✓ Saved train_data_with_features.csv

Processing TEST data...
✓ Test rows: 117,736
✓ Saved test_data_with_features.csv

New columns added:
['HR_change', 'Speed_change', 'HR_rolling_mean', 'HR_rolling_std', 'HR_deviation']

Sample from train data:
   time    HR  HR_change  HR_rolling_mean  HR_deviation
0     0   0.0        0.0         0.000000      0.000000
1     2   0.0        0.0         0.000000      0.000000
2     5  54.0        0.0        54.000000      0.000000
3     7   0.0        0.0        54.000000      0.000000
4     9  91.0        0.0        72.500000     18.500000
5    12  93.0        2.0        79.333333     13.666667
6    14  94.0        1.0        83.000000     11.000000
7    16  95.0        1.0        85.400000      9.600000
8    18  94.0       -1.0        86.833333      7.166667
9    21  93.0       -1.0        87.714286      5.285714


In [9]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
import pickle

print("=" * 60)
print("STEP 4: NORMALIZING DATA")
print("=" * 60)

# Load data with features
train_df = pd.read_csv('dataset/train_data_with_features.csv')
test_df = pd.read_csv('dataset/test_data_with_features.csv')

print(f"\nTrain rows: {len(train_df):,}")
print(f"Test rows: {len(test_df):,}")

# Features we want to normalize
features_to_normalize = [
    'HR', 
    'Speed', 
    'HR_change', 
    'Speed_change', 
    'HR_rolling_mean', 
    'HR_rolling_std', 
    'HR_deviation'
]

print(f"\nNormalizing these columns: {features_to_normalize}")

# Create scaler
scaler = StandardScaler()

# FIT the scaler ONLY on training data
scaler.fit(train_df[features_to_normalize])

print("\nScaler fitted on training data")
print("Mean values learned:", scaler.mean_)
print("Std values learned:", scaler.scale_)

# TRANSFORM both train and test using the same scaler
train_df[features_to_normalize] = scaler.transform(train_df[features_to_normalize])
test_df[features_to_normalize] = scaler.transform(test_df[features_to_normalize])

print("\n✓ Both datasets normalized")

# Show before/after example
print("\nSample normalized values (train data):")
print(train_df[['HR', 'Speed', 'HR_change', 'HR_deviation']].head(10))

# Save normalized data
train_df.to_csv('dataset/train_data_normalized.csv', index=False)
test_df.to_csv('dataset/test_data_normalized.csv', index=False)

print("\n✓ Saved train_data_normalized.csv")
print("✓ Saved test_data_normalized.csv")

# Save the scaler (IMPORTANT for later!)
with open('scaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)

print("✓ Saved scaler.pkl (you'll need this for real-time predictions!)")

print("\n" + "=" * 60)
print("NORMALIZATION COMPLETE!")
print("=" * 60)

STEP 4: NORMALIZING DATA

Train rows: 457,351
Test rows: 117,736

Normalizing these columns: ['HR', 'Speed', 'HR_change', 'Speed_change', 'HR_rolling_mean', 'HR_rolling_std', 'HR_deviation']

Scaler fitted on training data
Mean values learned: [1.46849982e+02 9.61521086e+00 4.78254120e-02 1.49884881e-03
 1.46103142e+02 4.05485994e+00 8.23860652e-01]
Std values learned: [32.6890873   4.52100952  1.78088176  0.50075904 33.07701309  3.55716185
  6.95052833]

✓ Both datasets normalized

Sample normalized values (train data):
         HR     Speed  HR_change  HR_deviation
0 -4.492324 -1.020836  -0.026855     -0.118532
1 -4.492324 -1.020836  -0.026855     -0.118532
2 -2.840397 -1.020836  -0.026855     -0.118532
3 -4.492324 -1.020836  -0.026855     -0.118532
4 -1.708521 -1.020836  -0.026855      2.543136
5 -1.647338 -1.020836   1.096184      1.847745
6 -1.616747 -1.020836   0.534665      1.464081
7 -1.586156 -1.020836   0.534665      1.262658
8 -1.616747 -1.020836  -0.588374      0.912565
9 -

In [11]:
#  we should remove the speed
import pandas as pd

print("=" * 60)
print("REMOVING SPEED COLUMNS")
print("=" * 60)

# Load the normalized data
train_df = pd.read_csv('dataset/train_data_normalized.csv')
test_df = pd.read_csv('dataset/test_data_normalized.csv')

print(f"\nOriginal columns: {train_df.columns.tolist()}")

# Remove Speed and Speed_change columns
columns_to_remove = ['Speed', 'Speed_change']

train_df = train_df.drop(columns=columns_to_remove)
test_df = test_df.drop(columns=columns_to_remove)

print(f"\nRemoved: {columns_to_remove}")
print(f"Remaining columns: {train_df.columns.tolist()}")

# Save back to same files
train_df.to_csv('dataset/train_data_normalized.csv', index=False)
test_df.to_csv('dataset/test_data_normalized.csv', index=False)

print(f"\n✓ Updated train_data_normalized.csv")
print(f"✓ Updated test_data_normalized.csv")

print("\nFeatures now available for model:")
feature_columns = ['HR', 'HR_change', 'HR_rolling_mean', 'HR_rolling_std', 'HR_deviation']
print(feature_columns)
print(f"Total: {len(feature_columns)} features")

REMOVING SPEED COLUMNS

Original columns: ['time', 'Speed', 'HR', 'ID_test', 'ID', 'HR_change', 'Speed_change', 'HR_rolling_mean', 'HR_rolling_std', 'HR_deviation']

Removed: ['Speed', 'Speed_change']
Remaining columns: ['time', 'HR', 'ID_test', 'ID', 'HR_change', 'HR_rolling_mean', 'HR_rolling_std', 'HR_deviation']

✓ Updated train_data_normalized.csv
✓ Updated test_data_normalized.csv

Features now available for model:
['HR', 'HR_change', 'HR_rolling_mean', 'HR_rolling_std', 'HR_deviation']
Total: 5 features


In [13]:
import pandas as pd
import numpy as np

print("=" * 60)
print("STEP 5: CREATING 60-SECOND WINDOWS (HR ONLY)")
print("=" * 60)

# Load normalized data
train_df = pd.read_csv('dataset/train_data_normalized.csv')
test_df = pd.read_csv('dataset/test_data_normalized.csv')

print(f"\nTrain rows: {len(train_df):,}")
print(f"Test rows: {len(test_df):,}")

# Features to use in windows (only HR-based, no Speed!)
feature_columns = [
    'HR', 
    'HR_change', 
    'HR_rolling_mean', 
    'HR_rolling_std', 
    'HR_deviation'
]

print(f"\nUsing {len(feature_columns)} features per timestep")
print(f"Features: {feature_columns}")

# Window size
WINDOW_SIZE = 60  # 60 seconds

def create_windows(df, window_size):
    """Create sliding windows from data"""
    
    windows = []
    test_ids = []
    
    # Process each test separately
    for test_id in df['ID_test'].unique():
        test_data = df[df['ID_test'] == test_id][feature_columns].values
        
        # Create sliding windows
        for i in range(len(test_data) - window_size + 1):
            window = test_data[i:i + window_size]
            windows.append(window)
            test_ids.append(test_id)
    
    return np.array(windows), test_ids

# Create windows for training data
print("\nCreating windows for TRAIN data...")
X_train, train_test_ids = create_windows(train_df, WINDOW_SIZE)

print(f"✓ Created {len(X_train):,} training windows")
print(f"  Each window shape: {X_train[0].shape} (60 timesteps × 5 features)")

# Create windows for test data
print("\nCreating windows for TEST data...")
X_test, test_test_ids = create_windows(test_df, WINDOW_SIZE)

print(f"✓ Created {len(X_test):,} test windows")
print(f"  Each window shape: {X_test[0].shape} (60 timesteps × 5 features)")

# Save as numpy arrays
np.save('dataset/X_train.npy', X_train)
np.save('dataset/X_test.npy', X_test)

print("\n✓ Saved X_train.npy")
print("✓ Saved X_test.npy")

print("\n" + "=" * 60)
print("WINDOW CREATION COMPLETE!")
print("=" * 60)
print(f"\nReady for LSTM training!")
print(f"Training samples: {len(X_train):,}")
print(f"Test samples: {len(X_test):,}")
print(f"Window shape: (60 timesteps, 5 features)")

STEP 5: CREATING 60-SECOND WINDOWS (HR ONLY)

Train rows: 457,351
Test rows: 117,736

Using 5 features per timestep
Features: ['HR', 'HR_change', 'HR_rolling_mean', 'HR_rolling_std', 'HR_deviation']

Creating windows for TRAIN data...
✓ Created 410,623 training windows
  Each window shape: (60, 5) (60 timesteps × 5 features)

Creating windows for TEST data...
✓ Created 105,936 test windows
  Each window shape: (60, 5) (60 timesteps × 5 features)

✓ Saved X_train.npy
✓ Saved X_test.npy

WINDOW CREATION COMPLETE!

Ready for LSTM training!
Training samples: 410,623
Test samples: 105,936
Window shape: (60 timesteps, 5 features)
