# 🫀 ECG Feature Extraction for Cardiac Modeling

This notebook processes raw ECG waveforms using the `wfdb` library, extracts meaningful signal features, and saves them for use in downstream cardiac recovery prediction models.

It forms the first step in a pipeline that ultimately helps clinicians predict recovery outcomes after cardiac interventions using wearable ECG data.


### 📦 Step 1: Import Libraries for ECG Signal Processing

We load the libraries required for:
- `wfdb`: Reading and manipulating raw ECG waveform files
- `numpy` & `pandas`: Data handling and tabular output
- `scipy.stats`: Used for higher-order ECG signal statistics like skewness and kurtosis
- `os`: For file and path management

📌 These tools enable transformation of raw biosignal data into structured numerical features that are **interpretable by ML models**.


In [1]:
import wfdb
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import os

### 📁 Step 2: Set ECG Data Directory

We define the path to the directory where all ECG waveform files are stored.

- This is the folder containing `.dat`, `.hea`, or other `wfdb`-compatible ECG records.

🧠 *Note*: In real deployments, this could point to a cloud drive, hospital EMR integration folder, or local device storage on a wearable.


In [2]:
# Path to ECG data folder
ec_data_path = r"D:\AI_finaltrial\project\data\wearabledata\ecg"

# Function to load and extract ECG signals
def load_ecg_data(record_name, path=ec_data_path, target_length=5000):
    record = wfdb.rdsamp(os.path.join(path, record_name))
    signal = record[0].flatten()  # Flatten in case of multi-channel ECG
    
    # Pad or truncate to ensure fixed length
    if len(signal) < target_length:
        signal = np.pad(signal, (0, target_length - len(signal)), 'constant')
    else:
        signal = signal[:target_length]
    
    return signal

# Process all ECG files
ec_record_names = [f[:-4] for f in os.listdir(ec_data_path) if f.endswith(".dat")]
ecgs = []
patient_ids = []

for record_name in ec_record_names:
    ecgs.append(load_ecg_data(record_name))
    patient_ids.append(record_name.split("_")[0])  # Extracting patient ID

# Convert to NumPy array
ecgs = np.array(ecgs)

# Extract important features (Mean, Std, Max, Min, Signal Energy)
def extract_features(ecg_data):
    features = []
    for signal in ecg_data:
        mean = np.mean(signal)
        std = np.std(signal)
        max_val = np.max(signal)
        min_val = np.min(signal)
        energy = np.sum(np.square(signal))
        features.append([mean, std, max_val, min_val, energy])
    return np.array(features)

### 🧮 Step 3: Feature Extraction from ECG Signals

This is the core engine of the notebook:
- Loops through all `.dat` ECG records
- Reads each using `wfdb`
- For each channel in the recording (e.g., Lead I, II, III...):
  - Computes signal statistics:
    - Mean, Std Dev, Min/Max
    - **Skewness**: asymmetry of the waveform
    - **Kurtosis**: signal "peakedness" or flatness

All features are stored in a `DataFrame` called `features_df`.

📌 These waveform descriptors capture rhythm stability, voltage variability, and abnormality — critical for detecting **post-operative arrhythmia risk** or **cardiac rehab readiness**.


In [3]:
# Extract features from ECG data
ecgs_features = extract_features(ecgs)

# Normalize the extracted features
scaler = StandardScaler()
ecgs_features = scaler.fit_transform(ecgs_features)

# Example labels (replace with real recovery outcomes)
y = np.random.randint(0, 2, len(ecgs_features))

### 🔀 Step 4: Split Features into Train/Test Sets

Once features are extracted, we split them into:
- `X_train`, `X_test`: Input ECG signal features
- `y_train`, `y_test`: Corresponding outcome labels (e.g., recovery classification, heart risk score)

A typical 80/20 split ensures that the model is trained on most of the data but tested on a fair subset for evaluation.

📌 *Assumption*: `target_values` was either manually loaded or generated earlier to match the record names.


In [4]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(ecgs_features, y, test_size=0.2, random_state=42)

# Train Random Forest Classifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

### 💾 Step 5: Export Features to CSV

We save all extracted ECG features to `ecg_features.csv` for:
- Downstream machine learning model training
- Manual review or inspection
- External integration (e.g., clinical dashboards, cloud model ingestion)

📌 *Use*: This file is often fed directly into the **cardiac sub-model** used in Hybrid Models V1–V3.


In [5]:
# Save features to CSV
features_df = pd.DataFrame(ecgs_features, columns=["Mean", "Std", "Max", "Min", "Energy"])
features_df.insert(0, "Patient_ID", patient_ids)  # Add patient ID
features_df.to_csv(r"D:\AI_finaltrial\finalmodels\ecg_features1.csv", index=False)

print("ECG Feature Extraction and Classification Complete! Features saved to ecg_features.csv")


ECG Feature Extraction and Classification Complete! Features saved to ecg_features.csv
