## EMF Signal Processing Pipeline for special case (Volunteer) - Feature Construction

**Signal Normalization:**
- Per-subject z-score normalization: standardizes each signal channel individually per patient using mean and standard deviation
- Theoretically shall eliminate sensor placement and session-specific baseline differences while preserving physiological signal patterns

**Signal Windowing:**
- 60-second overlapping windows with 15-second overlap
- Maintains temporal context for frequency analysis while enabling sufficient training samples

**Feature Extraction:**
- **Time-domain features:** Mean, std, RMS, min, max, range, skewness, kurtosis, MAD, and Hjorth parameters (Activity, Mobility, Complexity)
- **Frequency-domain features:** Bandpower analysis across 8 frequency bands including critical 0.01-0.02 Hz range, plus total power
- **Wavelet features:** 6-level decomposition energies with frequency band descriptions for multi-resolution analysis
- **Entropy measures:** Shannon entropy for signal complexity quantification
- **Patient metadata:** Age, weight, height, sex, and calculated BMI from patient demographics (in this case we have lack of patient metadata, so these features might be unused)

**Feature Selection & Dimensionality Reduction:**
- **PCA reduction is not applied** in this step to retain all features for comprehensive analysis
- **All domain features retained** for comprehensive analysis

**Target Construction:**
- **Glucose regression:** 10-minute lag-corrected CGM values to account for sensor delay
- **Glycemic state prediction:** 15-minute ahead glycemic state classification (hypoglycemic <70 mg/dL, normal 70-180 mg/dL, hyperglycemic >180 mg/dL)
- **Hypoglycemia flag:** Binary flag (1) if glucose < 75 mg/dL within next 900 seconds (15 minutes)
- **Hyperglycemia flag:** Binary flag (1) if glucose > 180 mg/dL within next 900 seconds (15 minutes)
- **Metabolic state tracking:** Fasting, First Insulin, Ensure, Second Insulin phases based on experimental protocol events
- Time-based nearest neighbor matching between feature windows and glucose measurements

**Output:**
- **Combined multi-patient dataset** with comprehensive feature set (all domain features)

In [1]:
# General imports:

# Disable warnings:
import warnings

warnings.filterwarnings('ignore')

# Essential imports
import pandas as pd
import polars as pl
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
import os
import gc

import json

from tqdm import tqdm
from sklearn.preprocessing import StandardScaler

# Plotting enhancements
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12
plt.rcParams['image.cmap'] = 'turbo'  # Set default colormap to turbo for all images
plt.style.use('seaborn-v0_8-whitegrid')

In [2]:
# Key Settings:

# Physical constants and sensors specifications:
SENSITIVITY = 50  # mV/nT
MAGNETIC_NOISE = 3  # pT/√Hz @ 1 Hz
MAX_AC_LINEARITY = 250  # nT (+/- 250 nT) - Equivalent to 21.78 V
MAX_DC_LINEARITY = 60  # nT (+/- 60 nT)
VOLTAGE_LIMIT = 15 # V (+/-15V)
CONVERSION_FACTOR = 20  # nT per 1V
SAMPLING_FREQUENCY = 5000  # Hz - expected from the experimental data
SENSOR_SATURATION = 250  # nT - saturation threshold for the sensor

# Subjects and their types
# Subject = {"Normal": "Normal Subjects","Clamp": "T1DM Clamp Subjects", "Additional": "Additional Subjects"}

# Path and directories
base_dir = Path("../../../Data")

# Output directory for saving results
output_dir = base_dir / "ProcessedData"
os.makedirs(output_dir, exist_ok=True)

# Directory for saving processed/downsampled signal files (parquet format)
signals_dir = output_dir / "Signal_Files"
os.makedirs(signals_dir, exist_ok=True)

# Labels directory
labels_dir = base_dir / "RawData"
labels_filename = "FilteredLabels.xlsx"

# Patients data file
patients_file = "patients.json"

# Patient Data
with open(labels_dir / patients_file, 'r') as f:
    # Load the JSON data
    patients_data = json.load(f)

# GMT zone correction for the sensor = GMT+2
GMT = 2

# Channel grouping (special case for volunteer data)
signal_channels = {
    'Hand': ['Hand1', 'Hand2'],
}

# Define glucose lag and glycemic prediction parameters
GLUCOSE_LAG_MINUTES = 10  # Glucose lag in minutes
GLYCEMIC_PREDICTION_MINUTES = 15  # Glycemic prediction in minutes
WINDOW_SIZE_MINUTES = 1  # Window size in minutes
WINDOW_OVERLAP_MINUTES = 0.25  # Overlap size in minutes


#### Load the CGM/stick Glucose target values from the file

In [3]:
def detect_all_transitions(df_labels):
    """
    Detect both metabolic state transitions and glycemic state transitions.

    Glycemic states:
    - Hypoglycemia: < 70 mg/dL
    - Normal: 70-180 mg/dL
    - Hyperglycemia: > 180 mg/dL
    """
    # Initialize transitions dictionary
    transitions = {}

    # Ensure data is sorted by time
    df_sorted = df_labels.sort("time")

    # Extract metabolic state transitions (already computed)
    metabolic_states = ["Fasting", "First Insulin", "Ensure", "Second Insulin"]

    # Find the first occurrence of each state
    for state in metabolic_states:
        state_rows = df_sorted.filter(pl.col('state') == state)
        if len(state_rows) > 0:
            transitions[f"Metabolic: {state}"] = state_rows['time'].item(0)

    # Filter out rows with null glucose values and add glycemic state
    df_glycemic = df_sorted.filter(
        pl.col('Glucose').is_not_null() &
        ~pl.col('Glucose').is_nan()
    ).with_columns([
        pl.when(pl.col('Glucose') < 70)
          .then(pl.lit("Hypoglycemia"))
          .when(pl.col('Glucose') <= 180)
          .then(pl.lit("Normal"))
          .otherwise(pl.lit("Hyperglycemia"))
          .alias("glycemic_state")
    ])

    # Record the initial glycemic state
    if len(df_glycemic) > 0:
        initial_state = df_glycemic['glycemic_state'].item(0)
        initial_time = df_glycemic['time'].item(0)
        transitions[f"Initial Glycemic State: {initial_state}"] = initial_time

    # Detect transitions using a window-based approach
    previous_state = None
    for row in df_glycemic.iter_rows(named=True):
        current_state = row['glycemic_state']
        current_time = row['time']

        if previous_state is not None and current_state != previous_state:
            transition_name = f"Glycemic: {previous_state} to {current_state}"
            transitions[transition_name] = current_time

        previous_state = current_state

    return transitions

#### Define the function to load the downsampled signals

In [4]:
# Load the downsampled data from parquet file
def load_downsampled_data(patient_name, signals_dir):
    """
    Load downsampled data from parquet file.

    Parameters:
    - patient_name: name of the patient
    - signals_dir: directory containing the parquet signal files

    Returns:
    - df_loaded: polars DataFrame with loaded data
    """
    # Create filename based on patient name
    patient_safe_name = patient_name.replace(" ", "_").replace("#", "")
    output_filename = f"{patient_safe_name}_downsampled_25hz.parquet"
    output_path = signals_dir / output_filename

    try:
        if output_path.exists():
            df_loaded = pl.read_parquet(output_path)
            print(f"Successfully loaded data from {output_filename}")
            print(f"Loaded data shape: {df_loaded.shape}")
            return df_loaded
        else:
            print(f"File not found: {output_path}")
            return None
    except Exception as e:
        print(f"Error loading parquet file: {e}")
        return None

#### Define the function to apply z-score normalization

In [5]:
def apply_zscore_normalization(df_signal, patient_name, global_scalers=None):
    """
    Apply z-score (StandardScaler) normalization to all signal channels per channel,
    followed by cross-session joint normalization.

    Parameters:
    - df_signal: polars DataFrame containing the signal data
    - patient_name: name of the patient
    - global_scalers: dictionary of pre-fitted StandardScaler objects for cross-session normalization

    Returns:
    - df_normalized: polars DataFrame with normalized signals
    - channel_scalers: dictionary of fitted StandardScaler objects for this patient
    """
    if df_signal is None:
        print("No signal data provided for normalization")
        return None, None

    print(f"Applying z-score normalization for {patient_name}")

    # Find the time column
    time_col = None
    for col in df_signal.columns:
        if 'time' in col.lower():
            time_col = col
            break

    if time_col is None:
        print("Error: No time column found in the data")
        return None, None

    # Convert to pandas temporarily for easier processing
    df_pandas = df_signal.to_pandas()

    # Get all signal columns (exclude time column)
    signal_columns = [col for col in df_pandas.columns if col != time_col]

    print(f"Processing {len(signal_columns)} signal channels")

    # Step 1: Per-channel z-score normalization
    df_normalized = df_pandas.copy()
    channel_scalers = {}

    for col in signal_columns:
        if df_pandas[col].dtype in ['float64', 'float32', 'int64', 'int32']:
            # Remove NaN values for fitting
            valid_data = df_pandas[col].dropna()

            if len(valid_data) == 0:
                print(f"Warning: No valid data for channel {col}")
                continue

            # Fit StandardScaler on this channel
            scaler = StandardScaler()

            # Reshape for sklearn (needs 2D array)
            data_reshaped = valid_data.values.reshape(-1, 1)
            scaler.fit(data_reshaped)

            # Transform the entire column (including NaN values)
            # Handle NaN values by only transforming non-NaN entries
            transformed_data = df_pandas[col].copy()
            non_nan_mask = ~df_pandas[col].isna()

            if non_nan_mask.sum() > 0:
                transformed_data[non_nan_mask] = scaler.transform(
                    df_pandas[col][non_nan_mask].values.reshape(-1, 1)
                ).flatten()

            df_normalized[col] = transformed_data
            channel_scalers[col] = scaler

            # Handle both scalar and array cases for mean_ and scale_
            mean_val = scaler.mean_[0] if hasattr(scaler.mean_, '__len__') else scaler.mean_
            scale_val = scaler.scale_[0] if hasattr(scaler.scale_, '__len__') else scaler.scale_
            print(f"Channel {col}: mean={mean_val:.4f}, std={scale_val:.4f}")

    # Step 2: Cross-session joint normalization (if global scalers provided)
    if global_scalers is not None:
        print(f"Applying cross-session joint normalization")

        for col in signal_columns:
            if col in global_scalers and col in df_normalized.columns:
                # Apply global scaler to already per-channel normalized data
                non_nan_mask = ~df_normalized[col].isna()

                if non_nan_mask.sum() > 0:
                    df_normalized.loc[non_nan_mask, col] = global_scalers[col].transform(
                        df_normalized.loc[non_nan_mask, col].values.reshape(-1, 1)
                    ).flatten()

                print(f"Applied cross-session normalization to {col}")

    # Convert back to polars
    df_normalized = pl.from_pandas(df_normalized)

    print(f"Z-score normalization completed for {len(signal_columns)} channels")
    return df_normalized, channel_scalers


#### Define the functions of Feature Extraction

In [6]:
import scipy.stats as stats
from scipy.signal import welch
import pywt
from scipy.stats import entropy

def calculate_hjorth_parameters(signal):
    """
    Calculate Hjorth parameters (Activity, Mobility, Complexity)
    """
    # First derivative
    diff1 = np.diff(signal)
    # Second derivative
    diff2 = np.diff(diff1)

    # Variance calculations
    var_signal = np.var(signal)
    var_diff1 = np.var(diff1)
    var_diff2 = np.var(diff2)

    # Hjorth parameters
    activity = var_signal
    mobility = np.sqrt(var_diff1 / var_signal) if var_signal > 0 else 0
    complexity = np.sqrt(var_diff2 / var_diff1) / mobility if var_diff1 > 0 and mobility > 0 else 0

    return activity, mobility, complexity

def calculate_bandpower(signal, fs, freq_range):
    """
    Calculate power in a specific frequency band
    """
    freqs, psd = welch(signal, fs=fs, nperseg=min(len(signal), 1024))

    # Find frequency indices
    freq_mask = (freqs >= freq_range[0]) & (freqs <= freq_range[1])

    if np.sum(freq_mask) == 0:
        return 0

    # Calculate power in band
    bandpower = np.trapezoid(psd[freq_mask], freqs[freq_mask])
    return bandpower

def calculate_entropy_measures(signal):
    """
    Calculate Shannon entropy only - removed approximate entropy
    """
    try:
        # Shannon entropy
        hist, _ = np.histogram(signal, bins=50)
        hist = hist[hist > 0]  # Remove zero bins
        shannon_entropy = entropy(hist)

        return shannon_entropy
    except Exception as e:
        print(f"Warning: Entropy calculation failed: {e}")
        return 0

def calculate_shannon_entropy(signal):
    """
    Calculate Shannon entropy
    """
    return calculate_entropy_measures(signal)

def calculate_bandpower_efficient(signal, fs, freq_ranges):
    """
    Calculate power in multiple frequency bands with better low-frequency resolution
    """
    try:
        # For better low-frequency resolution, use longer segments
        # Aim for frequency resolution of ~0.005 Hz to capture 0.01-0.02 Hz band
        target_freq_resolution = 0.005
        ideal_nperseg = int(fs / target_freq_resolution)  # ~5000 samples

        # Use the full signal length if possible, but cap at reasonable limit
        nperseg = min(len(signal), ideal_nperseg, 8192)

        # Ensure nperseg is not too small
        nperseg = max(nperseg, 512)

        freqs, psd = welch(signal, fs=fs, nperseg=nperseg, noverlap=nperseg//2)

        # Calculate all band powers in one pass
        bandpowers = {}
        for band_name, freq_range in freq_ranges.items():
            freq_mask = (freqs >= freq_range[0]) & (freqs <= freq_range[1])

            if np.sum(freq_mask) == 0:
                # If no frequencies in range, interpolate or use nearest
                print(f"Warning: No frequencies found for band {band_name} ({freq_range})")
                bandpowers[band_name] = 0
            else:
                bandpowers[band_name] = np.trapezoid(psd[freq_mask], freqs[freq_mask])

        return bandpowers
    except Exception as e:
        print(f"Warning: Bandpower calculation failed: {e}")
        return {band_name: 0 for band_name in freq_ranges.keys()}

def calculate_wavelet_energies(signal, fs=25, wavelet='db4', levels=6):
    """
    Calculate wavelet energies with frequency band descriptions - optimized version
    """
    try:
        # Limit signal length for faster computation
        max_samples = 5000  # Limit to ~3 minutes at 25 Hz
        if len(signal) > max_samples:
            signal = signal[:max_samples]

        # Decompose signal using wavelet
        coeffs = pywt.wavedec(signal, wavelet, level=levels)

        # Calculate energy in each subband with frequency descriptions
        energies = []

        # Approximate frequency bands based on sampling rate and decomposition levels
        # For fs=25Hz, Nyquist = 12.5Hz
        nyquist = fs / 2

        for i, coeff in enumerate(coeffs):
            energy = np.sum(np.square(coeff))

            # Map wavelet levels to approximate frequency bands
            if i == 0:  # Approximation coefficients (lowest frequencies)
                freq_desc = 'very_low_0_0.1'
            elif i == 1:  # Detail level 6
                freq_desc = 'low_0.1_0.2'
            elif i == 2:  # Detail level 5
                freq_desc = 'low_0.2_0.4'
            elif i == 3:  # Detail level 4
                freq_desc = 'mid_0.4_0.8'
            elif i == 4:  # Detail level 3
                freq_desc = 'mid_0.8_1.6'
            elif i == 5:  # Detail level 2
                freq_desc = 'high_1.6_3.1'
            elif i == 6:  # Detail level 1
                freq_desc = 'high_3.1_6.25'
            else:
                freq_desc = f'level_{i}'

            energies.append((energy, freq_desc))

        return energies
    except Exception as e:
        print(f"Warning: Wavelet calculation failed: {e}")
        return [(0, f'level_{i}') for i in range(levels + 1)]

def extract_features_from_window(window_data, patient_info, fs=25):
    """
    Extract comprehensive features from a single window - optimized version with cleaned features
    """
    features = {}

    # Find time column
    time_col = None
    for col in window_data.columns:
        if 'time' in col.lower():
            time_col = col
            break

    # Get signal columns (exclude time and background channels)
    signal_cols = [col for col in window_data.columns
                   if col != time_col and 'background' not in col.lower()]

    # Add patient metadata
    features['age'] = patient_info.get('age', 0)
    features['weight'] = patient_info.get('weight', 0)
    features['height'] = patient_info.get('height', 0)
    features['sex'] = 1 if patient_info.get('sex', '').lower() == 'male' else 0

    # Calculate BMI
    if features['weight'] > 0 and features['height'] > 0:
        features['bmi'] = features['weight'] / ((features['height'] / 100) ** 2)
    else:
        features['bmi'] = 0

    # Define frequency ranges for efficient computation
    freq_ranges = {
        '0_05_0_1': [0.05, 0.1],
        '0_1_0_5': [0.1, 0.5],
        '0_5_2': [0.5, 2],
        '2_5': [2, 5],
        '5_10': [5, 10],
        'total': [0.01, fs/2]
    }

    # Process each channel
    for channel in signal_cols:
        if channel in window_data.columns:
            signal = window_data[channel].values

            # Skip if signal is empty or all NaN
            if len(signal) == 0 or np.all(np.isnan(signal)):
                continue

            # Remove NaN values
            signal = signal[~np.isnan(signal)]

            if len(signal) == 0:
                continue

            # Downsample if signal is too long (for faster processing)
            if len(signal) > 10000:  # More than ~6 minutes at 25 Hz
                step = len(signal) // 10000
                signal = signal[::step]

            try:
                # Time-domain statistics (fast) - removed variance as it's duplicate of Hjorth activity
                features[f'{channel}_mean'] = np.mean(signal)
                features[f'{channel}_std'] = np.std(signal)
                features[f'{channel}_rms'] = np.sqrt(np.mean(np.square(signal)))
                features[f'{channel}_min'] = np.min(signal)
                features[f'{channel}_max'] = np.max(signal)
                features[f'{channel}_range'] = np.max(signal) - np.min(signal)
                features[f'{channel}_skewness'] = stats.skew(signal)
                features[f'{channel}_kurtosis'] = stats.kurtosis(signal)
                features[f'{channel}_mad'] = np.mean(np.abs(signal - np.mean(signal)))

                # Hjorth parameters (Activity = variance, so we keep this instead of separate variance)
                activity, mobility, complexity = calculate_hjorth_parameters(signal)
                features[f'{channel}_hjorth_activity'] = activity  # This is the variance
                features[f'{channel}_hjorth_mobility'] = mobility
                features[f'{channel}_hjorth_complexity'] = complexity

                # Frequency-domain features (efficient batch computation)
                bandpowers = calculate_bandpower_efficient(signal, fs, freq_ranges)
                for band_name, power in bandpowers.items():
                    features[f'{channel}_bandpower_{band_name}'] = power

                # Wavelet energies with frequency band descriptions
                wavelet_energies = calculate_wavelet_energies(signal, fs)
                for i, (energy, freq_desc) in enumerate(wavelet_energies):
                    features[f'{channel}_wavelet_energy_{freq_desc}'] = energy

                # Entropy measures (only Shannon entropy - removed approximate entropy)
                shannon_ent = calculate_shannon_entropy(signal)
                features[f'{channel}_shannon_entropy'] = shannon_ent

            except Exception as e:
                print(f"Warning: Feature extraction failed for channel {channel}: {e}")
                continue

    return features

def create_windows(df_signal, window_size_min=1, overlap_min=0.25, fs=25):
    """
    Create overlapping windows from the signal - optimized version
    """
    if df_signal is None:
        return []

    # Convert to pandas for easier processing
    df_pandas = df_signal.to_pandas()

    # Find time column
    time_col = None
    for col in df_pandas.columns:
        if 'time' in col.lower():
            time_col = col
            break

    if time_col is None:
        print("Error: No time column found")
        return []

    # Calculate window parameters
    window_size_samples = int(window_size_min * 60 * fs)  # Convert to samples
    overlap_samples = int(overlap_min * 60 * fs)
    step_samples = window_size_samples - overlap_samples

    print(f"Window size: {window_size_min} min ({window_size_samples} samples)")
    print(f"Overlap: {overlap_min} min ({overlap_samples} samples)")
    print(f"Step size: {step_samples} samples")

    windows = []
    total_samples = len(df_pandas)


    # Create windows
    start_idx = 0

    while start_idx + window_size_samples <= total_samples:
        end_idx = start_idx + window_size_samples

        window_data = df_pandas.iloc[start_idx:end_idx].copy()

        window_info = {
            'start_idx': start_idx,
            'end_idx': end_idx,
            'start_time': window_data[time_col].iloc[0],
            'end_time': window_data[time_col].iloc[-1],
            'data': window_data
        }

        windows.append(window_info)
        start_idx += step_samples

    print(f"Created {len(windows)} windows from {total_samples} samples")
    return windows

def construct_features_for_patient(df_normalized, patient_name, patients_data,
                                 window_size_min=1, overlap_min=0.25, fs=25):
    """
    Construct features for all windows of a patient

    Parameters:
    - df_normalized: polars DataFrame with normalized signal data
    - patient_name: name of the patient
    - patients_data: dictionary with patient information
    - window_size_min: window size in minutes
    - overlap_min: overlap size in minutes
    - fs: sampling frequency

    Returns:
    - pandas DataFrame with features for all windows
    """
    print(f"\nConstructing features for patient: {patient_name}")

    # Create windows
    windows = create_windows(df_normalized, window_size_min, overlap_min, fs)

    if not windows:
        print("No windows created")
        return None

    # Get patient info
    patient_info = patients_data[patient_name]

    # Extract features from each window
    all_features = []

    for i, window in enumerate(tqdm(windows, desc="Extracting features")):
        window_features = extract_features_from_window(window['data'], patient_info, fs)

        # Add window metadata
        window_features['patient'] = patient_name
        window_features['window_idx'] = i
        window_features['start_time'] = window['start_time']
        window_features['end_time'] = window['end_time']

        all_features.append(window_features)

    # Convert to DataFrame
    features_df = pd.DataFrame(all_features)

    print(f"Extracted {len(features_df)} windows with {len(features_df.columns)} features each")

    return features_df

#### Define the function to construct the final dataset with features and targets

In [7]:
def construct_final_dataset(features_df, labels_df, patient_name, glucose_lag_minutes=10, glycemic_prediction_minutes=15):
    """
    Construct the final dataset with features and targets.

    Parameters:
    - features_df: pandas DataFrame with extracted features
    - labels_df: polars DataFrame with labels and events
    - patient_name: name of the patient
    - glucose_lag_minutes: minutes to lag glucose target (default 10 min)
    - glycemic_prediction_minutes: minutes ahead to predict glycemic state (default 15 min)

    Returns:
    - final_df: pandas DataFrame with features and corresponding targets
    """
    # Convert polars DataFrame to pandas if needed
    if hasattr(labels_df, 'to_pandas'):
        labels_pandas = labels_df.to_pandas()
    else:
        labels_pandas = labels_df

    # Add patient column to labels if not present
    if 'patient' not in labels_pandas.columns:
        labels_pandas['patient'] = patient_name

    print(f"Features DataFrame shape: {features_df.shape}")
    print(f"Labels DataFrame shape: {labels_pandas.shape}")
    print(f"Features columns: {list(features_df.columns)}")
    print(f"Labels columns: {list(labels_pandas.columns)}")

    # Sort labels by time for proper lagging
    labels_pandas = labels_pandas.sort_values('time').reset_index(drop=True)

    # Create lagged glucose target to correct for CGM lag
    # CGM reading at time T represents the true glucose from 10 minutes ago
    # So to get the "true" glucose at time T, we need the CGM reading from T+10 minutes
    lag_offset = pd.Timedelta(minutes=glucose_lag_minutes)

    # Create prediction offset for glycemic state prediction
    prediction_offset = pd.Timedelta(minutes=glycemic_prediction_minutes)

    # Create target_lagged_glucose by shifting the glucose values backwards to account for lag
    if 'Glucose' in labels_pandas.columns:
        # For each original time point, find the glucose value from 10 minutes later (lag correction)
        target_lagged_glucose = []

        # For glycemic state prediction, find glucose values 15 minutes ahead
        future_glucose_values = []
        future_glycemic_states = []

        for idx, row in labels_pandas.iterrows():
            current_time = row['time']

            # 1. Lag-corrected glucose (10 minutes later)
            future_time_lag = current_time + lag_offset
            time_diffs_lag = np.abs((labels_pandas['time'] - future_time_lag).dt.total_seconds())
            closest_idx_lag = time_diffs_lag.idxmin()

            if time_diffs_lag[closest_idx_lag] <= 15 * 60:  # 15 minutes tolerance
                target_lagged_glucose.append(labels_pandas.loc[closest_idx_lag, 'Glucose'])
            else:
                target_lagged_glucose.append(row['Glucose'])

            # 2. Future glucose for glycemic state prediction (15 minutes ahead)
            future_time_pred = current_time + prediction_offset
            time_diffs_pred = np.abs((labels_pandas['time'] - future_time_pred).dt.total_seconds())
            closest_idx_pred = time_diffs_pred.idxmin()

            if time_diffs_pred[closest_idx_pred] <= 15 * 60:  # 15 minutes tolerance
                future_glucose = labels_pandas.loc[closest_idx_pred, 'Glucose']
                future_glucose_values.append(future_glucose)

                # Classify future glucose into glycemic states
                if pd.isna(future_glucose):
                    future_glycemic_states.append(None)
                elif future_glucose < 70:
                    future_glycemic_states.append('Hypoglycemia')
                elif future_glucose > 180:
                    future_glycemic_states.append('Hyperglycemia')
                else:
                    future_glycemic_states.append('Normal')
            else:
                future_glucose_values.append(None)
                future_glycemic_states.append(None)

    # Create a time-based merge using nearest timestamp matching
    final_features = []

    for idx, row in features_df.iterrows():
        window_start = row['start_time']
        window_end = row['end_time']

        # Find labels that fall within this window
        window_labels = labels_pandas[
            (labels_pandas['time'] >= window_start) &
            (labels_pandas['time'] <= window_end)
        ]

        # If no labels in window, find the closest label before window end
        if len(window_labels) == 0:
            before_window = labels_pandas[labels_pandas['time'] <= window_end]
            if len(before_window) > 0:
                # Get the most recent label before window end
                closest_label = before_window.loc[before_window['time'].idxmax()]
                target_glucose = closest_label['Glucose'] if 'Glucose' in closest_label else None
                metabolic_state = closest_label['state'] if 'state' in closest_label else None
                target_lagged_glucose = closest_label['target_lagged_glucose'] if 'target_lagged_glucose' in closest_label else None
                future_glucose = closest_label[f'future_glucose_{glycemic_prediction_minutes}min'] if f'future_glucose_{glycemic_prediction_minutes}min' in closest_label else None
                future_glycemic_state = closest_label[f'future_glycemic_state_{glycemic_prediction_minutes}min'] if f'future_glycemic_state_{glycemic_prediction_minutes}min' in closest_label else None
            else:
                target_glucose = None
                metabolic_state = None
                target_lagged_glucose = None
                future_glucose = None
                future_glycemic_state = None
        else:
            # Use the first label in the window
            target_glucose = window_labels['Glucose'].iloc[0] if 'Glucose' in window_labels.columns else None
            metabolic_state = window_labels['state'].iloc[0] if 'state' in window_labels.columns else None
            target_lagged_glucose = window_labels['target_lagged_glucose'].iloc[0] if 'target_lagged_glucose' in window_labels.columns else None
            future_glucose = window_labels[f'future_glucose_{glycemic_prediction_minutes}min'].iloc[0] if f'future_glucose_{glycemic_prediction_minutes}min' in window_labels.columns else None
            future_glycemic_state = window_labels[f'future_glycemic_state_{glycemic_prediction_minutes}min'].iloc[0] if f'future_glycemic_state_{glycemic_prediction_minutes}min' in window_labels.columns else None

        # Create feature row with targets
        feature_row = row.to_dict()
        feature_row['CGM'] = target_glucose
        feature_row[f'lagged_CGM_{glucose_lag_minutes}min'] = target_lagged_glucose
        feature_row['metabolic_state'] = metabolic_state
        feature_row[f'future_CGM_{glycemic_prediction_minutes}min'] = future_glucose
        feature_row[f'glycemic_state_{glycemic_prediction_minutes}min'] = future_glycemic_state

        final_features.append(feature_row)

    # Convert to DataFrame
    final_df = pd.DataFrame(final_features)

    # Reorder columns to place the glucose columns together
    glucose_cols = ['CGM', f'lagged_CGM_{glucose_lag_minutes}min', f'future_CGM_{glycemic_prediction_minutes}min']
    target_cols = ['metabolic_state', f'glycemic_state_{glycemic_prediction_minutes}min']
    other_cols = [col for col in final_df.columns if col not in glucose_cols + target_cols]

    # Place glucose columns after the basic window info but before other features
    window_info_cols = ['patient', 'window_idx', 'start_time', 'end_time']
    remaining_cols = [col for col in other_cols if col not in window_info_cols]

    # Reorder: window info, glucose columns, target columns, then other features
    new_column_order = window_info_cols + glucose_cols + target_cols + remaining_cols
    final_df = final_df[new_column_order]

    print(f"Final dataset shape: {final_df.shape}")
    print(f"CGM values available: {final_df['CGM'].notna().sum()}/{len(final_df)}")
    print(f"Lagged CGM values available: {final_df[f'lagged_CGM_{glucose_lag_minutes}min'].notna().sum()}/{len(final_df)}")
    print(f"Future CGM values available: {final_df[f'future_CGM_{glycemic_prediction_minutes}min'].notna().sum()}/{len(final_df)}")
    print(f"Future glycemic state values available: {final_df[f'glycemic_state_{glycemic_prediction_minutes}min'].notna().sum()}/{len(final_df)}")

    return final_df

def construct_final_dataset_with_flags(features_df, labels_df, patient_name, glucose_lag_minutes=10, glycemic_prediction_minutes=15):
    """
    Construct the final dataset with features and targets including hypo/hyper flags.

    Parameters:
    - features_df: pandas DataFrame with extracted features
    - labels_df: polars DataFrame with labels and events
    - patient_name: name of the patient
    - glucose_lag_minutes: minutes to lag glucose target (default 10 min)
    - glycemic_prediction_minutes: minutes ahead to predict glycemic state (default 15 min)

    Returns:
    - final_df: pandas DataFrame with features and corresponding targets including flags
    """
    # Convert polars DataFrame to pandas if needed
    if hasattr(labels_df, 'to_pandas'):
        labels_pandas = labels_df.to_pandas()
    else:
        labels_pandas = labels_df

    # Add patient column to labels if not present
    if 'patient' not in labels_pandas.columns:
        labels_pandas['patient'] = patient_name

    print(f"Features DataFrame shape: {features_df.shape}")
    print(f"Labels DataFrame shape: {labels_pandas.shape}")

    # Sort labels by time for proper lagging
    labels_pandas = labels_pandas.sort_values('time').reset_index(drop=True)

    # Create lagged glucose target to correct for CGM lag
    lag_offset = pd.Timedelta(minutes=glucose_lag_minutes)
    prediction_offset = pd.Timedelta(minutes=glycemic_prediction_minutes)

    # Create target_lagged_glucose by shifting the glucose values backwards to account for lag
    if 'Glucose' in labels_pandas.columns:
        # For each original time point, find the glucose value from 10 minutes later (lag correction)
        target_lagged_glucose = []
        future_glucose_values = []
        future_glycemic_states = []
        hypo_flags = []
        hyper_flags = []

        for idx, row in labels_pandas.iterrows():
            current_time = row['time']

            # 1. Lag-corrected glucose (10 minutes later)
            future_time_lag = current_time + lag_offset
            time_diffs_lag = np.abs((labels_pandas['time'] - future_time_lag).dt.total_seconds())
            closest_idx_lag = time_diffs_lag.idxmin()

            if time_diffs_lag[closest_idx_lag] <= 15 * 60:  # 15 minutes tolerance
                target_lagged_glucose.append(labels_pandas.loc[closest_idx_lag, 'Glucose'])
            else:
                target_lagged_glucose.append(row['Glucose'])

            # 2. Future glucose for glycemic state prediction (15 minutes ahead)
            future_time_pred = current_time + prediction_offset
            time_diffs_pred = np.abs((labels_pandas['time'] - future_time_pred).dt.total_seconds())
            closest_idx_pred = time_diffs_pred.idxmin()

            if time_diffs_pred[closest_idx_pred] <= 15 * 60:  # 15 minutes tolerance
                future_glucose = labels_pandas.loc[closest_idx_pred, 'Glucose']
                future_glucose_values.append(future_glucose)

                # Classify future glucose into glycemic states
                if pd.isna(future_glucose):
                    future_glycemic_states.append(None)
                elif future_glucose < 70:
                    future_glycemic_states.append('Hypoglycemia')
                elif future_glucose > 180:
                    future_glycemic_states.append('Hyperglycemia')
                else:
                    future_glycemic_states.append('Normal')
            else:
                future_glucose_values.append(None)
                future_glycemic_states.append(None)

            # 3. Hypo/Hyper flags: check for glucose < 75 or > 180 within next 900 seconds (15 minutes)
            future_time_flag = current_time + pd.Timedelta(seconds=900)  # 900 seconds = 15 minutes

            # Find all glucose measurements within the next 900 seconds
            future_window = labels_pandas[
                (labels_pandas['time'] > current_time) &
                (labels_pandas['time'] <= future_time_flag)
            ]

            hypo_flag = 0
            hyper_flag = 0

            if len(future_window) > 0:
                valid_glucose = future_window['Glucose'].dropna()
                if len(valid_glucose) > 0:
                    # Check for hypoglycemia (< 75 mg/dL)
                    if (valid_glucose < 75).any():
                        hypo_flag = 1
                    # Check for hyperglycemia (> 180 mg/dL)
                    if (valid_glucose > 180).any():
                        hyper_flag = 1

            hypo_flags.append(hypo_flag)
            hyper_flags.append(hyper_flag)

        labels_pandas['target_lagged_glucose'] = target_lagged_glucose
        labels_pandas[f'future_glucose_{glycemic_prediction_minutes}min'] = future_glucose_values
        labels_pandas[f'future_glycemic_state_{glycemic_prediction_minutes}min'] = future_glycemic_states
        labels_pandas['hypo_flag'] = hypo_flags
        labels_pandas['hyper_flag'] = hyper_flags
    else:
        labels_pandas['target_lagged_glucose'] = None
        labels_pandas[f'future_glucose_{glycemic_prediction_minutes}min'] = None
        labels_pandas[f'future_glycemic_state_{glycemic_prediction_minutes}min'] = None
        labels_pandas['hypo_flag'] = 0
        labels_pandas['hyper_flag'] = 0

    print(f"Applied {glucose_lag_minutes}-minute lag correction to glucose targets")
    print(f"Created {glycemic_prediction_minutes}-minute ahead glycemic state predictions")
    print(f"Created hypo/hyper flags for glucose events within 900 seconds")

    # Create a time-based merge using nearest timestamp matching
    final_features = []

    for idx, row in features_df.iterrows():
        window_start = row['start_time']
        window_end = row['end_time']

        # Find labels that fall within this window
        window_labels = labels_pandas[
            (labels_pandas['time'] >= window_start) &
            (labels_pandas['time'] <= window_end)
        ]

        # If no labels in window, find the closest label before window end
        if len(window_labels) == 0:
            before_window = labels_pandas[labels_pandas['time'] <= window_end]
            if len(before_window) > 0:
                # Get the most recent label before window end
                closest_label = before_window.loc[before_window['time'].idxmax()]
                target_glucose = closest_label['Glucose'] if 'Glucose' in closest_label else None
                metabolic_state = closest_label['state'] if 'state' in closest_label else None
                target_lagged_glucose = closest_label['target_lagged_glucose'] if 'target_lagged_glucose' in closest_label else None
                future_glucose = closest_label[f'future_glucose_{glycemic_prediction_minutes}min'] if f'future_glucose_{glycemic_prediction_minutes}min' in closest_label else None
                future_glycemic_state = closest_label[f'future_glycemic_state_{glycemic_prediction_minutes}min'] if f'future_glycemic_state_{glycemic_prediction_minutes}min' in closest_label else None
                hypo_flag = closest_label['hypo_flag'] if 'hypo_flag' in closest_label else 0
                hyper_flag = closest_label['hyper_flag'] if 'hyper_flag' in closest_label else 0
            else:
                target_glucose = None
                metabolic_state = None
                target_lagged_glucose = None
                future_glucose = None
                future_glycemic_state = None
                hypo_flag = 0
                hyper_flag = 0
        else:
            # Use the first label in the window
            target_glucose = window_labels['Glucose'].iloc[0] if 'Glucose' in window_labels.columns else None
            metabolic_state = window_labels['state'].iloc[0] if 'state' in window_labels.columns else None
            target_lagged_glucose = window_labels['target_lagged_glucose'].iloc[0] if 'target_lagged_glucose' in window_labels.columns else None
            future_glucose = window_labels[f'future_glucose_{glycemic_prediction_minutes}min'].iloc[0] if f'future_glucose_{glycemic_prediction_minutes}min' in window_labels.columns else None
            future_glycemic_state = window_labels[f'future_glycemic_state_{glycemic_prediction_minutes}min'].iloc[0] if f'future_glycemic_state_{glycemic_prediction_minutes}min' in window_labels.columns else None
            hypo_flag = window_labels['hypo_flag'].iloc[0] if 'hypo_flag' in window_labels.columns else 0
            hyper_flag = window_labels['hyper_flag'].iloc[0] if 'hyper_flag' in window_labels.columns else 0

        # Create feature row with targets
        feature_row = row.to_dict()
        feature_row['CGM'] = target_glucose
        feature_row[f'lagged_CGM_{glucose_lag_minutes}min'] = target_lagged_glucose
        feature_row['metabolic_state'] = metabolic_state
        feature_row[f'future_CGM_{glycemic_prediction_minutes}min'] = future_glucose
        feature_row[f'glycemic_state_{glycemic_prediction_minutes}min'] = future_glycemic_state
        feature_row['hypo_flag'] = hypo_flag
        feature_row['hyper_flag'] = hyper_flag

        final_features.append(feature_row)

    # Convert to DataFrame
    final_df = pd.DataFrame(final_features)

    # Reorder columns to place the glucose columns together
    glucose_cols = ['CGM', f'lagged_CGM_{glucose_lag_minutes}min', f'future_CGM_{glycemic_prediction_minutes}min']
    target_cols = ['metabolic_state', f'glycemic_state_{glycemic_prediction_minutes}min', 'hypo_flag', 'hyper_flag']
    other_cols = [col for col in final_df.columns if col not in glucose_cols + target_cols]

    # Place glucose columns after the basic window info but before other features
    window_info_cols = ['patient', 'window_idx', 'start_time', 'end_time']
    remaining_cols = [col for col in other_cols if col not in window_info_cols]

    # Reorder: window info, glucose columns, target columns, then other features
    new_column_order = window_info_cols + glucose_cols + target_cols + remaining_cols
    final_df = final_df[new_column_order]

    print(f"Final dataset shape: {final_df.shape}")
    print(f"CGM values available: {final_df['CGM'].notna().sum()}/{len(final_df)}")
    print(f"Lagged CGM values available: {final_df[f'lagged_CGM_{glucose_lag_minutes}min'].notna().sum()}/{len(final_df)}")
    print(f"Future CGM values available: {final_df[f'future_CGM_{glycemic_prediction_minutes}min'].notna().sum()}/{len(final_df)}")
    print(f"Future glycemic state values available: {final_df[f'glycemic_state_{glycemic_prediction_minutes}min'].notna().sum()}/{len(final_df)}")
    print(f"Hypo flags: {final_df['hypo_flag'].sum()}/{len(final_df)} ({final_df['hypo_flag'].sum()/len(final_df)*100:.1f}%)")
    print(f"Hyper flags: {final_df['hyper_flag'].sum()}/{len(final_df)} ({final_df['hyper_flag'].sum()/len(final_df)*100:.1f}%)")

    return final_df

#### Process all "Volunteer" patients to generate combined features dataframe

In [8]:
volunteer_patients_data = {}

# Display patients data to understand the structure
print("Available Volunteer patients in patients_data:")
for patient_name in patients_data.keys():
    if "volunteer" in patient_name:
        patient_info = patients_data[patient_name]
        print(patient_name)
        volunteer_patients_data[patient_name] = patient_info


Available Volunteer patients in patients_data:
volunteer_part1
volunteer_part2
volunteer_part3
volunteer_part4


#### Define the function to process "Volunteer" patients

In [9]:
# Process volunteer patients to generate combined features dataframe
def process_volunteer_patients(volunteer_patients_data, signals_dir, labels_dir, labels_filename, output_dir):
    """
    Process all Volunteer patients to generate combined features dataframe.

    Parameters:
    - volunteer_patients_data: dictionary containing patient information
    - signals_dir: directory containing signal files
    - labels_dir: directory containing label files
    - labels_filename: name of the labels Excel file
    - output_dir: directory to save results

    Returns:
    - combined_df: pandas DataFrame with features from all patients
    """
    all_final_datasets = []
    processing_summary = []

    # Define glucose lag and glycemic prediction parameters
    glucose_lag_minutes = GLUCOSE_LAG_MINUTES
    glycemic_prediction_minutes = GLYCEMIC_PREDICTION_MINUTES
    window_size = WINDOW_SIZE_MINUTES
    window_overlap = WINDOW_OVERLAP_MINUTES

    print("=" * 60)
    print("PROCESSING VOLUNTEER PATIENTS")
    print("=" * 60)

    for patient_name in volunteer_patients_data.keys():
        print(f"\n{'='*20} Processing: {patient_name} {'='*20}")

        try:
            # Step 1: Load downsampled signal data
            print(f"Step 1: Loading downsampled data for {patient_name}...")
            df_loaded = load_downsampled_data(patient_name, signals_dir)

            if df_loaded is None:
                print(f"❌ No signal data found for {patient_name}")
                processing_summary.append({
                    'patient': patient_name,
                    'status': 'Failed - No signal data',
                    'features_count': 0,
                    'windows_count': 0
                })
                continue

            # Step 2: Apply StandardScaler normalization
            print(f"Step 2: Applying StandardScaler normalization for {patient_name}...")
            df_normalized, channel_scalers = apply_zscore_normalization(df_loaded, patient_name)

            if df_normalized is None:
                print(f"❌ Normalization failed for {patient_name}")
                processing_summary.append({
                    'patient': patient_name,
                    'status': 'Failed - Normalization error',
                    'features_count': 0,
                    'windows_count': 0
                })
                continue

            # Step 3: Construct features
            print(f"Step 3: Constructing features for {patient_name}...")
            features_df = construct_features_for_patient(
                df_normalized,
                patient_name,
                volunteer_patients_data,
                window_size_min=window_size,
                overlap_min=window_overlap,
                fs=25
            )

            if features_df is None:
                print(f"❌ Feature construction failed for {patient_name}")
                processing_summary.append({
                    'patient': patient_name,
                    'status': 'Failed - Feature construction error',
                    'features_count': 0,
                    'windows_count': 0
                })
                continue

            # Step 4: Load labels (skip for Normal patients who don't have labels)
            df_labels = None
            if "Normal" not in patient_name:
                print(f"Step 4: Loading labels for {patient_name}...")
                try:
                    df_labels = pl.read_excel(
                        labels_dir / labels_filename,
                        sheet_name=patient_name,
                        columns=[0, 1, 2, 3],
                        schema_overrides={
                            "time": pl.Datetime,
                            "Glucose": pl.Float64,
                            "Events": pl.Utf8,
                            "Remarks": pl.Utf8
                        }
                    )

                    if len(df_labels.columns) >= 4:
                        df_labels = df_labels.select(df_labels.columns[:4])

                    # Add metabolic state column (skip, for now, for Volunteer patients)
                    if "Volunteer" not in patient_name:
                        pass
                        # df_labels = add_metabolic_state_column(df_labels)

                    print(f"✅ Loaded {df_labels.shape[0]} labels for {patient_name}")

                except Exception as e:
                    print(f"⚠️ Could not load labels for {patient_name}: {e}")
                    df_labels = None
            else:
                print(f"ℹ️ Skipping labels for {patient_name} (Normal patient)")

            # Step 5: Construct final dataset with new target labels
            print(f"Step 5: Constructing final dataset for {patient_name}...")
            if df_labels is not None:
                final_df = construct_final_dataset_with_flags(
                    features_df, df_labels, patient_name,
                    glucose_lag_minutes, glycemic_prediction_minutes
                )
            else:
                # For Normal patients without labels, create a simplified final dataset
                final_df = features_df.copy()
                final_df['CGM'] = None
                final_df[f'lagged_CGM_{glucose_lag_minutes}min'] = None
                final_df['metabolic_state'] = 'Normal' if 'Normal' in patient_name else None
                final_df[f'future_CGM_{glycemic_prediction_minutes}min'] = None
                final_df[f'glycemic_state_{glycemic_prediction_minutes}min'] = None
                final_df['hypo_flag'] = 0  # Normal patients don't have hypoglycemia
                final_df['hyper_flag'] = 0  # Normal patients don't have hyperglycemia

                # Reorder columns to match the expected structure
                glucose_cols = ['CGM', f'lagged_CGM_{glucose_lag_minutes}min', f'future_CGM_{glycemic_prediction_minutes}min']
                target_cols = ['metabolic_state', f'glycemic_state_{glycemic_prediction_minutes}min', 'hypo_flag', 'hyper_flag']
                other_cols = [col for col in final_df.columns if col not in glucose_cols + target_cols]

                window_info_cols = ['patient', 'window_idx', 'start_time', 'end_time']
                remaining_cols = [col for col in other_cols if col not in window_info_cols]

                new_column_order = window_info_cols + glucose_cols + target_cols + remaining_cols
                final_df = final_df[new_column_order]

            if final_df is not None:
                all_final_datasets.append(final_df)
                print(f"✅ Successfully processed {patient_name}")
                print(f"   - Windows: {len(final_df)}")
                print(f"   - Features: {len(final_df.columns)}")

                processing_summary.append({
                    'patient': patient_name,
                    'status': 'Success',
                    'features_count': len(final_df.columns),
                    'windows_count': len(final_df)
                })
            else:
                print(f"❌ Final dataset construction failed for {patient_name}")
                processing_summary.append({
                    'patient': patient_name,
                    'status': 'Failed - Final dataset construction error',
                    'features_count': 0,
                    'windows_count': 0
                })

        except Exception as e:
            print(f"❌ Error processing {patient_name}: {str(e)}")
            processing_summary.append({
                'patient': patient_name,
                'status': f'Failed - {str(e)}',
                'features_count': 0,
                'windows_count': 0
            })

        # Memory cleanup
        gc.collect()

    # Combine all datasets
    if all_final_datasets:
        print(f"\n{'='*60}")
        print("COMBINING ALL DATASETS")
        print(f"{'='*60}")

        combined_df = pd.concat(all_final_datasets, ignore_index=True)

        print(f"✅ Combined dataset created successfully!")
        print(f"   - Total patients processed: {len(all_final_datasets)}")
        print(f"   - Total windows: {len(combined_df)}")
        print(f"   - Total features: {len(combined_df.columns)}")

        # Save combined dataset (all features)
        combined_output_file = output_dir / "combined_features_and_targets_volunteer1.csv"
        combined_df.to_csv(combined_output_file, index=False)
        print(f"   - Full dataset saved to: {combined_output_file}")

        # Display summary statistics
        print(f"\n{'='*60}")
        print("PROCESSING SUMMARY")
        print(f"{'='*60}")

        summary_df = pd.DataFrame(processing_summary)
        print(summary_df.to_string(index=False))

        # Dataset statistics
        print(f"\n{'='*60}")
        print("DATASET STATISTICS")
        print(f"{'='*60}")

        # Patient distribution
        print(f"\nPatient distribution:")
        print(combined_df['patient'].value_counts())

        # Target statistics (for patients with labels)
        if 'CGM' in combined_df.columns:
            cgm_available = combined_df['CGM'].notna().sum()
            print(f"\nCGM targets available: {cgm_available}/{len(combined_df)} ({cgm_available/len(combined_df)*100:.1f}%)")

        if 'metabolic_state' in combined_df.columns:
            print(f"\nMetabolic state distribution:")
            print(combined_df['metabolic_state'].value_counts())

        if f'glycemic_state_{glycemic_prediction_minutes}min' in combined_df.columns:
            glycemic_available = combined_df[f'glycemic_state_{glycemic_prediction_minutes}min'].notna().sum()
            print(f"\nGlycemic state targets available: {glycemic_available}/{len(combined_df)} ({glycemic_available/len(combined_df)*100:.1f}%)")
            if glycemic_available > 0:
                print(f"Glycemic state distribution:")
                print(combined_df[f'glycemic_state_{glycemic_prediction_minutes}min'].value_counts())

        # New flag statistics
        if 'hypo_flag' in combined_df.columns:
            hypo_flags = combined_df['hypo_flag'].sum()
            print(f"\nHypoglycemia flags (within 900s): {hypo_flags}/{len(combined_df)} ({hypo_flags/len(combined_df)*100:.1f}%)")

        if 'hyper_flag' in combined_df.columns:
            hyper_flags = combined_df['hyper_flag'].sum()
            print(f"Hyperglycemia flags (within 900s): {hyper_flags}/{len(combined_df)} ({hyper_flags/len(combined_df)*100:.1f}%)")

        return combined_df
    else:
        print("❌ No datasets were successfully processed!")
        return None

#### Execute the feature construction pipeline for all Volunteer patients

In [10]:
# Execute the special case feature construction pipeline
print("Starting the special case feature construction pipeline...")

# Process all patients and create the combined volunteer features dataset
volunteer_features_df = process_volunteer_patients(
    volunteer_patients_data=volunteer_patients_data,
    signals_dir=signals_dir,
    labels_dir=labels_dir,
    labels_filename=labels_filename,
    output_dir=output_dir
)

if volunteer_features_df is not None:
    print(f"\n{'='*80}")
    print("FEATURE CONSTRUCTION PIPELINE COMPLETED SUCCESSFULLY!")
    print(f"{'='*80}")

    print(f"\nDataset Overview:")
    print(f"  📊 Shape: {volunteer_features_df.shape}")
    print(f"  👥 Patients: {volunteer_features_df['patient'].nunique()}")
    print(f"  🪟 Windows: {len(volunteer_features_df)}")
    print(f"  📈 Features: {len(volunteer_features_df.columns)}")

    # Display first few rows info
    print(f"\nColumn overview:")
    print(f"  - Metadata columns: {['patient', 'window_idx', 'start_time', 'end_time']}")
    print(f"  - Target columns: {[col for col in volunteer_features_df.columns if 'CGM' in col or 'state' in col or 'flag' in col]}")
    print(f"  - Feature columns: {len([col for col in volunteer_features_df.columns if col not in ['patient', 'window_idx', 'start_time', 'end_time'] and 'CGM' not in col and 'state' not in col and 'flag' not in col])}")

    print(f"\n✅ Combined features dataset is now available as 'volunteer_features_df'")
    print(f"✅ Feature construction pipeline completed successfully!")

else:
    print("❌ Feature construction pipeline failed!")
    print("Please check the error messages above and ensure:")
    print("  - Signal files are available in the signals_dir")
    print("  - Patient data is correctly loaded")
    print("  - Directory permissions are correct")


Starting the special case feature construction pipeline...
PROCESSING VOLUNTEER PATIENTS

Step 1: Loading downsampled data for volunteer_part1...
Successfully loaded data from volunteer_part1_downsampled_25hz.parquet
Loaded data shape: (612512, 3)
Step 2: Applying StandardScaler normalization for volunteer_part1...
Applying z-score normalization for volunteer_part1
Processing 2 signal channels
Channel Hand1: mean=-0.1187, std=27.9219
Channel Hand2: mean=-0.0934, std=27.7747
Z-score normalization completed for 2 channels
Step 3: Constructing features for volunteer_part1...

Constructing features for patient: volunteer_part1
Window size: 1 min (1500 samples)
Overlap: 0.25 min (375 samples)
Step size: 1125 samples
Created 544 windows from 612512 samples


Extracting features: 100%|██████████| 544/544 [00:06<00:00, 90.33it/s] 


Extracted 544 windows with 61 features each
Step 4: Loading labels for volunteer_part1...
✅ Loaded 215 labels for volunteer_part1
Step 5: Constructing final dataset for volunteer_part1...
Features DataFrame shape: (544, 61)
Labels DataFrame shape: (215, 5)
Applied 10-minute lag correction to glucose targets
Created 15-minute ahead glycemic state predictions
Created hypo/hyper flags for glucose events within 900 seconds
Final dataset shape: (544, 68)
CGM values available: 544/544
Lagged CGM values available: 544/544
Future CGM values available: 544/544
Future glycemic state values available: 544/544
Hypo flags: 56/544 (10.3%)
Hyper flags: 25/544 (4.6%)
✅ Successfully processed volunteer_part1
   - Windows: 544
   - Features: 68

Step 1: Loading downsampled data for volunteer_part2...
Successfully loaded data from volunteer_part2_downsampled_25hz.parquet
Loaded data shape: (656312, 3)
Step 2: Applying StandardScaler normalization for volunteer_part2...
Applying z-score normalization for 

Extracting features: 100%|██████████| 583/583 [00:06<00:00, 96.08it/s] 


Extracted 583 windows with 61 features each
Step 4: Loading labels for volunteer_part2...
✅ Loaded 205 labels for volunteer_part2
Step 5: Constructing final dataset for volunteer_part2...
Features DataFrame shape: (583, 61)
Labels DataFrame shape: (205, 5)
Applied 10-minute lag correction to glucose targets
Created 15-minute ahead glycemic state predictions
Created hypo/hyper flags for glucose events within 900 seconds
Final dataset shape: (583, 68)
CGM values available: 583/583
Lagged CGM values available: 583/583
Future CGM values available: 583/583
Future glycemic state values available: 583/583
Hypo flags: 112/583 (19.2%)
Hyper flags: 24/583 (4.1%)
✅ Successfully processed volunteer_part2
   - Windows: 583
   - Features: 68

Step 1: Loading downsampled data for volunteer_part3...
Successfully loaded data from volunteer_part3_downsampled_25hz.parquet
Loaded data shape: (872087, 3)
Step 2: Applying StandardScaler normalization for volunteer_part3...
Applying z-score normalization for

Extracting features: 100%|██████████| 774/774 [00:08<00:00, 89.97it/s] 


Extracted 774 windows with 61 features each
Step 4: Loading labels for volunteer_part3...
✅ Loaded 242 labels for volunteer_part3
Step 5: Constructing final dataset for volunteer_part3...
Features DataFrame shape: (774, 61)
Labels DataFrame shape: (242, 5)
Applied 10-minute lag correction to glucose targets
Created 15-minute ahead glycemic state predictions
Created hypo/hyper flags for glucose events within 900 seconds
Final dataset shape: (774, 68)
CGM values available: 774/774
Lagged CGM values available: 774/774
Future CGM values available: 774/774
Future glycemic state values available: 774/774
Hypo flags: 23/774 (3.0%)
Hyper flags: 184/774 (23.8%)
✅ Successfully processed volunteer_part3
   - Windows: 774
   - Features: 68

Step 1: Loading downsampled data for volunteer_part4...
Successfully loaded data from volunteer_part4_downsampled_25hz.parquet
Loaded data shape: (716200, 3)
Step 2: Applying StandardScaler normalization for volunteer_part4...
Applying z-score normalization for

Extracting features: 100%|██████████| 636/636 [00:08<00:00, 78.75it/s] 


Extracted 636 windows with 61 features each
Step 4: Loading labels for volunteer_part4...
✅ Loaded 242 labels for volunteer_part4
Step 5: Constructing final dataset for volunteer_part4...
Features DataFrame shape: (636, 61)
Labels DataFrame shape: (242, 5)
Applied 10-minute lag correction to glucose targets
Created 15-minute ahead glycemic state predictions
Created hypo/hyper flags for glucose events within 900 seconds
Final dataset shape: (636, 68)
CGM values available: 636/636
Lagged CGM values available: 636/636
Future CGM values available: 636/636
Future glycemic state values available: 636/636
Hypo flags: 27/636 (4.2%)
Hyper flags: 23/636 (3.6%)
✅ Successfully processed volunteer_part4
   - Windows: 636
   - Features: 68

COMBINING ALL DATASETS
✅ Combined dataset created successfully!
   - Total patients processed: 4
   - Total windows: 2537
   - Total features: 68
   - Full dataset saved to: ..\..\..\Data\ProcessedData\combined_features_and_targets_volunteer1.csv

PROCESSING SUMM

In [None]:
gc.collect()