## TDMS Signal Preprocessing Pipeline for Volunteer Data

This notebook processes a special case multi-part TDMS sensor data from a volunteer patient, handling large files with memory optimization and preparing data for analysis.

**Key Operations:**
- **Data Integration**: Loads and combines 4 separate TDMS files from different recording dates (Nov-Dec 2021)
- **Signal Processing**: Converts voltage measurements to nanoTesla units, applies band-pass filtering (0.05-10 Hz), and downsamples from 5000 Hz to 25 Hz
- **Memory Management**: Uses chunked reading and garbage collection for large file handling
- **Period Segmentation**: Splits combined data into analysis periods (Nov 16-17, Dec 7-8, Dec 14-15 and Dec 15-16) for focused analysis
- **Output Generation**: Saves period-specific datasets as parquet files for downstream analysis

The pipeline includes saturation detection, anti-aliasing filters, and comprehensive error handling to ensure robust data preprocessing for magnetometer sensor analysis.

In [1]:
# General imports:

# Disable warnings:
import warnings

warnings.filterwarnings('ignore')

# Essential imports
import polars as pl
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from pathlib import Path
import os
import gc

from tqdm import tqdm

# Add TDMS reading functionality
from nptdms import TdmsFile

# Add signal processing imports for antialiasing filter
from scipy.signal import butter, filtfilt

# Plotting enhancements
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12
plt.style.use('seaborn-v0_8-whitegrid')

In [2]:
# Key Settings:
Visualize_Signal = True # Signal visualization (only if required)

# Physical constants and sensors specifications:
SENSITIVITY = 50  # mV/nT
MAGNETIC_NOISE = 3  # pT/√Hz @ 1 Hz
MAX_AC_LINEARITY = 250  # nT (+/- 250 nT) - Equivalent to 21.78 V
MAX_DC_LINEARITY = 60  # nT (+/- 60 nT)
VOLTAGE_LIMIT = 15 # V (+/-15V)
CONVERSION_FACTOR = 20  # nT per 1V
SAMPLING_FREQUENCY = 5000  # Hz - expected from the experimental data
SENSOR_SATURATION = 250  # nT - saturation threshold for the sensor

# Subjects and their types # Not relevant for this notebook
# Subject = {"Normal": "Normal Subjects","Clamp": "T1DM Clamp Subjects", "Additional": "Additional Subjects"}

# Path and directories
base_dir = Path("../../../Data")

# Output directory for saving results
output_dir = base_dir / "ProcessedData"
os.makedirs(output_dir, exist_ok=True)

# Directory for saving processed/downsampled signal files (parquet format)
signals_dir = output_dir / "Signal_Files"
os.makedirs(signals_dir, exist_ok=True)

# GMT zone correction for the sensor = GMT+2
GMT = 2

# Key frequencies from background noise analysis
POWER_LINE_FREQ = 50  # Hz
POWERLINE_HARMONICS = [POWER_LINE_FREQ*i for i in range(1, 4)]  # 50, 100, 150 Hz.

# Filter parameters:
HIGHCUT_FREQ = 10  # Hz low-pass filter cutoff frequency
LOWCUT_FREQ = 0.05 # Hz high-pass filter cutoff frequency
FILTER_ORDER = 6 # for steeper roll-off

# Channel grouping
signal_channels = {
    'Head': ['Head_left', 'Head_right'],
    'Hand': ['Hand1', 'Hand2'],
    'Liver': ['Liver1', 'Liver2'],
    'Background': ['Background1', 'Background2']
}

input_file_set = {'part1': base_dir / "RawData/Volunteers/2021/2021_11_16/fruit_1.tdms",
                 'part2': base_dir / "RawData/Volunteers/2021/2021_12_07/fruit_1.tdms",
                 'part3': base_dir / "RawData/Volunteers/2021/2021_12_14/fruit_1.tdms",
                 'part4': base_dir / "RawData/Volunteers/2021/2021_12_15/fruit_9.tdms",
                 'background': base_dir / "RawData/Volunteers/2021/2021_12_03_background"}

In [3]:
def convert_voltage_to_nanotesla_flexible(df, conversion_factor, exclude_columns=None):
    """
    Convert raw voltage signals to nanoTesla (nT) for ALL columns except specified exclusions.

    Parameters:
    - df: polars DataFrame with voltage signal data
    - conversion_factor: conversion factor in nT per Volt (default: 20 nT/V)
    - exclude_columns: list of column names to exclude from conversion (e.g., ['Time'])

    Returns:
    - df_converted: polars DataFrame with signals converted to nT
    """
    if exclude_columns is None:
        exclude_columns = []

    # Make exclude_columns case-insensitive
    exclude_columns_lower = [col.lower() for col in exclude_columns]

    # Find all columns that are NOT in the exclude list
    signal_columns = []
    time_columns = []

    for col in df.columns:
        if col.lower() in exclude_columns_lower or col.lower() == 'time':
            time_columns.append(col)
            print(f"Preserving column '{col}' (not converting)")
        else:
            signal_columns.append(col)

    print(f"Found {len(signal_columns)} signal channels to convert: {signal_columns}")
    print(f"Found {len(time_columns)} non-signal columns to preserve: {time_columns}")

    # Apply conversion to each signal channel
    converted_data = {}

    # Keep non-signal columns unchanged
    for col in time_columns:
        converted_data[col] = df[col].to_numpy()

    print(f"Converting voltage signals to nanoTesla using factor: {conversion_factor} nT/V")

    for channel in tqdm(signal_columns, desc="Converting channels"):
        if channel in df.columns:
            # Extract voltage signal data
            voltage_signal = df[channel].to_numpy()

            # Convert to nanoTesla: nT = V × conversion_factor
            nanotesla_signal = voltage_signal * conversion_factor

            # Store converted signal
            converted_data[channel] = nanotesla_signal
        else:
            print(f"Warning: Channel '{channel}' not found in data")

    # Convert back to polars DataFrame
    df_converted = pl.DataFrame(converted_data)

    print(f"Voltage to nanoTesla conversion completed. Converted {len(signal_columns)} channels.")
    print(f"Signal values are now in nanoTesla (nT) units.")

    return df_converted

In [4]:
def read_tdms_file_improved(tdms_path, GMT=2, chunk_size=500000, downsample_factor=None, memory_threshold_mb=500):
    """
    Improved TDMS file reader with memory management and chunked processing.

    Parameters:
    -----------
    tdms_path : str or Path
        Path to the TDMS file
    GMT : int, default=2
        GMT offset in hours to apply to the Time column
    chunk_size : int, default=500000
        Number of samples to read at a time for large files
    downsample_factor : int, optional
        Factor by which to downsample the data (e.g., 2 means keep every 2nd sample)
    memory_threshold_mb : int, default=500
        Memory threshold in MB above which chunked reading is used

    Returns:
    --------
    pl.DataFrame or None
        Polars DataFrame containing the TDMS data, or None if reading fails
    """
    if not os.path.exists(tdms_path) or not str(tdms_path).lower().endswith('.tdms'):
        print(f"Invalid file path or not a TDMS file: {tdms_path}")
        return None

    try:
        print(f"Reading {tdms_path}...")

        # Read the TDMS file header first to get metadata
        tdms_file = TdmsFile.read(tdms_path)

        # Check if the 'Untitled' group exists
        if 'Untitled' not in tdms_file:
            # Get the available groups
            groups = list(tdms_file.groups())
            if not groups:
                print(f"Warning: No groups found in {tdms_path}")
                return None

            # Use the first available group
            group = groups[0]
            print(f"Using group '{group.name}' instead of 'Untitled'")
        else:
            group = tdms_file['Untitled']

        # Get channel information and data length
        channels = list(group.channels())
        if not channels:
            print(f"Warning: No channels found in group")
            return None

        # Get the length of data from the first channel
        first_channel = channels[0]
        total_length = len(first_channel)
        print(f"Total data length: {total_length:,} samples")

        # Calculate memory requirements
        estimated_memory_mb = (total_length * len(channels) * 8) / (1024 * 1024)  # 8 bytes per float64
        print(f"Estimated memory requirement: {estimated_memory_mb:.1f} MB")

        if estimated_memory_mb > memory_threshold_mb:
            print("Large file detected, using chunked reading approach...")
            return _read_tdms_chunked_local(group, GMT, chunk_size, downsample_factor, total_length)
        else:
            # For smaller files, use the standard approach with optimizations
            return _read_tdms_standard_local(group, GMT, downsample_factor)

    except Exception as e:
        print(f"Error processing TDMS file {tdms_path}: {e}")
        return None

def _read_tdms_standard_local(group, GMT, downsample_factor=None):
    """Standard reading approach for smaller files"""
    try:
        data = {}

        for channel in group.channels():
            channel_data = channel[:]

            # Apply downsampling if specified
            if downsample_factor and downsample_factor > 1:
                channel_data = channel_data[::downsample_factor]

            # Use appropriate data type to save memory
            if channel.name.lower() == 'time':
                data[channel.name] = channel_data
            else:
                # Convert to float32 to save memory (vs default float64)
                data[channel.name] = channel_data.astype(np.float32)

        # Create a Polars DataFrame
        df = pl.DataFrame(data)

        # If the DataFrame has a 'Time' column, adjust it by GMT offset
        if 'Time' in df.columns:
            try:
                df = df.with_columns([pl.col('Time') + pl.duration(hours=GMT)])
                print(f"Adjusted 'Time' column by GMT+{GMT} hours")
            except Exception as e:
                print(f"Warning: Could not adjust 'Time' column: {e}")

        return df

    except Exception as e:
        print(f"Error in standard reading: {e}")
        return None

def _read_tdms_chunked_local(group, GMT, chunk_size, downsample_factor, total_length):
    """Chunked reading approach for large files"""
    try:
        channels = list(group.channels())

        if downsample_factor and downsample_factor > 1:
            print(f"Applying downsampling factor: {downsample_factor}")
            effective_chunk_size = chunk_size * downsample_factor
        else:
            effective_chunk_size = chunk_size

        # Process data in chunks
        chunk_dfs = []

        for start_idx in range(0, total_length, effective_chunk_size):
            end_idx = min(start_idx + effective_chunk_size, total_length)
            progress = (end_idx / total_length) * 100
            print(f"Processing chunk {start_idx:,} to {end_idx:,} ({progress:.1f}%)")

            chunk_data = {}

            for channel in channels:
                try:
                    channel_data = channel[start_idx:end_idx]

                    # Apply downsampling if specified
                    if downsample_factor and downsample_factor > 1:
                        channel_data = channel_data[::downsample_factor]

                    # Use appropriate data type
                    if channel.name.lower() == 'time':
                        chunk_data[channel.name] = channel_data
                    else:
                        chunk_data[channel.name] = channel_data.astype(np.float32)

                except Exception as e:
                    print(f"Warning: Error reading channel {channel.name} in chunk: {e}")
                    continue

            if chunk_data:  # Only create DataFrame if we have data
                chunk_df = pl.DataFrame(chunk_data)
                chunk_dfs.append(chunk_df)

            # Force garbage collection to free memory
            gc.collect()

        if not chunk_dfs:
            print("No valid chunks were processed")
            return None

        # Concatenate all chunks
        print("Combining chunks...")
        df = pl.concat(chunk_dfs, how='vertical')

        # Clear chunk list to free memory
        del chunk_dfs
        gc.collect()

        # If the DataFrame has a 'Time' column, adjust it by GMT offset
        if 'Time' in df.columns:
            try:
                df = df.with_columns([pl.col('Time') + pl.duration(hours=GMT)])
                print(f"Adjusted 'Time' column by GMT+{GMT} hours")
            except Exception as e:
                print(f"Warning: Could not adjust 'Time' column: {e}")

        print(f"Successfully processed large file with chunked reading")
        return df

    except Exception as e:
        print(f"Error in chunked reading: {e}")
        return None


In [5]:
# Process each part of the TDMS file set and save them as individual parquet files
processed_files = {}  # Track successfully processed files

# Filter out the background file for processing
parts_to_process = {k: v for k, v in input_file_set.items() if k != 'background'}

# Parameters for handling large files
CHUNK_SIZE = 500000  # Reduced chunk size for better memory management
DOWNSAMPLE_FACTOR = 2  # Downsample by factor of 2 if files are too large (optional)
MEMORY_THRESHOLD = 300  # Use chunked reading for files estimated to use more than 300MB

for part_name, path in parts_to_process.items():
    print(f"\nProcessing {part_name} file: {path}")

    # Check if the file exists
    if not os.path.exists(path):
        print(f"File {path} does not exist. Skipping...")
        continue

    # Get file size to estimate if downsampling might be needed
    file_size_mb = os.path.getsize(path) / (1024 * 1024)
    print(f"File size: {file_size_mb:.1f} MB")

    # Determine if downsampling is needed based on file size
    use_downsampling = file_size_mb > 100  # Use downsampling for files larger than 100MB
    downsample_factor = DOWNSAMPLE_FACTOR if use_downsampling else None

    if use_downsampling:
        print(f"Large file detected. Will apply downsampling factor: {downsample_factor}")

    # Read the TDMS file using the improved function
    try:
        df = read_tdms_file_improved(
            str(path),
            GMT=GMT,
            chunk_size=CHUNK_SIZE,
            downsample_factor=downsample_factor,
            memory_threshold_mb=MEMORY_THRESHOLD
        )

        if df is None:
            print(f"Failed to load data from {part_name}. Skipping...")
            continue

        print(f"Successfully loaded data with {df.shape[0]:,} samples and {df.shape[1]} channels")

        # Display all available channels in this file
        print(f"All channels found in {part_name}: {df.columns}")

        # Separate time and signal channels
        time_cols = [col for col in df.columns if col.lower() == 'time']
        signal_cols = [col for col in df.columns if col.lower() != 'time']

        print(f"Time columns: {time_cols}")
        print(f"Signal channels ({len(signal_cols)}): {signal_cols}")

    except Exception as e:
        print(f"Error loading TDMS file for {part_name}: {e}")
        # Force garbage collection to free memory before continuing
        gc.collect()
        continue

    # Convert voltage signals to nanoTesla using the flexible function
    try:
        df = convert_voltage_to_nanotesla_flexible(df, CONVERSION_FACTOR)
        print(f"Voltage conversion completed for {part_name}")
    except Exception as e:
        print(f"Error during voltage conversion for {part_name}: {e}")
        gc.collect()
        continue

    # Save individual parquet file for this TDMS part
    try:
        output_file = signals_dir / f"volunteer_{part_name}.parquet"
        print(f"Saving {part_name} data to {output_file}...")
        df.write_parquet(str(output_file))
        print(f"Successfully saved {df.shape[0]:,} samples from {part_name} to {output_file}")

        # Print summary statistics for this file
        print(f"DataFrame shape for {part_name}: {df.shape}")
        print(f"Columns: {df.columns}")

        # Print time range for this file
        if time_cols:
            time_col = time_cols[0]
            time_min = df[time_col].min()
            time_max = df[time_col].max()
            duration_hours = (time_max - time_min).total_seconds() / 3600
            print(f"Time range for {part_name}: {time_min} to {time_max}")
            print(f"Duration: {duration_hours:.2f} hours")

        # Calculate file size
        if output_file.exists():
            output_size_mb = output_file.stat().st_size / (1024 * 1024)
            print(f"Output file size for {part_name}: {output_size_mb:.1f} MB")

        # Track successful processing
        processed_files[part_name] = {
            'file_path': output_file,
            'shape': df.shape,
            'time_range': (time_min, time_max) if time_cols else None,
            'duration_hours': duration_hours if time_cols else None
        }

    except Exception as e:
        print(f"Error saving {part_name} data: {e}")

    finally:
        # Clear the current DataFrame to free memory
        del df
        gc.collect()

# Print summary of all processed files
if processed_files:
    print(f"\n{'='*60}")
    print("PROCESSING SUMMARY")
    print(f"{'='*60}")
    print(f"Successfully processed {len(processed_files)} TDMS files:")

    for part_name, info in processed_files.items():
        print(f"\n{part_name}:")
        print(f"  - File: {info['file_path']}")
        print(f"  - Shape: {info['shape']}")
        if info['time_range']:
            print(f"  - Time range: {info['time_range'][0]} to {info['time_range'][1]}")
            print(f"  - Duration: {info['duration_hours']:.2f} hours")
else:
    print("No data was successfully processed. No output files created.")

print("\nIndividual TDMS processing completed!")


Processing part1 file: ..\..\..\Data\RawData\Volunteers\2021\2021_11_16\fruit_1.tdms
File size: 7486.1 MB
Large file detected. Will apply downsampling factor: 2
Reading ..\..\..\Data\RawData\Volunteers\2021\2021_11_16\fruit_1.tdms...
Total data length: 245,005,000 samples
Estimated memory requirement: 5607.7 MB
Large file detected, using chunked reading approach...
Applying downsampling factor: 2
Processing chunk 0 to 1,000,000 (0.4%)
Processing chunk 1,000,000 to 2,000,000 (0.8%)
Processing chunk 2,000,000 to 3,000,000 (1.2%)
Processing chunk 3,000,000 to 4,000,000 (1.6%)
Processing chunk 4,000,000 to 5,000,000 (2.0%)
Processing chunk 5,000,000 to 6,000,000 (2.4%)
Processing chunk 6,000,000 to 7,000,000 (2.9%)
Processing chunk 7,000,000 to 8,000,000 (3.3%)
Processing chunk 8,000,000 to 9,000,000 (3.7%)
Processing chunk 9,000,000 to 10,000,000 (4.1%)
Processing chunk 10,000,000 to 11,000,000 (4.5%)
Processing chunk 11,000,000 to 12,000,000 (4.9%)
Processing chunk 12,000,000 to 13,000,

Converting channels: 100%|██████████| 2/2 [00:01<00:00,  1.77it/s]


Voltage to nanoTesla conversion completed. Converted 2 channels.
Signal values are now in nanoTesla (nT) units.
Voltage conversion completed for part1
Saving part1 data to ..\..\..\Data\ProcessedData\Signal_Files\volunteer_part1.parquet...
Successfully saved 122,502,500 samples from part1 to ..\..\..\Data\ProcessedData\Signal_Files\volunteer_part1.parquet
DataFrame shape for part1: (122502500, 3)
Columns: ['Time', 'Voltage_0', 'Voltage_1']
Time range for part1: 2021-11-16 22:24:59.171050 to 2021-11-17 04:49:32.170950
Duration: 6.41 hours
Output file size for part1: 610.0 MB

Processing part2 file: ..\..\..\Data\RawData\Volunteers\2021\2021_12_07\fruit_1.tdms
File size: 8021.4 MB
Large file detected. Will apply downsampling factor: 2
Reading ..\..\..\Data\RawData\Volunteers\2021\2021_12_07\fruit_1.tdms...
Total data length: 262,525,000 samples
Estimated memory requirement: 6008.7 MB
Large file detected, using chunked reading approach...
Applying downsampling factor: 2
Processing chunk 0

Converting channels: 100%|██████████| 2/2 [00:00<00:00,  2.27it/s]


Voltage to nanoTesla conversion completed. Converted 2 channels.
Signal values are now in nanoTesla (nT) units.
Voltage conversion completed for part2
Saving part2 data to ..\..\..\Data\ProcessedData\Signal_Files\volunteer_part2.parquet...
Successfully saved 131,262,500 samples from part2 to ..\..\..\Data\ProcessedData\Signal_Files\volunteer_part2.parquet
DataFrame shape for part2: (131262500, 3)
Columns: ['Time', 'Voltage_0', 'Voltage_1']
Time range for part2: 2021-12-07 22:19:36.521900 to 2021-12-08 06:51:52.271800
Duration: 8.54 hours
Output file size for part2: 641.0 MB

Processing part3 file: ..\..\..\Data\RawData\Volunteers\2021\2021_12_14\fruit_1.tdms
File size: 10658.6 MB
Large file detected. Will apply downsampling factor: 2
Reading ..\..\..\Data\RawData\Volunteers\2021\2021_12_14\fruit_1.tdms...
Total data length: 348,835,000 samples
Estimated memory requirement: 7984.2 MB
Large file detected, using chunked reading approach...
Applying downsampling factor: 2
Processing chunk 

Converting channels: 100%|██████████| 2/2 [00:01<00:00,  1.44it/s]


Voltage to nanoTesla conversion completed. Converted 2 channels.
Signal values are now in nanoTesla (nT) units.
Voltage conversion completed for part3
Saving part3 data to ..\..\..\Data\ProcessedData\Signal_Files\volunteer_part3.parquet...
Successfully saved 174,417,500 samples from part3 to ..\..\..\Data\ProcessedData\Signal_Files\volunteer_part3.parquet
DataFrame shape for part3: (174417500, 3)
Columns: ['Time', 'Voltage_0', 'Voltage_1']
Time range for part3: 2021-12-14 22:35:22.120750 to 2021-12-15 06:58:26.370650
Duration: 8.38 hours
Output file size for part3: 900.6 MB

Processing part4 file: ..\..\..\Data\RawData\Volunteers\2021\2021_12_15\fruit_9.tdms
File size: 8753.4 MB
Large file detected. Will apply downsampling factor: 2
Reading ..\..\..\Data\RawData\Volunteers\2021\2021_12_15\fruit_9.tdms...
Total data length: 286,480,000 samples
Estimated memory requirement: 6557.0 MB
Large file detected, using chunked reading approach...
Applying downsampling factor: 2
Processing chunk 0

Converting channels: 100%|██████████| 2/2 [00:00<00:00,  2.38it/s]


Voltage to nanoTesla conversion completed. Converted 2 channels.
Signal values are now in nanoTesla (nT) units.
Voltage conversion completed for part4
Saving part4 data to ..\..\..\Data\ProcessedData\Signal_Files\volunteer_part4.parquet...
Successfully saved 143,240,000 samples from part4 to ..\..\..\Data\ProcessedData\Signal_Files\volunteer_part4.parquet
DataFrame shape for part4: (143240000, 3)
Columns: ['Time', 'Voltage_0', 'Voltage_1']
Time range for part4: 2021-12-15 21:29:29.463300 to 2021-12-16 07:23:24.963100
Duration: 9.90 hours
Output file size for part4: 1008.3 MB

PROCESSING SUMMARY
Successfully processed 4 TDMS files:

part1:
  - File: ..\..\..\Data\ProcessedData\Signal_Files\volunteer_part1.parquet
  - Shape: (122502500, 3)
  - Time range: 2021-11-16 22:24:59.171050 to 2021-11-17 04:49:32.170950
  - Duration: 6.41 hours

part2:
  - File: ..\..\..\Data\ProcessedData\Signal_Files\volunteer_part2.parquet
  - Shape: (131262500, 3)
  - Time range: 2021-12-07 22:19:36.521900 to

#### Define the anti-aliasing filter function

In [6]:
def apply_antialiasing_filter(df, lowcut_freq=0.05, highcut_freq=10, sampling_freq=5000, filter_order=6):
    """
    Apply low-pass and high-pass filters separately with improved stability.

    Parameters:
    - df: polars DataFrame with signal data
    - lowcut_freq: low cutoff frequency in Hz (default: 0.05Hz)
    - highcut_freq: high cutoff frequency in Hz (default: 10Hz)
    - sampling_freq: sampling frequency in Hz (default: 5000Hz)
    - filter_order: filter order (default: 6)

    Returns:
    - df_filtered: polars DataFrame with filtered signals
    """
    # Calculate normalized cutoff frequencies
    nyq = 0.5 * sampling_freq
    high_normal = highcut_freq / nyq

    # Design Butterworth low-pass filter
    b_low, a_low = butter(filter_order, high_normal, btype='low', analog=False)

    # Design high-pass filter with improved stability
    apply_highpass = lowcut_freq > 0
    if apply_highpass:
        low_normal = lowcut_freq / nyq
        print(f"High-pass normalized frequency: {low_normal:.6f}")

        # Use lower filter order for very low frequencies to improve stability
        hp_filter_order = min(filter_order, 4) if low_normal < 0.001 else filter_order

        # Design high-pass filter with SOS (Second-Order Sections) for better numerical stability
        sos_high = butter(hp_filter_order, low_normal, btype='high', analog=False, output='sos')
        print(f"Using filter order {hp_filter_order} for high-pass filter")

    # Get all signal channel names (exclude time column)
    signal_column_names = []
    for channel_group in signal_channels.values():
        signal_column_names.extend(channel_group)

    # Apply filter to each signal channel
    filtered_data = {}

    # Keep the time column unchanged
    time_col = None
    for col in df.columns:
        if col.lower() == 'time':
            time_col = col
            break

    if time_col is not None:
        filtered_data[time_col] = df[time_col].to_numpy()
        print(f"Time column '{time_col}' preserved in filtered data")
    else:
        print("Warning: No time column found in input data")

    filter_description = f"low-pass ({highcut_freq}Hz)"
    if apply_highpass:
        filter_description = f"high-pass ({lowcut_freq}Hz) + {filter_description}"

    print(f"Applying {filter_description} filters to channels...")

    for channel in tqdm(signal_column_names, desc="Filtering channels"):
        if channel in df.columns:
            # Extract signal data
            signal = df[channel].to_numpy()

            # First apply low-pass filter
            filtered_signal = filtfilt(b_low, a_low, signal)

            # Then apply high-pass filter if needed using SOS format
            if apply_highpass:
                from scipy.signal import sosfiltfilt
                filtered_signal = sosfiltfilt(sos_high, filtered_signal)

            # Check for NaN values and report
            if np.any(np.isnan(filtered_signal)):
                print(f"Warning: NaN values detected in {channel} after filtering")
                # Option: replace NaN with interpolated values or skip this channel
                nan_count = np.sum(np.isnan(filtered_signal))
                print(f"  {nan_count} NaN values out of {len(filtered_signal)} samples")

                # Simple NaN handling: replace with median of non-NaN values
                if nan_count < len(filtered_signal) * 0.1:  # Less than 10% NaN
                    median_val = np.nanmedian(filtered_signal)
                    filtered_signal = np.where(np.isnan(filtered_signal), median_val, filtered_signal)
                    print(f"  Replaced NaN values with median: {median_val:.3f}")
                else:
                    print(f"  Too many NaN values ({nan_count}/{len(filtered_signal)}), skipping channel")
                    continue

            # Store filtered signal
            filtered_data[channel] = filtered_signal
        else:
            print(f"Warning: Channel '{channel}' not found in data")

    # Convert back to polars DataFrame
    df_filtered = pl.DataFrame(filtered_data)

    print(f"Filtering completed successfully. Processed {len(signal_column_names)} channels.")
    return df_filtered


#### Define the downsampling function

In [7]:
def downsample_data(df, original_fs=5000, target_fs=25, fix_saturated=True):
    """
    Downsample the (filtered) data using averaging.

    Parameters:
    - df: polars DataFrame with filtered signal data
    - original_fs: original sampling frequency in Hz (default: 5000Hz)
    - target_fs: target sampling frequency in Hz (default: 25Hz)
    - fix_saturated: bool, if True removes saturated values before averaging (default: True)

    Returns:
    - df_downsampled: polars DataFrame with downsampled signals
    """
    # Calculate downsampling factor
    downsample_factor = original_fs // target_fs
    print(f"Downsampling from {original_fs}Hz to {target_fs}Hz (factor: {downsample_factor})")

    if fix_saturated:
        print("Saturation removal enabled - will exclude saturated values from averaging")

    # Get signal column names (exclude time column)
    signal_column_names = []
    for channel_group in signal_channels.values():
        signal_column_names.extend(channel_group)

    # Calculate number of complete windows
    n_samples = df.shape[0]
    n_windows = n_samples // downsample_factor
    print(f"Processing {n_samples} samples into {n_windows} downsampled points")

    # Initialize dictionary for downsampled data
    downsampled_data = {}

    # Downsample time column if present - check for both 'time' and 'Time'
    time_col = None
    for col in df.columns:
        if col.lower() == 'time':
            time_col = col
            break

    if time_col is not None:
        time_data = df[time_col].to_numpy()
        # Take every nth sample for time (or average if needed)
        downsampled_time = time_data[::downsample_factor][:n_windows]
        downsampled_data[time_col] = downsampled_time
        print(f"Time column '{time_col}' downsampled")
    else:
        print("Warning: No time column found for downsampling")

    # Define saturation thresholds based on physical constants
    # Use the predefined sensor saturation threshold
    saturation_threshold_nt = SENSOR_SATURATION  # Saturation threshold for the sensor
    print(f"Saturation threshold: ±{saturation_threshold_nt} nT")

    # Track saturation statistics
    total_saturated_samples = 0
    saturated_windows = 0

    # Downsample each signal channel using averaging
    print("Downsampling channels using averaging...")
    for channel in tqdm(signal_column_names, desc="Downsampling channels"):
        if channel in df.columns:
            # Extract signal data
            signal = df[channel].to_numpy()

            # Reshape for averaging (trim to complete windows)
            signal_windowed = signal[:n_windows * downsample_factor].reshape(n_windows, downsample_factor)

            if fix_saturated:
                # Apply saturation filtering before averaging
                downsampled_signal = []
                saturation_flags = []  # Track which windows had saturation issues
                channel_saturated_samples = 0
                channel_saturated_windows = 0

                for window in signal_windowed:
                    # Identify saturated samples (beyond threshold)
                    # Use > instead of >= to avoid edge case issues
                    saturated_mask = np.abs(window) > saturation_threshold_nt
                    saturated_count = np.sum(saturated_mask)

                    if saturated_count > 0:
                        channel_saturated_samples += saturated_count
                        channel_saturated_windows += 1

                        # Remove saturated values from averaging
                        valid_samples = window[~saturated_mask]

                        if len(valid_samples) > 0:
                            # Average only non-saturated samples
                            window_avg = np.mean(valid_samples)
                            saturation_flags.append(1)  # Partially saturated
                        else:
                            # If all samples are saturated, mark as invalid
                            # Use NaN to indicate unreliable data
                            window_avg = np.nan
                            saturation_flags.append(2)  # Fully saturated (unreliable)

                        downsampled_signal.append(window_avg)
                    else:
                        # No saturation, use normal averaging
                        downsampled_signal.append(np.mean(window))
                        saturation_flags.append(0)  # No saturation

                downsampled_signal = np.array(downsampled_signal)

                # Note: Saturation flags are tracked for statistics but not stored in output data

                # Update statistics
                total_saturated_samples += channel_saturated_samples
                saturated_windows += channel_saturated_windows

                if channel_saturated_samples > 0:
                    saturation_percentage = (channel_saturated_samples / (n_windows * downsample_factor)) * 100
                    print(f"  {channel}: {channel_saturated_samples} saturated samples ({saturation_percentage:.2f}%) in {channel_saturated_windows} windows")
            else:
                # Standard averaging without saturation handling
                downsampled_signal = np.mean(signal_windowed, axis=1)

            # Store downsampled signal
            downsampled_data[channel] = downsampled_signal
        else:
            print(f"Warning: Channel '{channel}' not found in filtered data")

    # Print saturation summary
    if fix_saturated and total_saturated_samples > 0:
        total_samples = n_windows * downsample_factor * len(signal_column_names)
        overall_saturation_percentage = (total_saturated_samples / total_samples) * 100
        print(f"\nSaturation Summary:")
        print(f"  Total saturated samples: {total_saturated_samples}")
        print(f"  Total windows with saturation: {saturated_windows}")
        print(f"  Overall saturation rate: {overall_saturation_percentage:.2f}%")

    # Convert to polars DataFrame
    df_downsampled = pl.DataFrame(downsampled_data)

    print(f"Downsampling completed. New shape: {df_downsampled.shape}")
    print(f"Effective sampling rate: {target_fs}Hz")

    return df_downsampled

#### Process signal data by periods

- Load the signal file
- Rename the voltage channels to Hand1 and Hand2
  - Apply band-pass filter (0.05 Hz to 10 Hz)
  - Downsample from 5000 Hz to 25 Hz
  - Save as separate parquet file

#### Load and process the signal data from the saved parquet files

In [9]:
# Load the saved parquet file
signal_file = signals_dir / "volunteer.parquet"
print(f"\nLoading signal file: {signal_file}")
if not signal_file.exists():
    print(f"Signal file {signal_file} does not exist. Cannot load data.")
else:
    try:
        signal_df = pl.read_parquet(str(signal_file))
        print(f"Successfully loaded signal DataFrame with shape: {signal_df.shape}")
        print(f"Columns: {signal_df.columns}")
    except Exception as e:
        print(f"Error loading signal file: {e}")

# Rename channels from Voltage_1 and Voltage_2 to Hand1 and Hand2
if signal_df is not None:
    print("Renaming channels to Hand1 and Hand2...")
    signal_df = signal_df.rename({
        'Voltage_0': 'Hand1',
        'Voltage_1': 'Hand2'
    })
    print("Renaming completed.")

# Split the DataFrame into periods based on the specified dates
if signal_df is not None:
    print("Splitting data into specified periods...")
    periods = {
        '2021-11-16_17': (pl.datetime(2021, 11, 16), pl.datetime(2021, 11, 17)),
        '2021-12-07_08': (pl.datetime(2021, 12, 7), pl.datetime(2021, 12, 8)),
        '2021-12-14_15': (pl.datetime(2021, 12, 14), pl.datetime(2021, 12, 15)),
        '2021-12-15_16': (pl.datetime(2021, 12, 15), pl.datetime(2021, 12, 16))
    }
    for period_name, (start_date, end_date) in periods.items():
        print(f"Processing period: {period_name} from {start_date} to {end_date}")

        # Filter the DataFrame for the current period
        period_df = signal_df.filter(
            (pl.col('Time') >= start_date) & (pl.col('Time') <= end_date)
        )

        if period_df.is_empty():
            print(f"No data found for period {period_name}. Skipping...")
            continue

        # Apply the anti-aliasing filter to all signal channels
        print(f"Applying band-pass filter to period {period_name}...")
        period_df_filtered = apply_antialiasing_filter(
            period_df,
            lowcut_freq=LOWCUT_FREQ,
            highcut_freq=HIGHCUT_FREQ,
            sampling_freq=SAMPLING_FREQUENCY,
            filter_order=FILTER_ORDER
        )
        print(f"Filtered data shape for {period_name}: {period_df_filtered.shape}")

        # Apply downsampling from 5000Hz to 25Hz
        print(f"Downsampling period {period_name} to 25Hz...")
        period_df_downsampled = downsample_data(
            period_df_filtered,
            original_fs=SAMPLING_FREQUENCY,
            target_fs=25,
            fix_saturated=True
        )
        print(f"Downsampled data shape for {period_name}: {period_df_downsampled.shape}")

        # Save the filtered and downsampled DataFrame to a new parquet file
        output_period_file = signals_dir / f"volunteer_{period_name}.parquet"
        try:
            period_df_downsampled.write_parquet(str(output_period_file))
            print(f"Successfully saved filtered and downsampled period {period_name} to {output_period_file}")
        except Exception as e:
            print(f"Error saving period {period_name}: {e}")



Loading signal file: ..\..\..\Data\ProcessedData\Signal_Files\volunteer.parquet
Successfully loaded signal DataFrame with shape: (571422500, 3)
Columns: ['Time', 'Voltage_0', 'Voltage_1']
Renaming channels to Hand1 and Hand2...
Renaming completed.
Splitting data into specified periods...
Processing period: 2021-11-16_17 from 2021-11-16 00:00:00.alias("datetime") to 2021-11-17 00:00:00.alias("datetime")
Applying band-pass filter to period 2021-11-16_17...
High-pass normalized frequency: 0.000020
Using filter order 4 for high-pass filter
Time column 'Time' preserved in filtered data
Applying high-pass (0.05Hz) + low-pass (10Hz) filters to channels...


Filtering channels:   0%|          | 0/8 [00:00<?, ?it/s]



Filtering channels: 100%|██████████| 8/8 [00:06<00:00,  1.27it/s]


Filtering completed successfully. Processed 8 channels.
Filtered data shape for 2021-11-16_17: (46163290, 3)
Downsampling period 2021-11-16_17 to 25Hz...
Downsampling from 5000Hz to 25Hz (factor: 200)
Saturation removal enabled - will exclude saturated values from averaging
Processing 46163290 samples into 230816 downsampled points
Time column 'Time' downsampled
Saturation threshold: ±250 nT
Downsampling channels using averaging...


Downsampling channels:   0%|          | 0/8 [00:00<?, ?it/s]



Downsampling channels:  38%|███▊      | 3/8 [00:07<00:12,  2.49s/it]

  Hand1: 77478 saturated samples (0.17%) in 416 windows


Downsampling channels: 100%|██████████| 8/8 [00:14<00:00,  1.87s/it]

  Hand2: 75863 saturated samples (0.16%) in 411 windows

Saturation Summary:
  Total saturated samples: 153341
  Total windows with saturation: 827
  Overall saturation rate: 0.04%
Downsampling completed. New shape: (230816, 3)
Effective sampling rate: 25Hz
Downsampled data shape for 2021-11-16_17: (230816, 3)
Successfully saved filtered and downsampled period 2021-11-16_17 to ..\..\..\Data\ProcessedData\Signal_Files\volunteer_2021-11-16_17.parquet
Processing period: 2021-12-07_08 from 2021-12-07 00:00:00.alias("datetime") to 2021-12-08 00:00:00.alias("datetime")





Applying band-pass filter to period 2021-12-07_08...
High-pass normalized frequency: 0.000020
Using filter order 4 for high-pass filter
Time column 'Time' preserved in filtered data
Applying high-pass (0.05Hz) + low-pass (10Hz) filters to channels...


Filtering channels:   0%|          | 0/8 [00:00<?, ?it/s]



Filtering channels: 100%|██████████| 8/8 [00:05<00:00,  1.54it/s]


Filtering completed successfully. Processed 8 channels.
Filtered data shape for 2021-12-07_08: (43012500, 3)
Downsampling period 2021-12-07_08 to 25Hz...
Downsampling from 5000Hz to 25Hz (factor: 200)
Saturation removal enabled - will exclude saturated values from averaging
Processing 43012500 samples into 215062 downsampled points
Time column 'Time' downsampled
Saturation threshold: ±250 nT
Downsampling channels using averaging...


Downsampling channels:   0%|          | 0/8 [00:00<?, ?it/s]



Downsampling channels:  38%|███▊      | 3/8 [00:06<00:11,  2.32s/it]

  Hand1: 121716 saturated samples (0.28%) in 645 windows


Downsampling channels: 100%|██████████| 8/8 [00:13<00:00,  1.68s/it]

  Hand2: 102761 saturated samples (0.24%) in 541 windows

Saturation Summary:
  Total saturated samples: 224477
  Total windows with saturation: 1186
  Overall saturation rate: 0.07%
Downsampling completed. New shape: (215062, 3)
Effective sampling rate: 25Hz
Downsampled data shape for 2021-12-07_08: (215062, 3)
Successfully saved filtered and downsampled period 2021-12-07_08 to ..\..\..\Data\ProcessedData\Signal_Files\volunteer_2021-12-07_08.parquet
Processing period: 2021-12-14_15 from 2021-12-14 00:00:00.alias("datetime") to 2021-12-15 00:00:00.alias("datetime")





Applying band-pass filter to period 2021-12-14_15...
High-pass normalized frequency: 0.000020
Using filter order 4 for high-pass filter
Time column 'Time' preserved in filtered data
Applying high-pass (0.05Hz) + low-pass (10Hz) filters to channels...


Filtering channels:   0%|          | 0/8 [00:00<?, ?it/s]



Filtering channels: 100%|██████████| 8/8 [00:04<00:00,  1.65it/s]


Filtering completed successfully. Processed 8 channels.
Filtered data shape for 2021-12-14_15: (46351293, 3)
Downsampling period 2021-12-14_15 to 25Hz...
Downsampling from 5000Hz to 25Hz (factor: 200)
Saturation removal enabled - will exclude saturated values from averaging
Processing 46351293 samples into 231756 downsampled points
Time column 'Time' downsampled
Saturation threshold: ±250 nT
Downsampling channels using averaging...


Downsampling channels:   0%|          | 0/8 [00:00<?, ?it/s]



Downsampling channels:  38%|███▊      | 3/8 [00:06<00:11,  2.31s/it]

  Hand1: 617331 saturated samples (1.33%) in 3286 windows


Downsampling channels: 100%|██████████| 8/8 [00:13<00:00,  1.73s/it]

  Hand2: 614167 saturated samples (1.33%) in 3268 windows

Saturation Summary:
  Total saturated samples: 1231498
  Total windows with saturation: 6554
  Overall saturation rate: 0.33%
Downsampling completed. New shape: (231756, 3)
Effective sampling rate: 25Hz
Downsampled data shape for 2021-12-14_15: (231756, 3)
Successfully saved filtered and downsampled period 2021-12-14_15 to ..\..\..\Data\ProcessedData\Signal_Files\volunteer_2021-12-14_15.parquet
Processing period: 2021-12-15_16 from 2021-12-15 00:00:00.alias("datetime") to 2021-12-16 00:00:00.alias("datetime")





Applying band-pass filter to period 2021-12-15_16...
High-pass normalized frequency: 0.000020
Using filter order 4 for high-pass filter
Time column 'Time' preserved in filtered data
Applying high-pass (0.05Hz) + low-pass (10Hz) filters to channels...


Filtering channels:   0%|          | 0/8 [00:00<?, ?it/s]



Filtering channels: 100%|██████████| 8/8 [00:44<00:00,  5.56s/it]






Filtering completed successfully. Processed 8 channels.
Filtered data shape for 2021-12-15_16: (173196391, 3)
Downsampling period 2021-12-15_16 to 25Hz...
Downsampling from 5000Hz to 25Hz (factor: 200)
Saturation removal enabled - will exclude saturated values from averaging
Processing 173196391 samples into 865981 downsampled points
Time column 'Time' downsampled
Saturation threshold: ±250 nT
Downsampling channels using averaging...


Downsampling channels:   0%|          | 0/8 [00:00<?, ?it/s]



Downsampling channels:  38%|███▊      | 3/8 [00:27<00:45,  9.17s/it]

  Hand1: 628530 saturated samples (0.36%) in 3475 windows


Downsampling channels: 100%|██████████| 8/8 [00:54<00:00,  6.85s/it]

  Hand2: 149350 saturated samples (0.09%) in 817 windows

Saturation Summary:
  Total saturated samples: 777880
  Total windows with saturation: 4292
  Overall saturation rate: 0.06%





Downsampling completed. New shape: (865981, 3)
Effective sampling rate: 25Hz
Downsampled data shape for 2021-12-15_16: (865981, 3)
Successfully saved filtered and downsampled period 2021-12-15_16 to ..\..\..\Data\ProcessedData\Signal_Files\volunteer_2021-12-15_16.parquet


In [None]:
gc.collect()

#### Load and process individual parquet files (volunteer_part1 through part4)

For each part:
- Load the signal data
- Rename the voltage channels to Hand1 and Hand2
- Apply band-pass filtering (0.05 Hz to 10 Hz)
- Downsample from 5000 Hz to 25 Hz
- Save as separate parquet file with "_downsampled_25Hz" suffix

In [8]:
# Load and process individual parquet files (volunteer_part1 through part4)
processed_downsampled_files = {}

# Define the individual parquet files to process
individual_files = {
    'part1': signals_dir / "volunteer_part1.parquet",
    'part2': signals_dir / "volunteer_part2.parquet",
    'part3': signals_dir / "volunteer_part3.parquet",
    'part4': signals_dir / "volunteer_part4.parquet"
}

for part_name, file_path in individual_files.items():
    print(f"\n{'='*60}")
    print(f"PROCESSING {part_name.upper()}: {file_path}")
    print(f"{'='*60}")

    # Check if file exists
    if not file_path.exists():
        print(f"File {file_path} does not exist. Skipping...")
        continue

    try:
        # Load the parquet file
        print(f"Loading {part_name} data...")
        signal_df = pl.read_parquet(str(file_path))
        print(f"Successfully loaded {part_name} with shape: {signal_df.shape}")
        print(f"Columns: {signal_df.columns}")

        # Display time range for this part
        if 'Time' in signal_df.columns:
            time_min = signal_df['Time'].min()
            time_max = signal_df['Time'].max()
            duration_hours = (time_max - time_min).total_seconds() / 3600
            print(f"Time range: {time_min} to {time_max}")
            print(f"Duration: {duration_hours:.2f} hours")

    except Exception as e:
        print(f"Error loading {part_name}: {e}")
        continue

    # Rename voltage channels to Hand1 and Hand2
    print(f"Renaming voltage channels in {part_name}...")
    original_columns = signal_df.columns

    # Check what voltage columns exist and rename them
    rename_dict = {}
    if 'Voltage_0' in signal_df.columns:
        rename_dict['Voltage_0'] = 'Hand1'
    if 'Voltage_1' in signal_df.columns:
        rename_dict['Voltage_1'] = 'Hand2'

    if rename_dict:
        signal_df = signal_df.rename(rename_dict)
        print(f"Renamed channels: {rename_dict}")
    else:
        print("No voltage channels found to rename")

    print(f"Updated columns: {signal_df.columns}")

    # Apply band-pass filter (0.05 Hz to 10 Hz)
    print(f"Applying band-pass filter to {part_name}...")
    try:
        signal_df_filtered = apply_antialiasing_filter(
            signal_df,
            lowcut_freq=LOWCUT_FREQ,
            highcut_freq=HIGHCUT_FREQ,
            sampling_freq=SAMPLING_FREQUENCY,
            filter_order=FILTER_ORDER
        )
        print(f"Filtered data shape for {part_name}: {signal_df_filtered.shape}")
    except Exception as e:
        print(f"Error applying filter to {part_name}: {e}")
        continue

    # Downsample from 5000 Hz to 25 Hz
    print(f"Downsampling {part_name} from 5000Hz to 25Hz...")
    try:
        signal_df_downsampled = downsample_data(
            signal_df_filtered,
            original_fs=SAMPLING_FREQUENCY,
            target_fs=25,
            fix_saturated=True
        )
        print(f"Downsampled data shape for {part_name}: {signal_df_downsampled.shape}")
    except Exception as e:
        print(f"Error downsampling {part_name}: {e}")
        continue

    # Save as separate parquet file with "_downsampled_25hz" ending
    output_file = signals_dir / f"volunteer_{part_name}_downsampled_25hz.parquet"
    try:
        signal_df_downsampled.write_parquet(str(output_file))
        print(f"Successfully saved downsampled {part_name} to {output_file}")

        # Calculate output file size
        if output_file.exists():
            output_size_mb = output_file.stat().st_size / (1024 * 1024)
            print(f"Output file size: {output_size_mb:.1f} MB")

        # Track successful processing
        processed_downsampled_files[part_name] = {
            'original_file': file_path,
            'downsampled_file': output_file,
            'original_shape': signal_df.shape,
            'downsampled_shape': signal_df_downsampled.shape,
            'time_range': (time_min, time_max) if 'Time' in signal_df.columns else None,
            'duration_hours': duration_hours if 'Time' in signal_df.columns else None
        }

    except Exception as e:
        print(f"Error saving downsampled {part_name}: {e}")
        continue

    finally:
        # Clean up memory
        del signal_df, signal_df_filtered, signal_df_downsampled
        gc.collect()

# Print summary of processed files
if processed_downsampled_files:
    print(f"\n{'='*80}")
    print("DOWNSAMPLING PROCESSING SUMMARY")
    print(f"{'='*80}")
    print(f"Successfully processed {len(processed_downsampled_files)} files:")

    for part_name, info in processed_downsampled_files.items():
        print(f"\n{part_name.upper()}:")
        print(f"  - Original file: {info['original_file']}")
        print(f"  - Downsampled file: {info['downsampled_file']}")
        print(f"  - Shape change: {info['original_shape']} → {info['downsampled_shape']}")
        if info['time_range']:
            print(f"  - Time range: {info['time_range'][0]} to {info['time_range'][1]}")
            print(f"  - Duration: {info['duration_hours']:.2f} hours")

        # Calculate compression ratio
        orig_samples = info['original_shape'][0]
        down_samples = info['downsampled_shape'][0]
        compression_ratio = orig_samples / down_samples if down_samples > 0 else 0
        print(f"  - Compression ratio: {compression_ratio:.1f}:1")
else:
    print("No files were successfully processed and downsampled.")

print(f"\nDownsampling completed! All processed files are saved with '_downsampled_25hz' suffix.")



PROCESSING PART1: ..\..\..\Data\ProcessedData\Signal_Files\volunteer_part1.parquet
Loading part1 data...
Successfully loaded part1 with shape: (122502500, 3)
Columns: ['Time', 'Voltage_0', 'Voltage_1']
Time range: 2021-11-16 22:24:59.171050 to 2021-11-17 04:49:32.170950
Duration: 6.41 hours
Renaming voltage channels in part1...
Renamed channels: {'Voltage_0': 'Hand1', 'Voltage_1': 'Hand2'}
Updated columns: ['Time', 'Hand1', 'Hand2']
Applying band-pass filter to part1...
High-pass normalized frequency: 0.000020
Using filter order 4 for high-pass filter
Time column 'Time' preserved in filtered data
Applying high-pass (0.05Hz) + low-pass (10Hz) filters to channels...


Filtering channels:   0%|          | 0/8 [00:00<?, ?it/s]



Filtering channels: 100%|██████████| 8/8 [00:16<00:00,  2.04s/it]


Filtering completed successfully. Processed 8 channels.
Filtered data shape for part1: (122502500, 3)
Downsampling part1 from 5000Hz to 25Hz...
Downsampling from 5000Hz to 25Hz (factor: 200)
Saturation removal enabled - will exclude saturated values from averaging
Processing 122502500 samples into 612512 downsampled points
Time column 'Time' downsampled
Saturation threshold: ±250 nT
Downsampling channels using averaging...


Downsampling channels:   0%|          | 0/8 [00:00<?, ?it/s]



Downsampling channels:  38%|███▊      | 3/8 [00:22<00:38,  7.62s/it]

  Hand1: 213953 saturated samples (0.17%) in 1156 windows


Downsampling channels: 100%|██████████| 8/8 [00:45<00:00,  5.74s/it]

  Hand2: 195512 saturated samples (0.16%) in 1066 windows

Saturation Summary:
  Total saturated samples: 409465
  Total windows with saturation: 2222
  Overall saturation rate: 0.04%
Downsampling completed. New shape: (612512, 3)
Effective sampling rate: 25Hz
Downsampled data shape for part1: (612512, 3)
Successfully saved downsampled part1 to ..\..\..\Data\ProcessedData\Signal_Files\volunteer_part1_downsampled_25Hz.parquet
Output file size: 10.3 MB






PROCESSING PART2: ..\..\..\Data\ProcessedData\Signal_Files\volunteer_part2.parquet
Loading part2 data...
Successfully loaded part2 with shape: (131262500, 3)
Columns: ['Time', 'Voltage_0', 'Voltage_1']
Time range: 2021-12-07 22:19:36.521900 to 2021-12-08 06:51:52.271800
Duration: 8.54 hours
Renaming voltage channels in part2...
Renamed channels: {'Voltage_0': 'Hand1', 'Voltage_1': 'Hand2'}
Updated columns: ['Time', 'Hand1', 'Hand2']
Applying band-pass filter to part2...
High-pass normalized frequency: 0.000020
Using filter order 4 for high-pass filter
Time column 'Time' preserved in filtered data
Applying high-pass (0.05Hz) + low-pass (10Hz) filters to channels...


Filtering channels:   0%|          | 0/8 [00:00<?, ?it/s]



Filtering channels: 100%|██████████| 8/8 [00:16<00:00,  2.05s/it]


Filtering completed successfully. Processed 8 channels.
Filtered data shape for part2: (131262500, 3)
Downsampling part2 from 5000Hz to 25Hz...
Downsampling from 5000Hz to 25Hz (factor: 200)
Saturation removal enabled - will exclude saturated values from averaging
Processing 131262500 samples into 656312 downsampled points
Time column 'Time' downsampled
Saturation threshold: ±250 nT
Downsampling channels using averaging...


Downsampling channels:   0%|          | 0/8 [00:00<?, ?it/s]



Downsampling channels:  38%|███▊      | 3/8 [00:24<00:40,  8.05s/it]

  Hand1: 328322 saturated samples (0.25%) in 1747 windows


Downsampling channels: 100%|██████████| 8/8 [00:46<00:00,  5.80s/it]

  Hand2: 297234 saturated samples (0.23%) in 1585 windows

Saturation Summary:
  Total saturated samples: 625556
  Total windows with saturation: 3332
  Overall saturation rate: 0.06%
Downsampling completed. New shape: (656312, 3)
Effective sampling rate: 25Hz
Downsampled data shape for part2: (656312, 3)
Successfully saved downsampled part2 to ..\..\..\Data\ProcessedData\Signal_Files\volunteer_part2_downsampled_25Hz.parquet
Output file size: 11.1 MB






PROCESSING PART3: ..\..\..\Data\ProcessedData\Signal_Files\volunteer_part3.parquet
Loading part3 data...
Successfully loaded part3 with shape: (174417500, 3)
Columns: ['Time', 'Voltage_0', 'Voltage_1']
Time range: 2021-12-14 22:35:22.120750 to 2021-12-15 06:58:26.370650
Duration: 8.38 hours
Renaming voltage channels in part3...
Renamed channels: {'Voltage_0': 'Hand1', 'Voltage_1': 'Hand2'}
Updated columns: ['Time', 'Hand1', 'Hand2']
Applying band-pass filter to part3...
High-pass normalized frequency: 0.000020
Using filter order 4 for high-pass filter
Time column 'Time' preserved in filtered data
Applying high-pass (0.05Hz) + low-pass (10Hz) filters to channels...


Filtering channels:   0%|          | 0/8 [00:00<?, ?it/s]



Filtering channels: 100%|██████████| 8/8 [00:20<00:00,  2.61s/it]


Filtering completed successfully. Processed 8 channels.
Filtered data shape for part3: (174417500, 3)
Downsampling part3 from 5000Hz to 25Hz...
Downsampling from 5000Hz to 25Hz (factor: 200)
Saturation removal enabled - will exclude saturated values from averaging
Processing 174417500 samples into 872087 downsampled points
Time column 'Time' downsampled
Saturation threshold: ±250 nT
Downsampling channels using averaging...


Downsampling channels:   0%|          | 0/8 [00:00<?, ?it/s]



Downsampling channels:  38%|███▊      | 3/8 [00:29<00:49,  9.83s/it]

  Hand1: 774734 saturated samples (0.44%) in 4146 windows


Downsampling channels: 100%|██████████| 8/8 [01:02<00:00,  7.86s/it]

  Hand2: 755043 saturated samples (0.43%) in 4039 windows

Saturation Summary:
  Total saturated samples: 1529777
  Total windows with saturation: 8185
  Overall saturation rate: 0.11%
Downsampling completed. New shape: (872087, 3)
Effective sampling rate: 25Hz
Downsampled data shape for part3: (872087, 3)





Successfully saved downsampled part3 to ..\..\..\Data\ProcessedData\Signal_Files\volunteer_part3_downsampled_25Hz.parquet
Output file size: 14.6 MB

PROCESSING PART4: ..\..\..\Data\ProcessedData\Signal_Files\volunteer_part4.parquet
Loading part4 data...
Successfully loaded part4 with shape: (143240000, 3)
Columns: ['Time', 'Voltage_0', 'Voltage_1']
Time range: 2021-12-15 21:29:29.463300 to 2021-12-16 07:23:24.963100
Duration: 9.90 hours
Renaming voltage channels in part4...
Renamed channels: {'Voltage_0': 'Hand1', 'Voltage_1': 'Hand2'}
Updated columns: ['Time', 'Hand1', 'Hand2']
Applying band-pass filter to part4...
High-pass normalized frequency: 0.000020
Using filter order 4 for high-pass filter
Time column 'Time' preserved in filtered data
Applying high-pass (0.05Hz) + low-pass (10Hz) filters to channels...


Filtering channels:   0%|          | 0/8 [00:00<?, ?it/s]



Filtering channels: 100%|██████████| 8/8 [00:17<00:00,  2.18s/it]


Filtering completed successfully. Processed 8 channels.
Filtered data shape for part4: (143240000, 3)
Downsampling part4 from 5000Hz to 25Hz...
Downsampling from 5000Hz to 25Hz (factor: 200)
Saturation removal enabled - will exclude saturated values from averaging
Processing 143240000 samples into 716200 downsampled points
Time column 'Time' downsampled
Saturation threshold: ±250 nT
Downsampling channels using averaging...


Downsampling channels:   0%|          | 0/8 [00:00<?, ?it/s]



Downsampling channels:  38%|███▊      | 3/8 [00:30<00:51, 10.32s/it]

  Hand1: 721774 saturated samples (0.50%) in 4023 windows


Downsampling channels: 100%|██████████| 8/8 [00:54<00:00,  6.81s/it]

  Hand2: 7202 saturated samples (0.01%) in 51 windows

Saturation Summary:
  Total saturated samples: 728976
  Total windows with saturation: 4074
  Overall saturation rate: 0.06%
Downsampling completed. New shape: (716200, 3)
Effective sampling rate: 25Hz
Downsampled data shape for part4: (716200, 3)
Successfully saved downsampled part4 to ..\..\..\Data\ProcessedData\Signal_Files\volunteer_part4_downsampled_25Hz.parquet
Output file size: 12.2 MB






DOWNSAMPLING PROCESSING SUMMARY
Successfully processed 4 files:

PART1:
  - Original file: ..\..\..\Data\ProcessedData\Signal_Files\volunteer_part1.parquet
  - Downsampled file: ..\..\..\Data\ProcessedData\Signal_Files\volunteer_part1_downsampled_25Hz.parquet
  - Shape change: (122502500, 3) → (612512, 3)
  - Time range: 2021-11-16 22:24:59.171050 to 2021-11-17 04:49:32.170950
  - Duration: 6.41 hours
  - Compression ratio: 200.0:1

PART2:
  - Original file: ..\..\..\Data\ProcessedData\Signal_Files\volunteer_part2.parquet
  - Downsampled file: ..\..\..\Data\ProcessedData\Signal_Files\volunteer_part2_downsampled_25Hz.parquet
  - Shape change: (131262500, 3) → (656312, 3)
  - Time range: 2021-12-07 22:19:36.521900 to 2021-12-08 06:51:52.271800
  - Duration: 8.54 hours
  - Compression ratio: 200.0:1

PART3:
  - Original file: ..\..\..\Data\ProcessedData\Signal_Files\volunteer_part3.parquet
  - Downsampled file: ..\..\..\Data\ProcessedData\Signal_Files\volunteer_part3_downsampled_25Hz.par