# Dataset Creation for Health Sensing Breathing Analysis 📊

## Machine Learning Dataset Generation

### Task Overview
Create a labeled dataset from 8-hour sleep study recordings by splitting continuous signals into 30-second windows with 50% overlap. Each window will be labeled based on breathing events (Hypopnea, Obstructive Apnea, or Normal).

### Key Requirements ✅
- **Window Size**: 30-second segments with 50% overlap (15-second step)
- **Label Assignment**: Based on >50% overlap with breathing events
- **Target Labels**: Hypopnea, Obstructive Apnea, Normal
- **Input Signals**: Nasal Airflow, Thoracic Movement, SpO₂
- **Output Format**: Efficient storage for ML training

### Technical Approach
- **Sliding Window**: Extract overlapping time segments from continuous signals
- **Event Mapping**: Assign labels based on temporal overlap with annotations
- **Data Format**: Parquet format for efficient storage and fast loading
- **Feature Engineering**: Time-series windows ready for ML models

In [1]:
# Import Required Libraries
import os
import sys
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import glob
from pathlib import Path
import argparse
from typing import Dict, List, Tuple, Optional
import warnings
warnings.filterwarnings('ignore')

# For data storage
import pickle
import pyarrow as pa
import pyarrow.parquet as pq

print("✅ Libraries imported successfully!")
print("📊 Ready for dataset creation from sleep study data")
print("🎯 Target: 30-second windows with event-based labeling")

✅ Libraries imported successfully!
📊 Ready for dataset creation from sleep study data
🎯 Target: 30-second windows with event-based labeling


In [11]:
class DatasetCreator:
    """
    Creates labeled dataset from continuous sleep study recordings.
    Splits 8-hour signals into 30-second windows with 50% overlap.
    """
    
    def __init__(self, window_duration=30, overlap_ratio=0.5, sampling_rate=32):
        """
        Initialize the dataset creator.
        
        Args:
            window_duration (int): Window size in seconds (default: 30)
            overlap_ratio (float): Overlap ratio between windows (default: 0.5 for 50%)
            sampling_rate (int): Sampling rate in Hz (default: 32)
        """
        self.window_duration = window_duration
        self.overlap_ratio = overlap_ratio
        self.sampling_rate = sampling_rate
        
        # Calculate window parameters
        self.window_samples = window_duration * sampling_rate  # 30 * 32 = 960 samples
        self.step_samples = int(self.window_samples * (1 - overlap_ratio))  # 480 samples (15 seconds)
        
        # Target labels for the dataset
        self.target_labels = ['Hypopnea', 'Obstructive Apnea', 'Normal']
        
        print(f"✅ DatasetCreator initialized")
        print(f"   ⏱️  Window duration: {window_duration} seconds ({self.window_samples} samples)")
        print(f"   🔄 Overlap: {overlap_ratio*100}% ({window_duration * overlap_ratio} seconds)")
        print(f"   👣 Step size: {self.step_samples} samples ({self.step_samples/sampling_rate} seconds)")
        print(f"   🏷️  Target labels: {self.target_labels}")
    
    def parse_datetime(self, date_str):
        """Parse datetime strings from various formats found in the data files."""
        formats = [
            "%d.%m.%Y %H:%M:%S,%f",  # 30.05.2024 20:59:00,000
            "%d.%m.%Y %H:%M:%S",     # 30.05.2024 20:59:00
            "%d-%m-%Y %H:%M:%S,%f",  # 30-05-2024 21:22:45,000
            "%d-%m-%Y %H:%M:%S",     # 30-05-2024 21:22:45
            "%d_%m_%Y %H:%M:%S,%f",  # Alternative format
            "%m/%d/%Y %I:%M:%S %p",  # 5/30/2024 8:59:00 PM
        ]
        
        for fmt in formats:
            try:
                return datetime.strptime(date_str.strip(), fmt)
            except ValueError:
                continue
        
        # Manual parsing for edge cases
        try:
            clean_str = date_str.replace(';', '').strip()
            if ',' in clean_str:
                dt_part, ms_part = clean_str.rsplit(',', 1)
                for base_fmt in ["%d.%m.%Y %H:%M:%S", "%d-%m-%Y %H:%M:%S"]:
                    try:
                        dt = datetime.strptime(dt_part, base_fmt)
                        ms = int(ms_part)
                        return dt + timedelta(milliseconds=ms)
                    except ValueError:
                        continue
            else:
                for base_fmt in ["%d.%m.%Y %H:%M:%S", "%d-%m-%Y %H:%M:%S"]:
                    try:
                        return datetime.strptime(clean_str, base_fmt)
                    except ValueError:
                        continue
        except Exception:
            pass
        
        return None
    
    def find_file_by_pattern(self, folder, patterns):
        """Find files matching any of the given patterns."""
        for pattern in patterns:
            files = glob.glob(os.path.join(folder, pattern))
            if files:
                return files[0]
        return None

In [13]:
    def load_signal_data(self, file_path, signal_type):
        """Load signal data with timestamps from a file."""
        if not os.path.exists(file_path):
            print(f"⚠️  Warning: Signal file not found: {os.path.basename(file_path)}")
            return None
        
        with open(file_path, 'r') as f:
            lines = f.readlines()
        
        # Extract start time from header
        start_time_line = None
        for line in lines[:4]:
            if 'Start Time:' in line:
                start_time_line = line
                break
        
        if not start_time_line:
            print(f"⚠️  Warning: No start time found in {os.path.basename(file_path)}")
            return None
        
        start_time_str = start_time_line.split('Start Time:')[1].strip()
        start_time = self.parse_datetime(start_time_str)
        
        if not start_time:
            print(f"⚠️  Warning: Could not parse start time: {start_time_str}")
            return None
        
        # Parse data lines
        values = []
        timestamps = []
        
        for line in lines[4:]:
            line = line.strip()
            if not line:
                continue
            
            try:
                if ';' in line:
                    time_str, value_str = line.split(';', 1)
                else:
                    parts = line.split()
                    if len(parts) >= 2:
                        time_str, value_str = parts[0], parts[1]
                    else:
                        continue
                
                timestamp = self.parse_datetime(time_str.strip())
                if timestamp:
                    timestamps.append(timestamp)
                    values.append(float(value_str.strip()))
                    
            except (ValueError, IndexError):
                continue
        
        if not timestamps:
            return None
        
        df = pd.DataFrame({
            'timestamp': timestamps,
            'value': values,
            'signal_type': signal_type
        })
        
        df = df.drop_duplicates(subset=['timestamp']).sort_values('timestamp')
        df = df.set_index('timestamp')
        
        return df
    
    def load_events_data(self, file_path):
        """Load breathing events data."""
        if not os.path.exists(file_path):
            return None
            
        with open(file_path, 'r') as f:
            lines = f.readlines()
            
        events = []
        data_lines = [line.strip() for line in lines[4:] if line.strip()]
        
        for line in data_lines:
            try:
                main_parts = line.split(';')
                if len(main_parts) < 3:
                    continue
                    
                time_range = main_parts[0].strip()
                duration = float(main_parts[1].strip())
                event_type = main_parts[2].strip()
                sleep_stage = main_parts[3].strip() if len(main_parts) > 3 else ""
                
                # Parse time range
                dash_pos = time_range.rfind('-')
                if dash_pos < 0:
                    continue
                    
                start_str = time_range[:dash_pos].strip()
                end_time_part = time_range[dash_pos+1:].strip()
                
                date_part = start_str.split(' ')[0]
                end_str = f"{date_part} {end_time_part}"
                
                start_time = self.parse_datetime(start_str)
                end_time = self.parse_datetime(end_str)
                
                if start_time and end_time:
                    events.append({
                        'start': start_time,
                        'end': end_time,
                        'duration': duration,
                        'event_type': event_type,
                        'sleep_stage': sleep_stage
                    })
            except Exception:
                continue
                
        if events:
            return pd.DataFrame(events)
        return None

# Add these methods to the DatasetCreator class
DatasetCreator.load_signal_data = load_signal_data
DatasetCreator.load_events_data = load_events_data

print("✅ Signal loading methods added!")
print("📥 Ready to load signals and events from participant folders")

✅ Signal loading methods added!
📥 Ready to load signals and events from participant folders


In [14]:
    def extract_windows(self, signal_data, signal_name):
        """
        Extract overlapping 30-second windows from signal data.
        
        Args:
            signal_data (DataFrame): Signal data with timestamp index
            signal_name (str): Name of the signal
            
        Returns:
            List of tuples: (window_start_time, window_end_time, window_values)
        """
        if signal_data is None or len(signal_data) < self.window_samples:
            return []
        
        windows = []
        values = signal_data['value'].values
        timestamps = signal_data.index
        
        # Extract overlapping windows
        for start_idx in range(0, len(values) - self.window_samples + 1, self.step_samples):
            end_idx = start_idx + self.window_samples
            
            window_values = values[start_idx:end_idx]
            window_start_time = timestamps[start_idx]
            window_end_time = timestamps[end_idx - 1]
            
            windows.append((window_start_time, window_end_time, window_values))
        
        return windows
    
    def assign_label(self, window_start, window_end, events_data):
        """
        Assign label to a window based on overlap with breathing events.
        
        Args:
            window_start (datetime): Window start time
            window_end (datetime): Window end time
            events_data (DataFrame): Events data
            
        Returns:
            str: Label for the window ('Hypopnea', 'Obstructive Apnea', or 'Normal')
        """
        if events_data is None or len(events_data) == 0:
            return 'Normal'
        
        window_duration = (window_end - window_start).total_seconds()
        
        # Check overlap with each event
        for _, event in events_data.iterrows():
            event_type = event['event_type']
            
            # Only consider target labels
            if event_type not in ['Hypopnea', 'Obstructive Apnea']:
                continue
            
            event_start = event['start']
            event_end = event['end']
            
            # Calculate overlap
            overlap_start = max(window_start, event_start)
            overlap_end = min(window_end, event_end)
            
            if overlap_start < overlap_end:
                overlap_duration = (overlap_end - overlap_start).total_seconds()
                overlap_ratio = overlap_duration / window_duration
                
                # If more than 50% overlap, assign the event label
                if overlap_ratio > 0.5:
                    return event_type
        
        return 'Normal'
    
    def process_participant(self, participant_folder):
        """
        Process a single participant to extract labeled windows.
        
        Args:
            participant_folder (str): Path to participant folder
            
        Returns:
            List of dictionaries: Windows with features and labels
        """
        participant_id = os.path.basename(participant_folder)
        print(f"🔄 Processing {participant_id}...")
        
        # Define file patterns for different naming conventions
        flow_patterns = [
            "Flow - *.txt", "Flow  - *.txt", "Flow Signal - *.txt", "Flow Nasal - *.txt"
        ]
        thorac_patterns = [
            "Thorac - *.txt", "Thorac  - *.txt", "Thorac Signal - *.txt", "Thorac Movement - *.txt"
        ]
        spo2_patterns = [
            "SPO2 - *.txt", "SPO2  - *.txt", "SPO2 Signal - *.txt"
        ]
        events_patterns = [
            "Flow Events - *.txt", "Flow Events  - *.txt"
        ]
        
        # Load signals
        signals = {}
        
        flow_file = self.find_file_by_pattern(participant_folder, flow_patterns)
        if flow_file:
            signals['nasal_airflow'] = self.load_signal_data(flow_file, 'Nasal Airflow')
        
        thorac_file = self.find_file_by_pattern(participant_folder, thorac_patterns)
        if thorac_file:
            signals['thoracic_movement'] = self.load_signal_data(thorac_file, 'Thoracic Movement')
        
        spo2_file = self.find_file_by_pattern(participant_folder, spo2_patterns)
        if spo2_file:
            signals['spo2'] = self.load_signal_data(spo2_file, 'SpO₂')
        
        # Load events
        events_file = self.find_file_by_pattern(participant_folder, events_patterns)
        events_data = self.load_events_data(events_file) if events_file else None
        
        # Filter events to only target labels
        if events_data is not None:
            target_events = events_data[events_data['event_type'].isin(['Hypopnea', 'Obstructive Apnea'])]
        else:
            target_events = None
        
        print(f"   📊 Loaded signals: {list(signals.keys())}")
        if target_events is not None:
            event_counts = target_events['event_type'].value_counts()
            print(f"   🚨 Events: {dict(event_counts)}")
        else:
            print(f"   🚨 Events: None found")
        
        # Extract windows from the primary signal (nasal airflow)
        if 'nasal_airflow' not in signals or signals['nasal_airflow'] is None:
            print(f"   ❌ No nasal airflow data found for {participant_id}")
            return []
        
        primary_signal = signals['nasal_airflow']
        windows = self.extract_windows(primary_signal, 'nasal_airflow')
        
        print(f"   🪟 Extracted {len(windows)} windows")
        
        # Process each window
        dataset_windows = []
        
        for i, (window_start, window_end, nasal_values) in enumerate(windows):
            # Get corresponding values from other signals
            thorac_values = self._get_window_values(signals.get('thoracic_movement'), 
                                                   window_start, window_end)
            spo2_values = self._get_window_values(signals.get('spo2'), 
                                                 window_start, window_end)
            
            # Assign label
            label = self.assign_label(window_start, window_end, target_events)
            
            # Create window data
            window_data = {
                'participant_id': participant_id,
                'window_id': f"{participant_id}_W{i:04d}",
                'start_time': window_start,
                'end_time': window_end,
                'duration': self.window_duration,
                'label': label,
                'nasal_airflow': nasal_values,
                'thoracic_movement': thorac_values,
                'spo2': spo2_values
            }
            
            dataset_windows.append(window_data)
        
        # Label distribution
        labels = [w['label'] for w in dataset_windows]
        label_counts = pd.Series(labels).value_counts()
        print(f"   🏷️  Label distribution: {dict(label_counts)}")
        
        return dataset_windows
    
    def _get_window_values(self, signal_data, window_start, window_end):
        """Get signal values for a specific time window."""
        if signal_data is None:
            return None
        
        # Filter signal data for the window time range
        window_data = signal_data[(signal_data.index >= window_start) & 
                                 (signal_data.index <= window_end)]
        
        if len(window_data) == 0:
            return None
        
        return window_data['value'].values

# Add methods to the DatasetCreator class
DatasetCreator.extract_windows = extract_windows
DatasetCreator.assign_label = assign_label
DatasetCreator.process_participant = process_participant
DatasetCreator._get_window_values = _get_window_values

print("✅ Window extraction and labeling methods added!")
print("🪟 Ready to create 30-second windows with event-based labels")

✅ Window extraction and labeling methods added!
🪟 Ready to create 30-second windows with event-based labels


In [15]:
    def create_dataset(self, input_dir, output_dir):
        """
        Create the complete dataset from all participants.
        
        Args:
            input_dir (str): Input directory containing participant folders
            output_dir (str): Output directory for saving the dataset
            
        Returns:
            dict: Dataset creation summary
        """
        print("🚀 DATASET CREATION: Processing all participants")
        print("=" * 70)
        
        # Create output directory
        os.makedirs(output_dir, exist_ok=True)
        
        # Find all participant folders
        participants = [f for f in os.listdir(input_dir) 
                       if os.path.isdir(os.path.join(input_dir, f)) and f.startswith('AP')]
        participants.sort()
        
        print(f"📁 Found {len(participants)} participants: {participants}")
        
        all_windows = []
        participant_stats = {}
        
        # Process each participant
        for i, participant in enumerate(participants, 1):
            print(f"\n[{i}/{len(participants)}] Processing {participant}")
            print("-" * 50)
            
            participant_path = os.path.join(input_dir, participant)
            
            try:
                windows = self.process_participant(participant_path)
                
                if windows:
                    all_windows.extend(windows)
                    
                    # Calculate statistics
                    labels = [w['label'] for w in windows]
                    participant_stats[participant] = {
                        'total_windows': len(windows),
                        'label_distribution': pd.Series(labels).value_counts().to_dict(),
                        'duration_hours': len(windows) * self.window_duration / 3600,
                        'status': 'success'
                    }
                    
                    print(f"   ✅ {participant}: {len(windows)} windows created")
                else:
                    participant_stats[participant] = {
                        'total_windows': 0,
                        'label_distribution': {},
                        'duration_hours': 0,
                        'status': 'failed'
                    }
                    print(f"   ❌ {participant}: No windows created")
                    
            except Exception as e:
                participant_stats[participant] = {
                    'total_windows': 0,
                    'label_distribution': {},
                    'duration_hours': 0,
                    'status': 'error',
                    'error': str(e)
                }
                print(f"   ❌ {participant}: Error - {str(e)}")
        
        # Dataset summary
        print(f"\n" + "=" * 70)
        print("📋 DATASET CREATION SUMMARY")
        print("=" * 70)
        
        total_windows = len(all_windows)
        if total_windows > 0:
            # Overall label distribution
            all_labels = [w['label'] for w in all_windows]
            overall_distribution = pd.Series(all_labels).value_counts()
            
            print(f"✅ Total windows created: {total_windows:,}")
            print(f"📊 Overall label distribution:")
            for label, count in overall_distribution.items():
                percentage = (count / total_windows) * 100
                print(f"   • {label}: {count:,} ({percentage:.1f}%)")
            
            # Save dataset
            print(f"\n💾 Saving dataset to: {output_dir}/")
            self.save_dataset(all_windows, output_dir)
            
            # Save statistics
            stats_summary = {
                'creation_date': datetime.now().isoformat(),
                'total_windows': total_windows,
                'participants': len(participants),
                'window_duration_seconds': self.window_duration,
                'overlap_ratio': self.overlap_ratio,
                'overall_distribution': overall_distribution.to_dict(),
                'participant_stats': participant_stats
            }
            
            stats_path = os.path.join(output_dir, 'dataset_stats.json')
            import json
            with open(stats_path, 'w') as f:
                # Convert datetime objects to strings for JSON serialization
                def json_serializer(obj):
                    if isinstance(obj, datetime):
                        return obj.isoformat()
                    return obj
                
                json.dump(stats_summary, f, indent=2, default=json_serializer)
            
            print(f"📈 Statistics saved to: {stats_path}")
            
        else:
            print("❌ No windows were created from any participant")
            stats_summary = {}
        
        print("=" * 70)
        return stats_summary
    
    def save_dataset(self, windows, output_dir):
        """
        Save the dataset in multiple formats for different use cases.
        
        Args:
            windows (list): List of window dictionaries
            output_dir (str): Output directory
        """
        print("💾 Saving dataset in multiple formats...")
        
        # Prepare data for saving
        dataset_records = []
        
        for window in windows:
            # Create a flattened record for tabular formats
            record = {
                'participant_id': window['participant_id'],
                'window_id': window['window_id'],
                'start_time': window['start_time'],
                'end_time': window['end_time'],
                'duration': window['duration'],
                'label': window['label']
            }
            
            # Add signal features (basic statistics for now)
            for signal_name in ['nasal_airflow', 'thoracic_movement', 'spo2']:
                values = window[signal_name]
                if values is not None and len(values) > 0:
                    record[f'{signal_name}_mean'] = np.mean(values)
                    record[f'{signal_name}_std'] = np.std(values)
                    record[f'{signal_name}_min'] = np.min(values)
                    record[f'{signal_name}_max'] = np.max(values)
                    record[f'{signal_name}_samples'] = len(values)
                else:
                    record[f'{signal_name}_mean'] = None
                    record[f'{signal_name}_std'] = None
                    record[f'{signal_name}_min'] = None
                    record[f'{signal_name}_max'] = None
                    record[f'{signal_name}_samples'] = 0
            
            dataset_records.append(record)
        
        # Create DataFrame
        df = pd.DataFrame(dataset_records)
        
        # Save as Parquet (Primary format - efficient for ML)
        parquet_path = os.path.join(output_dir, 'breathing_dataset_features.parquet')
        df.to_parquet(parquet_path, index=False)
        print(f"   ✅ Features saved as Parquet: {parquet_path}")
        
        # Save as CSV (Human-readable format)
        csv_path = os.path.join(output_dir, 'breathing_dataset_features.csv')
        df.to_csv(csv_path, index=False)
        print(f"   ✅ Features saved as CSV: {csv_path}")
        
        # Save raw time series data as Pickle (for full signal access)
        pickle_path = os.path.join(output_dir, 'breathing_dataset_raw.pkl')
        with open(pickle_path, 'wb') as f:
            pickle.dump(windows, f)
        print(f"   ✅ Raw time series saved as Pickle: {pickle_path}")
        
        print(f"📊 Dataset formats saved:")
        print(f"   • Parquet: {os.path.getsize(parquet_path):,} bytes (recommended for ML)")
        print(f"   • CSV: {os.path.getsize(csv_path):,} bytes (human-readable)")
        print(f"   • Pickle: {os.path.getsize(pickle_path):,} bytes (full time series)")

# Add methods to the DatasetCreator class
DatasetCreator.create_dataset = create_dataset
DatasetCreator.save_dataset = save_dataset

print("✅ Dataset creation and saving methods added!")
print("💾 Ready to create and save ML-ready datasets")

✅ Dataset creation and saving methods added!
💾 Ready to create and save ML-ready datasets


In [16]:
def main():
    """
    Main function for dataset creation - can be called from command line or notebook.
    """
    # Default parameters
    default_input_dir = "../Data"
    default_output_dir = "../Dataset"
    
    # Check if running in notebook or command line
    try:
        # If running in Jupyter notebook
        input_dir = default_input_dir
        output_dir = default_output_dir
        
        print("🚀 BREATHING ANALYSIS DATASET CREATION")
        print("=" * 60)
        print(f"📥 Input directory: {input_dir}")
        print(f"📤 Output directory: {output_dir}")
        print(f"⏱️  Window size: 30 seconds with 50% overlap")
        print(f"🏷️  Target labels: Hypopnea, Obstructive Apnea, Normal")
        print("=" * 60)
        
        # Create dataset creator
        creator = DatasetCreator(window_duration=30, overlap_ratio=0.5, sampling_rate=32)
        
        # Create the dataset
        stats = creator.create_dataset(input_dir, output_dir)
        
        if stats:
            print(f"\n🎉 Dataset creation completed successfully!")
            print(f"📊 Total windows: {stats.get('total_windows', 0):,}")
            print(f"👥 Participants: {stats.get('participants', 0)}")
            print(f"📁 Output saved to: {output_dir}/")
            
            return stats
        else:
            print("❌ Dataset creation failed")
            return None
            
    except Exception as e:
        print(f"❌ Error in dataset creation: {str(e)}")
        import traceback
        traceback.print_exc()
        return None

def create_dataset_cli():
    """
    Command line interface for dataset creation.
    Usage: python create_dataset.py -in_dir "Data" -out_dir "Dataset"
    """
    parser = argparse.ArgumentParser(description='Create ML dataset from sleep study recordings')
    parser.add_argument('-in_dir', '--input_dir', default='../Data',
                       help='Input directory containing participant folders')
    parser.add_argument('-out_dir', '--output_dir', default='../Dataset',
                       help='Output directory for saving the dataset')
    parser.add_argument('--window_duration', type=int, default=30,
                       help='Window duration in seconds (default: 30)')
    parser.add_argument('--overlap_ratio', type=float, default=0.5,
                       help='Overlap ratio between windows (default: 0.5)')
    parser.add_argument('--sampling_rate', type=int, default=32,
                       help='Sampling rate in Hz (default: 32)')
    
    args = parser.parse_args()
    
    print("🚀 BREATHING ANALYSIS DATASET CREATION")
    print("=" * 60)
    print(f"📥 Input directory: {args.input_dir}")
    print(f"📤 Output directory: {args.output_dir}")
    print(f"⏱️  Window size: {args.window_duration} seconds")
    print(f"🔄 Overlap ratio: {args.overlap_ratio * 100}%")
    print(f"📊 Sampling rate: {args.sampling_rate} Hz")
    print("=" * 60)
    
    # Create dataset creator
    creator = DatasetCreator(
        window_duration=args.window_duration,
        overlap_ratio=args.overlap_ratio,
        sampling_rate=args.sampling_rate
    )
    
    # Create the dataset
    stats = creator.create_dataset(args.input_dir, args.output_dir)
    
    if stats:
        print(f"\n🎉 Dataset creation completed successfully!")
        return 0
    else:
        print("❌ Dataset creation failed")
        return 1

# Usage instructions
print("🎯 USAGE INSTRUCTIONS:")
print("=" * 50)
print("# From Jupyter notebook:")
print("# stats = main()")
print("# ")
print("# From command line:")
print("# python create_dataset.py -in_dir '../Data' -out_dir '../Dataset'")
print("=" * 50)

🎯 USAGE INSTRUCTIONS:
# From Jupyter notebook:
# stats = main()
# 
# From command line:
# python create_dataset.py -in_dir '../Data' -out_dir '../Dataset'


In [7]:
# DEMONSTRATION: Create Dataset from Sleep Study Data
print("🚀 DEMONSTRATION: Creating ML Dataset from Sleep Study Recordings")
print("=" * 80)

# Execute the main dataset creation function
dataset_stats = main()

if dataset_stats:
    print("\n📊 DATASET CREATION RESULTS:")
    print("=" * 50)
    
    print(f"✅ Success! Created {dataset_stats['total_windows']:,} labeled windows")
    print(f"👥 Processed {dataset_stats['participants']} participants")
    print(f"⏱️  Window specifications:")
    print(f"   • Duration: {dataset_stats['window_duration_seconds']} seconds")
    print(f"   • Overlap: {dataset_stats['overlap_ratio']*100}%")
    
    print(f"\n🏷️  Label Distribution:")
    for label, count in dataset_stats['overall_distribution'].items():
        percentage = (count / dataset_stats['total_windows']) * 100
        print(f"   • {label}: {count:,} windows ({percentage:.1f}%)")
    
    print(f"\n📁 Output Files Created:")
    print(f"   • breathing_dataset_features.parquet (ML-ready features)")
    print(f"   • breathing_dataset_features.csv (human-readable)")
    print(f"   • breathing_dataset_raw.pkl (full time series)")
    print(f"   • dataset_stats.json (creation statistics)")
    
    print(f"\n💡 Format Choice Explanation:")
    print(f"   🎯 Parquet: Primary format for ML training")
    print(f"      • Fast loading and efficient storage")
    print(f"      • Column-oriented, optimized for analytics")
    print(f"      • Native support in pandas, scikit-learn")
    print(f"   📄 CSV: Human-readable backup format")
    print(f"      • Easy inspection and sharing")
    print(f"      • Compatible with any tool")
    print(f"   🗂️  Pickle: Full time series preservation")
    print(f"      • Complete signal data for advanced analysis")
    print(f"      • Python-native complex object storage")
    
    print("\n" + "=" * 80)
    print("✅ DATASET CREATION COMPLETED SUCCESSFULLY!")
    print("🚀 Ready for machine learning model training!")
    print("=" * 80)
    
else:
    print("❌ Dataset creation failed - check error messages above")

🚀 DEMONSTRATION: Creating ML Dataset from Sleep Study Recordings
🚀 BREATHING ANALYSIS DATASET CREATION
📥 Input directory: ../Data
📤 Output directory: ../Dataset
⏱️  Window size: 30 seconds with 50% overlap
🏷️  Target labels: Hypopnea, Obstructive Apnea, Normal
❌ Error in dataset creation: name 'window_samples' is not defined
❌ Dataset creation failed - check error messages above


🚀 DEMONSTRATION: Creating ML Dataset from Sleep Study Recordings
🚀 BREATHING ANALYSIS DATASET CREATION
📥 Input directory: ../Data
📤 Output directory: ../Dataset
⏱️  Window size: 30 seconds with 50% overlap
🏷️  Target labels: Hypopnea, Obstructive Apnea, Normal
❌ Error in dataset creation: name 'window_samples' is not defined
❌ Dataset creation failed - check error messages above


Traceback (most recent call last):
  File "/var/folders/cb/19hy99v14gxg9n0mbtk79djc0000gn/T/ipykernel_60001/1355396864.py", line 24, in main
    creator = DatasetCreator(window_duration=30, overlap_ratio=0.5, sampling_rate=32)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/folders/cb/19hy99v14gxg9n0mbtk79djc0000gn/T/ipykernel_60001/4230111827.py", line 22, in __init__
    self.step_samples = int(window_samples * (1 - overlap_ratio))  # 480 samples (15 seconds)
                            ^^^^^^^^^^^^^^
NameError: name 'window_samples' is not defined


In [17]:
# EXECUTE DATASET CREATION
print("🚀 CREATING DATASET FROM SLEEP STUDY DATA")
print("=" * 60)

try:
    # Create dataset creator instance
    creator = DatasetCreator(window_duration=30, overlap_ratio=0.5, sampling_rate=32)
    
    # Set input and output directories
    input_dir = "../Data"
    output_dir = "../Dataset"
    
    print(f"📥 Input directory: {input_dir}")
    print(f"📤 Output directory: {output_dir}")
    
    # Create the dataset
    print("\n🔄 Starting dataset creation...")
    stats = creator.create_dataset(input_dir, output_dir)
    
    if stats and stats.get('total_windows', 0) > 0:
        print(f"\n🎉 SUCCESS! Dataset created successfully!")
        print(f"✅ Total windows: {stats['total_windows']:,}")
        print(f"👥 Participants: {stats['participants']}")
        print(f"📊 Label distribution: {stats['overall_distribution']}")
        print(f"📁 Files saved to: {output_dir}/")
        
        # Show files created
        import os
        if os.path.exists(output_dir):
            files = [f for f in os.listdir(output_dir) if f.endswith(('.parquet', '.csv', '.pkl', '.json'))]
            print(f"\n📄 Files created:")
            for file in sorted(files):
                file_path = os.path.join(output_dir, file)
                if os.path.exists(file_path):
                    size = os.path.getsize(file_path)
                    print(f"   • {file}: {size:,} bytes")
        
    else:
        print("❌ Dataset creation failed - no windows were created")
        
except Exception as e:
    print(f"❌ Error during dataset creation: {str(e)}")
    import traceback
    traceback.print_exc()

🚀 CREATING DATASET FROM SLEEP STUDY DATA
✅ DatasetCreator initialized
   ⏱️  Window duration: 30 seconds (960 samples)
   🔄 Overlap: 50.0% (15.0 seconds)
   👣 Step size: 480 samples (15.0 seconds)
   🏷️  Target labels: ['Hypopnea', 'Obstructive Apnea', 'Normal']
📥 Input directory: ../Data
📤 Output directory: ../Dataset

🔄 Starting dataset creation...
🚀 DATASET CREATION: Processing all participants
📁 Found 5 participants: ['AP01', 'AP02', 'AP03', 'AP04', 'AP05']

[1/5] Processing AP01
--------------------------------------------------
🔄 Processing AP01...
   📊 Loaded signals: ['nasal_airflow', 'thoracic_movement', 'spo2']
   🚨 Events: {'Hypopnea': 125, 'Obstructive Apnea': 36}
   🪟 Extracted 1822 windows
   📊 Loaded signals: ['nasal_airflow', 'thoracic_movement', 'spo2']
   🚨 Events: {'Hypopnea': 125, 'Obstructive Apnea': 36}
   🪟 Extracted 1822 windows
   🏷️  Label distribution: {'Normal': 1727, 'Hypopnea': 79, 'Obstructive Apnea': 16}
   ✅ AP01: 1822 windows created

[2/5] Processing 