# üì¶ Data Preparation for Depression Detection

## NeuroSense Project - DAIC-WOZ Dataset Preprocessing

This notebook prepares the DAIC-WOZ dataset for the depression detection pipeline.

### Pipeline Overview:
1. **Download** - Fetch participant ZIP files from DAIC-WOZ server
2. **Extract** - Extract AUDIO.wav and TRANSCRIPT.csv files
3. **Analyze** - Check audio metadata (sample rate, duration, etc.)
4. **Separate** - Extract participant-only audio (remove interviewer)
5. **Process** - Apply noise reduction and normalization
6. **Labels** - Combine train/dev/test labels into single file
7. **Segment** - Split audio into 10-second overlapping chunks

### Dataset Info:
- **Source**: DAIC-WOZ (Distress Analysis Interview Corpus)
- **Participants**: 300-492 (189 total, some missing)
- **Content**: Clinical interviews with virtual interviewer "Ellie"
- **Labels**: PHQ-8 depression scores (binary: 0=not depressed, 1=depressed)

### Output Files:
- `DATASET/DAIC-WOZ/RAW_DATA` - Raw audio and transcript files
- `DATASET/DAIC-WOZ/PROCESSED/` - Cleaned participant audio
- `DATASET/DAIC-WOZ/RAW_DATA/audio_raw/` - Full interview audio files (*_AUDIO.wav)
- `DATASET/DAIC-WOZ/RAW_DATA/transcripts_raw/` - Timestamped transcripts (*_TRANSCRIPT.csv)
- `DATASET/DAIC-WOZ/RAW_DATA/covarep_features/` - Acoustic features at 10ms intervals (*_COVAREP.csv)
- `DATASET/DAIC-WOZ/RAW_DATA/formant_features/` - Vocal tract resonance frequencies (*_FORMANT.csv)
- `DATASET/DAIC-WOZ/RAW_DATA/segments/audio/` - 10-second audio segments
- `DATASET/DAIC-WOZ/metadata/labels_all.csv` - Combined labels with train/dev/test splits

---

In [None]:
!pip install noisereduce

Collecting noisereduce
  Downloading noisereduce-3.0.3-py3-none-any.whl.metadata (14 kB)
Downloading noisereduce-3.0.3-py3-none-any.whl (22 kB)
Installing collected packages: noisereduce
Successfully installed noisereduce-3.0.3


## 1. Environment Setup

Mount Google Drive and configure paths.

In [1]:
# ============================================
# MOUNT GOOGLE DRIVE
# ============================================
from google.colab import drive
drive.mount('/content/drive')

# ============================================
# IMPORTS & CONFIGURATION
# ============================================
import zipfile
import requests
import os

# Base path for all project files
BASE_PATH = '/content/drive/MyDrive/Final Project/DataSet/DAIC-WOZ'

# Source folder (where ZIP files will be downloaded)
zip_folder = f'{BASE_PATH}/RAW_DATA/full_data_participants_zip'

# Output folders for each file type
output_folders = {
    'AUDIO': f'{BASE_PATH}/RAW_DATA/audio_raw',
    'TRANSCRIPT': f'{BASE_PATH}/RAW_DATA/transcripts_raw',
    'COVAREP': f'{BASE_PATH}/RAW_DATA/covarep_features_raw',
    'FORMANT': f'{BASE_PATH}/RAW_DATA/formant_features_raw',
}

# Create all output folders
for folder in output_folders.values():
    os.makedirs(folder, exist_ok=True)
    print(f"üìÅ Created/verified: {folder}")


Mounted at /content/drive
üìÅ Created/verified: /content/drive/MyDrive/Final Project/DataSet/DAIC-WOZ/RAW_DATA/audio_raw
üìÅ Created/verified: /content/drive/MyDrive/Final Project/DataSet/DAIC-WOZ/RAW_DATA/transcripts_raw
üìÅ Created/verified: /content/drive/MyDrive/Final Project/DataSet/DAIC-WOZ/RAW_DATA/covarep_features_raw
üìÅ Created/verified: /content/drive/MyDrive/Final Project/DataSet/DAIC-WOZ/RAW_DATA/formant_features_raw


## 2. Download DAIC-WOZ Dataset

Download participant ZIP files from the DAIC-WOZ server.

**Note:**
- Participants range from 300-492
- Some IDs are missing (342, 394, 398, 460)
- Each ZIP contains: audio, transcript, facial features, etc.

In [None]:
# ============================================
# DOWNLOAD DAIC-WOZ PARTICIPANT FILES
# ============================================

# DAIC-WOZ dataset URL
base_url = "https://dcapswoz.ict.usc.edu/wwwdaicwoz/"

# Participant ID range
participant_ids = list(range(300, 493))

# Counters for summary
downloaded = 0
skipped = 0
failed = 0

print(f"Downloading {len(participant_ids)} participants...\n")

for pid in participant_ids:
    filename = f"{pid}_P.zip"
    url = base_url + filename
    output_path = os.path.join(zip_folder, filename)

    # Skip if file already exists
    if os.path.exists(output_path):
        print(f"‚è≠Ô∏è {filename} - exists")
        skipped += 1
        continue

    # Download file
    try:
        response = requests.get(url, stream=True)

        if response.status_code == 200:
            # Save file in chunks (memory efficient)
            with open(output_path, 'wb') as f:
                for chunk in response.iter_content(chunk_size=8192):
                    f.write(chunk)

            size_mb = os.path.getsize(output_path) / (1024*1024)
            print(f"‚úÖ {filename} - {size_mb:.0f} MB")
            downloaded += 1
        else:
            # Participant not found (some IDs are missing)
            print(f"‚ùå {filename} - not found")
            failed += 1

    except Exception as e:
        print(f"‚ùå {filename} - {e}")
        failed += 1

# ============================================
# SUMMARY
# ============================================
print("\n" + "=" * 40)
print(f"‚úÖ Downloaded: {downloaded}")
print(f"‚è≠Ô∏è Skipped (already exists): {skipped}")
print(f"‚ùå Failed/Not found: {failed}")
print("=" * 40)

## DAIC-WOZ File Descriptions

### Audio & Text
- `*_AUDIO.wav` - Full interview audio recording (participant + interviewer)
- `*_TRANSCRIPT.csv` - Timestamped transcript with speaker labels (Participant/Ellie)

### Acoustic Features (COVAREP)
- `*_COVAREP.csv` - Acoustic features extracted at 10ms intervals (100Hz), including:
  - F0 (pitch), VUV (voiced/unvoiced flag)
  - NAQ, QOQ (voice quality measures)
  - H1H2, PSP, MDQ, peakSlope (glottal features)
  - MCEP_0-24 (Mel cepstral coefficients)
  - HMPDM_0-24, HMPDD_0-12 (harmonic model features)
- `*_FORMANT.csv` - First 5 formants (vocal tract resonance frequencies)


In [None]:
import zipfile
import os
import time

#============================================
# FILE TYPE DETECTION FUNCTION
# ============================================
def get_file_type(filename):
    """Determine file type and target folder based on filename"""
    filename_upper = filename.upper()

    if 'AUDIO' in filename_upper and filename_upper.endswith('.WAV'):
        return 'AUDIO'
    elif 'TRANSCRIPT' in filename_upper and filename_upper.endswith('.CSV'):
        return 'TRANSCRIPT'
    elif 'COVAREP' in filename_upper and filename_upper.endswith('.CSV'):
        return 'COVAREP'
    elif 'FORMANT' in filename_upper and filename_upper.endswith('.CSV'):
        return 'FORMANT'
    else:
        return None  # Skip unknown files

# ============================================
# EXTRACT FILES FROM ZIP
# ============================================
max_retries = 3

# Counters per file type
extracted_counts = {key: 0 for key in output_folders.keys()}
skipped_counts = {key: 0 for key in output_folders.keys()}
failed_files = []

# Get list of ZIP files
zip_files = sorted([f for f in os.listdir(zip_folder) if f.endswith('_P.zip')])
print(f"\nüîç Found {len(zip_files)} ZIP files to process\n")

for idx, filename in enumerate(zip_files):
    zip_path = os.path.join(zip_folder, filename)
    participant_id = filename.replace('_P.zip', '')

    print(f"\nüì¶ [{idx+1}/{len(zip_files)}] Processing {filename}...")

    for attempt in range(max_retries):
        try:
            with zipfile.ZipFile(zip_path, 'r') as z:
                for file in z.namelist():
                    # Get just the filename (without folder path inside ZIP)
                    base_filename = os.path.basename(file)
                    if not base_filename:  # Skip directories
                        continue

                    # Determine file type
                    file_type = get_file_type(base_filename)

                    if file_type:
                        target_folder = output_folders[file_type]
                        output_path = os.path.join(target_folder, base_filename)

                        if os.path.exists(output_path):
                            print(f"  ‚è≠Ô∏è {base_filename} ‚Üí {file_type} (exists)")
                            skipped_counts[file_type] += 1
                        else:
                            # Extract to temp, then move to correct folder
                            z.extract(file, BASE_PATH)
                            extracted_path = os.path.join(BASE_PATH, file)

                            # Move to target folder
                            os.rename(extracted_path, output_path)
                            print(f"  ‚úÖ {base_filename} ‚Üí {file_type}")
                            extracted_counts[file_type] += 1

            break  # Success - exit retry loop

        except (ConnectionAbortedError, OSError) as e:
            print(f"  ‚ö†Ô∏è Attempt {attempt+1}/{max_retries} failed: {e}")
            if attempt < max_retries - 1:
                time.sleep(5)
            else:
                failed_files.append(filename)

    # Pause every 5 files
    if (idx + 1) % 5 == 0:
        print(f"\n‚è∏Ô∏è Pausing briefly...")
        time.sleep(2)

# ============================================
# SUMMARY
# ============================================
print("\n" + "=" * 60)
print("üìä EXTRACTION SUMMARY")
print("=" * 60)

print("\n‚úÖ Extracted files:")
for file_type, count in extracted_counts.items():
    if count > 0:
        print(f"   {file_type}: {count} files ‚Üí {output_folders[file_type]}")

print("\n‚è≠Ô∏è Skipped (already exist):")
for file_type, count in skipped_counts.items():
    if count > 0:
        print(f"   {file_type}: {count} files")

if failed_files:
    print(f"\n‚ùå Failed ZIPs: {len(failed_files)}")
    for f in failed_files:
        print(f"   - {f}")

print("\n" + "=" * 60)


üîç Found 189 ZIP files to process


üì¶ [1/189] Processing 300_P.zip...
  ‚è≠Ô∏è 300_AUDIO.wav ‚Üí AUDIO (exists)
  ‚úÖ 300_COVAREP.csv ‚Üí COVAREP
  ‚úÖ 300_FORMANT.csv ‚Üí FORMANT
  ‚è≠Ô∏è 300_TRANSCRIPT.csv ‚Üí TRANSCRIPT (exists)

üì¶ [2/189] Processing 301_P.zip...
  ‚è≠Ô∏è 301_AUDIO.wav ‚Üí AUDIO (exists)
  ‚ö†Ô∏è Attempt 1/3 failed: [Errno 107] Transport endpoint is not connected
  ‚è≠Ô∏è 301_AUDIO.wav ‚Üí AUDIO (exists)
  ‚úÖ 301_COVAREP.csv ‚Üí COVAREP
  ‚úÖ 301_FORMANT.csv ‚Üí FORMANT
  ‚è≠Ô∏è 301_TRANSCRIPT.csv ‚Üí TRANSCRIPT (exists)

üì¶ [3/189] Processing 302_P.zip...
  ‚è≠Ô∏è 302_AUDIO.wav ‚Üí AUDIO (exists)
  ‚úÖ 302_COVAREP.csv ‚Üí COVAREP
  ‚úÖ 302_FORMANT.csv ‚Üí FORMANT
  ‚è≠Ô∏è 302_TRANSCRIPT.csv ‚Üí TRANSCRIPT (exists)

üì¶ [4/189] Processing 303_P.zip...
  ‚è≠Ô∏è 303_AUDIO.wav ‚Üí AUDIO (exists)
  ‚úÖ 303_COVAREP.csv ‚Üí COVAREP
  ‚úÖ 303_FORMANT.csv ‚Üí FORMANT
  ‚è≠Ô∏è 303_TRANSCRIPT.csv ‚Üí TRANSCRIPT (exists)

üì¶ [5/189] Processing 304_P.zip...
  ‚è≠

## 4. Analyze Audio Metadata

Collect metadata from all audio files:
- Sample rate
- Duration
- Number of channels
- File size

In [None]:
# ============================================
# ANALYZE AUDIO FILE METADATA
# ============================================
import librosa
import os
import pandas as pd
import numpy as np

folder = f'{BASE_PATH}/RAW_DATA/audio_raw'

# Collect metadata for each audio file
data = []

audio_files = sorted([f for f in os.listdir(folder) if f.endswith("_AUDIO.wav")])
print(f"Found {len(audio_files)} audio files\n")

for filename in audio_files:
    file_path = os.path.join(folder, filename)
    participant_id = filename.replace("_AUDIO.wav", "")

    try:
        # Load audio file
        audio, sr = librosa.load(file_path, sr=None)

        # Get number of channels
        channels = 1 if len(audio.shape) == 1 else audio.shape[0]

        # Calculate duration
        duration_sec = len(audio) / sr
        duration_min = duration_sec / 60

        data.append({
            "participant_id": participant_id,
            "sample_rate": sr,
            "channels": channels,
            "duration_sec": round(duration_sec, 1),
            "duration_min": round(duration_min, 2),
            "file_size_mb": round(os.path.getsize(file_path) / (1024*1024), 1),
            "status": "‚úÖ OK"
        })
        print(f"‚úÖ {participant_id}")

    except Exception as e:
        data.append({
            "participant_id": participant_id,
            "sample_rate": None,
            "channels": None,
            "duration_sec": None,
            "duration_min": None,
            "file_size_mb": None,
            "status": f"‚ùå {e}"
        })
        print(f"‚ùå {participant_id}: {e}")

# Create DataFrame
df = pd.DataFrame(data)

# ============================================
# DISPLAY RESULTS
# ============================================
print("\n" + "=" * 60)
print("AUDIO FILES METADATA")
print("=" * 60)
display(df)

# ============================================
# STATISTICS
# ============================================
print("\n" + "=" * 60)
print("STATISTICS")
print("=" * 60)

df_valid = df[df["status"] == "‚úÖ OK"]

print(f"Total Files: {len(df)}")
print(f"Valid Files: {len(df_valid)}")
print(f"Failed Files: {len(df) - len(df_valid)}")
print(f"\nSample Rates: {df_valid['sample_rate'].unique().tolist()}")
print(f"Channels: {df_valid['channels'].unique().tolist()}")
print(f"\nDuration (min):")
print(f"  Min: {df_valid['duration_min'].min()}")
print(f"  Max: {df_valid['duration_min'].max()}")
print(f"  Mean: {round(df_valid['duration_min'].mean(), 2)}")
print(f"  Std: {round(df_valid['duration_min'].std(), 2)}")
print(f"\nTotal Duration: {round(df_valid['duration_min'].sum() / 60, 2)} hours")
print(f"Total Size: {round(df_valid['file_size_mb'].sum() / 1024, 2)} GB")

# Save metadata
output_path = f"{BASE_PATH}/RAW_DATA/audio_raw/audio_metadata.csv"
df.to_csv(output_path, index=False)
print(f"\n‚úÖ Saved to: {output_path}")

Found 189 audio files

‚úÖ 300
‚úÖ 301
‚úÖ 302
‚úÖ 303
‚úÖ 304
‚úÖ 305
‚úÖ 306
‚úÖ 307
‚úÖ 308
‚úÖ 309
‚úÖ 310
‚úÖ 311
‚úÖ 312
‚úÖ 313
‚úÖ 314
‚úÖ 315
‚úÖ 316
‚úÖ 317
‚úÖ 318
‚úÖ 319
‚úÖ 320
‚úÖ 321
‚úÖ 322
‚úÖ 323
‚úÖ 324
‚úÖ 325
‚úÖ 326
‚úÖ 327
‚úÖ 328
‚úÖ 329
‚úÖ 330
‚úÖ 331
‚úÖ 332
‚úÖ 333
‚úÖ 334
‚úÖ 335
‚úÖ 336
‚úÖ 337
‚úÖ 338
‚úÖ 339
‚úÖ 340
‚úÖ 341
‚úÖ 343
‚úÖ 344
‚úÖ 345
‚úÖ 346
‚úÖ 347
‚úÖ 348
‚úÖ 349
‚úÖ 350
‚úÖ 351
‚úÖ 352
‚úÖ 353
‚úÖ 354
‚úÖ 355
‚úÖ 356
‚úÖ 357
‚úÖ 358
‚úÖ 359
‚úÖ 360
‚úÖ 361
‚úÖ 362
‚úÖ 363
‚úÖ 364
‚úÖ 365
‚úÖ 366
‚úÖ 367
‚úÖ 368
‚úÖ 369
‚úÖ 370
‚úÖ 371
‚úÖ 372
‚úÖ 373
‚úÖ 374
‚úÖ 375
‚úÖ 376
‚úÖ 377
‚úÖ 378
‚úÖ 379
‚úÖ 380
‚úÖ 381
‚úÖ 382
‚úÖ 383
‚úÖ 384
‚úÖ 385
‚úÖ 386
‚úÖ 387
‚úÖ 388
‚úÖ 389
‚úÖ 390
‚úÖ 391
‚úÖ 392
‚úÖ 393
‚úÖ 395
‚úÖ 396
‚úÖ 397
‚úÖ 399
‚úÖ 400
‚úÖ 401
‚úÖ 402
‚úÖ 403
‚úÖ 404
‚úÖ 405
‚úÖ 406
‚úÖ 407
‚úÖ 408
‚úÖ 409
‚úÖ 410
‚úÖ 411
‚úÖ 412
‚úÖ 413
‚úÖ 414
‚úÖ 415
‚úÖ 416
‚úÖ 417
‚úÖ 418
‚úÖ 419
‚úÖ 420
‚úÖ 421
‚úÖ 422
‚úÖ 423
‚úÖ 424
‚

Unnamed: 0,participant_id,sample_rate,channels,duration_sec,duration_min,file_size_mb,status
0,300,16000,1,648.5,10.81,19.8,‚úÖ OK
1,301,16000,1,823.9,13.73,25.1,‚úÖ OK
2,302,16000,1,758.8,12.65,23.2,‚úÖ OK
3,303,16000,1,985.3,16.42,30.1,‚úÖ OK
4,304,16000,1,792.6,13.21,24.2,‚úÖ OK
...,...,...,...,...,...,...,...
184,488,16000,1,884.9,14.75,27.0,‚úÖ OK
185,489,16000,1,704.7,11.75,21.5,‚úÖ OK
186,490,16000,1,691.3,11.52,21.1,‚úÖ OK
187,491,16000,1,881.7,14.70,26.9,‚úÖ OK



STATISTICS
Total Files: 189
Valid Files: 189
Failed Files: 0

Sample Rates: [16000]
Channels: [1]

Duration (min):
  Min: 6.91
  Max: 32.77
  Mean: 15.94
  Std: 4.5

Total Duration: 50.21 hours
Total Size: 5.39 GB

‚úÖ Saved to: /content/drive/MyDrive/Final Project/DataSet/DAIC-WOZ/RAW_DATA/audio_raw/audio_metadata.csv


## 5. Extract Participant Audio Only

Extract only the participant's speech from the interview:
- Uses transcript timestamps to identify participant segments
- Preserves pauses between participant utterances
- Removes interviewer ("Ellie") audio

**Why?** The model should learn from participant's speech patterns, not the interviewer's questions.

In [None]:
# ============================================
# EXTRACT PARTICIPANT-ONLY AUDIO
# ============================================
import pandas as pd
import librosa
import numpy as np
import soundfile as sf
import os

def extract_participant_audio_full(audio_path, transcript_path):
    """
    Extract participant's audio segments from the full interview.

    Preserves:
    - Response latency (pause before responding)
    - Pauses within participant's speech
    - Pauses between sentences

    Args:
        audio_path: Path to full interview audio
        transcript_path: Path to transcript CSV with timestamps

    Returns:
        participant_audio: Concatenated participant audio
        sr: Sample rate
    """
    # Load audio
    audio, sr = librosa.load(audio_path, sr=None)

    # Load transcript (tab-separated)
    df = pd.read_csv(transcript_path, sep='\t')

    segments = []
    i = 0

    while i < len(df):
        # Find start of participant block
        if df.iloc[i]['speaker'] == 'Participant':

            # Get pause start (end of previous speaker)
            if i > 0:
                block_start = df.iloc[i-1]['stop_time']
            else:
                block_start = df.iloc[i]['start_time']

            # Find end of participant block (consecutive participant rows)
            block_end = df.iloc[i]['stop_time']

            while i + 1 < len(df) and df.iloc[i + 1]['speaker'] == 'Participant':
                i += 1
                block_end = df.iloc[i]['stop_time']

            # Extract audio block
            start_sample = int(block_start * sr)
            stop_sample = int(block_end * sr)
            segment = audio[start_sample:stop_sample]
            segments.append(segment)

        i += 1

    # Concatenate all participant segments
    if segments:
        participant_audio = np.concatenate(segments)
    else:
        participant_audio = np.array([])

    return participant_audio, sr


# ============================================
# PROCESS ALL FILES
# ============================================
audio_folder = f'{BASE_PATH}/RAW_DATA/audio_raw'
transcript_folder = f'{BASE_PATH}/RAW_DATA/transcripts_raw'
output_folder = f"{BASE_PATH}/RAW_DATA/audio_raw_only_patient"

os.makedirs(output_folder, exist_ok=True)

for filename in sorted(os.listdir(audio_folder)):
    if filename.endswith("_AUDIO.wav"):
        participant_id = filename.replace("_AUDIO.wav", "")

        audio_path = os.path.join(audio_folder, filename)
        transcript_path = os.path.join(transcript_folder, f"{participant_id}_TRANSCRIPT.csv")

        if os.path.exists(transcript_path):
            # Extract participant audio
            audio, sr = extract_participant_audio_full(audio_path, transcript_path)

            # Save participant-only audio
            output_path = os.path.join(output_folder, f"{participant_id}_PARTICIPANT.wav")
            sf.write(output_path, audio, sr)

            duration = len(audio) / sr
            print(f"‚úÖ {participant_id}: {duration:.1f}s of participant audio")
        else:
            print(f"‚ùå {participant_id}: transcript not found")

print("\n‚úÖ Done!")

‚úÖ 300: 272.8s of participant audio
‚úÖ 301: 573.0s of participant audio
‚úÖ 302: 376.8s of participant audio
‚úÖ 303: 738.6s of participant audio
‚úÖ 304: 453.5s of participant audio
‚úÖ 305: 1420.6s of participant audio
‚úÖ 306: 558.8s of participant audio
‚úÖ 307: 1003.0s of participant audio
‚úÖ 308: 604.8s of participant audio
‚úÖ 309: 331.2s of participant audio
‚úÖ 310: 529.0s of participant audio
‚úÖ 311: 392.0s of participant audio
‚úÖ 312: 489.3s of participant audio
‚úÖ 313: 437.9s of participant audio
‚úÖ 314: 1267.0s of participant audio
‚úÖ 315: 635.8s of participant audio
‚úÖ 316: 467.7s of participant audio
‚úÖ 317: 473.0s of participant audio
‚úÖ 318: 310.0s of participant audio
‚úÖ 319: 353.0s of participant audio
‚úÖ 320: 410.5s of participant audio
‚úÖ 321: 433.5s of participant audio
‚úÖ 322: 721.4s of participant audio
‚úÖ 323: 528.7s of participant audio
‚úÖ 324: 394.6s of participant audio
‚úÖ 325: 621.1s of participant audio
‚úÖ 326: 305.0s of participant audi

## 6. Audio Processing

Apply audio preprocessing:
1. **DC Offset Removal** - Subtract mean to center signal
2. **High-pass Filter** - Remove frequencies below 80 Hz (noise, not speech)
3. **Noise Reduction** (optional) - Remove background noise
4. **Normalization** - Scale to [-1, 1] range

In [None]:
# Install noise reduction library
!pip install noisereduce -q

In [None]:
# ============================================
# AUDIO PREPROCESSING
# ============================================
import librosa
import numpy as np
import soundfile as sf
import noisereduce as nr
from scipy.signal import butter, filtfilt
import os

# Configuration
input_folder = f"{BASE_PATH}/RAW_DATA/audio_raw_only_patient/"
output_folder = f"{BASE_PATH}/clean_audio/"
os.makedirs(output_folder, exist_ok=True)

# Processing options
APPLY_NOISE_REDUCTION = False  # Set to True if audio quality is poor


def highpass_filter(audio, sr, cutoff=80):
    """
    Remove frequencies below cutoff (default 80 Hz).

    Low frequencies are typically noise, not human speech.
    Human speech fundamental frequency: ~85-255 Hz

    Args:
        audio: Audio signal
        sr: Sample rate
        cutoff: Cutoff frequency in Hz

    Returns:
        Filtered audio signal
    """
    nyquist = sr / 2
    normalized_cutoff = cutoff / nyquist
    b, a = butter(5, normalized_cutoff, btype='high')
    filtered_audio = filtfilt(b, a, audio)
    return filtered_audio


# ============================================
# PROCESS ALL AUDIO FILES
# ============================================
for filename in sorted(os.listdir(input_folder)):
    if filename.endswith(".wav"):

        # Load audio file (keep original sample rate)
        audio, sr = librosa.load(os.path.join(input_folder, filename), sr=None)

        # Step 1: Remove DC Offset (subtract mean)
        audio = audio - np.mean(audio)

        # Step 2: Apply high-pass filter (remove frequencies below 80 Hz)
        audio = highpass_filter(audio, sr, cutoff=80)

        # Step 3 (Optional): Apply noise reduction
        if APPLY_NOISE_REDUCTION:
            audio = nr.reduce_noise(y=audio, sr=sr)

        # Step 4: Normalize to [-1, 1] range
        audio = audio / np.max(np.abs(audio))

        # Save processed audio
        output_filename = filename.replace(".wav", "_clean.wav")
        output_path = os.path.join(output_folder, output_filename)
        sf.write(output_path, audio, sr)

        print(f"‚úÖ {output_filename}")

print("\n‚úÖ Audio processing complete!")

‚úÖ 300_PARTICIPANT_clean.wav
‚úÖ 301_PARTICIPANT_clean.wav
‚úÖ 302_PARTICIPANT_clean.wav
‚úÖ 303_PARTICIPANT_clean.wav
‚úÖ 304_PARTICIPANT_clean.wav
‚úÖ 305_PARTICIPANT_clean.wav
‚úÖ 306_PARTICIPANT_clean.wav
‚úÖ 307_PARTICIPANT_clean.wav
‚úÖ 308_PARTICIPANT_clean.wav
‚úÖ 309_PARTICIPANT_clean.wav
‚úÖ 310_PARTICIPANT_clean.wav
‚úÖ 311_PARTICIPANT_clean.wav
‚úÖ 312_PARTICIPANT_clean.wav
‚úÖ 313_PARTICIPANT_clean.wav
‚úÖ 314_PARTICIPANT_clean.wav
‚úÖ 315_PARTICIPANT_clean.wav
‚úÖ 316_PARTICIPANT_clean.wav
‚úÖ 317_PARTICIPANT_clean.wav
‚úÖ 318_PARTICIPANT_clean.wav
‚úÖ 319_PARTICIPANT_clean.wav
‚úÖ 320_PARTICIPANT_clean.wav
‚úÖ 321_PARTICIPANT_clean.wav
‚úÖ 322_PARTICIPANT_clean.wav
‚úÖ 323_PARTICIPANT_clean.wav
‚úÖ 324_PARTICIPANT_clean.wav
‚úÖ 325_PARTICIPANT_clean.wav
‚úÖ 326_PARTICIPANT_clean.wav
‚úÖ 327_PARTICIPANT_clean.wav
‚úÖ 328_PARTICIPANT_clean.wav
‚úÖ 329_PARTICIPANT_clean.wav
‚úÖ 330_PARTICIPANT_clean.wav
‚úÖ 331_PARTICIPANT_clean.wav
‚úÖ 332_PARTICIPANT_clean.wav
‚úÖ 333_PA

## 7. Create Combined Labels File

Combine train/dev/test label files into a single `labels_all.csv`:
- Standardize column names (PHQ8_Binary ‚Üí PHQ_Binary)
- Add split column (train/dev/test)
- Display class distribution

In [2]:
# ============================================
# COMBINE LABEL FILES (CORRECTED FOR REGRESSION)
# ============================================
import pandas as pd
import os

folder = BASE_PATH + '/RAW_DATA/train_dev_test'

# Load the 3 split files
print("‚è≥ Loading split files...")
train = pd.read_csv(os.path.join(folder, "train_split_Depression_AVEC2017.csv"))
dev = pd.read_csv(os.path.join(folder, "dev_split_Depression_AVEC2017.csv"))
test = pd.read_csv(os.path.join(folder, "full_test_split.csv"))

# ---------------------------------------------------------
# 1. Standardize Column Names (Binary AND Score)
# ---------------------------------------------------------
# Binary
if 'PHQ8_Binary' in train.columns:
    train = train.rename(columns={'PHQ8_Binary': 'PHQ_Binary'})
if 'PHQ8_Binary' in dev.columns:
    dev = dev.rename(columns={'PHQ8_Binary': 'PHQ_Binary'})

# Score (◊î◊ó◊ú◊ß ◊î◊ó◊ì◊© ◊ï◊î◊ó◊©◊ï◊ë ◊ú◊®◊í◊®◊°◊ô◊î)
if 'PHQ8_Score' in train.columns:
    train = train.rename(columns={'PHQ8_Score': 'PHQ_Score'})
if 'PHQ8_Score' in dev.columns:
    dev = dev.rename(columns={'PHQ8_Score': 'PHQ_Score'})

# Test usually has 'PHQ_Score', but let's be safe
if 'PHQ8_Score' in test.columns:
    test = test.rename(columns={'PHQ8_Score': 'PHQ_Score'})

# ---------------------------------------------------------
# 2. Verify Columns
# ---------------------------------------------------------
print("=" * 50)
print("COLUMN VERIFICATION")
print("=" * 50)
print(f"Train: Binary? {'‚úÖ' if 'PHQ_Binary' in train.columns else '‚ùå'} | Score? {'‚úÖ' if 'PHQ_Score' in train.columns else '‚ùå'}")
print(f"Dev:   Binary? {'‚úÖ' if 'PHQ_Binary' in dev.columns else '‚ùå'} | Score? {'‚úÖ' if 'PHQ_Score' in dev.columns else '‚ùå'}")
print(f"Test:  Binary? {'‚úÖ' if 'PHQ_Binary' in test.columns else '‚ùå'} | Score? {'‚úÖ' if 'PHQ_Score' in test.columns else '‚ùå'}")

# ============================================
# COMBINE ALL SPLITS
# ============================================
train['split'] = 'train'
dev['split'] = 'dev'
test['split'] = 'test'

combined = pd.concat([train, dev, test], ignore_index=True)

# ============================================
# SAVE COMBINED FILE (WITH SCORE)
# ============================================
output_path = os.path.join(BASE_PATH, "DataSet/DAIC_WOZ/metadata/labels_all.csv")

# ◊û◊ï◊ï◊ì◊ê◊ô◊ù ◊©◊î◊™◊ô◊ß◊ô◊ô◊î ◊ß◊ô◊ô◊û◊™
os.makedirs(os.path.dirname(output_path), exist_ok=True)

# ◊õ◊ê◊ü ◊î◊©◊ô◊†◊ï◊ô ◊î◊ó◊©◊ï◊ë: ◊©◊ï◊û◊®◊ô◊ù ◊í◊ù ◊ê◊™ PHQ_Score
cols_to_save = ['Participant_ID', 'PHQ_Binary', 'PHQ_Score', 'split']

# ◊ë◊ì◊ô◊ß◊î ◊ê◊ó◊®◊ï◊†◊î ◊©◊ú◊ê ◊ó◊°◊® ◊õ◊ú◊ï◊ù ◊ú◊§◊†◊ô ◊©◊û◊ô◊®◊î
final_cols = [c for c in cols_to_save if c in combined.columns]
combined[final_cols].to_csv(output_path, index=False)

print("\n" + "=" * 50)
print(f"‚úÖ DONE! Saved labels with scores to:")
print(f"{output_path}")
print("=" * 50)
print("First 5 rows:")
print(combined[final_cols].head())

‚è≥ Loading split files...
COLUMN VERIFICATION
Train: Binary? ‚úÖ | Score? ‚úÖ
Dev:   Binary? ‚úÖ | Score? ‚úÖ
Test:  Binary? ‚úÖ | Score? ‚úÖ

‚úÖ DONE! Saved labels with scores to:
/content/drive/MyDrive/Final Project/DataSet/DAIC-WOZ/DataSet/DAIC_WOZ/metadata/labels_all.csv
First 5 rows:
   Participant_ID  PHQ_Binary  PHQ_Score  split
0             303           0          0  train
1             304           0          6  train
2             305           0          7  train
3             310           0          4  train
4             312           0          2  train


## 8. Segment Audio Files

Split processed audio into fixed-length segments:
- **Segment length**: 10 seconds
- **Overlap**: 5 seconds (50%)
- **Last segment**: Zero-padded if > 2 seconds

**Why segment?**
- Wav2Vec2 works best with shorter clips
- Enables data augmentation through different segments
- Reduces memory requirements during training

In [None]:
# ============================================
# SEGMENT AUDIO FILES
# ============================================
import librosa
import numpy as np
import soundfile as sf
import pandas as pd
import os

# Configuration
input_folder = f"{BASE_PATH}/clean_audio/"
output_folder = f"{BASE_PATH}/segments/audio/"
os.makedirs(output_folder, exist_ok=True)

# Segmentation parameters
SEGMENT_SEC = 10         # Segment length in seconds
OVERLAP_SEC = 5          # Overlap between segments
MIN_LAST_SEGMENT_SEC = 2 # Minimum length to keep last segment

print(f"Segmentation settings:")
print(f"  Segment length: {SEGMENT_SEC}s")
print(f"  Overlap: {OVERLAP_SEC}s")
print(f"  Hop size: {SEGMENT_SEC - OVERLAP_SEC}s")
print(f"  Min last segment: {MIN_LAST_SEGMENT_SEC}s\n")

# ============================================
# PROCESS ALL FILES
# ============================================
results = []

for filename in sorted(os.listdir(input_folder)):
    if filename.endswith("_PARTICIPANT_clean.wav"):
        participant_id = filename.replace("_PARTICIPANT_clean.wav", "")

        # Load processed audio
        audio, sr = librosa.load(os.path.join(input_folder, filename), sr=None)

        # Calculate segment parameters in samples
        segment_samples = int(SEGMENT_SEC * sr)
        hop_samples = int((SEGMENT_SEC - OVERLAP_SEC) * sr)
        min_last_samples = int(MIN_LAST_SEGMENT_SEC * sr)

        seg_count = 0

        # Create full segments
        for start in range(0, len(audio) - segment_samples + 1, hop_samples):
            segment = audio[start:start + segment_samples]
            output_path = os.path.join(output_folder, f"{participant_id}_seg{seg_count:03d}.wav")
            sf.write(output_path, segment, sr)
            seg_count += 1
            last_end = start + segment_samples

        # Handle remainder
        if seg_count > 0:
            remainder = audio[last_end:]
        else:
            remainder = audio

        # Keep last segment if long enough (pad with zeros)
        if len(remainder) >= min_last_samples:
            padded = np.zeros(segment_samples)
            padded[:len(remainder)] = remainder
            output_path = os.path.join(output_folder, f"{participant_id}_seg{seg_count:03d}.wav")
            sf.write(output_path, padded, sr)
            seg_count += 1

        results.append({'participant_id': participant_id, 'num_segments': seg_count})
        print(f"‚úÖ {participant_id}: {seg_count} segments")

# ============================================
# SUMMARY
# ============================================
results_df = pd.DataFrame(results)
print(f"\n" + "=" * 50)
print(f"SUMMARY")
print("=" * 50)
print(f"Total participants: {len(results_df)}")
print(f"Total segments: {results_df['num_segments'].sum()}")
print(f"Segments per participant:")
print(f"  Min: {results_df['num_segments'].min()}")
print(f"  Max: {results_df['num_segments'].max()}")
print(f"  Mean: {results_df['num_segments'].mean():.1f}")

# Save metadata
results_df.to_csv(os.path.join(output_folder, "segments_metadata.csv"), index=False)
print(f"\n‚úÖ Metadata saved to: segments_metadata.csv")

Segmentation settings:
  Segment length: 10s
  Overlap: 5s
  Hop size: 5s
  Min last segment: 2s

‚úÖ 305: 283 segments
‚úÖ 306: 111 segments
‚úÖ 307: 200 segments
‚úÖ 308: 120 segments
‚úÖ 309: 65 segments
‚úÖ 310: 105 segments
‚úÖ 311: 77 segments
‚úÖ 312: 97 segments
‚úÖ 313: 87 segments
‚úÖ 314: 253 segments
‚úÖ 315: 126 segments
‚úÖ 316: 93 segments
‚úÖ 317: 94 segments
‚úÖ 318: 61 segments
‚úÖ 319: 70 segments
‚úÖ 320: 81 segments
‚úÖ 321: 86 segments
‚úÖ 322: 143 segments
‚úÖ 323: 105 segments
‚úÖ 324: 78 segments
‚úÖ 325: 123 segments
‚úÖ 326: 60 segments
‚úÖ 327: 66 segments
‚úÖ 328: 160 segments
‚úÖ 329: 83 segments
‚úÖ 330: 87 segments
‚úÖ 331: 107 segments
‚úÖ 332: 101 segments
‚úÖ 333: 122 segments
‚úÖ 334: 121 segments
‚úÖ 335: 119 segments
‚úÖ 336: 112 segments
‚úÖ 337: 335 segments
‚úÖ 338: 67 segments
‚úÖ 339: 112 segments
‚úÖ 340: 55 segments
‚úÖ 341: 115 segments
‚úÖ 343: 90 segments
‚úÖ 344: 151 segments
‚úÖ 345: 114 segments
‚úÖ 346: 197 segments
‚úÖ 347: 54 segmen



### Dataset Statistics:
| Split | Participants | Depressed | Not Depressed |
|-------|--------------|-----------|---------------|
| Train | 107 | 30 | 77 |
| Dev | 35 | 12 | 23 |
| Test | 47 | 12 | 35 |
| **Total** | **189** | **54** | **135** |

### Next Steps:
1. Run `01_audio.ipynb` to extract Wav2Vec2 embeddings
2. Run `02_text.ipynb` to train BERT on transcripts
3. Run `03_fusion.ipynb` to combine modalities

---