# Segment-Level Feature Extraction for Depression Detection

**Author:** Jonathan Chan Jia Hao  
**Affiliation:** Monash University Malaysia, School of Information Technology  
**Date:** October 2024

---

## Overview

This notebook implements **segment-level feature extraction** for multimodal depression detection using the Extended DAIC-WOZ (E-DAIC) dataset. Unlike session-level approaches that aggregate entire conversations into single representations, this pipeline extracts fine-grained embeddings from individual utterances (text) and temporal segments (audio), preserving sequential and temporal dynamics for downstream attention-based modeling.

## Motivation

Depression manifests through subtle variations in linguistic patterns and acoustic characteristics that evolve throughout clinical interviews. Session-level aggregation (mean/max pooling) may lose critical temporal information. This notebook addresses this limitation by:

- **Preserving Temporal Structure:** Extracting embeddings at utterance/segment granularity
- **Enabling Attention Mechanisms:** Creating sequences suitable for learned attention pooling
- **Improving Interpretability:** Allowing identification of depression-discriminative moments in conversations

## Architecture Pipeline

### Text Modality (Utterance-Level)
1. **Input:** Participant transcripts (CSV format)
2. **Preprocessing:** Parse and split into individual utterances
3. **Embedding Model:** RoBERTa-base (768-dimensional)
4. **Output:** Variable-length sequence of utterance embeddings `[N_utterances × 768]`

### Audio Modality (Segment-Level)
1. **Input:** Interview audio recordings (WAV format)
2. **Preprocessing:** Split into 15-second fixed-length segments
3. **Embedding Model:** Wav2Vec2-base-960h (768-dimensional)
4. **Output:** Variable-length sequence of segment embeddings `[N_segments × 768]`

## Key Features

**Batch Processing Optimization**
- GPU-accelerated batch inference for audio segments (configurable batch size)
- Memory-efficient processing with automatic cache clearing
- Nested progress tracking for large-scale feature extraction

**Flexible Pooling Strategy**
- Stores both raw segment embeddings (for attention learning) and mean-pooled baselines
- Enables comparison between learned attention and simple averaging
- Maintains compatibility with downstream classification pipelines

**Robust Error Handling**
- Graceful handling of missing transcripts/audio files
- Automatic padding/truncation for variable-length inputs
- Comprehensive logging and verification steps

## Dataset Structure

**Input Files:**
- Transcripts: `/daic_data/{PID}_P/{PID}_Transcript.csv`
- Audio: `/daic_data/{PID}_P/{PID}_AUDIO.wav`
- Labels: `/Labels/train_split.csv`, `dev_split.csv`, `test_split.csv`

**Output Files:**
- Text: `text_data_utterance_with_labels.pkl`
- Audio: `audio_data_segment_with_labels.pkl`

## Notebook Structure

### Part 1: Text Utterance Extraction (Cells 1-9)
1. **Environment Setup** (Cell 1): Library imports, device configuration
2. **Data Loading** (Cell 2): Load participant splits and labels
3. **Model Initialization** (Cell 3-4): RoBERTa model and attention pooling layer
4. **Preprocessing** (Cell 5): Transcript parsing and utterance splitting
5. **Feature Extraction** (Cell 6-7): Utterance-level embedding generation
6. **Save & Verify** (Cell 8-9): Export to pickle and validation

### Part 2: Audio Segment Extraction (Cells 10-17)
1. **Environment Setup** (Cell 10-11): Audio processing libraries and paths
2. **Model Initialization** (Cell 12): Wav2Vec2 model loading
3. **Audio Segmentation** (Cell 13): 15-second fixed-length chunking
4. **Feature Extraction** (Cell 14-15): Batch-optimized segment embedding
5. **Save & Verify** (Cell 16-17): Export to pickle and validation

### Part 3: Memory Management (Cell 18)
- Emergency cleanup procedures for interrupted processing
- GPU cache clearing and memory monitoring

## Technical Specifications

**Text Processing:**
- Tokenizer: RoBERTa byte-level BPE (50,265 vocab)
- Max sequence length: 128 tokens
- Embedding extraction: [CLS] token representation
- Model parameters: 125M (frozen during feature extraction)

**Audio Processing:**
- Sampling rate: 16 kHz (downsampled if necessary)
- Segment duration: 15 seconds (240,000 samples)
- Feature extraction: Wav2Vec2 contextualized representations
- Temporal pooling: Mean over time steps
- Model parameters: 95M (frozen during feature extraction)

## Output Schema

Both pickle files contain DataFrames with the following structure:

```python
{
    'Participant_ID': str,              # Participant identifier
    'utterance_embeddings' or 'segment_embeddings': np.array,  # [N × 768]
    'utterances': List[str],            # (Text only) Raw utterance text
    'num_utterances' or 'num_segments': int,  # Sequence length
    'mean_pooled_embedding': np.array,  # [768] baseline aggregation
    'Depression_label': int,            # Binary PHQ-8 label (0/1)
    'PHQ8_Score': float,               # Continuous depression score
    'split': str                        # 'train', 'dev', or 'test'
}
```

---

**License:** MIT | **Contact:** jcha0091@student.monash.edu  
**Repository:** [GitHub](https://github.com/JonathanChan9001/Multimodal-Machine-Learning-and-Data-Analysis-Framework-for-Depression-Detection)

---

# 1. TEXT UTTERANCE

In [1]:
# Cell 1: Import libraries
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel
from tqdm import tqdm
import pickle
import os

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cuda


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Cell 2: Define paths
DATA_DIR = "/home/jonathanchan/ml_data"
LABELS_DIR = os.path.join(DATA_DIR, "Labels")
OUTPUT_FILE = os.path.join(DATA_DIR, "text_data_utterance_with_labels.pkl")

# Load participant IDs and labels
train_split = pd.read_csv(os.path.join(LABELS_DIR, "train_split.csv"))
dev_split = pd.read_csv(os.path.join(LABELS_DIR, "dev_split.csv"))
test_split = pd.read_csv(os.path.join(LABELS_DIR, "test_split.csv"))

# Combine all splits
all_splits = pd.concat([train_split, dev_split, test_split], ignore_index=True)
print(f"Total participants: {len(all_splits)}")



Total participants: 275


In [3]:
# Cell 3: Load RoBERTa model
model_name = "roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
roberta_model = AutoModel.from_pretrained(model_name).to(device)
roberta_model.eval()

print(f"Loaded {model_name}")


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loaded roberta-base


In [5]:
# Cell 4: Define Attention Pooling Layer
class AttentionPooling(nn.Module):
    def __init__(self, hidden_dim=768):
        super().__init__()
        self.attention = nn.Linear(hidden_dim, 1)
        
    def forward(self, embeddings):
        """
        embeddings: [num_utterances, hidden_dim]
        returns: [hidden_dim]
        """
        if len(embeddings.shape) == 1:
            # Single utterance case
            return embeddings
        
        attn_weights = torch.softmax(self.attention(embeddings), dim=0)
        pooled = torch.sum(attn_weights * embeddings, dim=0)
        return pooled

attention_pooler = AttentionPooling(768).to(device)
print("Attention pooling layer initialized")

Attention pooling layer initialized


In [6]:
# Cell 5: Function to split transcript into utterances (FIXED for column structure)
def split_into_utterances(transcript_path, participant_id):
    """
    Parse transcript CSV and extract participant utterances.
    Handles potential multi-level column headers.
    """
    try:
        # Read transcript - skip potential extra header rows
        df = pd.read_csv(transcript_path, sep='\t', encoding='utf-8')
        
        # Debug: check actual column structure
        actual_columns = df.columns.tolist()
        
        # Find the 'Text' column (might be nested or have spaces)
        text_column = None
        for col in actual_columns:
            col_str = str(col).strip()
            if 'Text' in col_str or 'text' in col_str.lower():
                text_column = col
                break
        
        if text_column is None:
            # Fallback: assume third column is text
            text_column = actual_columns[2] if len(actual_columns) > 2 else actual_columns[0]
        
        # Extract text utterances
        utterances = df[text_column].tolist()
        
        # Clean utterances - remove NaN and empty strings
        cleaned_utterances = []
        for utt in utterances:
            if pd.notna(utt) and len(str(utt).strip()) > 0:
                cleaned_utterances.append(str(utt).strip())
        
        if len(cleaned_utterances) == 0:
            print(f"Warning: No utterances found for {participant_id}")
            return [""]  # Return empty string to avoid crashes
        
        return cleaned_utterances
    
    except Exception as e:
        print(f"Error processing {participant_id}: {e}")
        import traceback
        traceback.print_exc()  # Show full error for debugging
        return [""]

In [7]:
# Cell 6: Function to get utterance embeddings
def get_utterance_embeddings(utterances, max_length=128):
    """
    Get RoBERTa embeddings for each utterance.
    Returns: tensor of shape [num_utterances, 768]
    """
    embeddings = []
    
    with torch.no_grad():
        for utt in utterances:
            # Tokenize
            inputs = tokenizer(
                utt,
                padding='max_length',
                truncation=True,
                max_length=max_length,
                return_tensors='pt'
            ).to(device)
            
            # Get RoBERTa output
            outputs = roberta_model(**inputs)
            
            # Extract [CLS] token embedding
            cls_embedding = outputs.last_hidden_state[:, 0, :].squeeze(0)  # [768]
            embeddings.append(cls_embedding.cpu())
    
    # Stack into tensor
    embeddings = torch.stack(embeddings)  # [num_utterances, 768]
    return embeddings

In [11]:
# Cell 7: Process all participants
# Find transcript files in daic_data directory
TRANSCRIPT_DIR = os.path.join(DATA_DIR, "daic_data")

results = []

for idx, row in tqdm(all_splits.iterrows(), total=len(all_splits), desc="Processing participants"):
    participant_id = row['Participant_ID']
    
    # Construct transcript path - CORRECTED: Transcript.csv not TRANSCRIPT.csv
    transcript_path = os.path.join(
        TRANSCRIPT_DIR,
        f"{participant_id}_P",
        f"{participant_id}_Transcript.csv"
    )
    
    if not os.path.exists(transcript_path):
        print(f"Transcript not found: {transcript_path}")
        continue
    
    # Split into utterances
    utterances = split_into_utterances(transcript_path, participant_id)
    
    # Get embeddings per utterance
    utterance_embeddings = get_utterance_embeddings(utterances)
    
    # Store results with BOTH utterance-level and pooled embeddings
    results.append({
        'Participant_ID': participant_id,
        'utterances': utterances,  # Original text
        'utterance_embeddings': utterance_embeddings.numpy(),  # [num_utt, 768]
        'num_utterances': len(utterances),
        'Depression_label': row.get('PHQ8_Binary', None),
        'PHQ8_Score': row.get('PHQ8_Score', None),
        'split': 'train' if participant_id in train_split['Participant_ID'].values else 
                'dev' if participant_id in dev_split['Participant_ID'].values else 'test'
    })

print(f"\nProcessed {len(results)} participants successfully")

Processing participants: 100%|██████████| 275/275 [02:33<00:00,  1.80it/s]


Processed 275 participants successfully





In [12]:
# Cell 8: Create DataFrame and save
df_utterance = pd.DataFrame(results)

# Add mean pooling baseline for comparison
df_utterance['mean_pooled_embedding'] = df_utterance['utterance_embeddings'].apply(
    lambda x: np.mean(x, axis=0)
)

print(f"\nDataFrame shape: {df_utterance.shape}")
print(f"\nColumns: {df_utterance.columns.tolist()}")
print(f"\nSample row:")
print(f"  Participant: {df_utterance.iloc[0]['Participant_ID']}")
print(f"  Num utterances: {df_utterance.iloc[0]['num_utterances']}")
print(f"  Utterance embeddings shape: {df_utterance.iloc[0]['utterance_embeddings'].shape}")
print(f"  Mean pooled shape: {df_utterance.iloc[0]['mean_pooled_embedding'].shape}")

# Save to pickle
with open(OUTPUT_FILE, 'wb') as f:
    pickle.dump(df_utterance, f)

print(f"\nSaved to: {OUTPUT_FILE}")


DataFrame shape: (275, 8)

Columns: ['Participant_ID', 'utterances', 'utterance_embeddings', 'num_utterances', 'Depression_label', 'PHQ8_Score', 'split', 'mean_pooled_embedding']

Sample row:
  Participant: 302
  Num utterances: 99
  Utterance embeddings shape: (99, 768)
  Mean pooled shape: (768,)

Saved to: /home/jonathanchan/ml_data/text_data_utterance_with_labels.pkl


In [13]:
# Cell 9: Verify the file
print("\n=== Verification ===")
with open(OUTPUT_FILE, 'rb') as f:
    loaded_df = pickle.load(f)

print(f"Loaded DataFrame shape: {loaded_df.shape}")
print(f"First participant ID: {loaded_df.iloc[0]['Participant_ID']}")
print(f"Split distribution:\n{loaded_df['split'].value_counts()}")
print(f"\nDepression label distribution:\n{loaded_df['Depression_label'].value_counts()}")


=== Verification ===
Loaded DataFrame shape: (275, 8)
First participant ID: 302
Split distribution:
split
train    163
dev       56
test      56
Name: count, dtype: int64

Depression label distribution:
Series([], Name: count, dtype: int64)


# 2. AUDIO ATTENTION POOLING

In [1]:
# Cell 10: Import audio processing libraries
import librosa
import soundfile as sf
from transformers import Wav2Vec2Processor, Wav2Vec2Model
import warnings
warnings.filterwarnings('ignore')

print("Audio processing libraries loaded")

  from .autonotebook import tqdm as notebook_tqdm


Audio processing libraries loaded


In [8]:
# Cell 11: Define paths for audio processing
DATA_DIR = "/home/jonathanchan/ml_data"
AUDIO_DIR = os.path.join(DATA_DIR, "daic_data")
LABELS_DIR = os.path.join(DATA_DIR, "Labels")
OUTPUT_FILE = os.path.join(DATA_DIR, "audio_data_segment_with_labels.pkl")

# Load splits
train_split = pd.read_csv(os.path.join(LABELS_DIR, "train_split.csv"))
dev_split = pd.read_csv(os.path.join(LABELS_DIR, "dev_split.csv"))
test_split = pd.read_csv(os.path.join(LABELS_DIR, "test_split.csv"))
all_splits = pd.concat([train_split, dev_split, test_split], ignore_index=True)

print(f"Total participants for audio processing: {len(all_splits)}")

Total participants for audio processing: 275


In [9]:
# Cell 12: Load Wav2Vec2 model
model_name = "facebook/wav2vec2-base-960h"
processor = Wav2Vec2Processor.from_pretrained(model_name)
wav2vec2_model = Wav2Vec2Model.from_pretrained(model_name).to(device)
wav2vec2_model.eval()

print(f"Loaded {model_name}")
print(f"Model on device: {next(wav2vec2_model.parameters()).device}")

Some weights of Wav2Vec2Model were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loaded facebook/wav2vec2-base-960h
Model on device: cuda:0


In [10]:
# Cell 13: Function to segment audio into chunks
def segment_audio(audio_path, segment_length=15.0, sr=16000):
    """
    Load audio and split into fixed-length segments.
    
    Args:
        audio_path: path to audio file
        segment_length: length of each segment in seconds (default 15s)
        sr: sampling rate (16000 for wav2vec2)
    
    Returns:
        List of audio segments (numpy arrays)
    """
    try:
        # Load audio
        audio, original_sr = librosa.load(audio_path, sr=sr, mono=True)
        
        # Calculate segment size in samples
        segment_samples = int(segment_length * sr)
        
        # Split into segments
        segments = []
        for i in range(0, len(audio), segment_samples):
            segment = audio[i:i + segment_samples]
            
            # Pad last segment if too short (minimum 1 second)
            if len(segment) < sr:  # Less than 1 second
                continue
            elif len(segment) < segment_samples:
                # Pad with zeros
                segment = np.pad(segment, (0, segment_samples - len(segment)))
            
            segments.append(segment)
        
        if len(segments) == 0:
            print(f"Warning: No valid segments extracted from {audio_path}")
            # Return one segment with silence
            segments = [np.zeros(segment_samples)]
        
        return segments
    
    except Exception as e:
        print(f"Error segmenting audio {audio_path}: {e}")
        # Return dummy segment
        return [np.zeros(int(segment_length * sr))]

In [26]:
# Cell 14: Function to extract wav2vec2 embeddings from audio segments (BATCH OPTIMIZED)

def get_audio_segment_embeddings(audio_segments, max_samples=16000*15, batch_size=16):
    """
    Extract wav2vec2 embeddings for each audio segment using batch processing.
    
    Args:
        audio_segments: list of numpy arrays (audio waveforms)
        max_samples: maximum length to process
        batch_size: number of segments to process at once on GPU
    
    Returns:
        tensor of shape [num_segments, 768]
    """
    embeddings = []
    
    with torch.no_grad():
        # Process segments in batches
        for i in range(0, len(audio_segments), batch_size):
            batch_segments = audio_segments[i:i + batch_size]
            
            # Prepare batch
            processed_batch = []
            for segment in batch_segments:
                # Ensure segment is correct length
                if len(segment) > max_samples:
                    segment = segment[:max_samples]
                processed_batch.append(segment)
            
            # Process entire batch at once (faster GPU utilization)
            inputs = processor(
                processed_batch,
                sampling_rate=16000,
                return_tensors="pt",
                padding=True
            ).to(device)
            
            # Get wav2vec2 output for entire batch
            outputs = wav2vec2_model(**inputs)
            
            # Mean pool over time dimension
            # outputs.last_hidden_state shape: [batch_size, time_steps, 768]
            batch_embeddings = outputs.last_hidden_state.mean(dim=1)  # [batch_size, 768]
            
            # Move to CPU and store
            embeddings.extend([emb.cpu() for emb in batch_embeddings])
    
    # Stack into tensor
    embeddings = torch.stack(embeddings)  # [num_segments, 768]
    return embeddings

In [27]:
# Cell 15: Process all participants with nested progress bars

results = []
BATCH_SIZE = 4

# Outer loop: participant batches
for batch_start in tqdm(range(0, len(all_splits), BATCH_SIZE), desc="Participant batches"):
    batch_end = min(batch_start + BATCH_SIZE, len(all_splits))
    batch_rows = all_splits.iloc[batch_start:batch_end]
    
    # Inner loop: individual participants (with nested progress bar)
    for idx, row in tqdm(batch_rows.iterrows(), total=len(batch_rows), 
                         desc=f"Batch {batch_start//BATCH_SIZE + 1}", 
                         leave=False):
        participant_id = row['Participant_ID']
        
        audio_path = os.path.join(
            AUDIO_DIR,
            f"{participant_id}_P",
            f"{participant_id}_AUDIO.wav"
        )
        
        if not os.path.exists(audio_path):
            continue
        
        # Segment audio into 15-second chunks
        audio_segments = segment_audio(audio_path, segment_length=15.0)
        
        # Get wav2vec2 embeddings per segment (GPU batch processing in Cell 14)
        segment_embeddings = get_audio_segment_embeddings(audio_segments)
        
        results.append({
            'Participant_ID': participant_id,
            'segment_embeddings': segment_embeddings.numpy(),
            'num_segments': len(audio_segments),
            'Depression_label': row.get('PHQ8_Binary', None),
            'PHQ8_Score': row.get('PHQ8_Score', None),
            'split': 'train' if participant_id in train_split['Participant_ID'].values else 
                    'dev' if participant_id in dev_split['Participant_ID'].values else 'test'
        })
    
    torch.cuda.empty_cache()

print(f"\nProcessed {len(results)} participants successfully")

Participant batches: 100%|██████████| 69/69 [08:14<00:00,  7.16s/it]


Processed 275 participants successfully





In [28]:
# Cell 16: Create DataFrame and save
df_audio_segment = pd.DataFrame(results)

# Add mean pooling baseline for comparison
df_audio_segment['mean_pooled_embedding'] = df_audio_segment['segment_embeddings'].apply(
    lambda x: np.mean(x, axis=0)
)

print(f"\nDataFrame shape: {df_audio_segment.shape}")
print(f"\nColumns: {df_audio_segment.columns.tolist()}")
print(f"\nSample statistics:")
print(f"  Participant: {df_audio_segment.iloc[0]['Participant_ID']}")
print(f"  Num segments: {df_audio_segment.iloc[0]['num_segments']}")
print(f"  Segment embeddings shape: {df_audio_segment.iloc[0]['segment_embeddings'].shape}")
print(f"  Mean pooled shape: {df_audio_segment.iloc[0]['mean_pooled_embedding'].shape}")

# Optional: Remove raw audio segments to save disk space
# Uncomment if you don't need the raw waveforms
# df_audio_segment = df_audio_segment.drop(columns=['audio_segments'])

# Save to pickle
with open(OUTPUT_FILE, 'wb') as f:
    pickle.dump(df_audio_segment, f)

print(f"\nSaved to: {OUTPUT_FILE}")


DataFrame shape: (275, 7)

Columns: ['Participant_ID', 'segment_embeddings', 'num_segments', 'Depression_label', 'PHQ8_Score', 'split', 'mean_pooled_embedding']

Sample statistics:
  Participant: 302
  Num segments: 51
  Segment embeddings shape: (51, 768)
  Mean pooled shape: (768,)

Saved to: /home/jonathanchan/ml_data/audio_data_segment_with_labels.pkl


In [29]:
# Cell 17: Verify the saved file
print("\n=== Audio Segmentation Verification ===")
with open(OUTPUT_FILE, 'rb') as f:
    loaded_df = pickle.load(f)

print(f"Loaded DataFrame shape: {loaded_df.shape}")
print(f"First participant ID: {loaded_df.iloc[0]['Participant_ID']}")
print(f"\nSplit distribution:\n{loaded_df['split'].value_counts()}")
print(f"\nDepression label distribution:\n{loaded_df['Depression_label'].value_counts()}")
print(f"\nSegment count statistics:")
print(f"  Mean segments per participant: {loaded_df['num_segments'].mean():.1f}")
print(f"  Min segments: {loaded_df['num_segments'].min()}")
print(f"  Max segments: {loaded_df['num_segments'].max()}")


=== Audio Segmentation Verification ===
Loaded DataFrame shape: (275, 7)
First participant ID: 302

Split distribution:
split
train    163
dev       56
test      56
Name: count, dtype: int64

Depression label distribution:
Series([], Name: count, dtype: int64)

Segment count statistics:
  Mean segments per participant: 65.0
  Min segments: 28
  Max segments: 132


In [None]:
# Cell: Emergency GPU/CPU memory cleanup after interruption

import torch
import gc
import sys

# 1. Delete any large variables that might still exist
try:
    del results, audio_segments, segment_embeddings, batch_rows
    print("✓ Deleted large variables")
except:
    pass

# 2. Clear Python garbage
gc.collect()
print("✓ Python garbage collected")

# 3. Clear GPU cache (if using CUDA)
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    torch.cuda.synchronize()  # Wait for all operations to finish
    print("✓ GPU cache cleared")
    
    # Show current GPU status
    print(f"\nGPU Memory Status:")
    print(f"  Allocated: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
    print(f"  Reserved: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")
    print(f"  Free: {(torch.cuda.get_device_properties(0).total_memory - torch.cuda.memory_reserved()) / 1024**3:.2f} GB")
else:
    print("✓ No GPU detected")



✓ Python garbage collected
✓ GPU cache cleared

GPU Memory Status:
  Allocated: 0.85 GB
  Reserved: 0.98 GB
  Free: 14.47 GB
