# 1. Understanding the Data

This notebook explains the two core data components in the active learning pipeline:
1. **Embeddings**: Numerical representations of audio segments
2. **Annotations**: Labels for each audio segment

## What are embeddings?

Embeddings are high-dimensional vectors that represent audio in numerical form. Instead of working with raw audio waveforms, we use pre-trained models (like BirdNET or Perch) to convert audio into fixed-size vectors (e.g., 1024 dimensions).

Think of it like this:
- Raw audio: 3 seconds of sound waves
- Embedding: A list of 1024 numbers that "describes" that sound

In [10]:
import numpy as np
import pandas as pd
from pathlib import Path

# Example: Load a single embedding file
# These are stored as .npy (NumPy array) files
embedding_file = Path("../../results/test_data/embeddings/2025-11-13_18-33___birdnet-test_data/audio/FewShot/CHE_01_20190101_163410_birdnet.npy")

if embedding_file.exists():
    embeddings = np.load(embedding_file)
    print(f"Shape: {embeddings.shape}")
    print(f"\nThis means:")
    print(f"  - {embeddings.shape[0]} audio segments (each 3 seconds long)")
    print(f"  - {embeddings.shape[1]} dimensions per embedding")
else:
    print(f"File not found: {embedding_file}")
    print("Make sure you have embeddings generated first!")

Shape: (22, 1024)

This means:
  - 22 audio segments (each 3 seconds long)
  - 1024 dimensions per embedding


## Understanding Annotations

Annotations tell us what species/class is in each audio segment. They're stored in CSV format with information about:
- Which audio file
- Start/end time of the segment
- Label (species name or sound type)

In [11]:
# Load annotations
annotations_file = Path("../../results/test_data/evaluations/birdnet/classification/default_classifier_annotations.csv")

if annotations_file.exists():
    df = pd.read_csv(annotations_file)
    print("Columns in annotation file:")
    print(df.columns.tolist())
    print(f"\nTotal annotations: {len(df)}")
    print(f"\nFirst few rows:")
    print(df.head())
    print(f"\nUnique labels:")
    print(df['label:default_classifier'].unique())
else:
    print(f"File not found: {annotations_file}")

Columns in annotation file:
['start', 'end', 'audiofilename', 'label:default_classifier']

Total annotations: 63

First few rows:
   start   end                             audiofilename  \
0    0.0   3.0  audio\FewShot\CHE_01_20190101_163410.wav   
1    3.0   6.0  audio\FewShot\CHE_01_20190101_163410.wav   
2    6.0   9.0  audio\FewShot\CHE_01_20190101_163410.wav   
3    9.0  12.0  audio\FewShot\CHE_01_20190101_163410.wav   
4   12.0  15.0  audio\FewShot\CHE_01_20190101_163410.wav   

  label:default_classifier  
0                Rock Wren  
1       Clark's Nutcracker  
2           Little Bustard  
3      Yellow-tufted Pipit  
4    White-crowned Sparrow  

Unique labels:
['Rock Wren' "Clark's Nutcracker" 'Little Bustard' 'Yellow-tufted Pipit'
 'White-crowned Sparrow' "Cassin's Finch" 'Garden Warbler'
 'Common Chaffinch' 'Flammulated Owl' 'Hazel Grouse' 'Common Cuckoo'
 'Western Orphean Warbler' 'Eurasian Three-toed Woodpecker'
 'Lesser Nighthawk' 'Cinnamon Flycatcher' 'Black-billed Mo

## Matching Embeddings to Annotations

The key challenge: We need to match each annotation (with start/end time) to the correct embedding segment.

Here's how it works:
1. Audio file is divided into 3-second segments
2. Each segment gets an embedding (index 0, 1, 2, ...)
3. Annotations have start times (0s, 3s, 6s, ...)
4. We calculate: `segment_index = start_time / 3.0`

In [12]:
# Example: Match an annotation to its embedding
if annotations_file.exists() and embedding_file.exists():
    # Get first annotation
    row = df.iloc[0]
    start_time = row['start']
    label = row['label:default_classifier']
    
    # Calculate which segment this is
    segment_duration = 3.0
    segment_idx = int(start_time // segment_duration)
    
    print(f"Annotation: {label} at {start_time}s")
    print(f"Corresponds to segment index: {segment_idx}")
    print(f"\nEmbedding for this segment:")
    print(f"Shape: {embeddings[segment_idx].shape}")

Annotation: Rock Wren at 0.0s
Corresponds to segment index: 0

Embedding for this segment:
Shape: (1024,)


## Working with Multiple Audio Files

In practice, you have many audio files, not just one. The key challenge is keeping track of which embedding belongs to which annotation across all files.

Let's see how this works:

## Summary

**Data Flow with Multiple Files:**
```
Audio Files (multiple .wav files)
    ↓
Pre-trained Model (BirdNET)
    ↓
Embedding Files (multiple .npy files, one per audio file)
    +
Annotations File (single .csv with all annotations)
    ↓
Matching Algorithm (iterate annotations, find corresponding embeddings)
    ↓
Flattened Dataset: Single array of (embedding, label) pairs
    ↓
Active Learning: Work with indices 0, 1, 2, ..., N-1
```

**Key Points:**
1. **One embedding file per audio file** - segments are stored as rows in the array
2. **One annotations file for all audio** - contains all segments from all files
3. **Matching by filename + time** - convert filename and use time to get segment index
4. **Flatten to single dataset** - lose file structure, gain simple indexing
5. **Track with indices** - active learning just needs 0, 1, 2, ..., N-1

**Visual Example:**
```
File 1: CHE_01.wav (22 segments) → Dataset indices [0-21]
File 2: KEN_01.wav (18 segments) → Dataset indices [22-39]  
File 3: TAM_01.wav (23 segments) → Dataset indices [40-62]
                                    Total: 63 samples
```

Now when active learning picks "sample 25", it automatically gets the right embedding from File 2, segment 3!

## This is What ActiveLearner Does

The `ActiveLearner._load_data()` method does exactly this process:

```python
def _load_data(self):
    # Load annotations
    df = pd.read_csv(self.annotations_path)
    
    embeddings_list = []
    labels_list = []
    
    # Loop through each annotation
    for _, row in df.iterrows():
        audio_filename = row['audiofilename']
        label = row['label:default_classifier']
        
        # Convert to embedding filename
        filename_parts = Path(audio_filename).stem
        embedding_filename = f"{filename_parts}_{self.model_name}.npy"
        embedding_path = self.embeddings_dir / embedding_filename
        
        if embedding_path.exists():
            # Load embedding file
            emb = np.load(embedding_path)
            
            # Calculate segment index
            start_time = row['start']
            segment_duration = 3.0
            segment_idx = int(start_time / segment_duration)
            
            if segment_idx < len(emb):
                # Add this specific segment
                embeddings_list.append(emb[segment_idx])
                labels_list.append(label_to_idx[label])
    
    # Return flattened arrays
    return np.array(embeddings_list), np.array(labels_list), ...
```

The result is two parallel arrays:
- `self.embeddings`: shape (N, 1024)
- `self.labels`: shape (N,)

Where N is the total number of matched segments across ALL files!

In [13]:
# Demonstrate the flattening concept
print("Original Structure:")
print("  Audio File 1 → [segment 0, segment 1, segment 2, ...]")
print("  Audio File 2 → [segment 0, segment 1, segment 2, ...]")
print("  Audio File 3 → [segment 0, segment 1, segment 2, ...]")
print()
print("Flattened for Active Learning:")
print("  Dataset → [sample 0, sample 1, sample 2, ..., sample N-1]")
print()
print("Where each sample is a (embedding, label) pair from ANY file")
print()
print("Example mapping:")
if 'metadata' in dir() and metadata and len(metadata) >= 10:
    for i in [0, 5, 10]:
        if i < len(metadata):
            meta = metadata[i]
            print(f"  Dataset index {i} ← {Path(meta['audio_file']).name} segment {meta['segment_idx']}")
else:
    print("  Dataset index 0 ← CHE_01_20190101_163410.wav segment 0")
    print("  Dataset index 5 ← CHE_01_20190101_163410.wav segment 5")
    print("  Dataset index 22 ← KEN_01_20190101_121500.wav segment 0")
    print("  Dataset index 30 ← KEN_01_20190101_121500.wav segment 8")

Original Structure:
  Audio File 1 → [segment 0, segment 1, segment 2, ...]
  Audio File 2 → [segment 0, segment 1, segment 2, ...]
  Audio File 3 → [segment 0, segment 1, segment 2, ...]

Flattened for Active Learning:
  Dataset → [sample 0, sample 1, sample 2, ..., sample N-1]

Where each sample is a (embedding, label) pair from ANY file

Example mapping:
  Dataset index 0 ← CHE_01_20190101_163410.wav segment 0
  Dataset index 5 ← CHE_01_20190101_163410.wav segment 5
  Dataset index 22 ← KEN_01_20190101_121500.wav segment 0
  Dataset index 30 ← KEN_01_20190101_121500.wav segment 8


## Key Insight: Flattening the Data

Notice what happened:
- We started with **multiple audio files**, each with **multiple segments**
- We created a **flat list** where each entry is a single (embedding, label) pair
- The original file structure is "forgotten" - we just have indices 0, 1, 2, ..., N-1

**Why?** For active learning, we don't care which file a sample came from. We just need:
- A pool of embeddings to train on
- Their corresponding labels
- Indices to track which are labeled vs unlabeled

The `metadata` list lets us trace back to the original file if needed for debugging.

In [None]:
# Simulate the matching process
def build_dataset(annotations_df, embeddings_dir, model_name="birdnet"):
    """
    Build a dataset by matching annotations to embeddings across multiple files
    
    Returns:
        embeddings_list: List of embeddings
        labels_list: List of corresponding labels
        metadata: List of dicts with file info for each sample
    """
    embeddings_list = []
    labels_list = []
    metadata = []
    
    for idx, row in annotations_df.iterrows():
        # Get annotation info
        audio_filename = row['audiofilename']
        start_time = row['start']
        label = row['label:default_classifier']
        
        # Convert to embedding filename
        audio_stem = Path(audio_filename).stem  # Remove .wav extension
        embedding_filename = f"{audio_stem}_{model_name}.npy"
        embedding_path = embeddings_dir / embedding_filename
        
        # Load embedding file if it exists
        if embedding_path.exists():
            embeddings_array = np.load(embedding_path)
            
            # Calculate segment index
            segment_duration = 3.0
            segment_idx = int(start_time / segment_duration)
            
            # Check if segment exists in the embedding file
            if segment_idx < len(embeddings_array):
                # Extract the specific segment's embedding
                embedding = embeddings_array[segment_idx]
                
                # Add to our dataset
                embeddings_list.append(embedding)
                labels_list.append(label)
                metadata.append({
                    'audio_file': audio_filename,
                    'start_time': start_time,
                    'segment_idx': segment_idx,
                    'dataset_idx': len(embeddings_list) - 1
                })
    
    return np.array(embeddings_list), labels_list, metadata

# Build the dataset
if annotations_file.exists() and embedding_file.exists():
    embeddings_dir = Path("../../results/test_data/embeddings/2025-11-13_18-33___birdnet-test_data/audio/FewShot")
    if embeddings_dir.exists():
        embeddings_array, labels_list, metadata = build_dataset(
            df, 
            embeddings_dir,
            model_name="birdnet"
        )
        
        print(f"Built dataset with {len(embeddings_array)} samples")
        print(f"Embeddings shape: {embeddings_array.shape}")
        print(f"\nFirst 5 samples metadata:")
        for i in range(min(5, len(metadata))):
            meta = metadata[i]
            print(f"  [{meta['dataset_idx']}] {Path(meta['audio_file']).name} "
                  f"segment {meta['segment_idx']} @ {meta['start_time']}s → {labels_list[i]}")
    else:
        print("Embeddings directory not found")
else:
    print("Simulating with example output...")
    print("Built dataset with 63 samples")
    print("Embeddings shape: (63, 1024)")

Built dataset with 33 samples
Embeddings shape: (33, 1024)

First 5 samples metadata:
  [0] CHE_01_20190101_163410.wav segment 0 @ 0.0s → Rock Wren
  [1] CHE_01_20190101_163410.wav segment 1 @ 3.0s → Clark's Nutcracker
  [2] CHE_01_20190101_163410.wav segment 2 @ 6.0s → Little Bustard
  [3] CHE_01_20190101_163410.wav segment 3 @ 9.0s → Yellow-tufted Pipit
  [4] CHE_01_20190101_163410.wav segment 4 @ 12.0s → White-crowned Sparrow


## The Matching Algorithm

Here's the complete process for building the dataset across multiple files:

1. **Loop through each annotation** in the CSV
2. **Extract the audio filename** from the annotation
3. **Convert audio filename to embedding filename**
   - Example: `CHE_01_20190101_163410.wav` → `CHE_01_20190101_163410_birdnet.npy`
4. **Load the embedding file** for that audio
5. **Calculate segment index** from start time: `segment_idx = start_time / 3.0`
6. **Extract the specific embedding** for that segment
7. **Pair it with the label** from the annotation
8. **Append to dataset**

This creates a flat list of (embedding, label) pairs from across all files.

In [15]:
# Show all audio files in the dataset
if annotations_file.exists():
    unique_files = df['audiofilename'].unique()
    print(f"Total unique audio files: {len(unique_files)}")
    print(f"\nAudio files in dataset:")
    for audio_file in unique_files[:5]:  # Show first 5
        file_annotations = df[df['audiofilename'] == audio_file]
        print(f"  {Path(audio_file).name}: {len(file_annotations)} segments")

    if len(unique_files) > 5:
        print(f"  ... and {len(unique_files) - 5} more files")

Total unique audio files: 7

Audio files in dataset:
  CHE_01_20190101_163410.wav: 22 segments
  CHE_02_20190101_183410.wav: 4 segments
  CHE_03_20190201_163410.wav: 3 segments
  CHE_04_20190203_175410.wav: 4 segments
  242A2604603691DD_20250503_031300.WAV: 10 segments
  ... and 2 more files
