# BioDCASE ATBFL Dataset Generator for BaseAL

This notebook converts the BioDCASE Challenge 4 (ATBFL-derived) strongly-labeled whale call dataset into BaseAL format.

## Input format (BioDCASE release)
- Annotations live in:
  - `train/annotations/*.csv`
  - `validation/annotations/*.csv`
- Each annotation row is an event with absolute timestamps:
  `(dataset, filename, annotation, annotator, low_frequency, high_frequency, start_datetime, end_datetime)`

## Output format (BaseAL)
```
ATBFL_BASEAL/
├── audio_flat/                 # symlinks to original long recordings
├── data/<MODEL>/               # segmented audio clips
├── embeddings/<MODEL>/         # per-segment embeddings (.npy)
├── labels.csv                  # filename, label, validation
└── metadata.csv                # segment-level metadata
```

## Labeling rule
- We convert absolute event timestamps into seconds relative to the recording start time (parsed from the audio filename).
- For each model segment window, we collect all overlapping events and assign a semicolon-separated multi-label (e.g., `bma;bmb`).
- Segments with no overlapping events get `NO_EVENT_LABEL`.

Notes:
- This generator follows the same bacpipe-based embedding pipeline as `generator_ESC50.ipynb`.
- The official BioDCASE split is preserved: all segments from `validation/` become `validation=True`.

### Overview Notes
* Data from the task 2 "Supervised Detection of Strongly-Labelled Whale Calls" of the BioDCASE 2025 challenge. It has been derived from the ATBFL library
* ATBFL is one of the largest annotated datasets in marine bioacoustics, gathering blue and fin whale recordings around Antarctica from 2005 to 2017
* 6591 audio files (6004 in train set + 587 in validation set) totaling 1880 hours of recordings from 11 different deployments organized in site-year datasets (eg, kerguelen-2015)
* 11 CSV annotation files named after each corresponding site-year dataset

### Annotation Format
* Annotations come in the form `(dataset,filename,annotation,annotator,low_frequency,high_frequency,start_datetime,end_datetime)`
* Label set: `{bma, bmb, bmz, bmd, bpd, bp20, bp20plus}` (the more readable version used)`{BmA, BmB, BmZ, BmD, BpD, Bp20, Bp20plus}` 
* One single annotator per dataset but a same annotator may have annotated several datasets
* Calls may overlap, so the set up is multi-class and multi-label: one file or segment of file file is likely to contain several classes
* 7 classes are provided but the evaluation will only take 3 into account as calls can be gathered by similarity (see below)

### Call Descriptions

**Downsweeps:** BmA, BmB, and BmZ calls are specific to blue whales (Balaenoptera musculus intermedia, Bm), while Bp20 and Bp20Plus calls are characteristic of fin whales (Balaenoptera physalus quoyi, Bp).

**ABZ Calls:** As described by Miller et al. (see "Related Works"), BmA calls consist of a constant-frequency tone between 25 and 28 Hz, without additional units. BmB calls are similar but followed by a partial or full inter-tone downsweep. BmZ calls contain two tonal units: A (higher frequency) and C (lower frequency). Occasionally, a B downsweep unit appears between them, forming a "Z" shape on spectrograms.

**Bp Calls:** Bp20 and Bp20Plus vocalizations are pulsed calls with peak energy at 20 Hz (Bp20) and additional energy at higher frequencies (80–100 Hz) in Bp20Plus.

## Downloading the data

To download the data, run 

```bash
wget https://zenodo.org/records/15092732/files/biodcase_development_set.zip
unzip biodcase_development_set.zip
```

In [None]:
from __future__ import annotations

import json
import os
from dataclasses import dataclass
from pathlib import Path

import numpy as np
import pandas as pd
import torchaudio
from tqdm.auto import tqdm

from utils.embeddings import generate_embeddings, initialise
from utils.helpers import convert_for_json

## Configuration

Set paths and model choice. `perch_v2` only runs on Linux/WSL; `birdnet` is usually easier to run everywhere.

In [None]:
# ============================== User Config ==============================
MODEL = "perch_v2"  # or: "birdnet"

# Where the BioDCASE release folder lives (must contain train/ and validation/), environment variable LOCAL_SCRATCH can be set to point to a local scratch directory for faster access in HPC environments
LOCAL_SCRATCH = os.getenv("LOCAL_SCRATCH")

if LOCAL_SCRATCH:
    BIODCASE_ROOT = Path(LOCAL_SCRATCH) / "biodcase" / "biodcase_development_set"
    DATASET_PATH = Path(LOCAL_SCRATCH) / "biodcase" / "ATBFL_BASEAL"
else:
    BIODCASE_ROOT = Path("biodcase_development_set")
    DATASET_PATH = Path("ATBFL_BASEAL")

# Segment labeling
NO_EVENT_LABEL = "no_call"
LABEL_SEPARATOR = ";"
MIN_OVERLAP_SEC = 0.0  # absolute overlap threshold

# Optional: limit number of recordings for a quick dry-run (set to None for full)
MAX_RECORDINGS = None  # e.g., 50

# ============================== End Config ==============================

In [None]:
DATASET_PATH.mkdir(parents=True, exist_ok=True)
SEG_PATH = DATASET_PATH / "data" / MODEL
EMB_PATH = DATASET_PATH / "embeddings" / MODEL
FLAT_AUDIO = DATASET_PATH / "audio_flat"
SEG_PATH.mkdir(parents=True, exist_ok=True)
EMB_PATH.mkdir(parents=True, exist_ok=True)
FLAT_AUDIO.mkdir(parents=True, exist_ok=True)

print(f"BioDCASE root: {BIODCASE_ROOT}")
print(f"Output path:   {DATASET_PATH}")
print(f"Model:         {MODEL}")
print(f"Flat audio:    {FLAT_AUDIO}")

## Load annotations and build file-level metadata

We load both train and validation annotation CSVs, parse datetimes, and compute event onset/offset in seconds relative to the recording start time.

Recording start time is inferred from the BioDCASE filename pattern:
- Example filename: `2015-02-04T03-00-00_000.wav`
- Recording start assumed: `2015-02-04T03:00:00+00:00`

In [None]:
def _parse_recording_start_utc(filename: str) -> pd.Timestamp | None:
    name = Path(filename).name
    # Common ATBFL naming: <YYYY-MM-DD>T<HH-MM-SS>_<...>.wav
    prefix = name.split("_", 1)[0]
    if "T" not in prefix:
        return None
    try:
        date_part, time_part = prefix.split("T", 1)
        time_part = time_part.replace("-", ":")
        # Force UTC; annotation datetimes include +00:00
        return pd.Timestamp(f"{date_part}T{time_part}+00:00")
    except Exception:
        return None


def _load_annotation_csvs(biodcase_root: Path) -> pd.DataFrame:
    records: list[pd.DataFrame] = []
    for split in ["train", "validation"]:
        ann_dir = biodcase_root / split / "annotations"
        if not ann_dir.exists():
            raise FileNotFoundError(f"Missing annotations directory: {ann_dir}")
        csvs = sorted(ann_dir.glob("*.csv"))
        if not csvs:
            raise FileNotFoundError(f"No annotation CSVs found under: {ann_dir}")
        for p in csvs:
            df = pd.read_csv(p)
            df["split"] = split
            df["source_csv"] = p.name
            records.append(df)
    out = pd.concat(records, ignore_index=True, sort=False)
    return out


ann = _load_annotation_csvs(BIODCASE_ROOT)
print(f"Loaded {len(ann):,} annotation rows")
print(ann.columns.tolist())
ann.head()

In [None]:
# Parse datetimes and compute relative onsets/offsets in seconds
ann = ann.copy()
ann["start_datetime"] = pd.to_datetime(ann["start_datetime"], errors="coerce")
ann["end_datetime"] = pd.to_datetime(ann["end_datetime"], errors="coerce")
ann["recording_start"] = ann["filename"].apply(_parse_recording_start_utc)

ann["onset_sec"] = (ann["start_datetime"] - ann["recording_start"]).dt.total_seconds()
ann["offset_sec"] = (ann["end_datetime"] - ann["recording_start"]).dt.total_seconds()

# Basic sanity filtering
ann = ann.dropna(subset=["dataset", "filename", "annotation", "onset_sec", "offset_sec"]).copy()
ann = ann[(ann["offset_sec"] > ann["onset_sec"]) & (ann["onset_sec"] >= 0)].copy()

print(f"After filtering: {len(ann):,} valid events")
ann[["split", "dataset", "filename", "annotation", "onset_sec", "offset_sec"]].head()

### Build per-recording rows

We aggregate events by `(split, dataset, filename)` into one row per recording. Each row contains:
- the original audio path
- duration in seconds (read from the audio header)
- `events`: list of `[onset, offset]` in seconds
- `event_labels`: list of labels aligned with `events`

In [None]:
def _audio_duration_sec(path: Path) -> float | None:
    try:
        info = torchaudio.info(os.fspath(path))
        if info.num_frames <= 0 or info.sample_rate <= 0:
            return None
        return float(info.num_frames) / float(info.sample_rate)
    except Exception:
        return None


group_cols = ["split", "dataset", "filename"]
rows: list[dict] = []

n_groups = ann[group_cols].drop_duplicates().shape[0]
for (split, dataset, filename), g in tqdm(
    ann.groupby(group_cols, sort=False),
    total=n_groups,
    desc="Aggregating recordings",
    unit="rec",
    ):
    audio_path = BIODCASE_ROOT / split / "audio" / str(dataset) / str(filename)
    duration = _audio_duration_sec(audio_path) if audio_path.exists() else None

    events = g[["onset_sec", "offset_sec"]].to_numpy(dtype=float).tolist()
    event_labels = g["annotation"].astype(str).tolist()

    flat_name = f"{dataset}__{filename}"
    rows.append(
        {
            "split": split,
            "dataset": str(dataset),
            "source_filename": str(filename),
            "source_audio_path": os.fspath(audio_path),
            "flat_filename": flat_name,
            "filepath": os.fspath(FLAT_AUDIO / flat_name),
            "length": duration,
            "detected_events": events,
            "detected_event_labels": event_labels,
            "annotator": g["annotator"].iloc[0] if "annotator" in g.columns else None,
        }
    )

file_df = pd.DataFrame(rows)
missing_audio = int(file_df["length"].isna().sum())
print(f"Recordings: {len(file_df):,}")
print(f"Missing/unreadable audio: {missing_audio:,}")

if MAX_RECORDINGS is not None and len(file_df) > MAX_RECORDINGS:
    file_df = file_df.sample(n=MAX_RECORDINGS, random_state=42).reset_index(drop=True)
    print(f"Subsampled to {len(file_df):,} recordings")

file_df.head()

## Create a flat audio directory (symlinks)

`generate_embeddings()` only scans a single directory (non-recursive). We create `audio_flat/` with symlinks to the original recordings.

Segment and embedding filenames will be derived from these flat filenames (e.g., `<dataset>__<filename>_000_003.wav`).

In [None]:
import shutil

# Clear existing files in audio_flat (best-effort)
for p in FLAT_AUDIO.glob("*"):
    try:
        if p.is_symlink() or p.is_file():
            p.unlink()
    except Exception:
        pass

created = 0
skipped = 0
resolved_names: list[str | None] = []

for _, row in tqdm(file_df.iterrows(), total=len(file_df), desc="Linking audio"):
    src = Path(row["source_audio_path"])
    if not src.exists():
        resolved_names.append(None)
        skipped += 1
        continue

    base_name = str(row["flat_filename"])
    name = base_name
    k = 1
    while (FLAT_AUDIO / name).exists():
        existing = FLAT_AUDIO / name
        try:
            if existing.is_symlink() and existing.resolve() == src.resolve():
                break
        except Exception:
            pass
        k += 1
        name = f"{k}__{base_name}"

    dst = FLAT_AUDIO / name
    if not dst.exists():
        try:
            os.symlink(os.fspath(src.resolve()), os.fspath(dst))
        except OSError:
            shutil.copy2(src, dst)
    created += 1
    resolved_names.append(name)

file_df = file_df.copy()
file_df["flat_filename"] = resolved_names
file_df = file_df.dropna(subset=["flat_filename"]).reset_index(drop=True)
file_df["filepath"] = file_df["flat_filename"].map(lambda x: os.fspath(FLAT_AUDIO / str(x)))

print(f"Created {created} links/copies; skipped {skipped} missing sources")
print(f"Flat audio files: {len(list(FLAT_AUDIO.glob('*.wav'))):,}")
file_df.head()

## Generate segments and embeddings

This step can be very slow for the full ATBFL dataset (many hours of audio). Consider setting `MAX_RECORDINGS` for testing first.

In [None]:
embedder = initialise(model_name=MODEL)
segment_duration = float(embedder.model.segment_length) / float(embedder.model.sr)
print(f"Model segment duration: {segment_duration:.3f}s")

In [None]:
embeddings = generate_embeddings(
    audio_dir=FLAT_AUDIO,
    embedder=embedder,
    model_name=MODEL,
    segments_dir=SEG_PATH,
    output_dir=EMB_PATH
 )
print(f"\nProcessed {len(embeddings)} recordings in flat directory")

## Build segment-level metadata and labels

We reproduce the segment windowing used by bacpipe (integer seconds derived from the model window length), then label each segment by overlapping ATBFL events.

In [None]:
@dataclass(frozen=True)
class SegmentConfig:
    """Configuration for segment labeling."""

    segment_duration: float
    min_overlap: float = 0.0
    no_event_label: str = "no_call"
    label_separator: str = ";"


def _event_overlaps(
    event_onset: float,
    event_offset: float,
    seg_start: float,
    seg_end: float,
    config: SegmentConfig,
 ) -> bool:
    overlap_start = max(event_onset, seg_start)
    overlap_end = min(event_offset, seg_end)
    overlap = max(0.0, overlap_end - overlap_start)
    return overlap > 0.0 and overlap >= config.min_overlap


def _unique_preserve_order(items: list[str]) -> list[str]:
    seen: set[str] = set()
    out: list[str] = []
    for it in items:
        if it not in seen:
            seen.add(it)
            out.append(it)
    return out


cfg = SegmentConfig(
    segment_duration=segment_duration,
    min_overlap=float(MIN_OVERLAP_SEC),
    no_event_label=NO_EVENT_LABEL,
    label_separator=LABEL_SEPARATOR,
 )

In [None]:
segment_rows: list[dict] = []

for _, row in tqdm(file_df.iterrows(), total=len(file_df), desc="Splitting into segments", unit="rec"):
    length = row["length"]
    if length is None or (isinstance(length, float) and np.isnan(length)):
        continue

    stem = Path(row["flat_filename"]).stem
    events = np.asarray(row["detected_events"], dtype=float)
    event_labels = [str(x) for x in row["detected_event_labels"]]

    n_segments = int(np.ceil(float(length) / cfg.segment_duration))
    for i in range(n_segments):
        seg_start = float(i) * cfg.segment_duration
        seg_end = float(i + 1) * cfg.segment_duration

        # Mirror bacpipe's integer second filename convention
        start_i = int(cfg.segment_duration * i)
        end_i = int(cfg.segment_duration * (i + 1))
        filename = f"{stem}_{start_i:03d}_{end_i:03d}.wav"

        overlapping: list[int] = []
        for idx, (onset, offset) in enumerate(events):
            if _event_overlaps(float(onset), float(offset), seg_start, seg_end, cfg):
                overlapping.append(idx)

        if overlapping:
            labels_here = _unique_preserve_order([event_labels[j] for j in overlapping])
            label = cfg.label_separator.join(labels_here)
            has_event = True
            seg_events: list[list[float]] = []
            seg_event_labels: list[str] = []
            for j in overlapping:
                onset, offset = events[j]
                rel_on = max(0.0, float(onset) - seg_start)
                rel_off = min(cfg.segment_duration, float(offset) - seg_start)
                seg_events.append([rel_on, rel_off])
                seg_event_labels.append(event_labels[j])
        else:
            label = cfg.no_event_label
            has_event = False
            seg_events = []
            seg_event_labels = []

        segment_rows.append(
            {
                "filename": filename,
                "original_filepath": row["source_audio_path"],
                "original_filename": row["source_filename"],
                "flat_filename": row["flat_filename"],
                "segment_start": seg_start,
                "segment_end": seg_end,
                "label": label,
                "has_event": has_event,
                "segment_events": seg_events,
                "segment_event_labels": seg_event_labels,
                "segment_event_clusters": [],
                "dataset": row["dataset"],
                "split": row["split"],
            }
        )

segment_df = pd.DataFrame(segment_rows)
print(f"Segments: {len(segment_df):,}")
segment_df.head()

In [None]:
# Save metadata.csv (JSON-encode list-like fields)
csv_df = segment_df.copy()
for col in ["segment_events", "segment_event_labels", "segment_event_clusters"]:
    if col in csv_df.columns:
        csv_df[col] = csv_df[col].apply(lambda x: json.dumps(convert_for_json(x)))
csv_df.to_csv(DATASET_PATH / "metadata.csv", index=False, encoding="utf-8")

# Save labels.csv (preserve official split)
labels_df = pd.DataFrame(
    {
        "filename": segment_df["filename"],
        "label": segment_df["label"],
        "validation": segment_df["split"].eq("validation"),
    }
 )
labels_df.to_csv(DATASET_PATH / "labels.csv", index=False, encoding="utf-8")

print(f"Wrote: {DATASET_PATH / 'metadata.csv'}")
print(f"Wrote: {DATASET_PATH / 'labels.csv'}")
print(f"Train segments: {(~labels_df['validation']).sum():,}")
print(f"Val segments:   {labels_df['validation'].sum():,}")

## Verify output structure

In [None]:
print("Output directory structure:")
print(f"  {DATASET_PATH}/")
print(f"  ├── audio_flat/ ({len(list(FLAT_AUDIO.glob('*.wav')))} files)")
print(f"  ├── data/{MODEL}/ ({len(list(SEG_PATH.glob('*.wav')))} files)")
print(f"  ├── embeddings/{MODEL}/ ({len(list(EMB_PATH.glob('*.npy')))} files)")
print("  ├── labels.csv")
print("  └── metadata.csv")