# BaseAL Dataset Generator

Generates a dataset in a BaseAL friendly format.

**Pipeline:**
1. Split audio into fixed-length segments (length dependent on model selected)
2. Generate embeddings per segment using pretrained models (BirdNET, Perch, etc.) - using bacpipe
3. Convert onset/offset labels to per-segment labels
4. Package into BaseAL format

**Required Format:**
```
dataset_name/
├── data/
|   ├── birdnet/
│   |   ├── file1_000_003.wav
│   |   ├── file1_003_006.wav
|   |   ├── ...
│   └── perch_v2/
│       └── ...
├── embeddings/
│   ├── birdnet/
│   │   ├── file1_000_003_birdnet.npy
│   │   └── ...
│   └── perch_v2/
│       └── ...
├── labels.csv        # filename, label, validation
└── metadata.csv      # All segment metadata
```

***Important - each row is a segment/embedding**

In [None]:
from pathlib import Path
import json
import pandas as pd

from utils.helpers import convert_for_json
from utils.embeddings import initialise, generate_embeddings
from utils.adapters import HSNAdapter, AdapterConfig
from utils.segment_labels import (
    split_metadata_with_adapter,
    create_labels_csv_with_adapter,
    SegmentConfig
)

## Generate Segments and Embeddings

This uses bacpipe which automatically manages and downloads models, generates audio segments, and embeddings.

*More info on bacpipe [here](https://github.com/bioacoustic-ai/bacpipe/releases/tag/v1.2.0)*

I have provided an example with a single BirdSet HSN shard and a small subset for testing. There were conflicting dependencies with datasets (huggingface) and bacpipe.

In [None]:
# Specify model
MODEL = "birdnet"

# perch_v2 only runs on Linux/WSL

# Audio path
AUDIO_PATH = Path("HSN/HSN_train_shard_0001")
METADATA_PATH = Path("HSN/HSN_metadata_train.parquet")

# Dataset paths
DATASET_PATH = Path("HSN_BASEAL")
DATASET_PATH.mkdir(exist_ok=True)

SEG_PATH = DATASET_PATH / "data" / MODEL
EMB_PATH = DATASET_PATH / "embeddings" / MODEL
SEG_PATH.mkdir(exist_ok=True, parents=True)
EMB_PATH.mkdir(exist_ok=True, parents=True)

# Validation configuration
VALIDATION_FRACTION = 0.1

embedder = initialise(model_name=MODEL)

Generate audio segments and embeddings

In [None]:
embeddings = generate_embeddings(
    audio_dir=AUDIO_PATH,
    embedder=embedder,
    model_name=MODEL,
    segments_dir=SEG_PATH,
    output_dir=EMB_PATH
)

## Labels and Metadata

Use the HSNAdapter to load metadata and create segment-level labels.

The adapter handles:
- Loading HSN's parquet metadata format
- Combining `ebird_code_multilabel` + `ebird_code_secondary` as labels
- Extracting `detected_events` onset/offset annotations
- Random validation split

In [None]:
# Configure the adapter
adapter_config = AdapterConfig(
    validation_fraction=VALIDATION_FRACTION,
    random_seed=42,
    no_event_label="no_call"
)

# Create the HSN adapter
adapter = HSNAdapter(config=adapter_config)

# Load metadata
df = adapter.load_metadata(METADATA_PATH)
print(f"Original: {len(df)} files")

In [None]:
# Get segment duration from model
duration = embedder.model.segment_length / embedder.model.sr

# Configure segmentation
config = SegmentConfig(
    segment_duration=duration,
    min_overlap=0.0,
    no_event_label="no_call"
)

# Split into segments using the adapter
segment_df = split_metadata_with_adapter(df, adapter, config)
print(f"Segments: {len(segment_df)} ({segment_df['has_event'].sum()} with events)")

In [None]:
# Convert numpy arrays to JSON strings to avoid embedded newlines
csv_df = segment_df.copy()
for col in ['segment_events', 'segment_event_clusters', 'ebird_code_multilabel', 'ebird_code_secondary']:
    if col in csv_df.columns:
        csv_df[col] = csv_df[col].apply(lambda x: json.dumps(convert_for_json(x)))
csv_df.to_csv(DATASET_PATH / "metadata.csv", index=False, encoding='utf-8')

# Create labels.csv with validation split using adapter
labels_df = create_labels_csv_with_adapter(segment_df, adapter)
labels_df.to_csv(DATASET_PATH / "labels.csv", index=False, encoding='utf-8')

print(f"Saved metadata to {DATASET_PATH / 'metadata.csv'}")
print(f"Saved labels to {DATASET_PATH / 'labels.csv'}")
print(f"\nValidation split:")
print(f"  Train: {(~labels_df['validation']).sum()} segments")
print(f"  Validation: {labels_df['validation'].sum()} segments")