# BirdSet Data Pipeline Tutorial

This Jupyter notebook provides a comprehensive guide to setting up and configuring a data pipeline tailored for bird classification in audio files. The tutorial is structured to walk you through each component of the pipeline, ensuring a clear understanding of its functionality and configuration. Whether you are processing raw audio data or spectrograms, this notebook aims to provide you with the necessary knowledge to efficiently set up your data pipeline.

## Installation

### Prerequisites
Before initiating the installation process of the BirdSet pipeline, it's crucial to ensure that your computing environment meets the following prerequisites:
- **Python**: You should have Python version 3.10 or higher installed on your system.

### Installation Steps
The BirdSet pipeline can be installed using either of the two methods: via Conda with Pip, or using Poetry. Select the method that best suits your preference and follow the corresponding steps below.

#### Using Conda and Pip

1. **Create a Conda Environment**: Begin by setting up a dedicated environment for BirdSet. This is a best practice to manage dependencies and avoid potential conflicts with other packages in your system.

   ```bash
   conda create -n birdset python=3.10
   ```

   After the environment is successfully created, activate it:

   ```bash
   conda activate birdset
   ```

2. **Install BirdSet**: Proceed with cloning the BirdSet repository and installing the package in editable mode. This approach is beneficial as it allows any modifications you make to the BirdSet code to be reflected immediately without the need for reinstallation.

   ```bash
   git clone https://github.com/DBD-research-group/BirdSet.git
   cd BirdSet
   pip install -e .
   ```

#### Using Poetry

1. **Clone the Repository**: Start with cloning the BirdSet repository to your local machine and navigate to the cloned directory.

   ```bash
   git clone https://github.com/DBD-research-group/BirdSet.git
   cd BirdSet
   ```

2. **Configure Poetry**: Prepare the project for Poetry by renaming the `pyproject.poetry` file to `pyproject.toml`.

   ```bash
   mv pyproject.poetry pyproject.toml
   ```

3. **Install Dependencies and Activate Environment**: Install all the necessary dependencies with Poetry and then activate the Poetry shell environment.

   ```bash
   poetry install
   poetry shell
   ```

## Log in to Huggingface

Our datasets are shared via HuggingFace Datasets in our [HuggingFace BirdSet repository](https://huggingface.co/datasets/DBD-research-group/birdset_v1). Huggingface is a central hub for sharing and utilizing datasets and models, particularly beneficial for machine learning and data science projects. For accessing private datasets hosted on HuggingFace, you need to be authenticated. Here's how you can log in to HuggingFace:

1. **Install HuggingFace CLI**: If you haven't already, you need to install the HuggingFace CLI (Command Line Interface). This tool enables you to interact with HuggingFace services directly from your terminal. You can install it using pip:

   ```bash
   pip install huggingface_hub
   ```

2. **Login via CLI**: Once the HuggingFace CLI is installed, you can log in to your HuggingFace account directly from your terminal. This step is essential for accessing private datasets or contributing to the HuggingFace community. Use the following command:

   ```bash
   huggingface-cli login
   ```

   After executing this command, you'll be prompted to enter your HuggingFace credentials ([User Access Token](https://huggingface.co/docs/hub/security-tokens)). Once authenticated, your credentials will be saved locally, allowing seamless access to HuggingFace resources.

## TLDR;
To get started with the default configuration, you can use the following code snippet to set up the BirdSet pipeline to load the [High Sierras](https://zenodo.org/records/7525805) test dataset (10,296 samples) including a train set (5,197 samples) with matching bird classes from [xeno-canto](https://xeno-canto.org/). The total size of the dataset is 6.2GB. The samples will be provided as spectrograms with a resolution of `128x1024` pixels in batches of size `32`, the labels are one-hot encoded for a multilabel classification task. Down below you find further information on how to configure the pipeline to your needs.

In [3]:
from birdset.datamodule.birdset_datamodule import BirdSetDataModule

# initiate the data module
dm = BirdSetDataModule()
# prepare the data (download dataset, ...)
dm.prepare_data()
# setup the dataloaders
dm.setup(stage="fit")
# get the dataloaders
train_loader = dm.train_dataloader()
# get the first batch
batch = next(iter(train_loader))
# get shape of the batch
print(batch["input_values"].shape)
print(batch["labels"].shape)
batch
   

Downloading builder script:   0%|          | 0.00/19.5k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/10.8k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/146k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/10.8k [00:00<?, ?B/s]

   

Downloading data files #1:   0%|          | 0/2 [00:00<?, ?obj/s]

Downloading data files #0:   0%|          | 0/2 [00:00<?, ?obj/s]

Downloading data files #2:   0%|          | 0/1 [00:00<?, ?obj/s]

KeyboardInterrupt: 

In [2]:
from lightning import Trainer
min_epochs = 1
max_epochs = 5
trainer = Trainer(min_epochs=min_epochs, max_epochs=max_epochs, accelerator="gpu", devices=1)

MisconfigurationException: No supported gpu backend found!

In [3]:
from birdset.modules.base_module import BaseModule
model = BaseModule(
    len_trainset=dm.len_trainset,
    task=dm.task,
    batch_size=dm.train_batch_size,
    num_epochs=max_epochs)

In [6]:
model

BaseModule(
  (loss): BCEWithLogitsLoss()
  (model): EfficientNetClassifier(
    (model): EfficientNet(
      (features): Sequential(
        (0): Conv2dNormActivation(
          (0): Conv2d(1, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
          (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (2): SiLU(inplace=True)
        )
        (1): Sequential(
          (0): MBConv(
            (block): Sequential(
              (0): Conv2dNormActivation(
                (0): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=32, bias=False)
                (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (2): SiLU(inplace=True)
              )
              (1): SqueezeExcitation(
                (avgpool): AdaptiveAvgPool2d(output_size=1)
                (fc1): Conv2d(32, 8, kernel_size=(1, 1), stride=(1, 1))
                (fc2): Conv2d(8, 32, kern

In [4]:
trainer.fit(model, dm)

You are using a CUDA device ('NVIDIA A100 80GB PCIe') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]

  | Name                  | Type                   | Params
-----------------------------------------------------------------
0 | loss                  | BCEWithLogitsLoss      | 0     
1 | model                 | EfficientNetClassifier | 6.5 M 
2 | train_metric          | cmAP                   | 0     
3 | valid_metric          | cmAP                   | 0     
4 | test_metric           | cmAP                   | 0     
5 | valid_metric_best     | MaxMetric              | 0     
6 | valid_add_metrics     | MetricCollection       | 0     
7 | test_add_metrics      | MetricCollection 

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=5` reached.


## Configuration of BirdSet Data Pipeline

The BirdSet Data Pipeline offers a robust and flexible configuration system, primarily designed to streamline the process of setting up your data processing environment. While this notebook presents hardcoded configurations for simplicity, it's important to note that these settings can be dynamically managed using advanced configuration tools like Hydra. Hydra is a powerful utility that enables flexible and scalable configuration management, allowing you to adapt the pipeline settings to various environments or use cases seamlessly. For an in-depth understanding of Hydra, consider visiting [Hydra's official documentation](https://hydra.cc/docs/intro).

Tipp! Detailed information is provided in the docstrings of the classes and functions. You can access them by hovering over the class or function name in your IDE or by opening the source file in a text editor.

### BirdSetDataModule
The `BirdSetDataModule` is a [PyTorch Lightning DataModule](https://lightning.ai/docs/pytorch/stable/data/datamodule.html#lightningdatamodule) that encapsulates the entire data pipeline. It inherits from `BaseDataModuleHF` which is a base class for all DataModules that use [HuggingFace datasets libary](https://huggingface.co/docs/datasets/index).

To initialize the `BirdSetDataModule`, you need to provide the following parameters:

```python
from src.datamodule.birdset_datamodule import BirdSetDataModule

data_module = BirdSetDataModule(
    dataset=dataset_config, #dataset (DatasetConfig): The configuration for the dataset.
    loaders=loaders_config, #loaders (LoadersConfig): The configuration for the loaders.
    transforms=transforms, #transforms (BirdSetTransformsWrapper): The transforms to be applied to the data.
    mapper=mapper #mapper (XCEventMapping): The mapping for the events.
)
```

We will now walk through each of these parameters to understand their functionality and configuration.


### 1. Dataset Configuration

Configuring the dataset is the first step in configuring the BirdSet data pipeline, here you specify which dataset you want to load, how many classes it has, and how the data is split into train, validation, and test sets. The `DatasetConfig` class is used to configure the dataset.

In [1]:
from birdset.datamodule.base_datamodule import DatasetConfig

dataset_config = DatasetConfig(
    data_dir='../../data_birdset',
    dataset_name='HSN',
    hf_path='DBD-research-group/BirdSet',
    hf_name='HSN',
    n_classes=21,
    n_workers=1,
    val_split=0.2,
    task="multilabel",
    subset=None,
    sampling_rate=32000,
    class_weights_sampler=None,
    classlimit=500,
    eventlimit=5,
)

Here's a brief overview of the parameters used in the `DatasetConfig` class:

- `data_dir`: Specifies the directory where the dataset files are stored. **Important**: The dataset uses a lot of disk space, so make sure you have enough storage available.
- `dataset_name`: The name assigned to the dataset.
- `hf_path`: The path to the dataset stored on HuggingFace.
- `hf_name`: The name of the dataset on HuggingFace.
- `seed`: A seed value for ensuring reproducibility across runs.
- `n_classes`: The total number of distinct classes in the dataset.
- `n_workers`: The number of worker processes used for data loading.
- `val_split`: The proportion of the dataset reserved for validation.
- `task`: Defines the type of task (e.g., 'multilabel' or 'multiclass').
- `sampling_rate`: The sampling rate for audio data processing.
- `class_weights_sampler`: Indicates whether to use class weights in the sampler for handling imbalanced datasets.
- `class_limit`: The maximum number of samples per class.
- `eventlimit`: Defines the maximum number of audio events processed per audio file, capping the quantity to ensure balance across files.

#### Important Note:
- The `class_weights_loss` parameter is currently deprecated and only implemented for focal loss. It's recommended to utilize the `class_weights_sampler` instead, as it has shown to yield favorable results, particularly as evidenced by the winner of the [BirdCLEF 2023](https://www.kaggle.com/competitions/birdclef-2023) challenge.

Selecting appropriate values for these parameters is crucial, as they directly influence the efficiency of the training process and the overall performance of the model.

### 2. Dataloader Configuration

Once the dataset is configured, the next crucial step is setting up the data loaders. Data loaders are pivotal in efficiently feeding data into the model during both the training and testing phases. They manage the data flow, ensuring that the model is supplied with a consistent stream of data batches. In this section, we'll use the `LoaderConfig` and `LoadersConfig` classes to define different configurations for the training and testing data loaders.

In [4]:
from birdset.datamodule.base_datamodule import LoaderConfig, LoadersConfig
# Configuration for the training data loader
train_loader_config = LoaderConfig(
    batch_size=32,
    shuffle=True,
    num_workers=8,
    pin_memory=False,
    drop_last=True,
    persistent_workers=False,
    #prefetch_factor=None,
)

# Configuration for the testing data loader
test_loader_config = LoaderConfig(
    batch_size=32,
    shuffle=False,
    num_workers=8,
    pin_memory=False,
    drop_last=False,
    persistent_workers=False,
    #prefetch_factor=None,
)

# Aggregating the loader configurations
loaders_config = LoadersConfig(
    train=train_loader_config,
    valid=test_loader_config,
    test=test_loader_config,
)

Here's a brief overview of the parameters used in the `LoaderConfig` class:

- `batch_size`: Specifies the number of samples contained in each batch. This is a crucial parameter as it impacts memory utilization and model performance.
- `shuffle`: Determines whether the data is shuffled at the beginning of each epoch. Shuffling is typically used for training data to ensure model robustness and prevent overfitting.
- `num_workers`: Sets the number of subprocesses to be used for data loading. More workers can speed up the data loading process but also increase memory consumption.
- `pin_memory`: When set to `True`, enables the DataLoader to copy Tensors into CUDA pinned memory before returning them. This can lead to faster data transfer to CUDA-enabled GPUs.
- `drop_last`: Determines whether to drop the last incomplete batch. Setting this to `True` is useful when the total size of the dataset is not divisible by the batch size.
- `persistent_workers`: Indicates whether the data loader should keep the workers alive for the next epoch. This can improve performance at the cost of memory.
- `prefetch_factor`: Defines the number of samples loaded in advance by each worker. This parameter is commented out here and can be adjusted based on specific requirements.

Proper configuration of the data loaders is essential as it directly influences the efficiency of the training process, hardware resource utilization, and ultimately, the performance of the model.

### 3. Configuration of TransformationsWrapper

Transformations play a critical role in the data preparation process within the BirdSet data pipeline. These operations, applied before the data is fed into the model, encompass a range of augmentation techniques designed to regularize the model and prevent overfitting. Properly configured transformations not only enhance the diversity and quality of the training data but also help the model generalize better to new, unseen data.

In the BirdSet framework, transformations are meticulously orchestrated through the `BirdSetTransformsWrapper` class. This wrapper acts as a comprehensive interface for defining and applying various transformations and augmentations to the data. It ensures that the data is consistently and effectively transformed, aligning with the specific requirements of the model and the inherent characteristics of the dataset.

By configuring the `transforms_wrapper` using the `BirdSetTransformsWrapper` class, you gain precise control over how the data is manipulated during the preprocessing phase.

To initialize the `BirdSetTransformsWrapper`, you need to provide the following parameters:

```python
from src.datamodule.components.transforms import BirdSetTransformsWrapper

transforms = BirdSetTransformsWrapper(
    task: Literal['multiclass', 'multilabel'] = "multilabel",
    sampling_rate: int = 32000,
    model_type: Literal['vision', 'waveform'] = "waveform",
    spectrogram_augmentations: DictConfig = DictConfig({}),
    waveform_augmentations: DictConfig = DictConfig({}),
    decoding: EventDecoding | None = None,
    feature_extractor: DefaultFeatureExtractor = DefaultFeatureExtractor(),
    max_length: int = 5,
    nocall_sampler: DictConfig = DictConfig({}),
    preprocessing: PreprocessingConfig = PreprocessingConfig()
)
```

We will go through each of these parameters to understand their functionality and configuration.

#### 3.1 Dataset and Model specific parameters
The following parameters ensure that the transformations are tailored to the specific requirements of the dataset and the model:

- `task`: Defines the type of task (e.g., 'multilabel' or 'multiclass').
- `sampling_rate`: The sampling rate for audio data processing.
- `model_type`: Specifies the type of model (e.g., 'vision' or 'waveform'). In case of a vison model, the input data is expected to be a spectrogram, while for a waveform model, the input data is the raw audio waveform.
- max_length: The maximum length of the audio files in seconds.

#### 3.2 Augmentations

Augmentations are powerful techniques applied to the data to introduce diversity and variability. They are particularly useful in audio and signal processing to enhance the robustness of models against variations in input data. In the BirdSet framework, you can configure waveform and spectrogram augmentations as follows:

**Waveform Augmentations**

These augmentations are applied directly to the audio waveform. In BirdSet, you can use any waveform augmentation technique as long as it can be composed by the [torch-audiomentations Compose function](https://github.com/asteroid-team/torch-audiomentations/blob/main/torch_audiomentations/core/composition.py). You can add waveform augmentations as follows:

In [5]:
from torch_audiomentations import AddColoredNoise, PitchShift
waveform_augmentation = {
    "colored_noise": AddColoredNoise(p=0.2, min_snr_in_db=3.0, max_snr_in_db=30.0, min_f_decay=-2.0, max_f_decay=2.0),
    "pitch_shift": PitchShift(p=0.2, sample_rate=32000, min_transpose_semitones=-4.0, max_transpose_semitones=4.0),
}

In this example:
- `colored_noise`: Adds colored noise to the audio signal to simulate various real-world noise conditions.
- `pitch_shift`: Alters the pitch of the audio signal, which is useful for simulating different tonal variations.

**Spectrogram Augmentations**

These augmentations are applied to the spectrogram representation of the audio. In BirdSet, you can use any spectrogram augmentation technique as long as it can be composed by the [torchvision Compose function](https://pytorch.org/vision/stable/generated/torchvision.transforms.Compose.html). You can add spectrogram augmentations as follows:

In [6]:
from torchvision.transforms import RandomApply
from torchaudio.transforms import TimeMasking, FrequencyMasking
spectrogram_augmentations = {
    "time_masking": RandomApply([TimeMasking(time_mask_param=100, iid_masks=True)], p=0.3),
    "frequency_masking": RandomApply([FrequencyMasking(freq_mask_param=100, iid_masks=True)], p=0.5)
}

In this example:
- `time_masking`: Randomly masks a sequence of consecutive time steps in the spectrogram, helping the model become invariant to small temporal shifts.
- `frequency_masking`: Randomly masks a sequence of consecutive frequency channels, encouraging the model to be robust against frequency variations.

Configuring the augmentations correctly is crucial as they directly influence the model's ability to learn from a diverse set of data representations, ultimately leading to better generalization and performance.

#### 3.3 Decoding

Decoding is a process, that converts the (compressed) data into a format that can be directly used by the model. In the BirdSet framework, we use the `EventDecoding` class by default. It is designed for preprocessing audio files in the context of event detection tasks. Its primary function is to ensure that each audio segment fed into the model is not only in the correct format, but also conditioned to improve the model's ability to identify and understand different audio events. Decoding is configured as follows:

In [7]:
from birdset.datamodule.components import EventDecoding
decoding = EventDecoding(
    min_len=1.0,
    max_len=5.0,
    sampling_rate=32000,
    extension_time=8,
    extracted_interval=5,
)

Key Parameters:
- `_target_`: Specifies the EventDecoding component to be used in the data processing pipeline.
- `min_len` and `max_len`: Determine the minimum and maximum duration (in seconds) of the audio segments after decoding. These constraints ensure that each processed audio segment is of a suitable length for the model.
- `sampling_rate`: Defines the sampling rate to which the audio should be resampled. This standardizes the input data's sampling rate, making it consistent for model processing.
- `extension_time`: Refers to the time (in seconds) by which the duration of an audio event is extended. This parameter is crucial for ensuring that shorter audio events are sufficiently long for the model to process effectively.
- `extracted_interval`: Denotes the fixed duration (in seconds) of the audio segment that is randomly extracted from the extended audio event.

Decoding is performed on the fly, ensuring that the data fed into the model is always in the correct format, even when the source data comes in various encoded forms.

#### 3.4 Feature Extraction

Feature extraction is a pivotal step in transforming raw data into a structured format that is suitable for model training. The `DefaultFeatureExtractor` in BirdSet is tailored for processing waveform data, providing a range of functionalities to prepare the data for model consumption.

In [8]:
from birdset.datamodule.components import DefaultFeatureExtractor
feature_extractor = DefaultFeatureExtractor(
    feature_size=1,
    sampling_rate=32000,
    padding_value=0.0,
    return_attention_mask=False,
)

Key Parameters:
- `feature_size`: Determines the size of the extracted features.
- `sampling_rate`: The sampling rate at which the audio data should be processed.
- `padding_value`: The value used for padding shorter sequences to a consistent length.
- `return_attention_mask`: Indicates whether an attention mask should be returned along with the processed features.

This component is crucial for ensuring that the input data to the model is in a consistent and processable format, catering to models that require structured input in the form of PyTorch tensors.

#### 3.5 No-call Sampler
You can use the `NoCallMixer` to add no-call samples to the dataset. This is particularly useful for training models to recognize the absence of bird calls. The `NoCallMixer` is configured as follows:
```python
from src.datamodule.components.no_call_sampler import NoCallSampler

nocall_sampler = NoCallMixer(
    directory: str = "path/to/no_call_samples",
    p: float = 0.075,
    sampling_rate: int = 32000,
    length: int = 5,
    n_classes: int = 21,
)
```


In [9]:
# Since this would require to have the dataset downloaded (see `download_background_noise.ipynb`, we will not use this for now
nocall_sampler = None

#### 3.6 Configuration of Data Preprocessing

Data preprocessing is a fundamental step in the BirdSet data pipeline, ensuring that the raw data is adequately conditioned and transformed, making it suitable for model consumption. The `PreprocessingConfig` class allows for a detailed specification of various preprocessing parameters, each carefully selected to meet the unique demands of your dataset and model. Here's how you can configure the data preprocessing in the BirdSet pipeline:

In [10]:
from torchaudio.transforms import Spectrogram
from birdset.datamodule.components.resize import Resizer
from birdset.datamodule.components.augmentations import PowerToDB
from birdset.datamodule.components.transforms import PreprocessingConfig

# Creating the preprocessing configuration
preprocessing = PreprocessingConfig(
        spectrogram_conversion= Spectrogram(
            n_fft=1024,
            hop_length=320,
            power=2.0,
        ),
        resizer=Resizer(
            db_scale=True,
            target_height=None,
            target_width=1024,
        ),
        dbscale_conversion=PowerToDB(),
        normalize_spectrogram=True,
        normalize_waveform=None,
        mean=4.268, # calculated on AudioSet
        std=4.569 # calculated on AudioSet
    )

Here's a brief overview of the parameters used in the `PreprocessingConfig` class:

- `spectrogram_conversion`: This is an instance of the Spectrogram class from torchaudio.transforms. It is used to convert the audio waveform to a spectrogram. The parameters n_fft, hop_length, and power are used to configure the spectrogram conversion.

    - `n_fft`: Th` size of the FFT, which will also determine the size of the window used for the STFT. It is set to 1024.
    - `hop_length`: The number of samples between successive frames in the STFT. It is set to 320.
    - `power`: The exponent for the magnitude spectrogram, e.g., 1 for energy, 2 for power, etc. It is set to 2.0.
    - `resizer`: This is an instance of the Resizer class from src.datamodule.components.resize. It is used to resize the spectrogram. The parameters db_scale and target_width are used to configure the resizing.

- `db_scale`: If set to True, the spectrogram is converted to dB scale. It is set to True.
- `target_height`: The target height for the resized spectrogram. It is not set in this case.
- target_width: The target width for the resized spectrogram. It is set to 1024.
- `dbscale_conversion`: This is an instance of the PowerToDB class from src.datamodule.components.augmentations. It is used to convert the spectrogram to a dB scale.

- `normalize_spectrogram`: If set to True, the spectrogram is normalized. It is set to True.

- `normalize_waveform`: If set to a value, the audio waveform is normalized. It is not set in this case.

- ``mean``: The mean value used for normalization. It is set to 4.268, which is calculated on AudioSet.

- `std`: The standard deviation used for normalization. It is set to 4.569, which is calculated on AudioSet.


By accurately configuring these preprocessing parameters, you ensure that the input data to the model is standardized and optimized for the learning process, which is essential for achieving high performance.

#### 3.7 Initiating the BirdSetTransformsWrapper


In [11]:
from birdset.datamodule.components.transforms import BirdSetTransformsWrapper
transforms = BirdSetTransformsWrapper(
    task="multilabel",
    sampling_rate=32000,
    model_type="vision",
    spectrogram_augmentations=spectrogram_augmentations,
    waveform_augmentations=waveform_augmentation,
    decoding=decoding,
    feature_extractor=feature_extractor,
    max_length=5,
    nocall_sampler=nocall_sampler,
    preprocessing=preprocessing,
)

### 4. Configuration of Event Mappings

Event mapping plays a pivotal role in the data pipeline, serving as the bridge between raw dataset events and the structured input required by the model. This process ensures that each event in the dataset is accurately represented and can be effectively utilized during model training. By default, we use [bambird](https://www.sciencedirect.com/science/article/pii/S1574954122004022?casa_token=HEbcdB5MyRMAAAAA:saYbr1WNlJTs-kAZOtzMrNt5r1sN_69E7bMjfCJu2A4zlLLFoIt-5-Cht2Wryg59851H_PWgfHzw) for event mapping, which is implemented in the `XCEventMapping` class. Within the BirdSet framework, event mappings are configured as follows:

In [12]:
from birdset.datamodule.components import XCEventMapping
# Instantiate the event mapper
mapper = XCEventMapping(
            biggest_cluster=True,
            no_call=False,
        )

Key Parameters in Event Mapping:
- `biggest_cluster`: If set to `True`, the mapper focuses on the biggest cluster of events, which can be particularly useful for datasets with imbalanced event distributions.
- ``: Specifies the maximum number of events to consider. This can be used to limit the scope of the mapping, although it's usually already managed by the `DatasetConfig`.
- `no_call`: Indicates whether 'no-call' events should be included. In this configuration, it's set to `False` as the no-call samples are handled separately by the `nocall_sampler`.

Properly configuring the event mappings is essential for ensuring that the model receives accurately structured and meaningful data, which is a cornerstone for effective model training and robust performance.

## Creating the BirdSet Datamodule

The BirdSet Datamodule plays a central role in the BirdSet data pipeline, offering streamlined handling and preprocessing of BirdSet datasets to ensure they are primed for model training. Let's delve into the setup process:

### Imports
First, we import the necessary modules. `BirdSetDataModule` is responsible for managing the BirdSet datasets, while the `logging` module is used for logging information during the data processing steps.

In [13]:
import logging 
import os

from birdset.datamodule.birdset_datamodule import BirdSetDataModule

### Creating Cache Directory
The cache directory is a dedicated space for storing processed data. Utilizing a cache directory can significantly expedite subsequent data loading operations by avoiding redundant data processing. Here's how to create and manage a cache directory effectively:

In [14]:
# Log the absolute path of the dataset
logging.info(f"Dataset path: <{os.path.abspath(dataset_config.data_dir)}>")

# Create the dataset directory if it does not exist
os.makedirs(dataset_config.data_dir, exist_ok=True)

This approach ensures:
- Organized data management: By maintaining a structured directory for your datasets, you facilitate easier access and management of your data assets.
- Efficient data loading: By caching processed data, subsequent loads are much faster, which is particularly beneficial when working with large datasets.

By carefully setting up the BirdSet Datamodule and managing your cache directory, you enhance the efficiency and reliability of your data pipeline, ensuring that your datasets are always ready for model training.

### Datamodule Initialization

The `BirdSetDataModule` class plays a pivotal role in orchestrating the data pipeline. It consolidates the dataset configuration, data loaders, transformations, and event mappings into a cohesive structure, ensuring a clean and manageable workflow. Here's how the BirdSetDataModule is initialized:

In [15]:
# Initialize the BirdSetDataModule
datamodule = BirdSetDataModule(
        dataset=dataset_config,
        loaders=loaders_config,
        transforms=transforms,
        mapper=mapper,
    )

Here's a brief overview of the parameters used in the `BirdSetDataModule` class:
- `dataset`: The configuration settings for the dataset. It defines how the data is structured and managed.
- `loaders`: Configuration settings for the data loaders, determining how data is batched and fed into the model.
- `transforms`: The set of transformations and augmentations applied to the data, ensuring that it's properly conditioned for the model.
- `mapper`: The event mapping configuration, essential for translating raw dataset events into a structured format that the model can interpret.

### Data Preparation

The data preparation stage is where the actual data processing takes place. This stage is critical in ensuring that the data is correctly preprocessed, structured, and ready for model training.

In [16]:
# Prepare the data for training
datamodule.prepare_data()

Map:   0%|          | 0/5197 [00:00<?, ? examples/s]

Map:   0%|          | 0/37176 [00:00<?, ? examples/s]

Processing labels: 100%|██████████| 21/21 [00:01<00:00, 11.28it/s]


Map:   0%|          | 0/17348 [00:00<?, ? examples/s]

Map:   0%|          | 0/12000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/13878 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/3470 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/12000 [00:00<?, ? examples/s]

The `prepare_data()` method encompasses various steps, including downloading the data (if not already locally available), applying the preprocessing steps defined in the transformations, and organizing the data into a format that is compatible with the model. It's a method that encapsulates the entire data preparation workflow, ensuring that the data is optimally prepared for the training process.

This methodical approach to data preparation and modularization of the data pipeline components in the BirdSet framework contributes significantly to the efficiency, maintainability, and robustness of the machine learning lifecycle.

**Hint**: If you recive an error concerning a not existing huggingface dataset, please make sure you are logged in to HuggingFace (see [Log in to Huggingface](#log-in-to-huggingface)).

### Datamodule Setup for Training Phase

Setting up the datamodule for the training phase is a crucial step in the BirdSet data pipeline. This setup involves initializing the training and validation dataloaders, which play a vital role in supplying the model with data during the training process. The setup is performed using the `setup(stage="fit")` method:

In [17]:
# Setup the datamodule for the training phase
datamodule.setup(stage="fit")

# Retrieve the training and validation dataloaders
train_loader = datamodule.train_dataloader()
validation_loader = datamodule.val_dataloader()

In [18]:
# Fetch a sample batch from the training dataloader
for batch in train_loader:
    break

# Inspect the keys and shapes of the data in the batch
batch.keys(), batch["input_values"].shape, batch["labels"].shape

(dict_keys(['input_values', 'labels']),
 torch.Size([32, 1, 128, 1024]),
 torch.Size([32, 21]))

This code snippet demonstrates:
- The initialization of the training phase.
- The retrieval of training and validation dataloaders.
- Fetching and inspecting a sample batch from the training dataloader.
- The shapes of `input_values` and `labels` indicate the batch size, number of channels (if applicable), and dimensions of the input data and labels, respectively.

### Datamodule Setup for Test Phase

Similarly, the datamodule is set up for the test phase to ensure that the model can be effectively evaluated using the test data:

In [19]:
# Setup the datamodule for the test phase
datamodule.setup(stage="test")

# Retrieve the test dataloader
test_loader = datamodule.test_dataloader()

The `setup(stage="test")` method prepares the datamodule specifically for the test phase, and `test_dataloader()` retrieves the test dataloader, which is instrumental for batching and loading the test data efficiently during the model evaluation process.

By methodically setting up the datamodule for both training and test phases, you ensure that the model has access to well-prepared data, which is essential for accurate training, validation, and testing.

### Usage in TensorFlow

Utilizing the BirdSet datamodule in a TensorFlow environment involves integrating the prepared dataloaders with TensorFlow's training and evaluation workflows. This integration ensures that the data is fed into TensorFlow models efficiently and in a format that TensorFlow can process. Here's how you can set up the BirdSet datamodule for TensorFlow compatibility:

In [20]:
# Setup the datamodule for the training phase
datamodule.setup(stage="fit")

# Retrieve the training and validation datasets
train_loader = datamodule.train_dataset
validation_loader = datamodule.val_dataset

# Setup the datamodule for the test phase
datamodule.setup(stage="test")

# Retrieve the test dataset
test_loader = datamodule.test_dataset

#### Key Considerations:
- `train_dataset`, `validation_dataset`, and `test_dataset` are the datasets prepared by the BirdSet datamodule, ready to be used in TensorFlow's training and evaluation routines.
- It's important to ensure that these datasets are in a format compatible with TensorFlow. This might involve additional steps such as converting the data to `tf.data.Dataset` objects or applying necessary transformations to align with TensorFlow's data handling mechanisms.
- More information on this integration process can be found in [HuggingFace's documentation](https://huggingface.co/docs/datasets/use_with_tensorflow).

By following these steps, you can leverage the robust data preprocessing and management capabilities of the BirdSet datamodule within a TensorFlow environment, facilitating an efficient and streamlined model training and evaluation process.

### Mapping of Labels to eBird Codes

The eBird codes in the BirdSet datasets are in integer format. However, we can map these numeric labels to their corresponding eBird codes as defined in the `dataset_info.json` file (it is created during data preprocessing in the folder where the preprocessed data is stored; i.e. in `data_dir` of the `DatasetConfig`). The `get_label_to_category_mapping_from_metadata` function does this by parsing the JSON file and creating a dictionary that maps each numeric label to its corresponding eBird code in string format.

In [21]:
from typing import Dict

def get_label_to_category_mapping_from_metadata(
    file_path: str, task: str
) -> Dict[int, str]:
    """
    Reads a JSON file and extracts the mapping of labels to eBird codes.

    The function expects the JSON structure to be in a specific format, where the mapping
    is a list of names located under the keys 'features' -> 'labels' -> 'names'.
    The index in the list corresponds to the label, and the value at that index is the eBird code.

    Args:
    - file_path (str): The path to the JSON file containing the label to eBird code mapping.
    - task (str): The type of task for which to get the mapping. Expected values are "multiclass" or "multilabel".

    Returns:
    - Dict[int, str]: A dictionary where each key is a label (integer) and the corresponding value is the eBird code.

    Raises:
    - FileNotFoundError: If the file at `file_path` does not exist.
    - json.JSONDecodeError: If the file is not a valid JSON.
    - KeyError: If the expected keys ('features', 'labels', 'names') are not found in the JSON structure.
    """

    # Open the file and read the JSON data
    with open(file_path, "r") as file:
        dataset_info = json.load(file)

    # Extract the list of eBird codes from the loaded JSON structure.
    # Note: This assumes a specific structure of the JSON data.
    # If the structure is different, this line will raise a KeyError.
    if task == "multiclass":
        ebird_codes_list = dataset_info["features"]["labels"]["names"]
    elif task == "multilabel":
        ebird_codes_list = dataset_info["features"]["labels"]["feature"]["names"]
    else:
        # If the task is not recognized (not multiclass or multilabel), raise an error.
        raise NotImplementedError(
            f"Only the multiclass and multilabel tasks are implemented, not task {task}."
        )

    # Create a dictionary mapping each label (index) to the corresponding eBird code.
    mapping = {label: ebird_code for label, ebird_code in enumerate(ebird_codes_list)}

    return mapping