# GADME Data Pipeline Tutorial

This Jupyter notebook provides a comprehensive guide to setting up and configuring a data pipeline tailored for bird classification in audio files. The tutorial is structured to walk you through each component of the pipeline, ensuring a clear understanding of its functionality and configuration. Whether you are processing raw audio data or spectrograms, this notebook aims to provide you with the necessary knowledge to efficiently set up your data pipeline.

## Installation

### Prerequisites
Before initiating the installation process of the GADME pipeline, it's crucial to ensure that your computing environment meets the following prerequisites:
- **Python**: You should have Python version 3.10 or higher installed on your system.

### Installation Steps
The GADME pipeline can be installed using either of the two methods: via Conda with Pip, or using Poetry. Select the method that best suits your preference and follow the corresponding steps below.

#### Using Conda and Pip

1. **Create a Conda Environment**: Begin by setting up a dedicated environment for GADME. This is a best practice to manage dependencies and avoid potential conflicts with other packages in your system.

   ```bash
   conda create -n gadme python=3.10
   ```

   After the environment is successfully created, activate it:

   ```bash
   conda activate gadme
   ```

2. **Install GADME**: Proceed with cloning the GADME repository and installing the package in editable mode. This approach is beneficial as it allows any modifications you make to the GADME code to be reflected immediately without the need for reinstallation.

   ```bash
   git clone https://github.com/DBD-research-group/GADME.git
   cd GADME
   pip install -e .
   ```

#### Using Poetry

1. **Clone the Repository**: Start with cloning the GADME repository to your local machine and navigate to the cloned directory.

   ```bash
   git clone https://github.com/DBD-research-group/GADME.git
   cd GADME
   ```

2. **Configure Poetry**: Prepare the project for Poetry by renaming the `pyproject.raphael` file to `pyproject.toml`.

   ```bash
   mv pyproject.raphael pyproject.toml
   ```

3. **Install Dependencies and Activate Environment**: Install all the necessary dependencies with Poetry and then activate the Poetry shell environment.

   ```bash
   poetry install
   poetry shell
   ```

## Log in to Huggingface

Our datasets are shared via HuggingFace Datasets in our [HuggingFace GADME repository](https://huggingface.co/datasets/DBD-research-group/gadme_v1). Huggingface is a central hub for sharing and utilizing datasets and models, particularly beneficial for machine learning and data science projects. For accessing private datasets hosted on HuggingFace, you need to be authenticated. Here's how you can log in to HuggingFace:

1. **Install HuggingFace CLI**: If you haven't already, you need to install the HuggingFace CLI (Command Line Interface). This tool enables you to interact with HuggingFace services directly from your terminal. You can install it using pip:

   ```bash
   pip install huggingface_hub
   ```

2. **Login via CLI**: Once the HuggingFace CLI is installed, you can log in to your HuggingFace account directly from your terminal. This step is essential for accessing private datasets or contributing to the HuggingFace community. Use the following command:

   ```bash
   huggingface-cli login
   ```

   After executing this command, you'll be prompted to enter your HuggingFace credentials ([User Access Token](https://huggingface.co/docs/hub/security-tokens)). Once authenticated, your credentials will be saved locally, allowing seamless access to HuggingFace resources.

## Configuration of GADME Data Pipeline

The GADME Data Pipeline offers a robust and flexible configuration system, primarily designed to streamline the process of setting up your data processing environment. While this notebook presents hardcoded configurations for simplicity, it's important to note that these settings can be dynamically managed using advanced configuration tools like Hydra. Hydra is a powerful utility that enables flexible and scalable configuration management, allowing you to adapt the pipeline settings to various environments or use cases seamlessly. For an in-depth understanding of Hydra, consider visiting [Hydra's official documentation](https://hydra.cc/docs/intro).

### Necessary Imports

Configuring the GADME Data Pipeline requires a set of specific Python modules and classes. Below is a list of essential imports, each playing a crucial role in the pipeline setup. Understanding the purpose and functionality of these components is key to effectively customizing the data pipeline according to your specific requirements.

In [None]:
import hydra
import torch_audiomentations
import torchaudio
import torchvision

import src
from src.datamodule.base_datamodule import DatasetConfig, LoaderConfig, LoadersConfig
from src.datamodule.components.transforms import (
    GADMETransformsWrapper,
    PreprocessingConfig,
)

In this section:
- `hydra` is utilized for managing and orchestrating the configuration setup, enabling you to easily modify and extend your pipeline configuration.
- `torch_audiomentations` provides a collection of audio augmentations, useful for enhancing your dataset and improving the robustness of your models.
- `torchaudio` and `torchvision` are essential for processing audio and image data, respectively, offering a wide range of tools for data manipulation and transformation.
- The `src` module contains the core functionalities of the GADME project, with specific classes like `DatasetConfig`, `LoaderConfig`, `LoadersConfig`, `GADMETransformsWrapper`, and `PreprocessingConfig` designed to configure different aspects of the data loading, preprocessing, and transformation pipeline.

By carefully configuring these components, you can tailor the GADME Data Pipeline to your specific data processing and analysis needs, ensuring an efficient and effective workflow.

### 1. Dataset Configuration

Configuring the dataset is a pivotal step in the GADME data pipeline, as it directly impacts how your data is handled, processed, and prepared for training. The `DatasetConfig` class provides a structured way to specify numerous parameters and settings, ensuring that the dataset aligns with the specific requirements of your project. Below is an illustration of how you might configure a dataset for use within the GADME pipeline, followed by a detailed explanation of each parameter.

In [None]:
config={
    "dataset_path": "../../data_gadme",
    "dataset_name": "high_sierras",
    "hf_path": "DBD-research-group/gadme_v1",
    "hf_name": "high_sierras",
    "rand_seed": 42,
    "num_classes": 22,
    "num_workers": 1,
    "validation_size": 0.2,
    "task": "multilabel",
    "sample_rate": 32000,
    "class_weights_loss": None,
    "class_weights_sampler": True,
    "class_limit": 500,
    "": 5,
} 

In [None]:
dataset_config = DatasetConfig(
        data_dir=config["dataset_path"],
        dataset_name=config["dataset_name"],
        hf_path=config["hf_path"],
        hf_name=config["hf_name"],
        seed=config["rand_seed"],
        n_classes=config["num_classes"],
        n_workers=config["num_workers"],
        val_split=config["validation_size"],
        task=config["task"],
        subset=None,
        sampling_rate=config["sample_rate"],
        #class_weights_loss=config["class_weights_loss"], #TODO: remove # after bug fix
        #class_weights_sampler=config["class_weights_sampler"], #TODO: remove # after bug fix
        classlimit=config["class_limit"],
        eventlimit=config[""],
    )

Here's a brief overview of the parameters used in the `DatasetConfig` class:

- `data_dir`: Specifies the directory where the dataset files are stored. **Important**: The dataset uses a lot of disk space, so make sure you have enough storage available.
- `dataset_name`: The name assigned to the dataset.
- `hf_path`: The path to the dataset stored on HuggingFace.
- `hf_name`: The name of the dataset on HuggingFace.
- `seed`: A seed value for ensuring reproducibility across runs.
- `n_classes`: The total number of distinct classes in the dataset.
- `n_workers`: The number of worker processes used for data loading.
- `val_split`: The proportion of the dataset reserved for validation.
- `task`: Defines the type of task (e.g., 'multilabel' or 'multiclass').
- `sampling_rate`: The sampling rate for audio data processing.
- `class_weights_loss` (Deprecated): Previously used for applying class weights in loss calculation.
- `class_weights_sampler`: Indicates whether to use class weights in the sampler for handling imbalanced datasets.
- `class_limit`: The maximum number of samples per class.
- ``: Defines the maximum number of audio events processed per audio file, capping the quantity to ensure balance across files.

#### Important Note:
- The `class_weights_loss` parameter is currently deprecated and only implemented for focal loss. It's recommended to utilize the `class_weights_sampler` instead, as it has shown to yield favorable results, particularly as evidenced by the winner of the [BirdCLEF 2023](https://www.kaggle.com/competitions/birdclef-2023) challenge.

Selecting appropriate values for these parameters is crucial, as they directly influence the efficiency of the training process and the overall performance of the model.

### 2. Dataloader Configuration

Once the dataset is configured, the next crucial step is setting up the data loaders. Data loaders are pivotal in efficiently feeding data into the model during both the training and testing phases. They manage the data flow, ensuring that the model is supplied with a consistent stream of data batches. In this section, we'll use the `LoaderConfig` and `LoadersConfig` classes to define different configurations for the training and testing data loaders.

In [None]:
# Configuration for batch sizes
config["train_batch_size"] = 32
config["test_batch_size"] = 64

In [None]:
# Configuration for the training data loader
train_loader_config = LoaderConfig(
    batch_size=config["train_batch_size"],
    shuffle=True,
    num_workers=config["num_workers"],
    pin_memory=False,
    drop_last=True,
    persistent_workers=False,
    #prefetch_factor=None,
)

# Configuration for the testing data loader
test_loader_config = LoaderConfig(
    batch_size=config["test_batch_size"],
    shuffle=False,
    num_workers=config["num_workers"],
    pin_memory=False,
    drop_last=False,
    persistent_workers=False,
    #prefetch_factor=None,
)

# Aggregating the loader configurations
loaders_config = LoadersConfig(
    train=train_loader_config,
    valid=test_loader_config,
    test=test_loader_config,
)

Here's a brief overview of the parameters used in the `LoaderConfig` class:

- `batch_size`: Specifies the number of samples contained in each batch. This is a crucial parameter as it impacts memory utilization and model performance.
- `shuffle`: Determines whether the data is shuffled at the beginning of each epoch. Shuffling is typically used for training data to ensure model robustness and prevent overfitting.
- `num_workers`: Sets the number of subprocesses to be used for data loading. More workers can speed up the data loading process but also increase memory consumption.
- `pin_memory`: When set to `True`, enables the DataLoader to copy Tensors into CUDA pinned memory before returning them. This can lead to faster data transfer to CUDA-enabled GPUs.
- `drop_last`: Determines whether to drop the last incomplete batch. Setting this to `True` is useful when the total size of the dataset is not divisible by the batch size.
- `persistent_workers`: Indicates whether the data loader should keep the workers alive for the next epoch. This can improve performance at the cost of memory.
- `prefetch_factor`: Defines the number of samples loaded in advance by each worker. This parameter is commented out here and can be adjusted based on specific requirements.

Proper configuration of the data loaders is essential as it directly influences the efficiency of the training process, hardware resource utilization, and ultimately, the performance of the model.

### 3. Configuration of Data Preprocessing

Data preprocessing is a fundamental step in the GADME data pipeline, ensuring that the raw data is adequately conditioned and transformed, making it suitable for model consumption. The `PreprocessingConfig` class allows for a detailed specification of various preprocessing parameters, each carefully selected to meet the unique demands of your dataset and model. Here's how you can configure the data preprocessing in the GADME pipeline:

In [None]:
# Configuration for preprocessing parameters
config["n_fft"] = 1024
config["hop_length"] = 320
config["n_mels"] = 128
config["db_scale"] = True

In [None]:
# Creating the preprocessing configuration
preprocessing_config = PreprocessingConfig(
        use_spectrogram=True,
        n_fft=config["n_fft"],
        hop_length=config["hop_length"],
        n_mels=config["n_mels"],
        db_scale=config["db_scale"],
        target_height=None,
        target_width=None,
        normalize_spectrogram=True,
        normalize_waveform=None,
    )

Here's a brief overview of the parameters used in the `PreprocessingConfig` class:

- `use_spectrogram`: Determines whether the audio data should be converted into a spectrogram, a visual representation of the spectrum of frequencies in the sound.
- `n_fft`: The size of the FFT (Fast Fourier Transform) window, impacting the frequency resolution of the spectrogram.
- `hop_length`: The number of samples between successive frames in the spectrogram. A smaller hop length leads to a higher time resolution.
- `n_mels`: The number of Mel bands to generate. This parameter is crucial for the Mel spectrogram and impacts the spectral resolution.
- `db_scale`: Indicates whether to scale the magnitude of the spectrogram to the decibel scale, which can help in visualizing the spectrum more clearly.
- `target_height` and `target_width`: Specify the dimensions to which the spectrogram images will be resized. This can be important for maintaining consistency in input size for certain neural networks.
- `normalize_spectrogram`: Whether to apply normalization to the spectrogram. Normalization can help in stabilizing the training process.
- `normalize_waveform`: Determines whether to apply normalization to the raw waveform data.

By accurately configuring these preprocessing parameters, you ensure that the input data to the model is standardized and optimized for the learning process, which is essential for achieving high performance.

### 4. Configuration of Transformations

Transformations play a critical role in the data preparation process within the GADME data pipeline. These operations, applied before the data is fed into the model, encompass a range of augmentation techniques designed to regularize the model and prevent overfitting. Properly configured transformations not only enhance the diversity and quality of the training data but also help the model generalize better to new, unseen data.

In the GADME framework, transformations are meticulously orchestrated through the `GADMETransformsWrapper` class. This wrapper acts as a comprehensive interface for defining and applying various transformations and augmentations to the data. It ensures that the data is consistently and effectively transformed, aligning with the specific requirements of the model and the inherent characteristics of the dataset.

By configuring the `transforms_wrapper` using the `GADMETransformsWrapper` class, you gain precise control over how the data is manipulated during the preprocessing phase. This level of control is pivotal in tailoring the data pipeline to the nuances of your specific use case, ultimately leading to more robust and accurate model performance.

#### 4.1 Augmentations

Augmentations are powerful techniques applied to the data to introduce diversity and variability. They are particularly useful in audio and signal processing to enhance the robustness of models against variations in input data. In the GADME framework, you can configure waveform and spectrogram augmentations as follows:

**Waveform Augmentations**

These augmentations are applied directly to the audio waveform. In GADME, you can use any waveform augmentation technique as long as it can be composed by the [torch-audiomentations Compose function](https://github.com/asteroid-team/torch-audiomentations/blob/main/torch_audiomentations/core/composition.py). You can add waveform augmentations as follows:

In [None]:
config["waveform_augmentations"] = {
    "colored_noise": {
        "_target_": torch_audiomentations.AddColoredNoise,
        "p": 0.2,
        "min_snr_in_db": 3.0,
        "max_snr_in_db": 30.0,
        "min_f_decay": -2.0,
        "max_f_decay": 2.0
    },
    "pitch_shift": {
        "_target_": torch_audiomentations.PitchShift,
        "p": 0.2,
        "sample_rate": 32000,
        "min_transpose_semitones": -4.0,
        "max_transpose_semitones": 4.0,
    }
}

In this example:
- `colored_noise`: Adds colored noise to the audio signal to simulate various real-world noise conditions.
- `pitch_shift`: Alters the pitch of the audio signal, which is useful for simulating different tonal variations.

**Spectrogram Augmentations**

These augmentations are applied to the spectrogram representation of the audio. In GADME, you can use any spectrogram augmentation technique as long as it can be composed by the [torchvision Compose function](https://pytorch.org/vision/stable/generated/torchvision.transforms.Compose.html). You can add spectrogram augmentations as follows:

In [None]:
config["spectrogram_augmentations"] = {
    "time_masking": {
        "_target_": torchvision.transforms.RandomApply,
        "p": 0.3,
        "transforms": 
        [{
            "_target_": torchaudio.transforms.TimeMasking, # - _ --> list!
            "time_mask_param": 100,
            "iid_masks": True,
        }],
    },
    "frequency_masking": {
        "_target_": torchvision.transforms.RandomApply,
        "p": 0.5,
        "transforms":
        [{
            "_target_": torchaudio.transforms.FrequencyMasking, # - _ --> list!
            "freq_mask_param": 100,
            "iid_masks": True,
        }]
    }
}

In this example:
- `time_masking`: Randomly masks a sequence of consecutive time steps in the spectrogram, helping the model become invariant to small temporal shifts.
- `frequency_masking`: Randomly masks a sequence of consecutive frequency channels, encouraging the model to be robust against frequency variations.

Configuring the augmentations correctly is crucial as they directly influence the model's ability to learn from a diverse set of data representations, ultimately leading to better generalization and performance.

#### 4.2 Decoding

Decoding is a process, that converts the (compressed) data into a format that can be directly used by the model. In the GADME framework, we use the `EventDecoding` class by default. It is designed for preprocessing audio files in the context of event detection tasks. Its primary function is to ensure that each audio segment fed into the model is not only in the correct format, but also conditioned to improve the model's ability to identify and understand different audio events. Decoding is configured as follows:

In [None]:
config["decoding"] = {
    "_target_": src.datamodule.components.EventDecoding,
    "min_len": 1.0,
    "max_len": 5.0,
    "sampling_rate": 32000,
    "extension_time": 8,
    "extracted_interval": 5,
}

Key Parameters:
- `_target_`: Specifies the EventDecoding component to be used in the data processing pipeline.
- `min_len` and `max_len`: Determine the minimum and maximum duration (in seconds) of the audio segments after decoding. These constraints ensure that each processed audio segment is of a suitable length for the model.
- `sampling_rate`: Defines the sampling rate to which the audio should be resampled. This standardizes the input data's sampling rate, making it consistent for model processing.
- `extension_time`: Refers to the time (in seconds) by which the duration of an audio event is extended. This parameter is crucial for ensuring that shorter audio events are sufficiently long for the model to process effectively.
- `extracted_interval`: Denotes the fixed duration (in seconds) of the audio segment that is randomly extracted from the extended audio event.

Decoding is performed on the fly, ensuring that the data fed into the model is always in the correct format, even when the source data comes in various encoded forms.

#### 4.3 Feature Extraction

Feature extraction is a pivotal step in transforming raw data into a structured format that is suitable for model training. The `DefaultFeatureExtractor` in GADME is tailored for processing waveform data, providing a range of functionalities to prepare the data for model consumption.

In [None]:
config["feature_extractor"] = {
    "_target_": src.datamodule.components.DefaultFeatureExtractor,
    "feature_size": 1,
    "sampling_rate": 32000,
    "padding_value": 0.0,
    "return_attention_mask": False,
}

Key Parameters:
- `_target_`: Specifies the feature extractor component used in the pipeline.
- `feature_size`: Determines the size of the extracted features.
- `sampling_rate`: The sampling rate at which the audio data should be processed.
- `padding_value`: The value used for padding shorter sequences to a consistent length.
- `return_attention_mask`: Indicates whether an attention mask should be returned along with the processed features.

This component is crucial for ensuring that the input data to the model is in a consistent and processable format, catering to models that require structured input in the form of PyTorch tensors.

#### 4.4 No-call Sampler

The no-call sampler is a sophisticated component in the GADME framework, designed to augment the dataset with 'no-call' samples. These samples are crucial for scenarios where the model needs to learn to recognize the absence of certain events.

In the following configuration we don't use the nocall_sampler for simplicity:

In [None]:
# Configuration without the no-call sampler
config["nocall_sampler"] = None

However, you can add a nocall_sampler as shown in the following code snippet. Here, `directory` is the path to the dataset you want to use to create no-call samples (for example, [DCASE18](https://dcase.community/challenge2018/index)).

```python
# Configuration to enable the no-call sampler
config["nocall_sampler"] = {
    "_target_": src.datamodule.components.augmentations.NoCallMixer,
    "directory": "insert path here",
    "p": 0.075,
    "sampling_rate": 32000,
    "length": 5,
    "n_classes": 22,
}
```

Key Parameters:
- `_target_`: Specifies the no-call sampler component in the pipeline.
- `directory`: The directory containing the no-call data. It's essential to ensure that this path is correctly set to the location of your no-call samples.
- `p`: The probability of a sample being replaced with a no-call sample. This parameter allows you to control the frequency of no-call samples in your dataset.
- `sampling_rate`, `length`, and `n_classes`: These parameters should align with the rest of your dataset and model configuration.

The no-call sampler effectively increases the diversity of the dataset by adding samples that represent the absence of specific events or classes. This helps in creating a more balanced dataset and enhancing the model's ability to generalize well to various input conditions.


#### 4.5 Combining all Transformations

After configuring individual components like augmentations, decoding, and feature extraction, the next step in the GADME framework is to combine all these elements. This integration is facilitated by the `GADMETransformsWrapper` class, which serves as a interface for managing all data transformations. Here's how you can instantiate and integrate all transformations:

In [None]:
# Instantiate waveform augmentations
waveform_augmentations = {
        name: hydra.utils.instantiate(aug)
        for name, aug in config["waveform_augmentations"].items()
    }

# Instantiate spectrogram augmentations
spectrogram_augmentations = {
    name: hydra.utils.instantiate(aug)
    for name, aug in config["spectrogram_augmentations"].items()
}

# Instantiate decoding
decoding = hydra.utils.instantiate(config["decoding"])

# Instantiate feature extraction
feature_extractor = hydra.utils.instantiate(config["feature_extractor"])

# Instantiate the no-call sampler
nocall_sampler = hydra.utils.instantiate(config["nocall_sampler"])

In [None]:
config['max_length'] = 5
# Combine all components into the transforms wrapper
transforms_wrapper = GADMETransformsWrapper(
    task=config["task"],
    sampling_rate=config["sample_rate"],
    model_type="vision",
    preprocessing=preprocessing_config,
    spectrogram_augmentations=spectrogram_augmentations,
    waveform_augmentations=waveform_augmentations,
    decoding=decoding,
    feature_extractor=feature_extractor,
    max_length=config["max_length"],
    n_classes=None,
    nocall_sampler=nocall_sampler,
)

Here's a brief overview of the parameters used in the `GADMETransformsWrapper` class:

- `task`: Specifies the type of task (e.g., 'multiclass' or 'multilabel').
- `sampling_rate`: The sampling rate at which the audio data should be processed.
- `model_type`: Indicates the type of model (e.g. 'vision' for spectrogram-based models or 'waveform' for waveform-based models).
- `preprocessing`: The preprocessing configuration defined earlier.
- `spectrogram_augmentations` and `waveform_augmentations`: The sets of augmentations to be applied to the spectrogram and waveform data, respectively.
- `decoding` and `feature_extractor`: Components responsible for data decoding and feature extraction.
- `max_length`: The maximum length for the processed data segments in seconds.
- `nocall_sampler`: The no-call sampler component, if configured.

#### Important Note:
- The `n_classes` parameter is currently deprecated.

The proper configuration of these parameters is critical for aligning the transformations with the model's requirements and the characteristics of the dataset, thereby ensuring optimal data preparation and subsequent model training.

### 5. Configuration of Event Mappings

Event mapping plays a pivotal role in the data pipeline, serving as the bridge between raw dataset events and the structured input required by the model. This process ensures that each event in the dataset is accurately represented and can be effectively utilized during model training. By default, we use [bambird](https://www.sciencedirect.com/science/article/pii/S1574954122004022?casa_token=HEbcdB5MyRMAAAAA:saYbr1WNlJTs-kAZOtzMrNt5r1sN_69E7bMjfCJu2A4zlLLFoIt-5-Cht2Wryg59851H_PWgfHzw) for event mapping, which is implemented in the `XCEventMapping` class. Within the GADME framework, event mappings are configured as follows:

In [None]:
# Configuration for event mappings
config["mapper"] = {
    "biggest_cluster": True,
    "": None, # Redundant because of the  parameter in DatasetConfig
    "no_call": False, # No-calls are already handled by the nocall_sampler
}

In [None]:
# Instantiate the event mapper
mapper = src.datamodule.components.XCEventMapping(
            biggest_cluster=config["mapper"]["biggest_cluster"],
            =config["mapper"][""],
            no_call=config["mapper"]["no_call"],
        )

Key Parameters in Event Mapping:
- `biggest_cluster`: If set to `True`, the mapper focuses on the biggest cluster of events, which can be particularly useful for datasets with imbalanced event distributions.
- ``: Specifies the maximum number of events to consider. This can be used to limit the scope of the mapping, although it's usually already managed by the `DatasetConfig`.
- `no_call`: Indicates whether 'no-call' events should be included. In this configuration, it's set to `False` as the no-call samples are handled separately by the `nocall_sampler`.

Properly configuring the event mappings is essential for ensuring that the model receives accurately structured and meaningful data, which is a cornerstone for effective model training and robust performance.

## Creating the GADME Datamodule

The GADME Datamodule plays a central role in the GADME data pipeline, offering streamlined handling and preprocessing of GADME datasets to ensure they are primed for model training. Let's delve into the setup process:

### Imports
First, we import the necessary modules. `GADMEDataModule` is responsible for managing the GADME datasets, while the `logging` module is used for logging information during the data processing steps.

In [None]:
import logging 
import os

from src.datamodule.gadme_datamodule import GADMEDataModule

### Creating Cache Directory
The cache directory is a dedicated space for storing processed data. Utilizing a cache directory can significantly expedite subsequent data loading operations by avoiding redundant data processing. Here's how to create and manage a cache directory effectively:

In [None]:
# Log the absolute path of the dataset
logging.info(f"Dataset path: <{os.path.abspath(config['dataset_path'])}>")

# Create the dataset directory if it does not exist
os.makedirs(config["dataset_path"], exist_ok=True)

This approach ensures:
- Organized data management: By maintaining a structured directory for your datasets, you facilitate easier access and management of your data assets.
- Efficient data loading: By caching processed data, subsequent loads are much faster, which is particularly beneficial when working with large datasets.

By carefully setting up the GADME Datamodule and managing your cache directory, you enhance the efficiency and reliability of your data pipeline, ensuring that your datasets are always ready for model training.

### Datamodule Initialization

The `GADMEDataModule` class plays a pivotal role in orchestrating the data pipeline. It consolidates the dataset configuration, data loaders, transformations, and event mappings into a cohesive structure, ensuring a clean and manageable workflow. Here's how the GADMEDataModule is initialized:

In [None]:
# Initialize the GADMEDataModule
datamodule = GADMEDataModule(
        dataset=dataset_config,
        loaders=loaders_config,
        transforms=transforms_wrapper,
        mapper=mapper,
    )

Here's a brief overview of the parameters used in the `GADMEDataModule` class:
- `dataset`: The configuration settings for the dataset. It defines how the data is structured and managed.
- `loaders`: Configuration settings for the data loaders, determining how data is batched and fed into the model.
- `transforms`: The set of transformations and augmentations applied to the data, ensuring that it's properly conditioned for the model.
- `mapper`: The event mapping configuration, essential for translating raw dataset events into a structured format that the model can interpret.

### Data Preparation

The data preparation stage is where the actual data processing takes place. This stage is critical in ensuring that the data is correctly preprocessed, structured, and ready for model training.

In [None]:
# Prepare the data for training
datamodule.prepare_data()

The `prepare_data()` method encompasses various steps, including downloading the data (if not already locally available), applying the preprocessing steps defined in the transformations, and organizing the data into a format that is compatible with the model. It's a method that encapsulates the entire data preparation workflow, ensuring that the data is optimally prepared for the training process.

This methodical approach to data preparation and modularization of the data pipeline components in the GADME framework contributes significantly to the efficiency, maintainability, and robustness of the machine learning lifecycle.

**Hint**: If you recive an error concerning a not existing huggingface dataset, please make sure you are logged in to HuggingFace (see [Log in to Huggingface](#log-in-to-huggingface)).

### Datamodule Setup for Training Phase

Setting up the datamodule for the training phase is a crucial step in the GADME data pipeline. This setup involves initializing the training and validation dataloaders, which play a vital role in supplying the model with data during the training process. The setup is performed using the `setup(stage="fit")` method:

In [None]:
# Setup the datamodule for the training phase
datamodule.setup(stage="fit")

# Retrieve the training and validation dataloaders
train_loader = datamodule.train_dataloader()
validation_loader = datamodule.val_dataloader()

In [None]:
# Fetch a sample batch from the training dataloader
for batch in train_loader:
    break

# Inspect the keys and shapes of the data in the batch
batch.keys(), batch["input_values"].shape, batch["labels"].shape

This code snippet demonstrates:
- The initialization of the training phase.
- The retrieval of training and validation dataloaders.
- Fetching and inspecting a sample batch from the training dataloader.
- The shapes of `input_values` and `labels` indicate the batch size, number of channels (if applicable), and dimensions of the input data and labels, respectively.

### Datamodule Setup for Test Phase

Similarly, the datamodule is set up for the test phase to ensure that the model can be effectively evaluated using the test data:

In [None]:
# Setup the datamodule for the test phase
datamodule.setup(stage="test")

# Retrieve the test dataloader
test_loader = datamodule.test_dataloader()

The `setup(stage="test")` method prepares the datamodule specifically for the test phase, and `test_dataloader()` retrieves the test dataloader, which is instrumental for batching and loading the test data efficiently during the model evaluation process.

By methodically setting up the datamodule for both training and test phases, you ensure that the model has access to well-prepared data, which is essential for accurate training, validation, and testing.

### Usage in TensorFlow

Utilizing the GADME datamodule in a TensorFlow environment involves integrating the prepared dataloaders with TensorFlow's training and evaluation workflows. This integration ensures that the data is fed into TensorFlow models efficiently and in a format that TensorFlow can process. Here's how you can set up the GADME datamodule for TensorFlow compatibility:

In [None]:
# Setup the datamodule for the training phase
datamodule.setup(stage="fit")

# Retrieve the training and validation datasets
train_loader = datamodule.train_dataset
validation_loader = datamodule.val_dataset

# Setup the datamodule for the test phase
datamodule.setup(stage="test")

# Retrieve the test dataset
test_loader = datamodule.test_dataset

#### Key Considerations:
- `train_dataset`, `validation_dataset`, and `test_dataset` are the datasets prepared by the GADME datamodule, ready to be used in TensorFlow's training and evaluation routines.
- It's important to ensure that these datasets are in a format compatible with TensorFlow. This might involve additional steps such as converting the data to `tf.data.Dataset` objects or applying necessary transformations to align with TensorFlow's data handling mechanisms.
- More information on this integration process can be found in [HuggingFace's documentation](https://huggingface.co/docs/datasets/use_with_tensorflow).

By following these steps, you can leverage the robust data preprocessing and management capabilities of the GADME datamodule within a TensorFlow environment, facilitating an efficient and streamlined model training and evaluation process.

### Mapping of Labels to eBird Codes

The eBird codes in the GADME datasets are in integer format. However, we can map these numeric labels to their corresponding eBird codes as defined in the `dataset_info.json` file (it is created during data preprocessing in the folder where the preprocessed data is stored; i.e. in `data_dir` of the `DatasetConfig`). The `get_label_to_category_mapping_from_metadata` function does this by parsing the JSON file and creating a dictionary that maps each numeric label to its corresponding eBird code in string format.

In [None]:
from typing import Dict

def get_label_to_category_mapping_from_metadata(
    file_path: str, task: str
) -> Dict[int, str]:
    """
    Reads a JSON file and extracts the mapping of labels to eBird codes.

    The function expects the JSON structure to be in a specific format, where the mapping
    is a list of names located under the keys 'features' -> 'labels' -> 'names'.
    The index in the list corresponds to the label, and the value at that index is the eBird code.

    Args:
    - file_path (str): The path to the JSON file containing the label to eBird code mapping.
    - task (str): The type of task for which to get the mapping. Expected values are "multiclass" or "multilabel".

    Returns:
    - Dict[int, str]: A dictionary where each key is a label (integer) and the corresponding value is the eBird code.

    Raises:
    - FileNotFoundError: If the file at `file_path` does not exist.
    - json.JSONDecodeError: If the file is not a valid JSON.
    - KeyError: If the expected keys ('features', 'labels', 'names') are not found in the JSON structure.
    """

    # Open the file and read the JSON data
    with open(file_path, "r") as file:
        dataset_info = json.load(file)

    # Extract the list of eBird codes from the loaded JSON structure.
    # Note: This assumes a specific structure of the JSON data.
    # If the structure is different, this line will raise a KeyError.
    if task == "multiclass":
        ebird_codes_list = dataset_info["features"]["labels"]["names"]
    elif task == "multilabel":
        ebird_codes_list = dataset_info["features"]["labels"]["feature"]["names"]
    else:
        # If the task is not recognized (not multiclass or multilabel), raise an error.
        raise NotImplementedError(
            f"Only the multiclass and multilabel tasks are implemented, not task {task}."
        )

    # Create a dictionary mapping each label (index) to the corresponding eBird code.
    mapping = {label: ebird_code for label, ebird_code in enumerate(ebird_codes_list)}

    return mapping