# 🐦 **Kaggle BirdCLEF+2025 Competition**

---

## ✅ **Final Results**
| Stage                                 | Public Score | Private Score | Submitted | Final Leaderboard Rank |
|--------------------------------------|:------------:|:-------------:|:---------:|:----------------------:|
| 1. Applied initial 5-second segment  |     0.793     |     0.803     |     -     |           -            |
| 2. Condition-based segment selection |     0.809     |     0.821     |     -     |           -            |
| 3. Model change + hyperparameter tuning |  0.825    |     0.841     |     -     |           -            |
| 4. Model architecture update         |     0.835     |   **0.847**   |    ✓      |           -            |
| 5. Public model ensemble             |     0.878     |   **0.893**   |     -     |           -            |
| 6. Post-processing adjustments       |     0.881     |   **0.891**   |    ✓      |   **507 / 2026**       |

---

### 📌 **Competition Information**
- **Title**: [BirdCLEF+2025](https://www.kaggle.com/competitions/birdclef-2025)
- **Organizers**: Cornell Lab of Ornithology, LifeCLEF
- **Period**: Mar 10, 2025 – Jun 5, 2025 (UTC)
- **Number of Teams**: 2,026
- **Evaluation Metric**: Macro-average ROC-AUC across classes (excluding classes not present in the test set)
- **Submission Format**:
  - Predict presence probabilities for 206 species for every 5-second audio segment
  - Maximum of 2 code submissions; final rank is based on the Private Score
- **Project Objective**: Detect and classify 206 bird species from 60-second soundscape recordings in natural environments
- **Execution Constraints**:
  - Internet disabled
  - Code notebook must run within 90 minutes (CPU) or 1 minute (GPU)
  - Use of external public datasets and pretrained models is allowed
  - Required output filename: `submission.csv`

---

### 👥 **Team and My Role**
- **Participation Period**: May 2, 2025 – Jun 5, 2025 (UTC)
- **Team Composition**: Participated as a team of 5 members
- **My Role**: Team Leader (led the entire pipeline including preprocessing, model design, experimentation, and ensembling)
  - Designed the preprocessing strategy and criteria for selecting segments
  - Architected the model and ensemble framework
  - Led post-processing and class weight adjustment strategies
  - Directed experimentation and implemented the end-to-end pipeline

---

### 🛠️ **Technologies & Libraries Used**
- **Language**: Python 3.11
- **Deep Learning Framework**: PyTorch
- **Model Architecture & Backbones**: `timm` (EfficientNet, NFNet, SE-ResNeXt)
- **Audio Processing**: `librosa`, `torchaudio` (MelSpectrogram, AmplitudeToDB)
- **Data Handling**: `numpy`, `pandas`, `scikit-learn`
- **Visualization**: `matplotlib`, `seaborn`

---

### 💻 **Development Environment**
- **Operating System**: Ubuntu 24.04
- **IDE / Editor**: VSCode
- **Hardware**
  - CPU: Ryzen 9 5900X
  - RAM: 16GB * 2 = 32GB
  - GPU: Nvidia 4070Ti (12GB VRAM)
- **Additional Environment**: Kaggle Notebooks
  - Multiple models trained and tested in parallel to maximize time efficiency

---

### 📂 **Dataset Overview**
- **`train_audio`**
  - ~20,000 audio clips (5–60 seconds each)
  - Contains sounds from birds (Aves), amphibians (Amphibia), mammals (Mammalia), and insects (Insecta)
  - Sourced from Xeno-canto, iNaturalist, and CSA (32 kHz, OGG format)
  - Weakly labeled data

- **`train_soundscapes`**
  - ~10,000 unlabeled 60-second recordings
  - Captured at locations similar to test environments

- **`test_soundscapes`**
  - ~700 one-minute recordings for evaluation
  - Automatically placed in the test directory (32 kHz, OGG format)

- **`train.csv`**
  - Metadata including `primary_label`, `secondary_labels`, `latitude`, `longitude`, `author`, `filename`, `rating`, and `collection`

- **`sample_submission.csv`**
  - `row_id` (soundscape_id_end_time)
  - Expected probability predictions for 206 species

- **`taxonomy.csv`**
  - Species ID, scientific name, and class category (Aves / Amphibia / Mammalia / Insecta)

- **`recording_location.txt`**
  - Descriptions of recording locations in the El Silencio Nature Reserve, Colombia

### 📖 **Baseline Code (Public Score: 0.761)**

- **Preprocessing Baseline**: [Transforming audio to mel-spec (Kadir Candisolu)](https://www.kaggle.com/code/kadircandrisolu/transforming-audio-to-mel-spec-birdclef-25)
    - **Config**
        - Sampling rate: 32 kHz
        - Segment length: 5 seconds
        - Target shape: 256 × 256
        - Mel-spectrogram parameters: `n_fft=1024`, `hop_length`, `n_mels=128`, `fmin=50`, `fmax=14000`
    - **Workflow**
        1. Extract the center 5-second segment
        2. If shorter than target length, repeat and pad the audio
        3. Convert to mel-spectrogram using `librosa.feature.melspectrogram` → dB scale → min-max normalization [0–1]
        4. Resize to (256, 256) and save as `.npy` file using `cv2.resize`

- **Train Baseline**: [EfficientNet-B0 PyTorch Train BirdCLEF-25 (Kadir Candisolu)](https://www.kaggle.com/code/kadircandrisolu/efficientnet-b0-pytorch-train-birdclef-25)
    - **Augmentation**
        - SpecAugment (time & frequency masking)
        - Random brightness and contrast
        - Mixup regularization
    - **Model**
        - Backbone: `timm.create_model('efficientnet_b0', pretrained=True, in_chans=1)`
        - Classifier: `AdaptiveAvgPool2d → Dropout(0.2) → Linear(num_classes)`
    - **Training Settings**
        - Loss: `BCEWithLogitsLoss`
        - Optimizer: `AdamW`
        - Scheduler: `CosineAnnealingLR`
        - Cross-validation: `StratifiedKFold(n_splits=5)`

- **Inference Baseline**: [EfficientNet-B0 PyTorch Inference BirdCLEF-25 (Kadir Candisolu)](https://www.kaggle.com/code/kadircandrisolu/efficientnet-b0-pytorch-inference-birdclef-25)
    - **Config**: Same as training configuration
    - **Inference Workflow**
        1. Split 60-second audio into 5-second segments → `total_segments = len / 5`
        2. Convert to mel-spectrogram and resize to 256×256
        3. Predict with each fold and apply `sigmoid(outputs)` → **soft-voting**
    - **Submission**
        - Fill predictions into `sample_submission.csv` format and save as `submission.csv`
---

### **1. Segment Timing Adjustment & Post-processing Added (PB 0.761 → 0.793 | PV 0.803)**
- **Segment Position Change**
    - **Before**: Center 5-second segment (baseline)
    - **After**: First 5 seconds of the audio file
- **Post-processing (Temporal Smoothing)**
    - **Edge segments**: Weighted average with previous/next segments (0.8 / 0.2)
    - **Inner segments**: Weighted average of previous/current/next segments (0.2 / 0.6 / 0.2)

---

### **2. Condition-Based Segment Selection (PB 0.793 → 0.809 | PV 0.803 → 0.821)**
1. **Handling Short Audio**
    - If length < 5s: repeat and pad to create a valid 5-second segment
2. **CSA Recording Exception** (contains human voice by default)
    - Based on listening tests, files starting with `CSA` consistently include human speech
    - For these files, a fixed 2–7 second segment (less human noise) is used
3. **Valid Segment Detection**
    - Criteria: Frequency ≥ 2kHz, threshold ≥ -27.5dB
    - Apply 5-second sliding window at 1-second steps
    - Choose the window with the most valid frames; if none found, default to 2–7s segment

---

### **3. Model & Hyperparameter Improvements (PB 0.809 → 0.825 | PV 0.821 → 0.841)**
- **Backbone Model Update**
    - **Before**: `efficientnet_b0`
    - **After**: `tf_efficientnetv2_s.in21k_ft_in1k`
- **Mel-Spectrogram Hyperparameters**

| Parameter      | Before | After |
|----------------|:------:|:-----:|
| `n_fft`        | 1024   | 2048  |
| `hop_length`   | 512    | 512   |
| `n_mels`       | 128    | 512   |
| `fmin`         | 50     | 20    |
| `fmax`         | 14000  | 16000 |

- **Loss Function**
    - **Before**: `BCEWithLogitsLoss`
    - **After**: `FocalLossBCE`
- **Post-processing Weight Change**
    - **Before**: Edge segment smoothing = 0.8 / 0.2
    - **After**: Edge segment smoothing = 0.9 / 0.1

---

### **4. Model Architecture Modification (PB 0.825 → 0.835 | PV 0.841 → 0.847)**
- **Decoupling Backbone and Head**
    - **Before**: Used classifier layer embedded inside the backbone
    - **After**: Set `features_only=True` to extract pure feature maps from the backbone
- **Applied Custom Convolutional Head**
    - Used `1×1 Conv` to enhance inter-channel interactions
    - Classification head: `AdaptiveAvgPool → Flatten → Linear`
- **Analysis**
    - Fully utilizes pretrained features by cleanly separating backbone and head
    - Conv head better captures spatial patterns for classification

---

### **[5. Public Model Ensemble (PB 0.835 → 0.878 | PV 0.847 → 0.893)](https://www.kaggle.com/code/johnyim1570/bird25-weightedblend-nfnet-seresnext-0-878)**
- **Referenced Public Notebooks**
    - 🔗 **Blending Logic**: [Bird25 | WeightedBlend | NFNet + ConvNeXtV2 | LB 0.860](https://www.kaggle.com/code/hideyukizushi/bird25-weightedblend-nfnet-convnextv2-lb-860) by [yukiZ](https://www.kaggle.com/hideyukizushi)
    - 🧪 **NFNet Model (PB 0.857)**: [Bird2025 | Single SED Model Inference [LB 0.857]](https://www.kaggle.com/code/i2nfinit3y/bird2025-single-sed-model-inference-lb-0-857) by [I2nfinit3y](https://www.kaggle.com/i2nfinit3y)
    - 📘 **SE-ResNeXt Model (PB 0.850)**: [Post-Processing with Power Adjustment for Low-Rank](https://www.kaggle.com/code/myso1987/post-processing-with-power-adjustment-for-low-rank) by [MYSO](https://www.kaggle.com/myso1987)

- **Model Replacement**
    - Replaced original models in the public WeightedBlend notebook with the two high-performing models above
    - Applied inference-level ensemble

- **Blending Weights**
    - NFNet: 25%
    - SE-ResNeXt: 75%

---

### **6. NFNet Post-processing Adjustment (PB 0.878 → 0.881 | PV 0.893 → 0.891)**
- **Power Adjustment Hyperparameter Tuning**
    - **Before**: Applied a fixed `exponent=2` to all tail columns ranked below `top_k=30`
    - **After**: Split tail columns into 3 groups and applied different exponents:
        - Rank 31–100: `exponent=2`
        - Rank 101–150: `exponent=3`
        - Rank >150: `exponent=4`

---

## 🧠 **Conclusion & Reflections**

After analyzing the BirdCLEF leaderboards from 2022 to 2024, a clear pattern emerged: the majority of participants experienced a **drop in Private Score compared to Public Score**, especially outside of the Silver medal range. Many discussions highlighted persistent concerns about **overfitting to the public test set**.

The WeightedBlend ensemble (PB 0.878) was tuned by experimenting with different weight ratios (e.g., 0.5:0.5 → 0.75:0.25), and further improved to PB 0.881 through post-processing adjustments. However, We suspected this performance gain was **likely overfitted to the public test set**.

As a result, We adopted a two-fold final submission strategy:
1. The best-performing model We trained independently (**PB 0.835**)
2. The ensemble model with the highest public score (**PB 0.881**)

However, once the private leaderboard was revealed, the outcome defied expectations. **All submissions scored higher on the Private Leaderboard**, and this trend was consistent across many teams. Surprisingly, the initial ensemble (PB 0.878) achieved **PV 0.893**, which aligned with **Bronze medal range** scores.

### ❗ **Summary**
- Assumptions based on historical leaderboard trends did **not apply** to this year's competition.  
- A **conservative submission strategy**—rather than bold experimentation—led to a **drop in final ranking**.  
- **Public Leaderboard Rank: 162 → Private Leaderboard Rank: 507**

---

## 😓 **Technical Limitations & Regrets**

From my perspective, there were two core challenges that defined this competition:

1. How to refine and effectively utilize the weakly labeled `train_audio` data.
2. How to generate trustworthy pseudo-labels from `train_soundscapes` for semi-supervised learning.

For challenge #1, We focused on the fact that **most bird sounds occur in high-frequency ranges**, and developed a heuristic that:
> Filters 5-second audio segments where signals exceed -27.5 dB above 2 kHz.

While this method contributed to some performance improvements, it soon plateaued. We concluded that it wasn’t reliable enough to generate pseudo-labels, and therefore **decided not to use `train_soundscapes` at all**.

Midway through the competition, We discovered the concept of **MIL (Multiple Instance Learning)**:
> This technique allows a model to learn from weak labels by focusing on the specific time segments where the positive class appears, which could solve both challenges above in a unified manner.

Unfortunately, due to time constraints and submission errors in the final phase, We were **unable to finish and submit the MIL-based inference code**.

---

## 💡 **Technical Growth & Takeaways**

Despite the limitations, I gained several valuable experiences through this competition:

- Designed and tested preprocessing strategies, model architectural changes, and post-processing techniques that directly led to performance improvements.
- Developed an efficient workflow combining **Kaggle Notebooks and local GPU training**, which helped parallelize experiments and save time.
- Learned about **knowledge distillation (Teacher–Student training)** for the first time while reviewing top solutions after the competition ended.
- Gained practical insights into **data distribution analysis, overfitting detection, and submission strategy planning** for large-scale multi-label audio classification tasks.