# Deep Learning Model

---

## Quick description

This project examines the application of deep learning methods for audio classification using spectrogram-based representations. Audio signals are transformed into **mel-spectrograms** and treated as image-like inputs for convolutional neural networks. Two modeling approaches are implemented and compared: a custom **baseline** Convolutional Neural Network trained from scratch and a transfer learning model based on **EfficientNet-B0** pretrained on ImageNet.

The models are trained using a **supervised learning** setup with separate training, validation, and test datasets. Validation is employed for model selection and early stopping, while final performance is assessed on a test set. Evaluation is conducted using standard classification metrics, including accuracy, loss, precision, recall.

The primary objective of the project is to analyze the performance difference between a simple baseline architecture and a pretrained deep model, and to assess the effectiveness of transfer learning for spectrogram-based audio classification tasks.

---

## Structure

### Preprocessing

The model is based on the Watkins Marine Mammals Database, where they store animal sounds of **44** species. Amongst the species there are **15407** recordings.


I devided the recordings into segments:
- 70% for training
- 15% for validation
- 15% for test

This means we have divided the recordings into the following numbers:
- Train: 10784
- Val:   2311
- Test:  2312

#### Normalization

After generating mel-spectrograms from each recording and saving them as pt files, I used offline normalization with per-sample standardization prior to training. Each .pt spectrogram is converted to float, adjusted to channel-first if necessary, then normalized.

#### Dataset & Dataloaders

After normalization we create a PyTorch dataset that loads spectrogram .pt files from train/val/test class folders. Each sample is shaped into channel-first shape, then padded or cropped to a fixed width (400 time frames) so the network receives equally shaped inputs. The dataset also builds a class_to_idx mapping from sorted class folder names; the dataloaders are created with configurable batching and multiprocessing options for training and evaluation.

#### Models

The project compares a custom baseline CNN and a transfer‑learning model based on EfficientNet‑B0. 

The baseline (BaselineCNN) is a compact conv stack (16→32→64→128) with global average pooling and a small classifier head. 

The transfer model (EfficientNetSpectrogram) loads torchvision's EfficientNet‑B0 pretrained on ImageNet, replaces the final linear layer to match our number of classes, and repeats single‑channel spectrograms to 3 channels so the backbone can be used without changes. The EfficientNet backbone can be frozen during training to fine‑tune only the classifier head.

Imbalance handling: per‑class weights are computed (using sklearn if available, otherwise inverse frequency) and passed to CrossEntropyLoss via get_weighted_criterion(...) to mitigate class imbalance.

EfficientNet provides strong, pretrained visual features; repeating the spectrogram channel lets the model use those features. The baseline is smaller and trains from scratch for comparison.