# The MS COCO classification challenge

Razmig Kéchichian

This notebook defines the multi-class classification challenge on the [MS COCO dataset](https://cocodataset.org/). It defines the problem, sets the rules of organization and presents tools you are provided with to accomplish the challenge.


## 1. Problem statement

Each image has **several** categories of objects to predict, hence the difference compared to the classification problem we have seen on the CIFAR10 dataset where each image belonged to a **single** category, therefore the network loss function and prediction mechanism (only highest output probability) were defined taking this constraint into account.

We adapted the MS COCO dataset for the requirements of this challenge by, among other things, reducing the number of images and their dimensions to facilitate processing.

In the companion `ms-coco.zip` compressed directory you will find two sub-directories:
- `images`: which contains the images in train (65k) and test (~5k) subsets,
- `labels`: which lists labels for each of the images in the train subset only.

Each label file gives a list of class IDs that correspond to the class index in the following tuple:

In [None]:
classes = ("person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat", "traffic light", 
           "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse", "sheep", "cow",
           "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie", "suitcase", "frisbee",       
           "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove", "skateboard", "surfboard",
           "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon", "bowl", "banana", "apple",
           "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut", "cake", "chair", "couch", 
           "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse", "remote", "keyboard", "cell phone", 
           "microwave", "oven", "toaster", "sink", "refrigerator", "book", "clock", "vase", "scissors", "teddy bear", 
           "hair drier", "toothbrush")

Your goal is to follow a **transfer learning strategy** in training and validating a network on **your own distribution of training data into training and a validation subsets**, then to **test it on the test subset** by producing a [JSON file](https://en.wikipedia.org/wiki/JSON) with content of the following format:

```
{
    "000000000139": [
        56,
        60,
        62
    ],
    "000000000285": [
        21,
    ],
    "000000000632": [
        57,
        59,
    73
    ],
    # other test images
}
```

In this file, the name (without extension) of each test image is associated with a list of class indices predicted by your network. Make sure that the JSON file you produce **follows this format strictly**.

You will submit your JSON prediction file to the following [online evaluation server and leaderboard](https://www.creatis.insa-lyon.fr/kechichian/ms-coco-classif-leaderboard.html), which will evaluate your predictions on test set labels, unavailable to you.

<div class="alert alert-block alert-danger"> <b>WARNING:</b> Use this server with <b>the greatest care</b>. A new submission with identical Participant or group name will <b>overwrite</b> the identically named submission, if one already exists, therefore check the leaderboard first. <b>Do not make duplicate leaderboard entries for your group</b>, keep track of your test scores privately. Also pay attention to upload only JSON files of the required format.<br>
</div>

The evaluation server calculates and returns mean performances over all classes, and optionally per class performances. Entries in the leaderboard are sorted by the F1 metric.

You can request an evaluation as many times as you want. It is up to you to specify the final evaluation by updating the leaderboard entry corresponding to your Participant or group name. This entry will be taken into account for grading your work.

It goes without saying that it is **prohibited** to use another distribution of the MS COCO database for training, e.g. the Torchvision dataset.


## 2. Organization

- Given the scope of the project, you will work in groups of 2. 
- Work on the challenge begins on IAV lab 3 session, that is on the **23rd of September**.
- Results are due 10 days later, that is on the **3rd of October, 18:00**. They comrpise:
    - a submission to the leaderboard,
    - a commented Python script (with any necessary modules) or Jupyter Notebook, uploaded on Moodle in the challenge repository by one of the members of the group.
    
    
## 3. Tools

In addition to the MS COCO annotated data and the evaluation server, we provide you with most code building blocks. Your task is to understand them and use them to create the glue logic, that is the main program, putting all these blocks together and completing them as necessary to implement a complete machine learning workflow to train and validate a model, and produce the test JSON file.

### 3.1 Custom `Dataset`s

We provide you with two custom `torch.utils.data.Dataset` sub-classes to use in training and testing.

In [None]:
import os
from glob import glob
from pathlib import Path

from PIL import Image
import torch

class COCOTrainImageDataset(torch.utils.data.Dataset):
    def __init__(self, img_dir, annotations_dir, max_images=None, transform=None):
        self.img_labels = sorted(glob("*.cls", root_dir=annotations_dir))
        if max_images:
            self.img_labels = self.img_labels[:max_images]
        self.img_dir = img_dir
        self.annotations_dir = annotations_dir
        self.transform = transform

    def __len__(self):
        return len(self.img_labels)

    def __getitem__(self, idx):
        img_path = os.path.join(self.img_dir, Path(self.img_labels[idx]).stem + ".jpg")
        labels_path = os.path.join(self.annotations_dir, self.img_labels[idx])
        image = Image.open(img_path).convert("RGB")
        with open(labels_path) as f: 
            labels = [int(label) for label in f.readlines()]
        if self.transform:
            image = self.transform(image)
        labels = torch.zeros(80).scatter_(0, torch.tensor(labels), value=1)
        return image, labels


class COCOTestImageDataset(torch.utils.data.Dataset):
    def __init__(self, img_dir, transform=None):
        self.img_list = sorted(glob("*.jpg", root_dir=img_dir))    
        self.img_dir = img_dir
        self.transform = transform

    def __len__(self):
        return len(self.img_list)

    def __getitem__(self, idx):
        img_path = os.path.join(self.img_dir, self.img_list[idx])
        image = Image.open(img_path).convert("RGB")        
        if self.transform:
            image = self.transform(image)
        return image, Path(img_path).stem # filename w/o extension

### 3.2 Training and validation loops

The following are two general-purpose classification train and validation loop functions to be called inside the epochs for-loop with appropriate argument settings.

Pay particular attention to the `validation_loop()` function's arguments `multi_task`, `th_multi_task` and `one_hot`.

In [None]:
import torch


def train_loop(train_loader, net, criterion, optimizer, device,
               mbatch_loss_group=-1):
    net.train()
    running_loss = 0.0
    mbatch_losses = []
    for i, data in enumerate(train_loader):
        inputs, labels = data[0].to(device), data[1].to(device)
        optimizer.zero_grad()
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        # following condition False by default, unless mbatch_loss_group > 0
        if i % mbatch_loss_group == mbatch_loss_group - 1:
            mbatch_losses.append(running_loss / mbatch_loss_group)
            running_loss = 0.0
    if mbatch_loss_group > 0:
        return mbatch_losses


def validation_loop(val_loader, net, criterion, num_classes, device,
                    multi_task=False, th_multi_task=0.5, one_hot=False, class_metrics=False):
    net.eval()
    loss = 0
    correct = 0
    size = len(val_loader.dataset)
    class_total = {label:0 for label in range(num_classes)}
    class_tp = {label:0 for label in range(num_classes)}
    class_fp = {label:0 for label in range(num_classes)}
    with torch.no_grad():
        for data in val_loader:
            images, labels = data[0].to(device), data[1].to(device)
            outputs = net(images)
            loss += criterion(outputs, labels).item() * images.size(0)
            if not multi_task:    
                predictions = torch.zeros_like(outputs)
                predictions[torch.arange(outputs.shape[0]), torch.argmax(outputs, dim=1)] = 1.0
            else:
                predictions = torch.where(outputs > th_multi_task, 1.0, 0.0)
            if not one_hot:
                labels_mat = torch.zeros_like(outputs)
                labels_mat[torch.arange(outputs.shape[0]), labels] = 1.0
                labels = labels_mat
                
            tps = predictions * labels
            fps = predictions - tps
            
            tps = tps.sum(dim=0)
            fps = fps.sum(dim=0)
            lbls = labels.sum(dim=0)  
                
            for c in range(num_classes):
                class_tp[c] += tps[c]
                class_fp[c] += fps[c]
                class_total[c] += lbls[c]
                    
            correct += tps.sum()

    class_prec = []
    class_recall = []
    freqs = []
    for c in range(num_classes):
        class_prec.append(0 if class_tp[c] == 0 else
                          class_tp[c] / (class_tp[c] + class_fp[c]))
        class_recall.append(0 if class_tp[c] == 0 else
                            class_tp[c] / class_total[c])
        freqs.append(class_total[c])

    freqs = torch.tensor(freqs)
    class_weights = 1. / freqs
    class_weights /= class_weights.sum()
    class_prec = torch.tensor(class_prec)
    class_recall = torch.tensor(class_recall)
    prec = (class_prec * class_weights).sum()
    recall = (class_recall * class_weights).sum()
    f1 = 2. / (1/prec + 1/recall)
    val_loss = loss / size
    accuracy = correct / freqs.sum()
    results = {"loss": val_loss, "accuracy": accuracy, "f1": f1,\
               "precision": prec, "recall": recall}

    if class_metrics:
        class_results = []
        for p, r in zip(class_prec, class_recall):
            f1 = (0 if p == r == 0 else 2. / (1/p + 1/r))
            class_results.append({"f1": f1, "precision": p, "recall": r})
        results = results, class_results

    return results

### 3.3 Tensorboard logging (optional)

Evaluation metrics and losses produced by the `validation_loop()` function on train and validation data can be logged to a [Tensorboard `SummaryWriter`](https://pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html) which allows you to observe training graphically via the following function:

In [None]:
def update_graphs(summary_writer, epoch, train_results, test_results,
                  train_class_results=None, test_class_results=None, 
                  class_names = None, mbatch_group=-1, mbatch_count=0, mbatch_losses=None):
    if mbatch_group > 0:
        for i in range(len(mbatch_losses)):
            summary_writer.add_scalar("Losses/Train mini-batches",
                                  mbatch_losses[i],
                                  epoch * mbatch_count + (i+1)*mbatch_group)

    summary_writer.add_scalars("Losses/Train Loss vs Test Loss",
                               {"Train Loss" : train_results["loss"],
                                "Test Loss" : test_results["loss"]},
                               (epoch + 1) if not mbatch_group > 0
                                     else (epoch + 1) * mbatch_count)

    summary_writer.add_scalars("Metrics/Train Accuracy vs Test Accuracy",
                               {"Train Accuracy" : train_results["accuracy"],
                                "Test Accuracy" : test_results["accuracy"]},
                               (epoch + 1) if not mbatch_group > 0
                                     else (epoch + 1) * mbatch_count)

    summary_writer.add_scalars("Metrics/Train F1 vs Test F1",
                               {"Train F1" : train_results["f1"],
                                "Test F1" : test_results["f1"]},
                               (epoch + 1) if not mbatch_group > 0
                                     else (epoch + 1) * mbatch_count)

    summary_writer.add_scalars("Metrics/Train Precision vs Test Precision",
                               {"Train Precision" : train_results["precision"],
                                "Test Precision" : test_results["precision"]},
                               (epoch + 1) if not mbatch_group > 0
                                     else (epoch + 1) * mbatch_count)

    summary_writer.add_scalars("Metrics/Train Recall vs Test Recall",
                               {"Train Recall" : train_results["recall"],
                                "Test Recall" : test_results["recall"]},
                               (epoch + 1) if not mbatch_group > 0
                                     else (epoch + 1) * mbatch_count)

    if train_class_results and test_class_results:
        for i in range(len(train_class_results)):
            summary_writer.add_scalars(f"Class Metrics/{class_names[i]}/Train F1 vs Test F1",
                                       {"Train F1" : train_class_results[i]["f1"],
                                        "Test F1" : test_class_results[i]["f1"]},
                                       (epoch + 1) if not mbatch_group > 0
                                             else (epoch + 1) * mbatch_count)

            summary_writer.add_scalars(f"Class Metrics/{class_names[i]}/Train Precision vs Test Precision",
                                       {"Train Precision" : train_class_results[i]["precision"],
                                        "Test Precision" : test_class_results[i]["precision"]},
                                       (epoch + 1) if not mbatch_group > 0
                                             else (epoch + 1) * mbatch_count)

            summary_writer.add_scalars(f"Class Metrics/{class_names[i]}/Train Recall vs Test Recall",
                                       {"Train Recall" : train_class_results[i]["recall"],
                                        "Test Recall" : test_class_results[i]["recall"]},
                                       (epoch + 1) if not mbatch_group > 0
                                             else (epoch + 1) * mbatch_count)
    summary_writer.flush()

## 4. The skeleton of the model training and validation program

Your main program should have more or less the following sections and control flow:

## Config

In [None]:
import os
"""
Configuration for COCO multi-label classification training.
Defines model type, preprocessing, optimizer, and training hyperparameters.
Specifies dataset directories and batch/data loader settings.
Includes validation split and threshold for multi-label predictions.
"""

CONFIG = {
    "model": "mobilenet_v3_small",  # Options: mobilenet_v3_small, efficientnet_b0, resnet50, resnet18
    "pretrained": True,              # Use pretrained weights
    "batch_size": 32,                # Batch size for training
    "image_size": 224,               # Input image size
    "num_epochs": 15,                # Number of training epochs
    "learning_rate": 1e-4,           # Optimizer learning rate
    "optimizer": "adamw",            # Options: adam, adamw, sgd
    "weight_decay": 1e-3,            # Weight decay for AdamW
    "max_cpus": 14,                  # Number of CPU threads for data loading
    "train_dir": "ms-coco/images/train-resized/train-resized",  # Training images
    "label_dir": "ms-coco/labels/train/train",                  # Training labels
    "test_dir": "ms-coco/images/test-resized/test-resized",     # Test images
    "validation_split": 0.2,         # Fraction of train set used for validation
    "threshold": 0.5,                # Default threshold for multi-label predictions
    "freeze_backbone": False,        # Freeze feature extractor/backbone for fine-tuning
}

## Tools to train

In [None]:
def update_graphs(summary_writer, epoch, train_results, test_results,
                  train_class_results=None, test_class_results=None, 
                  class_names = 80, mbatch_group=-1, mbatch_count=0, mbatch_losses=None):
    if mbatch_group > 0:
        for i in range(len(mbatch_losses)):
            summary_writer.add_scalar("Losses/Train mini-batches",
                                  mbatch_losses[i],
                                  epoch * mbatch_count + (i+1)*mbatch_group)

    summary_writer.add_scalars("Losses/Train Loss vs Test Loss",
                               {"Train Loss" : train_results["loss"],
                                "Test Loss" : test_results["loss"]},
                               (epoch + 1) if not mbatch_group > 0
                                     else (epoch + 1) * mbatch_count)

    summary_writer.add_scalars("Metrics/Train Accuracy vs Test Accuracy",
                               {"Train Accuracy" : train_results["accuracy"],
                                "Test Accuracy" : test_results["accuracy"]},
                               (epoch + 1) if not mbatch_group > 0
                                     else (epoch + 1) * mbatch_count)

    summary_writer.add_scalars("Metrics/Train F1 vs Test F1",
                               {"Train F1" : train_results["f1"],
                                "Test F1" : test_results["f1"]},
                               (epoch + 1) if not mbatch_group > 0
                                     else (epoch + 1) * mbatch_count)

    summary_writer.add_scalars("Metrics/Train Precision vs Test Precision",
                               {"Train Precision" : train_results["precision"],
                                "Test Precision" : test_results["precision"]},
                               (epoch + 1) if not mbatch_group > 0
                                     else (epoch + 1) * mbatch_count)

    summary_writer.add_scalars("Metrics/Train Recall vs Test Recall",
                               {"Train Recall" : train_results["recall"],
                                "Test Recall" : test_results["recall"]},
                               (epoch + 1) if not mbatch_group > 0
                                     else (epoch + 1) * mbatch_count)

    if train_class_results and test_class_results:
        for i in range(len(train_class_results)):
            summary_writer.add_scalars(f"Class Metrics/{class_names[i]}/Train F1 vs Test F1",
                                       {"Train F1" : train_class_results[i]["f1"],
                                        "Test F1" : test_class_results[i]["f1"]},
                                       (epoch + 1) if not mbatch_group > 0
                                             else (epoch + 1) * mbatch_count)

            summary_writer.add_scalars(f"Class Metrics/{class_names[i]}/Train Precision vs Test Precision",
                                       {"Train Precision" : train_class_results[i]["precision"],
                                        "Test Precision" : test_class_results[i]["precision"]},
                                       (epoch + 1) if not mbatch_group > 0
                                             else (epoch + 1) * mbatch_count)

            summary_writer.add_scalars(f"Class Metrics/{class_names[i]}/Train Recall vs Test Recall",
                                       {"Train Recall" : train_class_results[i]["recall"],
                                        "Test Recall" : test_class_results[i]["recall"]},
                                       (epoch + 1) if not mbatch_group > 0
                                             else (epoch + 1) * mbatch_count)
    summary_writer.flush()

## Train Loop

In [None]:
import torch
from tqdm import tqdm

def train_loop(train_loader, net, criterion, optimizer, device,
               mbatch_loss_group=-1):
    net.train()
    running_loss = 0.0
    mbatch_losses = []

    for i, data in enumerate(tqdm(train_loader, desc="Training", leave=True)):
        inputs, labels = data[0].to(device), data[1].to(device)
        optimizer.zero_grad()
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        # following condition False by default, unless mbatch_loss_group > 0
        if i % mbatch_loss_group == mbatch_loss_group - 1:
            mbatch_losses.append(running_loss / mbatch_loss_group)
            running_loss = 0.0
    if mbatch_loss_group > 0:
        return mbatch_losses
    else:
        return running_loss / len(train_loader)  # add average loss return


def validation_loop(val_loader, net, criterion, num_classes, device,
                    multi_task=False, th_multi_task=0.5, one_hot=True, class_metrics=False):
    net.eval()
    loss = 0
    correct = 0
    size = len(val_loader.dataset)
    class_total = {label:0 for label in range(num_classes)}
    class_tp = {label:0 for label in range(num_classes)}
    class_fp = {label:0 for label in range(num_classes)}
    with torch.no_grad():
        for data in tqdm(val_loader, desc="Validating", leave=True):
            images, labels = data[0].to(device), data[1].to(device)
            outputs = net(images)
            loss += criterion(outputs, labels).item() * images.size(0)
            if not multi_task:    
                predictions = torch.zeros_like(outputs)
                predictions[torch.arange(outputs.shape[0]), torch.argmax(outputs, dim=1)] = 1.0
            else:
                predictions = torch.where(outputs > th_multi_task, 1.0, 0.0)
            if not one_hot:
                pass
                
            tps = predictions * labels
            fps = predictions - tps
            
            tps = tps.sum(dim=0)
            fps = fps.sum(dim=0)
            lbls = labels.sum(dim=0)  
                
            for c in range(num_classes):
                class_tp[c] += tps[c]
                class_fp[c] += fps[c]
                class_total[c] += lbls[c]
                    
            correct += tps.sum()

    class_prec = []
    class_recall = []
    freqs = []
    for c in range(num_classes):
        class_prec.append(0 if class_tp[c] == 0 else
                          class_tp[c] / (class_tp[c] + class_fp[c]))
        class_recall.append(0 if class_tp[c] == 0 else
                            class_tp[c] / class_total[c])
        freqs.append(class_total[c])

    freqs = torch.tensor(freqs)
    class_weights = 1. / freqs
    class_weights /= class_weights.sum()
    class_prec = torch.tensor(class_prec)
    class_recall = torch.tensor(class_recall)
    prec = (class_prec * class_weights).sum()
    recall = (class_recall * class_weights).sum()
    f1 = 2. / (1/prec + 1/recall)
    val_loss = loss / size
    accuracy = correct / freqs.sum()
    results = {"loss": val_loss, "accuracy": accuracy, "f1": f1,\
               "precision": prec, "recall": recall}

    if class_metrics:
        class_results = []
        for p, r in zip(class_prec, class_recall):
            f1 = (0 if p == r == 0 else 2. / (1/p + 1/r))
            class_results.append({"f1": f1, "precision": p, "recall": r})
        results = results, class_results

    return results

import torch

@torch.no_grad()
def collect_val_probs_and_labels(val_loader, model, device):
    """Collect raw sigmoid probabilities and labels from the validation set."""
    model.eval()
    all_probs = []
    all_labels = []
    for images, labels in val_loader:
        images = images.to(device)
        labels = labels.to(device)
        logits = model(images)
        probs = torch.sigmoid(logits)
        all_probs.append(probs.cpu())
        all_labels.append(labels.cpu())
    return torch.cat(all_probs, dim=0), torch.cat(all_labels, dim=0)


def f1_weighted_precision_recall(preds, labels):
    preds = preds.float()
    labels = labels.float()

    tps = (preds * labels).sum(dim=0)
    fps = (preds * (1 - labels)).sum(dim=0)
    freqs = labels.sum(dim=0).clamp(min=1e-9)

    class_prec = tps / (tps + fps + 1e-9)
    class_recall = tps / freqs

    class_weights = 1.0 / freqs
    class_weights = class_weights / class_weights.sum()

    prec = (class_prec * class_weights).sum().item()
    rec = (class_recall * class_weights).sum().item()
    f1 = 0.0 if (prec == 0.0 and rec == 0.0) else 2.0 / (1.0 / prec + 1.0 / rec)
    return f1, prec, rec


def sweep_thresholds(probs, labels, thresholds=None, prefer_precision=None):
    """
    Try several thresholds to find the one giving best F1.
    If prefer_precision is set (e.g. 0.4), keep only thresholds with precision >= that.
    """
    if thresholds is None:
        thresholds = [i / 100 for i in range(5, 96, 5)]  # 0.05 .. 0.95

    results = []
    for t in thresholds:
        preds = (probs >= t).float()
        f1, p, r = f1_weighted_precision_recall(preds, labels)
        results.append((t, f1, p, r))

    if prefer_precision is not None:
        eligible = [x for x in results if x[2] >= prefer_precision]
        if eligible:
            return max(eligible, key=lambda x: x[1])  # best F1 among precision >= target

    return max(results, key=lambda x: x[1])  # best F1 overall


# Main Loop

In [None]:
"""
Main training script for COCO multi-label classification.
Supports optional freezing of backbone layers.
"""

import os
from datetime import datetime

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, random_split
from torch.utils.tensorboard import SummaryWriter
from torchvision import transforms
from torchvision.models import (
    mobilenet_v3_small, MobileNet_V3_Small_Weights,
    resnet18, ResNet18_Weights,
)

# Local modules
from dataset import COCOTrainImageDataset, COCOTestImageDataset
from loops import train_loop, validation_loop
from utils import (
    update_graphs,
    collect_val_probs_and_labels,
    sweep_thresholds,
)
from config import CONFIG


# Data preprocessing

def get_preprocessing_transform(model_name: str, train: bool = True, pretrained: bool = True) -> transforms.Compose:
    """Return preprocessing transforms depending on model and phase."""
    model_name = model_name.lower()

    # Select normalization weights
    if model_name == "mobilenet_v3_small":
        weights = MobileNet_V3_Small_Weights.IMAGENET1K_V1 if pretrained else None
    elif model_name == "resnet18":
        weights = ResNet18_Weights.DEFAULT if pretrained else None
    else:
        raise ValueError(f"Unsupported model: {model_name}")

    if weights is not None:
        norm_mean = weights.transforms().mean
        norm_std = weights.transforms().std
    else:
        norm_mean = [0.485, 0.456, 0.406]
        norm_std = [0.229, 0.224, 0.225]

    if train:
        # Training transform with augmentation
        return transforms.Compose([
            transforms.RandomResizedCrop(CONFIG["image_size"], scale=(0.8, 1.0)),
            transforms.RandomHorizontalFlip(p=0.5),
            transforms.RandomRotation(degrees=15),
            transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),
            transforms.ToTensor(),
            transforms.Normalize(mean=norm_mean, std=norm_std),
        ])

    # Validation/test transform (deterministic)
    return transforms.Compose([
        transforms.Resize((CONFIG["image_size"], CONFIG["image_size"])),
        transforms.ToTensor(),
        transforms.Normalize(mean=norm_mean, std=norm_std),
    ])



# Model selection and modification

def get_model(config: dict, device: torch.device, freeze_backbone: bool = False) -> torch.nn.Module:
    """
    Load base model, optionally freeze backbone, and modify classifier for 80 COCO classes.
    """
    model_name = config["model"].lower()
    pretrained = config.get("pretrained", True)

    if model_name == "mobilenet_v3_small":
        weights = MobileNet_V3_Small_Weights.IMAGENET1K_V1 if pretrained else None
        model = mobilenet_v3_small(weights=weights)
        in_features = model.classifier[3].in_features
        model.classifier[3] = nn.Sequential(
            nn.Dropout(p=0.3),
            nn.Linear(in_features, 80),
        )
        if freeze_backbone:
            # Freeze all layers except classifier
            for param in model.features.parameters():
                param.requires_grad = False

    elif model_name == "resnet18":
        weights = ResNet18_Weights.DEFAULT if pretrained else None
        model = resnet18(weights=weights)
        model.fc = nn.Linear(model.fc.in_features, 80)
        if freeze_backbone:
            # Freeze all layers except final fully connected
            for name, param in model.named_parameters():
                if "fc" not in name:
                    param.requires_grad = False

    else:
        raise ValueError(f"Unsupported model: {config['model']}")

    return model.to(device)


# Training pipeline

def main() -> None:
    """Main training function."""
    # Device setup 
    if torch.cuda.is_available():
        device = torch.device("cuda")
        print("GPU detected. Using CUDA.")
    else:
        print("No GPU detected. Training on CPU.")
        choice = input("Continue on CPU? (y/n): ").strip().lower()
        if choice == "y":
            device = torch.device("cpu")
        else:
            print("Exiting. Please use a GPU machine.")
            return
    print(f"Using device: {device}")

    # Datasets and loaders 
    train_transform = get_preprocessing_transform(CONFIG["model"], train=True)
    val_transform = get_preprocessing_transform(CONFIG["model"], train=False)

    full_train_dataset = COCOTrainImageDataset(
        img_dir=CONFIG["train_dir"],
        annotations_dir=CONFIG["label_dir"],
        transform=train_transform,
    )
    test_dataset = COCOTestImageDataset(
        img_dir=CONFIG["test_dir"],
        transform=val_transform,
    )

    val_size = int(CONFIG["validation_split"] * len(full_train_dataset))
    train_size = len(full_train_dataset) - val_size
    train_dataset, val_dataset = random_split(full_train_dataset, [train_size, val_size])

    train_loader = DataLoader(train_dataset, batch_size=CONFIG["batch_size"], shuffle=True, num_workers=CONFIG["max_cpus"])
    val_loader = DataLoader(val_dataset, batch_size=CONFIG["batch_size"], shuffle=False, num_workers=CONFIG["max_cpus"])
    test_loader = DataLoader(test_dataset, batch_size=CONFIG["batch_size"], shuffle=False, num_workers=CONFIG["max_cpus"])

    print(f"DataLoaders ready: train={len(train_loader)}, val={len(val_loader)}, test={len(test_loader)}")

    # Check sample batch
    image, label = next(iter(train_loader))
    print(f"Sample batch shapes - images: {image.shape}, labels: {label.shape}")

    # Model, loss, optimizer 
    freeze_backbone = CONFIG.get("freeze_backbone", False)
    model = get_model(CONFIG, device, freeze_backbone=freeze_backbone)
    criterion = nn.BCEWithLogitsLoss()

    optimizer_name = CONFIG["optimizer"].lower()
    if optimizer_name == "adam":
        optimizer = torch.optim.Adam(model.parameters(), lr=CONFIG["learning_rate"])
    elif optimizer_name == "adamw":
        optimizer = torch.optim.AdamW(model.parameters(), lr=CONFIG["learning_rate"], weight_decay=CONFIG["weight_decay"])
    elif optimizer_name == "sgd":
        optimizer = torch.optim.SGD(model.parameters(), lr=CONFIG["learning_rate"], momentum=0.9)
    else:
        raise ValueError(f"Unsupported optimizer: {CONFIG['optimizer']}")

    # TensorBoard logging 
    run_name = f"{CONFIG['model']}_{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}"
    writer = SummaryWriter(log_dir=os.path.join("runs", run_name))
    print(f"TensorBoard logs at: runs/{run_name}")

    # Training loop 
    best_f1 = 0.0
    best_model_path = "best_model.pth"

    for epoch in range(CONFIG["num_epochs"]):
        print(f"\nEpoch {epoch + 1}/{CONFIG['num_epochs']}")

        _ = train_loop(train_loader, model, criterion, optimizer, device)
        train_results = validation_loop(train_loader, model, criterion, num_classes=80, device=device, multi_task=True)
        val_results = validation_loop(val_loader, model, criterion, num_classes=80, device=device, multi_task=True)

        update_graphs(writer, epoch, train_results, val_results)
        # Each time we save the layer with the best F1 because 
        # the challenge is about to have the best F1
        if val_results["f1"] > best_f1:
            best_f1 = val_results["f1"]
            torch.save(model.state_dict(), best_model_path)
            print(f"New best model saved with F1 = {best_f1:.4f}")

    # Threshold sweep 
    val_probs, val_labels = collect_val_probs_and_labels(val_loader, model, device)
    best_t, best_f1_sweep, best_prec, best_rec = sweep_thresholds(val_probs, val_labels)
    print(f"Best threshold = {best_t:.2f} | F1 = {best_f1_sweep:.4f} | Precision = {best_prec:.4f} | Recall = {best_rec:.4f}")

    writer.close()


if __name__ == "__main__":
    from multiprocessing import freeze_support
    freeze_support()
    main()

    
    

Using cpu device.


# Test Program

In [None]:
#  test.py — Inference

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import transforms
import json
import time
from pathlib import Path

from dataset import COCOTestImageDataset   # <- make sure you use your correct dataset file
from main import get_model                 # we reuse your get_model() to build the network
from config import CONFIG

from torchvision.models import (
    mobilenet_v3_small, MobileNet_V3_Small_Weights,
    efficientnet_b0, EfficientNet_B0_Weights,
    resnet50, ResNet50_Weights
)


def get_test_transform(config):
    """
    Returns the deterministic test transform for the selected model.
    Uses the same preprocessing as the original pretrained weights.
    """
    model_name = config["model"].lower()
    pretrained = config.get("pretrained", True)

    if model_name == "mobilenet_v3_small":
        weights = MobileNet_V3_Small_Weights.IMAGENET1K_V1 if pretrained else None
    elif model_name == "efficientnet_b0":
        weights = EfficientNet_B0_Weights.IMAGENET1K_V1 if pretrained else None
    elif model_name == "resnet50":
        weights = ResNet50_Weights.DEFAULT if pretrained else None
    else:
        raise ValueError(f"Unsupported model: {config['model']}")

    if weights is not None:
        return weights.transforms()
    else:
        return transforms.Compose([
            transforms.Resize((config["image_size"], config["image_size"])),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                 std=[0.229, 0.224, 0.225])
        ])


def main():
    # Device 
    if torch.cuda.is_available():
        device = torch.device("cuda")
        print("GPU detected. Using CUDA for inference.")
    else:
        device = torch.device("cpu")
        print("No GPU detected. Using CPU.")

    # Dataset & DataLoader 
    test_transform = get_test_transform(CONFIG)
    test_dataset = COCOTestImageDataset(
        img_dir=CONFIG["test_dir"],
        transform=test_transform
    )
    test_loader = DataLoader(
        test_dataset,
        batch_size=CONFIG["batch_size"],
        shuffle=False,
        num_workers=CONFIG["max_cpus"]
    )

    # Load model 
    model = get_model(CONFIG, device)
    model.load_state_dict(torch.load("best_model.pth", map_location=device))
    model.eval()

    print(f"Loaded model from best_model.pth — starting inference on {len(test_dataset)} images...")
    start_time = time.time()

    # Prediction loop
    predictions = {}
    with torch.no_grad():
        for images, image_ids in test_loader:
            images = images.to(device)
            outputs = model(images)
            probs = torch.sigmoid(outputs)
            preds = (probs >= CONFIG["threshold"]).cpu().numpy()

            for img_id, pred in zip(image_ids, preds):
                predictions[img_id] = pred.nonzero()[0].tolist()

    # Save results 
    output_json = "test_predictions.json"
    with open(output_json, "w") as f:
        json.dump(predictions, f, indent=4)

    elapsed = time.time() - start_time
    print(f"\nInference complete! Processed {len(test_dataset)} images "
          f"in {elapsed/60:.2f} min ({elapsed:.1f} s total).")
    print(f"Predictions saved to {output_json}")


if __name__ == "__main__":
    from multiprocessing import freeze_support
    freeze_support()
    main()


## 5. The skeleton of the test submission program

This, much simpler, program should have the following sections and control flow:

In [None]:
#  test.py — Inference

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import transforms
import json
import time
from pathlib import Path

from dataset import COCOTestImageDataset   # <- make sure you use your correct dataset file
from main import get_model                 # we reuse your get_model() to build the network
from config import CONFIG

from torchvision.models import (
    mobilenet_v3_small, MobileNet_V3_Small_Weights,
    efficientnet_b0, EfficientNet_B0_Weights,
    resnet50, ResNet50_Weights
)


def get_test_transform(config):
    """
    Returns the deterministic test transform for the selected model.
    Uses the same preprocessing as the original pretrained weights.
    """
    model_name = config["model"].lower()
    pretrained = config.get("pretrained", True)

    if model_name == "mobilenet_v3_small":
        weights = MobileNet_V3_Small_Weights.IMAGENET1K_V1 if pretrained else None
    elif model_name == "efficientnet_b0":
        weights = EfficientNet_B0_Weights.IMAGENET1K_V1 if pretrained else None
    elif model_name == "resnet50":
        weights = ResNet50_Weights.DEFAULT if pretrained else None
    else:
        raise ValueError(f"Unsupported model: {config['model']}")

    if weights is not None:
        return weights.transforms()
    else:
        return transforms.Compose([
            transforms.Resize((config["image_size"], config["image_size"])),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                 std=[0.229, 0.224, 0.225])
        ])


def main():
    # Device
    if torch.cuda.is_available():
        device = torch.device("cuda")
        print("GPU detected. Using CUDA for inference.")
    else:
        device = torch.device("cpu")
        print("No GPU detected. Using CPU.")

    # Dataset & DataLoader 
    test_transform = get_test_transform(CONFIG)
    test_dataset = COCOTestImageDataset(
        img_dir=CONFIG["test_dir"],
        transform=test_transform
    )
    test_loader = DataLoader(
        test_dataset,
        batch_size=CONFIG["batch_size"],
        shuffle=False,
        num_workers=CONFIG["max_cpus"]
    )

    # Load model
    model = get_model(CONFIG, device)
    model.load_state_dict(torch.load("best_model.pth", map_location=device))
    model.eval()

    print(f"Loaded model from best_model.pth — starting inference on {len(test_dataset)} images...")
    start_time = time.time()

    # Prediction loop 
    predictions = {}
    with torch.no_grad():
        for images, image_ids in test_loader:
            images = images.to(device)
            outputs = model(images)
            probs = torch.sigmoid(outputs)
            preds = (probs >= CONFIG["threshold"]).cpu().numpy()

            for img_id, pred in zip(image_ids, preds):
                predictions[img_id] = pred.nonzero()[0].tolist()

    # Save results 
    output_json = "test_predictions.json"
    with open(output_json, "w") as f:
        json.dump(predictions, f, indent=4)

    elapsed = time.time() - start_time
    print(f"\nInference complete! Processed {len(test_dataset)} images "
          f"in {elapsed/60:.2f} min ({elapsed:.1f} s total).")
    print(f"Predictions saved to {output_json}")


if __name__ == "__main__":
    from multiprocessing import freeze_support
    freeze_support()
    main()


ModuleNotFoundError: No module named 'tensorboard'

## 6. Our Approach

The purpose of the task was, to our eyes, to find the best "good enough" model and parameters settings to answer to te question of multi-label classification for this specific dataset. We had the freedom to chose from a large list of pre-trained models, available in the `PyTorch` library, to help us with our task. 

This task alone is very time consuming: it requires testing the classification by modifying a couple of parameters, all while also testing between different models to see which one is the best fitted. We understood we were gonna have to make some decisions if we wanted to optimize our work. Our approach was then the following:

- choosing a couple of models from the PyTorch list 
- choosing a couple of parameters that we would modify along the testing phase

It was very important for us to set these two steps, if not, the work would have been too long to produce. Also, chosing the "best" model is impossible for us, due to an obvious lack of ressources and time.

We eventually decided to choose the models based on an energy focus. We know how much energy and resource consuming is the training of AI models nowadays. This pressing concern is largely discussed and raises questions about the ethical use of these tools. It made us think of trying to find the best working model out of the most reduced ones, that is, those with the less parameters, all for a purpose of saving time and ressources. We understood that by taking this decision, our final model would perform worse once updated to the leaderboard comapared to other, bigger models. We still decided to search between the couple rather small models in the `PyTorch` distribution, trying to come up with the one that is "good enough". The models we decided to test are the following:

- MobileNetV3-Small
- ResNet-18

We deliberately focused on these two architectures because they are not only computationally efficient but also well suited to multi-label image classification from an architectural standpoint. Also because we studied them in class. They have a small amount of parameters compared to other architectures. 

MobileNetV3-Small was designed specifically for low-resource environments. Architecturally, it combines depthwise separable convolutions (greatly reducing the number of parameters and multiplications) with inverted residual blocks and linear bottlenecks, which mantain representational power while keeping the model extremely light. It also integrates squeeze-and-excitation attention blocks, improving the network’s ability to focus on informative channels, something important in multi-label problems where several objects may share the image. Also, its small parameter count makes it fast to train and low on memory and energy cost. It has 3 million parameters.

On the other hand, ResNet-18 is a more classical but still efficient architecture that we studied in class. Its residual skip connections allow for deeper networks to train without vanishing gradients while still keeping the parameter count moderate compared to very deep ResNets and other models. Its parameter count is in the 11 million order. Although it uses standard convolutions (no depthwise separable trick), its straightforward structure is robust and proven for general image recognition. For a multi-label task, the residual connections help the model learn richer representations without becoming prohibitively heavy.

We explicitly avoided architectures such as ConvNeXt, which, while state-of-the-art, are much deeper and more computationally expensive, and models like VGG or DenseNet, which are parameter-heavy with less efficient use of computation. Our two chosen models represent different CNN design evolutions: classic residual learning (ResNet-18) and mobile-optimized depthwise convolutions (MobileNetV3). This mix gives us a practical space to experiment with accuracy vs efficiency trade-offs for our dataset while keeping training feasible and energy-aware.

We then chose which parameters we would focus on, before diving into the code and then the testing. There are a large number of parameters to experiment with when trying to perfectionate a model. We decided, again for a question of efficiency, to set a couple of them (the most impactful) so we could make an honest study, avoiding changing of parameters between models. These are the parameters, which we divided into two sections:

- efficiency parameters:
    - number of cpus
    - batch size

- learning parameters:
    - number of epochs
    - learning rate
    - image size
    - optimizer
    - weight decay
    - dropout layer
    - threshold

These are the most important parameters in terms of impact in a model's behaviour. There is a larger number of them, but we don't have the time to make a big testing pool with all of them.

We then drew the lines for what would be our attack. We first set up our working environment, that is, setting up the `Github` page for our code and a shared `Drive` for our results (excel and word). We then needed to make a `main.py` code for the training of our models, and a `test.py` for the generation of the JSON file we would eventually upload to the leaderboard. The main modules of the code were already given in this notebook. Then, each choosing one model, we would test its perfomance while modifying its parameters to find the best working one. 

## 8. Results and analysis

## 8.1 MobileNetV3-Small

We trained **MobileNetV3-Small** without freezing layers to maintain full precision. The goal was to balance **training speed**, **generalization**, and **F1 performance**.

---

### 8.1.1 Initial Runs (Run 1–3)

- **Run 1:** Quick test to estimate training time and initial F1 score. We also wanted to check which were gonna be the ideal efficiency parameters for our model.
  - **Parameters:** batch size 32, epochs 5, lr 0.001, image size 224, optimizer Adam, no dropout, threshold 0.5  
  - **Best F1:** 0.3585, **Time:** 17 min  

  ![Run 1](images/1.png)

- **Run 2:** Increased epochs to 8; F1 remained similar.  
  - **Parameters:** batch size 32, epochs 8, lr 0.001, image size 224, optimizer Adam, no dropout, threshold 0.5  
  - **Best F1:** 0.3555, **Time:** 20 min  

- **Run 3:** Increased batch size to 64 and epochs to 15; slight F1 improvement but signs of overfitting appeared.  
  - **Parameters:** batch size 64, epochs 15, lr 0.001, image size 224, optimizer Adam, no dropout, threshold 0.5  
  - **Best F1:** 0.3786, **Time:** 33 min  

---

### 8.1.2 Generalization Improvements (Run 4–5)

- **Run 4:** Changed optimizer to `AdamW`, reduced learning rate, increased image size.  
  - **Parameters:** batch size 64, epochs 10, lr 0.0001, image size 256, optimizer AdamW, weight decay 0.0001, no dropout, threshold 0.5  
  - **Best F1:** 0.3814, **Time:** 42 min  

  ![Run 4](images/5.png)  
  ![Run 4 Metrics](images/6.png)  
  ![Run 4 Metrics](images/7.png)

- **Run 5:** Added dropout and increased weight decay; final F1 similar to run 4.  
  - **Parameters:** batch size 64, epochs 10, lr 0.0003, image size 256, optimizer AdamW, weight decay 0.0003, dropout Yes, threshold 0.5  
  - **Best F1:** 0.38, **Time:** 45 min  

---

### 8.1.3 Aggressive Generalization & Threshold Tuning (Run 6–7)

- **Run 6:** Increased epochs to 20, added `ColorJitter`, higher weight decay. Validation F1 improved but test F1 dropped; training time too long.  
  - **Parameters:** batch size 64, epochs 20, lr 0.0003, image size 256, optimizer AdamW, weight decay 0.001, dropout Yes, threshold 0.5  

- **Run 7:** Reduced epochs, batch size, and image size; applied **threshold sweeping** to optimize precision/recall balance.  
  - **Parameters:** batch size 32, epochs 15, lr 0.0001, image size 224, optimizer AdamW, weight decay 0.001, dropout Yes, threshold Sweeping  
  - **Best F1:** 0.3971, **Time:** 69 min  

  ![Run 7 Metrics](images/8.png)  
  ![Run 7 Metrics](images/9.png)

The gaps between the precision in the validation set and in the training set diminished after this run, so we decided it was our best "good enough" configuration for this precise model, all while trying to keep it small.

We had managed to reduce the overfitting problem that our model was having.

---

### 8.1.4 Summary Table

| Run | Batch Size | Epochs | LR      | Image Size | Optimizer | Weight Decay | Dropout | Threshold | Best F1 | Time (min) |
|-----|-----------|--------|---------|------------|-----------|--------------|---------|-----------|---------|------------|
| 1   | 32        | 5      | 0.001   | 224        | Adam      | 0            | No      | 0.5       | 0.3585  | 17         |
| 2   | 32        | 8      | 0.001   | 224        | Adam      | 0            | No      | 0.5       | 0.3555  | 20         |
| 3   | 64        | 15     | 0.001   | 224        | Adam      | 0            | No      | 0.5       | 0.3786  | 33         |
| 4   | 64        | 10     | 0.0001  | 256        | AdamW     | 0.0001       | No      | 0.5       | 0.3814  | 42         |
| 5   | 64        | 10     | 0.0003  | 256        | AdamW     | 0.0003       | Yes     | 0.5       | 0.38    | 45         |
| 6   | 64        | 20     | 0.0003  | 256        | AdamW     | 0.001        | Yes     | 0.5       | -       | -          |
| 7   | 32        | 15     | 0.0001  | 224        | AdamW     | 0.001        | Yes     | Sweeping  | 0.3971  | 69         |


## 8.2 Resnet 18

### 8.2.1 Introduction

ResNet-18 is a deep learning model with 18 layers and about 11.7M weights. It uses skip connections to train more effectively.

We use it with the MS COCO dataset because it’s lightweight, fast, and still powerful enough to learn useful features from COCO’s large variety of images and objects.

To train this model we changed computer to also test training with a faster computer. We use an an NVIDIA RTX A2000 8GB and I7 CPU.

### 8.2.2 Why we chose ResNet-18


Efficiency: ResNet-18 is computationally less demanding compared to deeper models, making it suitable for tasks requiring faster inference times. That means less Time to wait and less power consumtion it's a main value for us.

Transfer Learning: Pretrained ResNet-18 models on ImageNet can be fine-tuned for COCO, leveraging learned features for improved performance.

Proven Architecture: ResNet architectures, including ResNet-18, have demonstrated strong performance in various vision tasks, including classification.



### 8.2.3 Let's Run

--

#### A. Initial Run with ResNet18 (Unfrozen)

For our first experiment, we used the same parameters as MobileNet to establish a baseline and identify potential areas for improvement.

- **Setup:** ResNet18 with all layers trainable  
- **Training:** 2 hours, 10 epochs  
- **Learning rate:** 1e-4  

**Results:**  
- F1 score: 0.42  
- Loss: still decreasing  

**Observations:**  
- The model was still learning, suggesting that increasing the number of epochs could improve results.  
- Increasing the learning rate to 1e-3 might allow faster convergence.  
- Early visualizations indicated that certain classes were underrepresented, hinting at a potential data imbalance issue.

![Alt text](images/Metrics-Resnet18-1.png)

---

#### B. Freezing the Backbone

Next, we froze the backbone of ResNet18 to focus training on the classifier layers.

- **Setup:** Backbone frozen, fine-tuning only the classifier  
- **Training:** 11 epochs  

**Results:**  
- F1 score after 11 epochs: 0.51  
- F1 score after 10 epochs: 0.46  

**Observations:**  
- Overlearning observed at 11 epochs, with accuracy decreasing.  
- Adjustments made to address Overlearning:
  - Reduce epochs to 10  
  - Add a dropout layer with p = 0.4  
  - Resize images to 224 × 224 (ResNet18’s original size for better performance)  

**Note:**  
- Dropout run gave an F1 score of 0.010 after 3 epochs, so it was stopped early.  
- Suggests that dropout placement or value might need tuning rather than a blanket addition.

---

#### C. Freezing Backbone + Resetting Last Layer

We experimented with freezing the backbone while resetting only the final layer weights.

- **Setup:** Frozen backbone, last layer reinitialized  
- **Training:** 10 epochs  
- **Learning rate:** 1e-3  

**Results:**  
- F1 score: 0.48  
- Training time reduced by ~40% compared to fully unfrozen model  

**Observations:**  
- Faster convergence due to fewer trainable parameters.  
- Slightly lower F1 than full fine-tuning, indicating that some deeper layers might still benefit from fine-tuning.  
- Suggested future approach: selectively unfreeze top blocks of the backbone.

---

#### D. Image Augmentation & Resizing

To improve generalization, we applied augmentation strategies:

- **Techniques:** Random horizontal flip, random rotation, color jitter  
- **Image size:** 224 × 224  

**Results:**  
- F1 score improved slightly (~0.50)  
- Reduced overlearning, especially on minority classes  

**Observations:**  
- Augmentation helped balance class performance.  
- Further improvement could be achieved by adding advanced augmentation like CutMix or MixUp.

---

#### E. Learning Rate Tuning

We experimented with a higher learning rate and learning rate scheduling:

- **Setup:** Initial LR = 1e-3, ReduceLROnPlateau scheduler  
- **Training:** 10 epochs  

**Results:**  
- F1 score increased to 0.52
- Loss plateaued earlier but stabilized  

**Observations:**  
- Adaptive LR helped avoid overshooting minima.  
- Future runs could combine scheduler with fine-tuned backbone layers.

---

#### F. Unfreezing Backbone Mid-Training

We tested the effect of unfreezing the backbone after 10 epochs of frozen training, then continuing to 20 epochs.

- **Setup:** Backbone frozen for first 10 epochs, then unfrozen  
- **Training:** 20 epochs total  
- **Learning rate:** 1e-3  

**Results:**  
- F1 score: 0.49  

**Observations:**  
- Unfreezing the backbone mid-training improved performance over frozen-only runs but did not surpass fully tuned methods.  
- Suggests that gradual unfreezing can help, but careful learning rate adjustment may be needed when unfreezing. We didn't do it.
- We can see that wen we remove the freeze we got big change in the curbes and high learning rates.

![Alt text](images/RunChangingFreeze.png)
![Alt text](images/RunChangingFreeze_2.png)

---

### 8.2.4 Summary of Runs

| Run | Backbone | Dropout | Augmentation | Epochs | Learning Rate    | F1 Score | Notes |
|-----|----------|---------|--------------|--------|-------|----------|-------|
| 1   | Unfrozen | None    | None         | 10     | 1e-4  | 0.42     | Baseline, loss still decreasing |
| 2   | Frozen   | None    | None         | 11     | 1e-4  | 0.51     | Overlearning observed |
| 2a  | Frozen   | 0.4     | None         | 3      | 1e-4  | 0.000001     | Dropout too strong, stopped early |
| 3   | Frozen   | None    | None         | 10     | 1e-3  | 0.48     | Last layer reset, faster training |
| 4   | Frozen   | None    | Augmentation | 10     | 1e-3  | 0.50     | Better generalization |
| 5   | Frozen   | None    | Augmentation | 10     | 1e-3 (sched) | 0.52 | Best F1, adaptive LR helped |
| 6   | Frozen→Unfrozen | None | Augmentation | 20 | 1e-3 | 0.49 | Unfroze backbone mid-training |

---

### 8.2.5 Conclusion

We observed that regardless of changing the parameters or the approach, we obtained approximately the same F1 score and loss. We don’t believe we can improve much with this model, as it is a lightweight one. We should be able to reach around 0.60, but not much higher. The images from all the runs show that the results remain roughly the same.

![Alt text](images/Allruns.png)

**Next Steps:**  
- Experiment with unfreezing top blocks of backbone earlier.  
- Fine-tune dropout placement and probability.  
- Test advanced augmentation methods.  
- Explore class-balanced loss or oversampling for minority classes.





## 9. To go further

We decided as a future case of study to try other models results on this task. We tried ConvNext.

ConvNeXt is a modern reinterpretation of the classic ResNet architecture, redesigned with ideas borrowed from Vision Transformers while keeping a fully convolutional backbone. It simplifies the traditional residual blocks into depthwise separable convolutions with large kernels (7×7), layer normalization instead of batch normalization, and inverted bottlenecks similar to those in efficient mobile models. This design improves the receptive field and feature extraction capacity. It is therefore a big model with a large number of parameters and a large computional time, which derives from our main approach to the task. 

We just wanted to compare our results with those of other models. We found out the F1 was better when working with a model like this, but the computation time and energy gone into it derived from our objective. It could still be a nice track of studying for the future.