# The MS COCO classification challenge

Razmig Kéchichian

This notebook defines the multi-class classification challenge on the [MS COCO dataset](https://cocodataset.org/). It defines the problem, sets the rules of organization and presents tools you are provided with to accomplish the challenge.


## 1. Problem statement

Each image has **several** categories of objects to predict, hence the difference compared to the classification problem we have seen on the CIFAR10 dataset where each image belonged to a **single** category, therefore the network loss function and prediction mechanism (only highest output probability) were defined taking this constraint into account.

We adapted the MS COCO dataset for the requirements of this challenge by, among other things, reducing the number of images and their dimensions to facilitate processing.

In the companion `ms-coco.zip` compressed directory you will find two sub-directories:
- `images`: which contains the images in train (65k) and test (~5k) subsets,
- `labels`: which lists labels for each of the images in the train subset only.

Each label file gives a list of class IDs that correspond to the class index in the following tuple:

In [None]:
classes = ("person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat", "traffic light", 
           "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse", "sheep", "cow",
           "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie", "suitcase", "frisbee",       
           "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove", "skateboard", "surfboard",
           "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon", "bowl", "banana", "apple",
           "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut", "cake", "chair", "couch", 
           "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse", "remote", "keyboard", "cell phone", 
           "microwave", "oven", "toaster", "sink", "refrigerator", "book", "clock", "vase", "scissors", "teddy bear", 
           "hair drier", "toothbrush")

Your goal is to follow a **transfer learning strategy** in training and validating a network on **your own distribution of training data into training and a validation subsets**, then to **test it on the test subset** by producing a [JSON file](https://en.wikipedia.org/wiki/JSON) with content of the following format:

```
{
    "000000000139": [
        56,
        60,
        62
    ],
    "000000000285": [
        21,
    ],
    "000000000632": [
        57,
        59,
    73
    ],
    # other test images
}
```

In this file, the name (without extension) of each test image is associated with a list of class indices predicted by your network. Make sure that the JSON file you produce **follows this format strictly**.

You will submit your JSON prediction file to the following [online evaluation server and leaderboard](https://www.creatis.insa-lyon.fr/kechichian/ms-coco-classif-leaderboard.html), which will evaluate your predictions on test set labels, unavailable to you.

<div class="alert alert-block alert-danger"> <b>WARNING:</b> Use this server with <b>the greatest care</b>. A new submission with identical Participant or group name will <b>overwrite</b> the identically named submission, if one already exists, therefore check the leaderboard first. <b>Do not make duplicate leaderboard entries for your group</b>, keep track of your test scores privately. Also pay attention to upload only JSON files of the required format.<br>
</div>

The evaluation server calculates and returns mean performances over all classes, and optionally per class performances. Entries in the leaderboard are sorted by the F1 metric.

You can request an evaluation as many times as you want. It is up to you to specify the final evaluation by updating the leaderboard entry corresponding to your Participant or group name. This entry will be taken into account for grading your work.

It goes without saying that it is **prohibited** to use another distribution of the MS COCO database for training, e.g. the Torchvision dataset.


## 2. Organization

- Given the scope of the project, you will work in groups of 2. 
- Work on the challenge begins on IAV lab 3 session, that is on the **23rd of September**.
- Results are due 10 days later, that is on the **3rd of October, 18:00**. They comrpise:
    - a submission to the leaderboard,
    - a commented Python script (with any necessary modules) or Jupyter Notebook, uploaded on Moodle in the challenge repository by one of the members of the group.
    
    
## 3. Tools

In addition to the MS COCO annotated data and the evaluation server, we provide you with most code building blocks. Your task is to understand them and use them to create the glue logic, that is the main program, putting all these blocks together and completing them as necessary to implement a complete machine learning workflow to train and validate a model, and produce the test JSON file.

### 3.1 Custom `Dataset`s

We provide you with two custom `torch.utils.data.Dataset` sub-classes to use in training and testing.

In [None]:
import os
from glob import glob
from pathlib import Path

from PIL import Image
import torch

class COCOTrainImageDataset(torch.utils.data.Dataset):
    def __init__(self, img_dir, annotations_dir, max_images=None, transform=None):
        self.img_labels = sorted(glob("*.cls", root_dir=annotations_dir))
        if max_images:
            self.img_labels = self.img_labels[:max_images]
        self.img_dir = img_dir
        self.annotations_dir = annotations_dir
        self.transform = transform

    def __len__(self):
        return len(self.img_labels)

    def __getitem__(self, idx):
        img_path = os.path.join(self.img_dir, Path(self.img_labels[idx]).stem + ".jpg")
        labels_path = os.path.join(self.annotations_dir, self.img_labels[idx])
        image = Image.open(img_path).convert("RGB")
        with open(labels_path) as f: 
            labels = [int(label) for label in f.readlines()]
        if self.transform:
            image = self.transform(image)
        labels = torch.zeros(80).scatter_(0, torch.tensor(labels), value=1)
        return image, labels


class COCOTestImageDataset(torch.utils.data.Dataset):
    def __init__(self, img_dir, transform=None):
        self.img_list = sorted(glob("*.jpg", root_dir=img_dir))    
        self.img_dir = img_dir
        self.transform = transform

    def __len__(self):
        return len(self.img_list)

    def __getitem__(self, idx):
        img_path = os.path.join(self.img_dir, self.img_list[idx])
        image = Image.open(img_path).convert("RGB")        
        if self.transform:
            image = self.transform(image)
        return image, Path(img_path).stem # filename w/o extension

### 3.2 Training and validation loops

The following are two general-purpose classification train and validation loop functions to be called inside the epochs for-loop with appropriate argument settings.

Pay particular attention to the `validation_loop()` function's arguments `multi_task`, `th_multi_task` and `one_hot`.

In [None]:
import torch


def train_loop(train_loader, net, criterion, optimizer, device,
               mbatch_loss_group=-1):
    net.train()
    running_loss = 0.0
    mbatch_losses = []
    for i, data in enumerate(train_loader):
        inputs, labels = data[0].to(device), data[1].to(device)
        optimizer.zero_grad()
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        # following condition False by default, unless mbatch_loss_group > 0
        if i % mbatch_loss_group == mbatch_loss_group - 1:
            mbatch_losses.append(running_loss / mbatch_loss_group)
            running_loss = 0.0
    if mbatch_loss_group > 0:
        return mbatch_losses


def validation_loop(val_loader, net, criterion, num_classes, device,
                    multi_task=False, th_multi_task=0.5, one_hot=False, class_metrics=False):
    net.eval()
    loss = 0
    correct = 0
    size = len(val_loader.dataset)
    class_total = {label:0 for label in range(num_classes)}
    class_tp = {label:0 for label in range(num_classes)}
    class_fp = {label:0 for label in range(num_classes)}
    with torch.no_grad():
        for data in val_loader:
            images, labels = data[0].to(device), data[1].to(device)
            outputs = net(images)
            loss += criterion(outputs, labels).item() * images.size(0)
            if not multi_task:    
                predictions = torch.zeros_like(outputs)
                predictions[torch.arange(outputs.shape[0]), torch.argmax(outputs, dim=1)] = 1.0
            else:
                predictions = torch.where(outputs > th_multi_task, 1.0, 0.0)
            if not one_hot:
                labels_mat = torch.zeros_like(outputs)
                labels_mat[torch.arange(outputs.shape[0]), labels] = 1.0
                labels = labels_mat
                
            tps = predictions * labels
            fps = predictions - tps
            
            tps = tps.sum(dim=0)
            fps = fps.sum(dim=0)
            lbls = labels.sum(dim=0)  
                
            for c in range(num_classes):
                class_tp[c] += tps[c]
                class_fp[c] += fps[c]
                class_total[c] += lbls[c]
                    
            correct += tps.sum()

    class_prec = []
    class_recall = []
    freqs = []
    for c in range(num_classes):
        class_prec.append(0 if class_tp[c] == 0 else
                          class_tp[c] / (class_tp[c] + class_fp[c]))
        class_recall.append(0 if class_tp[c] == 0 else
                            class_tp[c] / class_total[c])
        freqs.append(class_total[c])

    freqs = torch.tensor(freqs)
    class_weights = 1. / freqs
    class_weights /= class_weights.sum()
    class_prec = torch.tensor(class_prec)
    class_recall = torch.tensor(class_recall)
    prec = (class_prec * class_weights).sum()
    recall = (class_recall * class_weights).sum()
    f1 = 2. / (1/prec + 1/recall)
    val_loss = loss / size
    accuracy = correct / freqs.sum()
    results = {"loss": val_loss, "accuracy": accuracy, "f1": f1,\
               "precision": prec, "recall": recall}

    if class_metrics:
        class_results = []
        for p, r in zip(class_prec, class_recall):
            f1 = (0 if p == r == 0 else 2. / (1/p + 1/r))
            class_results.append({"f1": f1, "precision": p, "recall": r})
        results = results, class_results

    return results

### 3.3 Tensorboard logging (optional)

Evaluation metrics and losses produced by the `validation_loop()` function on train and validation data can be logged to a [Tensorboard `SummaryWriter`](https://pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html) which allows you to observe training graphically via the following function:

In [None]:
def update_graphs(summary_writer, epoch, train_results, test_results,
                  train_class_results=None, test_class_results=None, 
                  class_names = None, mbatch_group=-1, mbatch_count=0, mbatch_losses=None):
    if mbatch_group > 0:
        for i in range(len(mbatch_losses)):
            summary_writer.add_scalar("Losses/Train mini-batches",
                                  mbatch_losses[i],
                                  epoch * mbatch_count + (i+1)*mbatch_group)

    summary_writer.add_scalars("Losses/Train Loss vs Test Loss",
                               {"Train Loss" : train_results["loss"],
                                "Test Loss" : test_results["loss"]},
                               (epoch + 1) if not mbatch_group > 0
                                     else (epoch + 1) * mbatch_count)

    summary_writer.add_scalars("Metrics/Train Accuracy vs Test Accuracy",
                               {"Train Accuracy" : train_results["accuracy"],
                                "Test Accuracy" : test_results["accuracy"]},
                               (epoch + 1) if not mbatch_group > 0
                                     else (epoch + 1) * mbatch_count)

    summary_writer.add_scalars("Metrics/Train F1 vs Test F1",
                               {"Train F1" : train_results["f1"],
                                "Test F1" : test_results["f1"]},
                               (epoch + 1) if not mbatch_group > 0
                                     else (epoch + 1) * mbatch_count)

    summary_writer.add_scalars("Metrics/Train Precision vs Test Precision",
                               {"Train Precision" : train_results["precision"],
                                "Test Precision" : test_results["precision"]},
                               (epoch + 1) if not mbatch_group > 0
                                     else (epoch + 1) * mbatch_count)

    summary_writer.add_scalars("Metrics/Train Recall vs Test Recall",
                               {"Train Recall" : train_results["recall"],
                                "Test Recall" : test_results["recall"]},
                               (epoch + 1) if not mbatch_group > 0
                                     else (epoch + 1) * mbatch_count)

    if train_class_results and test_class_results:
        for i in range(len(train_class_results)):
            summary_writer.add_scalars(f"Class Metrics/{class_names[i]}/Train F1 vs Test F1",
                                       {"Train F1" : train_class_results[i]["f1"],
                                        "Test F1" : test_class_results[i]["f1"]},
                                       (epoch + 1) if not mbatch_group > 0
                                             else (epoch + 1) * mbatch_count)

            summary_writer.add_scalars(f"Class Metrics/{class_names[i]}/Train Precision vs Test Precision",
                                       {"Train Precision" : train_class_results[i]["precision"],
                                        "Test Precision" : test_class_results[i]["precision"]},
                                       (epoch + 1) if not mbatch_group > 0
                                             else (epoch + 1) * mbatch_count)

            summary_writer.add_scalars(f"Class Metrics/{class_names[i]}/Train Recall vs Test Recall",
                                       {"Train Recall" : train_class_results[i]["recall"],
                                        "Test Recall" : test_class_results[i]["recall"]},
                                       (epoch + 1) if not mbatch_group > 0
                                             else (epoch + 1) * mbatch_count)
    summary_writer.flush()

## 4. The skeleton of the model training and validation program

Your main program should have more or less the following sections and control flow:

In [None]:
import torch
import torchvision.models as models

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using {device} device.")

model = models.convnext_tiny(weights=models.ConvNeXt_Tiny_Weights.IMAGENET1K_V1)

for param in model.features.parameters():
    param.requires_grad = False
    
    

Using cpu device.


In [None]:
# import statements for python, torch and companion libraries and your own modules


#besion d'un sigmoide 
# et pci lost partie 4



# device initialization

# data directories initialization

# instantiation of transforms, datasets and data loaders
# TIP : use torch.utils.data.random_split to split the training set into train and validation subsets

# class definitions

# instantiation and preparation of network model

# instantiation of loss criterion
# instantiation of optimizer, registration of network parameters

# definition of current best model path
# initialization of model selection metric

# creation of tensorboard SummaryWriter (optional)

# epochs loop:
#   train
#   validate on train set
#   validate on validation set
#   update graphs (optional)
#   is new model better than current model ?
#       save it, update current best metric

# close tensorboard SummaryWriter if created (optional)

ConvNeXt(
  (features): Sequential(
    (0): Conv2dNormActivation(
      (0): Conv2d(3, 96, kernel_size=(4, 4), stride=(4, 4))
      (1): LayerNorm2d((96,), eps=1e-06, elementwise_affine=True)
    )
    (1): Sequential(
      (0): CNBlock(
        (block): Sequential(
          (0): Conv2d(96, 96, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=96)
          (1): Permute()
          (2): LayerNorm((96,), eps=1e-06, elementwise_affine=True)
          (3): Linear(in_features=96, out_features=384, bias=True)
          (4): GELU(approximate='none')
          (5): Linear(in_features=384, out_features=96, bias=True)
          (6): Permute()
        )
        (stochastic_depth): StochasticDepth(p=0.0, mode=row)
      )
      (1): CNBlock(
        (block): Sequential(
          (0): Conv2d(96, 96, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=96)
          (1): Permute()
          (2): LayerNorm((96,), eps=1e-06, elementwise_affine=True)
          (3): Linear(in_features=

## 5. The skeleton of the test submission program

This, much simpler, program should have the following sections and control flow:

In [None]:
# import statements for python, torch and companion libraries and your own modules
# TIP: use the python standard json module to write python dictionaries as JSON files

# global variables defining inference hyper-parameters among other things 
# DON'T forget the multi-task classification probability threshold

# data, trained model and output directories/filenames initialization

# device initialization

# instantiation of transforms, dataset and data loader

# load network model from saved file

# initialize output dictionary

# prediction loop over test_loader
#    get mini-batch
#    compute network output
#    threshold network output
#    update dictionary entries write corresponding class indices

# write JSON file

## 6. Our Approach

The purpose of the task was, to our eyes, to find the best "good enough" model and parameters settings to answer to te question of multi-label classification for this specific dataset. We had the freedom to chose from a large list of pre-trained models, available in the PyTorch library, to help us with our task. 

This task alone is very time consuming: it requires testing the classification by modifying a couple of parameters, all while also testing between different models to see which one is the best fitted. We understood we were gonna have to make some decisions if we wanted to optimize our work. Our approach was then the following:

- choosing a couple of models from the PyTorch list 
- choosing a couple of parameters that we would modify along the testing phase

It was very important for us to set these variables, if not, the work would have been too long to produce. Also, chosing the "best" model is impossible for us, due to an obvious lack of ressources and time.

We eventually decided to choose the models based on an energy focus. We know how much energy cand resource consuming is the training of AI models nowadays. This pressing concern is largely discussed and raises questions about the ethical use of these tools. It made us think of trying to find the best working model out of the most reduced ones, that is, those with the less parameters, all for a purpose of saving time and ressources. We understood that by taking this decision, our final model would perform worse once updated to the leaderboard comapared to other, bigger models. We still decided to search between the couple rather small models in the PyTorch distribution, trying to come up with the one that is "good enough". The models we decided to test are the following:

- MobileNetV3-Small
- ResNet-18

We deliberately focused on these two architectures because they are not only computationally efficient but also well suited to multi-label image classification from an architectural standpoint. Also because we studied them in class (apart from EfficientNet, which was just mentioned). They have a small amount of parameters compared to other architectures. 

MobileNetV3-Small was designed specifically for low-resource environments. Architecturally, it combines depthwise separable convolutions (greatly reducing the number of parameters and multiplications) with inverted residual blocks and linear bottlenecks, which mantain representational power while keeping the model extremely light. It also integrates squeeze-and-excitation attention blocks, improving the network’s ability to focus on informative channels, something important in multi-label problems where several objects may share the image. Also, its small parameter count makes it fast to train and low on memory and energy cost. It has 3 million parameters.

On the other hand, ResNet-18 is a more classical but still efficient architecture that we studied in class. Its residual skip connections allow for deeper networks to train without vanishing gradients while still keeping the parameter count moderate compared to very deep ResNets and other models. Its parameter count is in the 11 million order. Although it uses standard convolutions (no depthwise separable trick), its straightforward structure is robust and proven for general image recognition. For a multi-label task, the residual connections help the model learn richer representations without becoming prohibitively heavy.


We explicitly avoided architectures such as ConvNeXt, which, while state-of-the-art, are much deeper and more computationally expensive, and models like VGG or DenseNet, which are parameter-heavy with less efficient use of computation. Our three chosen models represent different CNN design evolutions: classic residual learning (ResNet-18), mobile-optimized depthwise convolutions (MobileNetV3), and balanced compound scaling (EfficientNet-B0). This mix gives us a practical space to experiment with accuracy vs. efficiency trade-offs for our dataset while keeping training feasible and energy-aware.

We then chose which parameters we would focus on, before diving into the code and then the testing. There are a large number of parameters to experiment with when trying to perfectionate a model. We decided, again for a question of efficiency, to set a couple of them (the most impactufl) so we could make an honest study, avoiding changing of parameters between models. These are the parameters, which we divided into two sections:

- efficiency parameters:
    - number of cpus
    - batch size

- learning parameters:
    - number of epochs
    - learning rate
    - image size
    - optimizer
    - weight decay
    - dropout layer
    - threshold

These are the most important parameters in terms of impact in a model's behaviour. There is a larger number of them, but we don't have the time to make a big testing pool with all of them.

We then drew the lines for what would be our attack. We first set up our working environment, that is, setting up the 'Github' page for our code and a shared 'Drive' for our results (excel and word). We then needed to make a 'main.py' code for the training of our models, and a 'test.py' for the generation of the JSON file we would eventually upload to the leaderboard. The main modeules of the code were already given in this notebook. Then, each choosing one model, we would test its perfomance while modifying its parameters to find the best working one. 

## 7. The Code

### 7.1 Custom datasets

### 7.2 Custom datasets

### 7.3 Training and Validation loops

### 7.4 Tensorboard

### 7.5 Model training and validation program

### 7.6 Test submission program

## 8. Results and analysis

### 8.1 MobileNetV3-Small

We decided not to freeze the layer parameters for the MobileNet model, since it is already a fatst model, and we didn't want to lose any precision.

#### 8.1.1 1st run, 2nd and 3rd run

The first runs with MobileNet were initially to check its initial performance and time of training. We didn't even run them through the 'test.py' script, since we mainly wanted to take a glance at which were going to be the best efficiency parameters. 

<div style="display: grid; grid-template-columns: 1fr 1fr; gap: 10px;">
  <div><b>mobilenet_v3_small run 1</b><br>number of cpus: 12<br>batch size: 32<br>number epochs: 5<br>learning rate: 0.001<br>image size: 224<br>optimizer: adam<br>weight decay: no<br>dropout layer: no<br>threshold: 0.5<br>best F1: 0.3585<br>Time (min): 17</div>
</div>

We used the following values for the parameters (randome but coherent ones). We realized the time it took was in accordance to what we expected from a small model. So was the best F1 value, which was not the highest but was still good. Just not good enough yet. 

Between run 1 and 2, we only changed the number of epochs from 5 to 8 to see the impact on the Time and F1 score. Time increased to 20 minutes, and F1 score stayed the same. We decided to keep increasing the number of epochs to see the changes. We wanted to get to athe point where it was already too long, even for a small model. We tried 15 next, and realised that F1 was now 0,38, but it stayed relatively constant after epoch number 10. We finally decided to keep epochs at 10 for the moment. It was time now to test the learning parameters to try and maximize the F1 score and other metrics. 

![Alt text](images/1.png)

Loss on validation set pikes after 10 epochs

<div style="display: grid; grid-template-columns: 1fr 1fr; gap: 10px;">
  <div><b>mobilenet_v3_small run 2</b><br>number of cpus: 12<br>batch size: 32<br>number epochs: 8<br>learning rate: 0.001<br>image size: 224<br>optimizer: adam<br>weight decay: no<br>dropout layer: no<br>threshold: 0.5<br>best F1: 0.3555<br>Time (min): 20</div>
  <div><b>mobilenet_v3_small run 3</b><br>number of cpus: 12<br>batch size: 64<br>number epochs: 15<br>learning rate: 0.001<br>image size: 224<br>optimizer: adam<br>weight decay: no<br>dropout layer: no<br>threshold: 0.5<br>best F1: 0.3786<br>Time (min): 33</div>
</div>

For run 3, we also tried to change the batch size to increase it. This was made in order to accelerate the running time, batch size being the number of samples processed before one gradient update. Indeed, with a small batch size, the training can be slower to converge. With a big batch size, training converges faster, but we lose in generalization and can fall into overfitting because there is less gradient noise.


#### 8.1.2 4th and 5th run

Even if our F1 score got better with run 3, we observed that the model was failing to generalize, maybe due to the batch size having increased. 

![Alt text](images/2.png)
![Alt text](images/3.png)
![Alt text](images/4.png)

See the gap between the train and validation accuracy metric, and the rapid stagnation of precision and therefore wrong labelisation in the validation set. This was a sign of the model generalizing to us. This usually means the model is predicting more classes as positive over time (hence the recall increases), but among those extra predictions, more are false positives, therefore precision stops improving and, in this case, gets worse. The model gets more confident and predicts more positives, but it also adds noise.

However, we wanted to keep our model fast, so we tried methods to better our generalization other than reducing the batch size. 

For run 4, we chose to change our optimizer from 'adam' to 'adamw', which presents better regularization from decoupling its weight decay. We also reduced the learning rate and increased the image size, more as a way of testing these parameters effects. We would later find out increasing image size helps the model to see more in detail, since the dimension of the images in the set are of bigger resolution, and therefore contributes to overfitting. Indeed, our precision metric for the training set was too high. Same for learning rate, since a smaller value lets the optimizer make finer updates instead of big jumps.

![Alt text](images/5.png)

We could see in run 4 that setting our epochs to 10 had helped with stabilizing the validation loss. However, as we can see in the curbs below, the gaps between training and validation metrics were still present, and the precision, even if of much better value, was still stagnating and therefore the model was still overfitting. 


![Alt text](images/6.png)
![Alt text](images/7.png)

The final F1 metric did not vary.

<div style="display: grid; grid-template-columns: 1fr 1fr 1fr; gap: 10px;">
  <div><b>mobilenet_v3_small run 4</b><br>number of cpus: 14<br>batch size: 64<br>number epochs: 10<br>learning rate: 0.0001<br>image size: 256<br>optimizer: adamw<br>weight decay: 0.0001<br>dropout layer: no<br>threshold: 0.5<br>best F1: 0.3814<br>Time (min): 42</div>
</div>

Results were similar for run number 5

We tried to counter overfitting by setting a higher learning rate than before. Also, we decided to increase the weight decay for 'adamW' and adding a dropout phase to randomly disable neurons during training so the network can’t rely on specific paths, forcing it to learn more robust features and generalize better.

Results were almost identical to those of run 4, with validation precision rising fast and then flattening while recall slowly increases. The model was clearly becoming more “liberal”, trying to catch more positives but producing more false positives. This is very common on COCO (high class imbalance, many small objects).

<div style="display: grid; grid-template-columns: 1fr 1fr 1fr; gap: 10px;">
  <div><b>mobilenet_v3_small run 5</b><br>number of cpus: 14<br>batch size: 64<br>number epochs: 10<br>learning rate: 0.0003<br>image size: 256<br>optimizer: adamw<br>weight decay: 0.0003<br>dropout layer: yes<br>threshold: 0.5<br>best F1: 0.38<br>Time (min): 45</div>
</div>

This time, we decided to test the best model chosen in the run on the 'test.py', in order to upload the JSON file to the leaderboard. We were surprised to see the F1 metric to be equal to 0.4227 which meant it was doing better than in the validation set, which we thought was unusual. 

#### 8.1.3 6th and 7th run 

For run number 6 we clearly needed a more agressive generalizing approach. We set up a large number of epochs to see if that was the problem, all while increasing the weight decay once again. We also tried other transforms to make the set more random, such as 'ColorJitter'. At the moment we only did 'RandomHorizontalFlip'.

The F1 metric for the validation set improved a lot, but we concluded it was just a question of a high epoch number. Indeed, the model still overfitted, as we tried it on the 'test.py' and it performed worse than the one before. The curbs also showed this trend. Precision was again falling, this time similarly to run 3 (plummeting after 10 epochs). The gaps were still seen in the other metrics between the validation and testing models, but our main concern was the precision metric. 

When testing in the test set, the F1 metric was 0.4145, which was worse than the previous run's one. Also, the time to complete the training was too long if we wanted to behave according to our initial intent of an environment friendly model.

<div style="display: grid; grid-template-columns: 1fr 1fr 1fr; gap: 10px;">
  <div><b>mobilenet_v3_small run 6</b><br>number of cpus: 14<br>batch size: 64<br>number epochs: 20<br>learning rate: 0.0003<br>image size: 256<br>optimizer: adamw<br>weight decay: 0.001<br>dropout layer: yes<br>threshold: 0.5</div>
</div>

The number of epochs was not the problem, as stagnation of the precision happened way too early in the training process. We later came back to a more reduced number of epochs, also to try to meet our initial approach of a environment friendly model. 

For run number 7, we came back to little epochs. We also tried another technique, which was tuning the decision threshold to boost precision. We changed the code to integrate this feature. In our experiments, the network outputs a probability for each possible label. By default, a label is considered present if its probability is ≥ 0,5. However, this fixed value does not necessarily mean the best trade-off between precision (proportion of predicted positives that are correct) and recall (proportion of true positives that are detected). We therefore performed a threshold sweep, testing several cut-off values from 0.05 to 0.95 on the validation set. Increasing the threshold makes the model predict a label only when it is more confident, which reduces false positives and generally improves precision, although it can slightly decrease recall. Conversely, lowering the threshold increases recall but often introduces more false positives, lowering precision. Selecting the threshold that maximized the F1-score (the harmonic mean of precision and recall) allowed us to improve our model’s validation performance without retraining. This simple post-processing step significantly increased precision and produced a better overall F1 balance for the multi-label task.

We also reduced the batch size again and the image size. Our F1 improved slightly from first runs, and specially, our precision on validation doesn't present the same gap and evolution as before. We believe the sweeping threshold and the reducing of the batch size allowed us to avoid generalization from the model. These are our curbs for run number 7:

![Alt text](images/8.png)

We can see that even if the F1 metric is not so much better, the precision improved from the first runs, all while using a small model (5mins/epoch). 

![Alt text](images/9.png)

The accuracy and F1 metrics don't seem to stagnate now. We could make it run for more epochs, but it would diverge from our main objective in this task. We want to keep the model short and efficient. 

<div style="display: grid; grid-template-columns: 1fr; gap: 10px;">
  <div><b>mobilenet_v3_small run 7</b><br>number of cpus: 14<br>batch size: 32<br>number epochs: 15<br>learning rate: 0.0001<br>image size: 224<br>optimizer: adamw<br>weight decay: 0.001<br>dropout layer: yes<br>threshold: sweeping<br>best F1: 0.3971<br>Time (min): 69</div>
</div>


When testing on 'test.py', the final F1 metric is not that impressive. 

### Resnet 18
#### Intro

ResNet-18 is a deep learning model with 18 layers and about 11.7M weights. It uses skip connections to train more effectively.

We use it with the MS COCO dataset because it’s lightweight, fast, and still powerful enough to learn useful features from COCO’s large variety of images and objects.

To train this model we changed computeur to also test training with a faster computeur. We use an  an NVIDIA RTX A2000 8GB and I7 CPU
#### Why we chose ResNet-18


Efficiency: ResNet-18 is computationally less demanding compared to deeper models, making it suitable for tasks requiring faster inference times. That means less Time to wait and less power consumtion it's a main value for us.

Transfer Learning: Pretrained ResNet-18 models on ImageNet can be fine-tuned for COCO, leveraging learned features for improved performance.

Proven Architecture: ResNet architectures, including ResNet-18, have demonstrated strong performance in various vision tasks, including classification and detection.



#### Let's Run it
##### 1
For the first run we will use the parameters used in mobileNet to have a base line and see what we can improve afterwars.

On the first run of ResNet18, we didn’t freeze any model layers. For this initial test, we trained for 2 hours and achieved an F1 score of 0.42 after 10 epochs. The loss was still decreasing, so extending the number of epochs in the next run may improve results. We used a learning rate of 1e-4; increasing it to 1e-3 could allow for faster convergence.

![Alt text](images/Metrics-Resnet18-1.png)
##### 2


After a run freezing the backbone, we obtained an F1 score of 0.46. However, when training for 11 epochs, we achieved a better F1 score of 0.51. We also observed overfitting, accompanied by a decrease in accuracy. Therefore, we decided to reduce the number of epochs to 10, add a dropout layer of 0.4, and resize the images to 224. According to the ResNet-18 paper, this is the input size on which the network was originally trained, which can lead to better results through improved contextualization.
The run with the dropout layer gave us an F1 score of 0.010 after 3 epochs so we decided to stop it.

##### 3












## 9. To go further

We decided as a future case of study to try other models results on this task. We tried EfficientNet-B0 and ConvNext.

EfficientNet-B0 is built around the idea of compound scaling, where network depth, width, and input resolution are balanced using a single scaling coefficient. This avoids over-widening or over-deepening the network unnecessarily. Architecturally, it uses mobile inverted bottleneck convolution blocks with SE attention like MobileNetV3, but with a  deeper design that improves feature extraction while staying efficient. This is great for multi-label classification. It is a good balance between low computational work and big number of parameters.

ConvNeXt is a modern reinterpretation of the classic ResNet architecture, redesigned with ideas borrowed from Vision Transformers while keeping a fully convolutional backbone. It simplifies the traditional residual blocks into depthwise separable convolutions with large kernels (7×7), layer normalization instead of batch normalization, and inverted bottlenecks similar to those in efficient mobile models. This design improves the receptive field and feature extraction capacity while remaining computationally efficient compared to older CNNs. It is however a big model with a large number of parameters and a large computional time, which derives from our main approach to the task. 

We just wanted to compare our results with those of other models.