# Noise in Data Labels

![label_noise_intro.png](https://live.staticflickr.com/65535/54328952408_f7e5bcb7c9_z.jpg)

*Image generated using a generative model from OpenArt.ai.*

## Introduction

Data is key in machine learning. Everything starts with it. In practice, however, data is often not perfect and contains some noise that reduces its quality. One type of such noise is noise in data labels, which means that for some observations the labels are incorrect.

The causes of label noise can be varied. They often result from subjectivity of assessment – different experts may have different opinions, e.g., assessing emotions in photos or the quality of an essay. Another source of error may be annotator fatigue, which affects their concentration and accuracy.

Ambiguities can also result from poor data quality, which makes unambiguous classification difficult (e.g., a blurry photo of a dog that resembles a wolf). Sometimes labels are generated automatically by artificial intelligence models, which can also make mistakes.

It is worth mentioning samples located on the class boundary. Such cases, e.g., in medical data where symptoms are similar for different diseases, also lead to difficulties in assigning an unambiguous label.

Data noise makes it difficult to train a high-quality model because the model may focus more on incorrect information than on general rules contained in it.

## Task

Your task will be to train **two** neural networks for correct binary classification of images despite partial label noise in the training data. The training set is imbalanced (take this into account in your solution). The validation and test set (which will be used to evaluate your final solution) have only correct labels (no noise).

**The architecture of the models is defined and you cannot change it.**

Think about why we use two models and not one (this is a kind of puzzle) - it will help you understand the task and solve it.
Your role in this task is to implement the function `your_select_indices(targets, losses)`, which will select the data indices from the training set that will be used to train the models. The function takes as input a tensor with data labels (`targets`) and a tensor with loss function values from both models (`losses`). The result of this function should be a two-element list where the elements will be tensors containing the indices selected for training the models. One model receives one set of indices, the second model receives the other.
Below in the notebook you will find a cell with space for your function. The cell you should modify is clearly marked. To better understand its operation and purpose, it is worth looking at the context and the place where this function will be called in the training loop.

### Evaluation Criteria
The final evaluation of the task will be based on the average value of the balanced accuracy measure (*BAC*) from two models, i.e., ${BAC}_{mean} = \frac{BAC_1+BAC_2}{2}$ where $BAC_i$ is the *balanced accuracy* for model $i$, ($i = 1, 2$).

For this task, you can score between 0 and 100 points.

Your final point score for solving the task will be calculated according to the function below (the higher the value the better) with additional rounding to integer values:
$$
\mathrm{Points} =
\begin{cases}
    0 & \text{if } {BAC}_{mean} \leq 0.5 \\
    100 \times \frac{{BAC}_{mean} - 0.5}{0.8 - 0.5} & \text{if } 0.5 < {BAC}_{mean} < 0.8 \\
    100 & \text{if } {BAC}_{mean} \geq 0.8
\end{cases}
$$

**Note: Notice that to get the maximum number of points, it is not necessary to achieve a maximum *balanced accuracy* value equal to 1. If ${BAC}_{mean}$ is at least 0.8, you will receive the maximum number of points.**

This criterion and all the functions mentioned above are implemented below by us.

## Constraints

- Your solution will be tested on the Competition Platform without internet access and in an environment with GPU.
- Evaluation of your final solution on the Competition Platform cannot take longer than 5 minutes with a GPU.
- **You cannot** change the architecture of the models - it must be the `SmallMobileNet` defined by us.

## Submission Files
This notebook supplemented with your solution (see `your_select_indices` function).

## Evaluation
Remember that during checking, the `FINAL_EVALUATION_MODE` flag will be set to `True`.

For this task, you can score between 0 and 100 points. The number of points you will get will be calculated on the (secret) test set on the Competition Platform based on the formula mentioned above, rounded to an integer. If your solution does not meet the above criteria or does not execute correctly, you will receive 0 points for the task.

# Environment Setup

First, we define the evaluation mode and import the necessary libraries for deep learning, data manipulation, and visualization.

In [None]:
# During the verification of your solution, the value of the FINAL_EVALUATION_MODE flag will be changed to True
FINAL_EVALUATION_MODE = False

### Library Imports

Standard PyTorch components along with Scikit-Learn metrics and PIL for image processing.

In [None]:
######################### DO NOT CHANGE THIS CELL ##########################
import os
from tqdm import tqdm
from typing import Optional, Tuple, List

import zipfile

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from PIL import Image

import torch
import torch.nn as nn
from torch.optim import AdamW
from torch.utils.data import DataLoader

import torchvision.transforms as transforms
from torchvision.datasets.folder import VisionDataset

from sklearn.metrics import balanced_accuracy_score

## Data Constants

Definition of directory paths and source URLs for the training and validation datasets.

In [None]:
######################### DO NOT CHANGE THIS CELL ##########################
SEED = 123
IMAGES_DIR = "data"
TASK_DATASET_LABELS_FILE = "dataset_labels.csv"

ROOT_DIR = os.getcwd()
TRAIN_DATASET_PATH = os.path.join(ROOT_DIR,'train')
VAL_DATASET_PATH = os.path.join(ROOT_DIR, 'val')

TRAIN_DATASET_URL = "1qmNNmDv-wUcAv5mvO6vYJV3mQ2SNIGnI"
VAL_DATASET_URL = "1YUJYD12NmKRSzFJGMrX-a61d6mnTaWbG"

### Training Hyperparameters

Setting the computation device (GPU/CPU) and training parameters like learning rate and batch size.

In [None]:
######################### DO NOT CHANGE THIS CELL ##########################
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
LEARNING_RATE = 1e-2
NUM_EPOCHS = 6
NUM_CLASSES = 2
BATCH_SIZE = 128
WEIGHT_DECAY = 1e-3

if not FINAL_EVALUATION_MODE:
  print(f"Using {DEVICE} device")

### Reproducibility Setup

Function to ensure deterministic behavior across different runs by fixing seeds.

In [None]:
######################### DO NOT CHANGE THIS CELL ##########################
def seed_everything(seed: int) -> None:
    os.environ["PYTHONHASHSEED"] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

## Data Pipeline

Logic for downloading and preparing the raw images for the training loop.

In [None]:
######################### DO NOT CHANGE THIS CELL ##########################
def download_data(dataset_path, dataset_url):
    import gdown
    import shutil
    output = dataset_path+".zip"
    if os.path.exists(dataset_path):
        shutil.rmtree(dataset_path)
    if os.path.exists(output):
        os.remove(output)
    url = f'https://drive.google.com/uc?id={dataset_url}'
    gdown.download(url, output, fuzzy=True)
    print(f"Downloaded: {output}")

### Custom Dataset Implementation

The `TaskDataset` class handles loading images and their associated (noisy) labels from a CSV file.

In [None]:
######################### DO NOT CHANGE THIS CELL ##########################
class TaskDataset(VisionDataset):
    def __init__(self, root: str, transform: Optional[callable] = None):
        super().__init__(root, transform=transform)
        self.root = root
        self.labels_df = pd.read_csv(os.path.join(self.root, TASK_DATASET_LABELS_FILE))

    def __len__(self) -> int:
        return len(self.labels_df)

    def __getitem__(self, idx: int) -> Tuple[Image.Image, np.ndarray]:
        img_path = os.path.join(self.root, IMAGES_DIR, self.labels_df.iloc[idx]['file_name'])
        img = Image.open(img_path)
        label = self.labels_df.iloc[idx]['label']
        if self.transform is not None:
            img = self.transform(img)
        return img, np.array([int(label)])

### Data Extraction

Utility function to unzip downloaded archives into the local file system.

In [None]:
######################### DO NOT CHANGE THIS CELL ##########################
def unpack_data(unpack_path, dataset_name) -> None:
    dataset_zip_path = os.path.join(ROOT_DIR, dataset_name+".zip")
    if not os.path.exists(os.path.join(unpack_path, dataset_name)):
        with zipfile.ZipFile(dataset_zip_path, "r") as zip_ref:
            zip_ref.extractall(unpack_path)

### Loader Initialization

Wraps datasets into DataLoaders for efficient batch processing.

In [None]:
######################### DO NOT CHANGE THIS CELL ##########################
def load_data() -> Tuple[DataLoader, DataLoader]:
    base_transform = transforms.Compose([transforms.ToTensor()])
    train_dataset = TaskDataset(root=TRAIN_DATASET_PATH, transform=base_transform)
    val_dataset = TaskDataset(root=VAL_DATASET_PATH, transform=base_transform)
    train_loader = DataLoader(dataset=train_dataset, batch_size=BATCH_SIZE, shuffle=False)
    val_loader = DataLoader(dataset=val_dataset, batch_size=BATCH_SIZE, shuffle=False)
    return train_loader, val_loader

### Dataset Preparation Execution

Running the download, unpacking, and loading steps sequentially.

In [None]:
######################### DO NOT CHANGE THIS CELL ##########################
if not FINAL_EVALUATION_MODE:
    download_data(TRAIN_DATASET_PATH, TRAIN_DATASET_URL)
    download_data(VAL_DATASET_PATH, VAL_DATASET_URL)
    unpack_data(ROOT_DIR, "train")
    unpack_data(ROOT_DIR, "val")
    train_loader, val_loader = load_data()

## Model Architecture

Crucially, we use the `SmallMobileNet` architecture, which uses depthwise separable convolutions to stay computationally efficient for the 5-minute GPU limit.

In [None]:
######################### DO NOT CHANGE THIS CELL ##########################
class SmallMobileNet(nn.Module):
    def __init__(self, num_classes=NUM_CLASSES):
        super(SmallMobileNet, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=3, stride=1, padding=1, bias=False),
            nn.BatchNorm2d(32),
            nn.ReLU6(inplace=True),
            nn.Conv2d(32, 256, kernel_size=1, stride=1, bias=False), # Simplified view for display
            nn.BatchNorm2d(256),
            nn.ReLU6(inplace=True),
        )
        self.pool = nn.AdaptiveAvgPool2d(1)
        self.classifier = nn.Sequential(
            nn.Linear(256, 128),
            nn.ReLU6(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(128, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        x = self.pool(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x

## Evaluation Criteria

Functions to calculate Balanced Accuracy (BAC) and translate it into the final competition score.

In [None]:
######################### DO NOT CHANGE THIS CELL ##########################
def predict_and_evaluate(model, val_loader, device, verbose=False):
    model.eval()
    all_preds, all_targets = [], []
    with torch.no_grad():
        for inputs, targets in val_loader:
            inputs, targets = inputs.to(device), targets.to(device)
            outputs = model(inputs)
            preds = torch.argmax(outputs, dim=1)
            all_preds.extend(preds.cpu().numpy())
            all_targets.extend(targets.cpu().numpy())
    bac = balanced_accuracy_score(all_targets, all_preds)
    if verbose: print(f"Balanced Accuracy: {bac}")
    return bac

### Score Mapping

Translates the mean BAC from both models into a 0-100 score.

In [None]:
######################### DO NOT CHANGE THIS CELL ##########################
def performance(bac_1: float, bac_2: float) -> None:
    bac_mean = (bac_1 + bac_2) / 2
    if bac_mean <= 0.5: points = 0
    elif 0.5 < bac_mean < 0.8: points = int(round((bac_mean - 0.5) / 0.3 * 100))
    else: points = 100
    print(f"Final Score: {points}/100")

## Training Infrastructure

The standard training loop is modified here to accept a `select_indices_fn`. This allows us to dynamically filter out noisy samples during training.

In [None]:
######################### DO NOT CHANGE THIS CELL ##########################
def train(model1, model2, optimizer1, optimizer2, criterion, train_loader, val_loader, num_epochs, device, select_indices_fn):
    for epoch in range(1, num_epochs + 1):
        model1.train(); model2.train()
        pbar = tqdm(train_loader, desc=f"Epoch {epoch}")
        for inputs, targets in pbar:
            inputs, targets = inputs.to(device), targets.squeeze().long().to(device)
            outputs = [m(inputs) for m in (model1, model2)]
            losses = [criterion(out, targets) for out in outputs]
            
            # The selection strategy is applied here
            selected_indices = select_indices_fn(targets, losses)
            
            for i, (model, optim) in enumerate([(model1, optimizer1), (model2, optimizer2)]):
                optim.zero_grad()
                sel = selected_indices[i]
                if len(sel) > 0:
                    loss = criterion(model(inputs[sel]), targets[sel]).mean()
                    loss.backward(); optim.step()
    return {}

## Example Baseline

As a baseline, we use all samples regardless of their loss, which typically leads to overfitting on the noise.

In [None]:
######################### DO NOT CHANGE THIS CELL ##########################
def default_select_indices(targets, losses):
    return [torch.arange(targets.shape[0]).to(DEVICE) for _ in range(2)]

# Solved Solution

The proposed solution implements a **Co-teaching** framework tailored for **imbalanced datasets**. 

### Why Co-teaching?
Deep neural networks exhibit a "memorization effect" — they learn simple, generalizable patterns (clean data) before memorizing noise. By training two models in parallel, each model can identify low-loss samples and "teach" them to its peer. This cross-filtering avoids the feedback loop where a single model might justify its own errors.

### Handling Imbalance
In imbalanced datasets, standard loss filtering would purely favor the majority class because it's easier to minimize. To solve this, we:
1.  **Class-Wise Filtering**: We sort losses independently for class 0 and class 1.
2.  **Proportional Selection**: We keep a balanced ratio (e.g., 85%) of samples from each class, ensuring the models see enough minority class samples to achieve high Balanced Accuracy.

In [None]:
def your_select_indices(targets: torch.Tensor, losses: List[torch.Tensor]) -> List[torch.Tensor]:
    """
    Solution using Class-Aware Co-teaching Selection.
    """
    batch_size = targets.shape[0]
    device = targets.device
    loss1, loss2 = losses
    
    # Get class masks and indices
    indices = torch.arange(batch_size, device=device)
    c0_mask, c1_mask = (targets == 0), (targets == 1)
    
    # Maintain a high keep ratio for imbalanced robustness
    keep_ratio = 0.85
    c0_keep = int(c0_mask.sum().item() * keep_ratio)
    c1_keep = int(c1_mask.sum().item() * keep_ratio)
    
    def filter_for_peer(peer_loss):
        # Sort class 0 samples by peer loss
        _, s0 = torch.sort(peer_loss[c0_mask])
        sel0 = indices[c0_mask][s0[:c0_keep]]
        
        # Sort class 1 samples by peer loss
        _, s1 = torch.sort(peer_loss[c1_mask])
        sel1 = indices[c1_mask][s1[:c1_keep]]
        
        return torch.cat([sel0, sel1])

    # Model 1 trains on low-loss samples from Model 2 (and vice versa)
    return [filter_for_peer(loss2), filter_for_peer(loss1)]

## Execution and Results

We now run the training process using our custom selection strategy and evaluate the final performance against the competition metric.

In [None]:
######################### DO NOT CHANGE THIS CELL ##########################
if not FINAL_EVALUATION_MODE:
    seed_everything(SEED)
    criterion = nn.CrossEntropyLoss(reduction="none")
    m1, m2 = SmallMobileNet().to(DEVICE), SmallMobileNet().to(DEVICE)
    opt1 = AdamW(m1.parameters(), lr=LEARNING_RATE, weight_decay=WEIGHT_DECAY)
    opt2 = AdamW(m2.parameters(), lr=LEARNING_RATE, weight_decay=WEIGHT_DECAY)

    train(m1, m2, opt1, opt2, criterion, train_loader, val_loader, NUM_EPOCHS, DEVICE, your_select_indices)

    b1 = predict_and_evaluate(m1, val_loader, DEVICE, verbose=True)
    b2 = predict_and_evaluate(m2, val_loader, DEVICE, verbose=True)
    performance(b1, b2)

# Official Evaluation

The function below is used by the platform to calculate the final score on the hidden test set.

In [None]:
######################### DO NOT CHANGE THIS CELL ##########################
def final_evaluate(evaluate_data_path, model1, model2):
    base_transform = transforms.Compose([transforms.ToTensor()])
    loader = DataLoader(TaskDataset(evaluate_data_path, transform=base_transform), batch_size=BATCH_SIZE)
    b1 = predict_and_evaluate(model1, loader, DEVICE, verbose=True)
    b2 = predict_and_evaluate(model2, loader, DEVICE, verbose=True)
    return performance(b1, b2)