# Noisy labels problem


## Introduction theory

Before starting to implement anything we should think about the problem first. We don't know which of the training samples are wrong, and we cannot try to correct them by hand because there are too many of them and the images are not very clear anyway.

We need some Noise-aware training techniques. Some ideas are:
- training using a loss function that is designed to be robust to label noise. Such as:
  - focal loss, it down-weighs the contribution of easy examples and puts more ephasis on difficult examples, which can make it resistant to noise. `torch.nn.FocalLoss`
  - Generalized Cross-Entropy loss, which includes additional hyperparameters that allow the model to learn the noise rate and label corruption matrix. `torch.nn.GCE`
- bootstrapping
- self-ensembling to learn multiple models that are more robust to noise
- Confidence based method, because predicting the confidence may allow humans to discard the least confidence examples. ( but we are not creating a real world model, we are getting tested automatically with a test set, so this method cannot work.)
- Pseudo-labeling, using the model's own predictions to re-label a portion of the training data. Similar to self-training, we eliminate completely labels of some samples and we try to predict them. In case the labels are the same after relabling, there is a higher chance they are correct. We would gradually relable the images into a better dataset. 
    - Or Computing a probability that the labels are misslabeled, than using that probability to weigh heavier on the labels that have a higher chance of being correct during training. `torch.nn.CrossEntropyLoss`
- Noise tolerant algorithms: complex deep learning models are more tolerant to noise. (this is not a very solid technique but should be used together with others.)

References:
- Focal loss: This loss function was first introduced in the paper "Focal Loss for Dense Object Detection" by Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár (https://arxiv.org/abs/1708.02002).
- Generalized cross-entropy loss: This loss function was proposed in the paper "Learning with Noisy Labels" by Mengye Ren, Elad Hazan, and Yoram Singer (https://arxiv.org/abs/1609.03683).
- Bootstrapping: This technique involves training multiple models on different subsets of the training data, and then combining their predictions to make a final prediction. It was first introduced in the paper "Bagging Predictors" by Leo Breiman (https://link.springer.com/article/10.1023/A:1018054314350).
- Self-ensembling: This technique involves training multiple models on the same data, and then using their predictions to create a final prediction. It was first introduced in the paper "Self-Ensembling for Visual Domain Adaptation" by Sergey Zagoruyko and Nikos Komodakis (https://arxiv.org/abs/1706.05208).
- Pseudo-labeling: This technique involves using the model's own predictions to label a portion of the training data. It was first introduced in the paper "Semi-Supervised Learning with Deep Generative Models" by Diederik Kingma, Danilo Jimenez Rezende, Shakir Mohamed, and Max Welling (https://arxiv.org/abs/1406.5298).
- Noise-aware training: This refers to techniques that are specifically designed to handle label noise in the training data. One example of such a technique is the generalized cross-entropy loss function, which was introduced in the paper "Learning with Noisy Labels" by Mengye Ren, Elad Hazan, and Yoram Singer (https://arxiv.org/abs/1609.03683).
- Confidence based methods: This refers to techniques that involve training a model to predict the confidence or probability of each class label, rather than just the class label itself. This can allow the model to identify examples that are less confident and potentially more prone to noise. One example of a confidence-based method is the method of "bootstrapping," which was introduced in the paper "Bagging Predictors" by Leo Breiman (https://link.springer.com/article/10.1023/A:1018054314350).

## Pseudo Labeling solution


This technique is barely described above

The first question is what portion of the dataset should I "forget" the labels and try predict new ones(by training on the remaining ones)? 

Making the worse assumption that half of the labels are wrong. Than we should leave out a smaller portion of the labels to be re-labeled.
But because we have a rather large dataset of 500 images per class and that the model we are using is a rather complex one which is already pretrained, we may increase the number of labels for re-labeling. 

According to previous argument I chose to separate the dataset in 10 parts. 
I will leave 10% of the dataset out and only use the rest of 90%. 
Similar to a 10-fold cross validation, but instead of validating we are re-labeling the remaining 10% of the data.

## Imports

In [None]:

import urllib
import shutil
import os
import time
import copy
import json

import pandas as pd
import numpy as np

from PIL import Image
import matplotlib.pyplot as plt

from tqdm import tqdm

import torch
import torchvision
import torch.nn as nn
import torch.optim as optim
from torch.optim import lr_scheduler
from torch.utils.data import SubsetRandomSampler, Dataset
from torchvision import transforms, datasets 

from sklearn.metrics import classification_report

import itertools
     

In [None]:
task2_id = "1GUF43k4PJX8YvNUWJbE1RnmXeI7nVbOL" 

In [None]:
# replace here your ide &id=1dO1vqCoJm2xwrnr171A6_eW7ikd-alrd"
# replace here your id 'https://docs.google.com/uc?export=download&id=1dO1vqCoJm2xwrnr171A6_eW7ikd-alrd'
# replace here your target name -O task1.tar.gz &&
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1GUF43k4PJX8YvNUWJbE1RnmXeI7nVbOL' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1GUF43k4PJX8YvNUWJbE1RnmXeI7nVbOL" -O task2.tar.gz && rm -rf /tmp/cookies.txt


In [None]:
%%capture
!mkdir data
!mv task1.tar.gz ./data
!tar -xzvf "/content/data/task2.tar.gz" -C "/content/data/"     #[run this cell to extract tar.gz files]
# this may take 12 seconds

## Mount drive 
( in order to save each iteration of leave out labels)

In [None]:
from google.colab import drive
drive.mount('/gdrive')

In [None]:
with open('/gdrive/MyDrive/foo.txt', 'w') as f:
  f.write('Hello Google Drive!')
!cat '/gdrive/MyDrive/foo.txt'

In [None]:
!rm '/gdrive/MyDrive/foo.txt'

## Seting up things

In [None]:
# Set up seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)

In [None]:
dir_data = 'data/task2/train_data/'
# Read the annotations file into a DataFrame
df = pd.read_csv(f'{dir_data}annotations.csv')

In [None]:
# Organize data for pytorch training ( only once )
# Define the base directory
base_dir = 'data/task1/labeled'

# Iterate over the rows in the DataFrame
for _, row in tqdm(df.iterrows(), total=df.shape[0]):
    # Extract the path and class from the row
    path = row['sample']
    label = row['label']
    
    # Create the directory for the class
    class_dir = f'{base_dir}/{label}'
    os.makedirs(class_dir, exist_ok=True)
    
    # Copy the file to the class directory
    shutil.copy(f"data/{path}", class_dir) 

In [None]:
# TODO: Set up the place where you save each iteration labels
# preferably a dataframe

In [None]:
# TODO: Random sample cross re-labling

In [None]:
# TODO: move out of the dataset those images that I considered 
# to be forgotten
# they will be relabled at the end of training and put back in the data
# but they their annotations will be saved in a separate dataframe


In [None]:
# TODO, during the training, for each iteration, 
# add a new set of labels to the csv that tells us the annotations
# at the end of the training phase
# use the predictions made by the models 
# to choose the images with the highest confidence
# by selecting the images that consistently have the same label
# this set of images will become an always used 
# set in the next training phase, it should not be forgetten again
# because there is a high confidence that these labels have the 
# correct labels 
# because I have leave 9 out technique,
# I will have 9 labels for each image
# and also the original label.
# the probability that those labels are correct can be 
# simply counting the label that occurs most often
# than dividing that number by 10