# Label Smoothing

Label smoothing is a regularization technique used to improve the performance and generalization of deep learning models, particularly in classification tasks. By altering the hard labels during training, label smoothing helps mitigate issues such as overfitting and overconfidence in the model's predictions.

## What is Label Smoothing?

In traditional classification tasks, the ground truth labels are represented as one-hot vectors. For instance, if there are three classes and the true label is class 2, the one-hot vector would be `[0, 1, 0]`. Label smoothing, on the other hand, replaces the hard 0 and 1 values with softer values. 

For example, with a smoothing factor alpha, the smoothed label y smooth for a true label y true can be computed as:

$$
y_{smooth} = (1 - \alpha) \cdot y_{true} + \alpha \cdot u
$$

This results in a label vector that looks something like `[0.1, 0.8, 0.1]` instead of `[0, 1, 0]` for alpha = 0.2.

## Why Use Label Smoothing?

### 1. Reducing Overfitting

Label smoothing introduces a small amount of noise into the labels, which acts as a form of regularization. This helps prevent the model from becoming too confident in its predictions, thereby reducing overfitting on the training data.

### 2. Mitigating Overconfidence

Neural networks can become overconfident in their predictions, assigning probabilities very close to 1 for the predicted class. This overconfidence can be problematic, especially when the model encounters out-of-distribution data. Label smoothing encourages the model to be less certain, leading to better calibration of the predicted probabilities.

### 3. Improving Generalization

By smoothing the labels, the model learns to distribute some probability mass to all classes, rather than focusing entirely on the correct class. This can lead to improved generalization performance on unseen data.

## Implementing Label Smoothing

Label smoothing can be easily implemented in most deep learning frameworks. Below is an example implementation using Pytorch:

In [None]:
import torch.nn as nn
import torch.nn.functional as F

class LabelSmoothCEloss(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self,  pred,  label,  smoothing=0.1):
        pred = F.softmax(pred,  dim=1)
        one_hot_label = F.one_hot(label, pred.size(1)).float()
        smoothed_one_hot_label = (1.0 - smoothing)  *  one_hot_label + smoothing / pred.size(1)
        loss = (-torch.log(pred))  *  smoothed_one_hot_label
        loss = loss.sum(axis=1,  keepdim=False)
        loss = loss.mean()
        return loss

criterion = LabelSmoothCEloss()

# Augmentation
We have introduced some useful data augmentation techniques for images in P1 chapter4. In this chapter, we will introduce two important techniques that we did not mention in previous chapters. They are `Mixup` and `Cutmix`.

## Mixup

Mixup is a simple yet effective data augmentation technique, particularly useful for image classification tasks. Its basic idea is to perform linear interpolation between the images and their labels in the input data. Specifically, for each pair of training samples, a new sample is created by taking a weighted average of two samples. This weighted average process is performed not only on the input images but also on the corresponding labels.

For example, given two training samples (x_1, y_1) and (x_2, y_2), where x_1 and x_2 are images, and y_1 and y_2 are the corresponding labels, the new sample generated by Mixup is as follows:

$$
\begin{align*}
\tilde{x} &= \lambda x_1 + (1 - \lambda) x_2 \\
\tilde{y} &= \lambda y_1 + (1 - \lambda) y_2
\end{align*}
$$

Where:
$$
\begin{align*}
\tilde{x} &\text{ is the new image} \\
\tilde{y} &\text{ is the new label} \\
\lambda &\text{ is a random interpolation coefficient sampled from a beta distribution.} \\
\end{align*}
$$

Mixup encourages the model to learn more generalizable features by blending different samples together, thereby reducing overfitting and improving the model's performance on unseen data.

## Benefits of Mixup:

1. **Regularization**: Mixup acts as a form of regularization by adding noise to the training data, which helps prevent the model from overfitting.

2. **Improving Generalization**: By blending samples, Mixup encourages the model to learn more robust features that generalize better to unseen data.

3. **Enhanced Diversity**: Mixup increases the diversity of the training data by creating new samples from combinations of existing ones, leading to a more comprehensive learning process.

## Implementation:

Mixup can be easily implemented in most deep learning frameworks. Below is a simple example of how to implement Mixup in PyTorch:

In [None]:
for i,(images,target) in enumerate(train_loader):
    # 1.input output
    images = images.cuda(non_blocking=True)
    target = torch.from_numpy(np.array(target)).float().cuda(non_blocking=True)

    # 2.mixup
    alpha=config.alpha
    lam = np.random.beta(alpha,alpha)
    index = torch.randperm(images.size(0)).cuda()
    inputs = lam*images + (1-lam)*images[index,:]
    targets_a, targets_b = target, target[index]
    outputs = model(inputs)
    loss = lam * criterion(outputs, targets_a) + (1 - lam) * criterion(outputs, targets_b)

    # 3.backward
    optimizer.zero_grad()   # reset gradient
    loss.backward()
    optimizer.step()

## CutMix

CutMix is a powerful data augmentation technique designed to improve the generalization and robustness of convolutional neural networks (CNNs) in image classification tasks. It involves cutting and pasting patches from one image onto another and mixing their labels accordingly.

### How CutMix Works:

1. **Image Transformation**:
   - Select a random image from the dataset.
   - Randomly choose a rectangular patch within the image.
   - Replace this patch with a patch from another randomly chosen image in the dataset.
   
2. **Label Generation**:
   - The new image is assigned a label that is a weighted average of the original labels of the two images, based on the area of overlap.
   
3. **Training**:
   - The model is trained on the augmented images with their corresponding mixed labels.

### Benefits of CutMix:

1. **Regularization**:
   - CutMix acts as a form of regularization by adding noise to the training data, reducing overfitting and improving the model's generalization ability.
   
2. **Improved Robustness**:
   - By combining patches from different images, CutMix encourages the model to learn features that are robust to variations in the input data.
   
3. **Increased Diversity**:
   - CutMix increases the diversity of the training data by creating new samples from combinations of existing ones, leading to a more comprehensive learning process.

### Implementation:

CutMix can be implemented in most deep learning frameworks. Here's a high-level overview of how to implement CutMix in PyTorch:

In [None]:
def cutmix(data, targets, alpha=1.0):
    indices = torch.randperm(data.size(0))
    shuffled_data = data[indices]
    shuffled_targets = targets[indices]
    
    lam = np.random.beta(alpha, alpha)
    
    bbx1, bby1, bbx2, bby2 = rand_bbox(data.size(), lam)
    new_data = data.clone()
    new_data[:, :, bbx1:bbx2, bby1:bby2] = shuffled_data[:, :, bbx1:bbx2, bby1:bby2]
    
    lam = 1 - ((bbx2 - bbx1) * (bby2 - bby1) / (data.size(-1) * data.size(-2)))
    
    targets = (targets, shuffled_targets, lam)
    return new_data, targets


def rand_bbox(size, lam):
    W = size[2]
    H = size[3]
    cut_rat = np.sqrt(1. - lam)
    cut_w = np.int(W * cut_rat)
    cut_h = np.int(H * cut_rat)

    cx = np.random.randint(W)
    cy = np.random.randint(H)

    bbx1 = np.clip(cx - cut_w // 2, 0, W)
    bby1 = np.clip(cy - cut_h // 2, 0, H)
    bbx2 = np.clip(cx + cut_w // 2, 0, W)
    bby2 = np.clip(cy + cut_h // 2, 0, H)
    return bbx1, bby1, bbx2, bby2


for inputs, targets in data_loader:
    inputs, targets = cutmix(inputs, targets)
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = cutmix_criterion(criterion, outputs, targets)
    loss.backward()
    optimizer.step()