## Label Smoothing

Label smoothing is a lesser-talked regularisation technique that elegantly addresses this issue.

We intentionally reduce the probability mass of the true class slightly.
The reduced probability mass is uniformly distributed to all other classes.

As asking the model to be “less overconfident” during training and prediction while still attempting to make accurate predictions.

### When not to use label smoothing?

if you only care about getting the final prediction correct and improving generalization, label smoothing will be a pretty handy technique.

However, I wouldn’t recommend utilizing it if you care about:
1. Getting the prediction correct.
2. And understanding the model’s confidence in generating a prediction.

This is because as we discussed above, label smoothing guides the model to become “less overconfident” about its predictions.
Thus, we typically notice a drop in the confidence values for every prediction

## Imports

In [2]:
import sys
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import torch.nn.functional as F
import numpy as np
import pandas as pd

from skorch import NeuralNetClassifier
from time import time
from tqdm import tqdm
from torch.utils.data import DataLoader

## Dataset

In [3]:
# Define data transformations
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])

# Load the Fashion MNIST dataset for both train and test sets
train_dataset = torchvision.datasets.FashionMNIST(root='./data', train=True, transform=transform, download=True)
test_dataset = torchvision.datasets.FashionMNIST(root='./data', train=False, transform=transform, download=True)

# Define batch sizes for train and test data loaders
batch_size = 64

# Create data loaders for train and test sets
trainloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
testloader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 26421880/26421880 [00:03<00:00, 7039425.68it/s] 


Extracting ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz to ./data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 29515/29515 [00:00<00:00, 156468.47it/s]


Extracting ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 4422102/4422102 [00:01<00:00, 3061159.70it/s]


Extracting ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 5148/5148 [00:00<00:00, 1276516.52it/s]

Extracting ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw






## Set seeds

In [4]:
np.random.seed(20)
torch.manual_seed(20)

<torch._C.Generator at 0x13921f910>

## Define Neural Network

In [5]:
# Define a simple teacher neural network with 4 fully connected layers
class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(28 * 28, 512)
        self.fc2 = nn.Linear(512, 256)
        self.fc3 = nn.Linear(256, 128)
        self.fc4 = nn.Linear(128, 10)

    def forward(self, x):
        x = x.view(x.size(0), -1)
        x1 = torch.relu(self.fc1(x))
        x2 = torch.relu(self.fc2(x1))
        x3 = torch.relu(self.fc3(x2))
        x4 = self.fc4(x3)
        return x1, x2, x3, x4  # Return intermediate feature activations for activation pruning

## Test evaluation

In [6]:
def evaluate(model):
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for data in testloader:
            inputs, labels = data
            outputs = model(inputs)[-1] # use last element returned by forward function
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    return correct / total

## Without label Smoothing

In [7]:
net = SimpleNet()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters(), lr=0.001)

for epoch in range(20):
    net.train()
    running_loss = 0.0
    
    for data in trainloader:
        inputs, labels = data
        optimizer.zero_grad()
        outputs = net(inputs)
        loss = criterion(outputs[-1], labels)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
        
    accuracy = evaluate(net)
        
    print(f"Epoch {epoch + 1}, Loss: {round(running_loss / len(trainloader), 2)}, Accuracy: {accuracy * 100:.2f}%")

Epoch 1, Loss: 0.5, Accuracy: 84.90%
Epoch 2, Loss: 0.37, Accuracy: 85.73%
Epoch 3, Loss: 0.33, Accuracy: 86.78%
Epoch 4, Loss: 0.31, Accuracy: 87.45%
Epoch 5, Loss: 0.29, Accuracy: 87.46%
Epoch 6, Loss: 0.27, Accuracy: 87.28%
Epoch 7, Loss: 0.26, Accuracy: 86.80%
Epoch 8, Loss: 0.25, Accuracy: 87.62%
Epoch 9, Loss: 0.23, Accuracy: 88.46%
Epoch 10, Loss: 0.22, Accuracy: 88.79%
Epoch 11, Loss: 0.21, Accuracy: 88.21%
Epoch 12, Loss: 0.2, Accuracy: 88.14%
Epoch 13, Loss: 0.19, Accuracy: 88.81%
Epoch 14, Loss: 0.18, Accuracy: 89.17%
Epoch 15, Loss: 0.17, Accuracy: 88.81%
Epoch 16, Loss: 0.17, Accuracy: 89.00%
Epoch 17, Loss: 0.16, Accuracy: 88.54%
Epoch 18, Loss: 0.15, Accuracy: 89.12%
Epoch 19, Loss: 0.15, Accuracy: 88.97%
Epoch 20, Loss: 0.14, Accuracy: 88.84%


### Output probability on a sample

In [8]:
net.eval()

with torch.no_grad():
        for data in testloader:
            inputs, labels = data

F.softmax(net(inputs[0])[-1])

  F.softmax(net(inputs[0])[-1])


tensor([[5.8588e-03, 7.4752e-04, 3.6302e-04, 9.9173e-01, 1.2088e-05, 1.4675e-08,
         1.2671e-03, 2.2036e-14, 2.3719e-05, 1.0715e-11]],
       grad_fn=<SoftmaxBackward0>)

## With Label Smoothing

Restart the kernel before executing the cell below. This time, don't run the "Without label Smoothing" cell.

In [9]:
net = SimpleNet()
criterion = nn.CrossEntropyLoss(label_smoothing = 0.2)
optimizer = optim.Adam(net.parameters(), lr=0.001)

for epoch in range(20):
    net.train()
    running_loss = 0.0
    
    for data in trainloader:
        inputs, labels = data
        optimizer.zero_grad()
        outputs = net(inputs)
        loss = criterion(outputs[-1], labels)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
        
    accuracy = evaluate(net)
        
    print(f"Epoch {epoch + 1}, Loss: {round(running_loss / len(trainloader), 2)}, Accuracy: {accuracy * 100:.2f}%")

Epoch 1, Loss: 1.19, Accuracy: 84.17%
Epoch 2, Loss: 1.1, Accuracy: 86.38%
Epoch 3, Loss: 1.08, Accuracy: 86.80%
Epoch 4, Loss: 1.06, Accuracy: 87.46%
Epoch 5, Loss: 1.05, Accuracy: 87.47%
Epoch 6, Loss: 1.04, Accuracy: 88.16%
Epoch 7, Loss: 1.03, Accuracy: 88.47%
Epoch 8, Loss: 1.02, Accuracy: 88.73%
Epoch 9, Loss: 1.01, Accuracy: 88.46%
Epoch 10, Loss: 1.01, Accuracy: 88.56%
Epoch 11, Loss: 1.0, Accuracy: 88.98%
Epoch 12, Loss: 1.0, Accuracy: 88.51%
Epoch 13, Loss: 0.99, Accuracy: 88.80%
Epoch 14, Loss: 0.98, Accuracy: 89.41%
Epoch 15, Loss: 0.98, Accuracy: 88.60%
Epoch 16, Loss: 0.97, Accuracy: 89.34%
Epoch 17, Loss: 0.97, Accuracy: 88.83%
Epoch 18, Loss: 0.96, Accuracy: 88.78%
Epoch 19, Loss: 0.96, Accuracy: 89.07%
Epoch 20, Loss: 0.96, Accuracy: 89.29%


### Output probability on a sample

In [10]:
net.eval()

with torch.no_grad():
        for data in testloader:
            inputs, labels = data

F.softmax(net(inputs[0])[-1])

  F.softmax(net(inputs[0])[-1])


tensor([[0.0375, 0.0308, 0.0264, 0.6425, 0.0349, 0.0293, 0.1042, 0.0259, 0.0412,
         0.0274]], grad_fn=<SoftmaxBackward0>)