<a href="https://colab.research.google.com/github/Mabinogit/AI-Image-Classification/blob/main/loss_function.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
! pip install torch

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [None]:
class LabelSmoothedCrossEntropy(nn.Module):
  def __init__(self, epsilon=0.1, ignore_index=-100):
        """
        Label smoothing loss function for Transformer models.

        Args:
        - epsilon (float): epsilon: This controls the amount of label smoothing applied. A higher epsilon means more smoothing. It defaults to 0.1..
        - ignore_index (int): ignore_index: Specifies an index in the target labels that should be ignored during loss calculation (often used for padding tokens). It defaults to -100. .
        """
        super(LabelSmoothedCrossEntropy, self).__init__()
        self.epsilon = epsilon
        self.ignore_index = ignore_index

        def forward(self, logits, target):
        """
        Computes the label-smoothed cross-entropy loss.

        Args:
        - logits (Tensor): Model output (batch_size, seq_len, vocab_size)
        - target (Tensor): Ground-truth labels (batch_size, seq_len)

        Returns:
        - loss (Tensor): Scalar loss value
        """
        # Gets the number of classes in your problem (from the logits).#
        num_classes = logits.size(-1)

        # log_probs: Converts the logits into log probabilities using the log_softmax function.
        log_probs = F.log_softmax(logits, dim=-1)  # Convert logits to log probabilities

        # Create a one-hot representation of the target
        with torch.no_grad():
            # This line is crucial for initialization. It creates a tensor named true_dist that is filled with zeros and has the same shape as log_probs.
            true_dist = torch.zeros_like(log_probs)
            # This PyTorch function is used to create a one-hot encoding of the target labels.
            true_dist.scatter_(-1, target.unsqueeze(-1), 1.0)  # One-hot encoding
            # The final line in this block applies the label smoothing using
            true_dist = (1 - self.epsilon) * true_dist + self.epsilon / num_classes  # Apply smoothing

         # Compute negative log likelihood of each probability/likelihood in the tensor
        loss = -true_dist * log_probs
         # Sum values of tensor into one value
        oss = loss.sum(dim=-1)  # Sum over vocab dimension

# Example Scenario 1: Lower probabilities for incorrect words

 #     Word 1: 0.8 (high probability)
 #     Word 2: 0.1
 #     Word 3: 0.1
 #    NLL: -log(0.8) + -log(0.1) + -log(0.1) = 0.22 + 2.30 + 2.30 = 4.82



        # Ignore padding index.--- It means you have defined a value to represent padding tokens (e.g., ignore_index = 0 or ignore_index = -100).
        #  The code understands that there might be padding in your data, and it needs to take steps to handle it correctly during loss calculation.
        if self.ignore_index is not None:
            mask = target != self.ignore_index
            loss = loss * mask  # Zero out loss for padding tokens
            return loss.sum() / mask.sum()  # Normalize by non-padding tokens
        else:
            return loss.mean()  # Regular mean loss

'''
Batch size: 2
Sequence length: 4
Assume we have two sequences in our batch:
Sequence 1: "The cat sat on the mat"
Sequence 2: "I love dogs"

target = [[2, 3, 4, 5, 6, 7, 0, 0],   # "The cat sat on the mat" + padding
          [1, 8, 9, 0, 0, 0, 0, 0]]   # "I love dogs" + padding

Assume ignore_index = 0 (padding token is 0)
Let's say, after calculating the loss for each token and summing over the vocabulary dimension, we have the following reduced loss tensor

loss_reduced = [[0.5, 0.2, 0.3, 0.1, 0.6, 0.4, 0.8, 0.9],
                [0.7, 0.1, 0.2, 0.3, 0.5, 0.6, 0.7, 0.8]]


Creating the Mask:
  mask = target != ignore_index

This will create the following mask:
  mask = [[True, True, True, True, True, True, False, False],
        [True, True, True, False, False, False, False, False]]


Element-wise Multiplication:
    loss_masked = [[0.5 * True, 0.2 * True, 0.3 * True, 0.1 * True, 0.6 * True, 0.4 * True, 0.8 * False, 0.9 * False],
               [0.7 * True, 0.1 * True, 0.2 * True, 0.3 * False, 0.5 * False, 0.6 * False, 0.7 * False, 0.8 * False]]

Since True is treated as 1 and False as 0 in numerical operations, this simplifies to:
    loss_masked = [[0.5, 0.2, 0.3, 0.1, 0.6, 0.4, 0.0, 0.0],
               [0.7, 0.1, 0.2, 0.0, 0.0, 0.0, 0.0, 0.0]]

# Normalize by non-padding tokens

  loss_masked.sum() would calculate (0.5 + 0.2 + 0.3 + 0.1 + 0.6 + 0.4 + 0.0 + 0.0) + (0.7 + 0.1 + 0.2 + 0.0 + 0.0 + 0.0 + 0.0 + 0.0) = 2.1 + 1.0 = 3.1
  mask.sum() would calculate (1 + 1 + 1 + 1 + 1 + 1 + 0 + 0) + (1 + 1 + 1 + 0 + 0 + 0 + 0 + 0) = 6 + 3 = 9 (since True is treated as 1 and False as 0)
  average_loss would then be 3.1 / 9 ≈ 0.344

'''





In [None]:
# Example usage
batch_size, seq_len, vocab_size = 2, 5, 10
logits = torch.randn(batch_size, seq_len, vocab_size)
target = torch.randint(0, vocab_size, (batch_size, seq_len))

criterion = LabelSmoothedCrossEntropy(epsilon=0.1, ignore_index=0)
loss = criterion(logits, target)
print(loss.item())

# Transformer Model's Role:

The Transformer model is designed to process sequences of data, like text. In the context of predicting the next word, it takes a sequence of words as input and outputs a probability distribution over its vocabulary for the next word.
Essentially, for each word in the vocabulary, the model assigns a probability representing how likely it is to be the next word in the sequence.
NLL and Cross-Entropy's Purpose:

NLL (Negative Log Likelihood): NLL is a way to measure how well the model's predicted probability distribution matches the true distribution (i.e., the actual next word). Lower NLL values indicate better predictions.
Cross-Entropy: In the case of one-hot encoded targets (where the true next word has a probability of 1 and all others have 0), cross-entropy is mathematically equivalent to NLL. It serves as the loss function during training, guiding the model to adjust its parameters and improve its predictions.
Minimizing the Loss: The training process involves iteratively adjusting the model's parameters to minimize the NLL (or cross-entropy) loss. By minimizing this loss, the model learns to assign higher probabilities to the correct next word and lower probabilities to incorrect words.
Label Smoothing's Refinement:

Label smoothing is a technique used to prevent the model from becoming overconfident in its predictions. It slightly modifies the target distribution (making it not strictly one-hot), encouraging the model to be less certain and more robust to noisy or unexpected data.
In this case, the loss function is called "Label Smoothed Cross-Entropy" because it applies label smoothing to the cross-entropy loss. It still essentially aims to minimize NLL (or a close approximation), but with a smoother target distribution.
In Simple Terms:

The Transformer model tries to guess the next word in a sequence by assigning probabilities to each word in its vocabulary.
NLL and cross-entropy are used to measure how good the model's guesses are compared to the actual next word.
During training, the model is adjusted to make better guesses by minimizing the NLL or cross-entropy loss.
Label smoothing is a technique to make the model's guesses less overconfident and more adaptable.
So, you're essentially right! The Transformer model predicts the next word, and NLL/cross-entropy helps it learn to pick the right word with higher probability by minimizing the loss function during training. Label smoothing adds a layer of refinement to the process.