# What Does Label Smoothing Do?
> An intuitive explaination of why does label smoothing helps to better generalize the model

- toc: true 
- badges: false
- comments: true
- categories: [deep-learning]
- image: my_icons/nn.png

## Introduction

Label smoothing was introduced by Szegedy et.al in the paper [Rethinking the Inception Architecture](https://arxiv.org/abs/1512.00567). Since then this trick has been used in many papers to improve the SOTA results on many datasets in various architectures. Although being widely used, there is less insight as to why this technique helps the model to perform better. The paper by Rafael Müller et.al [When does Label Smoothing Help?](https://arxiv.org/abs/1906.02629v2) provides insight into this question. This blog post is an attempt to explain the main result of the paper.

## What Is Label Smoothing?

Generally, in a classification problem, our aim is to maximize the log-likelihood of our label where label is based upon the ground truth. In other words, we want our model to assign maximum probability to the true label given the parameters and the input i.e ${P(\hat y\mid x,\theta)}$ where the true label is known beforehand. We motivate our model to achieve this by minimizing the cross-entropy loss between the predictions and the ground truth labels. Cross entropy loss is defined by the equation:
${-\sum_{i=1}^{n} y_{i} \times log(\hat y_{i}) }$ where n is the number of classes for.eg for Imagenet n = 1000. Don't be intimidated by the daunting equation and jargon because in reality the calculation of loss becomes very easy as the labels are provided as one-hot encoded vectors. Suppose you build a model for task of image-classification between 3 classes. For every image as the input the model predicts a 3 length vector. Let's say for image 1 the model's normalised predictions are 
${\hat y = [0.2, 0.7,0.1]}$ and the image belongs to category 2. Therefore, the target vector will be ${y = [0,1,0]}$. The loss for this image will be ${-(0\times \log 0.2 + 1\times \log 0.7 + 0\times \log 0.1) = -\log 0.7}$.
There is little more to how the normalised predictions of the model are calculated. The model's predictions are calulated by using the activation Softmax in the last layer's output. The model outputs a length 3 vector and each of the element 
of the vector is called 'logit'. For the logits to represent a valid probability distribution over the classes they should sum to 1. This is accomplished by passing the logits through a softmax layer. Let's say the output vector for a certain image as input is ${z = [z_{1}, z_{2},...,z_{n}]}$ then the predictions are calculated as ${\hat y = \text Softmax \left(z \right) = \large [\frac {e^{z_{1}}}{\sum_{i=1}^{n} e^{z_{i}}}, \frac {e^{z_{2}}}{\sum_{i=1}^{n} e^{z_{i}}}...
\frac {e^{z_{n}}}{\sum_{i=1}^{n} e^{z_{i}}}]}$.
Notice that sum of all the elements of ${\hat y}$ is 1. Suppose the ground truth label for the image is 2, then the target vector is ${[0,1,0,0,....0]}$ (The length of target vector is n as well). Thus, the Cross-entropy loss for this image,in it's full glory is written as ${\text loss\left(y,z\right) = -1 \times \normalsize \log \frac {e^{z_{2}}}{\sum_{i=1}^{n} e^{z_{i}}} = \log {\sum_{i=1}^{n} e^{z_{i}}} - z_{2}}$. Minimising this loss encourages ${z_{2}}$ to be as high as possible while ${z_{i}}$ for ${i\ne2}$ are encouraged to be close to 0. Szegedey et.al highlight two problems with this approach

The problem with this approach is that model becomes over-confident for it's predictions as it assigns nearly 100% probability to the ground label. Szegedy et. al argue that this can lead to overfitting and model may not be able to generalize well. Intuitively this makes sense. for.eg Let's say our dataset contains two symantically similar classes ([pets dataset](https://www.robots.ox.ac.uk/~vgg/data/pets/) has plenty of those). Suppose image1 belongs to one of the classes and image2 to other. Because these images are very similar, the output logits of these images would be very similar. Our over-confident model may assign other class to the images with high confidence(close to 100% probability) and thus our validation loss will be very high. 

The other problem with this approach is the vanishing gradient. The gradient of our loss w.r.t logit of correct class label k is ${\large \frac {e^{z_{k}}}{\sum_{i=1}^{n} e^{z_{i}}}-1}$ and w.r.t other logits is ${\large \frac {e^{z_{i}}}{\sum_{i=1}^{n} e^{z_{i}}}}$. Minimising the Cross-entropy loss leads to logit corresponding to correct class to be much higher than other logits. This leads to vanishing of gradients of loss w.r.t other logits and thus it hinders the model's ability to adapt. 

What can we do to cunteract these two problems. The Label smoothing paper suggests that we shouldn't provide sparse one-hot encoded vectors as target. Instead we should smoothen them. This is done by replacing the probability distribution over labels from dirac delta function to a linear combination of dirac delta distribution and a uniform distribution. This may sound incredibly complex to hear but in reality is very easy to implement. Let's define what the above jargon means. 

Dirac delta function denoted by ${\delta _{k,y}}$ is a function which is 1 for ${k=y}$ and 0 everywhere else. (So,it's a fancy name for one-hot encoded vector). If a image has class 3 as it's label and there are 4 classes in total, then the target vector for that image has the probability distribution ${\delta _{k,3} = [0,0,1,0]}$. Notice that ${\delta _{k,y}}$ is a valid probability distribution as it sums to 1 over it's domain. A uniform distribution is a distribution which has a constant value over it's domain. Let's say our domain consists of ${[1,2,3,4]}$. Uniform distribution is denoted as ${U\left(x\right)}$. For uniform distribution ${U\left(1\right) = U\left(2\right) = U\left(3\right) = U\left(4\right) = c}$. The value of c should be ${\frac {1}{\text total \,number\,of\,domain\,points} = 0.25}$ so that ${\sum_{i=1}^{4} U \left(i\right)}$ is 1. 

Let's denote our target vector for a particular image as ${q\left(k,y\right)}$.Here ${k}$ denotes the total no of classes and ${y}$ denotes the true label for the image. In case of one hot-encoded target vector ${q\left(k,y\right) = \delta _{k,y}}$. Szegedy et. al propose to replace ${\delta _{k,y}}$ with ${(1-\varepsilon)\times \delta _{k,y} + \varepsilon \times U\left(k\right)}$. As explained above value of ${U\left(k\right)}$ should be ${\frac {1}{k}}$. Thus our target vector ${q\left(k,y\right) = (1-\varepsilon)\times \delta _{k,y} + \varepsilon \times \frac{1}{k}}$. Let's try to smooth the labels of a concrete example.

Suppose target vector of an image for a classification task which has ${k=4}$ classes is ${q\left(k,y\right)=\delta_{k,2} = [0,1,0,0]}$.A valid uniform distribution over the labels is defined as ${U\left(k\right) = \frac{1}{k} = 0.25}$.Then,our smoothed target vector is 
${q\left(k,y\right) = (1-\varepsilon)\times \delta _{k,2} + \varepsilon \times U\left(k\right)}$ = ${(1-\varepsilon)\times[0,1,0,0] + \varepsilon\times [0.25,0.25,0.25,0.25]}$ = ${[0.25\varepsilon, 1-\varepsilon+0.25\varepsilon, 0.25\varepsilon,0.25\varepsilon]}$. If ${\varepsilon = 0.2}$,then ${q\left(k,y\right)=[0.05,0.85,0.05,0.05]}$. Notice that new smoothened labels still sum to 1, which confirms that ${(1-\varepsilon)\times \delta _{k,y} + \varepsilon \times U\left(k\right)}$ is a valid probability distribution over the labels.

Intuitively we can think label smoothing as a process to reduce the confidence of model in it's ground truth labels.The ground truth labels may sometimes be awry owing to errors in data labelling or data collection process. Label smoothing can make the model robust against incorrect labels. 

## Implementation In Code

To implement label smoothing, we don't change every label individually but we define a new loss function. Loss function is still Cross-entropy loss. Our new target vector for a particular image is ${ y = [\frac {\varepsilon}{k},\frac {\varepsilon}{k},...,(1 - \varepsilon) + \frac{\varepsilon}{k},\frac {\varepsilon}{k},\frac {\varepsilon}{k},...k times]}$. Let's assume the image belongs to class ${j}$. Normal one hot encoded target label will have 1 at j position and 0 everywhere else. Let's denote one hot encoded target vector as ${y^{h}}$. So, ${y^{h} = [0,0,0,...,1,0,...0]}$  
The loss is ${L\left(\hat y ,y\right) = \sum_{i=1}^{k} -y_{i} \times \log \hat y_{i}}$ = ${- \left( \frac {\varepsilon}{k}\times\log\hat y_{1} + \frac {\varepsilon}{k}\times\log\hat y_{2} + ...+ \left(1-\varepsilon+ \frac{\varepsilon}{k}\right)\times\log\hat y_{j}+\frac {\varepsilon}{k}\times\log\hat y_{j+1}+...+\frac {\varepsilon}{k}\times\log\hat y_{k}       \right)}$. We can rewrite this as ${L\left(\hat y ,y\right) = -\left(1-\varepsilon\right)\times\log\hat y_{j} - \frac{\varepsilon}{k}\times\left(\sum_{i=1}^{k} \log\hat y_{i}\right)}$. Eagle eyed reader can notice that term which is multiplied by ${\left(1 - \varepsilon\right)}$ is the same loss we calculated with one hot encoded target vector. Therefore, ${L\left(y,\hat y\right) = -\left(1-\varepsilon\right)\times L\left(y^{h},\hat y\right)-\frac{\varepsilon}{k}\times\left(\sum_{i=1}^{k} \log\hat y_{i}\right)}$. 

Thus, we only need to modify the loss function of our model and we are good to go. The implementation of this in code is shown below. The code snippet below uses Pytorch framework and implementation is copied from the [fast.ai](https://www.fast.ai/) course. 

In [3]:
def lin_comb(a1,a2,factor):
    return factor*a1 + (1-factor)*a2

class LabelSmoothing(nn.Module):
    def __init__(self, f:float=0.1, reduction = 'mean'):
        super().__init__()
        self.f = f
        self.reduction = reduction
    
    def forward(self,pred,targ):
        ls = F.log_softmax(pred, dim = 1)
        l1 = reduce_loss(ls.sum(1), self.reduction)
        l2 = F.nll_loss(ls, targ,reduction= self.reduction)
        return lin_comb(-l1/pred.shape[-1],l2,self.f)

## How And Why Does It Work?

Label smoothing goes against the conventional practice of maximising the likelihood of ground truth label. Instead it punishes the model if the logits which don't correspond to correct label get too low. This can be seen by the second term in equation of loss mentioned above i.e ${-\frac{\varepsilon}{k}\times\left(\sum_{i=1}^{k} \log\hat y_{i}\right)}$. We can see that if ${\hat y_{i}\, for\, i = {1,2,...,k}}$ go too close to 0 then the loss goes up (${\log}$ of something close to 0 is a large negative number). In contrast, Maximising the likelihood of ground truth label encourages the logits that don't correspond to correct label to go as low as possible. Let's dig into how label smoothing improves things despite going against the concept of probability theory. 

### Calculating Loss Without Label Smoothing

Let's imagine that we have a task to build a model for image classification task where each image can have one of three labels. This means our model will output a length 3 vector containing our three logits. Assume that penultimate layer of the model has has 4 activations. The model's last 3 layers will look something like shown below. I have joined an extra layer at the end which corresponds to calculation of our cross entropy loss by taking three logits as input. ![image](my_icons/nn.png). We put in an image in this model which has a target vector ${y^{h} = [0,1,0]^{T}}$. The penultimate layer's activations are ${X = [x_{1},x_{2},x_{3},x_{4}]^{T}}$, the last layer's outputs are ${Z = [z_{1},z_{2},z_{3}]^{T}}$ (A single vector is conventionally written as column vector, therefore, ${X}$, ${Z}$ and ${y^{h}}$ are written as transpose of row vectors). ${Z}$ is calculated from the penultimate layer's activation using the equation ${Z = W\star X}$ (${\star}$ here denotes matrix multiplication). Bias is ignored for sake of brevity. ${W}$ is the weight matrix connecting penultimate layer and output layer. ${W = \left[
         \begin{array}{ccc}
         w_{11} & w_{12} & w_{13} & w_{14}          \\
         w_{21} & w_{22} & w_{23} & w_{24}\\
         w_{31} & w_{32} & w_{33} & w_{34}
        \end{array}
    \right]}$. Shortly weight matrix can be written as ${W = [w_{1},w_{2},w_{3}]^{T}}$ where ${w_{i} = [w_{i1},w_{i2},w_{i3},w_{i4}]}$. The output vector ${Z}$ is calculated as ${W\times X = \left[
         \begin{array}{ccc}
         w_{11}\times x_{1} & w_{12}\times x_{2} & w_{13}\times x_{3} & w_{14}\times x_{4}          \\
         w_{21}\times x_{1} & w_{22}\times x_{2} & w_{23}\times x_{3} & w_{24}\times x_{4}\\
         w_{31}\times x_{1} & w_{32}\times x_{2} & w_{33}\times x_{3} & w_{34}\times x_{4}
        \end{array}
    \right]}$. In short this can be written as ${Z = \left[\begin{array}{ccc}
         z_{1}          \\
         z_{2}           \\
         z_{3}
        \end{array}
    \right] = \left[
         \begin{array}{ccc}
         w_{1}X^{T}          \\
         w_{2}X^{T}           \\
         w_{3}X^{T}
        \end{array}
    \right]}$ where ${w_{i}X^{T}}$ denotes inner product between ${w_{i}}$ and ${X^{T}}$. ${Z}$ is a vector of logits and therfore is un-normalised. To get our prediction vector we would have to normalise this by passing ${Z}$ through a softmax layer. Our prediction vector would be ${\hat y = \left[
         \begin{array}{ccc}
         \frac {e^{w_{1}X^{T}}}{e^{w_{1}X^{T}}+e^{w_{2}X^{T}}+e^{w_{3}X^{T}}}          \\
         \frac {e^{w_{2}X^{T}}}{e^{w_{1}X^{T}}+e^{w_{2}X^{T}}+e^{w_{3}X^{T}}}           \\
         \frac {e^{w_{3}X^{T}}}{e^{w_{1}X^{T}}+e^{w_{2}X^{T}}+e^{w_{3}X^{T}}}
        \end{array}
    \right]}$. As given before our target vector is ${y^{h} = [0,1,0]^{T}}$. So, our cross-entropy loss will be ${L\left(y^{h},z\right) = -\log \left(\frac {e^{w_{2}X^{T}}}{e^{w_{1}X^{T}}+e^{w_{2}X^{T}}+e^{w_{3}X^{T}}}\right)}$. For preserving our sanity let's denote ${e^{w_{1}X^{T}}+e^{w_{2}X^{T}}+e^{w_{3}X^{T}}}$ by ${S}$. Then ${L\left(y^{h},z\right) = -\log \left(\frac {e^{w_{2}X^{T}}}{S}\right) = \log {S}-{w_{2}X^{T}}}$


### Calculating Loss Without Label Smoothing

Our prediction vector is same as before, but our target vector changes. Let's denote our label smoothed target vector as ${y^{l}}$. So, ${y^{l} = [\frac {\varepsilon}{3}, 1-\varepsilon + \frac {\varepsilon}{3}, \frac {\varepsilon}{3}]^{T}}$. Then, our new loss will be ${L\left(y^{l},z\right) = -\frac {\varepsilon}{3}\times \log\frac {e^{w_{1}X^{t}}}{S}-\left(1-\varepsilon+\frac{\varepsilon}{3}\right)\times \log\frac {e^{w_{2}X^{t}}}{S}-\frac {\varepsilon}{3}\times \log\frac {e^{w_{3}X^{t}}}{S} = -\left(1-\varepsilon\right)\times \log\frac {e^{w_{2}X^{t}}}{S}-\frac {\varepsilon}{3}\times\left(\log\frac {e^{w_{1}X^{t}}}{S}+\log\frac {e^{w_{2}X^{t}}}{S}+\log\frac {e^{w_{3}X^{t}}}{S}\right)}$. Remember that ${\log a + \log b = \log ab}$. Utilising this rule, loss can be written as ${L\left(y^{l},z\right)=\left(1-\varepsilon\right)\left(\log S-w_{2}X^{T}\right)-\frac{\varepsilon}{3}\times{\log\left(\frac{e^{w_{1}X^{T}+w_{2}X^{T}+w_{3}X^{T}}}{S^{3}}\right)}}$. To further reduce this equation, we need to know two more rules 1. ${\log\frac{a}{b}=\log a-\log b}$ and 2. ${\log a^{b}=b\log a}$. Then, ${L\left(y^{l},y\right)=\left(\log S-w_{2}X^{T}\right)-\varepsilon\left(\log S - w_{2}X^{T}\right)-\frac{\varepsilon}{3}\left(w_{1}X^{T}+w_{2}X^{T}+w_{3}X^{T}\right)+\frac{\varepsilon}{3}\log\left(S^{3}\right) = \left(\log S-w_{2}X^{T}\right)-\varepsilon\log S+\varepsilon\left(w_{2}X^{T}\right)-\frac{\varepsilon}{3}\left(w_{1}X^{T}+w_{2}X^{T}+w_{3}X^{T}\right)+{\varepsilon}\log\ S}$. Notice, that first term of the last expression is our ${L\left(y^{h},z\right)}$. Therefore our loss with smooth labels can be finally written as ${L\left(y^{l},z\right)=L\left(y^{h},z\right)+\frac{\varepsilon}{3}\left(2w_{2}X^{T}-w_{1}X^{T}-w_{3}X^{T}\right)}$.

## Geometric Point Of View

Our last layer's output for the image we input esrlier is ${Z= \left[
         \begin{array}{ccc}
         w_{1}X^{T}          \\
         w_{2}X^{T}           \\
         w_{3}X^{T}
        \end{array}
    \right]}$. Since this class belongs to class 2, minimising the loss functions calculated above increases ${w_{2}X^{T}}$ while ${w_{1}X^{T}}$ and ${w_{2}X^{T}}$ are decreased. Notice a pattern that ${w_{i}X^{T}}$ produces logits for class ${i}$. ${w_{i}}$ can be thought of as a template for class ${i}$. Let's try to view the process of minimising or maximising ${w_{i}X^{T}}$ geometrically. 
    
#### Euclidean Norm
Euclidean norm of two vectors is simply the distance between the two vectors in their space. Euclidean Norm for two vectors is ${a}$ and ${b}$ can be calculated as: ${\lVert a-b\rVert=\left(a^{T}a-2a^{T}{b}+b^{T}b\right)^{\frac{1}{2}}}$. ${\therefore \lVert a-b\rVert^{2}= a^{T}\star a-2a^{T}\star{b}+b^{T}\star b}$. (Remeber that ${\star}$ denotes matrix multiplication.)

Now that we know how to calculate the euclidean norm, let's calculate it for ${w_{i}}$ and ${X}$. ${\lVert w_{i}-X\rVert^{2}= w_{i}^{T}\star w_{i}-2w_{i}^{T}\star{X}+X^{T}\star X= w_{i}^{T}\star w_{i}-2w_{i}{X}^{T}+X^{T}\star X}$. Geometrically this quantity is the distance between tempelate for class ${i}$ and penultimate layer's activation ${X}$. (For any two vectors ${a}$ and ${b}$, ${a\star b=a.b^{T}}$ where ${\star}$ and ${.}$ denote matrix multiplication and inner product respectively.)

Notice the second term inside the expression of ${\lVert w_{i}-X\rVert^{2}}$ which is ${2w_{i}X^{T}}$. If this term increases, the distance between ${w_{i}}$ and ${X}$ decreases and whenever it decreases the mentioned distance increases. But notice that this second term is just the same as ${2\times z_{i}}$. This means whenever ${z_{i}}$ increases/decreases distance between tempelate for class ${i}$ and penultimate layer's output vector decreases/increases. If an image belongs to class ${k}$, minimising the loss increases ${z_{k}}$ and decreases every other logit. This means that minimising the loss is same as minimising the distance between penultimate layer's output ${X}$ and tempelate for correct class ${w_{k}}$ and maximising the distance between ${X}$ and tempelate for every ${w_{i}}$ where ${i \neq k}$   
 

${L\left(y^{l},z\right) = \frac {\varepsilon}{3}\times \log\frac {e^{w_{1}X^{t}}}{S}+\left(1-\varepsilon+\frac{\varepsilon}{3}\right)\times \log\frac {e^{w_{2}X^{t}}}{S}+\frac {\varepsilon}{3}\times \log\frac {e^{w_{3}X^{t}}}{S}}$  

${\left(\log S-w_{2}X^{T}\right)-\varepsilon\left(\log S - w_{2}X^{T}\right)-\frac{\varepsilon}{3}\left(w_{1}X^{T}+w_{2}X^{T}+w_{3}X^{T}\right)+\frac{\varepsilon}{3}\log\left(S^{3}\right) = \left(\log S-w_{2}X^{T}\right)-\varepsilon\log S+\varepsilon\left(w_{2}X^{T}\right)-\frac{\varepsilon}{3}\left(w_{1}X^{T}+w_{2}X^{T}+w_{3}X^{T}\right)+{\varepsilon}\log\ S}$

${L\left(y^{h},z\right)+\frac{\varepsilon}{3}\left(2w_{2}X^{T}-w_{1}X^{T}-w_{3}X^{T}\right)}$