# What Does Label Smoothing Do?
> An intuitive explaination of why does label smoothing helps to better generalize the model

- toc: true 
- badges: false
- comments: true
- categories: [deep-learning]

## Introduction

Label smoothing was introduced by Szegedy et.al in the paper [Rethinking the Inception Architecture](https://arxiv.org/abs/1512.00567). Since then this trick has been used in many papers to improve the SOTA results on many datasets in various architectures. Although being widely used, there is less insight as to why this technique helps the model to perform better. The paper by Rafael Müller et.al [When does Label Smoothing Help?](https://arxiv.org/abs/1906.02629v2) provides insight into this question. This blog post is an attempt to explain the main result of the paper.

## What Is Label Smoothing?

Generally, in a classification problem, our aim is to maximize the log-likelihood of our label where label is based upon the ground truth. In other words, we want our model to assign maximum probability to the true label given the parameters and the input i.e ${P(\hat y\mid x,\theta)}$ where the true label is known beforehand. We motivate our model to achieve this by minimizing the cross-entropy loss between the predictions and the ground truth labels. Cross entropy loss is defined by the equation:
${-\sum_{i=1}^{n} y_{i} \times log(\hat y_{i}) }$ where n is the number of classes for.eg for Imagenet n = 1000. Don't be intimidated by the daunting equation and jargon because in reality the calculation of loss becomes very easy as the labels are provided as one-hot encoded vectors. Suppose you build a model for task of image-classification between 3 classes. For every image as the input the model predicts a 3 length vector. Let's say for image 1 the model's normalised predictions are 
${\hat y = [0.2, 0.7,0.1]}$ and the image belongs to category 2. Therefore, the target vector will be ${y = [0,1,0]}$. The loss for this image will be ${-(0\times \log 0.2 + 1\times \log 0.7 + 0\times \log 0.1) = -\log 0.7}$.
There is little more to how the normalised predictions of the model are calculated. The model's predictions are calulated by using the activation Softmax in the last layer's output. The model outputs a length 3 vector and each of the element 
of the vector is called 'logit'. For the logits to represent a valid probability distribution over the classes they should sum to 1. This is accomplished by passing the logits through a softmax layer. Let's say the output vector for a certain image as input is ${z = [z_{1}, z_{2},...,z_{n}]}$ then the predictions are calculated as ${\hat y = \text Softmax \left(z \right) = \large [\frac {e^{z_{1}}}{\sum_{i=1}^{n} e^{z_{i}}}, \frac {e^{z_{2}}}{\sum_{i=1}^{n} e^{z_{i}}}...
\frac {e^{z_{n}}}{\sum_{i=1}^{n} e^{z_{i}}}]}$.
Notice that sum of all the elements of ${\hat y}$ is 1. Suppose the ground truth label for the image is 2, then the target vector is ${[0,1,0,0,....0]}$ (The length of target vector is n as well). Thus, the Cross-entropy loss for this image,in it's full glory is written as ${\text loss\left(y,z\right) = -1 \times \normalsize \log \frac {e^{z_{2}}}{\sum_{i=1}^{n} e^{z_{i}}} = \log {\sum_{i=1}^{n} e^{z_{i}}} - z_{2}}$. Minimising this loss encourages ${z_{2}}$ to be as high as possible while ${z_{i}}$ for ${i\ne2}$ are encouraged to be close to 0. Szegedey et.al highlight two problems with this approach

The problem with this approach is that model becomes over-confident for it's predictions as it assigns nearly 100% probability to the ground label. Szegedy et. al argue that this can lead to overfitting and model may not be able to generalize well. Intuitively this makes sense. for.eg Let's say our dataset contains two symantically similar classes ([pets dataset](https://www.robots.ox.ac.uk/~vgg/data/pets/) has plenty of those). Suppose image1 belongs to one of the classes and image2 to other. Because these images are very similar, the output logits of these images would be very similar. Our over-confident model may assign other class to the images with high confidence(close to 100% probability) and thus our validation loss will be very high. 

The other problem with this approach is the vanishing gradient. The gradient of our loss w.r.t logit of correct class label k is ${\large \frac {e^{z_{k}}}{\sum_{i=1}^{n} e^{z_{i}}}-1}$ and w.r.t other logits is ${\large \frac {e^{z_{i}}}{\sum_{i=1}^{n} e^{z_{i}}}}$. Minimising the Cross-entropy loss leads to logit corresponding to correct class to be much higher than other logits. This leads to vanishing of gradients of loss w.r.t other logits and thus it hinders the model's ability to adapt. 

What can we do to cunteract these two problems. The Label smoothing paper suggests that we shouldn't provide sparse one-hot encoded vectors as target. Instead we should smoothen them. This is done by replacing the probability distribution over labels from dirac delta function to a linear combination of dirac delta distribution and a uniform distribution. This may sound incredibly complex to hear but in reality is very easy to implement. Let's define what the above jargon means. 

Dirac delta function denoted by ${\delta _{k,y}}$ is a function which is 1 for ${k=y}$ and 0 everywhere else. (So,it's a fancy name for one-hot encoded vector). If a image has class 3 as it's label and there are 4 classes in total, then the target vector for that image has the probability distribution ${\delta _{k,3} = [0,0,1,0]}$. Notice that ${\delta _{k,y}}$ is a valid probability distribution as it sums to 1 over it's domain. A uniform distribution is a distribution which has a constant value over it's domain. Let's say our domain consists of ${[1,2,3,4]}$. Uniform distribution is denoted as ${U\left(x\right)}$. For uniform distribution ${U\left(1\right) = U\left(2\right) = U\left(3\right) = U\left(4\right) = c}$. The value of c should be ${\frac {1}{\text total \,number\,of\,domain\,points} = 0.25}$ so that ${\sum_{i=1}^{4} U \left(i\right)}$ is 1. 

Let's denote our target vector for a particular image as ${q\left(k,y\right)}$.Here ${k}$ denotes the total no of classes and ${y}$ denotes the true label for the image. In case of one hot-encoded target vector ${q\left(k,y\right) = \delta _{k,y}}$. Szegedy et. al propose to replace ${\delta _{k,y}}$ with ${(1-\epsilon)\times \delta _{k,y} + \epsilon \times U\left(k\right)}$. As explained above value of ${U\left(k\right)}$ should be ${\frac {1}{k}}$. Thus our target vector ${q\left(k,y\right) = (1-\epsilon)\times \delta _{k,y} + \epsilon \times \frac{1}{k}}$. Let's try to smooth the labels of a concrete example.

Suppose target vector of an image for a classification task which has ${k=4}$ classes is ${q\left(k,y\right)=\delta_{k,2} = [0,1,0,0]}$.A valid uniform distribution over the labels is defined as ${U\left(k\right) = \frac{1}{k} = 0.25}$.Then,our smoothed target vector is ${q\left(k,y\right) = (1-\epsilon)\times \delta _{k,2} + \epsilon \times U\left(k\right) = (1-\epsilon)\times[0,1,0,0] + \epsilon\times [0.25,0.25,0.25,0.25] = [0.25\epsilon, 1-\epsilon+0.25\epsilon, 0.25\epsilon,0.25\epsilon]}$. If ${\epsilon = 0.2}$,then ${q\left(k,y\right)=[0.05,0.85,0.05,0.05]}$