# Log-Likelihood and Cross-Entropy Loss

## Softmax Function and Conditional Probabilities

The softmax function gives us a vector $\hat{\mathbf{y}}$, which we interpret as the estimated conditional probabilities of each class given any input $\mathbf{x}$:
$$
\hat{y}_j = P(y = j \mid \mathbf{x})
$$
where $\hat{y}_j$ is the predicted probability for class $j$. 

Suppose the entire dataset $\mathcal{D}$ has $n$ examples, where the example indexed by $i$ consists of a feature vector $\mathbf{x}_i$ and a one-hot label vector $\mathbf{y}_i$. To compare the model's predictions with reality, we evaluate the likelihood of the actual classes under the model given the features:
$$
P(\mathcal{D}) = \prod_{i=1}^n P(\mathbf{y}_i \mid \mathbf{x}_i)
$$

## Maximum Likelihood Estimation

According to maximum likelihood estimation, we maximize the log-likelihood:
$$
\log P(\mathcal{D}) = \sum_{i=1}^n \log P(\mathbf{y}_i \mid \mathbf{x}_i)
$$

Equivalently, this is equivalent to minimizing the negative log-likelihood:
$$
-\log P(\mathcal{D}) = -\sum_{i=1}^n \log P(\mathbf{y}_i \mid \mathbf{x}_i)
$$

## Cross-Entropy Loss

For any pair of label $\mathbf{y}$ and model prediction $\hat{\mathbf{y}}$ over $k$ classes, the loss function $\mathcal{L}$ is defined as:
$$
\mathcal{L} = - \sum_{j=1}^k y_j \log \hat{y}_j
$$

This is commonly called the cross-entropy loss. Since $\mathbf{y}$ is a one-hot vector, only the term corresponding to the actual class label contributes to the loss:
$$
\mathcal{L} = - \log \hat{y}_{\text{true}}
$$

## Properties of the Cross-Entropy Loss

1. **Range of the Loss**:
   - All $\hat{y}_j$ are probabilities, so $0 \leq \hat{y}_j \leq 1$.
   - The logarithm of probabilities is always $\leq 0$, and the loss is non-negative.

2. **Minimization Condition**:
   - The loss function is minimized when $\hat{y}_{\text{true}} = 1$, which corresponds to predicting the actual label with certainty.

3. **Practical Challenges**:
   - Label noise: Some labels in the dataset may be incorrect.
   - Insufficient input information: Input features may not contain enough information to perfectly classify all examples.


# Example of Cross-Entropy Loss Calculation

## Given

Let the true label $\mathbf{y}$ and predicted probabilities $\hat{\mathbf{y}}$ be:
$$
\mathbf{y} = [1, 0, 0], \quad \hat{\mathbf{y}} = [0.8, 0.1, 0.1]
$$

The cross-entropy loss is defined as:
$$
\mathcal{L}(\mathbf{y}, \hat{\mathbf{y}}) = - \sum_{j=1}^k y_j \log \hat{y}_j
$$

where $k$ is the number of classes.

## Step-by-Step Calculation

1. Substitute the values of $\mathbf{y}$ and $\hat{\mathbf{y}}$:
   $$ 
   \mathcal{L}([1, 0, 0], [0.8, 0.1, 0.1]) = -\left( y_1 \log \hat{y}_1 + y_2 \log \hat{y}_2 + y_3 \log \hat{y}_3 \right)
   $$

2. Expand using the one-hot encoding $\mathbf{y} = [1, 0, 0]$:
   $$
   \mathcal{L} = -\left( 1 \cdot \log(0.8) + 0 \cdot \log(0.1) + 0 \cdot \log(0.1) \right)
   $$

3. Simplify:
   $$
   \mathcal{L} = -\log(0.8)
   $$

4. Compute the logarithm (in base $e$ or natural log):
   $$
   \log(0.8) \approx -0.22314
   $$

5. Final result:
   $$
   \mathcal{L} \approx 0.22314
   $$

## Interpretation

The loss value $\mathcal{L} = 0.22314$ indicates how far the predicted probability $\hat{\mathbf{y}}$ is from the true label $\mathbf{y}$. A lower loss corresponds to a prediction closer to the true label. (Note $log(1) = 0$, and $y_{pred}$ will never larger than 1 since its probability)

# Example of Cross-Entropy Loss with Matrices

## Given

Suppose we have a minibatch of two examples, with the following true labels $\mathbf{Y}$ and predicted probabilities $\hat{\mathbf{Y}}$:

### True Labels (One-hot encoded)
$$
\mathbf{Y} = \begin{bmatrix}
1 & 0 & 0 \\
0 & 1 & 0
\end{bmatrix}
$$

where:
- The first example belongs to class 1.
- The second example belongs to class 2.

### Predicted Probabilities
$$
\hat{\mathbf{Y}} = \begin{bmatrix}
0.8 & 0.1 & 0.1 \\
0.2 & 0.7 & 0.1
\end{bmatrix}
$$

where:
- For the first example, the model predicts class 1 with probability 0.8, class 2 with probability 0.1, and class 3 with probability 0.1.
- For the second example, the model predicts class 1 with probability 0.2, class 2 with probability 0.7, and class 3 with probability 0.1.

## Cross-Entropy Loss for Each Example

The cross-entropy loss for each example is computed as:
$$
\mathcal{L}_i = - \sum_{j=1}^k y_{ij} \log \hat{y}_{ij}
$$


where $y_{ij}$ is the true label (one-hot encoded) and $\hat{y}_{ij}$ is the predicted probability for class $j$ for the $i$-th example.

---
For the First Example ($\mathbf{y}_1 = [1, 0, 0]$ and $\hat{\mathbf{y}}_1 = [0.8, 0.1, 0.1]$):

$$
\mathcal{L}_1 = -\left( 1 \cdot \log(0.8) + 0 \cdot \log(0.1) + 0 \cdot \log(0.1) \right)
$$

$$
\mathcal{L}_1 = -\log(0.8) \approx 0.22314
$$

For the Second Example ($\mathbf{y}_2 = [0, 1, 0]$ and $\hat{\mathbf{y}}_2 = [0.2, 0.7, 0.1]$):
$$
\mathcal{L}_2 = -\left( 0 \cdot \log(0.2) + 1 \cdot \log(0.7) + 0 \cdot \log(0.1) \right)
$$
$$
\mathcal{L}_2 = -\log(0.7) \approx 0.35667
$$

## Total Loss for the Minibatch

The total loss for the minibatch is the average of the losses for all examples:
$$
\mathcal{L}_{\text{total}} = \frac{1}{n} \sum_{i=1}^n \mathcal{L}_i
$$
where $n$ is the number of examples in the minibatch.

In this case:
$$
\mathcal{L}_{\text{total}} = \frac{1}{2} \left( 0.22314 + 0.35667 \right) = 0.28991
$$

## Summary

- The cross-entropy loss for the first example is $0.22314$.
- The cross-entropy loss for the second example is $0.35667$.
- The total loss for the minibatch is $0.28991$.

This process is applied to the entire minibatch, where each example contributes to the final loss based on the difference between the predicted probabilities and the true labels.


# Cross Entropy Loss, Softmax and Overfitting

Label smoothing is a way of adding noise at the output targets, aka labels. Let’s assume that we have a classification problem. In most of them, we use a form of cross-entropy loss such as 

$$
\mathcal{L}_i = - \sum_{j=1}^k y_{ij} \log \hat{y}_{ij}
$$



and softmax to output the final probabilities.


The target vector has the form of $[0, 1 , 0 , 0]$. Because of the way softmax is formulated: 

$$
\text{softmax}(z_i) = \frac{e^{z_i / T}}{\sum_{j} e^{z_j / T}}
$$


it can never achieve an output of 1 or 0. The best he can do is something like $[0.0003, 0.999, 0.0003, 0.0003]$. As a result, the model will continue to be trained, pushing the output values as high and as low as possible. The model will never converge. That, of course, will cause overfitting.

To address that, label smoothing replaces the hard 0 and 1 targets by a small margin. Specifically, 0 are replaced with $\frac{e}{k−1}$ and 1 with $1−e$ where k is the number of classes.
