# Cross Entropy
In a nutshell, there's definitely a *connection between probabilities and error functions*, and it's called **Cross-Entropy**  

[Watch the source Udacity video 1](https://www.youtube.com/watch?v=iREoPUrpXvE&t=197s)  


# Binary Cross-Entropy (Log Loss)  
[Watch the source Udacity video 2](https://www.youtube.com/watch?v=1BnhC6e0TFw&t=268s)  

The binary cross-entropy, also known as log loss, is a common form of cross-entropy used in binary classification tasks. For a pair of true labels $y_i$ and predicted probabilities $p_i$, the binary cross-entropy is calculated as:

$H(y, p) = -\sum_{i=1}^{m} \left(y_i \cdot \log(p_i) + (1 - y_i) \cdot \log(1 - p_i)\right)$

where:
- $m$ is the number of instances or samples.
- $y_i$ is the true label for instance $i$, either 0 or 1.
- $p_i$ is the predicted probability that instance $i$ belongs to class 1.

The goal during model training is often to minimize this binary cross-entropy loss, encouraging the model to make predictions that align well with the true binary labels.

# Categorical Cross-Entropy (Softmax Cross-Entropy)  
[Watch the source Udacity video 3](https://www.youtube.com/watch?v=keDswcqkees&t=154s)  

The categorical cross-entropy, also known as softmax cross-entropy, is a common form of cross-entropy used in multi-class classification tasks. For a pair of true one-hot encoded labels $y_i$ and predicted probabilities $p_i$, the categorical cross-entropy is calculated as:

$ H(y, p) = -\sum_{i=1}^{K} \sum_{j=1}^{m} y_{ij} \cdot \log(p_{ij}) $

where:
- $K$ is the number of classes.
- $m$ is the number of instances or samples.
- $y_{ij}$ is the indicator function (1 if instance $j$ belongs to class $i$, 0 otherwise).
- $p_{ij}$ is the predicted probability that instance $j$ belongs to class $i$.

The goal during model training is often to minimize this categorical cross-entropy loss, encouraging the model to make predictions that align well with the true multi-class labels.


## **Note 1**  
### Logarithmic Functions in Cross-Entropy

In the context of logarithmic functions, "log" and "ln" refer to the same operation with different bases:

- **$\log$:** Usually, it denotes the logarithm with base 10 (common logarithm).
- **$\ln$:** Specifically denotes the natural logarithm with base $e$, where $e$ is Euler's number (approximately 2.71828).

In the cross-entropy formula, you can use either "log" or "ln" interchangeably, as long as you are **consistent** within the formula. If you're using "$\ln$" for the natural logarithm, the formula remains the same, and the result will be equivalent.

So, in the binary cross-entropy formula, you can use either $\log$ or $\ln$, and the result will be the same:

$ H(y, p) = -\sum_{i=1}^{m} \left(y_i \cdot \log(p_i) + (1 - y_i) \cdot \log(1 - p_i)\right) $

or

$ H(y, p) = -\sum_{i=1}^{m} \left(y_i \cdot \ln(p_i) + (1 - y_i) \cdot \ln(1 - p_i)\right) $

## **Note 2**  

The negative sum in cross-entropy loss serves two main purposes:

1. **Positive Loss**: Logarithms of probabilities less than 1 are negative. The negative sum ensures the overall loss is positive, aligning with the goal of minimizing loss during model training.

2. **Minimization Objective**: Optimization algorithms aim to minimize. The negative sum aligns with this, making minimizing the negative sum equivalent to maximizing the likelihood of correct classes.

In short, the negative sum ensures a positive loss and aligns with the minimization objective of optimization algorithms.

## **Note 3**  
In the context of machine learning, <span style="background-color: #FFFF99; color: #000000; font-weight: bold;">maximizing the probability</span> of the correct class (or equivalently, <span style="background-color: #FFFF99; color: #000000; font-weight: bold;">minimizing the error function</span>) is indeed the same as <span style="background-color: #FFFF99; color: #000000; font-weight: bold;">minimizing the cross-entropy</span>.  
This is because cross-entropy is a measure of the difference between two probability distributions, and in this case, we want our predicted distribution (the output of our model) to be as close as possible to the true distribution (the actual labels). So, by <span style="background-color: #FFFF99; color: #000000; font-weight: bold;">minimizing the cross-entropy</span>, we are effectively <span style="background-color: #FFFF99; color: #000000; font-weight: bold;">maximizing the probability</span> of the correct class.



In [1]:
import numpy as np

# Write a function that takes as input two lists Y, P,
# and returns the float corresponding to their cross-entropy.

def cross_entropy(Y, P):
    Y = np.float_(Y)
    P = np.float_(P)
    return -np.sum(Y * np.log(P) + (1 - Y) * np.log(1 - P))