# Derivations related to Cross Entropy Loss (Binary and Categorical)

### What is Binary Cross Entropy Loss?

Binary Cross Entropy Loss is defined as follows given our prediction vector y_{pred}$:

$$ BCE = -\sum_{i = 1}^{m}y^{(i)} \cdot \log(y_{pred}^{(i)}) + (1 - y^{(i)}) \cdot \log(1 - y_{pred}^{(i)}) $$

I will go over each term one-by-one to explain the intuition.
If the actual output $y^{(i)}$ is 1, then only the term $y^{(i)} \cdot \log(y_{pred}^{(i)})$ remains since $(1 - y^{(i)})$ results in 0. Then by taking the log of our probability output, a probability close to 0 will result in a value close to  $-\infty$ where as a probability close to 1 will result in a value close to 0. Since we were meant to predict 1, we get a very large negative number the further away we are from 1. 

In the other case, when the output $y^{(i)}$ is 0, then only the term $(1 - y^{(i)}) \cdot \log(1 - y_{pred}^{(i)})$ remains. If our probability $y_{pred}^{(i)}$ was close to 1, then subtracting it from 1 results in a number close to 0. Then when we take the log of that number we get a value clsoe to $-\infty$. However when our prediction was close to 0 (the true output), subtracting it from 1 results in a number close to 1, and taking the log of this results in a number close to 0. 

To summarize when we get the example very wrong, it results in a very large negative number, but when we get it close to being right it results in a small negative number. So if we sum it all up we get the total throughout all examples. Of course we typically think of minimizing loss functions (think of loss as positive) thus we negate the whole summation to make the entire term a positive number. Then, our goal will be to minimize this.

### Coding the BCE Formula 

In [1]:
import numpy as np
import sys
from pathlib import Path

# Add the directory containing your code file to sys.path
sys.path.append(str(Path().resolve().parent.parent))

from models.LogisticRegression import LogisticRegression

# CREATE SOME DUMMY DATA
x = np.linspace(0, 10, 20)
y = np.array([1 if el + 2*np.random.rand() else 0 for el in x])

# reshape the data to place into the model
X = np.reshape(x, (x.shape[0], 1))

# initialize our model (I will use my Logistic Regression here)
model = LogisticRegression(1)

First we need to find our predicted values $y_{pred}$ using our model (In this case Logistic Regression)

In [2]:
y_pred = model(X)
y_pred

array([0.55120367, 0.57412176, 0.59672577, 0.61892637, 0.64064082,
       0.66179405, 0.68231958, 0.70216004, 0.72126758, 0.73960388,
       0.7571401 , 0.77385645, 0.78974172, 0.80479261, 0.81901303,
       0.83241324, 0.84500905, 0.85682101, 0.86787352, 0.87819413])

Now we just need to apply the formula above in code with our actual outputs $y$ and our predicted outputs $y_{pred}$.
<br>
It becomes quite simple with numpy, since we are able to take the element wise log of a numpy array as well as element wise multiply two numpy arrays and sum over them easily

In [3]:
def BCELoss(y, y_pred):
    return -np.sum(y * np.log(y_pred) + (1 - y) * np.log(1 - y_pred))

print(f"BCE Loss is: {BCELoss(y, y_pred)}")

BCE Loss is: 6.334363598521517


### Why Binary Cross Entropy? (Digging into the theory)

##### *(Feel free to skip this math if you are not interested)*

Let $y^{(i)} \in \{0, 1\}, \: y_{pred}^{(i)} \in [0, 1]$ be the actual output and predicted output for a given feature vector $x^{(i)} \in \mathbb{R}^{n}$ 
<br> Assume $P(y^{(i)} = 1 | x^{(i)}) = y_{pred}^{(i)}$. Then,
    $$ \implies P(y^{(i)} = 0 | x^{(i)}) = 1 - y_{pred}^{(i)} $$
    $$ \implies P(y^{(i)} | x^{(i)}) = (y_{pred}^{(i)})^{y^{(i)}} \cdot (1 - y_{pred}^{(i)})^{1 - y^{(i)}} $$
    $$ \implies y^{(i)} | x^{(i)} \sim Bernoulli(y_{pred}^{(i)}) $$

So now we know that our output given our input follows a bernoulli distribution with the probability parameter $p$ as our predicted output.

Let's define our predicted output as a function of our parameters for some fixed input. Rather than defining a hypothesis function of our input vector $x^{(i)}$, we define a function for a fixed x, and make our parameters the variable.

Let $$g^{(i)}: \mathbb{R}^{n} \rightarrow \mathbb{R} \\ \theta \rightarrow y_{pred}^{(i)}$$

Then we have the Likelihood as:

$$ \mathcal{L}(\theta) = \prod_{i = 1}^{m} P(y^{(i)} | x^{(i)}) $$

By taking the log of the likelihood, we then get a summation:

$$ \log{\mathcal{L}(\theta)} = l(\theta) = \sum_{i=1}^{m}\log((g^{(i)}(\theta))^{y^{(i)}} \cdot (1 - g^{(i)}(\theta))^{1 - y^{(i)}}) $$

$$ = \sum_{i=1}^{m}\log((g^{(i)}(\theta))^{y^{(i)}}) + \log((1 - g^{(i)}(\theta))^{1 - y^{(i)}}) $$
$$ = \sum_{i=1}^{m}y^{(i)} \cdot \log(g^{(i)}(\theta)) + (1 - y^{(i)}) \cdot \log(1 - g^{(i)}(\theta)) $$

Then we use Maximum Likelihood Estimation to estimate our parameters $\theta$

$$ = \underset{\theta}{\mathrm{argmax}} [\sum_{i=1}^{m}y^{(i)} \cdot \log(g^{(i)}(\theta)) + (1 - y^{(i)}) \cdot \log(1 - g^{(i)}(\theta))] $$
$$ = \underset{\theta}{\mathrm{argmin}} [-\sum_{i=1}^{m}y^{(i)} \cdot \log(g^{(i)}(\theta)) + (1 - y^{(i)}) \cdot \log(1 - g^{(i)}(\theta))] $$
$$ = \underset{\theta}{\mathrm{argmin}} [BCE] $$

Which is clearly equivalent to minimizing the binary cross entropy loss.

### The Derivative of BCE 

To perform an optimization algorithm (such as gradient descent) we require the calculation of the partial derivates of a loss function with respect to the parameters of the model. 

So far, we have defined $y^{(i)}, g^{(i)}(\theta), x^{(i)}$ to be the output, predicted output (as a function of our parameters) and our input vector respectively. In order to calculate the gradient of our loss function, we need to calculate the partial derivative w.r.t each of the parameters in the model. Lets first set our loss function to be a function of our parameters $J(\theta) = BCE$. Then for some arbitrary parameter $\theta_j$:

$$ \implies \frac{\partial J}{\partial \theta_{j}} = \frac{\partial}{\partial \theta_{j}}[-\sum_{i=1}^{m}y^{(i)} \cdot \log(g^{(i)}(\theta)) + (1 - y^{(i)}) \cdot \log(1 - g^{(i)}(\theta))] $$

$$ = -\sum_{i=1}^{m}\frac{\partial}{\partial \theta_{j}}[y^{(i)} \cdot \log(g^{(i)}(\theta)) + (1 - y^{(i)}) \cdot \log(1 - g^{(i)}(\theta))] $$
$$ = -\sum_{i=1}^{m}y^{(i)}\cdot \frac{1}{g^{(i)}(\theta)} \cdot \frac{\partial g^{(i)}}{\partial \theta_{j}} + (1 - y^{(i)})\cdot \frac{1}{(1 - g^{(i)}(\theta))} \cdot (-\frac{\partial g^{(i)}}{\partial \theta_{j}})$$
$$ = -\sum_{i=1}^{m}\frac{\partial g^{(i)}}{\partial \theta_{j}} \cdot (y^{(i)}\cdot \frac{1}{g^{(i)}(\theta)} - (1 - y^{(i)})\cdot \frac{1}{(1 - g^{(i)}(\theta))}) $$
$$ = -\sum_{i=1}^{m}\frac{\partial g^{(i)}}{\partial \theta_{j}} \cdot (\frac{y^{(i)}\cdot (1 - g^{(i)}(\theta)) - (1 - y^{(i)}) \cdot g^{(i)}(\theta)}{g^{(i)}(\theta) \cdot (1 - g^{(i)}(\theta))}) $$
$$ = -\sum_{i=1}^{m}\frac{\partial g^{(i)}}{\partial \theta_{j}} \cdot (\frac{y^{(i)} - g^{(i)}(\theta)}{g^{(i)}(\theta) \cdot (1 - g^{(i)}(\theta))}) $$

Now we can simplify this into a dot product of vectors since we are taking the sum of a product of elements. 
$$\text{Let  } y = (y^{(1)}, y^{(2)}, ..., y^{(m)}) \in \mathbb{R}^{m} \: \text{  and  } \: g(\theta) = (g^{(1)}(\theta), g^{(2)}(\theta), ..., g^{(m)}(\theta)) \in \mathbb{R}^{m} $$
$$\text{Let } \frac{\partial g}{\partial \theta_{j}} = (\frac{\partial g^{(1)}}{\partial \theta_{j}}, \frac{\partial g^{(2)}}{\partial \theta_{j}}, ..., \frac{\partial g^{(m)}}{\partial \theta_{j}}) \in \mathbb{R}^{m}$$

Then we can express this as the following:

$$ \frac{\partial J}{\partial \theta_{j}} = - \frac{\partial g}{\partial \theta_{j}}^T[(y - g(\theta)) \otimes (g(\theta) - g(\theta) \odot g(\theta))]

Where $\odot$ is the Hadamard product or element-wise multiplication and $\otimes$ is the Hadamard division or element-wise division

Let's refer to this set of Hadamard products and divisions as 
$y_{g} = (y - g(\theta)) \otimes (g(\theta) - g(\theta) \odot g(\theta)) \in \mathbb{R}^{m}$

Then we have: $$ \frac{\partial J}{\partial \theta_{j}} = - \frac{\partial g}{\partial \theta_j}^T y_{g} $$

Now to calculate the full gradient of the loss (which is a vector of all partial derivates) let us define the Jacobian matrix of $g$:
$$\text{Let } \frac{\partial g}{\partial \theta} = (\frac{\partial g}{\partial \theta_{1}} \rightarrow \frac{\partial g}{\partial \theta_{n}}) \in \mathbb{R}^{m \times n} $$

Finally we have:

$$ \nabla_{J}(\theta) = - \frac{\partial g}{\partial \theta}^T y_{g}

### Coding the derivative

In [9]:
import numpy as np
import sys
from pathlib import Path

# Add the directory containing your code file to sys.path
sys.path.append(str(Path().resolve().parent.parent))

from models.LogisticRegression import LogisticRegression

# CREATE SOME DUMMY DATA
x = np.linspace(0, 10, 20)
y = np.array([1 if el + 2*np.random.rand() else 0 for el in x])

# reshape the data to place into the model
X = np.reshape(x, (x.shape[0], 1))

# initialize our model (I will use my Logistic Regression here)
model = LogisticRegression(1)

After simplifying the calculation down to a matrix vector product, the coding becomes easy. We just have to code $-y_g$ and let our model calculate its own derivatives. Let's start by running a forward pass through our model

In [10]:
y_pred = model(X)
y_pred

array([0.67806954, 0.75025967, 0.81077919, 0.85938175, 0.89708506,
       0.92555488, 0.94661763, 0.96196598, 0.97302716, 0.98093524,
       0.98655681, 0.99053676, 0.99334636, 0.99532574, 0.99671822,
       0.99769684, 0.9983841 , 0.99886652, 0.99920503, 0.9994425 ])

Now we compute $-y_g = (y - g(\theta)) \otimes (g(\theta) - g(\theta) \odot g(\theta))$ and dot this with our model grads Jacobian matrix

In [14]:
def grads(y_pred, y, model_grads):
        return np.dot(model_grads.T, (-1) * ((y - y_pred) / (y_pred - y_pred**2)))

model_grads = model.grads(X)
final_grads = grads(y_pred, y, model_grads)
final_grads

array([-1.75656544, -1.26024501])

As you can see we are left with a vector of gradients corresponding to each of the parameters of the model. We can then pass those gradients into an optimization algorithm such as stochastic gradient descent to apply them to the parameters.

#### Note: Categorical cross entropy can be derived in a similar pattern
However rather than outputting a probability of just one class, we are outputting a probability distribution (vector) over $k$ classes. This means that we assume the output $y^{(i)}$ given $x^{(i)}$ follows a multinoulli distribution, which is an extension of the bernoulli distribution to multiple dimensions. If you follow a similar derivation function you should end up minimizing what is called the Categorical Cross Entropy Loss. The gradient of this loss also follows a similar process, however you end up with a matrix of derivates where each column is a class and each row a parameter (since each class has it's own hypothesis function).