# neuralthreads
[medium](https://neuralthreads.medium.com/i-was-not-satisfied-by-any-deep-learning-tutorials-online-37c5e9f4bea1)

## Chapter 4 — Losses and their derivatives

Binary cross-entropy loss — Special case of Categorical cross-entropy loss

In this post, we will talk about Binary cross-entropy loss and what it means. We will also see how to compute the gradients? And how it is a special case of Categorical cross-entropy loss?

### 4.4 What is Binary cross-entropy loss and how to compute the gradients?

Let me ask you which of the following sentences are true for this image?

![Alt text](image.png)

1. He is RDJ.
2. He played the role of Iron-man in MCU.
3. He also played Jack Sparrow.
4. He is also Sherlock Holmes.

We know that the correct answers ‘y_true’ are,

In [81]:
%%latex
\begin{gather*}
\newcommand{\arraystretch}{1.5}
    y\_true = 
    \begin{bmatrix*}
    1 \\
    1 \\
    0 \\
    1
    \end{bmatrix*}
\end{gather*}

<IPython.core.display.Latex object>

But let us suppose you don’t know and you guess your answer in probabilities ‘y_pred’ as follow,

In [82]:
%%latex
\begin{gather*}
\newcommand{\arraystretch}{1.5}
    y\_pred =  
    \begin{bmatrix*}
    0.9  \\
    0.95 \\
    0.75  \\
    0.85
    \end{bmatrix*}
\end{gather*}

<IPython.core.display.Latex object>

Two things,

**First**, Each element in y_true and y_pred is an independent answer unlike the Categorical cross-entropy example because we have 4 questions in this example.

**Second**, we can set a threshold value at 0.8, i.e., all values equal to or greater than 0.8 are 1 and others are 0.  
In that case, you correctly answered 3 out of 4 questions.  
But what if the threshold value is 0.74. In that case, you correctly answered all questions but there is still an error because the predicted value for the third question is not 0 or close to 0.

Here, we will use Binary cross-entropy loss.

Suppose we have true values,

In [83]:
%%latex
\begin{gather*}
\newcommand{\arraystretch}{1.5}
    y\_true = y =
    \begin{bmatrix*}
    y_1 \\
    y_2 \\
    y_3
    \end{bmatrix*}
\end{gather*}


<IPython.core.display.Latex object>

and predicted values,

In [84]:
%%latex
\begin{gather*}
\newcommand{\arraystretch}{1.5}
    y\_pred = \hat{y} =
    \begin{bmatrix*}
    \hat{y_1} \\
    \hat{y_2} \\
    \hat{y_3}
    \end{bmatrix*}
\end{gather*}

<IPython.core.display.Latex object>

Then Binary cross-entropy liss is calculated as follow:

In [85]:
%%latex
\begin{align*}
    BCE = - \frac{1}{N} \sum_{i=1}^{i=N} [y\_true_i \cdot log(y\_pred_i) + (1 - y\_true_i) \cdot log(1 - y\_pred_i) ]\\
        \\
    BCE = - \frac{1}{N} \sum_{i=1}^{i=N} [y_i \cdot log(\hat{y_i}) + (1 - y_i) \cdot log(1 - \hat{y_i}) ]\\
        \\
    \Rightarrow BCE = - \dfrac{1}{3} [y_1 \cdot log(\hat{y_1}) + (1 - y_1) \cdot log(1 - \hat{y_1})\, + \\ 
                                        +\, y_2 \cdot log(\hat{y_2}) + (1 - y_2) \cdot log(1 - \hat{y_2})\, + \\ 
                                           +\,  y_3 \cdot log(\hat{y_3}) + (1 - y_3) \cdot log(1 - \hat{y_3}) ] \\
\end{align*}

<IPython.core.display.Latex object>

We can easily calculate Binary cross-entropy loss in Python like this.

In [86]:
import numpy as np                             # importing NumPy
np.random.seed(42)

def B_cross_E(y_true, y_pred):                     # BCE
    return - np.mean(y_true * np.log(y_pred + 10**-100) + (1 - y_true) * np.log(1 - y_pred + 10**-100))

Now, we know that

In [87]:
%%latex
\begin{gather*}
    BCE = f(\hat{y_1},\hat{y_2},\hat{y_3})
\end{gather*}

<IPython.core.display.Latex object>

So, like MSE, MAE, and CE, we have a **Jacobian** for BCE.

In [88]:
%%latex
\begin{gather*}
    \newcommand{\arraystretch}{2.5}
    J = \dfrac{\partial{(BCE)}}{(\hat{y_1},\hat{y_2},\hat{y_3})} =     
    \begin{bmatrix*}
    \dfrac{\partial{(BCE)}}{\partial(\hat{y_1})} \\
    \dfrac{\partial{(BCE)}}{\partial(\hat{y_2})} \\
    \dfrac{\partial{(BCE)}}{\partial(\hat{y_3})} 
    \end{bmatrix*}
    
\end{gather*}


<IPython.core.display.Latex object>

We can easily find each term in this Jacobian.

In [89]:
%%latex
\begin{gather*}
    \newcommand{\arraystretch}{2}
    \Rightarrow J =  
    \begin{bmatrix*}
      - \dfrac{1}{3} ( \dfrac{y_1}{\hat{y_1}} - \dfrac{1 - y_1}{1 - \hat{y_1}})\\
      - \dfrac{1}{3} ( \dfrac{y_2}{\hat{y_2}} - \dfrac{1 - y_2}{1 - \hat{y_2}})\\
      - \dfrac{1}{3} ( \dfrac{y_3}{\hat{y_3}} - \dfrac{1 - y_3}{1 - \hat{y_3}})
    \end{bmatrix*} \Rightarrow  \\

    \newcommand{\arraystretch}{1.5}
    \Rightarrow J =  - \dfrac{1}{3} (
    \begin{bmatrix*}
      y_{1} \\
      y_{2} \\ 
      y_{3} \\
    \end{bmatrix*} / 
    \begin{bmatrix*}
      \hat{y_1} \\
      \hat{y_2} \\ 
      \hat{y_3} \\
    \end{bmatrix*} - 
    \begin{bmatrix*}
      1 - y_{1} \\
      1 - y_{2} \\ 
      1 - y_{3} \\
    \end{bmatrix*} / 
    \begin{bmatrix*}
      1 - \hat{y_1} \\
      1 - \hat{y_2} \\ 
      1 - \hat{y_3} \\
    \end{bmatrix*}
    ) \Rightarrow  \\
    \\
  \Rightarrow J = - \dfrac{1}{3} ( \frac{y\_true}{y\_pred} - \frac{1 - y\_true}{1 - y\_pred})
\end{gather*}

<IPython.core.display.Latex object>

> Note — Here, 3 represents ‘N’, i.e., the entries in y_true and y_pred

We can easily define the BCE Jacobian in Python like this.

In [90]:
def B_cross_E_grad(y_true, y_pred):           # CE Jacobian
    N = y_true.shape[0]
    return -(y_true / (y_pred + 10**-100) - (1 - y_true) / (1 - y_pred + 10**-100)) / N

> Note — 10**-100 is for stability.

Let us have a look at an example.

In [91]:
y_true = np.array([[1], [0], [1], [1]])
print(y_true)
y_pred = np.array([[0.4], [0.5], [0.8], [0.2]])
print(y_pred)
print(y_pred.shape)

[[1]
 [0]
 [1]
 [1]]
[[0.4]
 [0.5]
 [0.8]
 [0.2]]
(4, 1)


In [92]:
B_cross_E(y_true, y_pred)

0.8605048440456026

In [93]:
B_cross_E_grad(y_true, y_pred)

array([[-0.625 ],
       [ 0.5   ],
       [-0.3125],
       [-1.25  ]])

I hope now you understand what is Binary cross-entropy loss.

Now, let us talk a bit about Categorical and Binary cross-entropy loss functions.

A simple ‘Yes/No’ question never has an answer either ‘Yes’ or ‘No’. They both always come in compliment to each other, i.e., ‘Yes’ + ‘No’ = 1

If we had only 1 question in this example of Binary cross-entropy loss, then it is a simple case of Categorical cross-entropy loss with two options, ‘Yes’ and ‘No’.

But we have ‘N’ questions, so we will take mean.

> Note — In Chapter 5, we will talk more about the **Sigmoid activation function and Binary cross-entropy loss function** for Backpropagation. Because, in the output of the Sigmoid function, every element is independent and is between 0 and 1 and they can be interpreted as probabilities or answers to ‘N’ number of questions.

With this post, Chapter 4 — Losses and their derivatives is finished. We will be starting Chapter 5 — Diving deep in the Neural Networks with the next post in which we will talk about how Artificial Neural Networks are built which will be followed by Backpropagation, L1 and L2 penalties, Dropout, Layer Normalization, Batch training nad Validation sets and finally UCI white wine quality dataset example.