![](Figure_1.png)

## Derivative of tanh

$$\tanh x = { \frac{\sinh x}{\cosh x} }
 = { \frac {e^{x}-e^{-x}}{e^{x} +e^{-x}} } $$


$$\frac{d}{dx}{\sinh x} = \frac{1}{2}*(\frac{d}{dx}e^{x}-\frac{d}{dx}e^{-x}) = \frac{1}{2}*(e^{x} + e^{-x}) = \cosh x $$
$$\frac{d}{dx}{\cosh x} = \frac{1}{2}*(\frac{d}{dx}e^{x}+\frac{d}{dx}e^{-x}) = \frac{1}{2}*(e^{x} - e^{-x}) = \sinh x. $$
<p style="text-align: center;">  
Applying the quotient rule:
</p>

$$ \frac{d}{dx}{\tanh x} = \frac{\cosh x \frac{d}{dx} \sinh x - \sinh x \frac{d}{dx} \cosh x } {{\cosh}^{2} x} $$

$$ = \frac{\cosh x \cosh x - \sinh x \sinh x}{\cosh^{2} x}$$

$$ = 1 - \frac{\sinh ^{2}x}{\cosh^{2}x} = 1 - \tanh^{2} x $$

$$ \blacksquare . $$




## Derivative of logistic sigmoid

$$ \sigma (x) =  \frac{1}{1 + e^{-x}}$$

<p style="text-align: center;">  
Applying the chain rule:
</p>
$$ \frac{d}{dx} \sigma (x) = - \frac{1}{(1 + e^{-x})^{2}} * -e^{-x} $$

$$ = \sigma (x) \frac{e^{-x}}{1 + e^{-x}} $$

$$ = \sigma (x) (1 - \sigma (x)) $$

$$ \blacksquare . $$



## Derivative of ReLU

The Rectified Linear Unit is defined as $ f(x)=\text{max}(0,x) $.  
We can rewrite this using two cases:

\begin{equation} 
f(x)=
    \begin{cases}
      x, & \text{if}\ x>0 \\
      0, & \text{otherwise}
    \end{cases}
\end{equation} .

Upon simple differentiation of the two cases, we get

\begin{equation} 
f'(x)=
    \begin{cases}
      1, & \text{if}\ x>0 \\
      0, & \text{otherwise}
    \end{cases}
\end{equation} 

The ReLU is discontinuous at x = 0, 
therefore its derivative at x = 0 is technically not defined;
however, we are explicitly setting $ f'(0) = 0 $ in the statement above, so we have defined a derivative $\forall x \in \mathbb{R}$ 



# Backward Pass - Backpropagation

## Derivative 1:

$ \frac{\partial L}{\partial W^{(2)}}$.

By the chain rule:

$ = \frac{\partial L}{\partial a^{(2)}} \frac{\partial a^{(2)}}{\partial z^{(2)}} \frac{\partial z^{(2)}}{\partial W^{(2)}} $

### factor #1

We are interested in the derivative of the Loss function with respect to the vector of activations $\mathbf{a}^{(2)}$.  This is written as

$\nabla_{\mathbf {a}^{(2)}} L(\mathbf{y},\mathbf{a}^{(2)})$.

$\frac{\partial L}{\partial a^{(2)}}  =  -\frac{1}{N}\frac{\partial}{\partial a^{(2)}} \sum\limits_{n \in N} \sum\limits_{i \in C} y_{n,i} log {a}^{(2)}_{n,i}   $

where $N$ is the number of data points and $C$ is the number of unique classes.  For our Make Moons dataset, $N = 200$ and $C = 2$

To preserve all information, this derivative is an $ N \times C $ matrix, where each element, $\frac{\partial L}{\partial a^{(2)}_{n,i}}$, is 

$-\frac{1}{N}\frac{y_{n,i}}{a^{(2)}_{n,i}}$



### factor #2
We are interested in the derivative of the activations $\mathbf{a}^{(2)}$ with respect to inputs $\mathbf{z}^{(2)}$.  Above, we were computing the derivative of a scalar with respect to a matrix.  Now we are computing the derivative of a vector with respect to a vector, so we write the Jacobian matrix as 


$$ \frac{\partial \mathbf{a}^{(2)}}{\partial \mathbf{z}^{(2)}} = $$

   \begin{bmatrix}
       \frac{\partial a^{(2)}_{0}}{\partial z^{(2)}_{0}} &    \frac{\partial a^{(2)}_{0}}{\partial z^{(2)}_{1}} \\
       \frac{\partial a^{(2)}_{1}}{\partial z^{(2)}_{0}} &    \frac{\partial a^{(2)}_{1}}{\partial z^{(2)}_{1}}
   \end{bmatrix}


For $ i,j \in C$,
$\frac{\partial a^{(2)}_{i}}{\partial z^{(2)}_{j}}  = \frac{\partial}{\partial z^{(2)}_{j}} softmax(\mathbf{z}^{(2)})_{i}$

$ = \frac{\partial}{\partial z^{(2)}_{j}} \frac{exp(z^{(2)}_{i})}{\sum\limits_{i \in C}exp(z^{(2)}_{i})} $

Now, we can use the quotient rule for each of the four elements in the matrix.  

However, this is getting messy.  Let's try something different:  observe that 

### factors #1 and #2 together

We can ignore N datapoints for now, and consider only the sum over classes $k \in C$.

Let's compute the following:
$\frac{\partial L}{\partial a^{(2)}} \frac{\partial a^{(2)}}{\partial z^{(2)}} = \frac{\partial L}{\partial \mathbf{z}^{(2)}} $

Expressing the output layer's activations element-wise:

$$\frac{\partial L}{\partial z_{i}} = - \sum\limits_{k \in C} y_{k} \frac{\partial log a_{k}}{\partial z_{i}} = - \sum\limits_{k \in C} y_{k} \frac{1}{a_{k}} \frac{\partial  a_{k}}{\partial z_{i}}$$


Let's split the derivative into two cases:  

#### case 1:  $ i = j$.

$$ \frac{\partial  a_{i}}{\partial z_{i}}  = \frac{\partial}{\partial z_{i}} \frac{exp(z_{i})}{\sum\limits_{k \in C}exp(z_{k})}$$

By the quotient rule, we get:

$$ \frac{\big[\sum\limits_{k}exp(z_{k})\big]exp(z_{i}) - exp(z_{i})exp(z_{i}) }{ \big[{\sum\limits_{k}exp(z_{k})}\big]^{2} } $$.

$$ =  \frac{\big[\sum\limits_{k}exp(z_{k})\big] - exp(z_{i}) }{\sum\limits_{k}exp(z_{k})}  \frac{exp(z_{i})}{\sum\limits_{k}exp(z_{k})} $$.

This allows us to express the derivative of the softmax in terms of softmax function itself.

$$ \frac{\partial  a_{i}}{\partial z_{i}} = [1 - softmax(z_{i})] softmax(z_{i})  = [1 - a_{i}]a_{i} $$

Reincorporating into the full derivative above, we get:

$$ - \sum\limits_{k = i} y_{k} \frac{1}{a_{i}} \frac{\partial  a_{i}}{\partial z_{i}}$$

$$ =  -  y_{i} \frac{1}{a_{i}} \frac{\partial  a_{i}}{\partial z_{i}}$$

$$ =  -  y_{i} \frac{1}{a_{i}} [1 - a_{i}]a_{i}$$

$$ = - y_{i} (1 - a_{i}) $$

for i = j.






#### case 2: $ i \neq j$.

$$ \frac{\partial  a_{i}}{\partial z_{j}}  = \frac{\partial}{\partial z_{j}} \frac{exp(z_{i})}{\sum\limits_{k \in C}exp(z_{k})}$$

Using the quotient rule again:

$$ \frac{0 * exp(z_{i}) - exp(z_{i})exp(z_{j}) }{ \big[{\sum\limits_{k}exp(z_{k})}\big]^{2} }  = -\frac{exp(z_{i})}{\sum\limits_{k}exp(z_{k})} \frac{exp(z_{j})}{\sum\limits_{k}exp(z_{k})} = - softmax(z_{i})softmax(z_{j}) = -a_{i}a_{j}$$.

Incorporating this into the derivative above,

$$ - \sum\limits_{i \neq j \in C} y_{j} \frac{1}{a_{j}} \frac{\partial  a_{j}}{\partial z_{i}} =  -\sum\limits_{i \neq j \in C} y_{j} \frac{1}{a_{j}} (-a_{j} a_{i}) =  \sum\limits_{i \neq j \in C} y_{j} a_{i} $$

#### combining two cases:

$$ \frac{\partial L}{\partial z_{i}} = - y_{i} (1 - a_{i}) + \sum\limits_{j \neq i \in C} y_{j} a_{i} $$
$$ =  - y_{i} +  y_{i}a_{i} + \sum\limits_{j \neq i \in C} y_{j} a_{i} $$

$$ = -y_{i} + \sum\limits_{j \in C} y_{j}a_{i} $$

Rearranging and then finally using the trick that y is a one-hot vector, we get 
$$ = a_{i}\sum\limits_{j \in C}y_{j} - y_{i}  $$

$$ = a_{i}* 1 - y_{i}   = a_{i} - y_{i}$$.

So the **gradient** of the loss function with respect to a particular z_{i} is simply the difference between the true class identity in y and the activation value a_{i} at that output neuron.


Composing functions, so as to compute the cross-entropy of the softmax before taking the derivative, we get:

$\frac{\partial L}{\partial \mathbf{z}^{(2)}_{i}} = -\frac{1}{N}\frac{\partial}{\partial \mathbf{z}^{(2)}_{i}} \sum\limits_{n \in N} \sum\limits_{i \in C} y_{i,n}log (softmax (\mathbf{z}^{(2)}_{i,n})) = -\frac{1}{N}\frac{\partial}{\partial \mathbf{z}^{(2)}_{i}} \big[ y_{0,0} log(softmax(z_{0,0}^{(2)})) + y_{1,0} log(softmax(z_{1,0}^{(2)})) +
y_{0,1} log(softmax(z_{0,1}^{(2)})) + y_{1,1} log(softmax(z_{1,1}^{(2)})) + \dots + y_{C,N} log(softmax(z_{C,N}^{(2)})) \big]$

Using the definition of the softmax, we get nice cancelation of logs and exponents.

$ = -\frac{1}{N}\frac{\partial}{\partial \mathbf{z}^{(2)}_{i}} \bigg[ y_{0,0} [z_{0,0}^{(2)} -  log \sum\limits_{j}exp(z_{j,0}^{(2)})] + y_{1,0} [z_{1,0}^{(2)} -  log \sum\limits_{j}exp(z_{j,0}^{(2)})] +
y_{0,1} [z_{0,1}^{(2)} -  log \sum\limits_{j}exp(z_{j,0}^{(2)})] + y_{1,1} [z_{1,1}^{(2)} -  log \sum\limits_{j}exp(z_{j,0}^{(2)})] + \dots + y_{C,N} [z_{C,N}^{(2)} -  log \sum\limits_{j}exp(z_{j,0}^{(2)})]  \bigg]$

Factoring out the log sum terms, and realizing that we have a dot product left over, we get:
$ =  \frac{1}{N}\frac{\partial}{\partial \mathbf{z}^{(2)}_{i}} [ log \sum\limits_{j}exp(z_{j,0}^{(2)}) \mathbf{y} - \mathbf{z} \cdot \mathbf{y} ] $





### factor #3

### putting it together:

$N \times C$, $C \times $


## Derivative 2:

$ \frac{\partial L}{\partial b_{2}}$.

## Derivative 3:

$ \frac{\partial L}{\partial W_{1}}$.

## Derivative 4:

$ \frac{\partial L}{\partial b_{1}}$.