<center>
    <tr>
    <td><img src="images/Quansight_Logo_Lockup_1.png" width="25%"></img></td>
    </tr>
</center>

---
# Mathematics of Deep Neural Networks
---

## Outline

- A layered view of deep networks
- Recursive nature of deep networks
- Backpropagation

## Logistic regression and layers

Lets look at how we can specify logistic regression as layers.  The ability to specify such models as layers is key to designing deep neural networks.  We will also discuss *backpropagation*.

## Example: two-class softmax classifier

Lets consider a binary classification problem.  We often solve this problem using a sigmoid function.  The idea is that given a probability $p$ of one class, we can compute the probability of the other class as $1-p$.  For this discussion, however, we employ softmax and explicity models the two probabilities.  This discussion can easily be extended to more-than-two-classes classification problems.

Data consists of $(\mathbf{x}^{(i)}, y^{(i)})$.  Here $i \in [1,N]$, $\mathbf{x}^{(i)} \in \mathbb{R}^M$ is the ith feature vector and $y^{(i)}$ is the associated class label.  

<center>
    <tr>
        <td>
            <img src="images/two-class-linear.png" width=35%>
        </td>
    </tr>
</center>

### Negative log likelihood

We use negative log likelihood cost to train this classifier

$$
\newcommand{bx}{\mathbf{x}}
$$

$$
\small
\begin{split}
l(\theta) = - \sum_{i=1}^N \mathbb{I}_0 (y^{(i)}) \log 
\frac{e^{\bx^{(i)^T} \theta_1}}{e^{\bx^{(i)^T} \theta_1} + {e^{\bx^{(i)^T} \theta_2}}}
+
\mathbb{I}_1 (y^{(i)}) \log 
\frac{e^{\bx^{(i)^T} \theta_2}}{e^{\bx^{(i)^T} \theta_1} + {e^{\bx^{(i)^T} \theta_2}}}
\end{split}
$$

Define cost $C(\theta)$ that we want to minimize to be the negative log likelihood $l(\theta)$.

### Layer representation

Lets represent the above network as a layers.  We will treat cost as a layer as well.

<center>
    <tr>
        <td>
            <img src="images/softmax-2classes-layers.png" width=55%>
        </td>
    </tr>
</center>

In order to train this classifier, we need to estimate values for weights $\theta_1$ and $\theta_2$ ($\theta_1, \theta_2 \in \mathbb{R}^{M+1}$).  In order to do so, we need to compute $\frac{\partial C(\theta)}{\partial \theta_1}$ and $\frac{\partial C(\theta)}{\partial \theta_2}$.  Note that we set $z^4 = C(\theta)$, so we are interested in $\frac{\partial z^4}{\partial \theta_1}$ and $\frac{\partial z^4}{\partial \theta_2}$.

## Chain rule

A formula for computing derivatives of composite functions (Gottfried Wilhelm Leibniz, 1676).

$$
\newcommand{\diff}[2]
{
\frac{\partial {#1}}{\partial {#2}}
}
$$

$$
\frac{\partial f(g(u,v),h(u,v))}{\partial u} = \diff{f}{g} \diff{g}{u} + \diff{f}{h} \diff{h}{u}
$$

<center>
    <tr>
        <td>
            <img src="images/chain-rule.png" width=25%>
        </td>
    </tr>
</center>


### Computing derivatives of loss with respect to weights (i.e., network parameters)

We can use the *chain rule* to compute $\frac{\partial z^4}{\partial \theta_1}$ and $\frac{\partial z^4}{\partial \theta_2}$.  

<center>
    <tr>
        <td>
            <img src="images/softmax-2classes-layers.png" width=55%>
        </td>
    </tr>
</center>

$$
\newcommand{\diff}[2]
{
\frac{\partial {#1}}{\partial {#2}}
}
$$

$$
\begin{align}
\diff{z^4}{\theta_1} &= \diff{z^4}{z_1^3} \diff{z_1^3}{\theta_1} + \diff{z^4}{z_2^3} \diff{z_2^3}{\theta_1} \\
&= \diff{z^4}{z_1^3} \left( \diff{z_1^3}{z_1^2} \diff{z_1^2}{\theta_1} +  \diff{z_1^3}{z_2^2} \diff{z_2^2}{\theta_1}  \right) \\ 
&+ \diff{z^4}{z_2^3} \left( \diff{z_2^3}{z_1^2} \diff{z_1^2}{\theta_1} +  \diff{z_2^3}{z_2^2} \diff{z_2^2}{\theta_1}  \right)
\end{align}
$$

<!-- <center>
    <tr>
        <td>
            <img src="images/chain-rule-z4.png" width=35%>
        </td>
    </tr>
</center>
 -->
We can similarly compute $\frac{\partial z^4}{\partial \theta_2}$.

Recall that $z^4 = C(\theta)$, and we can minimize the $C(\theta)$ using gradient descent using the gradients computed above.

## Backpropagation

<center>
    <tr>
        <td>
            <img src="images/softmax-2classes-layers2.png" width=45%>
        </td>
    </tr>
</center>

### Forward pass**

$z^1 = f(\mathbf{x})$ (*input data*)\
$z^2 = f(z^1)$ (*linear function*)\
$z^3 = f(z^2)$ (*log softmax*)\
$z^4 = f(z^3) = C(\theta)$ (*negative log likelihood*, cost)

### Backward pass

$\delta^l = \diff{C(\theta)}{z^L}$

### Computing $\delta^l$

$$
\begin{split}
\delta^4 &= \diff{C({\theta})}{z^4} = \diff{z^4}{z^4} = 1 \\
\delta^3_1 &= \diff{C(\theta)}{z^3_1} = \diff{C(\theta)}{z^4} \diff{z^4}{z^3_1} 
= \delta^4 \diff{z^4}{z^3_1} \\
\delta^3_2 &= \diff{C(\theta)}{z^3_2} = \diff{C(\theta)}{z^4} \diff{z^4}{z^3_2} 
= \delta^4 \diff{z^4}{z^3_2} \\
\delta^2_1 &= \diff{C(\theta)}{z^2_1} = \sum_k \diff{C(\theta)}{z^3_k} \diff{z^3_k}{z^2_1} = \sum_k \delta^3_k \diff{z^3_k}{z^2_1} \\
\delta^2_2 &= \diff{C(\theta)}{z^2_2} = \sum_k \diff{C(\theta)}{z^3_k} \diff{z^3_k}{z^2_2} = \sum_k \delta^3_k \diff{z^3_k}{z^2_2}
\end{split} 
$$

## For any differentiable layer $l$

For a given layer $l$, with inputs $z_i^l$ and outputs $z_k^{l+1}$
$$
\delta^l_i = \sum_k \delta^{l+1}_k \diff{z^{l+1}_k}{z^l_i}
$$

Similarly, for layer $l$ that depends upon parameters $\theta^l$,
$$
\diff{C(\theta)}{\theta^l} = \sum_k \diff{C(\theta)}{z^{l+1}_k} \diff{z^{l+1}_k}{\theta^l} = \sum_k \delta^{l+1}_k \diff{z^{l+1}_k}{\theta^l}
$$

### For layer 1 in our two-class softmax classifier

In our 2-class softmax classifier only layer 1 has parameters ($\theta_0$ and $\theta_1$).

<center>
    <tr>
        <td>
            <img src="images/layer-l.png" width=25%>
        </td>
    </tr>
</center>

## Layered architectures

As long as we have differentiable layers, i.e., we can compute $\diff{z^{l+1}_k}{z^{l}_i}$,
we can use *backpropagation* to update the parameters $\theta$ to minimize the cost $C(\theta)$.

<center>
    <tr>
        <td>
            <img src="images/layer-architecture.png" width=55%>
        </td>
    </tr>
</center>

# Backpropagation

- Set $z^1$ equal to input $\mathbf{x}$.
- Forward pass: compute $z^2, z^3, ...$ layers $1, 2, ...$ activations.
- Set $\delta$ at the last layer equal to 1
- Backward pass: backpropagate $\delta$s all the way to first layer.
- Update $\theta$
- Repeat

# Summary

- Layered view of logistic regression and softmax
- Backpropogation
- Deep networks are recursive - we can treat a deep network as a layer in another network (this is extremely powerful)

<center>
    <tr>
    <td><img src="images/Quansight_Logo_Lockup_1.png" width="25%"></img></td>
    </tr>
</center>