## Derivatives of Loss Function

Let's pick a time period $t$ and consider its contribution to the loss. As discussed above this is $E_t$. We would like to calculate derivatives of $E_t$ with respect to the parameters $U$, $W$, $V$, $b_h$ and $b_y$.

### With respect to V and $b_y$

Using the chain rule we write (note that $k$ and $l$ are summed-over vector indices, and we have introduced the variable $q = V h_t + b_y$)
$$
\frac{\partial E_t}{\partial V_{ij}} = \frac{\partial E_t}{ \partial \hat{y}_{t_k}} \frac{\partial \hat{y}_{t_k}}{\partial q_l} \frac{\partial q_l}{\partial V_{ij}}
$$
The first term above is just $- y_t / \hat{y}_t$ while the middle term is just the derivative of the softmax function, and evaluates to
$$
\hat{y}_{t_k} (\delta_{kl} - \hat{y}_{t_k}),
$$
Putting these together and simplifying we get
$$
\frac{\partial E_t}{ \partial \hat{y}_{t_k}} \frac{\partial \hat{y}_{t_k}}{\partial q_l} = (\hat{y}_t - y_t)_l
$$

The last term is simply
$$
\frac{\partial q_l}{\partial V_{ij}} = \frac{\partial}{\partial V_{ij}} (V_{lm} h_m + {b_y}_l) = \delta_{il} h_j.
$$
Putting these pieces together gives us
$$
\frac{\partial E_t}{\partial V_{ij}} = ({\hat{y}_t}_i - {y_t}_i) {h_t}_j.
$$

In a very similar way we find
$$
\frac{\partial E_t}{\partial b_{y_i}} = {\hat{y}_t}_i - {y_t}_i.
$$

### With respect to W and $b_h$

Taking the derivative wrt $W$ is complicated by the fact that $E_t$ depends on $W$ in two ways: via the explicit $W$ term in $h_t$, and also indirectly through $h_t$'s dependence on $h_{t-1}$, which of course also depends on $W$, and from there to hidden layer activations at all timesteps less than t.

We start by writing
$$
\begin{align}
\frac{\partial E_t}{\partial W_{ij}} &= \frac{\partial E_t}{ \partial \hat{y}_{t_k}} \frac{\partial \hat{y}_{t_k}}{\partial q_l} \frac{\partial q_l}{\partial h_{t_m}} \frac{\partial h_{t_m}}{\partial W_{ij}} \\
\end{align}
$$

We have already calculated first two factors above - they simplify to $({\hat{y}_t}_l - {y_t}_l)$, and the third term is just $V_{lm}$. The fun starts when we consider the last term, the derivative of the activation $h_t$ wrt $W$.
We define $p_t = U x_t + W h_{t-1} + b_h$ and write

$$
\frac{\partial h_{t_m}}{\partial W_{ij}} = \frac{\partial h_{t_m}}{\partial p_{t_k}} \frac{\partial p_{t_k}}{\partial W_{ij}}.
$$
The first term is the derivative of the tanh, which is $1-\textrm{tanh}^2$. We express this as
$$
\frac{\partial h_{t_m}}{\partial p_{t_k}} = T^t_{mk}
$$
where $T^t_{mk}$ is the diagonal matrix whose elements are $(1-\textrm{tanh}^2)$ of the elements of $p_t$,
and then
$$
\begin{align}
\frac{\partial p_{t_k}}{\partial W_{ij}} &= \frac{\partial}{\partial W_{ij}} ( W_{kl} {h_{t-1}}_l )\\
&= \delta_{ik} {h_{t-1}}_j + W_{kl} \frac{\partial {h_{t-1}}_l}{\partial W_{ij}}
\end{align}.
$$
so that

$$
\begin{align}
\frac{\partial h_{t_m}}{\partial W_{ij}} &= T^t_{mi} {h_{t-1}}_j + T^t_{mk} W_{kl} \frac{\partial {h_{t-1}}_l}{\partial W_{ij}} \\
&= T^t_{mi} {h_{t-1}}_j + T^t_{mk} W_{kl} \big( T^{t-1}_{li} {h_{t-2}}_j + T^{t-1}_{ln} W_{np} \frac{\partial {h_{t-2}}_p}{\partial W_{ij}} \big)\\
&= T^t_{mi} {h_{t-1}}_j + (T^t W T^{t-1})_{mi} {h_{t-2}}_j + (T^t W T^{t-1}W)_{mk} \big( T^{t-2}_{ki} {h_{t-3}}_j + T^{t-2}_{kn} W_{np} \frac{\partial {h_{t-3}}_p}{\partial W_{ij}} \big) \\
&= T^t_{mi} {h_{t-1}}_j + (T^t W T^{t-1})_{mi} {h_{t-2}}_j + (T^t W T^{t-1}W T^{t-2})_{mi} {h_{t-3}}_j + \cdots
\end{align}
$$

We see that the derivative receives contributions from all timesteps.
