# 8.7. Backpropagation Through Time

So far we have repeatedly alluded to things like exploding gradients, vanishing gradients, and the need to detach the gradient for RNNs. For instance, in Section 8.5 we invoked the detach function on the sequence. None of this was really fully explained, in the interest of being able to build a model quickly and to see how it works. In this section, we will delve a bit more deeply into the details of backpropagation for sequence models and why (and how) the mathematics works.

We encountered some of the effects of gradient explosion when we first implemented RNNs (Section 8.5). In particular, if you solved the exercises, you would have seen that gradient clipping is vital to ensure proper convergence. To provide a better understanding of this issue, this section will review how gradients are computed for sequence models. Note that there is nothing conceptually new in how it works. After all, we are still merely applying the chain rule to compute gradients. Nonetheless, it is worth while reviewing backpropagation (Section 4.7) again.

We have described forward and backward propagations and computational graphs in MLPs in Section 4.7. Forward propagation in an RNN is relatively straightforward. Backpropagation through time is actually a specific application of backpropagation in RNNs [Werbos, 1990]. It requires us to expand the computational graph of an RNN one time step at a time to obtain the dependencies among model variables and parameters. Then, based on the chain rule, we apply backpropagation to compute and store gradients. Since sequences can be rather long, the dependency can be rather lengthy. For instance, for a sequence of 1000 characters, the first token could potentially have significant influence on the token at the final position. This is not really computationally feasible (it takes too long and requires too much memory) and it requires over 1000 matrix products before we would arrive at that very elusive gradient. This is a process fraught with computational and statistical uncertainty. In the following we will elucidate what happens and how to address this in practice.

## 8.7.1. Analysis of Gradients in RNNs
We start with a simplified model of how an RNN works. This model ignores details about the specifics of the hidden state and how it is updated. The mathematical notation here does not explicitly distinguish scalars, vectors, and matrices as it used to do. These details are immaterial to the analysis and would only serve to clutter the notation in this subsection.

In this simplified model, we denote  ht  as the hidden state,  xt  as the input, and  ot  as the output at time step  t . Recall our discussions in Section 8.4.2 that the input and the hidden state can be concatenated to be multiplied by one weight variable in the hidden layer. Thus, we use  wh  and  wo  to indicate the weights of the hidden layer and the output layer, respectively. As a result, the hidden states and outputs at each time steps can be explained as

htot=f(xt,ht−1,wh),=g(ht,wo),
 
where  f  and  g  are transformations of the hidden layer and the output layer, respectively. Hence, we have a chain of values  {…,(xt−1,ht−1,ot−1),(xt,ht,ot),…}  that depend on each other via recurrent computation. The forward propagation is fairly straightforward. All we need is to loop through the  (xt,ht,ot)  triples one time step at a time. The discrepancy between output  ot  and the desired label  yt  is then evaluated by an objective function across all the  T  time steps as

L(x1,…,xT,y1,…,yT,wh,wo)=1T∑t=1Tl(yt,ot).
 
For backpropagation, matters are a bit trickier, especially when we compute the gradients with regard to the parameters  wh  of the objective function  L . To be specific, by the chain rule,

∂L∂wh=1T∑t=1T∂l(yt,ot)∂wh=1T∑t=1T∂l(yt,ot)∂ot∂g(ht,wo)∂ht∂ht∂wh.
 
The first and the second factors of the product in (8.7.3) are easy to compute. The third factor  ∂ht/∂wh  is where things get tricky, since we need to recurrently compute the effect of the parameter  wh  on  ht . According to the recurrent computation in (8.7.1),  ht  depends on both  ht−1  and  wh , where computation of  ht−1  also depends on  wh . Thus, using the chain rule yields

∂ht∂wh=∂f(xt,ht−1,wh)∂wh+∂f(xt,ht−1,wh)∂ht−1∂ht−1∂wh.
 
To derive the above gradient, assume that we have three sequences  {at},{bt},{ct}  satisfying  a0=0  and  at=bt+ctat−1  for  t=1,2,… . Then for  t≥1 , it is easy to show

at=bt+∑i=1t−1(∏j=i+1tcj)bi.
 
By substituting  at ,  bt , and  ct  according to

atbtct=∂ht∂wh,=∂f(xt,ht−1,wh)∂wh,=∂f(xt,ht−1,wh)∂ht−1,
 
the gradient computation in (8.7.4) satisfies  at=bt+ctat−1 . Thus, per (8.7.5), we can remove the recurrent computation in (8.7.4) with

∂ht∂wh=∂f(xt,ht−1,wh)∂wh+∑i=1t−1(∏j=i+1t∂f(xj,hj−1,wh)∂hj−1)∂f(xi,hi−1,wh)∂wh.
 
While we can use the chain rule to compute  ∂ht/∂wh  recursively, this chain can get very long whenever  t  is large. Let us discuss a number of strategies for dealing with this problem.

### 8.7.1.1. Full Computation
Obviously, we can just compute the full sum in (8.7.7). However, this is very slow and gradients can blow up, since subtle changes in the initial conditions can potentially affect the outcome a lot. That is, we could see things similar to the butterfly effect where minimal changes in the initial conditions lead to disproportionate changes in the outcome. This is actually quite undesirable in terms of the model that we want to estimate. After all, we are looking for robust estimators that generalize well. Hence this strategy is almost never used in practice.

### 8.7.1.2. Truncating Time Steps
Alternatively, we can truncate the sum in (8.7.7) after  τ  steps. This is what we have been discussing so far, such as when we detached the gradients in Section 8.5. This leads to an approximation of the true gradient, simply by terminating the sum at  ∂ht−τ/∂wh . In practice this works quite well. It is what is commonly referred to as truncated backpropgation through time [Jaeger, 2002]. One of the consequences of this is that the model focuses primarily on short-term influence rather than long-term consequences. This is actually desirable, since it biases the estimate towards simpler and more stable models.

### 8.7.1.3. Randomized Truncation
Last, we can replace  ∂ht/∂wh  by a random variable which is correct in expectation but truncates the sequence. This is achieved by using a sequence of  ξt  with predefined  0≤πt≤1 , where  P(ξt=0)=1−πt  and  P(ξt=π−1t)=πt , thus  E[ξt]=1 . We use this to replace the gradient  ∂ht/∂wh  in (8.7.4) with

zt=∂f(xt,ht−1,wh)∂wh+ξt∂f(xt,ht−1,wh)∂ht−1∂ht−1∂wh.
 
It follows from the definition of  ξt  that  E[zt]=∂ht/∂wh . Whenever  ξt=0  the recurrent computation terminates at that time step  t . This leads to a weighted sum of sequences of varying lengths where long sequences are rare but appropriately overweighted. This idea was proposed by Tallec and Ollivier [Tallec & Ollivier, 2017].

### 8.7.1.4. Comparing Strategies

Fig. 8.7.1 illustrates the three strategies when analyzing the first few characters of The Time Machine book using backpropagation through time for RNNs:

The first row is the randomized truncation that partitions the text into segments of varying lengths.

The second row is the regular truncation that breaks the text into subsequences of the same length. This is what we have been doing in RNN experiments.

The third row is the full backpropagation through time that leads to a computationally infeasible expression.

Unfortunately, while appealing in theory, randomized truncation does not work much better than regular truncation, most likely due to a number of factors. First, the effect of an observation after a number of backpropagation steps into the past is quite sufficient to capture dependencies in practice. Second, the increased variance counteracts the fact that the gradient is more accurate with more steps. Third, we actually want models that have only a short range of interactions. Hence, regularly truncated backpropagation through time has a slight regularizing effect that can be desirable.

## 8.7.2. Backpropagation Through Time in Detail
After discussing the general principle, let us discuss backpropagation through time in detail. Different from the analysis in Section 8.7.1, in the following we will show how to compute the gradients of the objective function with respect to all the decomposed model parameters. To keep things simple, we consider an RNN without bias parameters, whose activation function in the hidden layer uses the identity mapping ( ϕ(x)=x ). For time step  t , let the single example input and the label be  xt∈Rd  and  yt , respectively. The hidden state  ht∈Rh  and the output  ot∈Rq  are computed as

htot=Whxxt+Whhht−1,=Wqhht,
 
where  Whx∈Rh×d ,  Whh∈Rh×h , and  Wqh∈Rq×h  are the weight parameters. Denote by  l(ot,yt)  the loss at time step  t . Our objective function, the loss over  T  time steps from the beginning of the sequence is thus

L=1T∑t=1Tl(ot,yt).
 
In order to visualize the dependencies among model variables and parameters during computation of the RNN, we can draw a computational graph for the model, as shown in Fig. 8.7.2. For example, the computation of the hidden states of time step 3,  h3 , depends on the model parameters  Whx  and  Whh , the hidden state of the last time step  h2 , and the input of the current time step  x3 .

As just mentioned, the model parameters in Fig. 8.7.2 are  Whx ,  Whh , and  Wqh . Generally, training this model requires gradient computation with respect to these parameters  ∂L/∂Whx ,  ∂L/∂Whh , and  ∂L/∂Wqh . According to the dependencies in Fig. 8.7.2, we can traverse in the opposite direction of the arrows to calculate and store the gradients in turn. To flexibly express the multiplication of matrices, vectors, and scalars of different shapes in the chain rule, we continue to use the  prod  operator as described in Section 4.7.

First of all, differentiating the objective function with respect to the model output at any time step  t  is fairly straightforward:

∂L∂ot=∂l(ot,yt)T⋅∂ot∈Rq.
 
Now, we can calculate the gradient of the objective function with respect to the parameter  Wqh  in the output layer:  ∂L/∂Wqh∈Rq×h . Based on Fig. 8.7.2, the objective function  L  depends on  Wqh  via  o1,…,oT . Using the chain rule yields

∂L∂Wqh=∑t=1Tprod(∂L∂ot,∂ot∂Wqh)=∑t=1T∂L∂oth⊤t,
 
where  ∂L/∂ot  is given by (8.7.11).

Next, as shown in Fig. 8.7.2, at the final time step  T  the objective function  L  depends on the hidden state  hT  only via  oT . Therefore, we can easily find the gradient  ∂L/∂hT∈Rh  using the chain rule:

∂L∂hT=prod(∂L∂oT,∂oT∂hT)=W⊤qh∂L∂oT.
 
It gets trickier for any time step  t<T , where the objective function  L  depends on  ht  via  ht+1  and  ot . According to the chain rule, the gradient of the hidden state  ∂L/∂ht∈Rh  at any time step  t < T  can be recurrently computed as:

∂L∂ht=prod(∂L∂ht+1,∂ht+1∂ht)+prod(∂L∂ot,∂ot∂ht)=W⊤hh∂L∂ht+1+W⊤qh∂L∂ot.
 
For analysis, expanding the recurrent computation for any time step  1≤t≤T  gives

∂L∂ht=∑i=tT(W⊤hh)T−iW⊤qh∂L∂oT+t−i.
 
We can see from (8.7.15) that this simple linear example already exhibits some key problems of long sequence models: it involves potentially very large powers of  W⊤hh . In it, eigenvalues smaller than 1 vanish and eigenvalues larger than 1 diverge. This is numerically unstable, which manifests itself in the form of vanishing and exploding gradients. One way to address this is to truncate the time steps at a computationally convenient size as discussed in Section 8.7.1. In practice, this truncation is effected by detaching the gradient after a given number of time steps. Later on we will see how more sophisticated sequence models such as long short-term memory can alleviate this further.

Finally, Fig. 8.7.2 shows that the objective function  L  depends on model parameters  Whx  and  Whh  in the hidden layer via hidden states  h1,…,hT . To compute gradients with respect to such parameters  ∂L/∂Whx∈Rh×d  and  ∂L/∂Whh∈Rh×h , we apply the chain rule that gives

∂L∂Whx∂L∂Whh=∑t=1Tprod(∂L∂ht,∂ht∂Whx)=∑t=1T∂L∂htx⊤t,=∑t=1Tprod(∂L∂ht,∂ht∂Whh)=∑t=1T∂L∂hth⊤t−1,
 
where  ∂L/∂ht  that is recurrently computed by (8.7.13) and (8.7.14) is the key quantity that affects the numerical stability.

Since backpropagation through time is the application of backpropagation in RNNs, as we have explained in Section 4.7, training RNNs alternates forward propagation with backpropagation through time. Besides, backpropagation through time computes and stores the above gradients in turn. Specifically, stored intermediate values are reused to avoid duplicate calculations, such as storing  ∂L/∂ht  to be used in computation of both  ∂L/∂Whx  and  ∂L/∂Whh .

## 8.7.3. Summary
Backpropagation through time is merely an application of backpropagation to sequence models with a hidden state.

Truncation is needed for computational convenience and numerical stability, such as regular truncation and randomized truncation.

High powers of matrices can lead to divergent or vanishing eigenvalues. This manifests itself in the form of exploding or vanishing gradients.

For efficient computation, intermediate values are cached during backpropagation through time.