RNN generalizes the markov and n-gram predictions from n-steps back, all the way back in the sequence by adding a hidden state $h_t$, where: $$P(x_t | x_{t-1}, \ldots, x_1) \approx P(x_t | h_{t-1})$$,$$h_t = f(x_t, h_{t-1})____(1)$$

In [2]:
import torch 
from d2l import torch as d2l

### RNN theory
* In simple NN:$X$ is not time dependent: $X \in \mathbb{R}^{n \times d}$,  
    So the output of a hidden layer is: $H = \phi(XW_{xh} + b_h)$
* In RNN we have the same thing just now with time: $X_t \in \mathbb{R}^{n \times d}$
    Here the from (1) we can see that wew need $H_{t-1}$ and thus we add $W_{hh}$: $$H_t = \phi(X_t W_{xh} + H_{t-1}W_{hh} + b_h)$$  
  The previous time $H$ becomes the hidden state since now it encapsulates the past.
* The output is the classic output of NN: $O_t = H_tW_{hq} + b_q$

### **Concatenation instead of Multiplication** (A more sophisticated approach)

![imgs/rnn_data_flow.png](imgs/rnn_data_flow.png)

In [10]:
X, W_xh = torch.randn(3, 1), torch.randn(1, 4)
H, W_hh = torch.randn(3, 4), torch.randn(4, 4)
add_mul_result = torch.matmul(X, W_xh) + torch.matmul(H, W_hh)
cat_mul_result = torch.matmul(torch.cat((X, H), 1), torch.cat((W_xh, W_hh), 0))
print(add_mul_result)
print(cat_mul_result)

tensor([[-2.1203, -0.8179,  0.8744,  0.1771],
        [ 2.2148, -0.4934,  2.7883,  1.2505],
        [ 2.7532, -3.6170,  3.5816,  0.0531]])
tensor([[-2.1203, -0.8179,  0.8744,  0.1771],
        [ 2.2148, -0.4934,  2.7883,  1.2505],
        [ 2.7532, -3.6170,  3.5816,  0.0531]])
