# RNN
We can regard RNN as a Linear Network which has hidden state. Assuming we have small batch of input $\mathbf{X}_t$ in time slice t, and we have preserved the hidden state of the previous time step $\mathbf{H}_{t-1}$, and introduced a new weight parameter $\mathbf{W}_{hh}$, to describe how to use the hidden status. And the calculation of the hidden state is:
$$
\mathbf{H}_t = \phi(\mathbf{X}_t\mathbf{W}_{xh}+\mathbf{H}_{t-1}\mathbf{W}_{hh}+\mathbf{b}_h
$$
![RNN](../statics/imgs/section8.4_fig1.jpg)

In [1]:
# RNN
import torch
from d2l import torch as d2l

In [2]:
X, W_xh = torch.normal(0, 1, (3, 1)), torch.normal(0, 1, (1, 4))
H, W_hh = torch.normal(0, 1, (3, 4)), torch.normal(0, 1, (4, 4))
torch.matmul(X, W_xh) + torch.matmul(H, W_hh)

tensor([[-0.3215,  0.4281, -0.9213,  3.1691],
        [-3.3469,  1.6438,  0.2987,  1.1681],
        [ 4.7888, -0.9300,  0.2859,  2.5178]])

In [3]:
torch.matmul(torch.cat((X, H), 1), torch.cat((W_xh, W_hh), 0))
# concatenate X and H along column (axis 1) and W_xh & W_hh along row (axis 0)

tensor([[-0.3215,  0.4281, -0.9213,  3.1691],
        [-3.3469,  1.6438,  0.2987,  1.1681],
        [ 4.7888, -0.9300,  0.2859,  2.5178]])

## Perplexity
Perplexity is an important metric of the language model to value its quality. A fine LM is able to predict what are we going to say in the following token

We will use the perspective of the information theory. If we want to compress the text, we could according to the prediction of the next token by current tokens we have. The better our LM is, the smaller number of bits we need when compress the sequence. So we could measure the average entropy of n tokens of one sequence:
$$
\frac{1}{n}\sum_{t=1}^n-logP(x_t|x_{t-1},...,x_1)
$$
where the P is given by the LM, $x_t$ is the token we observed in time step t. Based on some historical reasons, linguists prefer a metric perplexity to measure it, which is the exp of the formular above
$$
exp(\frac{1}{n}\sum_{t=1}^n-logP(x_t|x_{t-1},...,x_1))
$$

The best understanding of the perplexity is *the harmonic mean of the number of chosen tokens of the next token*