## Language Modelling
Language Modelling is modelling the probability of a sequence: 
$$
P(w_1, w_2, ..., w_n)=P(w_1) \cdot P(w_2|w_1) \cdot P(w_3|w_1,w_2)...P(w_n|w_1,w_2,...,w_{n-1})=\prod_{t=1}^{T}P(w_t|w_1,w_2,...,w_{t-1})
$$
An early statistical language model is the **n-gram model**. It assumes that the nth word is dependent only on the preceding n-1 words to make probability computation feasible. For example, a bi-gram model use one preceding word to predict the next word, a tri-gram model uses two preceding words to predict the next word...

But with n-gram model, we generate grammartical but incoherent texts that don't make sense. So let's see neural language models.

## How to Build Neural Language Models(NLM)
### A fixed-window NLM
It follows the convention of n-gram---predict the nth word given preceding n-1 words. Word vectors are concatenated into a window vector. And it receives the window vector, processes it with an MLP and outputs softmax probabilities over the entire vocabulary. However, one big problem of this model is that it cannot handle variable sequence length dependency. So we'll see RNN.

### RNN
#### Model Structure
RNN is like a reusable MLP. An RNN receives two inputs every step. We input a vector $x_1$ (it can be the first word vector of a sentence), and initialise a hidden state $h_0$ (usually a zero vector), take their respective affine transformations $W_{xh}x_1$ and $W_{hh}h_0$, and then plus a bias $b_h$, sum them up, we get $W_{xh}x_1+W_{hh}h_0+b_h$. And then we apply an activation function $\Phi$, we get $\Phi(W_{xh}x_1+W_{hh}h_0+b_h)$. This is what we use to compute the current hidden state $h_1$. To get an output of the current time step, just do an affine transformation with $h_1$ and add a bias $b_o$, apply an activation function. We can do this calculation recurrently: input $x_2$ (the second word vector of a sentence), load $h_1$...... This is how an RNN works. The main maths are as follows:

$$
\begin{align}
h_t=\Phi(W_{xh}x_t+W_{hh}h_{t-1}+b_h) \\
o_t=\phi(W_{ho}h_t+b_o)
\end{align}
$$

I also have a diagram:
<center>
<img src="./RNN.png" width=500 height=500>
</center>

#### Training an RNN

Let's say we are doing a Next-Word-Prediction task and we have a huge corpus starting with "the students opened their exams". When we input "the", we get the prob of the next word. However, we don't choose the most probable word as the next input. Instead, we input the ground truth "students". This is called "teacher forcing", aiming to avoid error accumulation, et cetera. And then input "opened"... Every prediction, we compute a loss. In practice, we don't just run through the entire corpus and compute all losses. It's too computationally expensive. Rather, we split the corpus into sentences( or documents), put them into batches. In a batch of sentences, losses of words of different sentences are computed in parallel. To do backprob, we compute the average losses of every sentence i.e., $\frac{loss_{w1}+loss_{w2}+...}{num\_words}$, then average the sentence-level losses, i,e., $\frac{loss_{s1}+loss_{s2}+...}{batch\_size}$, finally do mini batch gradient descent. Well, this is equivalent to averaging the losses of all tokens in a batch, which is what we in practice do.<a id="one"></a>

##### Gradient Calculation:
The gradient of loss at time step $t$ w.r.t. recurrent weights is:
$$
\begin{align}
\frac{\partial J^{(t)}}{\partial W_{hh}}
&= \sum_{i=1}^{t}{\frac{\partial J^{(t)}}{\partial W_{hh}}} \Big|_{(i)} \\
&= \frac{\partial J^{(t)}}{\partial h_t} \cdot \frac{\partial h_t}{\partial W_{hh}} \\
&= \frac{\partial J^{(t)}}{\partial h_t} \cdot (\frac{\partial h_t}{\partial W_{hh}} \Big|_{dir} + \frac{\partial h_t}{\partial h_{t-1}} \cdot \frac{\partial h_{t-1}}{\partial W_{hh}}) \\
&= \frac{\partial J^{(t)}}{\partial h_t} \cdot (\frac{\partial h_t}{\partial W_{hh}} \Big|_{dir} + \frac{\partial h_t}{\partial h_{t-1}} \cdot (\frac{\partial h_{t-1}}{\partial W_{hh}} \Big|_{dir} + \frac{\partial h_{t-1}}{\partial h_{t-2}} \cdot \frac{\partial h_{t-2}}{\partial W_{hh}})) \\
&= \frac{\partial J^{(t)}}{\partial h_t} \cdot (\frac{\partial h_t}{\partial W_{hh}} \Big|_{dir} + \frac{\partial h_t}{\partial h_{t-1}} \cdot (\frac{\partial h_{t-1}}{\partial W_{hh}} \Big|_{dir} + \frac{\partial h_{t-1}}{\partial h_{t-2}} \cdot (\frac{\partial h_{t-2}}{\partial W_{hh}} \Big|_{dir} + \frac{\partial h_{t-2}}{\partial h_{t-3}} \cdot \frac{\partial h_{t-3}}{\partial W_{hh}}))) \\
&= \text{...} \\
&= \frac{\partial J^{(t)}}{\partial h_t} \cdot \frac{\partial h_t}{\partial W_{hh}} \Big|_{dir} + \frac{\partial J^{(t)}}{\partial h_t} \cdot \frac{\partial h_t}{\partial h_{t-1}} \cdot \frac{\partial h_{t-1}}{\partial W_{hh}} \Big|_{dir} + ... + \frac{\partial J^{(t)}}{\partial h_t} \cdot \frac{\partial h_t}{\partial h_{t-1}} \cdot \frac{\partial h_{t-1}}{\partial h_{t-2}} \cdot ... \cdot \frac{\partial h_{2}}{\partial h_1} \cdot \frac{\partial h_1}{\partial W_{hh}} \Big|_{dir}
\tag{1}
\end{align}
$$
**N.B: The sign "$\Big|_{dir}$" denotes the direct partial derivative. When calculating direct partial derivatives, treat intermediate variables as constants.**

To get the gradient of the loss over the entire sequence, sum the gradients contributed by all time steps:
$$
\frac{\partial J}{\partial W_{hh}}=\sum^{T}_{t=1}{\frac{\partial J^{(t)}}{\partial W_{hh}}}
$$

As for the gradient of loss at time step $t$ w.r.t. **input weights**, the form is the same as above, just replace $W_{hh}$ with $W_{xh}$. And the gradient of loss w.r.t. **output weights**, they don't depend on time steps thus easy to calculate.

##### Gradient Clipping
As you can see from the formula (1), the length of matrix mulplication chain is O(t), which is prone to causing gradient instability problems. For gradient vanishing, there's no general solution. But for gradient exploding, we have **Gradient Clipping**. The main idea is: Put all gradients into a long one dimensional vector $g$, calculate its L2 norm, which reflects the magnitude of the overall gradient. If it's too large, clip it.

More specifically, we have gradient matrices: $\partial W_{xh}$, $\partial W_{hh}$ and $\partial W_{ho}$ (biases not considered for simplification). We take all gradient values to form a vector $g$ in the shape of (x*h+h*h+h*o), calculate its L2 norm $||g||$. Next we compare this value with a threshold $\theta$. $||g||$ being larger than $\theta$ means that gradients need to be clipped(scaled). We scale $g$ by $\frac{\theta}{||g||}$. If not, we do nothing or just consider $\theta$ to be 1.

$$
g‚Üêmin(1, \frac{\theta}{||g||}) \cdot g
$$

Doing so, we scale down exploding gradients to relatively moderate values. So $g$ are the gradients with which we do weight updates.

#### Application: Generating Texts
In put the first word, get the probabilities of the next word, sample one word as the input for the next step... Recurrently.

In [120]:
# RNN implementation from scratch
import torch
import torch.nn as nn
def get_params(num_inputs, num_hiddens, num_outputs): # the size of input layer, hidden layer and output layer.

    def norm(shape):
        return torch.randn(size=shape) * 0.01

    W_xh=norm((num_inputs, num_hiddens))
    W_hh=norm((num_hiddens, num_hiddens))
    W_ho=norm((num_hiddens, num_outputs))
    b_h=torch.zeros((num_hiddens))
    b_o=torch.zeros((num_outputs))
    params=[W_xh, W_hh, b_h, W_ho, b_o]
    for param in params:
        param.requires_grad_(True)
    return params

def init_rnn_hidden_state(batch_size, num_hiddens):
    return torch.zeros((batch_size, num_hiddens))

def rnn(inputs, init_state, params):
    W_xh, W_hh, b_h, W_ho, b_o=params
    H=init_state #shape: (batch_size, num_hiddens)
    outputs=[]
    for X in inputs:# shape of inputs: (sequence_length, batch_size, num_inputs/input_dim)
        H=torch.tanh(X@W_xh + H@W_hh + b_h)
        O=H@W_ho + b_o #shape:(batch_size, num_outputs)
        outputs.append(O) # list of (batch_size, num_outputs) with a length of "sequence_length"
    return (torch.concat(outputs, dim=0), #shape: (sequence_length*batch_size, num_outputs)
           H) #last_hidden_state

class RNNFromScratch():
    def __init__(self, num_inputs, num_hiddens, num_outputs, get_params, init_rnn_hidden_state, forward_fn):
        self.num_inputs, self.num_hiddens, self.num_outputs=num_inputs, num_hiddens, num_outputs
        self.params=get_params(self.num_inputs, self.num_hiddens, self.num_outputs)
        self.init_rnn_hidden_state, self.forward_fn=init_rnn_hidden_state, forward_fn

    def __call__(self, inputs, init_state):
        #shape of inputs: (sequence_length, batch_size, num_inputs), so you need to do the encoding yourself.
        return self.forward_fn(inputs, init_state, self.params)

    def begin_state(self, batch_size):
        #Here I choose to relinquish the right to decide when to init RNN state.
        return self.init_rnn_hidden_state(batch_size, self.num_hiddens)

In [121]:
'''I'm gonna show you how the model does a forwarding.
I'll generate a three-word(sequence_length=3) four-sentence(batch_size=4) corpus,
where each word is embedded into ten-dimensional(vocab_size=10).
And the model suits the data, it has an num_input of 10. And the num_hiddens of 20 is randomly set. It outputs 10-dimensional vectors.
'''
X=torch.randn((3,4,10))
model=RNNFromScratch(10, 20, 10, get_params, init_rnn_hidden_state, rnn)

In [122]:
state=model.begin_state(4) #4=batch_size

In [123]:
Y_pred, last_hidden_state=model(X, state)
Y_pred.shape, last_hidden_state.shape, Y_pred

(torch.Size([12, 10]),
 torch.Size([4, 20]),
 tensor([[ 7.8869e-04, -1.1282e-04, -2.2380e-03, -7.6417e-04, -1.2836e-03,
          -4.9329e-05,  1.7520e-04,  2.7867e-04, -2.9061e-04,  1.2229e-03],
         [-5.2602e-04, -1.0642e-03, -3.4827e-04, -1.9850e-03,  1.2060e-04,
           3.0027e-04,  8.1262e-04,  9.8413e-04,  6.4373e-05,  9.4788e-04],
         [ 3.0684e-04,  6.3645e-04,  2.8042e-03,  1.4256e-03,  9.6874e-04,
           1.3954e-04, -9.9776e-04, -4.9223e-04, -5.1182e-04, -2.6976e-03],
         [ 1.2678e-03,  3.9407e-04,  2.2582e-03,  3.6788e-04,  8.3618e-04,
           1.5151e-03, -8.5991e-05,  1.1036e-04, -1.3444e-03, -2.1440e-03],
         [-6.8979e-04, -4.4372e-04, -1.1161e-04, -5.0272e-04, -1.7772e-05,
          -3.6689e-04,  1.1432e-04,  1.0340e-04,  3.2620e-04,  3.3911e-04],
         [ 2.8142e-04, -1.3099e-05,  1.1177e-03, -7.1878e-04, -5.2349e-04,
          -6.0484e-05, -8.3074e-04,  1.8374e-04, -1.1330e-03, -2.1839e-03],
         [-4.2062e-04, -1.0863e-03,  1.7017e-04, 

In [124]:
def gradient_clipping(model, theta):
    if isinstance(model, nn.Module):
        params=[p for p in model.parameters() if p.requires_grad]
    else:
        params=[p for p in model.params]
    norm=torch.sqrt(sum(torch.sum(p.grad**2) for p in params))# Here we don't fisrt concat then calculate. Instead, we calculate then sum.
    if norm>theta:
        for param in params:
            param*=theta/norm

In [125]:
#Training
Y=torch.randint(0,10,(12,))
optimizer=torch.optim.Adam(model.params)
criterion=torch.nn.CrossEntropyLoss()#reduction="mean" by default. So we are summing the loss of every vector from every sentence(batch)
#and average it.(just like what I said in the 1st paragraph under "Trainging an RNN")

In [137]:
Y_pred, last_hidden_state=model(X, state)
optimizer.zero_grad()
loss=criterion(Y_pred,Y)
loss.backward()
gradient_clipping(model, 1.0)
optimizer.step()
model.params[0]

tensor([[ 0.0171, -0.0121,  0.0270,  0.0039,  0.0031,  0.0168,  0.0184, -0.0138,
          0.0082, -0.0005, -0.0057,  0.0111,  0.0160, -0.0196,  0.0133,  0.0086,
         -0.0033,  0.0027,  0.0202,  0.0275],
        [-0.0183, -0.0181, -0.0095,  0.0080, -0.0014, -0.0009,  0.0021, -0.0153,
         -0.0129, -0.0038,  0.0044,  0.0024, -0.0106,  0.0264,  0.0148, -0.0122,
          0.0131,  0.0002, -0.0184, -0.0232],
        [ 0.0140,  0.0095,  0.0214,  0.0123, -0.0119, -0.0193, -0.0165,  0.0107,
          0.0231, -0.0167,  0.0167, -0.0113, -0.0156, -0.0177, -0.0218,  0.0280,
         -0.0255,  0.0119, -0.0181, -0.0110],
        [ 0.0121,  0.0058,  0.0023,  0.0049, -0.0258,  0.0270,  0.0199,  0.0027,
          0.0091, -0.0104, -0.0263,  0.0054, -0.0132, -0.0311,  0.0082,  0.0060,
          0.0217, -0.0139,  0.0095,  0.0126],
        [-0.0133, -0.0134, -0.0168,  0.0189,  0.0361,  0.0021, -0.0057,  0.0227,
         -0.0199,  0.0038,  0.0160,  0.0069, -0.0195,  0.0192,  0.0024, -0.0181,
      

In [142]:
#RNN implementation with torch api
class SimpleRNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim , num_hiddens):
        super().__init__()
        self.embedding=nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)
        self.rnn=nn.RNN(embedding_dim, num_hiddens, batch_first=True) #num_hidden_layer is 1 by default. 
        #And h0 is by default torch.zeros((num_hiden_layers, batch_size, num_hiddens))
        self.fc=nn.Linear(num_hiddens, vocab_size)

    def forward(self, X): # X in the shape of (batch_size, seq_length)
        embeddings=self.embedding(X) #shape: (batch_size, seq_length, embedding_dim)
        hidden_states, h_n=self.rnn(embeddings)
        logits=self.fc(hidden_states.reshape(-1, hidden_states.size(-1)))
        logits=logits.reshape((hidden_states.size(0), hidden_states.size(1), -1))
        return logits

In [147]:
text='''She Was More Like A Beauty Queen From A Movie Scene.I Said Don't Mind, But What Do You Mean I Am The One
Who Will Dance On The Floor In The Round.She Said I Am The One Who Will Dance On The Floor In The Round.She Told Me Her Name Was Billie Jean, As She Caused A Scene
Then Every Head Turned With Eyes That Dreamed Of Being The One.Who Will Dance On The Floor In The Round.People Always Told Me Be Careful Of What You Do
And Don't Go Around Breaking Young Girls' Hearts.
And Mother Always Told Me Be Careful Of Who You Love.
And Be Careful Of What You Do 'Cause The Lie Becomes The Truth.
Billie Jean Is Not My Lover.
She's Just A Girl Who Claims That I Am The One.
But The Kid Is Not My Son.
She Says I Am The One, But The Kid Is Not My Son.
For Forty Days And Forty Nights.
The Law was on her Side.
But Who Can Stand When She's In Demand.
Her Schemes And Plans.
'Cause We Danced On The Floor In The Round.
So Take My Strong Advice, Just Remember To Always Think Twice.
Do think Twice.
She Told My Baby We had Danced unTill Three.
Then She Looked At Me.
She Showed A Photo Of A Baby Crying.
His Eyes Looked Like Mine.
Go On Dance On The Floor In The Round, Baby.
People Always Told Me Be Careful Of What You Do.
And Don't Go Around Breaking Young Girls' Hearts.
She Came And Stood Right By Me.
Then The Smell Of Sweet Perfume.
This Happened Much Too Soon.
She Called Me To Her Room.
Billie Jean Is Not My Lover.
She's Just A Girl Who Claims That I Am The One.
But The Kid Is Not My Son.'''

In [152]:
import re
class Tokenizer:
    def __init__(self, corpus):
        self.corpus=corpus.lower().replace("\n", "")
        self.corpus=re.sub(r"[^\w\s]","",self.corpus)

### Evaluation of Language Models
#### Perplexity