## 1. Language Modelling
Language Modelling is modelling the probability of a sequence: 
$$
P(w_1, w_2, ..., w_n)=P(w_1) \cdot P(w_2|w_1) \cdot P(w_3|w_1,w_2)...P(w_n|w_1,w_2,...,w_{n-1})=\prod_{t=1}^{T}P(w_t|w_1,w_2,...,w_{t-1})
$$
An early statistical language model is the **n-gram model**. It assumes that the nth word is dependent only on the preceding n-1 words to make probability computation feasible. For example, a bi-gram model use one preceding word to predict the next word, a tri-gram model uses two preceding words to predict the next word...

But with n-gram model, we generate grammartical but incoherent texts that don't make sense. So let's see neural language models.

## 2. How to Build Neural Language Models(NLM)
### 2.1 A fixed-window NLM
It follows the convention of n-gram---predict the nth word given preceding n-1 words. Word vectors are concatenated into a window vector. And it receives the window vector, processes it with an MLP and outputs softmax probabilities over the entire vocabulary. However, one big problem of this model is that it cannot handle variable sequence length dependency. So we'll see RNN.

### 2.2 RNN
#### 2.2.1 Model Structure
RNN is like a reusable MLP. An RNN receives two inputs every step. We input a vector $x_1$ (it can be the first word vector of a sentence), and initialise a hidden state $h_0$ (usually a zero vector), take their respective affine transformations $W_{xh}x_1$ and $W_{hh}h_0$, and then plus a bias $b_h$, sum them up, we get $W_{xh}x_1+W_{hh}h_0+b_h$. And then we apply an activation function $\Phi$, we get $\Phi(W_{xh}x_1+W_{hh}h_0+b_h)$. This is what we use to compute the current hidden state $h_1$. To get an output of the current time step, just do an affine transformation with $h_1$ and add a bias $b_o$, apply an activation function. We can do this calculation recurrently: input $x_2$ (the second word vector of a sentence), load $h_1$...... This is how an RNN works. The main maths are as follows:

$$
\begin{align}
h_t=\Phi(W_{xh}x_t+W_{hh}h_{t-1}+b_h) \\
o_t=\phi(W_{ho}h_t+b_o)
\end{align}
$$

I also have a diagram:
<center>
<img src="./RNN.png" width=500 height=500>
</center>

#### 2.2.2 Training an RNN

Let's say we are doing a Next-Word-Prediction task and we have a huge corpus starting with "the students opened their exams". When we input "the", we get the prob of the next word. However, we don't choose the most probable word as the next input. Instead, we input the ground truth "students". This is called "teacher forcing", aiming to avoid error accumulation, et cetera. And then input "opened"... Every prediction, we compute a loss. In practice, we don't just run through the entire corpus and compute all losses. It's too computationally expensive. Rather, we split the corpus into sentences( or documents), put them into batches. In a batch of sentences, losses of words of different sentences are computed in parallel. To do backprob, we compute the average losses of every sentence i.e., $\frac{loss_{w1}+loss_{w2}+...}{num\_words}$, then average the sentence-level losses, i,e., $\frac{loss_{s1}+loss_{s2}+...}{batch\_size}$, finally do mini batch gradient descent. Well, this is equivalent to averaging the losses of all tokens in a batch, which is what we in practice do.<a id="one"></a>

##### 2.2.2.1 Gradient Calculation:
The gradient of loss at time step $t$ w.r.t. recurrent weights is:
$$
\begin{align}
\frac{\partial J^{(t)}}{\partial W_{hh}}
&= \sum_{i=1}^{t}{\frac{\partial J^{(t)}}{\partial W_{hh}}} \Big|_{(i)} \\
&= \frac{\partial J^{(t)}}{\partial h_t} \cdot \frac{\partial h_t}{\partial W_{hh}} \\
&= \frac{\partial J^{(t)}}{\partial h_t} \cdot (\frac{\partial h_t}{\partial W_{hh}} \Big|_{dir} + \frac{\partial h_t}{\partial h_{t-1}} \cdot \frac{\partial h_{t-1}}{\partial W_{hh}}) \\
&= \frac{\partial J^{(t)}}{\partial h_t} \cdot (\frac{\partial h_t}{\partial W_{hh}} \Big|_{dir} + \frac{\partial h_t}{\partial h_{t-1}} \cdot (\frac{\partial h_{t-1}}{\partial W_{hh}} \Big|_{dir} + \frac{\partial h_{t-1}}{\partial h_{t-2}} \cdot \frac{\partial h_{t-2}}{\partial W_{hh}})) \\
&= \frac{\partial J^{(t)}}{\partial h_t} \cdot (\frac{\partial h_t}{\partial W_{hh}} \Big|_{dir} + \frac{\partial h_t}{\partial h_{t-1}} \cdot (\frac{\partial h_{t-1}}{\partial W_{hh}} \Big|_{dir} + \frac{\partial h_{t-1}}{\partial h_{t-2}} \cdot (\frac{\partial h_{t-2}}{\partial W_{hh}} \Big|_{dir} + \frac{\partial h_{t-2}}{\partial h_{t-3}} \cdot \frac{\partial h_{t-3}}{\partial W_{hh}}))) \\
&= \text{...} \\
&= \frac{\partial J^{(t)}}{\partial h_t} \cdot \frac{\partial h_t}{\partial W_{hh}} \Big|_{dir} + \frac{\partial J^{(t)}}{\partial h_t} \cdot \frac{\partial h_t}{\partial h_{t-1}} \cdot \frac{\partial h_{t-1}}{\partial W_{hh}} \Big|_{dir} + ... + \frac{\partial J^{(t)}}{\partial h_t} \cdot \frac{\partial h_t}{\partial h_{t-1}} \cdot \frac{\partial h_{t-1}}{\partial h_{t-2}} \cdot ... \cdot \frac{\partial h_{2}}{\partial h_1} \cdot \frac{\partial h_1}{\partial W_{hh}} \Big|_{dir}
\tag{1}
\end{align}
$$
**N.B: The sign "$\Big|_{dir}$" denotes the direct partial derivative. When calculating direct partial derivatives, treat intermediate variables as constants.**

To get the gradient of the loss over the entire sequence, sum the gradients contributed by all time steps:
$$
\frac{\partial J}{\partial W_{hh}}=\sum^{T}_{t=1}{\frac{\partial J^{(t)}}{\partial W_{hh}}}
$$

As for the gradient of loss at time step $t$ w.r.t. **input weights**, the form is the same as above, just replace $W_{hh}$ with $W_{xh}$. And the gradient of loss w.r.t. **output weights**, they don't depend on time steps thus easy to calculate.

##### 2.2.2.2 Gradient Clipping
As you can see from the formula (1), the length of matrix mulplication chain is O(t), which is prone to causing gradient instability problems. For gradient vanishing, there's no general solution. But for gradient exploding, we have **Gradient Clipping**. The main idea is: Put all gradients into a long one dimensional vector $g$, calculate its L2 norm, which reflects the magnitude of the overall gradient. If it's too large, clip it.

More specifically, we have gradient matrices: $\partial W_{xh}$, $\partial W_{hh}$ and $\partial W_{ho}$ (biases not considered for simplification). We take all gradient values to form a vector $g$ in the shape of (x*h+h*h+h*o), calculate its L2 norm $||g||$. Next we compare this value with a threshold $\theta$. $||g||$ being larger than $\theta$ means that gradients need to be clipped(scaled). We scale $g$ by $\frac{\theta}{||g||}$. If not, we do nothing or just consider $\theta$ to be 1.

$$
g←min(1, \frac{\theta}{||g||}) \cdot g
$$

Doing so, we scale down exploding gradients to relatively moderate values. So $g$ are the gradients with which we do weight updates.

**My Comtemplation**:Can batch normalization be applied between the time steps of an RNN? My idea is based on the fact that gradients closer to the loss function are less affected by gradient vanishing and exploding. In MLPs, this issue can be alleviated by adding batch normalization layers between the fully connected layers. So, in an RNN, can batch normalization be added between time steps to alleviate this problem?

The answer is: not batch norm, but layer norm. And Layer Norm can alleviate both gradient vanishing（mainly） and gradient exploding(slightly).

But overall, the most effective way is to abandon RNN and turn to LSTM.

#### 2.2.3 Applications
##### Generating Texts
Input the first word, get the probabilities of the next word, sample one word as the input for the next step... Recurrently.

##### Sequence Tagging
This includes part of speech tagging, Named Entity Recognition, etc. For example, input "The startled cat knocked over the vase", output their part of speech (DT, JJ, NN, VBN, IN, DT, NN)

##### Sentiment(Text) Classification
Normally, we can connect RNN hidden states of all time steps to an nn.Linear producing an output for each time step. But if we only want one output such as the probability vector, we connect the last hidden state of the last time step to an nn.Linear. Yet, there is a usually better way, which takes advantage of hidden states of all time steps. We take the hidden state vectors, and apply a function such as mean or max to pool the them so that we get one vector, and then put it into an nn. Linear.

##### Language Encoder
The hidden states of RNNs are encoded features of input sequences. We can employ them to do many other tasks such as machine translation, question answering, etc.

##### Signal(Feature) Decoder
Decode features of signals. For instance, input audio signals, and use RNNs to decode them into sequences of words; decode the encoded features in tasks like machine translation, summarization, etc. This is the concept of conditional langauge model.

In [22]:
# RNN implementation from scratch
import torch
import torch.nn as nn
def get_params(num_inputs, num_hiddens, num_outputs): # the size of input layer, hidden layer and output layer.

    def norm(shape):
        return torch.randn(size=shape) * 0.01

    W_xh=norm((num_inputs, num_hiddens))
    W_hh=norm((num_hiddens, num_hiddens))
    W_ho=norm((num_hiddens, num_outputs))
    b_h=torch.zeros((num_hiddens))
    b_o=torch.zeros((num_outputs))
    params=[W_xh, W_hh, b_h, W_ho, b_o]
    for param in params:
        param.requires_grad_(True)
    return params

def init_rnn_hidden_state(batch_size, num_hiddens):
    return torch.zeros((batch_size, num_hiddens))

def rnn(inputs, init_state, params):
    W_xh, W_hh, b_h, W_ho, b_o=params
    H=init_state #shape: (batch_size, num_hiddens)
    outputs=[]
    for X in inputs:# shape of inputs: (sequence_length, batch_size, num_inputs/input_dim)
        H=torch.tanh(X@W_xh + H@W_hh + b_h)
        O=H@W_ho + b_o #shape:(batch_size, num_outputs)
        outputs.append(O) # list of (batch_size, num_outputs) with a length of "sequence_length"
    return (torch.concat(outputs, dim=0), #shape: (sequence_length*batch_size, num_outputs)
           H) #last_hidden_state

class RNNFromScratch():
    def __init__(self, num_inputs, num_hiddens, num_outputs, get_params, init_rnn_hidden_state, forward_fn):
        self.num_inputs, self.num_hiddens, self.num_outputs=num_inputs, num_hiddens, num_outputs
        self.params=get_params(self.num_inputs, self.num_hiddens, self.num_outputs)
        self.init_rnn_hidden_state, self.forward_fn=init_rnn_hidden_state, forward_fn

    def __call__(self, inputs, init_state):
        #shape of inputs: (sequence_length, batch_size, num_inputs), so you need to do the encoding yourself.
        return self.forward_fn(inputs, init_state, self.params)

    def begin_state(self, batch_size):
        #Here I choose to relinquish the right to decide when to init RNN state.
        return self.init_rnn_hidden_state(batch_size, self.num_hiddens)

In [23]:
'''I'm gonna show you how the model does a forwarding.
I'll generate a three-word(sequence_length=3) four-sentence(batch_size=4) corpus,
where each word is embedded into ten-dimensional(vocab_size=10).
And the model suits the data, it has an num_input of 10. And the num_hiddens of 20 is randomly set. It outputs 10-dimensional vectors.
'''
X=torch.randn((3,4,10))
model=RNNFromScratch(10, 20, 10, get_params, init_rnn_hidden_state, rnn)

In [24]:
state=model.begin_state(4) #4=batch_size

In [25]:
Y_pred, last_hidden_state=model(X, state)
Y_pred.shape, last_hidden_state.shape, Y_pred

(torch.Size([12, 10]),
 torch.Size([4, 20]),
 tensor([[-7.6345e-04, -2.3445e-03, -3.6665e-04,  1.7603e-03,  1.1258e-03,
           6.9076e-04, -8.8053e-04,  2.3621e-04,  4.5635e-05, -8.5392e-04],
         [-2.8610e-04, -1.1678e-03, -1.6485e-03, -5.7420e-04,  8.1074e-04,
           6.9223e-04, -1.1956e-03, -1.7231e-03, -1.4239e-03, -2.1762e-03],
         [ 2.5728e-03, -1.6653e-03, -1.0001e-03, -2.9238e-05,  1.3186e-03,
          -4.1197e-03, -2.2584e-03, -5.2514e-04, -8.7097e-04,  2.5179e-03],
         [-3.7598e-04, -2.4222e-03, -3.2804e-04,  1.2338e-03,  2.6149e-03,
          -7.0440e-04, -1.7304e-04,  1.1132e-03,  1.4809e-03,  9.3106e-04],
         [-3.2777e-04, -9.3533e-04,  1.9009e-04, -3.8599e-04,  4.7021e-05,
          -9.6063e-04,  2.5407e-03,  1.5146e-04,  2.5898e-03,  2.3404e-03],
         [ 1.1144e-03,  1.9660e-04, -1.3043e-04, -5.5121e-04,  1.0919e-04,
          -1.2696e-04, -3.1353e-03, -4.5795e-04, -2.0319e-03, -1.6917e-03],
         [-1.5946e-03,  2.7701e-04,  1.3220e-03, 

In [26]:
def gradient_clipping(model, theta):
    if isinstance(model, nn.Module):
        params=[p for p in model.parameters() if p.requires_grad]
    else:
        params=[p for p in model.params]
    norm=torch.sqrt(sum(torch.sum(p.grad**2) for p in params))# Here we don't fisrt concat then calculate. Instead, we calculate then sum.
    if norm>theta:
        for param in params:
            param.grad[:]*=theta/norm

In [27]:
#Training
Y=torch.randint(0,10,(12,))
optimizer=torch.optim.Adam(model.params)
criterion=torch.nn.CrossEntropyLoss()#reduction="mean" by default. So we are summing the loss of every vector from every sentence(batch)
#and average it.(just like what I said in the 1st paragraph under "Trainging an RNN")

In [28]:
Y_pred, last_hidden_state=model(X, state)
optimizer.zero_grad()
loss=criterion(Y_pred,Y)
loss.backward()
gradient_clipping(model, 1.0)
optimizer.step()
model.params[0]

tensor([[ 0.0051, -0.0158,  0.0152, -0.0050, -0.0097, -0.0012,  0.0052, -0.0198,
          0.0061,  0.0162,  0.0110,  0.0002, -0.0099, -0.0039, -0.0027,  0.0050,
         -0.0259, -0.0167, -0.0004,  0.0117],
        [ 0.0034, -0.0066,  0.0007,  0.0197,  0.0003,  0.0023,  0.0094,  0.0096,
         -0.0177,  0.0139,  0.0006, -0.0158,  0.0236,  0.0035, -0.0002,  0.0158,
         -0.0014,  0.0075,  0.0031, -0.0005],
        [-0.0011,  0.0044,  0.0071, -0.0010,  0.0069, -0.0077,  0.0224, -0.0069,
         -0.0128,  0.0012, -0.0034, -0.0045,  0.0179,  0.0073, -0.0115, -0.0339,
         -0.0027, -0.0050, -0.0135,  0.0055],
        [ 0.0159, -0.0175,  0.0029, -0.0015, -0.0123, -0.0128,  0.0122, -0.0113,
         -0.0093,  0.0106,  0.0027, -0.0147, -0.0020,  0.0078, -0.0056,  0.0158,
          0.0004,  0.0115, -0.0136,  0.0055],
        [-0.0065, -0.0089, -0.0092, -0.0009,  0.0016,  0.0007,  0.0061, -0.0050,
         -0.0069, -0.0101, -0.0224,  0.0034,  0.0155, -0.0089,  0.0133, -0.0099,
      

In [29]:
#RNN implementation with torch api
class SimpleRNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim , num_hiddens):
        super().__init__()
        self.embedding=nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)
        self.rnn=nn.RNN(embedding_dim, num_hiddens, batch_first=True) #num_hidden_layer is 1 by default. 
        #And h0 is by default torch.zeros((num_hiden_layers, batch_size, num_hiddens))
        self.fc=nn.Linear(num_hiddens, vocab_size)

    def forward(self, X): # X in the shape of (batch_size, seq_length)
        embeddings=self.embedding(X) #shape: (batch_size, seq_length, embedding_dim)
        hidden_states, h_n=self.rnn(embeddings)
        logits=self.fc(hidden_states.reshape(-1, hidden_states.size(-1)))
        logits=logits.reshape((hidden_states.size(0), hidden_states.size(1), -1))
        return logits

In [30]:
text='''Some of the most important decisions of our lives occur while we are feeling stressed and anxious. From medical decisions to financial and professional ones, we are all sometimes required to weigh up information under stressful conditions. But do we become better or worse at processing and using information under such circumstances? My colleague and I, both neuroscientists, wanted to investigate how the mind operates under stress, so we visited some local fire stations. Firefighters' workdays vary quite a bit. Some are pretty relaxed; they will spend their time washing the truck, cleaning equipment, cooking meals and reading. Other days can be hectic, with numerous life threatening incidents to attend to; they will enter burning homes to rescue trapped residents, and assist with medical emergencies. These ups and downs presented the perfect setting for an experiment on how people's ability to use information changes when they feel under pressure. We found that perceived threat acted as a trigger for a stress reaction that made the task of processing information easier for the firefighters - but only as long as it conveyed bad news. This is how we arrived at these results. We asked the firefighters to estimate their likelihood of experiencing 40 different adverse events in their life, such as being involved in an accident or becoming a victim of card fraud. We then gave them either good news (that their likelihood of experiencing these events was lower than they'd thought) or bad news (that it was higher) and asked them to provide new estimates. People are normally quite optimistic - they will ignore bad news and embrace the good. This is what happened when the firefighters were relaxed; but when they were under stress, a different pattern emerged. Under these conditions, they became hyper-vigilant to bad news, even when it had nothing to do with their job (such as learning that the likelihood of card fraud was higher than they'd thought), and altered their beliefs in response. In contrast, stress didn't change how they responded to good news (such as learning that the likelihood of card fraud was lower than they'd thought). Back in our lab, we observed the same pattern in students who were told they had to give a surprise public speech, which would be judged by a panel, recorded and posted online. Sure enough, their cortisol levels spiked, their heart rates went up and they suddenly became better at processing unrelated, yet alarming, information about rates of disease and violence. When we experience stressful events, a physiological change is triggered that causes us to take in warnings and focus on what might go wrong. Brain imaging reveals that this 'switch' is related to a sudden boost in a neural signal important for learning, specifically in response to unexpected warning signs, such as faces expressing fear. Such neural engineering could have helped prehistoric humans to survive. When our ancestors found themselves surrounded by hungry animals, they would have benefited from an increased ability to learn about hazards. In a safe environment, however, it would have been wasteful to be on high alert constantly. So, a neural switch that automatically increases or decreases our ability to process warnings in response to changes in our environment could have been useful. In fact, people with clinical depression and anxiety seem unable to switch away from a state in which they absorb all the negative messages around them. It is also important to realise that stress travels rapidly from one person to the next. If a co-worker is stressed, we are more likely to tense up and feel stressed ourselves. We don't even need to be in the same room with someone for their emotions to influence our behaviour. Studies show that if we observe positive feeds on social media, such as images of a pink sunset, we are more likely to post uplifting messages ourselves. If we observe negative posts, such as complaints about a long queue at the coffee shop, we will in turn create more negative posts. In some ways, many of us now live as if we are in danger, constantly ready to tackle demanding emails and text messages, and respond to news alerts and comments on social media. Repeatedly checking your phone, according to a survey conducted by the American Psychological Association, is related to stress. In other words, a pre-programmed physiological reaction, which evolution has equipped us with to help us avoid famished predators, is now being triggered by an online post. Social media posting, according to one study, raises your pulse, makes you sweat, and enlarges your pupils more than most daily activities. The fact that stress increases the likelihood that we will focus more on alarming messages, together with the fact that it spreads extremely rapidly, can create collective fear that is not always justified. After a stressful public event, such as a natural disaster or major financial crash, there is often a wave of alarming information in traditional and social media, which individuals become very aware of. But that has the effect of exaggerating existing danger. And so, a reliable pattern emerges - stress is triggered, spreading from one person to the next, which temporarily enhances the likelihood that people will take in negative reports, which increases stress further. As a result, trips are cancelled, even if the disaster took place across the globe; stocks are sold, even when holding on is the best thing to do. The good news, however, is that positive emotions, such as hope, are contagious too, and are powerful in inducing people to act to find solutions. Being aware of the close relationship between people's emotional state and how they process information can help us frame our messages more effectively and become conscientious agents of change.This finding also has direct implications for our everyday lives. First, it reminds us to be especially cautious when making important decisions under stress. While stress can heighten our sensitivity to potential threats, this heightened awareness is not always beneficial—it can lead us to over-focus on negative information, overestimate risks, and underestimate opportunities. For example, in financial investments or career choices, stress may push us to avoid reasonable risks or overreact to low-probability events.

Second, it suggests that we can actively manage our emotional state to improve information processing. Short relaxation exercises, deep breathing, mindfulness meditation, or talking with friends can reduce stress levels and help the brain return to a more balanced mode, allowing for a more comprehensive evaluation of information. Conversely, when facing real danger or urgent threats, stress can act as a protective mechanism, enabling rapid recognition of hazards and quick action.

Furthermore, this mechanism plays a significant role in group behavior. Not only are we affected by our own stress, but we are also highly influenced by the emotional state of others. This means that in organizations, public communication, or even family life, the stress levels of leaders or key influencers can quickly spread, amplifying panic or hope. Being aware of this, we can strategically manage how information is presented—for instance, providing clear, calm guidance during crises while emphasizing solutions rather than just warnings.

Ultimately, this research highlights a fundamental truth: human neural mechanisms are highly adaptive, designed both to protect us from real dangers and, in modern society, sometimes to overreact to virtual threats. Understanding this system makes it clearer how emotions, stress, and information processing interact to influence decision-making, allowing us to manage both personal behavior and social dynamics more effectively. By consciously regulating our emotions and environment, we can reduce unnecessary fear while harnessing the beneficial aspects of stress to respond more effectively to genuine challenges.'''

In [None]:
Some, the, of Teacher Forcing

In [11]:
import re
from collections import Counter
class Tokenizer:
    def __init__(self, corpus):
        self.corpus=corpus.lower().replace("\n", " ").replace("."," ")
        self.corpus=re.sub(r"[^\w\s]","",self.corpus)
        self.tokens=re.split(r"\s+", self.corpus)+["<unk>"]
        count_token=Counter(self.tokens)
        self.token_freqs=[(*t,i) for i,t in enumerate(sorted(count_token.items(), key=lambda x:x[1], reverse=True))]
        self.token_to_idx, self.idx_to_token={}, {}
        for t in self.token_freqs:
            self.token_to_idx[t[0]]=t[2]
            self.idx_to_token[t[2]]=t[0]

    def __call__(self, text):
        text=re.sub(r"[^\w\s]","", text.lower().replace("\n", " ").replace("."," "))
        tokens=re.split(r"\s+", text)
        result=[]
        for token in tokens:
            if token in self.tokens:
                result.append(self.token_to_idx[token])
            else:
                result.append(self.token_to_idx["<unk>"])
        return result

    def __len__(self):
        return len(self.token_to_idx)

    def decode(self, idx):
        if isinstance(idx, int):
            return self.idx_to_token[idx]
        elif isinstance(idx, list):
            return [self.idx_to_token[i] for i in idx]

In [12]:
t=Tokenizer(text)

corpus_tensor=torch.tensor(t(t.corpus), dtype=torch.long)
y_corpus_tensor=torch.cat([corpus_tensor[-1:], corpus_tensor[:-1]])

corpus_tensor=corpus_tensor.reshape(127,-1)
y_corpus_tensor=y_corpus_tensor.reshape(127,-1)

In [13]:
corpus_indices = t(t.corpus)  
seq_len = 4 

X_list, y_list = [], []

for i in range(len(corpus_indices) - seq_len):
    X_list.append(corpus_indices[i:i+seq_len])
    y_list.append(corpus_indices[i+1:i+seq_len+1]) 

corpus_tensor = torch.tensor(X_list, dtype=torch.long)       # (num_samples, seq_len)
y_corpus_tensor = torch.tensor(y_list, dtype=torch.long)     # (num_samples, seq_len)

print("corpus_tensor shape:", corpus_tensor.shape)
print("y_corpus_tensor shape:", y_corpus_tensor.shape)


corpus_tensor shape: torch.Size([1266, 4])
y_corpus_tensor shape: torch.Size([1266, 4])


In [14]:
net=SimpleRNN(len(t), 2*len(t), 4*len(t))

In [15]:
criterion=nn.CrossEntropyLoss()
optimizer=torch.optim.Adam(net.parameters(), lr=1e-3)
net.train()

SimpleRNN(
  (embedding): Embedding(559, 1118)
  (rnn): RNN(1118, 2236, batch_first=True)
  (fc): Linear(in_features=2236, out_features=559, bias=True)
)

In [16]:
for i in range(38):
    optimizer.zero_grad()
    y_pred=net(corpus_tensor)
    loss=criterion(y_pred.reshape(-1,len(t)), y_corpus_tensor.flatten())
    loss.backward()
    gradient_clipping(net, 1) #or you can simply apply it on net.rnn
    optimizer.step()
    print(loss)

tensor(6.3515, grad_fn=<NllLossBackward0>)
tensor(4.8831, grad_fn=<NllLossBackward0>)
tensor(3.5703, grad_fn=<NllLossBackward0>)
tensor(2.4826, grad_fn=<NllLossBackward0>)
tensor(1.6686, grad_fn=<NllLossBackward0>)
tensor(1.1327, grad_fn=<NllLossBackward0>)
tensor(0.8184, grad_fn=<NllLossBackward0>)
tensor(0.6395, grad_fn=<NllLossBackward0>)
tensor(0.5371, grad_fn=<NllLossBackward0>)
tensor(0.4765, grad_fn=<NllLossBackward0>)
tensor(0.4390, grad_fn=<NllLossBackward0>)
tensor(0.4154, grad_fn=<NllLossBackward0>)
tensor(0.3998, grad_fn=<NllLossBackward0>)
tensor(0.3900, grad_fn=<NllLossBackward0>)
tensor(0.3839, grad_fn=<NllLossBackward0>)
tensor(0.3795, grad_fn=<NllLossBackward0>)
tensor(0.3765, grad_fn=<NllLossBackward0>)
tensor(0.3744, grad_fn=<NllLossBackward0>)
tensor(0.3721, grad_fn=<NllLossBackward0>)
tensor(0.3700, grad_fn=<NllLossBackward0>)
tensor(0.3684, grad_fn=<NllLossBackward0>)
tensor(0.3674, grad_fn=<NllLossBackward0>)
tensor(0.3667, grad_fn=<NllLossBackward0>)
tensor(0.36

In [19]:
net.eval()
def predict(text):
    inputs=torch.tensor(t(text)).reshape(1,-1)
    return t.decode(torch.argmax(net(inputs),dim=2).flatten().tolist())

In [20]:
results=["I mean"]
final_seq_len=20
for i in range(final_seq_len-1):
    new_word=predict(" ".join(results))[-1]
    print(" ".join(results))
    results.append(new_word)
print(" ".join(results))

I mean
I mean neuroscientists
I mean neuroscientists wanted
I mean neuroscientists wanted to
I mean neuroscientists wanted to investigate
I mean neuroscientists wanted to investigate how
I mean neuroscientists wanted to investigate how the
I mean neuroscientists wanted to investigate how the mind
I mean neuroscientists wanted to investigate how the mind operates
I mean neuroscientists wanted to investigate how the mind operates under
I mean neuroscientists wanted to investigate how the mind operates under stress
I mean neuroscientists wanted to investigate how the mind operates under stress so
I mean neuroscientists wanted to investigate how the mind operates under stress so we
I mean neuroscientists wanted to investigate how the mind operates under stress so we visited
I mean neuroscientists wanted to investigate how the mind operates under stress so we visited some
I mean neuroscientists wanted to investigate how the mind operates under stress so we visited some local
I mean neurosci

### 2.3 Long Short-Term Memory(LSTM)

>*You should never use an RNN these days. You should always use an LSTM.*
> *                                                    Christopher Manning

RNN dones't handle well long-term dependency. So LSTM(Long Short-Term Memory) was proposed. It introduces a memory cell and a candidate memory cell used to update the memory cell; and 3 gates: a forget gate, an input gate and an output gate, respectively for deciding to what extent to delete from, write to and read from the memory cell. 

First, we compute the 3 gates, which are contrained to 0 to 1 by sigmoid. In the extreme case, when gate values are close to zero, the gate effectively blocks information flow; when they are close to one, the gate allows information to pass through almost fully.

$$
\begin{aligned}
F_t&=\sigma(X_{t}W_{xf} + H_{t-1}W_{hf} + b_f) \\
I_t&=\sigma(X_{t}W_{xi} + H_{t-1}W_{hi} + b_i) \\
O_t&=\sigma(X_{t}W_{xo} + H_{t-1}W_{ho} + b_o)
\end{aligned}
$$

Then compute the candidate memoery cell.

$$
\tilde{C_t}=\tanh(X_{t}W_{xc} + H_{t-1}W_{hc} + b_c)
$$

Apply the forget gate to the previous memory cell, and apply the input gate to the current candidate cell to get the current memory cell.

$$
C_{t}=F_t \odot C_{t-1} + I_t \odot \tilde{C_t}
$$

Apply the output gate to read from the memory cell, and get the hidden state.

$$
H_{t}=O_t \odot tanh(C_{t})
$$

Finally we can connect a classification head to the $H_t$.

### 2.4 GRU(Gated Recurrent Unit)
GRU is considered to be simpler than LSTM and often achieves comparable performance.

An GRU has 2 gates: a reset gate and a update gate.

$$
\begin{aligned}
R_t&=\sigma(X_{t}W_{xr} + H_{t-1}W_{hr} + b_r) \\
U_t&=\sigma(X_{t}W_{xu} + H_{t-1}W_{hu} + b_u) \\
\end{aligned}
$$

Then we use the reset gate to control how much past information is used to build a candidate hidden state.

$$
\tilde{H_t}=tanh(X_{t}W_{xh} + (R_t \odot H_{t-1})W_{hh} + b_h)
$$

Next, the update gate determines to what extent we update the hidden state.

$$
H_t=U_t \odot H_{t-1} + (1-U_t) \odot \tilde{H_t}
$$

The update formula for $H_t$ is in the form of Exponential Moving Average(EMA). The definition of EMA is:

$$
EMA_{t} = \alpha_{t} x_t + (1-\alpha_{t}) EMA_{t-1}
$$

### 2.5 Why LSTM and GRU effectively alleviate gradient vanishing?

The key lies in the "+" sign in $C_t$(LSTM) and $H_t$(GRU). Similar to Residual Connection (ResNet, Kaiming He), the "+" provides a term that has relatively stable gradients. In ResNet, the term is x, whose derivative is always 1. In LSTM, the term is $F_t \odot C_{t-1}$, whose derivative $F_t$ is between 0 and 1, while in GRU it is $U_t \odot H_{t-1}$, following the same principle. In contrast, vanilla RNNs (the most basic RNNs) don't have such a stable path for gradients to flow. 

### Evaluation of Language Models
#### Perplexity
Given a sentence of n words, its perplexity is:
$$
\begin{align}
perplexity
& = P(w_1, w_2, w_3, ..., w_n)^{- \frac{1}{n}} \\
& = (P(w_1) \cdot P(w_2|w_1) \cdot ... \cdot P(w_n|w_{n-1}, w_{n-2},...,w_1))^{-\frac{1}{n}}
\end{align}
$$

Usually we input a special token "\<BOS>"(Beginning of Sentence) to get $P(w_1)$

In [5]:
model=SimpleRNN(100,200,1)

In [9]:
for i in model.parameters():
    print(type(i))
    break

True
<class 'torch.nn.parameter.Parameter'>
