# FastAI NLP Study Notes

In this notebook, we try to replicate models implemented in [A Language Model from Scratch](https://github.com/fastai/fastbook/blob/master/12_nlp_dive.ipynb). 

We will use the *Human Numbers* dataset, and it simply contains the first 10,000 numbers written out in English.

In [18]:
from fastcore.utils import L
from fastai.text.all import *
path = untar_data(URLs.HUMAN_NUMBERS)

In [22]:
lines = L()
with open(path / "train.txt", 'r') as f: lines += L(*f.readlines())
with open(path / "valid.txt", 'r') as f: lines += L(*f.readlines())
lines[:5], lines[-5:]

((#5) ['one \n','two \n','three \n','four \n','five \n'],
 (#5) ['nine thousand nine hundred ninety five \n','nine thousand nine hundred ninety six \n','nine thousand nine hundred ninety seven \n','nine thousand nine hundred ninety eight \n','nine thousand nine hundred ninety nine \n'])

The data set is quite simple, we can build our own the **tokenizer** and **word2vec** tool.

In [26]:
text = " . ".join([x.strip() for x in lines])
tokens = text.split(' ')
tokens[:5], tokens[-5:]

(['one', '.', 'two', '.', 'three'],
 ['thousand', 'nine', 'hundred', 'ninety', 'nine'])

In [34]:
vocab = L(*tokens).unique()
word2idx = {t:i for i, t in enumerate(vocab)}
indices = L(word2idx[t] for t in tokens)
indices

(#63095) [0,1,2,1,3,1,4,1,5,1...]

## Our First Language Model from Scratch

We will build a model that predicts the next word by looking at the previous 3 words.

In [192]:
import torch
from torch import tensor

raw_seqs = [(tokens[i:(i+3)], tokens[i+3]) for i in range(0, len(tokens)-3, 4)]
raw_seqs[:5], data[-5:]

([(['one', '.', 'two'], '.'),
  (['three', '.', 'four'], '.'),
  (['five', '.', 'six'], '.'),
  (['seven', '.', 'eight'], '.'),
  (['nine', '.', 'ten'], '.')],
 [(['ninety', 'six', '.'], 'nine'),
  (['thousand', 'nine', 'hundred'], 'ninety'),
  (['seven', '.', 'nine'], 'thousand'),
  (['nine', 'hundred', 'ninety'], 'eight'),
  (['.', 'nine', 'thousand'], 'nine')])

Since we cannot feed our model raw text, we need to use indices instead.

In [206]:
seqs = L((tensor(indices[i:(i+3)]), indices[i+3]) for i in range(0, len(indices)-4, 3))
seqs

(#21031) [(tensor([0, 1, 2]), 1),(tensor([1, 3, 1]), 4),(tensor([4, 1, 5]), 1),(tensor([1, 6, 1]), 7),(tensor([7, 1, 8]), 1),(tensor([1, 9, 1]), 10),(tensor([10,  1, 11]), 1),(tensor([ 1, 12,  1]), 13),(tensor([13,  1, 14]), 1),(tensor([ 1, 15,  1]), 16)...]

### A Simple Language Model - Explicit Layers

Model architecture:

<img src="https://raw.githubusercontent.com/Climbo-Dev/climbo-code-samples/main/images/fastbook_12_nlp_att_00022.png" width="500">

- rectangles are embeddings
- arrows are linear transformations
- ellipses are hidden layer transformations (linear combination + ReLU)

In [148]:
class LMModel(Module):
    def __init__(self, vocab_size, n_hidden):
        self.i_h = nn.Embedding(vocab_size, n_hidden)
        self.h_h = nn.Linear(n_hidden, n_hidden)
        # the output is a very lone vector so we know which word we are predicting.
        self.h_o = nn.Linear(n_hidden, vocab_size)

    def forward(self, x):
        i_h, h_h, h_o = self.i_h, self.h_h, self.h_o

        x1 = i_h(x[:,0])
        x2 = i_h(x[:,1])
        x3 = i_h(x[:,2])
        
        h = F.relu(h_h(x1))
        h = F.relu(h_h(x2) + h)
        h = F.relu(h_h(x3) + h)

        return h_o(h)

Load data

In [204]:
bs = 64
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(seqs[:cut], seqs[cut:], bs=64, shuffle=False)

Let's see how the model work on a batch:

In [151]:
model = LMModel(len(vocab), 10)
x, y = next(iter(dls.train))
output = model(x)
print(x.shape, y.shape, output.shape)
output

torch.Size([64, 3]) torch.Size([64]) torch.Size([64, 30])


tensor([[-0.1754, -0.1120,  0.9104,  ...,  0.6732,  0.3281,  0.2257],
        [ 0.5163,  0.2517,  0.9616,  ...,  0.6012, -0.3431,  0.3015],
        [-0.2849, -0.0460,  0.7778,  ...,  0.6773,  0.3038,  0.3074],
        ...,
        [-0.5219, -0.2838,  1.2541,  ...,  0.7349,  0.5050,  0.3316],
        [-0.2661, -0.0940,  0.9655,  ...,  0.7476,  0.3570,  0.4005],
        [-0.4831, -0.2836,  1.2013,  ...,  0.6834,  0.4560,  0.3107]],
       grad_fn=<AddmmBackward0>)

Now we can train the model:

In [153]:
learner = Learner(dls, LMModel(len(vocab), 64), loss_func=F.cross_entropy, metrics=accuracy)
learner.fit_one_cycle(4, 0.001)

epoch,train_loss,valid_loss,accuracy,time
0,2.146425,2.331466,0.311861,00:04
1,1.512226,1.875539,0.448776,00:04
2,1.474846,1.683619,0.49275,00:04
3,1.423842,1.701277,0.414309,00:04


Compare the result with the simple benchmark of using only the most frequent token:

In [191]:
import pandas as pd
ys = torch.concat([y for _, y in dls.valid])
ys = pd.Series(ys)
most_frequent = s.value_counts().head(1)
mw = vocab[most_frequent.index[0]]
mf = most_frequent.values[0] / ys.shape[0]
print(f"most frequent word: {mw}, frequency of the most frequent word: {mf}")

most frequent word: thousand, frequency of the most frequent word: 0.15165200855716662


### A Simple Recurrent Language Model

We can simplify the previous model by using a simple for loop:

<img src="https://raw.githubusercontent.com/Climbo-Dev/climbo-code-samples/main/images/fastbook_12_nlp_att_00070.png" width="500">

In [207]:
class LMModel2(Module):
    def __init__(self, vocab_size, n_hidden):
        self.i_h = nn.Embedding(vocab_size, n_hidden)
        self.h_h = nn.Linear(n_hidden, n_hidden)
        self.h_o = nn.Linear(n_hidden, vocab_size)

    def forward(self, x):
        h = 0
        for i in range(3):
            h = h + self.i_h(x[:,i])
            h = F.relu(self.h_h(h))
        return self.h_o(h)

The model should have the same architecture with `LMModel`:

In [208]:
bs = 64
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(seqs[:cut], seqs[cut:], bs=64, shuffle=False)

# Train the Model
learner = Learner(dls, LMModel2(len(vocab), 64), loss_func=F.cross_entropy, metrics=accuracy)
learner.fit_one_cycle(4, 0.001)

epoch,train_loss,valid_loss,accuracy,time
0,1.854017,1.998687,0.461136,00:04
1,1.389332,1.871393,0.473972,00:04
2,1.423766,1.686019,0.489898,00:04
3,1.376569,1.665637,0.467079,00:04


### A Recurrent Language Model with Memory

The previous model resets hidden state variables to 0 for every forward path. However, if we feed the model with words from a consecutive sequence, hidden state variables fitted from previous words should be useful traing the current model. 

<img src="https://raw.githubusercontent.com/Climbo-Dev/climbo-code-samples/main/images/fastbook_12_nlp_att_00024.png" width="500">

To allow our language model to memorize prior states informations, we shall initialize the hidden variable once and carry its value between forward paths:

In [223]:
class LMModel3(Module):
    def __init__(self, vocab_size, n_hidden, detach:bool):
        self.i_h = nn.Embedding(vocab_size, n_hidden)
        self.h_h = nn.Linear(n_hidden, n_hidden)
        self.h_o = nn.Linear(n_hidden, vocab_size)
        self.h = 0
        self.detach = detach

    def forward(self, x):
        for i in range(3):
            self.h = self.h + self.i_h(x[:,i])
            self.h = F.relu(self.h_h(self.h))
        output = self.h_o(self.h)
        if self.detach:
            self.h = self.h.detach()  # !!! We should always detach h, see explanations below
        return output

    def reset(self):
        self.h = 0

Since we need to feed this model consecutive tokens, we need to create batches by ourselves.

The `group_chunks` function below re-order the original sequence so batches, when procressed sequentially, reproduces the original word sequence. (This method is hacky though)

In [216]:
def group_chunks(ds, bs):
    m = len(ds) // bs
    new_ds = L()
    for i in range(m): new_ds += L(ds[i + m*j] for j in range(bs))
    return new_ds

chunks = group_chunks(raw_seqs[:50], 10) # visualize using original tokens
chunks

(#50) [(['one', '.', 'two'], '.'),(['eleven', '.', 'twelve'], '.'),(['twenty', 'one', '.'], 'twenty'),(['.', 'twenty', 'eight'], '.'),(['.', 'thirty', 'five'], '.'),(['.', 'forty', 'two'], '.'),(['eight', '.', 'forty'], 'nine'),(['five', '.', 'fifty'], 'six'),(['two', '.', 'sixty'], 'three'),(['sixty', 'nine', '.'], 'seventy')...]

In [217]:
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(
    group_chunks(seqs[:cut], bs), 
    group_chunks(seqs[cut:], bs), 
    bs=bs, drop_last=True, shuffle=False)

In [224]:
learn = Learner(dls, LMModel3(len(vocab), 64, True), loss_func=F.cross_entropy,
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(10, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.70763,1.826244,0.471154,00:04
1,1.284969,1.817968,0.475,00:04
2,1.114166,1.677932,0.457452,00:04
3,1.014055,1.539132,0.533413,00:04
4,0.987971,1.45322,0.541106,00:04
5,0.95458,1.564336,0.564904,00:04
6,0.887796,1.496529,0.599279,00:04
7,0.829708,1.559965,0.599519,00:04
8,0.793747,1.542842,0.610096,00:04
9,0.774882,1.516615,0.605288,00:04


If we do not detach, the model training becomes much slower:

In [225]:
learn = Learner(dls, LMModel3(len(vocab), 64, False), loss_func=F.cross_entropy,
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(4, 3e-3)

epoch,train_loss,valid_loss,accuracy,time


RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.