# FastAI NLP Study Notes

In this notebook, we try to replicate models implemented in [A Language Model from Scratch](https://github.com/fastai/fastbook/blob/master/12_nlp_dive.ipynb). 

We will use the *Human Numbers* dataset, and it simply contains the first 10,000 numbers written out in English.

In [1]:
from fastcore.utils import L
from fastai.text.all import *
path = untar_data(URLs.HUMAN_NUMBERS)

In [2]:
lines = L()
with open(path / "train.txt", 'r') as f: lines += L(*f.readlines())
with open(path / "valid.txt", 'r') as f: lines += L(*f.readlines())
lines[:5], lines[-5:]

((#5) ['one \n','two \n','three \n','four \n','five \n'],
 (#5) ['nine thousand nine hundred ninety five \n','nine thousand nine hundred ninety six \n','nine thousand nine hundred ninety seven \n','nine thousand nine hundred ninety eight \n','nine thousand nine hundred ninety nine \n'])

The data set is quite simple, we can build our own the **tokenizer** and **word2vec** tool.

In [3]:
text = " . ".join([x.strip() for x in lines])
tokens = text.split(' ')
tokens[:5], tokens[-5:]

(['one', '.', 'two', '.', 'three'],
 ['thousand', 'nine', 'hundred', 'ninety', 'nine'])

In [4]:
vocab = L(*tokens).unique()
word2idx = {t:i for i, t in enumerate(vocab)}
indices = L(word2idx[t] for t in tokens)
indices

(#63095) [0,1,2,1,3,1,4,1,5,1...]

## Our First Language Model from Scratch

We will build a model that predicts the next word by looking at the previous 3 words.

In [5]:
import torch
from torch import tensor

raw_seqs = [(tokens[i:(i+3)], tokens[i+3]) for i in range(0, len(tokens)-3, 4)]
raw_seqs[:5], raw_seqs[-5:]

([(['one', '.', 'two'], '.'),
  (['three', '.', 'four'], '.'),
  (['five', '.', 'six'], '.'),
  (['seven', '.', 'eight'], '.'),
  (['nine', '.', 'ten'], '.')],
 [(['ninety', 'six', '.'], 'nine'),
  (['thousand', 'nine', 'hundred'], 'ninety'),
  (['seven', '.', 'nine'], 'thousand'),
  (['nine', 'hundred', 'ninety'], 'eight'),
  (['.', 'nine', 'thousand'], 'nine')])

Since we cannot feed our model raw text, we need to use indices instead.

In [6]:
seqs = L((tensor(indices[i:(i+3)]), indices[i+3]) for i in range(0, len(indices)-4, 3))
seqs

(#21031) [(tensor([0, 1, 2]), 1),(tensor([1, 3, 1]), 4),(tensor([4, 1, 5]), 1),(tensor([1, 6, 1]), 7),(tensor([7, 1, 8]), 1),(tensor([1, 9, 1]), 10),(tensor([10,  1, 11]), 1),(tensor([ 1, 12,  1]), 13),(tensor([13,  1, 14]), 1),(tensor([ 1, 15,  1]), 16)...]

### A Simple Language Model - Explicit Layers

Model architecture:

<img src="https://raw.githubusercontent.com/Climbo-Dev/climbo-code-samples/main/images/fastbook_12_nlp_att_00022.png" width="500">

- rectangles are embeddings
- arrows are linear transformations
- ellipses are hidden layer transformations (linear combination + ReLU)

In [7]:
class LMModel(Module):
    def __init__(self, vocab_size, n_hidden):
        self.i_h = nn.Embedding(vocab_size, n_hidden)
        self.h_h = nn.Linear(n_hidden, n_hidden)
        # the output is a very lone vector so we know which word we are predicting.
        self.h_o = nn.Linear(n_hidden, vocab_size)

    def forward(self, x):
        i_h, h_h, h_o = self.i_h, self.h_h, self.h_o

        x1 = i_h(x[:,0])
        x2 = i_h(x[:,1])
        x3 = i_h(x[:,2])
        
        h = F.relu(h_h(x1))
        h = F.relu(h_h(x2) + h)
        h = F.relu(h_h(x3) + h)

        return h_o(h)

Load data

In [8]:
bs = 64
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(seqs[:cut], seqs[cut:], bs=64, shuffle=False)

Let's see how the model work on a batch:

In [9]:
model = LMModel(len(vocab), 10)
x, y = next(iter(dls.train))
output = model(x)
print(x.shape, y.shape, output.shape)
output

torch.Size([64, 3]) torch.Size([64]) torch.Size([64, 30])


tensor([[ 0.0185, -0.4614,  0.5482,  ...,  0.8765, -0.1430,  1.4199],
        [ 0.0480, -0.5438, -0.2975,  ...,  1.0006,  0.8631,  1.0290],
        [-0.0047, -0.2217, -0.6078,  ...,  0.2592,  0.8370,  0.6079],
        ...,
        [-0.5158,  0.4641,  0.1081,  ...,  0.4880, -0.1568,  0.6477],
        [-0.2396,  0.2414,  0.4456,  ...,  0.6800, -0.4271,  0.6313],
        [-0.4916,  0.1453,  0.2579,  ...,  0.6784, -0.1513,  0.9050]],
       grad_fn=<AddmmBackward0>)

Now we can train the model:

In [10]:
learner = Learner(dls, LMModel(len(vocab), 64), loss_func=F.cross_entropy, metrics=accuracy)
learner.fit_one_cycle(4, 0.001)

epoch,train_loss,valid_loss,accuracy,time
0,2.021665,2.377312,0.323033,00:02
1,1.490597,1.938966,0.44426,00:02
2,1.468995,1.673518,0.486095,00:02
3,1.426818,1.718666,0.376753,00:02


Compare the result with the simple benchmark of using only the most frequent token:

In [11]:
import pandas as pd
ys = torch.concat([y for _, y in dls.valid])
ys = pd.Series(ys)
most_frequent = ys.value_counts().head(1)
mw = vocab[most_frequent.index[0]]
mf = most_frequent.values[0] / ys.shape[0]
print(f"most frequent word: {mw}, frequency of the most frequent word: {mf}")

most frequent word: thousand, frequency of the most frequent word: 0.15165200855716662


### A Simple Recurrent Language Model

We can simplify the previous model by using a simple for loop:

<img src="https://raw.githubusercontent.com/Climbo-Dev/climbo-code-samples/main/images/fastbook_12_nlp_att_00070.png" width="500">

In [12]:
class LMModel2(Module):
    def __init__(self, vocab_size, n_hidden):
        self.i_h = nn.Embedding(vocab_size, n_hidden)
        self.h_h = nn.Linear(n_hidden, n_hidden)
        self.h_o = nn.Linear(n_hidden, vocab_size)

    def forward(self, x):
        h = 0
        for i in range(3):
            h = h + self.i_h(x[:,i])
            h = F.relu(self.h_h(h))
        return self.h_o(h)

The model should have the same architecture with `LMModel`:

In [13]:
bs = 64
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(seqs[:cut], seqs[cut:], bs=64, shuffle=False)

# Train the Model
learner = Learner(dls, LMModel2(len(vocab), 64), loss_func=F.cross_entropy, metrics=accuracy)
learner.fit_one_cycle(4, 0.001)

epoch,train_loss,valid_loss,accuracy,time
0,1.841735,2.018465,0.47207,00:02
1,1.367022,1.79769,0.480628,00:02
2,1.429261,1.671371,0.491799,00:02
3,1.385907,1.664714,0.489422,00:02


### A Recurrent Language Model with Memory

The previous model resets hidden state variables to 0 for every forward path. However, if we feed the model with words from a consecutive sequence, hidden state variables fitted from previous words should be useful traing the current model. 

<img src="https://raw.githubusercontent.com/Climbo-Dev/climbo-code-samples/main/images/fastbook_12_nlp_att_00024.png" width="500">

To allow our language model to memorize prior states informations, we shall initialize the hidden variable once and carry its value between forward paths:

In [14]:
class LMModel3(Module):
    def __init__(self, vocab_size, n_hidden, detach:bool):
        self.i_h = nn.Embedding(vocab_size, n_hidden)
        self.h_h = nn.Linear(n_hidden, n_hidden)
        self.h_o = nn.Linear(n_hidden, vocab_size)
        self.h = 0
        self.detach = detach

    def forward(self, x):
        for i in range(3):
            self.h = self.h + self.i_h(x[:,i])
            self.h = F.relu(self.h_h(self.h))
        output = self.h_o(self.h)
        if self.detach:
            self.h = self.h.detach()  # !!! We should always detach h, see explanations below
        return output

    def reset(self):
        self.h = 0

Since we need to feed this model consecutive tokens, we need to create batches by ourselves.

The `group_chunks` function below re-order the original sequence so batches, when procressed sequentially, reproduces the original word sequence. (This method is hacky though)

In [15]:
def group_chunks(ds, bs):
    m = len(ds) // bs
    new_ds = L()
    for i in range(m): new_ds += L(ds[i + m*j] for j in range(bs))
    return new_ds

chunks = group_chunks(raw_seqs[:50], 10) # visualize using original tokens
chunks

(#50) [(['one', '.', 'two'], '.'),(['eleven', '.', 'twelve'], '.'),(['twenty', 'one', '.'], 'twenty'),(['.', 'twenty', 'eight'], '.'),(['.', 'thirty', 'five'], '.'),(['.', 'forty', 'two'], '.'),(['eight', '.', 'forty'], 'nine'),(['five', '.', 'fifty'], 'six'),(['two', '.', 'sixty'], 'three'),(['sixty', 'nine', '.'], 'seventy')...]

In [16]:
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(
    group_chunks(seqs[:cut], bs), 
    group_chunks(seqs[cut:], bs), 
    bs=bs, drop_last=True, shuffle=False)

In [17]:
learn = Learner(dls, LMModel3(len(vocab), 64, True), loss_func=F.cross_entropy,
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(10, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.732836,1.862581,0.474279,00:02
1,1.292882,1.808473,0.448317,00:02
2,1.080552,1.804297,0.490625,00:02
3,1.004886,1.654634,0.514183,00:02
4,0.958965,1.74239,0.550481,00:03
5,0.901038,1.753155,0.579567,00:02
6,0.862871,1.70584,0.579327,00:02
7,0.816607,1.590275,0.589183,00:02
8,0.782191,1.628739,0.601683,00:02
9,0.772643,1.620571,0.6,00:02


If we do not detach, we got a RunTimeError:

```
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.
```

**TODO**: investigate

In [21]:
learn = Learner(dls, LMModel3(len(vocab), 64, False), loss_func=F.cross_entropy,
                metrics=accuracy, cbs=ModelResetter)
# learn.fit_one_cycle(4, 3e-3)

### Creating More Signals

In previous examples, our input data are disjoint. It would be better if we can predict every word by looking at the previous n words. 

In [109]:
sl = 16  # sequence length
seqs = L((tensor(indices[i:i+sl]), tensor(indices[i+1:i+sl+1])) for i in range(0, len(indices)-sl-1, sl))
print([L([vocab[o] for o in s]) for s in seqs[0]])
print([L([vocab[o] for o in s]) for s in seqs[1]])

[['one', '.', 'two', '.', 'three', '.', 'four', '.', 'five', '.', 'six', '.', 'seven', '.', 'eight', '.'], ['.', 'two', '.', 'three', '.', 'four', '.', 'five', '.', 'six', '.', 'seven', '.', 'eight', '.', 'nine']]
[['nine', '.', 'ten', '.', 'eleven', '.', 'twelve', '.', 'thirteen', '.', 'fourteen', '.', 'fifteen', '.', 'sixteen', '.'], ['.', 'ten', '.', 'eleven', '.', 'twelve', '.', 'thirteen', '.', 'fourteen', '.', 'fifteen', '.', 'sixteen', '.', 'seventeen']]


In [110]:
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(group_chunks(seqs[:cut], bs), group_chunks(seqs[cut:], bs), bs=bs, drop_last=True, shuffle=False)

In [111]:
class LMModel4(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.h_h = nn.Linear(n_hidden, n_hidden)
        self.h_o = nn.Linear(n_hidden, vocab_sz)
        self.h = 0

    def forward(self, x):
        outs = []
        for i in range(sl):
            self.h = self.h + self.i_h(x[:,i])
            self.h = F.relu(self.h_h(self.h))
            outs.append(self.h_o(self.h))
        self.h = self.h.detach()
        return torch.stack(outs, dim=1)

    def reset(self):
        self.h = 0

We need to define a new loss function. Target is of size `bs x sl`, input is of size `bs x sl x vocab_sz`:

In [112]:
def loss_func(input, target):
    return F.cross_entropy(input.view(-1, len(vocab)), target.view(-1))

Let's test how the model perform on a small batch:

In [124]:
x = torch.vstack([seqs[0][0], seqs[1][0]])
target = torch.vstack([seqs[0][1], seqs[1][1]])
model = LMModel4(len(vocab), 64)
input = model(x)
print("input:", input.size(), input.view(-1, len(vocab)).size())
print("target:", target.size(), target.view(-1).size())
print("loss:", loss_func(input, target))

input: torch.Size([2, 16, 30]) torch.Size([32, 30])
target: torch.Size([2, 16]) torch.Size([32])
loss: tensor(3.5521, grad_fn=<NllLossBackward0>)


We can now train the model:

In [125]:
learn = Learner(dls, LMModel4(len(vocab), 64), loss_func=loss_func, metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,3.202699,3.065444,0.251709,00:01
1,2.281799,1.886787,0.400065,00:01
2,1.719919,1.768146,0.487712,00:01
3,1.414626,1.720758,0.530192,00:01
4,1.242322,1.813718,0.547526,00:01
5,1.101264,1.762658,0.569987,00:01
6,0.995921,1.701138,0.588053,00:01
7,0.89964,1.758914,0.593018,00:01
8,0.82368,1.759491,0.591064,00:01
9,0.769644,1.754826,0.582682,00:01


In [126]:
x, y = next(iter(dls.train))

### LSTM

<img src="https://raw.githubusercontent.com/Climbo-Dev/climbo-code-samples/main/images/fastbook_12_nlp_LSTM.png" width="800">

In [130]:
class LSTMCellSimple(Module):
    def __init__(self, ni, nh):
        self.forget_gate = nn.Linear(ni + nh, nh)
        self.input_gate  = nn.Linear(ni + nh, nh)
        self.cell_gate   = nn.Linear(ni + nh, nh)
        self.output_gate = nn.Linear(ni + nh, nh)

    def forward(self, input, state):
        h, c = state
        h = torch.cat([h, input], dim=1)
        forget = torch.sigmoid(self.forget_gate(h))
        c = c * forget
        inp = torch.sigmoid(self.input_gate(h))
        cell = torch.tanh(self.cell_gate(h))
        c = c + inp * cell
        out = torch.sigmoid(self.output_gate(h))
        h = out * torch.tanh(c)
        return h, (h, c)

In practice, it's better to do one big matrix multiplication than four smaller ones to improve performance. 

In [216]:
class LSTMCell(Module):
    def __init__(self, ni, nh):
        self.ih = nn.Linear(ni, 4 * nh)
        self.hh = nn.Linear(nh, 4 * nh)

    def forward(self, input, state):
        h, c = state
        # One big multiplication for all the gates is better than 4 smaller ones
        gates = (self.ih(input) + self.hh(h)).chunk(4, 1) # split into 4 chunks along dim=1
        igate, fgate, ogate = map(torch.sigmoid, gates[:3])
        cgate = gates[3].tanh()

        c = fgate * c + igate * cgate
        h = ogate * c.tanh()
        return h, (h,c)

Let's illustrate how the model works with dummy data:

In [171]:
bs = 4
ni = 10  # embedding length
nh = 64
model = LSTMCell(ni, nh)

x = torch.rand((bs, ni))
h = torch.rand((bs, nh))
c = torch.rand((bs, nh))

res, (h1, c1) = model(x, (h, c))
pred = nn.Linear(nh, len(vocab))(res)
print(x.size(), res.size(), h1.size(), c1.size(), pred.size())

torch.Size([4, 10]) torch.Size([4, 64]) torch.Size([4, 64]) torch.Size([4, 64]) torch.Size([4, 30])


Now we can build the full LSTM model:

In [221]:
class LMModel6(Module):
    def __init__(self, vocab_sz, n_embed, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_embed)
        self.cell = LSTMCell(n_embed, n_hidden)
        self.h_o = nn.Linear(n_hidden, vocab_sz)
        self.state = [torch.zeros(bs, n_hidden) for _ in range(2)]

    def forward(self, x):
        outs = []
        state = self.state
        for i in range(sl):
            res, state = self.cell(self.i_h(x[:,i]), state)
            outs.append(self.h_o(res))
        self.state = [h.detach() for h in state]
        return torch.stack(outs, dim=1)

    def reset(self):
        for h in self.state: h.zero_()

In [219]:
# Let's regenerate the dataset
bs = 4
sl = 16  # sequence length
seqs = L((tensor(indices[i:i+sl]), tensor(indices[i+1:i+sl+1])) for i in range(0, len(indices)-sl-1, sl))
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(group_chunks(seqs[:cut], bs), group_chunks(seqs[cut:], bs), bs=bs, drop_last=True, shuffle=False)

In [212]:
# x, y = next(iter(dls.train))
out = LMModel6(len(vocab), 64, 64)(x)
print(len(vocab), x.size(), y.size(), out.size())
print(CrossEntropyLossFlat()(out, y))

30 torch.Size([4, 16]) torch.Size([4, 16]) torch.Size([4, 16, 30])
TensorBase(3.4138, grad_fn=<AliasBackward0>)


In [222]:
learn = Learner(dls, LMModel6(len(vocab), 64, 64), loss_func=CrossEntropyLossFlat(), metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,1.315539,1.654632,0.487389,00:09
1,0.776126,1.284034,0.641339,00:09
2,0.404424,1.415817,0.690831,00:09
3,0.313369,1.339205,0.733423,00:09
4,0.297468,1.569576,0.72105,00:09
5,0.296873,1.268781,0.726364,00:09
6,0.40831,1.720014,0.671716,00:09
7,0.308835,1.468801,0.670923,00:09
8,0.275845,1.220386,0.755869,00:09
9,0.290519,1.007425,0.746193,00:09


### Regularizing an LSTM

Reference: [Regularizing and Optimizing LSTM Language Models](https://arxiv.org/abs/1708.02182)

#### Dropout

In [None]:
class Dropout(Module):
    def __init__(self, p): 
        self.p = p

    def forward(self, x):
        if not self.training: return x
        mask = x.new(*x.shape).bernoulli_(1-p)
        return x * mask.div_(1-p)        

#### Activation Regularization (AR) and Temporal Activation Regularization (TAR)

**Activation Regularization (AR)**: let `activations` been all activation weights of the model. Activation Regularization can be achieved by modifying the loss function:

```
loss += alpha * activations.pow(2).mean()

```

**Temporal Activation Regularization (TAR)**: since we are dealing with sequential data, the outputs of the LSTM model should somewhat make sense when we read tokens in order. TAR encourages this behavior by adding a penalty to the loss to make the difference between two consecutive activations as small as possible: our activations tensor has a shape `bs x sl x n_hidden`, and we read consecutive activations on the sequence length axis. With this, TAR can be expressed as follows:

```
loss += beta * (activations[:,1:] - activations[:,:-1]).pow(2).mean()
```

#### Training a Weight-Tied Regularized LSTM

**Weight Tying**: in a language model, the input embeddings represent a mapping from words to activations, and the output hidden layer represetns a mapping from activations to words. We might expect, intuitively, that these mappings could be the same. 

In [258]:
vocab_len = 30
n_hidden = 5
i_h = nn.Embedding(vocab_len, n_hidden)
h_o = nn.Linear(n_hidden, vocab_len)
print(i_h.weight.size(), h_o.weight.size())
h_o.weight = i_h.weight

torch.Size([30, 5]) torch.Size([30, 5])
