Reference paper: [Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A Neural Probabilistic Language Model. J. Mach. Learn. Res. 3, (February 2003), 1137–1155.](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)

One main difference in our implementation is that we are working with characters instead of words. The vocabulary size in the Bengio's paper is 17,000 words, whereas we are going to have a vocabulary of 27 characters (26 characters and the `<.>` special character).

In [1]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
words = open("./names.txt", "r").read().splitlines()
words[:8]

['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte', 'mia', 'amelia']

In [3]:
len(words)

32033

In [4]:
chars = sorted(list(set("".join(words))))
s2i = {s: i + 1 for i, s in enumerate(chars)}
s2i["."] = 0
i2s = {i: s for s, i in s2i.items()}
print(i2s)

{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}


In [5]:
block_size = (
    3  # context length: how many characters do we take to predict the next one?
)
X, Y = [], []
for w in words[:5]:
    print(w)
    context = [0] * block_size
    for ch in w + ".":
        ix = s2i[ch]
        X.append(context)
        Y.append(ix)
        print("".join(i2s[i] for i in context), "--->", i2s[ix])
        context = context[1:] + [ix]  # crop and append

X = torch.tensor(X)
Y = torch.tensor(Y)

emma
... ---> e
..e ---> m
.em ---> m
emm ---> a
mma ---> .
olivia
... ---> o
..o ---> l
.ol ---> i
oli ---> v
liv ---> i
ivi ---> a
via ---> .
ava
... ---> a
..a ---> v
.av ---> a
ava ---> .
isabella
... ---> i
..i ---> s
.is ---> a
isa ---> b
sab ---> e
abe ---> l
bel ---> l
ell ---> a
lla ---> .
sophia
... ---> s
..s ---> o
.so ---> p
sop ---> h
oph ---> i
phi ---> a
hia ---> .


In [6]:
(
    X.shape,
    X.dtype,
    Y.shape,
    Y.dtype,
)

(torch.Size([32, 3]), torch.int64, torch.Size([32]), torch.int64)

In [7]:
C = torch.randn((27, 2))  # embeddig matrix

We can index this embedding matrix directly...

In [8]:
C[5]

tensor([0.3214, 0.7416])

Or using a multiplication between a OHE vector and the embedding matrix...

In [9]:
ohe = F.one_hot(torch.tensor(5), 27).float()  # F.one_hot() return int64 dtype tensors
ohe @ C

tensor([0.3214, 0.7416])

In [10]:
emb = C[X]
emb.shape  # bs, ctx len, emb lengh

torch.Size([32, 3, 2])

In [11]:
W1 = torch.randn((6, 100))  # (ctx len x emb length), hidden dim
b1 = torch.randn(100)

In [12]:
# we would like to perform (emb @ W1) + b1, but the dims don't match
# ---------------------------------------------------------------------------
# RuntimeError                              Traceback (most recent call last)
# Cell In[31], line 1
# ----> 1 (emb @ W1) + b1

# RuntimeError: mat1 and mat2 shapes cannot be multiplied (96x2 and 6x100)e

To solve this problem, we want to multiply between a matrix of shape `(32, 6)` and one of shape `(6, 100)`. One way to achieve this issue is to use `torch.cat()` and "manually" concatenate the embedding representation for each token in the input context.

In [13]:
%%timeit
(torch.cat((emb[:, 0, :], emb[:, 1, :], emb[:, 2, :]), 1) @ W1) + b1  # we need to concatenate 

29 µs ± 125 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


The issue with this approach is that it won't automatically handle increasing/decreasing the context length. 

Alternatively, we could use `torch unbind()`. This command will return a tuple of all slices along a given dimension, already without it.

In [14]:
a = torch.unbind(emb, 1)
len(a)

3

In [15]:
%%timeit 
(torch.cat(torch.unbind(emb, 1), 1) @ W1) + b1

17.1 µs ± 98.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


One downside of using `torch.cat()` is that we are creating an entirely new tensor in memory.

An even more efficient way of doing this is to use the `.view()` method. In each tensor, there is an undelying storage, that we can access with `Tensor.storage()`, which contains all of the numbers in a tensor as a 1-dimensional vector. This in how the tensor is represented in memory ― a 1D vector. When we call `.view()`, we are manipulating some attributes (i.e., offset, stride, and shape) of that tensor, that dictate how this 1D sequence is interpreted to be as a N-dimensional tensor. When we use `.view()`, no memory is being moved, copied, or created. The storage stays the same. 

In [16]:
%%timeit
(emb.view((32, -1)) @ W1) + b1

7.71 µs ± 36.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


Another approach, more readable and also performant is to use `.reshape()`. As [this StackOverflow answer](https://stackoverflow.com/a/54507446/2092449) very clearly explains, the difference between `.view()` and `.reshape()` is that the latter might return a new tensor that may be a view of the original tensor, or it may be a new tensor altogether. So, if you just want to reshape tensors, use `.reshape()`. If you're also concerned about memory usage and want to ensure that the two tensors share the same data, use `.view()`.

In [17]:
%%timeit
(emb.reshape((32, -1)) @ W1) + b1

7.69 µs ± 70.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


I suggest to use `view()` whenever possible, for performance reasons.

In [18]:
h = torch.tanh(emb.view(emb.shape[0], -1) @ W1 + b1)
h

tensor([[ 0.7879,  0.9995,  1.0000,  ...,  0.9998, -1.0000, -0.9939],
        [ 0.8132,  0.8769,  0.7999,  ...,  0.9890, -0.9721, -0.8718],
        [-0.7958,  0.0539, -0.8250,  ...,  0.9229, -0.5362,  0.9993],
        ...,
        [ 0.9803, -0.3940, -0.9081,  ..., -0.6458, -0.8879, -0.9986],
        [-0.9476,  0.7758, -0.0293,  ...,  0.9680, -0.9001,  0.9998],
        [ 0.6691,  0.7267, -0.2923,  ..., -0.9386, -0.9939, -0.9818]])

In [19]:
h.shape

torch.Size([32, 100])

One thing to pay attention to, is the addition operation. The resulting matrix generated by multiplying `emb` (reshaped) and `W1` is of size `(32, 100)`, whilst the bias vector has size `(100)`.

```
32, 100 --> 32, 100
    100 -->  1, 100
```

So the same bias vector will be copied to all the rows of the (`emb.view(emb.shape[0], -1) @ W1`) matrix. Meaning that element `b[0, 0]` will be added to each element of the row `x[0, :]`.

In [20]:
W2 = torch.randn((100, 27))
b2 = torch.randn((27))

In [21]:
logits = h @ W2 + b2

In [22]:
logits.shape

torch.Size([32, 27])

In [23]:
counts = logits.exp()

In [24]:
probs = counts / counts.sum(1, keepdims=True)

In [25]:
probs.shape

torch.Size([32, 27])

In [26]:
# sanity check, rows should sum to 1
probs[0].sum()

tensor(1.)

In [27]:
loss = -probs[torch.arange(32), Y].log().mean()  # average neg log likelihood
loss

tensor(11.5661)

Let's rewrite all of the above in a more reusable way, and train the MLP model

In [28]:
block_size = (
    3  # context length: how many characters do we take to predict the next one?
)
X, Y = [], []
for w in words:
    context = [0] * block_size
    for ch in w + ".":
        ix = s2i[ch]
        X.append(context)
        Y.append(ix)
        context = context[1:] + [ix]  # crop and append

X = torch.tensor(X)
Y = torch.tensor(Y)

In [29]:
X.shape, Y.shape

(torch.Size([228146, 3]), torch.Size([228146]))

In [30]:
g = torch.Generator().manual_seed(2147483647)
C = torch.randn((27, 2), generator=g).requires_grad_()
W1 = torch.randn((6, 100), generator=g).requires_grad_()
b1 = torch.randn((100), generator=g).requires_grad_()
W2 = torch.randn((100, 27), generator=g).requires_grad_()
b2 = torch.randn((27), generator=g).requires_grad_()
parameters = [C, W1, b1, W2, b2]

In [31]:
sum(p.nelement() for p in parameters)

3481

In [32]:
bs = 32

for epoch in range(1001):
    # minibatch
    ix = torch.randint(0, X.shape[0], (bs,))
    xs = X[ix]
    ys = Y[ix]

    # forward pass
    emb = C[xs]  # (32, 3, 2)
    h = torch.tanh(
        emb.view((-1, 6)) @ W1 + b1
    )  # (32, 100)... also tanh is important to not get inf loss!
    logits = h @ W2 + b2  # (32, 27)
    loss = F.cross_entropy(logits, ys)

    if epoch % 50 == 0:
        print(f"epoch {epoch}: {loss.item()}")

    # backward pass
    for p in parameters:
        p.grad = None
    loss.backward()

    # update
    for p in parameters:
        p.data -= 0.1 * p.grad

epoch 0: 16.95440101623535
epoch 50: 5.307323455810547
epoch 100: 3.8642489910125732
epoch 150: 3.7624990940093994
epoch 200: 3.6555802822113037
epoch 250: 3.177612781524658
epoch 300: 2.781998872756958
epoch 350: 3.154618740081787
epoch 400: 3.1591622829437256
epoch 450: 2.992241859436035
epoch 500: 2.8389673233032227
epoch 550: 2.394284963607788
epoch 600: 2.0478570461273193
epoch 650: 3.046037197113037
epoch 700: 2.8940694332122803
epoch 750: 2.655653953552246
epoch 800: 2.7728943824768066
epoch 850: 2.4803926944732666
epoch 900: 2.9597811698913574
epoch 950: 3.102785587310791
epoch 1000: 2.4645280838012695


In [33]:
logits.max(1)

torch.return_types.max(
values=tensor([3.8991, 3.0654, 2.7789, 2.2662, 3.5375, 2.6529, 2.3019, 2.9262, 2.3590,
        2.6526, 3.2358, 2.5021, 2.7789, 4.5728, 2.7789, 4.5486, 2.3283, 2.7688,
        4.8154, 1.9772, 2.7830, 1.8930, 1.8393, 4.2401, 2.4845, 3.5620, 1.9488,
        4.4645, 2.3582, 1.8850, 2.7101, 2.7906], grad_fn=<MaxBackward0>),
indices=tensor([ 0,  0,  1, 14,  0,  0,  0,  0,  0,  1,  0,  9,  1,  9,  1,  0,  0,  0,
         1,  0,  0, 14,  0,  1, 14,  9,  0,  1,  0,  0,  9, 14]))