Reference paper: [Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A Neural Probabilistic Language Model. J. Mach. Learn. Res. 3, (February 2003), 1137–1155.](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)

One main difference in our implementation is that we are working with characters instead of words. The vocabulary size in the Bengio's paper is 17,000 words, whereas we are going to have a vocabulary of 27 characters (26 characters and the `<.>` special character).

In [1]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
words = open("./names.txt", "r").read().splitlines()
words[:8]

['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte', 'mia', 'amelia']

In [3]:
len(words)

32033

In [4]:
chars = sorted(list(set("".join(words))))
s2i = {s: i+1 for i, s in enumerate(chars)}
s2i["."] = 0
i2s = {i: s for s, i in s2i.items()}
print(i2s)

{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}


In [5]:
block_size = 3  # context length: how many characters do we take to predict the next one?
X, Y = [], []
for w in words[:5]:
    print(w)
    context = [0] * block_size
    for ch in w + ".":
        ix = s2i[ch]
        X.append(context)
        Y.append(ix)
        print("".join(i2s[i] for i in context), "--->", i2s[ix])
        context = context[1:] + [ix]  # crop and append
        
X = torch.tensor(X)
Y = torch.tensor(Y)

emma
... ---> e
..e ---> m
.em ---> m
emm ---> a
mma ---> .
olivia
... ---> o
..o ---> l
.ol ---> i
oli ---> v
liv ---> i
ivi ---> a
via ---> .
ava
... ---> a
..a ---> v
.av ---> a
ava ---> .
isabella
... ---> i
..i ---> s
.is ---> a
isa ---> b
sab ---> e
abe ---> l
bel ---> l
ell ---> a
lla ---> .
sophia
... ---> s
..s ---> o
.so ---> p
sop ---> h
oph ---> i
phi ---> a
hia ---> .


In [6]:
X.shape, X.dtype, Y.shape, Y.dtype, 

(torch.Size([32, 3]), torch.int64, torch.Size([32]), torch.int64)

In [7]:
C = torch.randn((27, 2))  # embeddig matrix

We can index this embedding matrix directly...

In [8]:
C[5]

tensor([-1.4756,  0.6766])

Or using a multiplication between a OHE vector and the embedding matrix...

In [9]:
ohe = F.one_hot(torch.tensor(5), 27).float()  # F.one_hot() return int64 dtype tensors
ohe @ C 

tensor([-1.4756,  0.6766])

In [10]:
emb = C[X]
emb.shape  # bs, ctx len, emb lengh

torch.Size([32, 3, 2])

In [11]:
W1 = torch.randn((6, 100))  # (ctx len x emb length), hidden dim
b1 = torch.randn(100)

In [12]:
# we would like to perform (emb @ W1) + b1, but the dims don't match
# ---------------------------------------------------------------------------
# RuntimeError                              Traceback (most recent call last)
# Cell In[31], line 1
# ----> 1 (emb @ W1) + b1

# RuntimeError: mat1 and mat2 shapes cannot be multiplied (96x2 and 6x100)e

To solve this problem, we want to multiply between a matrix of shape `(32, 6)` and one of shape `(6, 100)`. One way to achieve this issue is to use `torch.cat()` and "manually" concatenate the embedding representation for each token in the input context.

In [13]:
%%timeit
(torch.cat((emb[:, 0, :], emb[:, 1, :], emb[:, 2, :]), 1) @ W1) + b1  # we need to concatenate 

29.3 µs ± 205 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


The issue with this approach is that it won't automatically handle increasing/decreasing the context length. 

Alternatively, we could use `torch unbind()`. This command will return a tuple of all slices along a given dimension, already without it.

In [14]:
a = torch.unbind(emb, 1)
len(a)

3

In [15]:
%%timeit 
(torch.cat(torch.unbind(emb, 1), 1) @ W1) + b1

17 µs ± 78.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


One downside of using `torch.cat()` is that we are creating an entirely new tensor in memory.

An even more efficient way of doing this is to use the `.view()` method. In each tensor, there is an undelying storage, that we can access with `Tensor.storage()`, which contains all of the numbers in a tensor as a 1-dimensional vector. This in how the tensor is represented in memory ― a 1D vector. When we call `.view()`, we are manipulating some attributes (i.e., offset, stride, and shape) of that tensor, that dictate how this 1D sequence is interpreted to be as a N-dimensional tensor. When we use `.view()`, no memory is being moved, copied, or created. The storage stays the same. 

In [16]:
%%timeit
(emb.view((32, -1)) @ W1) + b1

8.01 µs ± 69 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


Another approach, more readable and also performant is to use `.reshape()`. As [this StackOverflow answer](https://stackoverflow.com/a/54507446/2092449) very clearly explains, the difference between `.view()` and `.reshape()` is that the latter might return a new tensor that may be a view of the original tensor, or it may be a new tensor altogether. So, if you just want to reshape tensors, use `.reshape()`. If you're also concerned about memory usage and want to ensure that the two tensors share the same data, use `.view()`.

In [17]:
%%timeit
(emb.reshape((32, -1)) @ W1) + b1

7.99 µs ± 54.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


I suggest to use `view()` whenever possible, for performance reasons.

In [18]:
h = torch.tanh(emb.view(emb.shape[0], -1) @ W1 + b1)
h

tensor([[ 0.9872,  0.9964, -1.0000,  ..., -0.9986,  0.2636, -0.9999],
        [ 0.0741,  0.9828, -0.9978,  ..., -0.4900, -0.9942, -0.9993],
        [ 0.8229, -0.9995,  0.2666,  ..., -0.9851, -0.9740,  0.9977],
        ...,
        [-0.0168, -0.9989,  0.9989,  ..., -0.8979, -0.7624,  0.9997],
        [-0.6276,  0.8327, -0.9993,  ..., -0.9102, -0.8032, -0.9606],
        [-0.7889,  1.0000, -0.9999,  ...,  0.4892,  0.6764, -1.0000]])

In [19]:
h.shape

torch.Size([32, 100])

One thing to pay attention to, is the addition operation. The resulting matrix generated by multiplying `emb` (reshaped) and `W1` is of size `(32, 100)`, whilst the bias vector has size `(100)`.

```
32, 100 --> 32, 100
    100 -->  1, 100
```

So the same bias vector will be copied to all the rows of the (`emb.view(emb.shape[0], -1) @ W1`) matrix. Meaning that element `b[0, 0]` will be added to each element of the row `x[0, :]`.