## MLP for language modelling

Implementing Bengio et al paper to develop an MLP for language modelling. 

It introduces the idea of vector embeddings to capture semantic proximity, instead of explicitly calculating probabilities for each possible combination of words which wouldn't generalize well. __17,000 words__ are considered in the dataset. 

### Architecture

<img src="../papers/architecture.png" style="width:70%;">

__Explanation of architecture:__ 
- 3 previous words are used as context and indexed as $w_i$
- An embedding of that word is shared from a global matrix $C$ and used as input for the hidden layer. 
- Size of hidden layer is a hyperparameter
- post which `tanh` non-linearity is applied
- finally there is a fully connected output layer (with __17,000 neurons -- one for each word__)
- softmax is applied to choose the most likely word


### Parameters

- The lookup table $C$ (embedding matrix)
- $W_i, b_i$ for hidden layer
- $W_i, b_i$ for output layer layer

In [25]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F

We will implement the same architecture above not for sentences as is done in Bengio et al, but for individual names. 

In [3]:
words = open('names.txt', 'r').read().splitlines()

In [4]:
words[:5]

['emma', 'olivia', 'ava', 'isabella', 'sophia']

In [5]:
# define stoi 
stoi = {}
allletters = sorted(set("".join(words)))

stoi = {s:i+1 for i,s in enumerate(allletters)}
stoi['.'] = 0

itos = {i:s for s,i in stoi.items()}

In [10]:
some = [0]*3
some

[0, 0, 0]

In [13]:
words[:1]

['emma']

### Dataset preparation

Use 3 previous letter to guess the next one.

In [17]:
X , Y = [], []
block_size= 3 # can be reset to whatever you like

for w in words[:3]:
    #'emma'
    print(w)
    context = [0]*block_size # contains indcies of context letters
    for ch in w + '.':
        ix = stoi[ch] # 'e' -> 5
        Y.append(ix) # 5 is the target
        X.append(context)
        print("".join(itos[i] for i in context), '------->', ch)
        context = context[1:] + [ix] # update context and append new index

X = torch.tensor(X)
Y = torch.tensor(Y)


emma
... -------> e
..e -------> m
.em -------> m
emm -------> a
mma -------> .
olivia
... -------> o
..o -------> l
.ol -------> i
oli -------> v
liv -------> i
ivi -------> a
via -------> .
ava
... -------> a
..a -------> v
.av -------> a
ava -------> .


In [21]:
print(X[:3], Y[:3])
print(X.shape, X.dtype, Y.shape, Y.dtype)

tensor([[ 0,  0,  0],
        [ 0,  0,  5],
        [ 0,  5, 13]]) tensor([ 5, 13, 13])
torch.Size([16, 3]) torch.int64 torch.Size([16]) torch.int64


So we have X with 3 (integer) features as out input, Y is a scalar (integer) output. 

Now lets build the embedding loop table $C$:
- We have $27$ possible characters, which we will try to embed into a lower dimension space (unlike one-hot encoding, which is still 27 dimensional!)
- In the _paper_ they compress $17000$ words to $30$ dimensional space. 

In [24]:
C = torch.randn((27,2)) # each of 27 characters has a 2D embedding

C[:3]

tensor([[-0.1438,  1.3808],
        [ 1.5834,  0.0049],
        [-1.5679,  1.1807]])

Now how to acess the embedding for a single integer, say $5$?

In [None]:
# option 1: index into C directly
print(C[5])

# option 2: one-hot encode 5 and then multiply -- as was done in bigram 
print(F.one_hot(torch.tensor(5), num_classes=27).float() @ C)


tensor([1.3538, 0.7665])
tensor([1.3538, 0.7665])


Introducting `.float()` is ajust an occupational hazard. Note that both of the above way give the same embedding tensor for $5$. 

Going forward we will just extract the row directly using the index. 

<span style="color:#FF0000; font-family: 'Bebas Neue'; font-size: 01em;">Question:</span>: Now how to we convert X: 16*3 into embeddings? We must leverage pytorch indexing flexibility. 

In [33]:
C[[5,13,4,4,4]]  # retrieves 5th, 13th and 4th row of C.

tensor([[ 1.3538,  0.7665],
        [-1.5971, -1.9288],
        [ 0.4243,  0.1791],
        [ 0.4243,  0.1791],
        [ 0.4243,  0.1791]])

We indexed with 1 dimensional tensor of integers. But turns out we can also index with 2 dimensional tensor of integers. For example: 

In [38]:
C[X][:2], C[X].shape # dim(X) = 16*3 and each element has 2 dim embedding => 16*3*2

(tensor([[[-0.1438,  1.3808],
          [-0.1438,  1.3808],
          [-0.1438,  1.3808]],
 
         [[-0.1438,  1.3808],
          [-0.1438,  1.3808],
          [ 1.3538,  0.7665]]]),
 torch.Size([16, 3, 2]))

More experimentation on higher dimension tensor indexing in the data structures notebook!