## MLP for language modelling

Implementing Bengio et al paper to develop an MLP for language modelling. 

It introduces the idea of vector embeddings to capture semantic proximity, instead of explicitly calculating probabilities for each possible combination of words which wouldn't generalize well. __17,000 words__ are considered in the dataset. 

### Architecture

<img src="../papers/architecture.png" style="width:70%;">

__Explanation of architecture:__ 
- 3 previous words are used as context and indexed as $w_i$
- An embedding of that word is shared from a global matrix $C$ and used as input for the hidden layer. 
- Size of hidden layer is a hyperparameter
- post which `tanh` non-linearity is applied
- finally there is a fully connected output layer (with __17,000 neurons -- one for each word__)
- softmax is applied to choose the most likely word


### Parameters

- The lookup table $C$ (embedding matrix)
- $W_i, b_i$ for hidden layer
- $W_i, b_i$ for output layer layer

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F

We will implement the same architecture above not for sentences as is done in Bengio et al, but for individual names. 

In [2]:
words = open('names.txt', 'r').read().splitlines()

In [3]:
words[:5]

['emma', 'olivia', 'ava', 'isabella', 'sophia']

In [4]:
# define stoi 
stoi = {}
allletters = sorted(set("".join(words)))

stoi = {s:i+1 for i,s in enumerate(allletters)}
stoi['.'] = 0

itos = {i:s for s,i in stoi.items()}

In [5]:
some = [0]*3
some

[0, 0, 0]

In [6]:
words[:1]

['emma']

### Dataset preparation

Use 3 previous letter to guess the next one.

In [69]:
X , Y = [], []
block_size= 3 # can be reset to whatever you like

for w in words[:3]:
    #'emma'
    print(w)
    context = [0]*block_size # contains indcies of context letters
    for ch in w + '.':
        ix = stoi[ch] # 'e' -> 5
        Y.append(ix) # 5 is the target
        X.append(context)
        print("".join(itos[i] for i in context), '------->', ch)
        context = context[1:] + [ix] # update context and append new index

X = torch.tensor(X)
Y = torch.tensor(Y)


emma
... -------> e
..e -------> m
.em -------> m
emm -------> a
mma -------> .
olivia
... -------> o
..o -------> l
.ol -------> i
oli -------> v
liv -------> i
ivi -------> a
via -------> .
ava
... -------> a
..a -------> v
.av -------> a
ava -------> .


In [8]:
print(X[:3], Y[:3])
print(X.shape, X.dtype, Y.shape, Y.dtype)

tensor([[ 0,  0,  0],
        [ 0,  0,  5],
        [ 0,  5, 13]]) tensor([ 5, 13, 13])
torch.Size([16, 3]) torch.int64 torch.Size([16]) torch.int64


So we have X with 3 (integer) features as out input, Y is a scalar (integer) output. 

Now lets build the embedding loop table $C$:
- We have $27$ possible characters, which we will try to embed into a lower dimension space (unlike one-hot encoding, which is still 27 dimensional!)
- In the _paper_ they compress $17000$ words to $30$ dimensional space. 

In [9]:
C = torch.randn((27,2)) # each of 27 characters has a 2D embedding

C[:3]

tensor([[-1.3508, -0.8780],
        [-0.6882,  0.7677],
        [-0.0113, -1.4200]])

Now how to acess the embedding for a single integer, say $5$?

In [10]:
# option 1: index into C directly
print(C[5])

# option 2: one-hot encode 5 and then multiply -- as was done in bigram 
print(F.one_hot(torch.tensor(5), num_classes=27).float() @ C)


tensor([ 1.7233, -0.6068])
tensor([ 1.7233, -0.6068])


Introducting `.float()` is ajust an occupational hazard. Note that both of the above way give the same embedding tensor for $5$. 

Going forward we will just extract the row directly using the index. 

<span style="color:#FF0000; font-family: 'Bebas Neue'; font-size: 01em;">Question:</span>: Now how to we convert X: 16*3 into embeddings? We must leverage pytorch indexing flexibility. 

In [33]:
C[[5,13,4,4,4]]  # retrieves 5th, 13th and 4th row of C.

tensor([[ 1.3538,  0.7665],
        [-1.5971, -1.9288],
        [ 0.4243,  0.1791],
        [ 0.4243,  0.1791],
        [ 0.4243,  0.1791]])

We indexed with 1 dimensional tensor of integers. But turns out we can also index with 2 dimensional tensor of integers. For example: 

In [15]:
emb = C[X]

In [16]:
emb[:2], emb.shape # dim(X) = 16*3 and each element has 2 dim embedding => 16*3*2

(tensor([[[-1.3508, -0.8780],
          [-1.3508, -0.8780],
          [-1.3508, -0.8780]],
 
         [[-1.3508, -0.8780],
          [-1.3508, -0.8780],
          [ 1.7233, -0.6068]]]),
 torch.Size([16, 3, 2]))

More experimentation on higher dimension tensor indexing in the data structures notebook!

## Hidden layer initialization

In [14]:
# 3 previous characters, with 2D embedding each => 6 features and each neuron has a weight corresponsing for a feature
# 100 neurons -- hyperparameter
W1 = torch.randn((6,100)) 
b1 = torch.randn(100)

Ideally we want to do something like: `emb @ W1 + b`, but dimensions are not compatible for direct operation. 

We need to concatinate the 3 characters to create `emb: (16,6)`, compatible with `W1:(6,100)`

so we must transform emb: 16,3,2 -> 16,6 ; 
using [torch.cat()](https://docs.pytorch.org/docs/stable/generated/torch.cat.html#torch.cat)

In [25]:
print(emb.shape)
emb_concat = torch.cat([emb[:,0,:], emb[:,1,:],emb[:,2,:]], dim=1) # what s sly way to do it. 
print(emb_concat.shape)

torch.Size([16, 3, 2])
torch.Size([16, 6])


But this doesnt generalize well since we have _hardcoded_ $0,1,2$ indices; instead we will use [torch.unbind](https://docs.pytorch.org/docs/stable/generated/torch.unbind.html#torch-unbind)

1. So we first unbind along axis 1: <br>
torch.Size([16, 3, 2]) $\rightarrow$ tuple(torch.Size([16, 2]),torch.Size([16, 2]),torch.Size([16, 2]))

2. Concat along dim 1 <br>
tuple $\rightarrow$ torch.Size([16, 6])

__Note that this result is not hardcoded and if context length is changed, this code this words!__

In [29]:
X = torch.cat(torch.unbind(emb, dim = 1), dim = 1)

<span style="color:#FF0000; font-family: 'Bebas Neue'; font-size: 01em;">Note of storage efficiency:</span>

But a much more memory efficient way to do this dimension manipulation is through torch.view() which simply recasts the original tensor in a new one. 

This is efficient because the tensor (say t1), is stored in the same way and only _represented_ differently when .view() is called! 

In [38]:
X1 = emb.view(16,6)
X1.shape

torch.Size([16, 6])

^which is essentially the same as unbinding and conactinating, which is costly in terms of memory and speed. 

### Hidden layer outputs

In [44]:
# h = emb.view(16,-1) @ W1 + b1
h = emb.view(emb.shape[0], -1) @ W1 + b1 # to avoid hard coding
h.shape

torch.Size([16, 100])

In [53]:
# applying activation to get output of hidden layer
H = torch.tanh(h)

## Output layer

In [54]:
W2 = torch.randn((100,27)) # 100 inputs features from H, 27 possible character outputs = no of neurons
b2 = torch.randn(27) 

In [None]:
logits = H @ W2 + b2
logits.shape

torch.Size([16, 27])

now we normalize logits using softmax. 

In [71]:
probs = torch.softmax(logits, dim = 1)
probs.shape

torch.Size([16, 27])

## Loss

Now we want to index into prob at the correct label and find the prob assigned to it by the NN. 

i.e. in the below labels set $Y$, we see row 0 in prob at index 5. Take log negative of it to find loss. Do this over all rows. 

In [72]:
Y

tensor([ 5, 13, 13,  1,  0, 15, 12,  9, 22,  9,  1,  0,  1, 22,  1,  0])

In [75]:
probs[torch.arange(emb.shape[0]), Y]

tensor([7.1041e-04, 1.1634e-06, 2.0192e-18, 1.5043e-10, 1.0573e-10, 4.8019e-06,
        6.4802e-15, 3.5613e-01, 4.4052e-12, 6.4136e-05, 3.0762e-05, 5.9196e-10,
        8.9206e-07, 3.2348e-10, 2.7067e-12, 3.1208e-05])

Prob assigned to correct labels are very low. So loss is expected to be high! (take mean, not sum!)

In [80]:
loss = -probs[torch.arange(emb.shape[0]), Y].log().mean()
print(loss)

tensor(18.3391)


Let's make the forward pass to calculate NLL loss, more coherent in the next notebook. 