**Multilayer Perceptron(MLP)** 

Referring research paper - [A Neural Probabilistic Language Model](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)

In [2]:
import torch
import torch.nn.functional as F

In [3]:
words = open("names.txt").read().splitlines()
words[:10]

['emma',
 'olivia',
 'ava',
 'isabella',
 'sophia',
 'charlotte',
 'mia',
 'amelia',
 'harper',
 'evelyn']

In [4]:
len(words)

32033

In [5]:
chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}
print(itos)

{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}


The NN in research paper

![alt text](img/mlp_NN.png)

We will create our NN similar to this.

In [6]:
#build the dataset and dataloader
block_size = 3
x, y = [], []
for w in words:
    # print(w)
    context = [0]*block_size
    for ch in w + '.':
        ix = stoi[ch]
        x.append(context)
        y.append(ix)
        # print(''.join(itos[i] for i in context), '---->', itos[ix])
        context = context[1:] + [ix]  # crop and append

x = torch.tensor(x)
y = torch.tensor(y)


In [7]:
x.shape, y.shape


(torch.Size([228146, 3]), torch.Size([228146]))

C is our Embedding mapping that will map character index to the Embeddings.

In [8]:
c = torch.randn((27,2))

In [9]:
emb = c[x]
emb.shape

torch.Size([228146, 3, 2])

Now, first hidden layer

In [10]:
w1 = torch.randn((6,100)) # 2*3 , 100 neurons
b1 = torch.randn(100)

We want 
>emb @ w

but their shapes are not compatible  ([16, 3, 2]) and ([6, 100]).

Therefore we will **concatenate** them along **dimension 1** i.e. 

**[no.of samples, no. of neurons in inp, vector embedding]** --> along **no. of neurons in inp to get ([16, 6]).**


In [11]:
torch.cat([emb[:,0,:], emb[:,1,:], emb[:,2,:]], dim=1).shape

torch.Size([228146, 6])

But only doing concatenate won't work, what if we change the context size from 3 to more. So, to overcome this we can use 

>**torch.unbind(emb, dim=1)**

which removes thedimension from the tensor and then we can concatenate it.

In [12]:
torch.cat(torch.unbind(emb, dim=1), dim=1).shape

torch.Size([228146, 6])

To make this even more easy, we can basically use 

>**emb.view(16,6)**

and this will work easily.

In [13]:
h = torch.tanh(emb.view(emb.shape[0], 6) @ w1 + b1) # emb.shape[0], 6

In [14]:
h.shape

torch.Size([228146, 100])

Now, 2nd hidden layer (Here the output layer).

In [15]:
w2 = torch.randn((100,27))
b2 = torch.randn(27)

In [16]:
logits = h @ w2 + b2
logits.shape

torch.Size([228146, 27])

In [17]:
counts = logits.exp()

In [18]:
probs = counts/counts.sum(dim=-1, keepdim = True)
probs.shape

torch.Size([228146, 27])

In [20]:
loss = -probs[torch.arange(x.shape[0]),y].log().mean()

In [21]:
loss

tensor(16.2408)

**Finally**

Now we will do this all continuously

In [22]:
x.shape, y.shape #Datasets

(torch.Size([228146, 3]), torch.Size([228146]))

In [34]:
g = torch.Generator().manual_seed(2147483647)
c = torch.randn((27,2), generator=g)
w1 = torch.randn((6,100), generator=g) # 2*3 , 100 neurons
b1 = torch.randn(100, generator=g)  
w2 = torch.randn((100,27), generator=g)
b2 = torch.randn(27, generator=g)
parameters = [c, w1, b1, w2, b2]


In [35]:
sum(p.nelement() for p in parameters)

3481

We can replace this 

>counts = logits.exp()

>probs = counts/counts.sum(dim=-1, keepdim = True)

>loss = -probs[torch.arange(16),y].log().mean()

with

>F.cross_entropy(logits, y)

In [36]:
for p in parameters:
    p.requires_grad = True

In [37]:
lre = torch.linspace(-3, 0, 1000) # log learning rates
lrs = 10**lre # learning rates 

In [44]:
lri = []
lossesi = []

#forward pass
for epoch in range(10000):
    #mini-batch
    ix = torch.randint(0, x.shape[0], (32,))

    emb = c[x[ix]] # 32, 3, 2
    h = torch.tanh(emb.view(emb.shape[0], 6) @ w1 + b1) # emb.shape[0], 6
    logits = h @ w2 + b2
    loss = F.cross_entropy(logits, y[ix])
    

    #backward pass
    for p in parameters:
        p.grad = None
    loss.backward()

    #update
    # lr = lrs[epoch]
    lr = 0.1
    for p in parameters:
        p.data += -lr * p.grad
    # lri.append(lre[epoch])
    # lossesi.append(loss.item())
print(loss.item())

2.1009087562561035


As we can see it takes a lot of time to forward and backard these 220000.... something values, so we have used **mini-batches**

![alt text](img/lrcurve.png)

We got to know that the learning rate 0.1 is somewhat best for us so we will use that only.
