# MLP

The bigram model using probablities based on normalized counts has it's limitations.

To extend it to have more context like a two characters as input the probabalities matrix will have (27*27) possilities and for three characters (27 * 27 * 27) and becomes too big.

To overcome this we're gonna try out [Bengion et al.2003 MLP model paper](https://www.youtube.com/redirect?event=video_description&redir_token=QUFFLUhqbVhPSHNrVkluYVY5elh4RDZrOWd4R2xVNVRQd3xBQ3Jtc0tsUDE0UFVRQURUTTlWckExZWp4eGxFa3lRYlQ3amtYX3kxdDI1ZW5uU1pxZERidUYyZkJjSVlzd21rMndCMFlYRW5kYmZISkxfSDR1TzhaOXI1bXptUnUxU0xyUXJYeEpTZlRrTkRjTS0wTkMxNjFnSQ&q=https%3A%2F%2Fwww.jmlr.org%2Fpapers%2Fvolume3%2Fbengio03a%2Fbengio03a.pdf&v=TCH_1BHY58I)

* This paper uses words but we'll proceed with characters
* Each character will be represented as a 30 dimensional vector 
* The advantages of embeddings is knowledge transference, for examples animals like dog, cat might be closer to each other in 30 dimensional space. If cat was not in training set but this knowledge transfer will help in this case.

Let's implement the below architecture in this notebook
![fully connected MLP](https://pbs.twimg.com/media/Fhzl42hVUAI9U8V?format=jpg&name=large)

* Three input characters with 30 dimensional embedding each
* A Lookup table for characters
* Tanh activation connected to three inputs
* since we have 27 characters a final layer with 27 units(logits)
* softmax on top of it to normalize the probabality
* pluck the label based on probabality

In [23]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
%matplotlib inline

## Rebuilding training dataset

In [24]:
# Read all words
def read_words():
    words = open("names.txt").read().splitlines()
    return words
words = read_words()

In [25]:
len(words)

32033

In [26]:
# build the vocabulary of characters and mappings to/from integers
chars = sorted(list(set(''.join(words))))
stoi  = {s:i+1 for i, s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s, i in stoi.items()}
print(itos)

{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}


### Build the dataset

In [27]:
def build_dataset(block_size, number_of_words: int, logs=False):

    block_size = block_size # Context ength: how many characters do we take to predict the next one?
    X, Y = [], []
    for w in words[:number_of_words]:

        print(w)
        context = [0] * block_size
        for ch in w + '.':
            print(f"Context: {context}")
            ix = stoi[ch]
            X.append(context)
            Y.append(ix)
            if logs:
                print(''.join(itos[i] for i in context), '--->', itos[ix])
            context = context[1:] + [ix]
            print(f"Context after append: {context}")

    X = torch.tensor(X)
    Y = torch.tensor(Y)
    return X, Y

In [28]:
X, Y = build_dataset(block_size=3, number_of_words=5, logs=True)

emma
Context: [0, 0, 0]
... ---> e
Context after append: [0, 0, 5]
Context: [0, 0, 5]
..e ---> m
Context after append: [0, 5, 13]
Context: [0, 5, 13]
.em ---> m
Context after append: [5, 13, 13]
Context: [5, 13, 13]
emm ---> a
Context after append: [13, 13, 1]
Context: [13, 13, 1]
mma ---> .
Context after append: [13, 1, 0]
olivia
Context: [0, 0, 0]
... ---> o
Context after append: [0, 0, 15]
Context: [0, 0, 15]
..o ---> l
Context after append: [0, 15, 12]
Context: [0, 15, 12]
.ol ---> i
Context after append: [15, 12, 9]
Context: [15, 12, 9]
oli ---> v
Context after append: [12, 9, 22]
Context: [12, 9, 22]
liv ---> i
Context after append: [9, 22, 9]
Context: [9, 22, 9]
ivi ---> a
Context after append: [22, 9, 1]
Context: [22, 9, 1]
via ---> .
Context after append: [9, 1, 0]
ava
Context: [0, 0, 0]
... ---> a
Context after append: [0, 0, 1]
Context: [0, 0, 1]
..a ---> v
Context after append: [0, 1, 22]
Context: [0, 1, 22]
.av ---> a
Context after append: [1, 22, 1]
Context: [1, 22, 1]
av

In [29]:
X.shape, X.dtype, Y.shape, Y.dtype

(torch.Size([32, 3]), torch.int64, torch.Size([32]), torch.int64)

In [30]:
X

tensor([[ 0,  0,  0],
        [ 0,  0,  5],
        [ 0,  5, 13],
        [ 5, 13, 13],
        [13, 13,  1],
        [ 0,  0,  0],
        [ 0,  0, 15],
        [ 0, 15, 12],
        [15, 12,  9],
        [12,  9, 22],
        [ 9, 22,  9],
        [22,  9,  1],
        [ 0,  0,  0],
        [ 0,  0,  1],
        [ 0,  1, 22],
        [ 1, 22,  1],
        [ 0,  0,  0],
        [ 0,  0,  9],
        [ 0,  9, 19],
        [ 9, 19,  1],
        [19,  1,  2],
        [ 1,  2,  5],
        [ 2,  5, 12],
        [ 5, 12, 12],
        [12, 12,  1],
        [ 0,  0,  0],
        [ 0,  0, 19],
        [ 0, 19, 15],
        [19, 15, 16],
        [15, 16,  8],
        [16,  8,  9],
        [ 8,  9,  1]])

In [31]:
Y

tensor([ 5, 13, 13,  1,  0, 15, 12,  9, 22,  9,  1,  0,  1, 22,  1,  0,  9, 19,
         1,  2,  5, 12, 12,  1,  0, 19, 15, 16,  8,  9,  1,  0])

Now we've the dataset, let's build the embedding lookup table

## Embedding lookup table

For 1700 words, 30 dimension space was used in paper. For 27 possiblities(characters) let's try a 2 dimensionsal embedding.

In [32]:
# Initialized randomnly
C = torch.randn((27, 2))
C, C.shape

(tensor([[ 2.0444, -0.8643],
         [ 0.4949,  0.6256],
         [ 0.3523, -1.5919],
         [-0.5518,  1.1153],
         [ 0.0844,  0.6198],
         [ 1.2884, -0.5162],
         [ 0.3585, -0.8834],
         [-0.8051, -0.1403],
         [-0.7723, -0.6009],
         [-0.8732, -0.0418],
         [ 0.0856,  1.7633],
         [ 0.1055, -1.2005],
         [ 0.8034,  1.5798],
         [ 1.5923,  0.8133],
         [-0.8429, -1.2364],
         [ 0.4664, -1.3050],
         [-0.2633, -0.9519],
         [ 1.0584, -0.9919],
         [ 0.0866,  0.9613],
         [ 0.7803,  1.2683],
         [ 1.9862, -0.5937],
         [ 0.8093, -0.0125],
         [-0.3031, -0.5510],
         [ 0.5475,  1.2947],
         [ 1.0933, -0.8364],
         [-0.6993,  1.9711],
         [ 0.0128, -0.2101]]),
 torch.Size([27, 2]))

In [33]:
# The lookup of embedding for single character can be done two ways
# 1. Indexing
C[5]

tensor([ 1.2884, -0.5162])

In [34]:
# 2. Onehot
F.one_hot(torch.tensor(5), num_classes=27).float() @ C

tensor([ 1.2884, -0.5162])

Indexing and one hot encoding gives the same results. We'll use indexing as it's faster.

In [35]:
# Indexing multiple values
# Singce our shape of input is 32, 3
print(C[[5, 6, 7]])
# Works also with tensor
print(C[torch.tensor([5, 6, 7])])

tensor([[ 1.2884, -0.5162],
        [ 0.3585, -0.8834],
        [-0.8051, -0.1403]])
tensor([[ 1.2884, -0.5162],
        [ 0.3585, -0.8834],
        [-0.8051, -0.1403]])


In [36]:
# The total equivalent would be
C[X]

tensor([[[ 2.0444, -0.8643],
         [ 2.0444, -0.8643],
         [ 2.0444, -0.8643]],

        [[ 2.0444, -0.8643],
         [ 2.0444, -0.8643],
         [ 1.2884, -0.5162]],

        [[ 2.0444, -0.8643],
         [ 1.2884, -0.5162],
         [ 1.5923,  0.8133]],

        [[ 1.2884, -0.5162],
         [ 1.5923,  0.8133],
         [ 1.5923,  0.8133]],

        [[ 1.5923,  0.8133],
         [ 1.5923,  0.8133],
         [ 0.4949,  0.6256]],

        [[ 2.0444, -0.8643],
         [ 2.0444, -0.8643],
         [ 2.0444, -0.8643]],

        [[ 2.0444, -0.8643],
         [ 2.0444, -0.8643],
         [ 0.4664, -1.3050]],

        [[ 2.0444, -0.8643],
         [ 0.4664, -1.3050],
         [ 0.8034,  1.5798]],

        [[ 0.4664, -1.3050],
         [ 0.8034,  1.5798],
         [-0.8732, -0.0418]],

        [[ 0.8034,  1.5798],
         [-0.8732, -0.0418],
         [-0.3031, -0.5510]],

        [[-0.8732, -0.0418],
         [-0.3031, -0.5510],
         [-0.8732, -0.0418]],

        [[-0.3031, -0

In [37]:
# Let's verify this
C[X].shape

torch.Size([32, 3, 2])

32 is total number of inputs with shape 3 and dimensional embedding 2.

In [38]:
X[13, 2]

tensor(1)

In [39]:
C[X][13, 2]

tensor([0.4949, 0.6256])

In [40]:
C[1]

tensor([0.4949, 0.6256])

In [41]:
emb = C[X]
emb.shape

torch.Size([32, 3, 2])

Now the embedding lookup table is completed.

## Implementing the hidden layer plus internals of torch.Tensor, storage and views

In [42]:
# Intitializing weights and biases
W1 = torch.randn((
    6, # 3(inputs) * 2(embedding dim)
    100 # Number of neurons
))
b1 = torch.randn(100)

In [43]:
W1.shape

torch.Size([6, 100])

In [44]:
# Inputs * weights + bias will not work  now
# as dimensions of weighs and input doesn't abide
# by matrix multiplication rulees
# shape of input [32, 3, 2], weights [6, 100]
emb @ W1 + b1

RuntimeError: mat1 and mat2 shapes cannot be multiplied (96x2 and 6x100)

PyTorch's tensor's a really powerful, because ut has tons of methods to allow us to create modify and perfom lot's of operations on it.

We're gonna use [torch.cat](https://pytorch.org/docs/stable/generated/torch.cat.html) to tackle the above problem.

In [45]:
cat_tensors = torch.cat([emb[:, 0, :], emb[:, 1, :], emb[:, 2, :]], 1)
cat_tensors.shape

torch.Size([32, 6])

In [46]:
# To generalize this in case of diffrent block size
# We'll use unbind with cat
unbind_tensors = torch.unbind(emb, 1)
# Gives a list which is exactly the same
# as cat_tensors abov
len(unbind_tensors)

3

In [47]:
cat_unbind_tensors = torch.cat(unbind_tensors, 1)
cat_unbind_tensors.shape

torch.Size([32, 6])

Now irrespective of block size the above code will run.

But there's an efficient way to do this.

In [48]:
a = torch.arange(18)
a

tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17])

In [49]:
a.shape

torch.Size([18])

In [50]:
a.view(3, 3, 2)

tensor([[[ 0,  1],
         [ 2,  3],
         [ 4,  5]],

        [[ 6,  7],
         [ 8,  9],
         [10, 11]],

        [[12, 13],
         [14, 15],
         [16, 17]]])

In [51]:
a.storage()

 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
[torch.storage._TypedStorage(dtype=torch.int64, device=cpu) of size 18]

Every tensor has view and storage,
* Using tensor.view(shape) we can manipulate the shape of an tensor
* But tensor.storage() in memory will still remain a single dimension vector
* And using view just changes some attributes like offest etc and tensor in memory remains same to the multiples

In [52]:
# Let's use view() to reshape the tensor from [32, 3, 2] to [32, 6]
emb.view(32, 6)

tensor([[ 2.0444, -0.8643,  2.0444, -0.8643,  2.0444, -0.8643],
        [ 2.0444, -0.8643,  2.0444, -0.8643,  1.2884, -0.5162],
        [ 2.0444, -0.8643,  1.2884, -0.5162,  1.5923,  0.8133],
        [ 1.2884, -0.5162,  1.5923,  0.8133,  1.5923,  0.8133],
        [ 1.5923,  0.8133,  1.5923,  0.8133,  0.4949,  0.6256],
        [ 2.0444, -0.8643,  2.0444, -0.8643,  2.0444, -0.8643],
        [ 2.0444, -0.8643,  2.0444, -0.8643,  0.4664, -1.3050],
        [ 2.0444, -0.8643,  0.4664, -1.3050,  0.8034,  1.5798],
        [ 0.4664, -1.3050,  0.8034,  1.5798, -0.8732, -0.0418],
        [ 0.8034,  1.5798, -0.8732, -0.0418, -0.3031, -0.5510],
        [-0.8732, -0.0418, -0.3031, -0.5510, -0.8732, -0.0418],
        [-0.3031, -0.5510, -0.8732, -0.0418,  0.4949,  0.6256],
        [ 2.0444, -0.8643,  2.0444, -0.8643,  2.0444, -0.8643],
        [ 2.0444, -0.8643,  2.0444, -0.8643,  0.4949,  0.6256],
        [ 2.0444, -0.8643,  0.4949,  0.6256, -0.3031, -0.5510],
        [ 0.4949,  0.6256, -0.3031, -0.5

The way this happens is dimension 1 get stacked up as a single dimension.

In [53]:
emb.view(32, 6).shape

torch.Size([32, 6])

In [54]:
emb.view(32, 6) == torch.cat(torch.unbind(emb, 1), 1)

tensor([[True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, T

Element wise comparison proves that view is equal to cat(unbind)

In [55]:
h = torch.tanh(emb.view(-1, 6) @ W1 + b1)
h

tensor([[ 1.0000, -1.0000, -1.0000,  ...,  0.9884, -1.0000, -0.9874],
        [ 0.9971, -0.9999, -1.0000,  ...,  0.9814, -1.0000, -0.9979],
        [ 0.9997, -1.0000, -1.0000,  ...,  0.9335, -0.9999, -0.9999],
        ...,
        [-0.9523, -0.9799, -0.8882,  ...,  0.5090, -0.9854, -0.9972],
        [-0.9684, -0.6631,  0.0346,  ..., -0.1564, -0.7355, -0.9919],
        [ 0.9191, -0.9560,  0.7704,  ..., -0.3943, -0.8343, -0.8931]])

In [56]:
h.shape

torch.Size([32, 100])

Make sure broadcasting is done right

In [57]:
(emb.view(32, 6) @ W1).shape

torch.Size([32, 100])

In [58]:
b1.shape

torch.Size([100])

32, 100
1 , 100

* broadcasting 32, 100 to 100
* broadcasting aligns from right abd creates a  fake dimension (1)
* Then 32 will be copied vertically for every element of 100

## Implementing output layer

In [59]:
W2 = torch.randn((100, # Inputs layer size
                  27 # Output layer 27 characters
                 ))
b2 = torch.randn(27)

In [60]:
logits = h @ W2 + b2
logits.shape

torch.Size([32, 27])

## Implmenting negative log likelihood loss

In [61]:
# Fake counts -> logits exp
counts = logits.exp()

In [62]:
# Normalize fake counts
prob = counts / counts.sum(1, keepdims=True)
prob.shape

torch.Size([32, 27])

In [64]:
Y.shape

torch.Size([32])

In [65]:
# Indexing probabalites based on Y
# This probabalities in future will be the probabalities by neural network
prob[torch.arange(32), Y]

tensor([6.1116e-06, 6.3760e-11, 4.6058e-11, 2.8328e-08, 2.5840e-03, 5.8854e-08,
        5.4178e-04, 8.8482e-15, 2.6611e-05, 2.1139e-08, 1.7233e-09, 1.2606e-06,
        6.3865e-11, 2.2311e-04, 2.4064e-09, 2.3015e-09, 1.8519e-10, 6.3482e-08,
        8.9236e-14, 1.1066e-07, 7.8817e-02, 1.3576e-05, 1.0586e-08, 4.3965e-09,
        8.3002e-04, 2.7862e-08, 3.1586e-03, 6.8665e-13, 7.5278e-11, 2.8178e-09,
        9.1849e-08, 6.5022e-06])

In [66]:
loss = - prob[torch.arange(32), Y].log().mean()
loss

tensor(16.8291)

## Summary of full network

In [76]:
g = torch.Generator().manual_seed(2147483647)
C = torch.randn((27, 2), generator=g)
W1 = torch.randn((6, 100), generator=g)
b1 = torch.randn(100, generator=g)
W2 = torch.randn((100, 27), generator=g)
b2 = torch.randn(27, generator=g)
paramerters = [C, W1, b1, W2, b2]

In [77]:
# Number of parameters in total
sum(p.nelement() for p in paramerters)

3481

In [79]:
emb = C[X]
h = torch.tanh(emb.view(-1, 6) @ W1 + b1)
logits = h @ W2 + b2
counts = logits.exp()
prob = counts / counts.sum(1, keepdim=True)
loss = - prob[torch.arange(32), Y].log().mean()
loss

tensor(17.7697)

## Cross entropy why?

```
counts = logits.exp()
prob = counts / counts.sum(1, keepdim=True)
loss = - prob[torch.arange(32), Y].log().mean()
```

PyTorch creates a seperate tensor for each of these step

1. Uses a fused kernel which combines all the above operations
2. In backward pass, expression takes much simpler form mathametically
3. Under the hood, this is numerically well behaved
4. Forward pass and backward pass are much more efficient

In [80]:
loss = F.cross_entropy(logits, Y)
loss

tensor(17.7697)

In [81]:
# Numerical stability difference
logits = torch.tensor([-2, -3, 0, 5])
counts = logits.exp()
probs = counts / counts.sum()
probs

tensor([9.0466e-04, 3.3281e-04, 6.6846e-03, 9.9208e-01])

In [86]:
# Numerical stability difference
# With more extreme values, which will occur in backpropgation
logits = torch.tensor([-100, -3, 0, 100])
counts = logits.exp()
print(f"Counts: {counts}")
probs = counts / counts.sum()
probs

Counts: tensor([3.7835e-44, 4.9787e-02, 1.0000e+00,        inf])


tensor([0., 0., 0., nan])

What's happening above is the floating point ran out of dynamic range for exp(100) returning inf
and for negative 100 the probs is near zero.

So we cannot pass very larger number to our logits --> loss expression

In [87]:
# How PyTorch handles this is
# By finding maximum of the logits and offsets it from the logits to avoid it
logits = torch.tensor([-100, -3, 0, 100]) - 100
counts = logits.exp()
print(f"Counts: {counts}")
probs = counts / counts.sum()
probs

Counts: tensor([0.0000e+00, 1.4013e-45, 3.7835e-44, 1.0000e+00])


tensor([0.0000e+00, 1.4013e-45, 3.7835e-44, 1.0000e+00])

## Implementing training loop, overfitting one batch

In [93]:
# Set requires grad
for p in paramerters:
    p.requires_grad = True
for _ in range(1000):
    # Forward pass
    emb = C[X] # [32, 3, 2]
    h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (32, 100)
    logits = h @ W2 + b2 # (100, 27)
    loss = F.cross_entropy(logits, Y)
    # Backward pass
    for p in paramerters:
        p.grad = None
    loss.backward()
    
    # Update parametrs
    for p in paramerters:
        p.data += -0.1 * p.grad
print(loss.item())

0.2552148103713989


We've achieved a very good loss. Why?
Because we're fitting the model for only 5 words i.e 32 inputs and with 3481 parameters.
Lots of paramters for very less data.
What we're doing it overfitting the model for one batch of data.

> Note: Based on this overfitting can be defined as tuning many parameters for few samples or a batch of data.

Why loss of 0 is not achieved?

In [96]:
logits.max(1)

torch.return_types.max(
values=tensor([13.4802, 18.1070, 20.7401, 20.8217, 16.9567, 13.4802, 16.2105, 14.3530,
        16.0952, 18.6156, 16.1835, 21.1566, 13.4802, 17.3799, 17.3764, 20.3181,
        13.4802, 16.8035, 15.3868, 17.3206, 18.7822, 16.2195, 11.0971, 10.8824,
        15.6570, 13.4802, 16.3803, 17.1817, 12.8921, 16.3671, 19.3422, 16.3281],
       grad_fn=<MaxBackward0>),
indices=tensor([19, 13, 13,  1,  0, 19, 12,  9, 22,  9,  1,  0, 19, 22,  1,  0, 19, 19,
         1,  2,  5, 12, 12,  1,  0, 19, 15, 16,  8,  9,  1,  0]))

In [97]:
Y

tensor([ 5, 13, 13,  1,  0, 15, 12,  9, 22,  9,  1,  0,  1, 22,  1,  0,  9, 19,
         1,  2,  5, 12, 12,  1,  0, 19, 15, 16,  8,  9,  1,  0])

We can see the indices of logits and output are same for most of the cases, these are inputs overfitted to outputs. But the missing ones are 

* ... --> e (emma)
* ... --> o (olivia)
* ... --> a (ava)
* ... --> s (sophia)

for different outptus for the same input.

To overcome this, let's train on the full dataset.

## Training on full dataset, minibatches