# MLP

The bigram model using probablities based on normalized counts has it's limitations.

To extend it to have more context like a two characters as input the probabalities matrix will have (27*27) possilities and for three characters (27 * 27 * 27) and becomes too big.

To overcome this we're gonna try out [Bengion et al.2003 MLP model paper](https://www.youtube.com/redirect?event=video_description&redir_token=QUFFLUhqbVhPSHNrVkluYVY5elh4RDZrOWd4R2xVNVRQd3xBQ3Jtc0tsUDE0UFVRQURUTTlWckExZWp4eGxFa3lRYlQ3amtYX3kxdDI1ZW5uU1pxZERidUYyZkJjSVlzd21rMndCMFlYRW5kYmZISkxfSDR1TzhaOXI1bXptUnUxU0xyUXJYeEpTZlRrTkRjTS0wTkMxNjFnSQ&q=https%3A%2F%2Fwww.jmlr.org%2Fpapers%2Fvolume3%2Fbengio03a%2Fbengio03a.pdf&v=TCH_1BHY58I)

* This paper uses words but we'll proceed with characters
* Each character will be represented as a 30 dimensional vector 
* The advantages of embeddings is knowledge transference, for examples animals like dog, cat might be closer to each other in 30 dimensional space. If cat was not in training set but this knowledge transfer will help in this case.

Let's implement the below architecture in this notebook
![fully connected MLP](https://pbs.twimg.com/media/Fhzl42hVUAI9U8V?format=jpg&name=large)

* Three input characters with 30 dimensional embedding each
* A Lookup table for characters
* Tanh activation connected to three inputs
* since we have 27 characters a final layer with 27 units(logits)
* softmax on top of it to normalize the probabality
* pluck the label based on probabality

In [2]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
%matplotlib inline

## Rebuilding training dataset

In [3]:
# Read all words
def read_words():
    words = open("names.txt").read().splitlines()
    return words
words = read_words()

In [4]:
len(words)

32033

In [15]:
# build the vocabulary of characters and mappings to/from integers
chars = sorted(list(set(''.join(words))))
stoi  = {s:i+1 for i, s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s, i in stoi.items()}
print(itos)

{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}


### Build the dataset

In [20]:
def build_dataset(block_size, number_of_words: int, logs=False):

    block_size = block_size # Context ength: how many characters do we take to predict the next one?
    X, Y = [], []
    for w in words[:number_of_words]:

        print(w)
        context = [0] * block_size
        for ch in w + '.':
            print(f"Context: {context}")
            ix = stoi[ch]
            X.append(context)
            Y.append(ix)
            if logs:
                print(''.join(itos[i] for i in context), '--->', itos[ix])
            context = context[1:] + [ix]
            print(f"Context after append: {context}")

    X = torch.tensor(X)
    Y = torch.tensor(Y)
    return X, Y

In [21]:
X, Y = build_dataset(block_size=3, number_of_words=5, logs=True)

emma
Context: [0, 0, 0]
... ---> e
Context after append: [0, 0, 5]
Context: [0, 0, 5]
..e ---> m
Context after append: [0, 5, 13]
Context: [0, 5, 13]
.em ---> m
Context after append: [5, 13, 13]
Context: [5, 13, 13]
emm ---> a
Context after append: [13, 13, 1]
Context: [13, 13, 1]
mma ---> .
Context after append: [13, 1, 0]
olivia
Context: [0, 0, 0]
... ---> o
Context after append: [0, 0, 15]
Context: [0, 0, 15]
..o ---> l
Context after append: [0, 15, 12]
Context: [0, 15, 12]
.ol ---> i
Context after append: [15, 12, 9]
Context: [15, 12, 9]
oli ---> v
Context after append: [12, 9, 22]
Context: [12, 9, 22]
liv ---> i
Context after append: [9, 22, 9]
Context: [9, 22, 9]
ivi ---> a
Context after append: [22, 9, 1]
Context: [22, 9, 1]
via ---> .
Context after append: [9, 1, 0]
ava
Context: [0, 0, 0]
... ---> a
Context after append: [0, 0, 1]
Context: [0, 0, 1]
..a ---> v
Context after append: [0, 1, 22]
Context: [0, 1, 22]
.av ---> a
Context after append: [1, 22, 1]
Context: [1, 22, 1]
av

In [22]:
X.shape, X.dtype, Y.shape, Y.dtype

(torch.Size([32, 3]), torch.int64, torch.Size([32]), torch.int64)

In [23]:
X

tensor([[ 0,  0,  0],
        [ 0,  0,  5],
        [ 0,  5, 13],
        [ 5, 13, 13],
        [13, 13,  1],
        [ 0,  0,  0],
        [ 0,  0, 15],
        [ 0, 15, 12],
        [15, 12,  9],
        [12,  9, 22],
        [ 9, 22,  9],
        [22,  9,  1],
        [ 0,  0,  0],
        [ 0,  0,  1],
        [ 0,  1, 22],
        [ 1, 22,  1],
        [ 0,  0,  0],
        [ 0,  0,  9],
        [ 0,  9, 19],
        [ 9, 19,  1],
        [19,  1,  2],
        [ 1,  2,  5],
        [ 2,  5, 12],
        [ 5, 12, 12],
        [12, 12,  1],
        [ 0,  0,  0],
        [ 0,  0, 19],
        [ 0, 19, 15],
        [19, 15, 16],
        [15, 16,  8],
        [16,  8,  9],
        [ 8,  9,  1]])

In [24]:
Y

tensor([ 5, 13, 13,  1,  0, 15, 12,  9, 22,  9,  1,  0,  1, 22,  1,  0,  9, 19,
         1,  2,  5, 12, 12,  1,  0, 19, 15, 16,  8,  9,  1,  0])

Now we've the dataset, let's build the embedding lookup table

## Embedding lookup table

For 1700 words, 30 dimension space was used in paper. For 27 possiblities(characters) let's try a 2 dimensionsal embedding.

In [50]:
# Initialized randomnly
C = torch.randn((27, 2))
C, C.shape

(tensor([[-1.1322, -0.7408],
         [ 0.3019,  0.8261],
         [ 0.6245, -1.2015],
         [-0.4238,  0.3620],
         [ 1.4434,  0.6574],
         [ 1.9904,  1.3509],
         [-2.2949,  1.2116],
         [ 0.1021,  0.7517],
         [-1.6855,  0.0709],
         [-0.5005,  0.2273],
         [-0.1834, -0.3516],
         [-0.9227,  0.4773],
         [-1.1198,  0.1138],
         [ 0.1110,  1.2119],
         [-0.0134,  0.5650],
         [-0.2966,  0.4494],
         [ 2.3767, -0.2638],
         [-0.3663, -1.1388],
         [-0.0167, -1.3415],
         [-0.5355, -0.4040],
         [-1.7162,  0.2149],
         [ 0.0129,  0.6510],
         [-0.5705, -0.2869],
         [ 2.1717, -0.0403],
         [-0.4718,  1.3111],
         [ 0.4909, -0.4099],
         [ 2.1120,  0.9359]]),
 torch.Size([27, 2]))

In [51]:
# The lookup of embedding for single character can be done two ways
# 1. Indexing
C[5]

tensor([1.9904, 1.3509])

In [52]:
# 2. Onehot
F.one_hot(torch.tensor(5), num_classes=27).float() @ C

tensor([1.9904, 1.3509])

Indexing and one hot encoding gives the same results. We'll use indexing as it's faster.

In [55]:
# Indexing multiple values
# Singce our shape of input is 32, 3
print(C[[5, 6, 7]])
# Works also with tensor
print(C[torch.tensor([5, 6, 7])])

tensor([[ 1.9904,  1.3509],
        [-2.2949,  1.2116],
        [ 0.1021,  0.7517]])
tensor([[ 1.9904,  1.3509],
        [-2.2949,  1.2116],
        [ 0.1021,  0.7517]])


In [56]:
# The total equivalent would be
C[X]

tensor([[[-1.1322, -0.7408],
         [-1.1322, -0.7408],
         [-1.1322, -0.7408]],

        [[-1.1322, -0.7408],
         [-1.1322, -0.7408],
         [ 1.9904,  1.3509]],

        [[-1.1322, -0.7408],
         [ 1.9904,  1.3509],
         [ 0.1110,  1.2119]],

        [[ 1.9904,  1.3509],
         [ 0.1110,  1.2119],
         [ 0.1110,  1.2119]],

        [[ 0.1110,  1.2119],
         [ 0.1110,  1.2119],
         [ 0.3019,  0.8261]],

        [[-1.1322, -0.7408],
         [-1.1322, -0.7408],
         [-1.1322, -0.7408]],

        [[-1.1322, -0.7408],
         [-1.1322, -0.7408],
         [-0.2966,  0.4494]],

        [[-1.1322, -0.7408],
         [-0.2966,  0.4494],
         [-1.1198,  0.1138]],

        [[-0.2966,  0.4494],
         [-1.1198,  0.1138],
         [-0.5005,  0.2273]],

        [[-1.1198,  0.1138],
         [-0.5005,  0.2273],
         [-0.5705, -0.2869]],

        [[-0.5005,  0.2273],
         [-0.5705, -0.2869],
         [-0.5005,  0.2273]],

        [[-0.5705, -0

In [58]:
# Let's verify this
C[X].shape

torch.Size([32, 3, 2])

32 is total number of inputs with shape 3 and dimensional embedding 2.

In [60]:
X[13, 2]

tensor(1)

In [61]:
C[X][13, 2]

tensor([0.3019, 0.8261])

In [63]:
C[1]

tensor([0.3019, 0.8261])

In [81]:
emb = C[X]
emb.shape

torch.Size([32, 3, 2])

Now the embedding lookup table is completed.

## Implementing the hidden layer plus internals of torch.Tensor, storage and views

In [65]:
# Intitializing weights and biases
W1 = torch.randn((
    6, # 3(inputs) * 2(embedding dim)
    100 # Number of neurons
))
b1 = torch.randn(100)

In [66]:
W1.shape

torch.Size([6, 100])

In [67]:
# Inputs * weights + bias will not work  now
# as dimensions of weighs and input doesn't abide
# by matrix multiplication rulees
# shape of input [32, 3, 2], weights [6, 100]
emb @ W1 + b1

RuntimeError: mat1 and mat2 shapes cannot be multiplied (96x2 and 6x100)

PyTorch's tensor's a really powerful, because ut has tons of methods to allow us to create modify and perfom lot's of operations on it.

We're gonna use [torch.cat](https://pytorch.org/docs/stable/generated/torch.cat.html) to tackle the above problem.

In [113]:
cat_tensors = torch.cat([emb[:, 0, :], emb[:, 1, :], emb[:, 2, :]], 1)
cat_tensors.shape

torch.Size([32, 6])

In [120]:
# To generalize this in case of diffrent block size
# We'll use unbind with cat
unbind_tensors = torch.unbind(emb, 1)
# Gives a list which is exactly the same
# as cat_tensors abov
len(unbind_tensors)

3

In [121]:
cat_unbind_tensors = torch.cat(unbind_tensors, 1)
cat_unbind_tensors.shape

torch.Size([32, 6])

Now irrespective of block size the above code will run.

But there's an efficient way to do this.

In [122]:
a = torch.arange(18)
a

tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17])

In [123]:
a.shape

torch.Size([18])

In [126]:
a.view(3, 3, 2)

tensor([[[ 0,  1],
         [ 2,  3],
         [ 4,  5]],

        [[ 6,  7],
         [ 8,  9],
         [10, 11]],

        [[12, 13],
         [14, 15],
         [16, 17]]])

In [127]:
a.storage()

 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
[torch.storage._TypedStorage(dtype=torch.int64, device=cpu) of size 18]

Every tensor has view and storage,
* Using tensor.view(shape) we can manipulate the shape of an tensor
* But tensor.storage() in memory will still remain a single dimension vector
* And using view just changes some attributes like offest etc and tensor in memory remains same to the multiples

In [128]:
# Let's use view() to reshape the tensor from [32, 3, 2] to [32, 6]
emb.view(32, 6)

tensor([[-1.1322, -0.7408, -1.1322, -0.7408, -1.1322, -0.7408],
        [-1.1322, -0.7408, -1.1322, -0.7408,  1.9904,  1.3509],
        [-1.1322, -0.7408,  1.9904,  1.3509,  0.1110,  1.2119],
        [ 1.9904,  1.3509,  0.1110,  1.2119,  0.1110,  1.2119],
        [ 0.1110,  1.2119,  0.1110,  1.2119,  0.3019,  0.8261],
        [-1.1322, -0.7408, -1.1322, -0.7408, -1.1322, -0.7408],
        [-1.1322, -0.7408, -1.1322, -0.7408, -0.2966,  0.4494],
        [-1.1322, -0.7408, -0.2966,  0.4494, -1.1198,  0.1138],
        [-0.2966,  0.4494, -1.1198,  0.1138, -0.5005,  0.2273],
        [-1.1198,  0.1138, -0.5005,  0.2273, -0.5705, -0.2869],
        [-0.5005,  0.2273, -0.5705, -0.2869, -0.5005,  0.2273],
        [-0.5705, -0.2869, -0.5005,  0.2273,  0.3019,  0.8261],
        [-1.1322, -0.7408, -1.1322, -0.7408, -1.1322, -0.7408],
        [-1.1322, -0.7408, -1.1322, -0.7408,  0.3019,  0.8261],
        [-1.1322, -0.7408,  0.3019,  0.8261, -0.5705, -0.2869],
        [ 0.3019,  0.8261, -0.5705, -0.2

The way this happens is dimension 1 get stacked up as a single dimension.

In [129]:
emb.view(32, 6).shape

torch.Size([32, 6])

In [130]:
emb.view(32, 6) == torch.cat(torch.unbind(emb, 1), 1)

tensor([[True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, T

Element wise comparison proves that view is equal to cat(unbind)

In [131]:
h = torch.tanh(emb.view(-1, 6) @ W1 + b1)
h

tensor([[-0.9346,  0.9008,  0.9908,  ..., -0.9937, -0.2760, -0.2226],
        [-0.9857,  0.9408,  0.9988,  ..., -0.9584, -0.9922,  0.1073],
        [-0.9947,  0.9016,  0.9900,  ...,  0.3018, -0.0542,  0.9769],
        ...,
        [-0.8981,  0.9899,  0.9898,  ...,  0.9079,  0.9961,  0.0368],
        [-0.9951,  1.0000,  0.6883,  ..., -0.9892, -0.3642,  0.7588],
        [-0.9949,  0.9719,  0.8945,  ..., -0.9817, -0.7537,  0.7307]])

In [132]:
h.shape

torch.Size([32, 100])

Make sure broadcasting is done right

In [133]:
(emb.view(32, 6) @ W1).shape

torch.Size([32, 100])

In [134]:
b1.shape

torch.Size([100])

32, 100
1 , 100

* broadcasting 32, 100 to 100
* broadcasting aligns from right abd creates a  fake dimension (1)
* Then 32 will be copied vertically for every element of 100