# Welcome to NameWeave - Multi Layer Perceptron Approach

Like our original <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave.ipynb">NameWeave</a>,\
We will try to create a **Multi Layer Perceptron** to build a character level language model and predict names based on it.

To Approach this model, we will follow an approach based on the paper,\
<a href="https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf">A Neural Probabilistic Language Model</a>,\
Which is a **word level language model** but solves the similar problem of predicting words...\
This paper is 19 pages long, and we don't have time to read the entire paper,\
But I invite you to read it.

In this paper,\
They used a word vocabulary of 17000.

![Word Vocabulary](ExplanationMedia/Images/Vocabulary.png)

They then converted this vocabulary into a 30 dimensional feature space

![Vocabulary to Feature Space](ExplanationMedia/Images/VocabularytoFeatureSpace.png)

This is a very small space for a very large dataset.

The approach of this paper is also very similar because,\
They used a **multilayer neural network** to **predict the next word given the previous ones**,\
& they **maximize the log-likelihood** of the training data or a regularized criterion.

#### Why does this approach work? Let's take a concrete example

We have a phrase: *A dog was running in a room*, *The cat is walking in the bedroom*

During the training of the network, the words move around to a similar corner of the space based on their features\
So, even if the model goes *out of distribution* during test, making predictions,\
The similar words which have never occured before may occur here.

Resulting in the phrase: *A dog is walking in a bedroom*, *The cat is running in a room*

If we knew that dog and cat played similar roles (semantically and syntactically), and similarly for (the,a), (bedroom,room), (is,was), we could naturally transfer probability mass.

Let's now look at the Neural Network for this Approach

![A Neural Probabilistic Language Model - Neural Network](https://miro.medium.com/v2/resize:fit:1200/1*EqKiy4-6tuLSoPP_kub33Q.png)

In this network,\
They are taking *3 previous words* and are trying to *predict the 4-th word in a sequence*.

Now because they had the vocabulary of 17000 words (**'w'**),\
These previous words are the indexes ranging from 0-16999.

There is also a lookup-table which they call **'C'** \
This is their lookup-embedding-matrix, which is shared among all the words\
This C is a matrix of say 17000x30. (So, number of words in vocabulary by number of dimensions in the feature space).

So what this is essentially doing is,\
They are trying to pick out the row based on the index of the word from vocabulary\
And the row represents the 1x30 vector of the word's embedding.\
So they are using the same matrix over and over to look for their own vector of embedding.

So, because they are taking *3 previous words*, and each vector uses 1x30 dimensions,\
They have 3x30 dimensions making up 90 dimensions in total.

Next up is the hidden layer (tanh non-linearity layer).\
This is layer has the *hyper-parameter* (*hyper-parameter* is a parameter of the neural network, that is the designer' choice of the neural network).\
So, this layer's size can be as large as we'd like or as we'd like.\
So, we are going to go over multiple choices, and we are going to evaluage how good they work.\
Note: This layer will be fully connected to all the vector embeddings of the previous layer (90 dimensions).

Next, they have a output layer of logits (you can refer to the original <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave.ipynb">NameWeave</a>  for reference).\
Now, because they had a vocabulary of **17000 words**, this layer has **17000 neurons** which is **fully connected to the hidden layer**.\
Resulting in the *maximum computation between the hidden layer and output layer*.

This output layer is then having a softmax activation layer, which exponentiates the logits and normalized to sum to 1.\
Which results in a nice probability distribution for the next *4-th word in a sequence*.

<hr>

During training we have the label (identity of the next word in a sequence).
That word's index is used to choose the probability of that word,\
And then they maximize the probability of that word, with respect to the parameters of this neural network.

<hr>

Parameters:
1. The *weights and biases* of the *output layer*
2. The *weights and biases* of the *hidden layer*
3. The *embedding look-up table 'C'*

So, Let's implement our own neural network, based on the above approach.

# Installing Dependencies

In [1]:
!pip install torch
!pip install numpy
!pip install pandas
!pip install matplotlib



# Importing Libraries

In [82]:
import random
import torch
import torch.nn.functional as F # This is required for one-hot encoding
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Loading Dataset

Once again, you can refer to the original <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave.ipynb">NameWeave</a> for reference, as to why we chose to load the dataset in the following way...

In [3]:
words = open("Datasets/Indian_Names.txt").read().splitlines()

In [4]:
words = [word.lower() for word in words]

In [5]:
len(words)

53982

# Building Vocabulary

In [6]:
# Remember we need our starting and ending tokens as well in these mappings,
characters = sorted(list(set(''.join(words)))) # Gives us all the characters in the english alphabet, hopefully our dataset has all of them
stoi = {s:i+1 for i,s in enumerate(characters)} # Enumerate returns the tuples of number and string, which can then be mapped to string:index
# We manually add these tokens for convenience
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()} # After we have the string:index mapping, we can easily iterate over their items to map index:string
print("Characters:",characters)
print("STOI:",stoi)
print("ITOS",itos)

Characters: ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
STOI: {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 'i': 9, 'j': 10, 'k': 11, 'l': 12, 'm': 13, 'n': 14, 'o': 15, 'p': 16, 'q': 17, 'r': 18, 's': 19, 't': 20, 'u': 21, 'v': 22, 'w': 23, 'x': 24, 'y': 25, 'z': 26, '.': 0}
ITOS {1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}


# Building Dataset for Neural Network

Now, we can't just feed in names to our Neural Network.\
Rather, we need to build a dataset which will be able to feed into our neural network.

Let's visualize how we are going to feed in to the neural network first...

![Multi Layer Perceptron Approach](ExplanationMedia/Images/NameWeaveMultiLayerPerceptronApproach.png)

Let's now try to make this dataset...

Remeber, this is **not bigram anymore**.

In [24]:
# We define a Block Size based on the number of characters we feed are going to feed to predict the next one
inputBlockSize = 3

# We define two lists, inputs & outputs, where inputs are our blocks of the block size mentioned above and outputs are the label indexes
inputs , outputs = [], []

# We run a loop for each word in the original dataset
for word in words[:5]:
    # We define the block for each iteration and fill it with 0 values -> [0, 0, 0]
    block = [0] * inputBlockSize # This is also known as the context of the network
    # We print each word
    print("Name:", word)
    # We run another loop for each word's character, here word also needs the ending token '.'
    for character in word + '.':
        # We take out the index from our look-up table
        index = stoi[character]
        # We append the input with our block
        inputs.append(block)
        # We append the output label with out index of the character
        outputs.append([index])
        # We can check our inputs and thier corresponsing outputs
        print(''.join(itos[i] for i in block), '--->', itos[index])
        # We then take the block, crop it 1 size from the left and append the next index to it (sliding window of name)
        block = block[1:] + [index]
# We also convert these inputs and outputs to tensors for neural network processing
inputs = torch.tensor(inputs)
outputs = torch.tensor(outputs)

Name: aaban
... ---> a
..a ---> a
.aa ---> b
aab ---> a
aba ---> n
ban ---> .
Name: aabharan
... ---> a
..a ---> a
.aa ---> b
aab ---> h
abh ---> a
bha ---> r
har ---> a
ara ---> n
ran ---> .
Name: aabhas
... ---> a
..a ---> a
.aa ---> b
aab ---> h
abh ---> a
bha ---> s
has ---> .
Name: aabhat
... ---> a
..a ---> a
.aa ---> b
aab ---> h
abh ---> a
bha ---> t
hat ---> .
Name: aabheer
... ---> a
..a ---> a
.aa ---> b
aab ---> h
abh ---> e
bhe ---> e
hee ---> r
eer ---> .


In [29]:
# We can now check the shape of inputs and outputs and their corresponding datatypes
print("Inputs Shape:",inputs.shape,", Datatype:",inputs.dtype)
print("Outputs Shape:",outputs.shape,", Datatype:",outputs.dtype)

Inputs Shape: torch.Size([37, 3]) , Datatype: torch.int64
Outputs Shape: torch.Size([37, 1]) , Datatype: torch.int64


In [30]:
# We can also check how the inputs look like
print(inputs)

tensor([[ 0,  0,  0],
        [ 0,  0,  1],
        [ 0,  1,  1],
        [ 1,  1,  2],
        [ 1,  2,  1],
        [ 2,  1, 14],
        [ 0,  0,  0],
        [ 0,  0,  1],
        [ 0,  1,  1],
        [ 1,  1,  2],
        [ 1,  2,  8],
        [ 2,  8,  1],
        [ 8,  1, 18],
        [ 1, 18,  1],
        [18,  1, 14],
        [ 0,  0,  0],
        [ 0,  0,  1],
        [ 0,  1,  1],
        [ 1,  1,  2],
        [ 1,  2,  8],
        [ 2,  8,  1],
        [ 8,  1, 19],
        [ 0,  0,  0],
        [ 0,  0,  1],
        [ 0,  1,  1],
        [ 1,  1,  2],
        [ 1,  2,  8],
        [ 2,  8,  1],
        [ 8,  1, 20],
        [ 0,  0,  0],
        [ 0,  0,  1],
        [ 0,  1,  1],
        [ 1,  1,  2],
        [ 1,  2,  8],
        [ 2,  8,  5],
        [ 8,  5,  5],
        [ 5,  5, 18]])


In [31]:
# We can also check how the outputs look like
print(outputs)

tensor([[ 1],
        [ 1],
        [ 2],
        [ 1],
        [14],
        [ 0],
        [ 1],
        [ 1],
        [ 2],
        [ 8],
        [ 1],
        [18],
        [ 1],
        [14],
        [ 0],
        [ 1],
        [ 1],
        [ 2],
        [ 8],
        [ 1],
        [19],
        [ 0],
        [ 1],
        [ 1],
        [ 2],
        [ 8],
        [ 1],
        [20],
        [ 0],
        [ 1],
        [ 1],
        [ 2],
        [ 8],
        [ 5],
        [ 5],
        [18],
        [ 0]])


Now that we have our inputs and outputs configured, let's build our **embeddingLookUpMatrix 'C'**

# Building Embedding Look-up Matrix

In the paper the researchers had a big vocabulary of 17000 words,\
They used a very small 30 dimensional feature space.

Because we have a vocabulary of only 27 characters,\
Let's use a very small 2 dimensional feature space for our embedding look-up matrix.

In [75]:
# We decide to build a embeddingLookUpMatrix with 27x2 because we have a vocabulary of 27 characters and we want to fit them in a 2 dimensional space
# In the beginning we initialize it randomly
embeddingFeatureSpaceLength = 2
embeddingLookUpMatrix = torch.randn((len(characters),embeddingFeatureSpaceLength))
# So each one of our 27 characters will have a 2 dimensional embedding
print(embeddingLookUpMatrix)

tensor([[-0.3346,  0.6162],
        [ 0.4039, -2.8433],
        [ 0.9639, -0.1585],
        [ 0.3347,  0.3767],
        [ 0.0578,  0.9342],
        [-1.0575, -1.4122],
        [ 1.2202, -0.5792],
        [ 0.5931, -0.3377],
        [-1.5203, -0.3711],
        [-0.0503, -0.4452],
        [ 0.1173, -0.4054],
        [ 1.7778,  0.7496],
        [-0.7196, -2.4921],
        [ 1.6031, -0.0101],
        [-1.6452,  1.4524],
        [ 1.3075, -0.2939],
        [-0.3972,  0.1475],
        [-0.5473, -0.1973],
        [-0.4729, -0.1627],
        [-0.8425,  1.7155],
        [ 2.5917,  1.3175],
        [-0.7879,  0.9584],
        [-0.1324,  1.8445],
        [ 1.8695, -0.0285],
        [-0.5529,  0.4944],
        [ 0.0266,  2.5908]])


Now that we have a embedding-look-up matrix,

For example,

We can easily do:
```python
embeddingLookUpMatrix[6]
```

To get the embedding:
```python
tensor([-0.2483, -0.3909])
```

But there is a more similar way to do the exact same thing based on one-hot encoding....

We can do:
```python
F.one_hot(torch.tensor(6), num_classes=27)
```

To get the one-hot embedding
```python
tensor([0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
```

Then convert this to float and multiply it with our original *embeddingLookUpMatrix*:
```python
F.one_hot(torch.tensor(6), num_classes=27).float() @ embeddingLookUpMatrix
```

Which results in:
```python
tensor([-0.2483, -0.3909])
```

This works because of the property of matrix multiplication,

The 0's in our one-hot encoded vector discards all the zeros,\
And only multiplies the 1 to the corresponding column of the embeddingLookUpMatrix.

So we can consider this matrix multiplication to be the first layer of our neural network,\
Giving us our corresponding embedding for the index. *(1x2 embedding vector for our case)*

But we will simply index into our look-up table and discard the way of one-hot encoding for the time being

But, now that we know that we want to index into our look-up embedding matrix,\
How do we do that simultaneously for all the inputs?

PyTorch has got you covered.

In PyTorch we can, very flexibly pick out rows...

For example,\
We can do indexing with lists of indexes:
```python
embeddingLookUpMatrix[[1,2,3]]
```

Which gives out the rows of the corresponding indexes:
```python
tensor([[ 0.2118,  1.0454],
        [ 0.1876,  0.8921],
        [ 0.9759, -0.2606]])
```

We can also do:
```python
embeddingLookUpMatrix[torch.tensor([1,2,3])]
```

Which gives out the rows of the corresponding indexes:
```python
tensor([[ 0.2118,  1.0454],
        [ 0.1876,  0.8921],
        [ 0.9759, -0.2606]])
```

We can similarly pick out the same rows again and again:
```python
embeddingLookUpMatrix[[1,2,3,3,3,3]]
```

Which gives us the same row again and again:
```python
tensor([[ 0.2118,  1.0454],
        [ 0.1876,  0.8921],
        [ 0.9759, -0.2606],
        [ 0.9759, -0.2606],
        [ 0.9759, -0.266],
        [ 0.9759, -0.2606]])
```

Lastly, the magic happens when we try to do the same with multi dimensional lists as well:
```python
embeddingLookUpMatrix[[1, 0], [1, 1]]
```

Which results in:
```python
tensor([ 1.0454, -0.9078])
```

In [76]:
# So we can now easily do
embeddingLookUpMatrix[inputs]

tensor([[[-0.3346,  0.6162],
         [-0.3346,  0.6162],
         [-0.3346,  0.6162]],

        [[-0.3346,  0.6162],
         [-0.3346,  0.6162],
         [ 0.4039, -2.8433]],

        [[-0.3346,  0.6162],
         [ 0.4039, -2.8433],
         [ 0.4039, -2.8433]],

        [[ 0.4039, -2.8433],
         [ 0.4039, -2.8433],
         [ 0.9639, -0.1585]],

        [[ 0.4039, -2.8433],
         [ 0.9639, -0.1585],
         [ 0.4039, -2.8433]],

        [[ 0.9639, -0.1585],
         [ 0.4039, -2.8433],
         [-1.6452,  1.4524]],

        [[-0.3346,  0.6162],
         [-0.3346,  0.6162],
         [-0.3346,  0.6162]],

        [[-0.3346,  0.6162],
         [-0.3346,  0.6162],
         [ 0.4039, -2.8433]],

        [[-0.3346,  0.6162],
         [ 0.4039, -2.8433],
         [ 0.4039, -2.8433]],

        [[ 0.4039, -2.8433],
         [ 0.4039, -2.8433],
         [ 0.9639, -0.1585]],

        [[ 0.4039, -2.8433],
         [ 0.9639, -0.1585],
         [-1.5203, -0.3711]],

        [[ 0.9639, -0

In [77]:
# We can also check the shape of this
embeddingLookUpMatrix[inputs].shape

torch.Size([37, 3, 2])

We see that the size of this index is the shape of the original size of the dataset with a 2-dimensional embedding vector space

So if we do:
```python
# Input of 5th block and 3rd index of the block
inputs[5,2]
```

It gives us:
```python
tensor(14)
```

We can look that vector up by doing:
```python
embeddingLookUpMatrix[inputs][5,2]
```

Which gives us the corresponding vector of the item specified:
```python
tensor([ 1.7724, -0.9331])
```

We can verify the same by doing:
```python
embeddingLookUpMatrix[14]
```

Gives the same output:
```python
tensor([ 1.7724, -0.9331])```


In [78]:
# So we can now define our embedding into a variable
embedding = embeddingLookUpMatrix[inputs]

# Constructing Hidden Layer

Let's understand what we will initially have in the hidden layer.

1. The hidden layer will have it's own weights & biases
2. The hidden layer will have it's own neurons which will act as a hyper-parameter to set the number of neurons we want in this layer

So let's initialize weights and biases for now...

Note: The size of the weights will be based on the block size of the inputs and its corresponding vector embedding.\
Thus,

$$\text{Hidden Layer Size} = [(\text{Block Size} * \text{Vector Embedding Dimensions}), \text{Number of Neurons(Hyperparameter)}]$$

In [79]:
# We can initialize the number of neurons we want in the hidden layer
numberOfHiddenLayerNeurons = 100
# Then we can randomly initialize the weights of the hidden layer
weightsOfHiddenLayer = torch.randn((inputBlockSize*embeddingFeatureSpaceLength), numberOfHiddenLayerNeurons)
# Then we can initialize the corresponding biases as well
biasesOfHiddenLayer = torch.randn(numberOfHiddenLayerNeurons)

In [81]:
# We can check the shapes of our hidden layer weights and hidden layer biases
print("Shape of Weights of Hidden Layer:", weightsOfHiddenLayer.shape)
print("Shape of Biases of Hidden Layer:", biasesOfHiddenLayer.shape)

Shape of Weights of Hidden Layer: torch.Size([6, 100])
Shape of Biases of Hidden Layer: torch.Size([100])


Now that we have the weights and biases initialized for our hidden layer.

By convention we would like to do something like:
$$\text{Layer Computation} = \text{Embeddings} * \text{Weights} + \text{Biases}$$

But for our case:

Embeddings are in the shape of the [number of blocks in all the names, block size, vector dimension size]\
for example, [37, 3, 2]\
& Weights are in the shape of [(block size * vector dimension size), number of neurons (hyperparameter)]\
for example, [6, 100]

And thus, we cannot just simply multiply these matrices

There are a numerous ways to do this,\
Either we can convert [37, 3, 2] ---> [37, 6]\
Or we can convert [6, 100] ---> [3, 2, 100]

We will stick with the first one, because it is fairly simpler and would reduce the complexity to understand the problem.

I invite you to also look into the documentation of <a href="https://pytorch.org/docs/stable/index.html">PyTorch</a>.

According to the official documentation,

**TORCH.CAT**
Concatenates the given sequence of seq tensors in the given dimension. All tensors must either have the same shape (except in the concatenating dimension) or be empty.

```python
torch.cat(tensors, dim=0, *, out=None) → Tensor
```

Parameters:
- tensors (sequence of Tensors) – any python sequence of tensors of the same type. Non-empty tensors provided must have the same shape, except in the cat dimension.
- dim (int, optional) – the dimension over which the tensors are concatenated

Keyword Arguments:
- out (Tensor, optional) – the output tensor.

Example:
```python
>>> x = torch.randn(2, 3)
>>> x
tensor([[ 0.6580, -1.0969, -0.4614],
        [-0.1034, -0.5790,  0.1497]])
>>> torch.cat((x, x, x), 0)
tensor([[ 0.6580, -1.0969, -0.4614],
        [-0.1034, -0.5790,  0.1497],
        [ 0.6580, -1.0969, -0.4614],
        [-0.1034, -0.5790,  0.1497],
        [ 0.6580, -1.0969, -0.4614],
        [-0.1034, -0.5790,  0.1497]])
>>> torch.cat((x, x, x), 1)
tensor([[ 0.6580, -1.0969, -0.4614,  0.6580, -1.0969, -0.4614,  0.6580,
         -1.0969, -0.4614],
        [-0.1034, -0.5790,  0.1497, -0.1034, -0.5790,  0.1497, -0.1034,
         -0.5790,  0.1497]])
```

So we can do:
```python
# Pickout the Embedding along the dimension 0, 1 & 2 and concatenate them along dimension 1
# Each embedding[:, n, :] gives us the 3x2 embeddings
torch.cat([embedding[:, 0, :], embedding[:, 1, :], embedding[:, 2, :]], dim=1)
```

And its shape turns out to be:
```python
torch.Size([37, 6])
```

But this is kind of ugly and we have another method...

**TORCH.UNBIND**
Removes a tensor dimension.

Returns a tuple of all slices along a given dimension, already without it.

```python
torch.unbind(input, dim=0) → seq
```
Parameters:
- input (Tensor) – the tensor to unbind
- dim (int) – dimension to remove

Example:
```python
>>> torch.unbind(torch.tensor([[1, 2, 3],
>>>                            [4, 5, 6],
>>>                            [7, 8, 9]]))
(tensor([1, 2, 3]), tensor([4, 5, 6]), tensor([7, 8, 9]))
```

So now we can do:
```python
torch.cat(torch.unbind(embedding, dim=1), dim=1)
```

Whose shape also turns out to be:
```python
torch.Size([37, 6])
```

We have a third way of doing the same thing.

Its called:\
**TORCH.TENSOR.VIEW**

Which gives me the paternity to explain some of the features of the internals of the PyTorch Library.

We have an interesting blog post <a href="http://blog.ezyang.com/2019/05/pytorch-internals/">here</a> by Edward Z. Yang which you can go through to understand more about this.

To explain *TORCH.TENSOR.VIEW*.\
Let's take an example and explain each step one by one...

For example,\
If we do:
```python
torch.arange(0,18)
```

It gives us:
```python
tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17])
```

This can also be viewed as:
```python
torch.arange(0,18).view(9,2)
```

Which gives us:
```python
tensor([[ 0,  1],
        [ 2,  3],
        [ 4,  5],
        [ 6,  7],
        [ 8,  9],
        [10, 11],
        [12, 13],
        [14, 15],
        [16, 17]])
```

This can also be written as:
```python
torch.arange(0,18).view(3,3,2)
```

Which gives us:
```python
tensor([[[ 0,  1],
         [ 2,  3],
         [ 4,  5]],

        [[ 6,  7],
         [ 8,  9],
         [10, 11]],

        [[12, 13],
         [14, 15],
         [16, 17]]])
```

So,\
If we have an embedding of size say: [37, 3, 2]\
We can essentially do:
```python
embedding.view(37,6)
```

We can also verify the result to be the same by doing
```python
embedding.view(37,6) == torch.cat(torch.unbind(embedding, dim=1), dim=1)
```

Resulting in all True values