# Welcome to NameWeave - Multi Layer Perceptron Approach

Like our original <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave.ipynb">NameWeave</a>,\
We will try to create a **Multi Layer Perceptron** to build a character level language model and predict names based on it.

To Approach this model, we will follow an approach based on the paper,\
<a href="https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf">A Neural Probabilistic Language Model</a>,\
Which is a **word level language model** but solves the similar problem of predicting words...\
This paper is 19 pages long, and we don't have time to read the entire paper,\
But I invite you to read it.

In this paper,\
They used a word vocabulary of 17000.

![Word Vocabulary](ExplanationMedia/Images/Vocabulary.png)

They then converted this vocabulary into a 30 dimensional feature space

![Vocabulary to Feature Space](ExplanationMedia/Images/VocabularytoFeatureSpace.png)

This is a very small space for a very large dataset.

The approach of this paper is also very similar because,\
They used a **multilayer neural network** to **predict the next word given the previous ones**,\
& they **maximize the log-likelihood** of the training data or a regularized criterion.

#### Why does this approach work? Let's take a concrete example

We have a phrase: *A dog was running in a room*, *The cat is walking in the bedroom*

During the training of the network, the words move around to a similar corner of the space based on their features\
So, even if the model goes *out of distribution* during test, making predictions,\
The similar words which have never occured before may occur here.

Resulting in the phrase: *A dog is walking in a bedroom*, *The cat is running in a room*

If we knew that dog and cat played similar roles (semantically and syntactically), and similarly for (the,a), (bedroom,room), (is,was), we could naturally transfer probability mass.

Let's now look at the Neural Network for this Approach

![A Neural Probabilistic Language Model - Neural Network](https://miro.medium.com/v2/resize:fit:1200/1*EqKiy4-6tuLSoPP_kub33Q.png)

In this network,\
They are taking *3 previous words* and are trying to *predict the 4-th word in a sequence*.

Now because they had the vocabulary of 17000 words (**'w'**),\
These previous words are the indexes ranging from 0-16999.

There is also a lookup-table which they call **'C'** \
This is their lookup-embedding-matrix, which is shared among all the words\
This C is a matrix of say 17000x30. (So, number of words in vocabulary by number of dimensions in the feature space).

So what this is essentially doing is,\
They are trying to pick out the row based on the index of the word from vocabulary\
And the row represents the 1x30 vector of the word's embedding.\
So they are using the same matrix over and over to look for their own vector of embedding.

So, because they are taking *3 previous words*, and each vector uses 1x30 dimensions,\
They have 3x30 dimensions making up 90 dimensions in total.

Next up is the hidden layer (tanh non-linearity layer).\
This is layer has the *hyper-parameter* (*hyper-parameter* is a parameter of the neural network, that is the designer' choice of the neural network).\
So, this layer's size can be as large as we'd like or as we'd like.\
So, we are going to go over multiple choices, and we are going to evaluage how good they work.\
Note: This layer will be fully connected to all the vector embeddings of the previous layer (90 dimensions).

Next, they have a output layer of logits (you can refer to the original <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave.ipynb">NameWeave</a>  for reference).\
Now, because they had a vocabulary of **17000 words**, this layer has **17000 neurons** which is **fully connected to the hidden layer**.\
Resulting in the *maximum computation between the hidden layer and output layer*.

This output layer is then having a softmax activation layer, which exponentiates the logits and normalized to sum to 1.\
Which results in a nice probability distribution for the next *4-th word in a sequence*.

<hr>

During training we have the label (identity of the next word in a sequence).
That word's index is used to choose the probability of that word,\
And then they maximize the probability of that word, with respect to the parameters of this neural network.

<hr>

Parameters:
1. The *weights and biases* of the *output layer*
2. The *weights and biases* of the *hidden layer*
3. The *embedding look-up table 'C'*

So, Let's implement our own neural network, based on the above approach.

# Installing Dependencies

In [1]:
!pip install torch
!pip install numpy
!pip install pandas
!pip install matplotlib



# Importing Libraries

In [2]:
import random
import torch
import torch.nn.functional as F # This is required for one-hot encoding
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


# Loading Dataset

Once again, you can refer to the original <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave.ipynb">NameWeave</a> for reference, as to why we chose to load the dataset in the following way...

In [3]:
words = open("Datasets/Indian_Names.txt").read().splitlines()

In [4]:
words = [word.lower() for word in words]

In [5]:
len(words)

53982

# Building Vocabulary

In [6]:
# Remember we need our starting and ending tokens as well in these mappings,
characters = sorted(list(set(''.join(words)))) # Gives us all the characters in the english alphabet, hopefully our dataset has all of them
stoi = {s:i+1 for i,s in enumerate(characters)} # Enumerate returns the tuples of number and string, which can then be mapped to string:index
# We manually add these tokens for convenience
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()} # After we have the string:index mapping, we can easily iterate over their items to map index:string
print("Characters:",characters)
print("STOI:",stoi)
print("ITOS",itos)

Characters: ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
STOI: {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 'i': 9, 'j': 10, 'k': 11, 'l': 12, 'm': 13, 'n': 14, 'o': 15, 'p': 16, 'q': 17, 'r': 18, 's': 19, 't': 20, 'u': 21, 'v': 22, 'w': 23, 'x': 24, 'y': 25, 'z': 26, '.': 0}
ITOS {1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}


# Building Dataset for Neural Network

Now, we can't just feed in names to our Neural Network.\
Rather, we need to build a dataset which will be able to feed into our neural network.

Let's visualize how we are going to feed in to the neural network first...

![Multi Layer Perceptron Approach](ExplanationMedia/Images/NameWeaveMultiLayerPerceptronApproach.png)

Let's now try to make this dataset...

Remeber, this is **not bigram anymore**.

In [7]:
# We define a Block Size based on the number of characters we feed are going to feed to predict the next one
inputBlockSize = 3

# We define two lists, inputs & outputs, where inputs are our blocks of the block size mentioned above and outputs are the label indexes
inputs , outputs = [], []

# We run a loop for each word in the original dataset
for word in words[:5]:
    # We define the block for each iteration and fill it with 0 values -> [0, 0, 0]
    block = [0] * inputBlockSize # This is also known as the context of the network
    # We print each word
    print("Name:", word)
    # We run another loop for each word's character, here word also needs the ending token '.'
    for character in word + '.':
        # We take out the index from our look-up table
        index = stoi[character]
        # We append the input with our block
        inputs.append(block)
        # We append the output label with out index of the character
        outputs.append([index])
        # We can check our inputs and thier corresponsing outputs
        print(''.join(itos[i] for i in block), '--->', itos[index])
        # We then take the block, crop it 1 size from the left and append the next index to it (sliding window of name)
        block = block[1:] + [index]
# We also convert these inputs and outputs to tensors for neural network processing
inputs = torch.tensor(inputs)
outputs = torch.tensor(outputs)

Name: aaban
... ---> a
..a ---> a
.aa ---> b
aab ---> a
aba ---> n
ban ---> .
Name: aabharan
... ---> a
..a ---> a
.aa ---> b
aab ---> h
abh ---> a
bha ---> r
har ---> a
ara ---> n
ran ---> .
Name: aabhas
... ---> a
..a ---> a
.aa ---> b
aab ---> h
abh ---> a
bha ---> s
has ---> .
Name: aabhat
... ---> a
..a ---> a
.aa ---> b
aab ---> h
abh ---> a
bha ---> t
hat ---> .
Name: aabheer
... ---> a
..a ---> a
.aa ---> b
aab ---> h
abh ---> e
bhe ---> e
hee ---> r
eer ---> .


In [8]:
# We can now check the shape of inputs and outputs and their corresponding datatypes
print("Inputs Shape:",inputs.shape,", Datatype:",inputs.dtype)
print("Outputs Shape:",outputs.shape,", Datatype:",outputs.dtype)

Inputs Shape: torch.Size([37, 3]) , Datatype: torch.int64
Outputs Shape: torch.Size([37, 1]) , Datatype: torch.int64


In [9]:
# We can also check how the inputs look like
print(inputs)

tensor([[ 0,  0,  0],
        [ 0,  0,  1],
        [ 0,  1,  1],
        [ 1,  1,  2],
        [ 1,  2,  1],
        [ 2,  1, 14],
        [ 0,  0,  0],
        [ 0,  0,  1],
        [ 0,  1,  1],
        [ 1,  1,  2],
        [ 1,  2,  8],
        [ 2,  8,  1],
        [ 8,  1, 18],
        [ 1, 18,  1],
        [18,  1, 14],
        [ 0,  0,  0],
        [ 0,  0,  1],
        [ 0,  1,  1],
        [ 1,  1,  2],
        [ 1,  2,  8],
        [ 2,  8,  1],
        [ 8,  1, 19],
        [ 0,  0,  0],
        [ 0,  0,  1],
        [ 0,  1,  1],
        [ 1,  1,  2],
        [ 1,  2,  8],
        [ 2,  8,  1],
        [ 8,  1, 20],
        [ 0,  0,  0],
        [ 0,  0,  1],
        [ 0,  1,  1],
        [ 1,  1,  2],
        [ 1,  2,  8],
        [ 2,  8,  5],
        [ 8,  5,  5],
        [ 5,  5, 18]])


In [10]:
# We can also check how the outputs look like
print(outputs)

tensor([[ 1],
        [ 1],
        [ 2],
        [ 1],
        [14],
        [ 0],
        [ 1],
        [ 1],
        [ 2],
        [ 8],
        [ 1],
        [18],
        [ 1],
        [14],
        [ 0],
        [ 1],
        [ 1],
        [ 2],
        [ 8],
        [ 1],
        [19],
        [ 0],
        [ 1],
        [ 1],
        [ 2],
        [ 8],
        [ 1],
        [20],
        [ 0],
        [ 1],
        [ 1],
        [ 2],
        [ 8],
        [ 5],
        [ 5],
        [18],
        [ 0]])


In [11]:
# We want our outputs to be single elements in a list and not add another dimension to the list
# So we use a flatten method available in PyTorch to flatten these outputs
outputs = torch.flatten(outputs)

In [30]:
print(outputs)

tensor([ 1,  1,  2,  1, 14,  0,  1,  1,  2,  8,  1, 18,  1, 14,  0,  1,  1,  2,
         8,  1, 19,  0,  1,  1,  2,  8,  1, 20,  0,  1,  1,  2,  8,  5,  5, 18,
         0])


Now that we have our inputs and outputs configured, let's build our **embeddingLookUpMatrix 'C'**

# Building Embedding Look-up Matrix

In the paper the researchers had a big vocabulary of 17000 words,\
They used a very small 30 dimensional feature space.

Because we have a vocabulary of only 27 characters,\
Let's use a very small 2 dimensional feature space for our embedding look-up matrix.

In [12]:
# We decide to build a embeddingLookUpMatrix with 27x2 because we have a vocabulary of 27 characters and we want to fit them in a 2 dimensional space
# In the beginning we initialize it randomly
embeddingFeatureSpaceLength = 2
embeddingLookUpMatrix = torch.randn((len(characters),embeddingFeatureSpaceLength))
# So each one of our 27 characters will have a 2 dimensional embedding
print(embeddingLookUpMatrix)

tensor([[ 0.5820, -0.2370],
        [ 1.1564, -1.5507],
        [ 0.3370,  0.9905],
        [ 2.0201, -1.3263],
        [ 1.7319,  0.8092],
        [-1.1567, -0.1717],
        [ 1.4870,  0.3647],
        [ 1.4763, -0.1208],
        [ 1.4331, -0.6318],
        [ 0.8025, -0.2571],
        [-1.5145,  0.6996],
        [ 0.5932,  1.5696],
        [ 0.9620, -0.2745],
        [ 0.7202, -0.7890],
        [ 1.9879, -0.6847],
        [ 1.8194,  0.2373],
        [ 0.3454,  1.3520],
        [-0.0533, -0.4030],
        [ 0.6345, -1.5687],
        [ 1.2893,  1.3174],
        [-0.3376, -0.0277],
        [ 1.3857, -2.1648],
        [ 0.1235,  0.8132],
        [ 1.2909, -0.0200],
        [-0.7188, -0.9777],
        [-0.0405,  0.3629]])


Now that we have a embedding-look-up matrix,

For example,

We can easily do:
```python
embeddingLookUpMatrix[6]
```

To get the embedding:
```python
tensor([-0.2483, -0.3909])
```

But there is a more similar way to do the exact same thing based on one-hot encoding....

We can do:
```python
F.one_hot(torch.tensor(6), num_classes=27)
```

To get the one-hot embedding
```python
tensor([0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
```

Then convert this to float and multiply it with our original *embeddingLookUpMatrix*:
```python
F.one_hot(torch.tensor(6), num_classes=27).float() @ embeddingLookUpMatrix
```

Which results in:
```python
tensor([-0.2483, -0.3909])
```

This works because of the property of matrix multiplication,

The 0's in our one-hot encoded vector discards all the zeros,\
And only multiplies the 1 to the corresponding column of the embeddingLookUpMatrix.

So we can consider this matrix multiplication to be the first layer of our neural network,\
Giving us our corresponding embedding for the index. *(1x2 embedding vector for our case)*

But we will simply index into our look-up table and discard the way of one-hot encoding for the time being

But, now that we know that we want to index into our look-up embedding matrix,\
How do we do that simultaneously for all the inputs?

PyTorch has got you covered.

In PyTorch we can, very flexibly pick out rows...

For example,\
We can do indexing with lists of indexes:
```python
embeddingLookUpMatrix[[1,2,3]]
```

Which gives out the rows of the corresponding indexes:
```python
tensor([[ 0.2118,  1.0454],
        [ 0.1876,  0.8921],
        [ 0.9759, -0.2606]])
```

We can also do:
```python
embeddingLookUpMatrix[torch.tensor([1,2,3])]
```

Which gives out the rows of the corresponding indexes:
```python
tensor([[ 0.2118,  1.0454],
        [ 0.1876,  0.8921],
        [ 0.9759, -0.2606]])
```

We can similarly pick out the same rows again and again:
```python
embeddingLookUpMatrix[[1,2,3,3,3,3]]
```

Which gives us the same row again and again:
```python
tensor([[ 0.2118,  1.0454],
        [ 0.1876,  0.8921],
        [ 0.9759, -0.2606],
        [ 0.9759, -0.2606],
        [ 0.9759, -0.266],
        [ 0.9759, -0.2606]])
```

Lastly, the magic happens when we try to do the same with multi dimensional lists as well:
```python
embeddingLookUpMatrix[[1, 0], [1, 1]]
```

Which results in:
```python
tensor([ 1.0454, -0.9078])
```

In [13]:
# So we can now easily do
embeddingLookUpMatrix[inputs]

tensor([[[ 0.5820, -0.2370],
         [ 0.5820, -0.2370],
         [ 0.5820, -0.2370]],

        [[ 0.5820, -0.2370],
         [ 0.5820, -0.2370],
         [ 1.1564, -1.5507]],

        [[ 0.5820, -0.2370],
         [ 1.1564, -1.5507],
         [ 1.1564, -1.5507]],

        [[ 1.1564, -1.5507],
         [ 1.1564, -1.5507],
         [ 0.3370,  0.9905]],

        [[ 1.1564, -1.5507],
         [ 0.3370,  0.9905],
         [ 1.1564, -1.5507]],

        [[ 0.3370,  0.9905],
         [ 1.1564, -1.5507],
         [ 1.9879, -0.6847]],

        [[ 0.5820, -0.2370],
         [ 0.5820, -0.2370],
         [ 0.5820, -0.2370]],

        [[ 0.5820, -0.2370],
         [ 0.5820, -0.2370],
         [ 1.1564, -1.5507]],

        [[ 0.5820, -0.2370],
         [ 1.1564, -1.5507],
         [ 1.1564, -1.5507]],

        [[ 1.1564, -1.5507],
         [ 1.1564, -1.5507],
         [ 0.3370,  0.9905]],

        [[ 1.1564, -1.5507],
         [ 0.3370,  0.9905],
         [ 1.4331, -0.6318]],

        [[ 0.3370,  0

In [14]:
# We can also check the shape of this
embeddingLookUpMatrix[inputs].shape

torch.Size([37, 3, 2])

We see that the size of this index is the shape of the original size of the dataset with a 2-dimensional embedding vector space

So if we do:
```python
# Input of 5th block and 3rd index of the block
inputs[5,2]
```

It gives us:
```python
tensor(14)
```

We can look that vector up by doing:
```python
embeddingLookUpMatrix[inputs][5,2]
```

Which gives us the corresponding vector of the item specified:
```python
tensor([ 1.7724, -0.9331])
```

We can verify the same by doing:
```python
embeddingLookUpMatrix[14]
```

Gives the same output:
```python
tensor([ 1.7724, -0.9331])```


In [15]:
# So we can now define our embedding into a variable
embedding = embeddingLookUpMatrix[inputs]

# Constructing Hidden Layer

Let's understand what we will initially have in the hidden layer.

1. The hidden layer will have it's own weights & biases
2. The hidden layer will have it's own neurons which will act as a hyper-parameter to set the number of neurons we want in this layer

So let's initialize weights and biases for now...

Note: The size of the weights will be based on the block size of the inputs and its corresponding vector embedding.\
Thus,

$$\text{Hidden Layer Size} = [(\text{Block Size} * \text{Vector Embedding Dimensions}), \text{Number of Neurons(Hyperparameter)}]$$

In [16]:
# We can initialize the number of neurons we want in the hidden layer
numberOfHiddenLayerNeurons = 100
# Then we can randomly initialize the weights of the hidden layer
weightsOfHiddenLayer = torch.randn((inputBlockSize*embeddingFeatureSpaceLength), numberOfHiddenLayerNeurons)
# Then we can initialize the corresponding biases as well
biasesOfHiddenLayer = torch.randn(numberOfHiddenLayerNeurons)

In [17]:
# We can check the shapes of our hidden layer weights and hidden layer biases
print("Shape of Weights of Hidden Layer:", weightsOfHiddenLayer.shape)
print("Shape of Biases of Hidden Layer:", biasesOfHiddenLayer.shape)

Shape of Weights of Hidden Layer: torch.Size([6, 100])
Shape of Biases of Hidden Layer: torch.Size([100])


Now that we have the weights and biases initialized for our hidden layer.

By convention we would like to do something like:
$$\text{Layer Computation} = \text{Embeddings} * \text{Weights} + \text{Biases}$$

But for our case:

Embeddings are in the shape of the [number of blocks in all the names, block size, vector dimension size]\
for example, [37, 3, 2]\
& Weights are in the shape of [(block size * vector dimension size), number of neurons (hyperparameter)]\
for example, [6, 100]

And thus, we cannot just simply multiply these matrices

There are a numerous ways to do this,\
Either we can convert [37, 3, 2] ---> [37, 6]\
Or we can convert [6, 100] ---> [3, 2, 100]

We will stick with the first one, because it is fairly simpler and would reduce the complexity to understand the problem.

I invite you to also look into the documentation of <a href="https://pytorch.org/docs/stable/index.html">PyTorch</a>.

According to the official documentation,

**TORCH.CAT**
Concatenates the given sequence of seq tensors in the given dimension. All tensors must either have the same shape (except in the concatenating dimension) or be empty.

```python
torch.cat(tensors, dim=0, *, out=None) → Tensor
```

Parameters:
- tensors (sequence of Tensors) – any python sequence of tensors of the same type. Non-empty tensors provided must have the same shape, except in the cat dimension.
- dim (int, optional) – the dimension over which the tensors are concatenated

Keyword Arguments:
- out (Tensor, optional) – the output tensor.

Example:
```python
>>> x = torch.randn(2, 3)
>>> x
tensor([[ 0.6580, -1.0969, -0.4614],
        [-0.1034, -0.5790,  0.1497]])
>>> torch.cat((x, x, x), 0)
tensor([[ 0.6580, -1.0969, -0.4614],
        [-0.1034, -0.5790,  0.1497],
        [ 0.6580, -1.0969, -0.4614],
        [-0.1034, -0.5790,  0.1497],
        [ 0.6580, -1.0969, -0.4614],
        [-0.1034, -0.5790,  0.1497]])
>>> torch.cat((x, x, x), 1)
tensor([[ 0.6580, -1.0969, -0.4614,  0.6580, -1.0969, -0.4614,  0.6580,
         -1.0969, -0.4614],
        [-0.1034, -0.5790,  0.1497, -0.1034, -0.5790,  0.1497, -0.1034,
         -0.5790,  0.1497]])
```

So we can do:
```python
# Pickout the Embedding along the dimension 0, 1 & 2 and concatenate them along dimension 1
# Each embedding[:, n, :] gives us the 3x2 embeddings
torch.cat([embedding[:, 0, :], embedding[:, 1, :], embedding[:, 2, :]], dim=1)
```

And its shape turns out to be:
```python
torch.Size([37, 6])
```

But this is kind of ugly and we have another method...

**TORCH.UNBIND**
Removes a tensor dimension.

Returns a tuple of all slices along a given dimension, already without it.

```python
torch.unbind(input, dim=0) → seq
```
Parameters:
- input (Tensor) – the tensor to unbind
- dim (int) – dimension to remove

Example:
```python
>>> torch.unbind(torch.tensor([[1, 2, 3],
>>>                            [4, 5, 6],
>>>                            [7, 8, 9]]))
(tensor([1, 2, 3]), tensor([4, 5, 6]), tensor([7, 8, 9]))
```

So now we can do:
```python
torch.cat(torch.unbind(embedding, dim=1), dim=1)
```

Whose shape also turns out to be:
```python
torch.Size([37, 6])
```

We have a third way of doing the same thing.

Its called:\
**TORCH.TENSOR.VIEW**

Which gives me the paternity to explain some of the features of the internals of the PyTorch Library.

We have an interesting blog post <a href="http://blog.ezyang.com/2019/05/pytorch-internals/">here</a> by Edward Z. Yang which you can go through to understand more about this.

To explain *TORCH.TENSOR.VIEW*.\
Let's take an example and explain each step one by one...

For example,\
If we do:
```python
torch.arange(0,18)
```

It gives us:
```python
tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17])
```

This can also be viewed as:
```python
torch.arange(0,18).view(9,2)
```

Which gives us:
```python
tensor([[ 0,  1],
        [ 2,  3],
        [ 4,  5],
        [ 6,  7],
        [ 8,  9],
        [10, 11],
        [12, 13],
        [14, 15],
        [16, 17]])
```

This can also be written as:
```python
torch.arange(0,18).view(3,3,2)
```

Which gives us:
```python
tensor([[[ 0,  1],
         [ 2,  3],
         [ 4,  5]],

        [[ 6,  7],
         [ 8,  9],
         [10, 11]],

        [[12, 13],
         [14, 15],
         [16, 17]]])
```

So,\
If we have an embedding of size say: [37, 3, 2]\
We can essentially do:
```python
embedding.view(37,6)
```

We can also verify the result to be the same by doing
```python
embedding.view(37,6) == torch.cat(torch.unbind(embedding, dim=1), dim=1)
```

Resulting in all True values

So, to get the *hidden-states*, we can simply run:

```python
embedding.view(37,6)
```

to get all the hidden layer states as:

```python
hiddenLayerStates = embedding.view(37,6) @ weightsOfHiddenLayer + biasesOfHiddenLayer
```

<hr>

Before we use this, you see how we are using 37 and 6 as a number which is hard-coded and does not make our model very flexible?\
Let's fix it by using *-1* instead of *37* to specify that it should take all the inputs, and use *inputBlockSize\*embeddingFeatureSpaceLength* instead of *6*.

<hr>

Also, remembering our original multi-layer perceptron approach, we had something called tanh() non-linearity.

So, instead we would want to know what tanh() is?

So **what is tanh()**?

In order to answer that we have to understand a few more things,

In mathematics, the trigonometric functions are real functions which relate an angle of a right-angled triangle to ratios of two side lengths.
![Trigonometry](https://upload.wikimedia.org/wikipedia/commons/thumb/7/72/Sinus_und_Kosinus_am_Einheitskreis_1.svg/250px-Sinus_und_Kosinus_am_Einheitskreis_1.svg.png)

In mathematics, hyperbolic functions are analogues of the ordinary trigonometric functions, but defined using the hyperbola rather than the circle.
![Hyperbola v/s Parabola](ExplanationMedia/Images/Hyperbola-vs-Parabola.png)

Here are some of the most used hyperbolic functions:
![sinhcoshtanh](https://upload.wikimedia.org/wikipedia/commons/thumb/7/76/Sinh_cosh_tanh.svg/300px-Sinh_cosh_tanh.svg.png)

So to answer the question,\
**tanh() is a hyperbolic function**, or a non-linearity we use to get a number between **-1 and 1** and ranges between $$-\infty\text{ and }\infty$$

This is the formula for tanh():\
$$\tanh x = \frac{\sinh x}{\cosh x} = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}} = \frac{e^{2x} - 1}{e^{2x} + 1}$$

![tanh](https://miro.medium.com/v2/resize:fit:443/1*WeuJzmlt3iNVWsUsvf24Eg.png)

So, we can now do:
```python
hiddenLayerStates = torch.tanh(embedding.view(37,6) @ weightsOfHiddenLayer + biasesOfHiddenLayer)
```

In [18]:
# So putting all the above things we learnt together, we get
hiddenLayerStates = torch.tanh(embedding.view(-1, inputBlockSize*embeddingFeatureSpaceLength) @ weightsOfHiddenLayer + biasesOfHiddenLayer)
print(hiddenLayerStates)

tensor([[ 0.8461, -0.1966, -0.7330,  ...,  0.9426, -0.9570,  0.2508],
        [ 0.9832,  0.7964, -0.6979,  ...,  0.9806, -0.9932, -0.1872],
        [ 1.0000, -0.1505,  0.4600,  ...,  0.9999, -0.9906,  0.9022],
        ...,
        [ 0.9981, -0.9832,  0.6876,  ..., -0.2359, -0.9823,  0.9468],
        [-0.1683,  0.9544,  0.7570,  ...,  0.9999, -0.9999,  0.9696],
        [ 0.9985,  0.8951,  0.9982,  ...,  0.8906, -0.5284,  0.9913]])


In [19]:
# Let's see what the shape of the hidden layer states look like
print(hiddenLayerStates.shape)

torch.Size([37, 100])


Keep in mind that we need to be careful with the broadcasting rules of the '+' of:
```python
hiddenLayerStates = torch.tanh(embedding.view(37,6) @ weightsOfHiddenLayer + biasesOfHiddenLayer)
```

I will move on to the next section, but to check the broadcasting rules you can refer to <a href="https://pytorch.org/docs/stable/notes/broadcasting.html">original documentation</a> or refer to my <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave.ipynb">NameWeave Notebook</a> for a more simpler explanation.

So let's create our final layer next...

# Constructing Final Layer

Looking at the shape that we have right now :

```python
torch.Size([37, 100])
```

We see that we have a *100* neurons, taking *37* inputs...

We understand that each of these neurons would be the inputs to our final layer,\
Thus, we have to take a layer where it takes *100* inputs and produces the output of *27*.

Why **27**?

Because, we would be interested in the index of the output now.

So we will do:
```python
# We can initialize the number of neurons we have in the final layer
numberOfFinalLayerOutputs = 27
# Then we can randomly initialize the weights of the final layer
weightsOfFinalLayer = torch.randn(numberOfHiddenLayerNeurons, numberOfFinalLayerOutputs)
# Then we can initialize the corresponding biases as well
biasesOfFinalLayer = torch.randn(numberOfFinalLayerOutputs)
```

Therefore,\
The **logits** our final layer will produce would be:
$$\text{Logits} = \text{hiddenLayerStates} * \text{weightsOfFinalLayer} + \text{biasesOfFinalLayer}$$

In [20]:
# Let's construct our final layer
# We can initialize the number of neurons we have in the final layer
numberOfFinalLayerOutputs = 27
# Then we can randomly initialize the weights of the final layer
weightsOfFinalLayer = torch.randn(numberOfHiddenLayerNeurons, numberOfFinalLayerOutputs)
# Then we can initialize the corresponding biases as well
biasesOfFinalLayer = torch.randn(numberOfFinalLayerOutputs)

In [21]:
# Let's compute logits now
logits = hiddenLayerStates @ weightsOfFinalLayer + biasesOfFinalLayer

In [22]:
# Let's check the output
print(logits)

tensor([[-5.2723e+00,  6.7407e+00,  5.4457e+00,  1.1902e+00,  8.3090e+00,
         -1.1993e+01,  1.1749e+01,  1.0155e+01, -1.2423e+01,  6.8171e+00,
         -1.0491e+01,  1.2763e+01, -3.8437e+00, -3.7703e+00,  9.8887e+00,
          5.3221e+00,  1.5256e+00, -1.1038e+01, -8.4957e+00, -8.4986e+00,
          6.8459e+00, -4.4593e+00,  7.1014e-01, -8.8109e+00,  1.8097e+00,
         -1.3912e+01, -1.2222e+01],
        [-6.5779e+00,  1.1229e+01,  1.1490e+01, -5.1242e-01,  2.7115e+00,
         -1.4522e+01,  3.8286e+00,  1.0287e+01, -2.5238e+01,  8.7356e+00,
         -6.7601e+00,  9.6300e+00, -5.4677e+00, -7.5854e+00,  7.6966e+00,
         -9.0475e+00, -6.8988e+00, -1.5325e+01, -2.3821e+00, -1.6014e+01,
          1.6096e+00,  2.9128e-01,  8.5502e+00, -1.7395e+01,  4.0671e+00,
         -1.4717e+01, -3.6569e+00],
        [-9.8583e+00,  1.1538e+01,  5.3546e+00,  8.9338e-02,  3.4122e+00,
         -1.6844e+01,  6.7176e+00,  7.0386e+00, -1.4026e+01,  7.0354e+00,
         -9.2277e+00,  4.9584e+00, -4.76

In [23]:
#Let's check the shape as well
print(logits.shape)

torch.Size([37, 27])


So, we want to do exactly now what we did in our previous <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave.ipynb">NameWeave Notebook</a>.\
We want to:
1. Take the logits
2. Exponentiate them
3. Normalize them into a probability that sum to 1

In [24]:
# So let's get our probabilities back
# Calculating counts from logits
counts = logits.exp()
# Normalizing counts to probabilities that sum to 1
probabilities = counts / counts.sum(1, keepdims=True)

In [25]:
# Let's see the probabilities in action
print(probabilities)

tensor([[9.7155e-09, 1.6019e-03, 4.3874e-04, 6.2239e-06, 7.6865e-03, 1.1713e-11,
         2.3968e-01, 4.8713e-02, 7.6167e-12, 1.7290e-03, 5.2588e-11, 6.6065e-01,
         4.0539e-08, 4.3627e-08, 3.7305e-02, 3.8775e-04, 8.7040e-06, 3.0449e-11,
         3.8685e-10, 3.8574e-10, 1.7796e-03, 2.1903e-08, 3.8511e-06, 2.8226e-10,
         1.1564e-05, 1.7189e-12, 9.3201e-12],
        [6.0114e-09, 3.2555e-01, 4.2259e-01, 2.5893e-06, 6.5058e-05, 2.1321e-12,
         1.9881e-04, 1.2682e-01, 4.7295e-17, 2.6886e-02, 5.0098e-09, 6.5764e-02,
         1.8244e-08, 2.1949e-09, 9.5128e-03, 5.0866e-10, 4.3611e-09, 9.5495e-13,
         3.9920e-07, 4.7977e-13, 2.1615e-05, 5.7838e-06, 2.2336e-02, 1.2052e-13,
         2.5236e-04, 1.7556e-12, 1.1156e-07],
        [4.8819e-10, 9.5707e-01, 1.9745e-03, 1.0205e-05, 2.8307e-04, 4.5179e-13,
         7.7166e-03, 1.0638e-02, 7.5594e-12, 1.0603e-02, 9.1720e-10, 1.3286e-03,
         7.9691e-08, 1.3872e-09, 1.2151e-05, 3.4728e-07, 1.0055e-08, 8.0994e-11,
         7.2830e-

Now that we have our probabilities,\
We also want to:
1. Calculate Loss
2. Tune the particular weights depending on the gradients

So just like our older <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave.ipynb">NameWeave Notebook</a>, we will calculate loss such that,
1. We will take the log likelihood of the probabilities
2. Take the average of the log likelihood
3. Convert it to negetive average log likelihood

In [26]:
# Let's calculate our own average negetive log likelihood straight from this tensor
# This is the vectorized form of that expression
loss = -probabilities[torch.arange(len(inputs)), outputs].log().mean()
print(loss)

tensor(11.9243)


Let's make our code a little more respectable...

What is **entropy**?

Entropy is the measurement of disorder or impurities in the information processed in machine learning.

$$ H(X) = -\sum_{x \in \mathcal{X}} p(x) \log p(x) $$

**When entropy becomes 0, then the dataset has no impurity.**

![LowHighEntropy](https://static.javatpoint.com/tutorial/machine-learning/images/entropy-in-machine-learning3.png)

What is the **Information Gain in Entropy**?

Information gain is defined as the pattern observed in the dataset by calculating the reduction in entropy or surprise by splitting a dataset according to a given value of a random variable.

$$ \text{Information Gain} = 1-\text{Entropy} $$

![LowHighInformationGain](https://miro.medium.com/v2/resize:fit:1400/1*DsjX_bHYWn21Z0VIPjxnbw.png)

What is **Cross-Entropy**?

Cross-entropy is a measure of the difference between two probability distributions for a given random variable or set of events.

What is **Cross-Entropy Loss**?

Cross-entropy loss refers to the contrast between two random variables. It measures the variables to extract the difference in the information they contain, showcasing the results.

So,\
We now understand that our lines of code:
```python
# Calculating counts from logits
counts = logits.exp()
# Normalizing counts to probabilities that sum to 1
probabilities = counts / counts.sum(1, keepdims=True)
# Calculating the negetive average log likelihood as loss
loss = -probabilities[torch.arange(len(inputs)), outputs].log().mean()
```

Can now be replaced by a ready-made function:
```python
loss = F.cross_entropy(logits, outputs)
```

Since we are doing classification...

In [31]:
# Calculating the negetive average log likelihood as loss (cross entropy loss)
loss = F.cross_entropy(logits, outputs)
# We get the same loss now
print(loss)

tensor(11.9243)


Now, why are we using *F.cross_entropy()* might be the next question coming to your mind...

There are a number of reasons, as to why,
1. Its efficient and does not require to use new memory
2. We practically don't use the three lines of code

Now you might be wondering why my last point is valid.\
Let me explain...

Suppose we write the same code:
```python
#Defining logits (example)
logits = torch.tensor([-100, -3, 0, 100])
# Calculating counts from logits
counts = logits.exp()
# Normalizing counts to probabilities that sum to 1
probabilities = counts / counts.sum(1, keepdims=True)
# Calculating the negetive average log likelihood as loss
loss = -probabilities[torch.arange(len(inputs)), outputs].log().mean()
```

Even if the code seems simple, we run into problems...\
Because when we represent counts by performing an exponential function,\
Any large negetive number works fine(it represents a very tiny number),\
But the moment we associate a very positive number along with it(it tries to represent a very large number) and it goes out of memory and count turns out to be inifinity and probabilities remain undefined as well.

So summarising the reasons:
1. The forward pass would be much more efficient
2. The backward pass would be much more efficient
3. The calculations would be mathematically well behaved

##### Now that we have all the layers, let's put all the parameters in a single variable to access all of them way faster

Let's recall what those were:
1. The *weights and biases* of the *output layer*
2. The *weights and biases* of the *hidden layer*
3. The *embedding look-up table 'C'*

In [33]:
# Let's define our parameters variable
parameters = [embeddingLookUpMatrix, weightsOfHiddenLayer, biasesOfHiddenLayer, weightsOfFinalLayer, biasesOfFinalLayer]
print(parameters)

[tensor([[ 0.5820, -0.2370],
        [ 1.1564, -1.5507],
        [ 0.3370,  0.9905],
        [ 2.0201, -1.3263],
        [ 1.7319,  0.8092],
        [-1.1567, -0.1717],
        [ 1.4870,  0.3647],
        [ 1.4763, -0.1208],
        [ 1.4331, -0.6318],
        [ 0.8025, -0.2571],
        [-1.5145,  0.6996],
        [ 0.5932,  1.5696],
        [ 0.9620, -0.2745],
        [ 0.7202, -0.7890],
        [ 1.9879, -0.6847],
        [ 1.8194,  0.2373],
        [ 0.3454,  1.3520],
        [-0.0533, -0.4030],
        [ 0.6345, -1.5687],
        [ 1.2893,  1.3174],
        [-0.3376, -0.0277],
        [ 1.3857, -2.1648],
        [ 0.1235,  0.8132],
        [ 1.2909, -0.0200],
        [-0.7188, -0.9777],
        [-0.0405,  0.3629]]), tensor([[-1.3309e+00,  5.3760e-01, -8.8002e-01, -1.6027e+00, -8.7127e-01,
          1.5733e+00,  1.5703e+00, -5.7903e-02, -6.7040e-01,  1.5406e+00,
         -6.2869e-02, -7.6818e-01,  3.9260e-01, -2.2779e+00, -1.8487e-01,
          7.3894e-03, -2.6233e-01,  1.8654e+00,

In [36]:
# We must set requires_grad to True in all the parameters to avoid any errors in the future
for parameter in parameters:
    parameter.requires_grad = True

Let's put everything we have together for now with a respectable generator so that we all get the same output

In [34]:
# We will define a generator to give the same result on your machine, as of my machine
generator = torch.Generator().manual_seed(6942069420)
# Embedding Matrix (Input Layer)
embeddingFeatureSpaceLength = 2
embeddingLookUpMatrix = torch.randn((len(characters),embeddingFeatureSpaceLength), generator=generator)
# Hidden Layer
numberOfHiddenLayerNeurons = 100
weightsOfHiddenLayer = torch.randn((inputBlockSize*embeddingFeatureSpaceLength), numberOfHiddenLayerNeurons, generator=generator)
biasesOfHiddenLayer = torch.randn(numberOfHiddenLayerNeurons, generator=generator)
# Output Layer / Final Layer
numberOfFinalLayerOutputs = 27
weightsOfFinalLayer = torch.randn(numberOfHiddenLayerNeurons, numberOfFinalLayerOutputs, generator=generator)
biasesOfFinalLayer = torch.randn(numberOfFinalLayerOutputs, generator=generator)
# Parameters
parameters = [embeddingLookUpMatrix, weightsOfHiddenLayer, biasesOfHiddenLayer, weightsOfFinalLayer, biasesOfFinalLayer]

In [35]:
# Let's check how many parameters we have
sum(parameter.nelement() for parameter in parameters)

3479

Let's understand how neural network will train itself with forward pass, backward pass and updatation now...

Now that we have trained two neural networks already...

We can safely say that they work in the sequence:
1. Forward Pass - Makes calculations and calculates loss
2. Backward Pass - Resets all the gradients and back propagtes through the network
3. Data Updation - Updates the data for all the parameters in the opposite direction of the gradient depending on the learning rate

In [38]:
# Forward Pass
embedding = embeddingLookUpMatrix[inputs]
hiddenLayerStates = torch.tanh(embedding.view(-1, inputBlockSize*embeddingFeatureSpaceLength) @ weightsOfHiddenLayer + biasesOfHiddenLayer)
logits = hiddenLayerStates @ weightsOfFinalLayer + biasesOfFinalLayer
loss = F.cross_entropy(logits, outputs)
print("Loss:", loss)

# Backward Pass
for parameter in parameters:
    parameter.grad = None
loss.backward()

# Update Weights
learning_rate = 0.1
for parameter in parameters:
    parameter.data += -learning_rate * parameter.grad

Loss: tensor(9.8323, grad_fn=<NllLossBackward0>)


So we can take it in a loop to train the model over and over...

In [39]:
# We define the number of epochs
epochs = 10
for _ in range(epochs):
    # Forward Pass
    embedding = embeddingLookUpMatrix[inputs]
    hiddenLayerStates = torch.tanh(embedding.view(-1, inputBlockSize*embeddingFeatureSpaceLength) @ weightsOfHiddenLayer + biasesOfHiddenLayer)
    logits = hiddenLayerStates @ weightsOfFinalLayer + biasesOfFinalLayer
    loss = F.cross_entropy(logits, outputs)
    print("Loss:", loss)
    
    # Backward Pass
    for parameter in parameters:
        parameter.grad = None
    loss.backward()
    
    # Update Weights
    learning_rate = 0.1
    for parameter in parameters:
        parameter.data += -learning_rate * parameter.grad

Loss: tensor(7.7304, grad_fn=<NllLossBackward0>)
Loss: tensor(6.5795, grad_fn=<NllLossBackward0>)
Loss: tensor(5.4027, grad_fn=<NllLossBackward0>)
Loss: tensor(4.4660, grad_fn=<NllLossBackward0>)
Loss: tensor(4.5961, grad_fn=<NllLossBackward0>)
Loss: tensor(3.2872, grad_fn=<NllLossBackward0>)
Loss: tensor(3.2044, grad_fn=<NllLossBackward0>)
Loss: tensor(2.4061, grad_fn=<NllLossBackward0>)
Loss: tensor(2.0618, grad_fn=<NllLossBackward0>)
Loss: tensor(1.8381, grad_fn=<NllLossBackward0>)
