### MLP

Problem with the previous model, where we had only 1 layer of 28 neurons was that, \
Since, we were looking at one one previous character, the name generations were not very convincing

The problem with the tabular probabitlities in the model was scalability.

Since, we were looking at only 1 previous character, we had only 28 probabilities per character.\
If we wish to look into 2 previous characters to predict the next, then the number of probabilities and hence the rows in the table would be 28*28 = 784.

The number of columns will still remain the same (denoting the next character) but the rows will be 784 now, denoting the possible combinations of 2 previous characters.

If we wish to consider 3 previous characters, then the number of rows would become `28^3 = approx. 20000`

Also note, that the matrix of probabilities would be a **sparse matrix** as typical names would not have all the possible combinations of 2 characters.

Hence we will stick to the neural network building and make it more complex by having a **Multi Layer Perceptron**


### Taking inspiration from Bengio et al. 2003

The paper **[A Neural Probabilistic Language Model](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)** proposes a language model that works on words; we are working on a character-level language model, taking inspiration from the same model.

The paper discusses representing each word from the entire vocabulary of 17,000 words in a 30-dimensional space. \
Initially, these points (word embeddings) are spread out in the space at random, and then tuned using back-propagation in such a way that words with similar meanings or which are related to each other in some way end up staying close to each other, and conversely, the words with different meanings would end up being distant from each other.

Similar to the paper, we are also using a Multi-Layer Neural Network to predict the next character given the sequence of previous characters, and to train the neural network, they are maximising the Log-Likelihood of the training data, such that similar occurring characters would be placed close to each other in the newly created vector space.

In [2]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
%matplotlib inline 

import pickle

In [3]:
# read all the cleaned names 
with open('indian_names_clean.pkl', 'rb') as f:
    names = pickle.load(f)

print(f"Ready! Loaded {len(names)} names")
print("First 10:", names[:10])

Ready! Loaded 64128 names
First 10: ['jyotirmoy', 'ilamuhil', 'indravathi', 'raamen', 'benudhar', 'mithushaya', 'malani', 'sathuna', 'oviyashri', 'vaitheeswarsn']


In [7]:
# build the vocabulary of characters and lookup tables
chars = sorted(list(set(''.join(names))))
stoi = {s:i+1 for i, s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s, i in stoi.items()}

print(stoi)
print(itos)

{'-': 1, 'a': 2, 'b': 3, 'c': 4, 'd': 5, 'e': 6, 'f': 7, 'g': 8, 'h': 9, 'i': 10, 'j': 11, 'k': 12, 'l': 13, 'm': 14, 'n': 15, 'o': 16, 'p': 17, 'q': 18, 'r': 19, 's': 20, 't': 21, 'u': 22, 'v': 23, 'w': 24, 'x': 25, 'y': 26, 'z': 27, '.': 0}
{1: '-', 2: 'a', 3: 'b', 4: 'c', 5: 'd', 6: 'e', 7: 'f', 8: 'g', 9: 'h', 10: 'i', 11: 'j', 12: 'k', 13: 'l', 14: 'm', 15: 'n', 16: 'o', 17: 'p', 18: 'q', 19: 'r', 20: 's', 21: 't', 22: 'u', 23: 'v', 24: 'w', 25: 'x', 26: 'y', 27: 'z', 0: '.'}


### Building the Neural Network
Taking reference from the paper 
![image.png](attachment:image.png)

The neural network takes 3 input words (in our case, we will take 3 input characters) to predict the 4th word (character).

In [24]:
# build the dataset for the neural network
block_size = 3 # context length for a 4-gram character level language model : we are taking previous 3 characters to predict the 4th one
X, Y = [], []

for name in names[:2]:
    print(name)
    context = [0] * block_size
    for ch in name + '.':
        index = stoi[ch]
        X.append(context)
        Y.append(index)
        print(''.join(itos[i] for i in context), '---->', itos[index])
        context = context[1:] + [index]    # crop and append

X = torch.tensor(X)
Y = torch.tensor(Y)


jyotirmoy
... ----> j
..j ----> y
.jy ----> o
jyo ----> t
yot ----> i
oti ----> r
tir ----> m
irm ----> o
rmo ----> y
moy ----> .
ilamuhil
... ----> i
..i ----> l
.il ----> a
ila ----> m
lam ----> u
amu ----> h
muh ----> i
uhi ----> l
hil ----> .


In [27]:
X.shape, X.dtype, Y.shape, Y.dtype

(torch.Size([19, 3]), torch.int64, torch.Size([19]), torch.int64)

### Building Lookup Table (C)

`{'-': 1, 'a': 2, 'b': 3, 'c': 4, 'd': 5, 'e': 6, 'f': 7, 'g': 8, 'h': 9, 'i': 10, 'j': 11, 'k': 12, 'l': 13, 'm': 14, 'n': 15, 'o': 16, 'p': 17, 'q': 18, 'r': 19, 's': 20, 't': 21, 'u': 22, 'v': 23, 'w': 24, 'x': 25, 'y': 26, 'z': 27, '.': 0}`

We have 28 possible characters and we are going to embed them in a lower dimensional space (In the paper they have 17000 words and they embedded them in dimensions as small as 30). In our case, lets try to start embedding the 28 characters in a 2-D space.

Let's build the lookup table comprising of 28 rows denoting each 28 characters and 2 columns having 2 features per character.

In [29]:
C = torch.randn((28,2))

Before embedding each character into the 2 dimensional space using C, lets first try to embed a single interger to this space, to see how this embedding works

One way to do it, is by directly pointing to the 5th row of the lookup table, C

In [30]:
C[5]

tensor([-0.6343,  1.3699])

Another way of achieving this is the way we did for `build_makemore_nn`, which is seemingly different but actually identical method.\
Using the **One Hot Encoding** Method.

Here, we will get a vector whose 5th dimension is activated (1) and it has a length of 28 

In [33]:
F.one_hot(torch.tensor(5), num_classes=28)

tensor([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0])

In [34]:
F.one_hot(torch.tensor(5), num_classes=28).shape

torch.Size([28])

### Seemingly Different but actually same

The reason why one-hot encoding is same as C[5] -> If you notice, \
the one-hot encoding of 5 -> [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

When we multiply this vector with C - since only the 5th dimension of the above vector is 1, it will only give the 5th row of C\
because `the one-hot encoding of 5 -> [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]` is a `[1 * 28]` matrix and C is a `[28 * 2]` matrix \
Therefore, their multiplication with give `[1 * 2]` row vector which is just the 5th row of C.

In [36]:
C[5]

tensor([-0.6343,  1.3699])

In [35]:
F.one_hot(torch.tensor(5), num_classes=28).float() @ C

tensor([-0.6343,  1.3699])

### Word Embedding Lookup via One-Hot Multiplication

Let e = [0, 0, 0, 0, 0, 1, 0, ..., 0] (1 at position 5) - One-hot encoding of 5 \
C is [28 * 2] lookup table, a matrix with 28 rows corresponding to all 28 distinct characters of our dataset, and 2 columns storing the 2 features of each character.

The operation `e @ C` (where e is a one-hot vector of length 28 with a 1 at index 5, and C is a 28×2 matrix) simply selects the 5th row of C (0-based indexing, so row index 5).

**A one-hot vector acts like a selector: it has zeros everywhere except a single 1.**\
Matrix multiplication with a one-hot row vector picks out the corresponding row from the matrix.

The first layer of the massive neural network could be treated as just indexing into the matrix `C` as it would be faster compared to the operation `e @ C`

In [38]:
X

tensor([[ 0,  0,  0],
        [ 0,  0, 11],
        [ 0, 11, 26],
        [11, 26, 16],
        [26, 16, 21],
        [16, 21, 10],
        [21, 10, 19],
        [10, 19, 14],
        [19, 14, 16],
        [14, 16, 26],
        [ 0,  0,  0],
        [ 0,  0, 10],
        [ 0, 10, 13],
        [10, 13,  2],
        [13,  2, 14],
        [ 2, 14, 22],
        [14, 22,  9],
        [22,  9, 10],
        [ 9, 10, 13]])

In [37]:
X.shape, X.dtype, Y.shape, Y.dtype

(torch.Size([19, 3]), torch.int64, torch.Size([19]), torch.int64)

![image.png](attachment:image.png)

From the image above, the 3 columns of each row of X will go to 3 look ups simultaneously as X holds the 3 previous characters

X[0,0] = index for $w_{t-n+1}$\
X[0,1] = index for $w_{t-2}$\
X[0,2] = index for $w_{t-1}$

Looking up (embedding) just one interger (like 5 above) was easy as it was a direct lookup (`C[5]`)
How do we simultaneously look up the entire [19 x 3] integers as seen above from X.shape?

### Pytorch Indexing
Pytorch provides a powerful indexing technique where we can index specific rows using lists: \
example: 
* C[[5,6,7]] will give the 5th, 6th and 7th row of C
* We can also use tensor: C[([5,6,7])]
* We can repeat a row multiple times as well C[([7,7,7,7,7,7,7])]



In [53]:
print(C[5])
print(C[6])
print(C[7])
print(C[[5,6,7]])
print(C[([5,6,7])])
print(C[([7,7,7,7,7,7,7])])

tensor([-0.6343,  1.3699])
tensor([-2.2420,  0.7920])
tensor([-0.3851, -1.3634])
tensor([[-0.6343,  1.3699],
        [-2.2420,  0.7920],
        [-0.3851, -1.3634]])
tensor([[-0.6343,  1.3699],
        [-2.2420,  0.7920],
        [-0.3851, -1.3634]])
tensor([[-0.3851, -1.3634],
        [-0.3851, -1.3634],
        [-0.3851, -1.3634],
        [-0.3851, -1.3634],
        [-0.3851, -1.3634],
        [-0.3851, -1.3634],
        [-0.3851, -1.3634]])


The above was for retrieving 1-D tensors

We can also retrieve multi dimensions embeddings from C

In [60]:
X.shape

torch.Size([19, 3])

In [61]:
C.shape

torch.Size([28, 2])

In [55]:
C[X].shape

torch.Size([19, 3, 2])

In the above retrieval, we see the shape of C[X] as [19 * 3 * 2] -> 
* 19 are the rows of X
* 3 are the 3 consecutive characters
* 2 are the features of each character

To understand this better:
X[13,2] should be the $w_{t-2}$ character for 13th character in Y (Y[13])
so, C[X[13,2]] should be equal to C[index], where index is the result of X[13,2]

To visualize:
![image.png](attachment:image.png)

Then C[X] will look like below:
![image-2.png](attachment:image-2.png)

In [56]:
print(X[13,2])
print(C[X[13,2]])

tensor(2)
tensor([-0.4475, -0.8111])


In [57]:
C[2]

tensor([-0.4475, -0.8111])

### The first layer of NN
Creating the first layer of the neural network - A mapping of each sequence to the lookup table

In [62]:
# Creating the embedding
emb = C[X]
emb.shape

torch.Size([19, 3, 2])

### Creating the Hidden Layer
```
Number of inputs to the hidden layer = Total number of outputs from the first layer 
Total number of outputs from the first layer = Total number of characters in our sequence * number of features of each character
                                             = 3 * 2 = 6
```

In [64]:
W1 = torch.randn((6, 100))   # 100 are the number of neurons in the hidden layer (we can choose it however)
b1 = torch.randn(100)       # 100 biases for the 100 neurons

### Forward Pass for the Hidden Layer

Generally we would do: `emb @ W1 + b1` \
but here, we cannot do that as the size of matrices do not match

In [65]:
emb @ W1 + b1

RuntimeError: mat1 and mat2 shapes cannot be multiplied (57x2 and 6x100)

### Unwrapping the C[X] 
In order to perform the forward pass through the hidden layer of 100 neuron, we will unwrap the C[X] embeddings as shown below:

![image.png](attachment:image.png)

Instead of sending a [1 * 3 * 2] tensor to the hidden layer (one row from C[X]), we will now send a single row with 6 features like below:

![image.png](attachment:image.png)

Here we have unwrapped each character as its own feature set and concatenated one after the other

In [66]:
torch.cat([emb[:, 0, :], emb[:, 1, :], emb[:, 2, :]], dim=1).shape

torch.Size([19, 6])

The result above shows that each input sequence has been unwrapped in such a way that each character sequence is now the sequence of the features:

![image.png](attachment:image.png)

In [73]:
# Generalizing the above: -> [emb[:, 0, :], emb[:, 1, :], emb[:, 2, :]
len(torch.unbind(emb, 1))

3

In [93]:
torch.cat(torch.unbind(emb, 1), 1)

tensor([[-0.5534,  2.1715, -0.5534,  2.1715, -0.5534,  2.1715],
        [-0.5534,  2.1715, -0.5534,  2.1715, -1.9096,  0.7250],
        [-0.5534,  2.1715, -1.9096,  0.7250,  0.9063,  1.5535],
        [-1.9096,  0.7250,  0.9063,  1.5535,  0.3775, -1.4243],
        [ 0.9063,  1.5535,  0.3775, -1.4243, -1.3809,  1.2106],
        [ 0.3775, -1.4243, -1.3809,  1.2106,  2.0941,  0.4766],
        [-1.3809,  1.2106,  2.0941,  0.4766, -2.2420,  0.1973],
        [ 2.0941,  0.4766, -2.2420,  0.1973,  1.3985, -0.2247],
        [-2.2420,  0.1973,  1.3985, -0.2247,  0.3775, -1.4243],
        [ 1.3985, -0.2247,  0.3775, -1.4243,  0.9063,  1.5535],
        [-0.5534,  2.1715, -0.5534,  2.1715, -0.5534,  2.1715],
        [-0.5534,  2.1715, -0.5534,  2.1715,  2.0941,  0.4766],
        [-0.5534,  2.1715,  2.0941,  0.4766,  0.6833, -0.4916],
        [ 2.0941,  0.4766,  0.6833, -0.4916, -0.4475, -0.8111],
        [ 0.6833, -0.4916, -0.4475, -0.8111,  1.3985, -0.2247],
        [-0.4475, -0.8111,  1.3985, -0.2

### Better & Efficient Way
There is a significantly better and efficient way to achieve the above.

In pytorch, `.view()` function is a very efficient function.

Lets see that by an example:

In [77]:
a = torch.arange(18)
a

tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17])

In [76]:
a.shape

torch.Size([18])

In [78]:
a.view(2,9)

tensor([[ 0,  1,  2,  3,  4,  5,  6,  7,  8],
        [ 9, 10, 11, 12, 13, 14, 15, 16, 17]])

In [79]:
a.view(9,2)

tensor([[ 0,  1],
        [ 2,  3],
        [ 4,  5],
        [ 6,  7],
        [ 8,  9],
        [10, 11],
        [12, 13],
        [14, 15],
        [16, 17]])

In [82]:
a.view(3, 3, 2)

tensor([[[ 0,  1],
         [ 2,  3],
         [ 4,  5]],

        [[ 6,  7],
         [ 8,  9],
         [10, 11]],

        [[12, 13],
         [14, 15],
         [16, 17]]])

In [85]:
a.storage()

 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
[torch.storage.TypedStorage(dtype=torch.int64, device=cpu) of size 18]

### View Function
Why is it so efficient?

In each tensor, their is an underlying storage(), which is always all the numbers as a 1 dimensional vector.\
pytorch uses, this storage to represent any tensor in the memory : **always as a 1-D vector**

When we call `.view()` function, we are manipulating some of the attributes of the tensor that dictate how this 1-D sequence is interpretted to be `an n-dimensional tensor`

No memory is changed, copied, moved or created, when we call `.view()` unlike concatenation shown above, where it creates a new storage. 

In [107]:
emb.shape

torch.Size([19, 3, 2])

In [108]:
# getting the same shape as torch.cat(torch.unbind(emb, 1), 1).shape
emb.view(19,6)

tensor([[-0.5534,  2.1715, -0.5534,  2.1715, -0.5534,  2.1715],
        [-0.5534,  2.1715, -0.5534,  2.1715, -1.9096,  0.7250],
        [-0.5534,  2.1715, -1.9096,  0.7250,  0.9063,  1.5535],
        [-1.9096,  0.7250,  0.9063,  1.5535,  0.3775, -1.4243],
        [ 0.9063,  1.5535,  0.3775, -1.4243, -1.3809,  1.2106],
        [ 0.3775, -1.4243, -1.3809,  1.2106,  2.0941,  0.4766],
        [-1.3809,  1.2106,  2.0941,  0.4766, -2.2420,  0.1973],
        [ 2.0941,  0.4766, -2.2420,  0.1973,  1.3985, -0.2247],
        [-2.2420,  0.1973,  1.3985, -0.2247,  0.3775, -1.4243],
        [ 1.3985, -0.2247,  0.3775, -1.4243,  0.9063,  1.5535],
        [-0.5534,  2.1715, -0.5534,  2.1715, -0.5534,  2.1715],
        [-0.5534,  2.1715, -0.5534,  2.1715,  2.0941,  0.4766],
        [-0.5534,  2.1715,  2.0941,  0.4766,  0.6833, -0.4916],
        [ 2.0941,  0.4766,  0.6833, -0.4916, -0.4475, -0.8111],
        [ 0.6833, -0.4916, -0.4475, -0.8111,  1.3985, -0.2247],
        [-0.4475, -0.8111,  1.3985, -0.2

In [109]:
# We can compare element wise to check if they are same
emb.view(19,6) == torch.cat(torch.unbind(emb, 1), 1)

tensor([[True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True]])

In [110]:
# Recalculating the hidden layer
h = emb.view(19,6) @ W1 + b1

In [111]:
h.shape

torch.Size([19, 100])

We got 100 activations for every 19 input characters

In [113]:
emb.shape

torch.Size([19, 3, 2])

In [114]:
C.shape

torch.Size([28, 2])

In [115]:
X.shape

torch.Size([19, 3])

In [117]:
# Generalizing the hidden layer
num_entries = X.shape[0]
num_features = X.shape[1] * C.shape[1]          # number of characters in sequence * num of features per character

print(num_entries)
print(num_features)

19
6


In [118]:
h = emb.view(num_entries, num_features) @ W1 + b1

In [119]:
h.shape

torch.Size([19, 100])

In [120]:
# lets pass it through tanh as well
h = torch.tanh(emb.view(num_entries, num_features) @ W1 + b1)

In [121]:
h.shape

torch.Size([19, 100])

In [124]:
h[:2]

tensor([[ 0.9961, -0.9988,  0.4939, -0.9324,  0.9930, -0.9795, -1.0000,  0.9997,
          0.4910,  0.7475,  0.9980,  0.9923, -1.0000,  0.5178, -0.9665,  0.0771,
         -0.9928,  0.9971,  0.6835, -1.0000,  0.9871, -0.5764,  1.0000, -0.9969,
          0.9989, -0.9841,  1.0000, -0.7683,  0.6020,  0.9481, -0.9554, -0.2131,
          0.9305, -0.6129, -0.9922,  0.9651, -0.8075, -0.9999, -1.0000, -0.9816,
          1.0000, -0.9997,  0.0081,  0.8927,  0.5256,  0.5756, -0.9173,  0.9979,
         -1.0000,  0.4030, -0.9430, -0.2375, -0.1984, -0.9799,  1.0000, -0.9891,
          0.6619, -0.9530, -0.8044, -0.9950,  0.9720,  0.6648,  1.0000,  0.9964,
         -0.9999,  0.9999,  1.0000,  0.9843, -0.9998,  0.7945, -0.6615, -0.0789,
          0.9999,  0.8685, -0.9968, -1.0000,  0.9989,  0.4683, -0.2392,  0.9916,
          0.9677, -0.0605,  0.0543, -0.8495, -1.0000,  0.9933, -0.0946,  0.9814,
         -0.9968,  0.2768, -1.0000,  0.6442,  0.2512, -0.9688, -1.0000,  0.4932,
         -0.9978, -0.9997,  

### Broadcasting

When we do emb.view(19,6) @ W1 + b1

notice that emb.view(19,6) @ W1 is a [19 x 100] matrix and b1 is a row vector of 100 elements

So when we do the addition, of [19 x 100] with [100] vector, broadcasting makes is [1 x 100] and then does an element wise addition operation

### Visual Representation of Hidden layer

![image.png](attachment:image.png)

### Creating the Final layer of NN

![image.png](attachment:image.png)

The number of inputs will be same as the number of neurons in the previous layer (100) and the output of this layer will be the number of characters in our dataset, as we will be generating a probability distribution of the possibility of occurrence of each character


### Visual Representation of Final Layer

![image-2.png](attachment:image-2.png)

In [134]:
W2 = torch.randn((100, 28))
b2 = torch.randn(28)

In [136]:
logits = h @ W2 + b2
counts = logits.exp()  # refer to build_makemore_nn for fake counts understanding
probs = counts / counts.sum(dim=1, keepdim=True)

In [137]:
probs.shape

torch.Size([19, 28])

In [138]:
probs[0].sum()

tensor(1.)

In [139]:
# The actual characters that are corresponding to the 19 inputs 
Y

tensor([11, 26, 16, 21, 10, 19, 14, 16, 26,  0, 10, 13,  2, 14, 22,  9, 10, 13,
         0])

From the probs matrix, we would like to index into the row corresponding to the input row and from that row, we would like to pluck out the probability that our model assigns to the actual Y

In [144]:
y_predicted = probs[torch.arange(num_entries), Y]
y_predicted

tensor([2.3900e-07, 3.3714e-08, 3.6176e-07, 2.4673e-05, 1.5985e-07, 1.4733e-08,
        2.1106e-13, 1.0034e-10, 1.3670e-15, 7.6250e-05, 4.4606e-01, 2.9026e-04,
        1.8209e-07, 1.3177e-05, 1.6349e-13, 2.2473e-12, 6.1388e-05, 1.3352e-08,
        9.7114e-01])

### Flow till now

![image.png](attachment:image.png)

In [145]:
log_likelihood = y_predicted.log()
nll = -log_likelihood                   # negative log likelihood
loss = nll.mean()                       # loss
loss

tensor(16.1749)

Above is the loss which we want to minimize in order to make our model more accurate

### Arranging the code till now properly

In [160]:
# dataset
X.shape, Y.shape
num_entries = X.shape[0]        # total number of input sequences 
num_features = 2                           # number of features per char
block_size = 3                             # number of previous characters to look before predicting the next

In [None]:
g = torch.Generator().manual_seed(2147483647)                        # generator for reproducibility
C = torch.randn((28, num_features), generator=g)                        # creating the feature matrix
W1 = torch.randn((block_size * num_features, 100), generator=g)         # 100 neurons in hidden layer
b1 = torch.randn(100, generator=g)                                      # biases per neuron
W2 = torch.randn((100, 28), generator=g)
b2 = torch.randn(28, generator=g)

parameters = [C, W1, b1, W2, b2]

In [165]:
# Total number of parameters in our model
sum(p.nelement() for p in parameters)

3584

In [167]:
for p in parameters:
    print(p.nelement())

56
600
100
2800
28


In [168]:
emb = C[X]

In [169]:
# Forward Pass
h = torch.tanh(emb.view(num_entries, block_size * num_features) @ W1 + b1)          # [19 x 100]
logits = h @ W2 + b2     # [19 x 28]
counts = logits.exp()
probs = counts / counts.sum(dim=1, keepdim=True)
y_predicted = probs[torch.arange(num_entries), Y]
log_likelihood = y_predicted.log()
nll = -log_likelihood
loss = nll.mean()
loss

tensor(19.9615)

### Cross Entropy

A method used to calculate the loss more efficiently using pytorch, such that the below lines of code:
```
counts = logits.exp()
probs = counts / counts.sum(dim=1, keepdim=True)
y_predicted = probs[torch.arange(num_entries), Y]
log_likelihood = y_predicted.log()
nll = -log_likelihood
loss = nll.mean()
loss
```

Can be replaced by just one line : `F.cross_entropy()`

### Why use cross entropy
* pytorch will not create separate memory allocations for `counts, probs, y_predicted, log_likelihood, nll, etc.` eliminating huge operational overhead
* Instead pytorch clusters up all the computation by creating fused kernels to very efficiently evaluate the expressions
* Helps to have much more efficient backward pass - mathematically and operationally it is much more efficient and with simpler backward pass - as the complex operations clustered together results in simpler backward operations algebraically (example: calculating tanh might be complex but its gradient is simply `1-(t**2)`)
* The operations with higher numbers will be well behaved

The last point can be understood as below

In [176]:
tmp = torch.tensor([-5, -3, 0, 5])
counts = tmp.exp()
probs = counts / counts.sum()
probs

tensor([4.5079e-05, 3.3309e-04, 6.6903e-03, 9.9293e-01])

Adding or subtracting any offset to the tensor does not affect the probabilities (because we are normalizing the results to obtain the probabilities)

In [177]:
tmp = torch.tensor([-5, -3, 0, 5]) + 5
counts = tmp.exp()
probs = counts / counts.sum()
probs

tensor([4.5079e-05, 3.3309e-04, 6.6903e-03, 9.9293e-01])

The 2 results above are exactly similar.

However, if we have some really large value, exp() will make it even larger and the result may overflow from the floating point placeholders

In [178]:
tmp = torch.tensor([-5, -3, 0, 500])
counts = tmp.exp()
counts

tensor([0.0067, 0.0498, 1.0000,    inf])

`inf` shows that the compiler could not actually hold the data properly.

So when we use **cross entropy**, pytorch uses the offset method to make the numbers smaller so that all the calculations are within bounds

In [182]:
tmp = torch.tensor([-5, -3, 0, 500]) - 500
counts = tmp.exp()
counts

tensor([0., 0., 0., 1.])

From the above reasons, it is advisable to use **cross entropy**

In [214]:
X.shape, Y.shape
num_entries = X.shape[0]
num_features = 2
block_size = 3

g = torch.Generator().manual_seed(2147483647)
C = torch.randn((28, num_features), generator=g)
W1 = torch.randn((num_features * block_size, 100), generator=g)
b1 = torch.randn(100, generator=g)
W2 = torch.randn((100, 28), generator=g)
b2 = torch.randn(28, generator=g)

parameters = [C, W1, b1, W2, b2]

In [215]:
sum(p.nelement() for p in parameters)

3584

In [216]:
for p in parameters:
    print(p.nelement())

56
600
100
2800
28


In [None]:
for p in parameters:
    p.requires_grad = True

In [322]:
for _ in range(10):
    emb = C[X]              # C is part of the parameters which would be changed

    # Forward Pass
    h = torch.tanh(emb.view(num_entries, block_size * num_features) @ W1 + b1)
    logits = h @ W2 + b2
    loss = F.cross_entropy(logits, Y)
    print(f'Loss in iteration {_} = {loss}')

    # Backward Pass

    # reset gradients
    for p in parameters:
        p.grad = None

    loss.backward()

    # Updation of parameters
    learning_rate = -0.5

    for p in parameters:
        p.data += learning_rate * p.grad

Loss in iteration 0 = 0.09276076406240463
Loss in iteration 1 = 0.0927477553486824
Loss in iteration 2 = 0.09273017197847366
Loss in iteration 3 = 0.09271739423274994
Loss in iteration 4 = 0.0926998183131218
Loss in iteration 5 = 0.0926872193813324
Loss in iteration 6 = 0.09266962856054306
Loss in iteration 7 = 0.0926567018032074
Loss in iteration 8 = 0.09263887256383896
Loss in iteration 9 = 0.09262654185295105


The reason we are able to get so low loss in small number of iterations is because , we have lots of parameters (3584) but too few inputs (19 input sequences), so it becomes very easy for the training to assign weights to the parameters in such a way that it **overfits the model**

So lets train the model on the entire dataset.

In [323]:
len(names)

64128

In [324]:
stoi = { s: i+1 for i, s in enumerate(sorted(list(set(''.join(names)))))}
stoi['.'] = 0

itos = {i: s for s, i in stoi.items()}
print(stoi)
print(itos)

{'-': 1, 'a': 2, 'b': 3, 'c': 4, 'd': 5, 'e': 6, 'f': 7, 'g': 8, 'h': 9, 'i': 10, 'j': 11, 'k': 12, 'l': 13, 'm': 14, 'n': 15, 'o': 16, 'p': 17, 'q': 18, 'r': 19, 's': 20, 't': 21, 'u': 22, 'v': 23, 'w': 24, 'x': 25, 'y': 26, 'z': 27, '.': 0}
{1: '-', 2: 'a', 3: 'b', 4: 'c', 5: 'd', 6: 'e', 7: 'f', 8: 'g', 9: 'h', 10: 'i', 11: 'j', 12: 'k', 13: 'l', 14: 'm', 15: 'n', 16: 'o', 17: 'p', 18: 'q', 19: 'r', 20: 's', 21: 't', 22: 'u', 23: 'v', 24: 'w', 25: 'x', 26: 'y', 27: 'z', 0: '.'}


In [346]:
# preparing the data
X, Y = [], []
block_size = 3

for name in names:
    context = [0] * block_size
    for ch in name + '.':
        index = stoi[ch]
        X.append(context)
        Y.append(index)
        context = context[1:] + [index]
        # print(''.join(itos[i] for i in context), '--->', itos[index])

X = torch.tensor(X)
Y = torch.tensor(Y)

In [347]:
X.shape, Y.shape

(torch.Size([573325, 3]), torch.Size([573325]))

In [348]:
# preparing the parameters

num_entries = X.shape[0]
num_features = 2
g = torch.Generator().manual_seed(2147483647)

C = torch.randn((28, 2), generator=g)
W1 = torch.randn((block_size * num_features, 100), generator=g)
b1 = torch.randn(100, generator=g)
W2 = torch.randn((100, 28), generator=g)
b2 = torch.randn(28, generator=g)

parameters = [C, W1, b1, W2, b2]

In [349]:
for p in parameters:
    p.requires_grad = True

In [350]:
# forward pass
for _ in range(10):
    emb = C[X]
    h = torch.tanh(emb.view(num_entries, block_size * num_features) @ W1 + b1)
    logits = h @ W2 + b2
    loss = F.cross_entropy(logits, Y)
    print(f'{loss}')

    # backward pass
    for p in parameters:
        p.grad = None
    
    loss.backward()

    learning_rate = -0.5
    for p in parameters:
        p.data += learning_rate * p.grad


18.83281135559082
13.503205299377441
10.179420471191406
8.542027473449707
7.489169120788574
6.796303749084473
6.397121906280518
6.362353801727295
6.200438976287842
5.138263702392578
