# Overview

In [Single character neural net](https://www.kaggle.com/code/aisuko/single-character-nn-prediction-with-pytorch) notebook. We build a character level module(one-layer) that predicts a next one with a lookup table of counts(we normlize the counts to probability).

If we are to take more context into account when predicting the next character in a sequence things quickly blow up. For example, we have 27 character in the matrix. We only get 27 possibilities for what we could have come in the context. However, if we take two characters in the past and try to predict the third one, the number of possibilities would be $27*27=729$ times. So, the whole thing just kind of explodes and doesn't work very well.

So, in this notebnook, we will implement MLP based on [Bengio et al. 2003](https://www.youtube.com/redirect?event=video_description&redir_token=QUFFLUhqazJyRmVlM0lsU2pxUmcyVXZTWkk2QnNFSFEyUXxBQ3Jtc0tseS1KWVA3SHBvWWh6cV9TcDFPVC05NkVTVFlWS2pmNFB5RUVvWnJOcjJHbGcteVB3UDNXeU1sWm9FQlZPS3ZLakFUV1FKb1ZqQmNjS3Z0ZzdJNWpTU21wZWJ6RE1pTjhWQjFjanpETm1ock52c2JVVQ&q=https%3A%2F%2Fwww.jmlr.org%2Fpapers%2Fvolume3%2Fbengio03a%2Fbengio03a.pdf&v=TCH_1BHY58I). This paper solve(with a 17000 words vocabulary based language model) the question above by using MLP. And they associated each word 30 dimensional feature vector which means every word is now embedded into a thirty dimensional space.**17000 vectors in a 30 dimensional space.** 

These words are initialized completely reandomly so they're spread out at random but then we're going to tune these embeddings of these words using back propagation so during the course of training of this neural net these points or vectors are going to basically move around in this space and the similar meanings or that are indeed synonyms of each other might end up in a very similat part of the space.

In the training process, it will transform the these information from embeddings to neural net.


# Preprocess the data

In [1]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt

words=open('/kaggle/input/character-lm-without-framework/names.txt', 'r').read().splitlines()

print(words[:8])
print(len(words))

['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte', 'mia', 'amelia']
32033


In [2]:
# build the vocabulary of characters and mappings to/from integers
chars=sorted(list(set(''.join(words))))
stoi={s:i+1 for i,s in enumerate(chars)}
stoi['.']=0
itos={i:s for s,i in stoi.items()}
print(itos)

{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}


# Build the dataset

In [3]:
# context length: how many characters do we take to predict the next one?
block_size=3
# X input of neural net, Y the lable of every element in X
X,Y=[],[]
for w in words[:5]: # first 5 words
    print(w)
    context=[0]*block_size # start with a padded context of zero tokens
    for ch in w+'.': # we always padding with .
        ix=stoi[ch]
        X.append(context)
        Y.append(ix)
        print(''.join(itos[i] for i in context), '--->', itos[ix])
        context=context[1:] +[ix] # crop and append
X=torch.tensor(X)
Y=torch.tensor(Y)

emma
... ---> e
..e ---> m
.em ---> m
emm ---> a
mma ---> .
olivia
... ---> o
..o ---> l
.ol ---> i
oli ---> v
liv ---> i
ivi ---> a
via ---> .
ava
... ---> a
..a ---> v
.av ---> a
ava ---> .
isabella
... ---> i
..i ---> s
.is ---> a
isa ---> b
sab ---> e
abe ---> l
bel ---> l
ell ---> a
lla ---> .
sophia
... ---> s
..s ---> o
.so ---> p
sop ---> h
oph ---> i
phi ---> a
hia ---> .


In [4]:
X.shape, X.dtype, Y.shape, Y.dtype

(torch.Size([32, 3]), torch.int64, torch.Size([32]), torch.int64)

We can tell from the `first 5 words`, we have created datasets for 32 examples, each input of the neural networks is 3 integers. The integer of label y is `torch.Size([32])`.


# Define Lookup table(embeddings)

We will do the embeddings for all the characters by using look up table.

**Let's grab all the characters(in our case is 27) into a low-dimensional(2) space.** In the paper, they grab 17000 words into 30 dimensional space.

In [5]:
# each one of 27 characters will have 2 dimentional embedding
C=torch.randn((27,2)) # random create a tensor which has 27 rows and 2 columns
C.shape

torch.Size([27, 2])

## Example for demonstrate embeddings

Here we will do embedding for the number 5. We use one-hot encoding make sure the every value of neuron would be zero with the exception of the neuron indexed at the target class.

And there are two ways to do this:

In [6]:
# embedding a single interger has index 5, C[5]
# embedding the whole list
# note: we convert integet to float
torch.equal(C[5],(F.one_hot(torch.tensor(5), num_classes=27).float() @ C))

True

## Example of tensor index

In [7]:
# We do embedding for x in once
# for everyone of 32 by 3 integers we;re retrieved 2 embedding vectors
C[X].shape

torch.Size([32, 3, 2])

In [8]:
# the value of example/index 13 and second demisional is 1
X[13,2]

tensor(1)

In [9]:
torch.equal(C[X][13,2], C[1])

True

In [10]:
emb=C[X]
emb.shape

torch.Size([32, 3, 2])

# Construct the hidden layer

We need to solve the `How to multiply different shape of tensors issues` like `emb@W1+b1`.

```bash
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[14], line 1
----> 1 emb@W1+b1

RuntimeError: mat1 and mat2 shapes cannot be multiplied (96x2 and 6x100)
```

In [11]:
# we have 2 dimensional embeddings and we have 3 of them
# so, the number of input layer should be 6

# and the number of neurons in this layer is 100(it's up to us)
W1=torch.randn((6,100))
b1=torch.randn(100) # initiate randomly

# How to concatenate these input from ([32,3,2]) to (32,6) to fit the hidden layer?

There are usually many ways of implementing this. We want to retrive these three parts and concatenate them.

* torch.cat
* torch.cat(torch.unbind) (it is less efficient)
* torch.view(efficient way, see reason below)

### torch.cat

In [12]:
# using torch.cat(concatenates) https://pytorch.org/docs/stable/generated/torch.cat.html#torch.cat

torch.cat([emb[:,0,:], emb[:,1,:], emb[:,2,:]],1).shape

torch.Size([32, 6])

### torch.cat(torch.unbind())

In [13]:
# However, we want to be more flexible. If the input change to other dimensions, the code won't work.
# so, here we use torch.unbind to remove the dimensions

# it's less efficient, because the concetenation would create a whole new tensor with a whole
# new storage so new memeory is being created because there's no way to concatenate tensors just
# by manipulating the view attributes

torch.cat(torch.unbind(emb,1),1).shape

torch.Size([32, 6])

### torch.view

In [14]:
# let's use a more efficient way to do this
a=torch.arange(18)
a

tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17])

In [15]:
a.shape

torch.Size([18])

In [16]:
# re-represent a as different sized and dimensional tensors, adn it can be:
print(a.view(2,9))
print(a.view(9,2))
print(a.view(3,3,2))

tensor([[ 0,  1,  2,  3,  4,  5,  6,  7,  8],
        [ 9, 10, 11, 12, 13, 14, 15, 16, 17]])
tensor([[ 0,  1],
        [ 2,  3],
        [ 4,  5],
        [ 6,  7],
        [ 8,  9],
        [10, 11],
        [12, 13],
        [14, 15],
        [16, 17]])
tensor([[[ 0,  1],
         [ 2,  3],
         [ 4,  5]],

        [[ 6,  7],
         [ 8,  9],
         [10, 11]],

        [[12, 13],
         [14, 15],
         [16, 17]]])


We can tell as long as the total number of elements here multiply to be the same it will just work, and in pytorch this operation calling that view is extremely efficient.

**The reason is that in each tensor there's something called the underlying storage and the storage is just the numbers always as a one-dimensional vector and this is how this tensor is represented in memory.**

In [17]:
print(a.storage())

 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
[torch.storage.TypedStorage(dtype=torch.int64, device=cpu) of size 18]


  print(a.storage())


It's always a one-dimensional vector but when we call that view we are manipulating some of attributes of that tensor that dictate how this one-dimensional sequence is interpreted to be an n-dimensional tensor and so what's happening here is that no memeory is being changed copied moved or created when we call that view the storage is identical but when you call that view some of the internal attributes of the view of the sensor are being manipulated and changed in particular that's something called a storage offeset strides and shapes and these are manipulated so that this one-dimensional sequence of bytes is seen as different.

In [18]:
# we can simply ask pytorch to view this instead as a [32, 6] array.
emb.view(32,6)==torch.cat(torch.unbind(emb,1),1)

tensor([[True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, T

In [19]:
h=emb.view(32,6)@W1+b1
print(h.shape)
print(h)

torch.Size([32, 100])
tensor([[-3.2553, -3.3835, -3.5769,  ...,  4.3531,  3.0629,  3.6948],
        [-3.2326, -2.6479, -2.9989,  ...,  3.4741,  2.3047,  3.5943],
        [ 0.4374, -2.8839, -7.1125,  ...,  1.7678,  1.5556,  0.1276],
        ...,
        [-2.9334, -0.6889,  0.3926,  ...,  0.6780,  2.1971,  3.3015],
        [ 0.4755, -0.7217, -4.5915,  ...,  2.6427,  1.2158,  0.8756],
        [ 1.0306, -3.3673,  0.6775,  ...,  3.4702, -0.2258,  0.2063]])


In [20]:
# we shouldn't hard code any of the number of neurons, so, we can use 
# emb.shpae[0] or negative 1

# for nagtive 1, pytorch will infer what this should be, because the number of elements must be the same
# so pytorch will detrieve this must be 32

h=emb.view(-1,6)@W1+b1
print(h.shape)
print(h)

torch.Size([32, 100])
tensor([[-3.2553, -3.3835, -3.5769,  ...,  4.3531,  3.0629,  3.6948],
        [-3.2326, -2.6479, -2.9989,  ...,  3.4741,  2.3047,  3.5943],
        [ 0.4374, -2.8839, -7.1125,  ...,  1.7678,  1.5556,  0.1276],
        ...,
        [-2.9334, -0.6889,  0.3926,  ...,  0.6780,  2.1971,  3.3015],
        [ 0.4755, -0.7217, -4.5915,  ...,  2.6427,  1.2158,  0.8756],
        [ 1.0306, -3.3673,  0.6775,  ...,  3.4702, -0.2258,  0.2063]])


# Using `tanh` 

It will make sure all the elements between [-1,1].

In [21]:
h=torch.tanh(emb.view(-1,6)@W1+b1)
h

tensor([[-0.9970, -0.9977, -0.9984,  ...,  0.9997,  0.9956,  0.9988],
        [-0.9969, -0.9900, -0.9950,  ...,  0.9981,  0.9803,  0.9985],
        [ 0.4115, -0.9938, -1.0000,  ...,  0.9434,  0.9147,  0.1269],
        ...,
        [-0.9944, -0.5973,  0.3736,  ...,  0.5902,  0.9756,  0.9973],
        [ 0.4426, -0.6180, -0.9998,  ...,  0.9899,  0.8384,  0.7042],
        [ 0.7742, -0.9976,  0.5899,  ...,  0.9981, -0.2220,  0.2034]])

In [22]:
# everyone of 32 examples,we have 100 neurons
h.shape

torch.Size([32, 100])

# Final layer

In [23]:
# the input is 100, and the output should be 27, becuase we have 27 possible characters in the next
W2=torch.randn(100, 27)
b2=torch.randn(27)

In [24]:
logits=h@W2+b2
logits.shape

torch.Size([32, 27])

# Softmax

In [25]:
# exp
counts=logits.exp()

In [26]:
# normaliztion
prob=counts/counts.sum(1, keepdims=True)
print(prob[0].sum())

tensor(1.)


# Mapping the probs to the label value

In [27]:
# label
Y

tensor([ 5, 13, 13,  1,  0, 15, 12,  9, 22,  9,  1,  0,  1, 22,  1,  0,  9, 19,
         1,  2,  5, 12, 12,  1,  0, 19, 15, 16,  8,  9,  1,  0])

In [28]:
torch.arange(32)
# index the prob in the following way
prob[torch.arange(32), Y]

tensor([1.5588e-06, 5.2970e-11, 8.3058e-08, 6.8039e-13, 3.5345e-09, 2.6521e-08,
        5.5430e-05, 6.5675e-09, 7.4476e-09, 1.9069e-02, 1.7083e-11, 3.2856e-10,
        4.1587e-04, 2.6186e-05, 1.1847e-09, 1.1951e-04, 3.2309e-06, 1.4354e-07,
        6.1658e-06, 2.2425e-05, 7.3623e-11, 5.2226e-11, 4.5429e-02, 3.4487e-02,
        1.7952e-06, 1.8844e-09, 1.0459e-06, 7.8410e-09, 1.3964e-05, 9.4942e-12,
        5.3321e-03, 9.9990e-01])

# Create negative log likelihood loss

We meed to minimize it to get neural network a better accuracy.

In [29]:
loss=-prob[torch.arange(32),Y].log().mean()
loss

tensor(14.8825)

# Let's made all the process above respectable

In [30]:
# we deifine the dataset
X.shape, Y.shape

# we use torch.Generator for reproducibility
g=torch.Generator().manual_seed(2147483647)

# Loopup table, the input layer
C=torch.randn((27,2), generator=g)

W1=torch.randn((6,100), generator=g)
b1=torch.randn(100, generator=g)
W2=torch.randn((100, 27), generator=g)
b2=torch.randn(27, generator=g)
parameters=[C, W1, b1, W2, b2]

In [31]:
# number of parameters in total
sum(p.nelement() for p in parameters) 

3481

In [32]:
# mebedding in input layer
emb=C[X] # (32, 3, 2)

# hiddent layer
h=torch.tanh(emb.view(-1, 6) @ W1+b1) # (32, 100)

# final(ioutput) layer
logits=h@W2+b2 # (32, 27)

# these teo steps are combined called Softmax
# exp()
counts=logits.exp()
# normalization
prob=counts/counts.sum(1, keepdims=True)

# calculate negative log likelihood loss
loss=-prob[torch.arange(32), Y].log().mean()

# it is expressing how well this neural network works with the current setting of parameters 
loss

tensor(17.7697)

## Update raw softmax part be more efficient

The softmax function is used in various multiclass classification methods. So, a more efficient way to calculate this is used `F.cross_entropy(logits, Y)`.

The reason we use `cross_entropy` rather than raw softmax is that when we implement the raw softmax. It will create many of the new tensors on memory while the calculating process, this is very inefficient.

1. `corss_entropy` will cluster up all the operations and very often create have **fused kernels** that very efficiently evaluate these expressions, like clustered mathematical operations. The forward pass can be more efficient.

2. ** A fused analytically and mathematically kernel** It's a very much simpler backward pass to implement. The backward pass can be more efficient.

3. It's a numerically well behaved(See the example below). `corss_entropy` will calcualte the maximum value that occurs in the logits.

In [33]:
# for example we have a very negative and also postive number
logits_example=torch.tensor([-100, -3, 0, 100])
counts_example=logits_example.exp()
probs_example=counts_example/counts_example.sum()
probs_example

tensor([0., 0., 0., nan])

In [34]:
counts_example

tensor([3.7835e-44, 4.9787e-02, 1.0000e+00,        inf])

In [35]:
F.cross_entropy(logits, Y)

tensor(17.7697)

# Re-structure the whole code and add the training loop 

Here, we split the code to the `forward pass` and the `backward pass`.

In [36]:
for p in parameters:
    p.requires_grad = True

for _ in range(1000):
    # forward pass
    emb=C[X] # (32, 3, 2) 32 examples
    h=torch.tanh(emb.view(-1, 6)@W1+b1) # (32, 100)
    logits=h@W2+b2 # (32, 27)
    loss=F.cross_entropy(logits, Y)
    # backward pass
    for p in parameters:
        p.grad=None
    loss.backward()
    # update
    for p in parameters:
        p.data+=-0.1*p.grad
        
# We use 3481 parameters to fit 32 example in this neural network, so the result is good
# So, this is called overfitting a single batch of the unique data, and get very low loss and good prediction
print(loss.item())

0.2561509907245636


## We're not able to achive exactly zero

The reason is that

In [37]:
# get the max along the firt dimension

# values: The actual values that take on the maximum number
# indices: The indices of piece

# we can see the indices are very close to the labels, but some aren't. This is because
# we have same input mapping to different output in the dataset, like:

# ... ---> e
# ... ---> o
# ... ---> a

# So, it means e, o and a are all possible outcomes in a training set for the exact same input. 

logits.max(1)

torch.return_types.max(
values=tensor([13.3437, 17.7879, 20.5832, 20.6042, 16.7390, 13.3437, 15.9747, 14.1889,
        15.9158, 18.3894, 15.9409, 20.9284, 13.3437, 17.1212, 17.1498, 20.0637,
        13.3437, 16.4564, 15.1328, 17.0537, 18.5905, 15.9655, 10.8739, 10.6874,
        15.5062, 13.3437, 16.2394, 16.9563, 12.7426, 16.2141, 19.0840, 16.0213],
       grad_fn=<MaxBackward0>),
indices=tensor([ 9, 13, 13,  1,  0,  9, 12,  9, 22,  9,  1,  0,  9, 22,  1,  0,  9, 19,
         1,  2,  5, 12, 12,  1,  0,  9, 15, 16,  8,  9,  1,  0]))

In [38]:
Y

tensor([ 5, 13, 13,  1,  0, 15, 12,  9, 22,  9,  1,  0,  1, 22,  1,  0,  9, 19,
         1,  2,  5, 12, 12,  1,  0, 19, 15, 16,  8,  9,  1,  0])

# Acknowledge

* https://www.youtube.com/watch?v=TCH_1BHY58I
* https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
* https://pytorch.org/docs/stable/generated/torch.equal.html
* https://pytorch.org/docs/stable/generated/torch.cat.html#torch.cat
* https://en.wikipedia.org/wiki/Softmax_function