### LLM Fundamentals

In this notebook we will go through the fundamentals of the **LLM**.

The steps are as follows:
- Load the data
- Encode the data
- Converting our text to tensors
- Train/test split
- Staring with Bigram model
- Creating a self-attention unit
- Staring with the Head class

In [33]:
## first the imports
import numpy as np
import torch
import torch.nn as nn
from torch.nn import functional as F
import matplotlib.pyplot as plt
from pathlib import Path

### Loading the data

In [8]:
DATA_PATH = '../data/'
PATH = Path(DATA_PATH)
FILE_NAME = 'wizardOfOz.txt'
FILE_PATH = PATH / FILE_NAME

In [12]:
## opening the file
with open(FILE_PATH, 'r', encoding = 'utf-8') as f:
    text = f.read()
## checking the first 100 char
print(text[:100])

﻿ Dorothy and the Wizard in Oz


  A Faithful Record of Their Amazing Adventures
    in an Undergrou


### Encoding the data

In [24]:
## first we want to create a set of char
chars = sorted(set(text))
## checking how many distinct char are in the text
vocab_size = len(chars)
print(vocab_size)
## and then assign a number to each char
int_to_string = {i:l for i, l in enumerate(chars)}
string_to_int = {l:i for i, l in enumerate(chars)}
## and then move on to creating our encoding and decoding functions
encode = lambda l:[string_to_int[x] for x in l]
decode = lambda i:''.join([int_to_string[x] for x in i])

76


In [18]:
## we can test our encoder and decoder now
print(encode('Dorothy'))
print(decode(encode('Dorothy')))

[27, 63, 66, 63, 68, 56, 73]
Dorothy


### Converting our text to tensors

In [25]:
text_tensor = torch.tensor(encode(text), dtype=torch.long)
print(text_tensor.size(), type(text_tensor))
text_tensor[:100]

torch.Size([230550]) <class 'torch.Tensor'>


tensor([75,  1, 27, 63, 66, 63, 68, 56, 73,  1, 49, 62, 52,  1, 68, 56, 53,  1,
        46, 57, 74, 49, 66, 52,  1, 57, 62,  1, 38, 74,  0,  0,  0,  1,  1, 24,
         1, 29, 49, 57, 68, 56, 54, 69, 60,  1, 41, 53, 51, 63, 66, 52,  1, 63,
        54,  1, 43, 56, 53, 57, 66,  1, 24, 61, 49, 74, 57, 62, 55,  1, 24, 52,
        70, 53, 62, 68, 69, 66, 53, 67,  0,  1,  1,  1,  1, 57, 62,  1, 49, 62,
         1, 44, 62, 52, 53, 66, 55, 66, 63, 69])

In [26]:
## we're simply splitting the data into train and test sets
train_data, test_data = np.split(text_tensor, [int(.8*len(text_tensor))])
train_data.shape, test_data.shape

(torch.Size([184440]), torch.Size([46110]))

In [28]:
## next we have to define a block size for our model
block_size = 8
## this means, the model will look at 8 sequences
## at each round of training
x = train_data[:block_size]
y = train_data[1:block_size+1]
for i in range(block_size):
    context = x[:i+1]
    target = y[i]
    print(f'When the context is {context} the target will be {target}')

When the context is tensor([75]) the target will be 1
When the context is tensor([75,  1]) the target will be 27
When the context is tensor([75,  1, 27]) the target will be 63
When the context is tensor([75,  1, 27, 63]) the target will be 66
When the context is tensor([75,  1, 27, 63, 66]) the target will be 63
When the context is tensor([75,  1, 27, 63, 66, 63]) the target will be 68
When the context is tensor([75,  1, 27, 63, 66, 63, 68]) the target will be 56
When the context is tensor([75,  1, 27, 63, 66, 63, 68, 56]) the target will be 73


The reason we're looping through the `block_size` range, is to have our model get used to seeing anything from `1` to the `block_size` length of characters.

In [42]:
## we also need to break our data into batches for faster computations
block_size = 8
batch_size = 4

def get_batch(split):
    data = train_data if split == 'train' else test_data
    random_inx = torch.randint(high=len(data)-block_size, size = (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in random_inx])
    y = torch.stack([data[i+1:i+block_size+1] for i in random_inx])
    return x, y

X_train, y_train = get_batch('train')
X_test, y_test = get_batch('test')
print(X_train.shape, y_train.shape)

torch.Size([4, 8]) torch.Size([4, 8])


In [43]:
## and we can loop through the batches in our set
for b in range(batch_size):
    for i in range(block_size):
        context = X_train[b,:i+1]
        target = y_train[b, i]
        print(f'Batch {b}: When context is {context} the target is {target}')

Batch 0: When context is tensor([49]) the target is 68
Batch 0: When context is tensor([49, 68]) the target is 0
Batch 0: When context is tensor([49, 68,  0]) the target is 56
Batch 0: When context is tensor([49, 68,  0, 56]) the target is 57
Batch 0: When context is tensor([49, 68,  0, 56, 57]) the target is 67
Batch 0: When context is tensor([49, 68,  0, 56, 57, 67]) the target is 1
Batch 0: When context is tensor([49, 68,  0, 56, 57, 67,  1]) the target is 54
Batch 0: When context is tensor([49, 68,  0, 56, 57, 67,  1, 54]) the target is 53
Batch 1: When context is tensor([53]) the target is 1
Batch 1: When context is tensor([53,  1]) the target is 67
Batch 1: When context is tensor([53,  1, 67]) the target is 69
Batch 1: When context is tensor([53,  1, 67, 69]) the target is 52
Batch 1: When context is tensor([53,  1, 67, 69, 52]) the target is 52
Batch 1: When context is tensor([53,  1, 67, 69, 52, 52]) the target is 53
Batch 1: When context is tensor([53,  1, 67, 69, 52, 52, 53])

### The Bigram Model

In [63]:
## we will be inheriting from the nn.Module
## and then use the embedding from nn to build our class
class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        ## we're basically creating a wrapper
        ## around a tensor of vocab_size x vocab_size
        ## and each index that's passed to the model
        ## will go and take out it's row from that table 
        self.embedding_table = nn.Embedding(num_embeddings=vocab_size,
                                           embedding_dim=vocab_size)
    def forward(self, x, target=None):
        ## pytorch will re-arrange it into (Batch, Time, Channel) tensor
        ## where batch is the batch_size, time is the block_size
        ## and channel is the vocab_size
        ## so in our case will be (4, 8, 76)
        logits = self.embedding_table(x)
        ## we also need the loss
        ## which we'll be using the -log(likelihood)
        ## the issue with the functional cross entropy
        ## is that it needs the inputs to be in (Batch*Time, Channel)
        ## so we have to change the shape of our logits and targets
        if target is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            target = target.view(B*T)
            loss = F.cross_entropy(logits, target)
        return logits, loss

    def generate(self, x, num_max_token):
        for _ in range(num_max_token):
            ## we want to get the predictions again
            logits, loss = self(x)
            ## and we only want the last block (Batch, Block, Vocab)
            logits = logits[:, -1, :] ## (Batch, Vocab)
            ## and then we apply the softmax to get the probabilities
            probs = torch.softmax(logits, dim=-1) ## still (Batch, Vocab)
            ## and then get a sample from the probablity distribution
            next_inx = torch.multinomial(probs,num_samples=1) ## (B, 1)
            ## and then append the next index to the x
            x = torch.cat((x, next_inx), dim=1) ## (Batch, Block + 1)
        return x

model = BigramLanguageModel(vocab_size)
output, loss = model(X_train, y_train)
print(output.shape)
## we're expecting the initial entropy to be
## -ln(1/vocab_size)
print(f'expected loss {-np.log(1/vocab_size):.2f}')
print(f'claculated loss {loss.item():.2f}')

torch.Size([256, 76])
expected loss 4.33
claculated loss 4.76


And we can see that the initial loss is higher than expected, which means the initial guesses are not completely diffused, and we have some entropy.

In [64]:
## checking the generate method
## which is completely random at the moment
## because we haven't yet trained the model
print(decode(model.generate(X_test, num_max_token=100)[0].tolist()))

ster toweo(o6aoyga﻿.mI7ku(;ViEdvnhh;BeoPYP"sbx8-yIfltBpJRG-pOqYpwkb"O7;AYWpC:MErhh:veRORfv3y&prFeNFlBcr:NiMA


### Training the model

In [65]:
## like any other ML model
## we need an optimizer
optimizer = torch.optim.AdamW(params=model.parameters(), lr=1e-3)

In [66]:
## the next step is to train the model
batch_size = 32
epochs = 20000
for e in range(epochs):
    ## get the X and y
    X_train, y_train = get_batch('train')
    logits, loss = model(X_train, y_train)
    ## zeroing the gradient
    optimizer.zero_grad(set_to_none=True)
    ## and then backpropagation
    loss.backward()
    ## and then taking a step
    optimizer.step()
    if e%1000==0:
        print(f'Epoch {e} Loss is {loss.item():.4f}')

Epoch 0 Loss is 4.8435
Epoch 1000 Loss is 3.7858
Epoch 2000 Loss is 3.2759
Epoch 3000 Loss is 2.7724
Epoch 4000 Loss is 2.6515
Epoch 5000 Loss is 2.5495
Epoch 6000 Loss is 2.4680
Epoch 7000 Loss is 2.3957
Epoch 8000 Loss is 2.5189
Epoch 9000 Loss is 2.5125
Epoch 10000 Loss is 2.4334
Epoch 11000 Loss is 2.3051
Epoch 12000 Loss is 2.3299
Epoch 13000 Loss is 2.3017
Epoch 14000 Loss is 2.4648
Epoch 15000 Loss is 2.4087
Epoch 16000 Loss is 2.3207
Epoch 17000 Loss is 2.4632
Epoch 18000 Loss is 2.4763
Epoch 19000 Loss is 2.3227


In [68]:
## now let's check to see what will our model generate after training
print(decode(model.generate(X_test, num_max_token=300)[0].tolist()))

ster toweabadmoom pngal,"Bu f t ne I wandns n's. ppedes 3.

ey engiousece

rothe,"ckithe besatourskevoly n bleant athy wncleancaros aut



Thacecet thout Yon tantt, her averesogond bely stowathe ie, ce angaive. d.

anly."At at, idins torerowadvo s mamers Zere aplor sinss I'sa, angen at tatheyous he a w t ig


Still not that good, but certainly better than the first try.

Next, we have to start with adding the **self-attention** part to our model.

### Adding the *self-attention* capability

we want any given token in our block to communicate with the tokens that come before it and the simplest way to achieve that will be to get the *average* - bag of words approach.

In [76]:
## lets suppose we have x
## where the dimensions are batch, block = time, and vocab = channel
B, T, C = 4, 8, 4
x = torch.randn((B, T, C))
xbow = torch.zeros((B, T, C))
for b in range(B):
    for t in range(T):
        xprev = x[b, :t+1, :]
        xbow[b, t,:] = torch.mean(xprev, 0) ## averaging out all the ts

In [77]:
x[0] ## first batch of tokens

tensor([[-0.9912, -1.6142,  0.2791,  2.2788],
        [ 0.3470,  0.6120,  0.6927,  0.8185],
        [ 0.5250,  0.6194, -0.7180, -0.1794],
        [-2.2976, -0.4656,  1.4123, -0.0794],
        [ 2.6717,  0.1332,  1.1822, -0.3420],
        [-0.6930, -0.7369,  1.1706, -0.3187],
        [-0.1670, -0.6074,  1.3874, -1.0650],
        [ 0.0793, -1.0949, -1.7705,  0.0817]])

In [78]:
xbow[0] ## first batch averged out over the tokens

tensor([[-0.9912, -1.6142,  0.2791,  2.2788],
        [-0.3221, -0.5011,  0.4859,  1.5486],
        [-0.0397, -0.1276,  0.0846,  0.9726],
        [-0.6042, -0.2121,  0.4165,  0.7096],
        [ 0.0510, -0.1430,  0.5697,  0.4993],
        [-0.0730, -0.2420,  0.6698,  0.3630],
        [-0.0865, -0.2942,  0.7723,  0.1590],
        [-0.0657, -0.3943,  0.4545,  0.1493]])

Now, this is very inefficient, and the mathematical way to avoid the loops is to use the *matrix multiplication*.

In [85]:
## we can create a normalized triangle matrix
## and multiply it with the original values
## to get the average
a = torch.tril(torch.ones((T, T)))
a = a / torch.sum(a, 1, keepdim = True)
c = a @ x
c[0]

tensor([[-0.9912, -1.6142,  0.2791,  2.2788],
        [-0.3221, -0.5011,  0.4859,  1.5486],
        [-0.0397, -0.1276,  0.0846,  0.9726],
        [-0.6042, -0.2121,  0.4165,  0.7096],
        [ 0.0510, -0.1430,  0.5697,  0.4993],
        [-0.0730, -0.2420,  0.6698,  0.3630],
        [-0.0865, -0.2942,  0.7723,  0.1590],
        [-0.0657, -0.3943,  0.4545,  0.1493]])

In [86]:
## and now c and xbow are the same
torch.allclose(xbow, c)

True

In [89]:
## the other way to write this will be 
tri = torch.tril(torch.ones((T, T)))
w = torch.zeros((T, T))
w = w.masked_fill(tri == 0, float('-inf'))
w = F.softmax(w, 1)
w

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

### Bigram Model + self-attention 

In [94]:
## adding some new params
num_emb = 32
device = 'cuda' if torch.cuda.is_available() else 'cpu'
class BigramLanguageModelV2(nn.Module):
    def __init__(self, vocab_size=vocab_size, num_emb=num_emb, block_size=block_size):
        super().__init__()
        ## we're changing this to be (vocab x emb)
        self.embedding_table = nn.Embedding(num_embeddings=vocab_size,
                                           embedding_dim=num_emb)
        ## we now want to have another embedding for the positions
        self.pos_embedding_table = nn.Embedding(block_size, num_emb)
        ## and then a linear layer to give us the logits
        self.lin = nn.Linear(num_emb, vocab_size)
        
    def forward(self, x, target=None):
        ## now we'll be incorporating the new layers
        B, T = x.shape
        token_emb = self.embedding_table(x) ## (B, T, num_emb)
        ## and then we can also get the position
        pos_emb = self.pos_embedding_table(torch.arange(T, device=device)) ## (T, num_emb)
        ## and now our x will be the sum of these two
        x = token_emb + pos_emb
        ## and we can finally get our logits by the linear layer
        logits = self.lin(x) ## (B, T, C)
        if target is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            target = target.view(B*T)
            loss = F.cross_entropy(logits, target)
        return logits, loss

    def generate(self, x, num_max_token):
        for _ in range(num_max_token):
            ## we want to get the predictions again
            logits, loss = self(x)
            ## and we only want the last block (Batch, Block, Vocab)
            logits = logits[:, -1, :] ## (Batch, Vocab)
            ## and then we apply the softmax to get the probabilities
            probs = torch.softmax(logits, dim=-1) ## still (Batch, Vocab)
            ## and then get a sample from the probablity distribution
            next_inx = torch.multinomial(probs,num_samples=1) ## (B, 1)
            ## and then append the next index to the x
            x = torch.cat((x, next_inx), dim=1) ## (Batch, Block + 1)
        return x

model = BigramLanguageModelV2(vocab_size = vocab_size, num_emb = num_emb)
output, loss = model(X_train, y_train)
print(output.shape)
## we're expecting the initial entropy to be
## -ln(1/vocab_size)
print(f'expected loss {-np.log(1/vocab_size):.2f}')
print(f'claculated loss {loss.item():.2f}')

torch.Size([256, 76])
expected loss 4.33
claculated loss 4.73


### How to add the *self-attention* to our model?

Attention is basically a communication mechanism, so in this model, we have the `block_size` nodes, and each of these nodes are communicating the the nodes that come before them, with a weighted vector. This weight depends on what the node has, and what it's looking for/interested in finding.

There's no notion of space in this model, there are simply nodes that are out there, unlike the *Convolutional Neural Net*, where the filter goes through the positions/pixels in a space-dependent manner. We have to add the position information ourselves, which is what the `self.pos_embedding_table` does, in our updated model.

In [106]:
## we'll be introducing the key and query to our model
## where the key is the info about the token
## and query is what it's looking for
## each of these are called a head
## and we also need a head_size
head_size = 16
B, T, C = 4, 8, 32
random_x = torch.randn((B, T, C))
## and now we need two linear layers
## that only have weights
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
## we also need a layer for value
value = nn.Linear(C, head_size, bias=False)
k = key(random_x) ## (B, T, head_size)
q = key(random_x) ## (B, T, head_size)
## and now our weights is going to be the product of key and query
w = q @ k.transpose(-2, -1) ## (B, T, H) @ (B, H, T) > (B, T, T)
## and then the rest is the same as above
a = torch.tril(torch.ones((T, T)))
w = w.masked_fill(a==0, float('-inf'))
w = torch.softmax(w, dim=-1)
## and now our weights are not uniform!
w[0]

tensor([[1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00],
        [5.6163e-03, 9.9438e-01, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00],
        [1.4480e-03, 6.5055e-04, 9.9790e-01, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00],
        [5.9243e-02, 3.5913e-02, 7.1229e-02, 8.3362e-01, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00],
        [7.5317e-02, 2.0028e-02, 7.5614e-02, 2.5977e-01, 5.6927e-01, 0.0000e+00,
         0.0000e+00, 0.0000e+00],
        [8.3936e-03, 9.7194e-03, 2.9207e-03, 1.8211e-03, 4.4495e-03, 9.7270e-01,
         0.0000e+00, 0.0000e+00],
        [6.3756e-02, 1.0215e-02, 1.2512e-01, 8.6673e-02, 4.2035e-02, 6.4905e-03,
         6.6571e-01, 0.0000e+00],
        [1.0949e-01, 2.5646e-02, 6.5264e-02, 5.3551e-02, 5.3301e-02, 2.9483e-02,
         1.4093e-01, 5.2233e-01]], grad_fn=<SelectBackward0>)

In [107]:
## and before getting the product of the weights and X
## we apply the value layer to X, and that's what we aggregate
## and the shape now is (B, T, H)
v = value(random_x)
out = w @ v
out.shape

torch.Size([4, 8, 16])

In [112]:
## we also need to scale the Q @ K product
## so that the weights are not too concenterated 
## especially at the beginning, because we don't want our nodes
## to only learn from one other node, and we want the weights to be diffused
## and the way to normalize it is to devide them by sqrt of head size
w = q @ k.transpose(-2, -1) * head_size ** -.5
a = torch.tril(torch.ones((T, T)))
## the reason we have this is so that the nodes won't
## communicate with the nodes that come after them
w = w.masked_fill(a == 0, float('-inf'))
w = F.softmax(w, dim =-1)
v = value(random_x)
out = w @ v
w[0]

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2152, 0.7848, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1440, 0.1179, 0.7380, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2055, 0.1813, 0.2152, 0.3980, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1742, 0.1251, 0.1744, 0.2374, 0.2889, 0.0000, 0.0000, 0.0000],
        [0.1312, 0.1361, 0.1008, 0.0895, 0.1119, 0.4305, 0.0000, 0.0000],
        [0.1397, 0.0884, 0.1653, 0.1508, 0.1259, 0.0789, 0.2511, 0.0000],
        [0.1332, 0.0926, 0.1170, 0.1114, 0.1112, 0.0959, 0.1418, 0.1968]],
       grad_fn=<SelectBackward0>)

### The actual head class

In [115]:

## the head class will use the self-attention concepts
class Head(nn.Module):
    def __init__(self, head_size=head_size, num_embd = num_emb):
        super().__init__()
        ## we have a linear layer for keys
        ## one for queries and another for values
        self.key = nn.Linear(num_embd, head_size, bias=False)
        self.quey = nn.Linear(num_embd, head_size, bias=False)
        self.value = nn.Linear(num_embd, head_size, bias=False)
        ## we also have the triangle that we use for masking
        self.register_buffer(name='tril',tensor=torch.tril(torch.ones((block_size, block_size))))

    def forward(self, x):
        B, T, C = x.shape ## C = head_size
        k = self.key(x) ## (B, T, num_emb)
        q = self.quey(x) ## (B, T, num_emb)
        weight = q @ k.transpose(-2, -1) * C ** -0.5 ## (B, T, T)
        weight = weight.masked_fill(self.tril[:T, :T]==0, float('-inf'))
        weight = F.softmax(weight, dim=-1)
        v = self.value(x) ## (B, T, C)
        out = weight @ v ## (B, T, C)
        return out
    
## in order to improve our model furthur
## we can have multiple heads
class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size=head_size) for _ in range(num_heads)])
    def forward(self, x):
        return torch.cat([h(x) for h in self.heads], dim=1)

In [120]:
## updating the model
class BigramLanguageModelV3(nn.Module):
    def __init__(self, vocab_size=vocab_size, num_emb=num_emb, block_size=block_size):
        super().__init__()
        ## we're changing this to be (vocab x emb)
        self.embedding_table = nn.Embedding(num_embeddings=vocab_size,
                                           embedding_dim=num_emb)
        ## we now want to have another embedding for the positions
        self.pos_embedding_table = nn.Embedding(block_size, num_emb)
        self.head = Head(num_emb, num_emb)
        ## and then a linear layer to give us the logits
        self.lin_head = nn.Linear(num_emb, vocab_size)
        
    def forward(self, x, target=None):
        ## now we'll be incorporating the new layers
        B, T = x.shape
        token_emb = self.embedding_table(x) ## (B, T, num_emb)
        ## and then we can also get the position
        pos_emb = self.pos_embedding_table(torch.arange(T, device=device)) ## (T, num_emb)
        ## and now our x will be the sum of these two
        x = token_emb + pos_emb
        x = self.head(x)
        ## and we can finally get our logits by the linear layer
        logits = self.lin_head(x) ## (B, T, C)
        if target is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            target = target.view(B*T)
            loss = F.cross_entropy(logits, target)
        return logits, loss

    def generate(self, x, num_max_token):
        for _ in range(num_max_token):
            ## we want to get the predictions again
            logits, loss = self(x)
            ## and we only want the last block (Batch, Block, Vocab)
            logits = logits[:, -1, :] ## (Batch, Vocab)
            ## and then we apply the softmax to get the probabilities
            probs = torch.softmax(logits, dim=-1) ## still (Batch, Vocab)
            ## and then get a sample from the probablity distribution
            next_inx = torch.multinomial(probs,num_samples=1) ## (B, 1)
            ## and then append the next index to the x
            x = torch.cat((x, next_inx), dim=1) ## (Batch, Block + 1)
        return x

model = BigramLanguageModelV3(vocab_size = vocab_size, num_emb = num_emb)
output, loss = model(X_train, y_train)
print(output.shape)
## we're expecting the initial entropy to be
## -ln(1/vocab_size)
print(f'expected loss {-np.log(1/vocab_size):.2f}')
print(f'claculated loss {loss.item():.2f}')

torch.Size([256, 76])
expected loss 4.33
claculated loss 4.34


In [None]:
## training the new model and checking the loss 
## to turn off the gradient calculation
@torch.inference_mode()
def estimate_loss(evaluation_interval):
    out = {}
    for split in ['train', 'test']:
        losses = torch.zeros(evaluation_interval)        
        for i in range(evaluation_interval):
            xs, ys = get_batch(split)
            model.eval()
            _, loss = model(xs, ys)
            losses[i] = loss.item()
        out[split] = losses.mean()   
    model.train()     
    return out

epochs = 5000
evaluation_interval = 500
for e in range(epochs):
    xs, ys = get_batch('train')
    logits, loss = model(xs, ys)
    ## zero out the gradient
    optimizer.zero_grad(set_to_none=True)
    ## and then backpropagation
    loss.backward()
    ## and then the optimizer step
    optimizer.step()
    if e%evaluation_interval==0:
        result = estimate_loss(evaluation_interval)
        print(f"Epoch {e} average train loss is {result['train']:.2f} | average test loss is {result['test']:.2f}")