# 🔄 GPT
> *Yo, listen up, folks! It's like GPT, man, it's the hot talk of the town right now, no joke! Everyone and their grandma is all gaga about this crazy thing. It's like this super brainy AI wizardry that can talk, write, and do all sorts of mind-bending stuff with words. I mean, it's like having a genius ghost in your computer, cooking up sentences that'll make Shakespeare blush. People are yapping about it at coffee shops, on social media, even at those fancy-schmancy tech conferences. It's like the ultimate word-wrangling rockstar, and ain't nobody keeping quiet about it, b**ch!*
>
> \- ChatGPT *(jesse's style)*

Yo!!! 🔥 It has fired up things pretty well!

___

Let's get back to the point, and collect ourselves from all things, here in this very notebook...
- We will start building the GPT *(yes!)*
- We will take a small data - tiny shakespere and make the model to generate more of it.
- Explore what is the **transformer** architecture and so on...

This is going to be a **real** fire 🔥🔥🔥 <br>
Let's do it! *(without talking too much)*

### ⬇️ Downloading the dataset

In [4]:
## !wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

In [5]:
# reading the file
with open('./input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [6]:
print(text[:100])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


—— Alright ——

### 🔍 Just a little inspection

In [7]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print("Total unique characters:", vocab_size, end="\n\n")
print(chars)
print(''.join(chars))

Total unique characters: 65

['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz


Alright, we just have a **single number** `3` in whole data... what's going on shakespere!?

In [8]:
# not a big deal.
text.count("3")

27

### 🤹‍♀️ Playing around with the encoder/decoder = tokenizer!

In [9]:
char_to_number = {ch:i for i,ch in enumerate(chars)}
number_to_char = {i:ch for i,ch in enumerate(chars)}

# encoder: take a string, output a list of integers
encode = lambda s: [char_to_number[c] for c in s]

# decoder: take a list of integers, output a string
decode = lambda l: ''.join([number_to_char[i] for i in l]) 

encoded_tokens = encode("This is GPT.")
print(encoded_tokens)

decoded_tokens = decode(encoded_tokens)
print(decoded_tokens)

[32, 46, 47, 57, 1, 47, 57, 1, 19, 28, 32, 8]
This is GPT.


> **NOTE**: This is going to be the "character level" language model, we won't be making the "word or sub-word" level model here. But still it will give the pretty good results ✨

# 📂 Creating the dataset

We will keep:
1. `90%` data for training
2. `10%` for the validation

In [10]:
import torch
data = torch.tensor(encode(text), dtype=torch.long)

In [11]:
# datatype long will allow the larger numbers to be stored and avoid the 
# numeric overflow in the normal integer (int32)
print(data.shape, data.dtype)

torch.Size([1115394]) torch.int64


In [12]:
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

Which looks something like... 👇

In [13]:
decode(train_data[:100].tolist())

'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'

Nothing fancy, we will just be feeding in the **individual characters** in and will have the `y` as the next token.

### 📏 The context length `2048`

So the chatGPT has some max length, till which at max it can accept the input and then can generate the output which is `4,096 tokens` in ChatGPT case.

#### 🥜 Context window in the nutshell

🤔 **What does that mean actually**?
- Say, you are a such a person who can give predict the next year of your friend's life **by looking at their** last `10` years at max.
- So, you take last `10` years of history and estimate how their next year will be.
- You keep going for say till `100` years.
- You only have **the context window** of 10 years because you can't analyze more than that at once.
- So, when you predict their next year, you take ***that predicted year*** and forget the very first year; keeping the context window of `10`.

But, this is not it! <br>
People **may not have 10 years of data** so you should also be able to predict their next year even if they give just say `6` years of data and predict the 7th!

> 🪶 *That's where we will train the network which will predict the next token based on past tokens till the maximum context length is reached.*

In [14]:
context_window = 8 #for now
sample_data = train_data[:context_window+1]
print(sample_data, decode(sample_data.tolist()), sep=" = ")

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58]) = First Cit


In [15]:
x = train_data[:context_window] # except the `+1`th character
y = train_data[1:context_window+1] # except the very first `0`th character
for t in range(context_window):
    context = x[:t+1].tolist()
    target = y[t]
    print(f"When input is {context} the target: {target}")

When input is [18] the target: 47
When input is [18, 47] the target: 56
When input is [18, 47, 56] the target: 57
When input is [18, 47, 56, 57] the target: 58
When input is [18, 47, 56, 57, 58] the target: 1
When input is [18, 47, 56, 57, 58, 1] the target: 15
When input is [18, 47, 56, 57, 58, 1, 15] the target: 47
When input is [18, 47, 56, 57, 58, 1, 15, 47] the target: 58


Which also is 👇

In [16]:
for t in range(context_window):
    context = x[:t+1].tolist()
    target = y[t].tolist()
    print(f"{decode(context)} → {decode([target])}")

F → i
Fi → r
Fir → s
Firs → t
First →  
First  → C
First C → i
First Ci → t


> 🗒 <br><br>**NOTE**: This is like our `makemore` version, but the difference is... in makemore, we had a **fixed** input size. All names were `8` long. If not, we used to *pad* the `.` character to make them `8` long... here, instead we append the characters gradually so that the **model could learn the combinations** 🤗

⏪ **Before** *(makemore)*:

    ... → d
    ..d → i
    .di → o
    dio → n
    ion → d
    ond → r
    ndr → e
    dre → .

⏩ **Now** *(GPT)*:

    d → i
    di → o
    dio → n
    dion → d
    diond → r
    diondr → e
    diondre → .

> 💬 <br>*We are doing this because we want the transformer "to be used to" seeing the characters as low as a single character and as long as `context_window`.* <br><br> — Andrej

In [17]:
train_data.shape

torch.Size([1003854])

### 🗄️ Dataset with multiple samples

In [18]:
n_samples = 4      #batch_size
context_window = 8 #block_size
torch.manual_seed(1337)

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - context_window, (n_samples,))
    x = torch.stack([data[i:i+context_window] for i in ix])
    y = torch.stack([data[i+1:i+context_window+1] for i in ix])
    return x, y

In [19]:
xb, yb = get_batch('train')
print('📥 Inputs:')
print(xb.shape)
print(xb)
print('\n📤 Targets:')
print(yb.shape)
print(yb)

📥 Inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])

📤 Targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])


I hope it is clear what is going on here... 
- Just selects any integer between `0` and `1115386` *(lengthData - 8 | to avoid indexerror)*
- And then from ***those*** `4` random indices, it will create the single sample/batch of `8` numbers.

🤨 Wanna see in action?

In [20]:
for b in range(n_samples): # batch dimension
    print("🔵\n")
    for t in range(context_window): # time dimension
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"{context.tolist()} → {target} \n{decode(context.tolist())} → {decode([target.tolist()])}\n")

🔵

[24] → 43 
L → e

[24, 43] → 58 
Le → t

[24, 43, 58] → 5 
Let → '

[24, 43, 58, 5] → 57 
Let' → s

[24, 43, 58, 5, 57] → 1 
Let's →  

[24, 43, 58, 5, 57, 1] → 46 
Let's  → h

[24, 43, 58, 5, 57, 1, 46] → 43 
Let's h → e

[24, 43, 58, 5, 57, 1, 46, 43] → 39 
Let's he → a

🔵

[44] → 53 
f → o

[44, 53] → 56 
fo → r

[44, 53, 56] → 1 
for →  

[44, 53, 56, 1] → 58 
for  → t

[44, 53, 56, 1, 58] → 46 
for t → h

[44, 53, 56, 1, 58, 46] → 39 
for th → a

[44, 53, 56, 1, 58, 46, 39] → 58 
for tha → t

[44, 53, 56, 1, 58, 46, 39, 58] → 1 
for that →  

🔵

[52] → 58 
n → t

[52, 58] → 1 
nt →  

[52, 58, 1] → 58 
nt  → t

[52, 58, 1, 58] → 46 
nt t → h

[52, 58, 1, 58, 46] → 39 
nt th → a

[52, 58, 1, 58, 46, 39] → 58 
nt tha → t

[52, 58, 1, 58, 46, 39, 58] → 1 
nt that →  

[52, 58, 1, 58, 46, 39, 58, 1] → 46 
nt that  → h

🔵

[25] → 17 
M → E

[25, 17] → 27 
ME → O

[25, 17, 27] → 10 
MEO → :

[25, 17, 27, 10] → 0 
MEO: → 


[25, 17, 27, 10, 0] → 21 
MEO:
 → I

[25, 17, 27, 10, 0, 21] 

More clearly... 💖

In [21]:
print('📥 Inputs:')
for decoded_xb in map(lambda t: decode(t.tolist()), xb):
    print("[", decoded_xb.replace("\n","\\n"), "]", sep='')
    
print('\n📤 Targets:')
for decoded_yb in map(lambda t: decode(t.tolist()), yb):
    print("[", decoded_yb.replace("\n","\\n"), "]", sep='')

📥 Inputs:
[Let's he]
[for that]
[nt that ]
[MEO:\nI p]

📤 Targets:
[et's hea]
[or that ]
[t that h]
[EO:\nI pa]


Yeah, so this is how it goes 😄

Which means... 
- If a single sample consists `8` characters then there are `8` different combinations for the models to learn from that single sample.
- Then *if* we have `4` samples, we will have total `32` training, forward passes to the model. 

# 🐫 Let's create the Bigram LM
Oh, some DejaVu? Let's recap quickly:
- The model only takes care of the **previous** token
- Based on that, it will select the **row** *(in the manual case)* and will pick out the next most probable character.

That's it! Let's see how that can work here.

> **NOTE**: Since I am running this model on the cloud, I am able to use the GPU. So, the following code may use the "device" to leverage the faster training 😉

In [22]:
import torch.nn as nn # for layers and stuff
from torch.nn import functional as F # for the loss function and softmax
torch.manual_seed(1337) # same as in the lecture

<torch._C.Generator at 0x7f9380f6fe10>

In [23]:
class BigramLM(nn.Module): # because, we will use its functions like forward.
    """
    Just the starter skeleton of the Class.
    In the very next cell we will implement full BigramLM.
    
    With Explanation 😉
    """
    
    
    def __init__(self, vocab_size):
        super().__init__()
        
        # The simple lookup table...
        self.embedding_table = nn.Embedding(vocab_size, vocab_size)
        
    def forward(self, idx, targets=None):
        '''
        Just take the index of the "current" character 
        and then use that to get the probability dist 
        for the next one using the `embedding_table`.
        '''
        
        # Note we are NOT indexing `[]` but CALLING `()` the table
        # because that is the Layer (kind of)
        logits = self.embedding_table(idx)
        
        return logits # Form of (B, T, C)

In [24]:
model = BigramLM(vocab_size)
logits = model(xb, yb)
logits.shape # B, T, C form

torch.Size([4, 8, 65])

- `4`: Number of samples
- `8`: Context window *(each tokens)*
- `65`: The vocab size *(distributions of next token for each tokens in context window)*

### 💭 Think through this...
Here for each sample... for each `8` characters we are pulling out the next character's distribution.

**Which means**, till now **NO CONNECTION** between the tokens have been made. We have just pulled out the next tokens distribution... and that's solely based on the current token and nothing linking the previous tokens. 

*That's what Andrej said in [this clip](https://youtube.com/clip/Ugkxcwwtx2tSFQEwgPHmXgylUpKFSXHBM_gn).*

## 🏗 Building Up the Bigram

In [25]:
class BigramLM(nn.Module):
    """
    This class takes the `vocab_size` as a single input because its being
    the "simplest" model, we won't need anything else.
    
    It will create `vocab_size` * `vocab_size` table and then we will be
    able to access it.
    
    The Forward function will take the `x` input and based on the shape
    it will perform the forward pass on it. The nuances are explained in 
    the following markdown cells.
    """
    
    
    def __init__(self, vocab_size):
        super().__init__()
        
        # The simple lookup table...
        self.embedding_table = nn.Embedding(vocab_size, vocab_size)
        
    def forward(self, idx, targets=None):
        '''
        Just take the index of the "current" character 
        and then use that to get the probability dist 
        for the next one using the `embedding_table`.
        '''

        logits = self.embedding_table(idx)
        
        if targets is None: # means we are inferencing and not training
            loss=None
        else:               # means we are training
            # Refer: Change [1]
            B, T, C = logits.shape
            logits = logits.view(B*T, C)

            # Refer: Change [1]
            targets = targets.view(B*T)

            # For the given logits and *correct* targets, pick the pre-
            # dicted logits for the given target to calculate the 
            # negative log likelihood loss.
            loss = F.cross_entropy(logits, targets)
            
            
        return logits, loss
        
        
    def generate(self, idx, max_new_tokens):
        '''
        Take the index of the token and guess the next
        token based on the embeddings!
        '''
        for _ in range(max_new_tokens):
            # call the `forward` method 
            logits, loss = self(idx) # we can do this because we have inherited `nn.Module` :)
            
            # Take the very last token (8th in the context window)
            # and use its distribution to get the next token
            logits = logits[:, -1, :] ## refer: Change [2]

            # Convert the logits into the probabilities
            probs = F.softmax(logits, dim=-1) ## dim=-1: along the last dimension ~ here `1`

            # Take the next idx!
            next_idx = torch.multinomial(probs, num_samples=1)

            # Here we will ALWAYS append, there is NO shrinking!
            idx = torch.cat((idx, next_idx), dim=1)

        return idx

### 📖 Change Explainer

### `Change [1]`

The current `logits` shape is `B, T, C` = `4, 8, 65`. 

<img src="./images/BTC.png">

### `Change [2]`

<img src="./images/last_token.png">

A simple generation 🎉

In [26]:
model = BigramLM(vocab_size)
logits, loss = model(xb, yb)
output = decode(
    model.generate(
        idx = torch.zeros((1, 1), dtype=torch.long), 
        max_new_tokens=100)[0].tolist()
)
print(output)


hbH

:CLP.A!fq'3ggt!O!T?X!!SA?W&TrpvYybSE3w&S BXUhmiKYyTmWMPhhmnHKj!!btgnwNNULuEzRuYyiWEQxPX!$3C'MBj


> That's just an **un-trained** random model 😅

# 📈 Optimization

### Create a model

In [24]:
model = BigramLM(vocab_size)

### Set the optimizer

In [25]:
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

In [26]:
# The parameters...
param = model.parameters()
for p in param:
    print(len(p))

65


Since we only have a simple, single layer... it returns the parameters of that layer, which is `65` in this case.

### Training loop

In [30]:
n_samples = 32 ### BATCH SIZE!

for steps in range(20_000): # increase number of steps for good results... 
    
    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    
    ### Aha!!
    optimizer.zero_grad(set_to_none=True)
    
    loss.backward()
    
    ### Update the weights' value.
    optimizer.step()

print(loss.item())

2.420738458633423


Trained generation 🎉

In [31]:
output = decode(
    model.generate(
        idx = torch.zeros((1, 1), dtype=torch.long), 
        max_new_tokens=500)[0].tolist()
)
print(output)


Toul sil prangir sis.

Wh I whise brthit RD:
Gom.
I INomere y ghesen cond
YCouchin, w?

Te counere ne ung;
GE n;
II t.

He verve o was.


Hes aift NTHEDIO:
Sl:
Whinke ngt?
MAms NGO shemo too mo anthatinthakes f utous as Agonteopr botherore thind spat PTEShiat ureraierio pr son me LO:
Wats me S:
To tllingewe lley ayom

Mo;
Latanssuromas:

Y:
PE:
Therucover, min ld te o e, un rd o s hthecals,

WI thank m:


NTharu t irlendoucin,
Y: Tisteriomad t.
Yor f.

S:
G, g w,

Hee:


NGLI,
JUTeden t t IVo sl


Yo! 🔥 <br>
It is just looking at the **previous single token!**

# The code with CUDA 🙅
Let's start stepping into the real world. Let's use the **GPU**!

> 😃 
>
>*The following code is the **re-written** version of the code above just to demonstrate <br> and introduce the new code as Andrej shows in the [clip](https://youtube.com/clip/Ugkx7YMJZGAh8ybgDkRS4rDpIvV5I4xTDK9d).*



🗒 **A little note:** From our previous exercises, we are already been familiar with what happens inside. We have used the variable names in *our language* such as calling `batch_size` as `n_samples`. So, from now, I will use the same variable names as are used in the industry so that we can get grip on those industry jargons 😉

> ➡ In many places I have written *`### 🗽 Transfer to device 🗽 ###`* which means there is the small change in code to transfer the weights on the available device.

### `1.` Setting the hyperparams

In [32]:
batch_size = 32       # n_samples
block_size = 8        # context_window
max_iters = 20_000    # total steps
eval_interval = 1_000 # interval at we will print the loss
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200

For me, the device now is:

In [33]:
device

'cuda'

### `2.` Create the model

In [34]:
model = BigramLM(vocab_size)

### 🗽 Transfer to device 🗽 ###
model = model.to(device)

### `3.` To Get Batch

In [58]:
def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    
    ### 🗽 Transfer to device 🗽 ###
    x, y = x.to(device), y.to(device)
    return x, y

### `4.` Optimizer

In [59]:
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

### `5.` Estimate loss

In [60]:
@torch.no_grad()
def estimate_loss():
    '''
    This function takes the random samples from the dataset (based on the batch size)
    for `eval_iter` times. Records loss and takes the mean loss. And reports back.
    
    Which means, if we have the `eval_iter = 10` and `batch_size=32` then it will take 
    32 random samples from training data and then validation data for 10 times and takes
    the means of these 10 losses.
    '''
    out = {}
    
    # 🔥 sets on evaluation mode... 🔥
    # which does something like `training_mode = False` 
    # in the layers like `BatchNorm`.
    model.eval()
    
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    
    # 🔥 sets the model to training mode back!!! 🔥
    # which does something like `training_mode = True` 
    # in the layers like `BatchNorm`.
    model.train()
    return out

### `6.` Training

This loop is amazing. Let's break the steps down:
- Train for `20,000` loops
- Print when you hit every `1,000`th step
- When you are asked to print the loss, then go and calculate the loss for a random batch of `32` for `eval_iter` times. 
- Take the mean of those and then print.

> ➡ Doing this will let us see **how the model has learnt** the relationship in the subset of data instead of printing the loss as is for that *last* batch. This way is better 😁

In [38]:
for step in range(max_iters): # increase number of steps for good results... 
    
    if step % eval_interval == 0:
        losses = estimate_loss()
        print(f"[Step {step}]: Train Loss~{losses['train']:.4f}, Val Loss~{losses['val']:.4f}")
    
    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

[Step 0]: Train Loss~4.6628, Val Loss~4.6395
[Step 1000]: Train Loss~3.7101, Val Loss~3.6990
[Step 2000]: Train Loss~3.1239, Val Loss~3.1174
[Step 3000]: Train Loss~2.8066, Val Loss~2.8062
[Step 4000]: Train Loss~2.6495, Val Loss~2.6406
[Step 5000]: Train Loss~2.5530, Val Loss~2.5677
[Step 6000]: Train Loss~2.5175, Val Loss~2.5374
[Step 7000]: Train Loss~2.5002, Val Loss~2.5110
[Step 8000]: Train Loss~2.4871, Val Loss~2.5037
[Step 9000]: Train Loss~2.4769, Val Loss~2.4854
[Step 10000]: Train Loss~2.4649, Val Loss~2.4871
[Step 11000]: Train Loss~2.4600, Val Loss~2.4805
[Step 12000]: Train Loss~2.4585, Val Loss~2.4853
[Step 13000]: Train Loss~2.4561, Val Loss~2.4812
[Step 14000]: Train Loss~2.4536, Val Loss~2.4804
[Step 15000]: Train Loss~2.4579, Val Loss~2.4905
[Step 16000]: Train Loss~2.4560, Val Loss~2.4900
[Step 17000]: Train Loss~2.4556, Val Loss~2.4887
[Step 18000]: Train Loss~2.4497, Val Loss~2.4807
[Step 19000]: Train Loss~2.4517, Val Loss~2.4903


### `7.` Inference!

In [39]:
output = decode(
    model.generate(
        idx = torch.zeros((1, 1), 
                          dtype=torch.long,
                          device=device),  ### 🗽 Transfer to device 🗽 ###
        max_new_tokens=500)[0].tolist()
)
print(output)




CExfik brid owindakis by bth

HAPet bobe d e.
S:
O:
IS:
Falatanss:
Wanthar u qur, vet?
F dilasoate awice my.

Hastarom oroup
Yowhthetof isth ble mil ndill, ath iree sengmin lat Heriliovets, and Win nghirileranousel lind me l.
HAshe ce hiry:
Supr aisspllw y.
Hentofu noroopetelaves
MP:

Pl, d mothakleo Windo whth eisbyo the m dourive we higend t so mower; te

AN ad nterupt f s ar iris! m:

Thiny aleronth,
Mad
RD:

WISo myr f-
LIERor,
SShisar adsal thes ghesthidin cour ay aney Iry ts I fr y ce.
J


# 🔍📚 Taking the context into an account

- The GPT model is **"autoregressive"**. Which means, it sees the "previous" context and then generates the next token. 
- That **also** means, the model **doesn't** know the future tokens to generate *this* token, because they are to be generated and aren't yet generated *(in the contrast of text-to-text models or masked learning done with the BERT models where we train the model to see the past and future tokens to estimate the `<MASK>` token)*.

<img src="./images/autoregressive.png">

... and to do that we will use the **mean** for now. As a starter.

## 😅 The for-loop way
Suppose we have the `context-window=8` so, we will start with the `1st` token's embeddings, then go take and incrementlly add the tokens till `8th` taking mean of them in the way.

Which looks like...

In [40]:
# A sample data
B, T, C = 4, 8, 3
sample_xb = torch.arange(B*T*C, dtype=torch.float).view(B, T, C)

👉 The embedding size is `3`, alright?

In [41]:
sample_xb

tensor([[[ 0.,  1.,  2.],
         [ 3.,  4.,  5.],
         [ 6.,  7.,  8.],
         [ 9., 10., 11.],
         [12., 13., 14.],
         [15., 16., 17.],
         [18., 19., 20.],
         [21., 22., 23.]],

        [[24., 25., 26.],
         [27., 28., 29.],
         [30., 31., 32.],
         [33., 34., 35.],
         [36., 37., 38.],
         [39., 40., 41.],
         [42., 43., 44.],
         [45., 46., 47.]],

        [[48., 49., 50.],
         [51., 52., 53.],
         [54., 55., 56.],
         [57., 58., 59.],
         [60., 61., 62.],
         [63., 64., 65.],
         [66., 67., 68.],
         [69., 70., 71.]],

        [[72., 73., 74.],
         [75., 76., 77.],
         [78., 79., 80.],
         [81., 82., 83.],
         [84., 85., 86.],
         [87., 88., 89.],
         [90., 91., 92.],
         [93., 94., 95.]]])

In [38]:
# make the array full of zeros to fill the means in
x_bow = torch.zeros_like(sample_xb)

for b in range(B): # for all samples (here 4)
    for t in range(T): # for all tokens (here 8)
        x_prev = sample_xb[b, :t+1] # shape T, C
        x_bow[b, t] = x_prev.mean(dim=0) # shape C

In [39]:
# say just a first example B[0]
sample_xb[0]

tensor([[ 0.,  1.,  2.],
        [ 3.,  4.,  5.],
        [ 6.,  7.,  8.],
        [ 9., 10., 11.],
        [12., 13., 14.],
        [15., 16., 17.],
        [18., 19., 20.],
        [21., 22., 23.]])

🔼 We can compare them 🔽

In [39]:
# for that first example, we have the incremental means
x_bow[0]

tensor([[ 0.0000,  1.0000,  2.0000],
        [ 1.5000,  2.5000,  3.5000],
        [ 3.0000,  4.0000,  5.0000],
        [ 4.5000,  5.5000,  6.5000],
        [ 6.0000,  7.0000,  8.0000],
        [ 7.5000,  8.5000,  9.5000],
        [ 9.0000, 10.0000, 11.0000],
        [10.5000, 11.5000, 12.5000]])

Note how it performs the mean incrementally!

## 😎 The Math Trick!
The loops are slow, this was just the example with `4` samples and `8` long context... but when these numbers get big, this way we would spend the most of the time in calculating this stuff than the actual training!

So, we will use some maths to give the same result **in a single shot!**.

> 🔳🔲 <br>**The matrix multiplication at rescue!**

___
🔗 Here is the [math trick link](https://youtube.com/clip/UgkxoQF1PSblzO0ChHycaNpmq8c29MOmdYDk) where Andrej explains this neatly, I would encourage you to check that out there and then let's continue here.
___

📑 **The concept summary**:
1. The **matrix multiplication** allows us to perform the `sum`.
2. We can leverage that sum feature from the matmul.
3. But to do that in **incremental fashion** we will have to use the `torch.tril` which gives the ones in the incremental fashion and the rest are zeros, which will let us ignore the other numbers.
4. The half part is done, the summation *(which is just the cumsum)*. The rest is to divide to get the mean.
5. Which we will get by performing the `.sum(dim=1)` which gives the **count** of elements in that row.

## 📔 Just to note... `cumsum` performs better

I am not sure ***why Andrej didn't*** mention this in the lecture, but here is the thing.

1. To get the **incremental** mean, we use the `torch.tril`
2. Then we divide that with the count
3. If we observe, the same thing can be done with the **cumsum** and then we can divide accordingly.

👨‍💻 Here is the code for it:
```python
data = torch.randint(0, 10, (8, 3), dtype=torch.float)
cum_sum = data.cumsum(0) / torch.arange(1, data.shape[0]+1).view(data.shape[0], 1)
```

When I observed the time between **both** ways *(matmul and cumsum)* I observed a large difference. Here's the code you can check too!

#### `1.` MatMul 

In [40]:
# A big matrix ;)
data = torch.randint(0, 10, (10_000, 10_000), dtype=torch.float)
the_a = torch.tril(torch.ones(data.shape[0], data.shape[0]))
the_a = the_a / the_a.sum(dim=1, keepdims=True)

In [39]:
%%timeit
the_a @ data

8.57 s ± 224 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### `2.` CumSum

In [44]:
%%timeit

data.cumsum(0) / torch.arange(1, data.shape[0]+1).view(data.shape[0], 1)

1.62 s ± 2.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Check the difference!

In [40]:
# MatMul Way
res = the_a @ data
# CumSum Way
cum_sum = data.cumsum(0) / torch.arange(1, data.shape[0]+1).view(data.shape[0], 1)

In [41]:
torch.allclose(res, cum_sum)

True

Both are same! Yo!? 🔥

# 🧸 With Softmax?
Turns out that there is the **third** way of doing this, which is the **real way**. <br>
Let's explore why.

**What softmax does?** <br>
On a high level, it converts the numbers into the probability, right? Means, it **normalizes** the numbers. Or in the other language it **gives the mean** of the numbers!

> That's what we do here, right?

In [40]:
# For that we simply create the masked 2D tensor tril
tril = torch.tril(torch.ones((T, T)))

# then the "weights" which will act as the "count" for the average
wei = torch.zeros((T, T))

# we will "filter" and make the rest "-inf" to exclude them from the 
# softmax calculation, otherwise they will affect the softmax normalization
wei = wei.masked_fill(tril==0, float("-inf"))

In [41]:
F.softmax(wei, dim=-1) @ sample_xb

tensor([[[ 0.0000,  1.0000,  2.0000],
         [ 1.5000,  2.5000,  3.5000],
         [ 3.0000,  4.0000,  5.0000],
         [ 4.5000,  5.5000,  6.5000],
         [ 6.0000,  7.0000,  8.0000],
         [ 7.5000,  8.5000,  9.5000],
         [ 9.0000, 10.0000, 11.0000],
         [10.5000, 11.5000, 12.5000]],

        [[24.0000, 25.0000, 26.0000],
         [25.5000, 26.5000, 27.5000],
         [27.0000, 28.0000, 29.0000],
         [28.5000, 29.5000, 30.5000],
         [30.0000, 31.0000, 32.0000],
         [31.5000, 32.5000, 33.5000],
         [33.0000, 34.0000, 35.0000],
         [34.5000, 35.5000, 36.5000]],

        [[48.0000, 49.0000, 50.0000],
         [49.5000, 50.5000, 51.5000],
         [51.0000, 52.0000, 53.0000],
         [52.5000, 53.5000, 54.5000],
         [54.0000, 55.0000, 56.0000],
         [55.5000, 56.5000, 57.5000],
         [57.0000, 58.0000, 59.0000],
         [58.5000, 59.5000, 60.5000]],

        [[72.0000, 73.0000, 74.0000],
         [73.5000, 74.5000, 75.5000],
      

### 🖼 Which looks like...

<img src="./images/method_comparision.png">

Well, it may look like "softmax" method has a little bit more involved, but later it will help us, because we will be getting the values of `wei` not from the "torch.zeros" but from the data itself as the embeddings, and that will be masked later.

So, the conclusion is that... we will be using the Softmax way where:
1. We create a `tril` which helps to create a mask
2. Using that *triled* array to create an actual mask on the `wei` or `weights` so that the future tokens don't communicate with the past.
3. Then run through the softmax to get the normalized values and finally attatch (matmul) them with the data `x` to get the incremental mean.

# 📅🌱 Driving our way towards maturity
The old days of *playing* with the bigram toy model are over, let's build ourself up for the GPT. So, from now we will change the code a little-by-little and make it GPTable.

In [42]:
batch_size = 32       # n_samples
block_size = 8        # context_window
max_iters = 20_000    # total steps
eval_interval = 1_000 # interval at we will print the loss
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 32   ### Added new - for the extra layer ###

In [43]:
class BigramLM(nn.Module):
    """
    Changes list for this run:
    1. Removed the `vocab_size` as it was kind of redundant
    2. Added a new linear layer of shape `n_embd` which then will
        return the logits of `vocab_size`
    3. In the `forward` function to get the logits, added a matmul operation.
    
    """
    
    
    def __init__(self):
        super().__init__()
        self.embedding_table = nn.Embedding(vocab_size, n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size)  # added a layer
        
    def forward(self, idx, targets=None):
        '''
        Just take the index of the "current" character 
        and then use that to get the probability dist 
        for the next one using the `embedding_table`.
        '''

        tok_emb = self.embedding_table(idx) # B, T, n_emb
        logits = self.lm_head(tok_emb) # B, T, vocab_size
        
        if targets is None: 
            loss=None
        else:               
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
            
        return logits, loss
        
        
    def generate(self, idx, max_new_tokens):
        '''
        Take the index of the token and guess the next
        token based on the embeddings!
        '''
        for _ in range(max_new_tokens):
            # call the `forward` method 
            logits, loss = self(idx) # we can do this because we have inherited `nn.Module` :)
            
            # Take the very last token (8th in the context window)
            # and use its distribution to get the next token
            logits = logits[:, -1, :] ## refer: Change [2]

            # Convert the logits into the probabilities
            probs = F.softmax(logits, dim=-1) ## dim=-1: along the last dimension ~ here `1`

            # Take the next idx!
            next_idx = torch.multinomial(probs, num_samples=1)

            # Here we will ALWAYS append, there is NO shrinking!
            idx = torch.cat((idx, next_idx), dim=1)

        return idx

In [44]:
model = BigramLM()
model = model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

In [45]:
# Total parameters now
sum(len(i) for i in model.parameters())

195

In [46]:
print(model)

BigramLM(
  (embedding_table): Embedding(65, 32)
  (lm_head): Linear(in_features=32, out_features=65, bias=True)
)


In [47]:
for step in range(max_iters): # increase number of steps for good results... 
    
    if step % eval_interval == 0:
        losses = estimate_loss()
        print(f"[Step {step}]: Train Loss~{losses['train']:.4f}, Val Loss~{losses['val']:.4f}")
        break
    
    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
    

[Step 0]: Train Loss~4.2778, Val Loss~4.2879


Okay, works just fine.

# 🏛️ The architecture
I have been resisting an urge to put the "famous" transformer image, but I can't stop myself now. Let's keep referring the diagram to have some clarity.

<img src="./images/whole-trans.png">

## 🧹 Let's simplify it a bit.
So, we won't be using the `encoder - decoder` architecture here, which is generatlly used in the **"seq2seq"** models like BERT. What GPT is `decoder only` model, which has the **right** part only and is called the **causal language model**. It doesn't care about the input it only looks at the past tokens and keeps creating the new tokens.

Let's update the architecture step by step.

<img src="./images/decoder-only.png">

## 😌 Alright, what a relief!
Let me show the clean image...

<img src="./images/decoder-only-cleaned.png">

😅 <br> Which turns out to be **just same as the encoder** except it has the "masked" part - where the future tokens don't talk to the past. We will clean, and further annotate this architecture... but be sure that what you are looking at is the backbone of the GPT.

*(have changed a little bit of terms compared to the original architecture for simplicity)*.

# 🫀 Positions: The heart of LLMs
There is a famous pair of sentences: **"Aayush ate Pizza"** and **"Pizza ate Aayush"**. Do they mean the same thing?

Of course, not. 

- Just by changing the positions of the words, the meaning will change.
- Also positions take care of **how the words are connected** which in turn, takes care of the grammar.
- So, where we have the case of **generating** the text, we need to take care of the positions!

> *That means, we need to keep track of the positions where the word/token occur.*

### 🧐 Let's understand it...

```python
def __init__(self):
    super().__init__()
    self.embedding_table = nn.Embedding(vocab_size, n_embd)
    self.position_embedding_table = nn.Embedding(block_size, n_embd)        ### Added this ###
    self.lm_head = nn.Linear(n_embd, vocab_size)

def forward(self, idx, targets=None):
    B, T = idx.shape

    tok_emb = self.embedding_table(idx) # B, T, n_emb
    pos_emb = self.position_embedding_table(torch.arange(T, device=device)) ### Added this ###
    x = tok_emb + pos_emb                                                   ### Added this ###
    logits = self.lm_head(x) # B, T, vocab_size 
```

<img src="./images/old-network.png">

📍 And so now, we will need to keep track of the positions... **so later** when we want to ***flow*** the information from the past, we can leverage the positions!

## 🔥 And now... 

<img src="./images/with-positions-network.png">

> ✅ <br>Hope it is clear now... and we have now **successfully completed** this part of the architecture!

<img src="./images/positions-done.png" height=600 width=400>

# 😌 Great, done with the positions...
We have just understood the underlying structure... and **haven't** yet implemented that one in the model... we will, **but let's join another piece** of the puzzle ***self attention*** 🤳

___

# 🙇‍♂️🤝👁️ Self Attention: The Affinity
Alright have this sentence in mind:

    "When they finally reached the scene of the place crash, they were dead."
    
Are **they** dead who reached there? Or are **they** dead who were there already? <br>See how the pronoun **they** is used. It is confusing to us and hence to get the meaning of the sentence fully, the model needs to **keep track** of the inter-relations between the tokens what means what.

> 🙇‍♂️ <br>And it is solved by the **Self Attention**. <br> It lets the model to focus on the specific and most important parts of the text letting the model to keep context between the tokens.

## 🪝Query — 🔑Key — 💰Value

Remember our softmax way?

```python
tril = torch.tril(torch.ones((T, T)))
wei = torch.zeros((T, T))
wei = wei.masked_fill(tril==0, float("-inf"))
F.softmax(wei, dim=-1) @ sample_xb
```

There we were defining the `wei` of the **weights** to be uniform *all zeros*. But that was just for the concept. That represents **"how each token finds the previous token more or less interesting"** and **that's what we will find using 🪝🔑💰**!

So whole thing is to ***find*** the `wei`. We will need to calculate it. So, let's start the journey.

# 🪝Query

> The "Hook". 

For each token, for each individual token we will create some values which will be used to **match** the keys of the rest of the tokens ***including itself***.

> **Query vector**: *"What am I looking for?"*
> <br>— Andrej

# 🔑 Key

> The "Key". 

We will match the **query** of each token to the keys of each token. That will simply be the matrix multiplication *(dot product)* between them.

> **Key vector**: *"What do I contain?"*
> <br>— Andrej

# 🪝 @ 🔑 Dot Product
Here all tokens will interact with each other which will give us the `wei`... but wait, before using that `wei` we will **mask** the future tokens!

*(don't worry, everything will be illustrated in a bit 🎨)*

# 💰 Value
> The "Information".

Why aren't we done when we performed the dot product between the query and the key? I mean we got the `wei` and that's it! We just found all the affinities between the tokens... so **we should be able to move forward with this extra information** and perform the last step to **use this information in the context of the actual data / actual tokens**.

But not. We produce one more vector "value" - to represent the actual tokens; and that will finally multiply the `wei` 😉

### 🤔 Side note, what is a need of "Value" vector?

ChatGPT has a good and satisfactory answer:

1. **Separation of Concerns**: The Transformer architecture <u>employs multi-head attention</u>, where multiple sets of key, query, and value matrices are learned. <u>Each of these sets focuses on different aspects or patterns</u> in the input sequence. By using separate "value" vectors, the <u>model can flexibly adjust</u> the way it combines information from the input sequence **without changing the original token embeddings**. This separation of concerns allows the model to learn richer and more diverse representations.

> 🍵 <br>*My thought, so can we introduce another vector? <br>Query, Key, Value and **MasterValues**? That will give more flexibility in multiple attention heads!*

2. **Weighted Sum**: After computing the attention scores, <u>the "value" vectors are used to compute a weighted su</u>m. The attention scores determine how much weight each "value" vector contributes to the output. This weighted sum operation effectively combines information from different parts of the input sequence to produce the final output. <u>If you used the original token embeddings directly, you wouldn't have this flexibility</u> to compute different weighted sums for different output tokens.

3. **Parameterization**: The "value" vectors are <u>learnable parameters</u>, which means the model can adaptively adjust them during training to better capture relevant information from the input sequence. In contrast, if you used fixed token embeddings, the model would be limited to a static representation of the input.

___

**TL;DR**: It is all about the learning flexibility.

### 🧐 Let's understand it...

```python
B, T, n_embd = 4, 3, 5
after_position_calculation = torch.rand(B, T, n_embd)

head_size = 8
query = nn.Linear(n_embd, head_size, bias=False)
key = nn.Linear(n_embd, head_size, bias=False)
value = nn.Linear(n_embd, head_size, bias=False)

q = query(after_position_calculation)
k = key(after_position_calculation)
v = value(after_position_calculation)
wei = q @ k.transpose(-2, -1)
```

# Wnat some visuals? Seriously? Okay 🤨

<img src="./images/okay-sherlock.gif">

___

<img src="./images/QKV.png">

# 🏆🌟👏 Honorable Mention
Many visuals and simple explanations *(including the funny styles of drawing arrows)* are heavily inspired from our well known **BAM!!**, Josh Starmer. His incredibly easy to follow video for [decoders](https://www.youtube.com/watch?v=bQ5BoolX9Ag&t) is the most recommended 🙌.

# 👨‍💻 Let's quickly implement it!
I can't wait to see the results!

In [35]:
''' The old stuff, just keeping here to refer :) '''

batch_size = 32       
block_size = 8        
n_embd = 32

max_iters = 20_000    
eval_interval = 1_000 
eval_iters = 200
learning_rate = 1e-3

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# ---

head_size = n_embd ## For now, to avoid the dimensions mismatch below; but that will be changed.

In [36]:
class BigramLM(nn.Module):
    """
    Changes list for this run:
    1. Added the code for the positions embeddings
    2. Added `SA` head, self-attention head. That `Head` class is implemented
    below.
    3. Added a safe truncate in the `generate` method to avoid having text
    longer than the expected `block_size`
    """
    
    
    def __init__(self):
        super().__init__()
        self.embedding_table = nn.Embedding(vocab_size, n_embd)
        #🔻ADDED
        self.positions_embeddings = nn.Embedding(block_size, n_embd)
        
        #🔻ADDED | 🖊 REMAINING IMPLEMENTATION
        self.sa_head = Head(head_size)
        self.lm_head = nn.Linear(n_embd, vocab_size)
        
    def forward(self, idx, targets=None):
        '''
        Changes:
        1. Added the forward for positions
        2. Code to "add" the positions embeddings with the token embeddings
        3. Passing that in the `sm_head` which will give the final vector to pass in the `lm_head`
        '''
        #🔻ADDED
        B, T = idx.shape 
        

        tok_emb = self.embedding_table(idx) # B, T, n_emb
        #🔻ADDED
        positions_emb = self.positions_embeddings(torch.arange(T, device=device)) # T, n_emb
        #🔻ADDED
        x = tok_emb + positions_emb # B, T, n_emb
        #🔻ADDED | 🖊 REMAINING IMPLEMENTATION
        x = self.sa_head(x) # B, T, head_size
        logits = self.lm_head(x) # B, T, vocab_size
        
        if targets is None: 
            loss=None
        else:               
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
            
        return logits, loss
        
        
    def generate(self, idx, max_new_tokens):
        '''
        Take the index of the token and guess the next
        token based on the embeddings!
        '''
        for _ in range(max_new_tokens):
            #🔻ADDED
            idx_cond = idx[:, -block_size:] # make sure we only have size of `T`
            logits, loss = self(idx_cond)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            next_idx = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, next_idx), dim=1)

        return idx

In [37]:
class Head(nn.Module):
    """
    This class will simply create the Q, K, V vectors
    and also the reguster_buffer to create the mask.
    
    Then on the `forward` it will pass the vectors in the 
    Q, K, V and give the `out`.
    """
    
    def __init__(self, head_size):
        super().__init__()
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        # What is this? Explained below.
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size, device=device)))

    def forward(self, x):
        '''
        Take the `x` input which will be the positions.
        The shape will be B, T, C meaning:
        "For each batch, there will be T tokens which will have positions encoded in C
        space"
        
        We will use that and work oursalves forward.
        '''
        B, T, C = x.shape
        q = self.query(x)
        k = self.key(x)
        v = self.value(x)

        wei = q @ k.transpose(-2, -1) * C**-0.5 # the C**-0.5 is used to control the variance
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float("-inf")) # the mask
        wei = F.softmax(wei, dim=-1) # the final wei

        out = wei @ v # this is what we will use further
        return out

### The "buffers"?

> *register_buffer is a method in PyTorch that allows you to add a tensor to a module's state. This is typically used to register a buffer that should not be considered a model parameter. For example, BatchNorm's running_mean is not a parameter, but is part of the module's state.*
> <br>— Bard


## 🚆 Training *(excited)*

In [65]:
model = BigramLM()
model = model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

In [66]:
# Total parameters now
sum(len(i) for i in model.parameters())

299

In [67]:
print(model)

BigramLM(
  (embedding_table): Embedding(65, 32)
  (positions_embeddings): Embedding(8, 32)
  (sa_head): Head(
    (query): Linear(in_features=32, out_features=32, bias=False)
    (key): Linear(in_features=32, out_features=32, bias=False)
    (value): Linear(in_features=32, out_features=32, bias=False)
  )
  (lm_head): Linear(in_features=32, out_features=65, bias=True)
)


In [71]:
for step in range(max_iters): # increase number of steps for good results... 
    
    if step % eval_interval == 0:
        losses = estimate_loss()
        print(f"[Step {step}]: Train Loss~{losses['train']:.4f}, Val Loss~{losses['val']:.4f}")
    
    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

[Step 0]: Train Loss~2.3763, Val Loss~2.3915
[Step 1000]: Train Loss~2.3596, Val Loss~2.3879
[Step 2000]: Train Loss~2.3590, Val Loss~2.3772
[Step 3000]: Train Loss~2.3570, Val Loss~2.3756
[Step 4000]: Train Loss~2.3498, Val Loss~2.3761
[Step 5000]: Train Loss~2.3601, Val Loss~2.3789
[Step 6000]: Train Loss~2.3372, Val Loss~2.3651
[Step 7000]: Train Loss~2.3490, Val Loss~2.3687
[Step 8000]: Train Loss~2.3315, Val Loss~2.3698
[Step 9000]: Train Loss~2.3318, Val Loss~2.3681
[Step 10000]: Train Loss~2.3274, Val Loss~2.3652
[Step 11000]: Train Loss~2.3277, Val Loss~2.3557
[Step 12000]: Train Loss~2.3407, Val Loss~2.3744
[Step 13000]: Train Loss~2.3410, Val Loss~2.3667
[Step 14000]: Train Loss~2.3162, Val Loss~2.3588
[Step 15000]: Train Loss~2.3249, Val Loss~2.3655
[Step 16000]: Train Loss~2.3252, Val Loss~2.3684
[Step 17000]: Train Loss~2.3237, Val Loss~2.3779
[Step 18000]: Train Loss~2.3324, Val Loss~2.3586
[Step 19000]: Train Loss~2.3358, Val Loss~2.3594


## 🎉 Inference!

In [72]:
output = decode(
    model.generate(
        idx = torch.zeros((1, 1), 
                          dtype=torch.long,
                          device=device),
        max_new_tokens=500)[0].tolist()
)
print(output)


Thins; sis kes shuk ble-mar, sates spaith but olk virescece:
O thers
At fishe,
Sed we aris il

ENCELONUS:
Beirak der, hry.


NG LARUDKII wongd ses amawshe. 
I thirear I IS:
Binond courm hir, byoras o'd;

Thathese haw Hel brno en:
Big wiflouth ff ma than:
Hraces chet, ofart, l's odl, ta he uat re. Ga Bon' inol s' I orrt burs blitherower lifl he puns foour out arsceard; wat ngoo.

Cur woury poour,
The,
I LIse thonoan dto.

POUS:

Mar stha pat ing: brain shown'ed.

MARKENORY :
Tour hehe won ou
'lll


> Okay, it is **being human**!

# 🧠👥👁 Multi-head attention
I like this name so much that it's the name of my router 🤗

<img src="./images/wifi.png">

Let's expand the model so that it can learn more relationships between the tokens!

In [38]:
class MultiHeadAttention(nn.Module):
    """
    This is simpler that you think!
    It will simply create `n` heads (each of them with seperate Q, K, V settings)
    and will pass the `x` individually (seperate) in each heads and then finally 
    will combine each `out` on the last dimension.
    """
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        
    def forward(self, x):
        return torch.cat([h(x) for h in self.heads], dim=-1)

<img src="./images/multihead-concat.png">

And so now the Bigram Model will be...

In [39]:
class BigramLM(nn.Module):
    """
    Changes list for this run:
    1. Added multi-head attention!
    """
    
    
    def __init__(self):
        super().__init__()
        self.embedding_table = nn.Embedding(vocab_size, n_embd)
        self.positions_embeddings = nn.Embedding(block_size, n_embd)
        self.sa_heads = MultiHeadAttention(4, n_embd//4) # to make shape workout with the next layer...
        self.lm_head = nn.Linear(n_embd, vocab_size)
        
    def forward(self, idx, targets=None):
        '''
        Changes:
        Nothing to change here.
        '''
        B, T = idx.shape 
        

        tok_emb = self.embedding_table(idx) # B, T, n_emb
        positions_emb = self.positions_embeddings(torch.arange(T, device=device)) # T, n_emb
        x = tok_emb + positions_emb # B, T, n_emb
        x = self.sa_heads(x) # B, T, head_size
        logits = self.lm_head(x) # B, T, vocab_size
        
        if targets is None: 
            loss=None
        else:               
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
            
        return logits, loss
        
        
    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -block_size:]
            logits, loss = self(idx_cond)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            next_idx = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, next_idx), dim=1)

        return idx

## 🚆 Training *(excitedx2)*

In [76]:
model = BigramLM()
model = model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

In [77]:
# Total parameters now
sum(len(i) for i in model.parameters())

299

> 📔 **Note** the same number of parameters!

In [78]:
print(model)

BigramLM(
  (embedding_table): Embedding(65, 32)
  (positions_embeddings): Embedding(8, 32)
  (sa_heads): MultiHeadAttention(
    (heads): ModuleList(
      (0-3): 4 x Head(
        (query): Linear(in_features=32, out_features=8, bias=False)
        (key): Linear(in_features=32, out_features=8, bias=False)
        (value): Linear(in_features=32, out_features=8, bias=False)
      )
    )
  )
  (lm_head): Linear(in_features=32, out_features=65, bias=True)
)


In [79]:
for step in range(max_iters): # increase number of steps for good results... 
    
    if step % eval_interval == 0:
        losses = estimate_loss()
        print(f"[Step {step}]: Train Loss~{losses['train']:.4f}, Val Loss~{losses['val']:.4f}")
    
    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

[Step 0]: Train Loss~4.2530, Val Loss~4.2557
[Step 1000]: Train Loss~2.5166, Val Loss~2.5187
[Step 2000]: Train Loss~2.3898, Val Loss~2.4064
[Step 3000]: Train Loss~2.3508, Val Loss~2.3434
[Step 4000]: Train Loss~2.2879, Val Loss~2.3015
[Step 5000]: Train Loss~2.2716, Val Loss~2.2902
[Step 6000]: Train Loss~2.2474, Val Loss~2.2635
[Step 7000]: Train Loss~2.2294, Val Loss~2.2553
[Step 8000]: Train Loss~2.2151, Val Loss~2.2483
[Step 9000]: Train Loss~2.1988, Val Loss~2.2501
[Step 10000]: Train Loss~2.1957, Val Loss~2.2268
[Step 11000]: Train Loss~2.1828, Val Loss~2.2167
[Step 12000]: Train Loss~2.1684, Val Loss~2.2066
[Step 13000]: Train Loss~2.1616, Val Loss~2.2336
[Step 14000]: Train Loss~2.1607, Val Loss~2.2092
[Step 15000]: Train Loss~2.1560, Val Loss~2.2192
[Step 16000]: Train Loss~2.1593, Val Loss~2.2219
[Step 17000]: Train Loss~2.1564, Val Loss~2.2150
[Step 18000]: Train Loss~2.1386, Val Loss~2.2128
[Step 19000]: Train Loss~2.1401, Val Loss~2.2023


## 🎉 Inference!

In [81]:
output = decode(
    model.generate(
        idx = torch.zeros((1, 1), 
                          dtype=torch.long,
                          device=device),
        max_new_tokens=500)[0].tolist()
)
print(output)



MEONTERL:
Het.
You has to acust, Cis beight guard mesto way dand have drae
vire, nern ane wind set!

Gow'd the shatortu: porwer my as beree more.


Frod no aness! prours bee padty me we weirg onecloik is neat;
Butre of ith,
hind me you rone kings with at if, ther-dunickin shis bougr,
Youmy at's hinto hatene and alighavery a wirgl inhe bread ish gun in Donot nempto ne, be you.

Sony corthad din woo.

KINCE:
You in! the tham be crings you'l thour so rad;
Whhave he jred bre'stres.

S

MIRGOLORD:




> **Hell yeah! It is improved!**

# 🧼✨ The Feed Forward Simplicity

> ✅ <br>Hope it is clear now... and we have now **successfully completed** this part of the architecture!

<img src="./images/multihead-maskdone.png" height=600 width=400>

The next part that we will be implementing is the ***"Feed Forward Layer"***.

#### 🚀 Motivation of this layer
> *When we calculated the QKV and solved that multi head part, **we went way too fast** for calculating the probability for the next token. The tokens looked at each other but didn't have the time to **think on** the information that they have found from the other tokens.* <br> — Andrej

In [40]:
class FeedForward(nn.Module):
    """
    Just a single layer!
    """
    
    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, n_embd),
            nn.ReLU()
        )
        
    def forward(self, x):
        return self.net(x)

In [41]:
class BigramLM(nn.Module):
    """
    Changes list for this run:
    1. Added Feed forward
    """
    
    
    def __init__(self):
        super().__init__()
        self.embedding_table = nn.Embedding(vocab_size, n_embd)
        self.positions_embeddings = nn.Embedding(block_size, n_embd)
        self.sa_heads = MultiHeadAttention(4, n_embd//4) 
        self.ffwd = FeedForward(n_embd) ### JUST ADDED THIS ###
        self.lm_head = nn.Linear(n_embd, vocab_size)
        
    def forward(self, idx, targets=None):
        '''
        Changes:
        Nothing to change here.
        '''
        B, T = idx.shape 
        

        tok_emb = self.embedding_table(idx) # B, T, n_emb
        positions_emb = self.positions_embeddings(torch.arange(T, device=device)) # T, n_emb
        x = tok_emb + positions_emb # B, T, n_emb
        x = self.sa_heads(x) # B, T, head_size
        x = self.ffwd(x) ### AND THIS ###
        logits = self.lm_head(x) # B, T, vocab_size
        
        if targets is None: 
            loss=None
        else:               
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
            
        return logits, loss
        
        
    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -block_size:] # make sure we only have size of `T`
            logits, loss = self(idx_cond)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            next_idx = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, next_idx), dim=1)

        return idx

## 🚆 Training *(excitedx3)*

In [84]:
model = BigramLM()
model = model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

In [85]:
# Total parameters now
sum(len(i) for i in model.parameters())

363

In [86]:
print(model)

BigramLM(
  (embedding_table): Embedding(65, 32)
  (positions_embeddings): Embedding(8, 32)
  (sa_heads): MultiHeadAttention(
    (heads): ModuleList(
      (0-3): 4 x Head(
        (query): Linear(in_features=32, out_features=8, bias=False)
        (key): Linear(in_features=32, out_features=8, bias=False)
        (value): Linear(in_features=32, out_features=8, bias=False)
      )
    )
  )
  (ffwd): FeedForward(
    (net): Sequential(
      (0): Linear(in_features=32, out_features=32, bias=True)
      (1): ReLU()
    )
  )
  (lm_head): Linear(in_features=32, out_features=65, bias=True)
)


In [87]:
for step in range(max_iters): # increase number of steps for good results... 
    
    if step % eval_interval == 0:
        losses = estimate_loss()
        print(f"[Step {step}]: Train Loss~{losses['train']:.4f}, Val Loss~{losses['val']:.4f}")
    
    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

[Step 0]: Train Loss~4.2171, Val Loss~4.2125
[Step 1000]: Train Loss~2.4522, Val Loss~2.4433
[Step 2000]: Train Loss~2.3255, Val Loss~2.3314
[Step 3000]: Train Loss~2.2723, Val Loss~2.2850
[Step 4000]: Train Loss~2.2333, Val Loss~2.2622
[Step 5000]: Train Loss~2.2139, Val Loss~2.2419
[Step 6000]: Train Loss~2.1865, Val Loss~2.2280
[Step 7000]: Train Loss~2.1735, Val Loss~2.2081
[Step 8000]: Train Loss~2.1510, Val Loss~2.1879
[Step 9000]: Train Loss~2.1450, Val Loss~2.1956
[Step 10000]: Train Loss~2.1317, Val Loss~2.1705
[Step 11000]: Train Loss~2.1270, Val Loss~2.1847
[Step 12000]: Train Loss~2.1116, Val Loss~2.1748
[Step 13000]: Train Loss~2.1069, Val Loss~2.1679
[Step 14000]: Train Loss~2.0946, Val Loss~2.1736
[Step 15000]: Train Loss~2.1029, Val Loss~2.1679
[Step 16000]: Train Loss~2.0843, Val Loss~2.1668
[Step 17000]: Train Loss~2.0794, Val Loss~2.1808
[Step 18000]: Train Loss~2.0897, Val Loss~2.1582
[Step 19000]: Train Loss~2.0780, Val Loss~2.1467


## 🎉 Inference!

In [88]:
output = decode(
    model.generate(
        idx = torch.zeros((1, 1), 
                          dtype=torch.long,
                          device=device),
        max_new_tokens=500)[0].tolist()
)
print(output)


Shan con; but his frad,
Buch fim the's
tack knounk Is he to ince in king and thoulds:
Whice shat a:
Seare willafgut villsed ort?- RIZABENVINCESSTENCEYCUCKING Roful ahat to is ties be man orm leme!

I is and That mell to cmarwich Tho you shave give trer cach by hup ance to,
Tharss nour felf;
Bit tazen
Of to sprond wing me ver it wores.

HEOLARD:
In, and
Iron ther, shat what to grere cibrut Pagctyour tworseatud, no planesen that fiurathou aps, good the carce a noplet.

MHENCIO:
Sarn shond tate thu


> The model's still dumb 😅

> ✅ <br>One more lego piece is done now.

<img src="./images/feedfwd-done.png" height=600 width=400>

# 🧩 The "Add & Norm" with Residual Connections
Okay, now it is the time to add the mysterious **Add & Norm** layers and will use the **residual connection** which is just to carry the original tokens' information to the transformed information.

**Plan of attack**:
- First we will implement our decoder as given in the architecture, which **first** has these masked Multi-Head attention **and then** the Add & Norm layer.
- After checking the accuracy with that, we will create a second version which is shown in the Andrej's lecture, in which **first** we will use the Add & Norm later **and then** use the masked multi-head and so on.

So, let's go.

## ➕📊 What is "Add & Norm"?
If you remember the *BatchNorm* layer, which we used to **standardize** the weights of the layer **using the mean and variance** across multiple samples in the batch, here we will do the same, but this will use the mean and variance of the **given sample**. So, there is no need to keep running mean and variance, what a relief 😌

## 🔗 Now, Residual connection?
When the netwrok gets very deep, the issue of **vanishing gradient** occurs. Which is, while backpropogation the gradients become very small that even if they are carrying information but because of so many layers in between, little to no information reaches to the initial-input layers. The gradient update becomes so small there.

To prevent this, these **residual connection** or **skip connections** is one of the remedies. Let me paste the ChatGPT response for a clearer understanding:

> ### 🗒 
> *The problem arises when the gradients become **very small** as they are propagated backward through many layers. In some cases, they can become so close to zero that they **effectively carry little to no information** about how to update the parameters in the early layers of the network. <br> <br>When gradients vanish, it **becomes extremely challenging** to train the network effectively. Layers earlier in the network do not receive meaningful updates, and as a result, these layers learn very slowly or not at all. This is a significant obstacle in training deep neural networks because the **network's depth** is one of the key factors that can enable it to learn complex and hierarchical representations. <br> <br> The vanishing gradient problem is especially pronounced in networks that use certain activation functions (e.g., sigmoid or tanh) because these functions squash their input into a limited range, making it easier for gradients to vanish when their absolute values are small.*

## 🤔 And... projection layer?
Nobody told you about this? 😅 <br>
Alright, it is just to **match** the input and output shapes. Since the shapes matter the most in the NN, we will need to use it. 

**But when?**<br>
Actually the code below shows you the `residual connection` done by the following lines:
```python
x = x + self.sa_heads(x)
x = x + self.ffwd(x) 
```

As you can see the `x` goes through a couple of computation, inside the `sa_heads` and `ffwd` layers, and when it comes out, we are using the **addition** with the `x`. That is the residual connection and there **we need to makesure** that the shape matches!

And for that we will implement the `proj` or **projection** layers in the `MultiHead` and `FFWD` :)

# 💻 Let's implement this (these)

In [42]:
class MultiHeadAttention(nn.Module):
    """
    Changes:
    1. Added a `projection` layer. Since we are using the residual connection.
    """
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.proj(out) # Called here...
        return out 

### Wondering something? Anything?
If not, then should. You should be a pretty confused at this point that ***if torch.cat() concatenates all the dim=-1, and the `proj` layer accepts `n_embd` shape, then how do the shape work out?***.

Well, it wouldn't but since we are passing the shapes `self.sa_heads = MultiHeadAttention(4, n_embd//4)`, it will automatically match because we are using `n_embd//4` and when it will get concatenated `4` times, it will become `n_embd` 😄

In [43]:
class FeedForward(nn.Module):
    """
    Changes:
    1. Added the `proj` layer here as well.
    """
    
    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, n_embd),
            nn.ReLU(),
            nn.Linear(n_embd, n_embd) ## The proj layer
        )
        
    def forward(self, x):
        return self.net(x)

### But according to the paper...
The **Feed Forward** layer should be the **multiplied by `4`**... so just a small change below 👇

In [44]:
class FeedForward(nn.Module):
    """
    Changes:
    1. Added the `proj` layer here as well.
    2. Added the multiplied by `4`
    """
    
    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4*n_embd),
            nn.ReLU(),
            nn.Linear(4*n_embd, n_embd) ## The proj layer
        )
        
    def forward(self, x):
        return self.net(x)

In [45]:
class BigramLM(nn.Module):
    """
    Changes:
    1. Added the residual connections
    2. Added 2 layers for the layernorm.
    """
    
    
    def __init__(self):
        super().__init__()
        self.embedding_table = nn.Embedding(vocab_size, n_embd)
        self.positions_embeddings = nn.Embedding(block_size, n_embd)
        self.sa_heads = MultiHeadAttention(4, n_embd//4) 
        self.add_norm_1 = nn.LayerNorm(n_embd) ## ADDED THE LN 1 ##
        self.ffwd = FeedForward(n_embd)
        self.add_norm_2 = nn.LayerNorm(n_embd) ## ADDED THE LN 2 ##
        self.lm_head = nn.Linear(n_embd, vocab_size)
        
    def forward(self, idx, targets=None):
        '''
        Changes:
        1. Applied layernorm after the self-attention and ffwd.
        '''
        B, T = idx.shape 
        

        tok_emb = self.embedding_table(idx) # B, T, n_emb
        positions_emb = self.positions_embeddings(torch.arange(T, device=device)) # T, n_emb
        x = tok_emb + positions_emb         # B, T, n_emb
        
        ### RESIDUAL CONNECTIONS & LAYER NORM ###
        x = x + self.add_norm_1(self.sa_heads(x))  # B, T, head_size
        x = x + self.add_norm_2(self.ffwd(x)) 
        logits = self.lm_head(x)             # B, T, vocab_size
        
        ### NOTE: The layernorm is applied AFTER the self-attention and ffwd.
        ### in the later code we will use as used in  Andrej's lecture.
        
        if targets is None: 
            loss=None
        else:               
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
            
        return logits, loss
        
        
    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -block_size:] # make sure we only have size of `T`
            logits, loss = self(idx_cond)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            next_idx = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, next_idx), dim=1)

        return idx

## 🚆 Training *(excitedx4)*

In [134]:
model = BigramLM()
model = model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

In [135]:
# Total parameters now
sum(len(i) for i in model.parameters())

811

In [136]:
print(model)

BigramLM(
  (embedding_table): Embedding(65, 32)
  (positions_embeddings): Embedding(8, 32)
  (sa_heads): MultiHeadAttention(
    (heads): ModuleList(
      (0-3): 4 x Head(
        (query): Linear(in_features=32, out_features=8, bias=False)
        (key): Linear(in_features=32, out_features=8, bias=False)
        (value): Linear(in_features=32, out_features=8, bias=False)
      )
    )
    (proj): Linear(in_features=32, out_features=32, bias=True)
  )
  (add_norm_1): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
  (ffwd): FeedForward(
    (net): Sequential(
      (0): Linear(in_features=32, out_features=128, bias=True)
      (1): ReLU()
      (2): Linear(in_features=128, out_features=32, bias=True)
    )
  )
  (add_norm_2): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
  (lm_head): Linear(in_features=32, out_features=65, bias=True)
)


In [137]:
for step in range(max_iters): # increase number of steps for good results... 
    
    if step % eval_interval == 0:
        losses = estimate_loss()
        print(f"[Step {step}]: Train Loss~{losses['train']:.4f}, Val Loss~{losses['val']:.4f}")
    
    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

[Step 0]: Train Loss~4.8423, Val Loss~4.8601
[Step 1000]: Train Loss~2.3000, Val Loss~2.2975
[Step 2000]: Train Loss~2.1771, Val Loss~2.2149
[Step 3000]: Train Loss~2.1228, Val Loss~2.1739
[Step 4000]: Train Loss~2.0763, Val Loss~2.1493
[Step 5000]: Train Loss~2.0567, Val Loss~2.1216
[Step 6000]: Train Loss~2.0303, Val Loss~2.1010
[Step 7000]: Train Loss~2.0041, Val Loss~2.0857
[Step 8000]: Train Loss~1.9928, Val Loss~2.0740
[Step 9000]: Train Loss~1.9830, Val Loss~2.0999
[Step 10000]: Train Loss~1.9806, Val Loss~2.0829
[Step 11000]: Train Loss~1.9561, Val Loss~2.0674
[Step 12000]: Train Loss~1.9727, Val Loss~2.0826
[Step 13000]: Train Loss~1.9700, Val Loss~2.0588
[Step 14000]: Train Loss~1.9542, Val Loss~2.0598
[Step 15000]: Train Loss~1.9397, Val Loss~2.0410
[Step 16000]: Train Loss~1.9485, Val Loss~2.0593
[Step 17000]: Train Loss~1.9275, Val Loss~2.0560
[Step 18000]: Train Loss~1.9273, Val Loss~2.0380
[Step 19000]: Train Loss~1.9097, Val Loss~2.0472


## 🎉 Inference!

In [138]:
output = decode(
    model.generate(
        idx = torch.zeros((1, 1), 
                          dtype=torch.long,
                          device=device),
        max_new_tokens=500)[0].tolist()
)
print(output)


Qeth is is mountup know hence?

POMINIUS:
I MERCINIUS:
Oless Slaid back, gue: then as thent that and it them thouse? Gred, thou and stifne'er mothing-shack thy My gonitor we to thath of I wass,
Seen; fulce.
And newards, decands
That thear miolo you will lest Julor his we dead.

RICHARD IVERGAUOFFORUTUM:
Boh some, expicky, why.

VILLUS:
Go mablatciond Haviles a streigh
Or nunsman, print me our and would,
But I his with dable beasice thre?

Firswouls:
First:
Which light thou soven hands will?
But 


> 🙌 Alright, alright... `2.04` validation loss this time... let's check it out with the **BEFORE** setting.

In [46]:
class BigramLM(nn.Module):
    """
    Changes:
    1. Used the LayerNorm BEFORE passing to the attention head or ffwd
    """
    
    
    def __init__(self):
        super().__init__()
        self.embedding_table = nn.Embedding(vocab_size, n_embd)
        self.positions_embeddings = nn.Embedding(block_size, n_embd)
        self.sa_heads = MultiHeadAttention(4, n_embd//4) 
        self.add_norm_1 = nn.LayerNorm(n_embd) ## ADDED THE LN 1 ##
        self.ffwd = FeedForward(n_embd)
        self.add_norm_2 = nn.LayerNorm(n_embd) ## ADDED THE LN 2 ##
        self.lm_head = nn.Linear(n_embd, vocab_size)
        
    def forward(self, idx, targets=None):
        '''
        Changes:
        1. Applied layernorm BEFORE the self-attention and ffwd.
        '''
        B, T = idx.shape 
        

        tok_emb = self.embedding_table(idx) # B, T, n_emb
        positions_emb = self.positions_embeddings(torch.arange(T, device=device)) # T, n_emb
        x = tok_emb + positions_emb         # B, T, n_emb
        
        ### RESIDUAL CONNECTIONS & LAYER NORM ###
        x = x + self.sa_heads(self.add_norm_1(x))  # B, T, head_size
        x = x + self.ffwd(self.add_norm_2(x)) 
        logits = self.lm_head(x)             # B, T, vocab_size
    
        
        if targets is None: 
            loss=None
        else:               
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
            
        return logits, loss
        
        
    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -block_size:] # make sure we only have size of `T`
            logits, loss = self(idx_cond)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            next_idx = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, next_idx), dim=1)

        return idx

## 🚆 Training *(excitedx5)*

In [140]:
model = BigramLM()
model = model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

In [141]:
# Total parameters now
sum(len(i) for i in model.parameters())

811

In [142]:
print(model)

BigramLM(
  (embedding_table): Embedding(65, 32)
  (positions_embeddings): Embedding(8, 32)
  (sa_heads): MultiHeadAttention(
    (heads): ModuleList(
      (0-3): 4 x Head(
        (query): Linear(in_features=32, out_features=8, bias=False)
        (key): Linear(in_features=32, out_features=8, bias=False)
        (value): Linear(in_features=32, out_features=8, bias=False)
      )
    )
    (proj): Linear(in_features=32, out_features=32, bias=True)
  )
  (add_norm_1): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
  (ffwd): FeedForward(
    (net): Sequential(
      (0): Linear(in_features=32, out_features=128, bias=True)
      (1): ReLU()
      (2): Linear(in_features=128, out_features=32, bias=True)
    )
  )
  (add_norm_2): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
  (lm_head): Linear(in_features=32, out_features=65, bias=True)
)


In [143]:
for step in range(max_iters): # increase number of steps for good results... 
    
    if step % eval_interval == 0:
        losses = estimate_loss()
        print(f"[Step {step}]: Train Loss~{losses['train']:.4f}, Val Loss~{losses['val']:.4f}")
    
    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

[Step 0]: Train Loss~4.5303, Val Loss~4.5256
[Step 1000]: Train Loss~2.3502, Val Loss~2.3594
[Step 2000]: Train Loss~2.2468, Val Loss~2.2464
[Step 3000]: Train Loss~2.1776, Val Loss~2.2182
[Step 4000]: Train Loss~2.1407, Val Loss~2.1836
[Step 5000]: Train Loss~2.1202, Val Loss~2.1585
[Step 6000]: Train Loss~2.0903, Val Loss~2.1525
[Step 7000]: Train Loss~2.0673, Val Loss~2.1279
[Step 8000]: Train Loss~2.0669, Val Loss~2.1269
[Step 9000]: Train Loss~2.0463, Val Loss~2.1049
[Step 10000]: Train Loss~2.0382, Val Loss~2.1196
[Step 11000]: Train Loss~2.0221, Val Loss~2.0859
[Step 12000]: Train Loss~2.0089, Val Loss~2.0800
[Step 13000]: Train Loss~2.0011, Val Loss~2.0816
[Step 14000]: Train Loss~2.0011, Val Loss~2.0871
[Step 15000]: Train Loss~1.9832, Val Loss~2.0807
[Step 16000]: Train Loss~1.9782, Val Loss~2.0556
[Step 17000]: Train Loss~1.9717, Val Loss~2.0672
[Step 18000]: Train Loss~1.9706, Val Loss~2.0741
[Step 19000]: Train Loss~1.9729, Val Loss~2.0741


> 🙌 In the BEFORE norm setting, it is slightly large but we can continue 😄`2.07`

## 🎉 Inference!

In [144]:
output = decode(
    model.generate(
        idx = torch.zeros((1, 1), 
                          dtype=torch.long,
                          device=device),
        max_new_tokens=500)[0].tolist()
)
print(output)



AUF YVOLYCUS HENVO:
I saymence it og lethen?

MARIOLA:
Ats,
Thus own with sea to
I'll Permen you, threks lirjere to bay them,
Rome a grove;
What witter nowere
But
Leard?

Cleed: enrows.

ips presseronged plaing.
Harry riscancy, 'temporn to this moulds in wells to know:
That, linestraw Leetury in senver birnerows toothere forter woo verown:
O. this bour praisont my your grasans, an of man Ranancends shad dee-tanntermervount mague dome of it, I preived's in his sagivesty, and thou e frabe!
Ands d


#### 🤔👍 We are almost done, what'd you say?
<img src="./images/residual-layrnorm-done.png" height=600 width=400>

# 🧱 Blocks
Focusing on the mysterious **`Nx`** part in the diagram. It simply shows *"how many blocks"* or *"how many replica"* of the given **"encoder" or "decoder"** we want to keep in the model sequentially.

In the paper:

> The encoder stacks — Nx identical layers of encoders (in the original paper Nx = 6) <br>
> The decoder stacks — Nx identical layers of decoders (in the original paper Nx =6)

That means we can make `N` number of those blocks, each of them with their own multi-head QKV architecture which can understand the underlying relation so well. Thus, the result will be...

In [63]:
class BigramLM(nn.Module):
    """
    Changes:
    1. Removed all block related part to its seperate class
    2. Added the Nx blocks in the init and also in the forward
    3. Added LayerNorm AFTER the Blocks (which is not in the paper but used)
    """
    
    
    def __init__(self):
        super().__init__()
        self.embedding_table = nn.Embedding(vocab_size, n_embd)
        self.positions_embeddings = nn.Embedding(block_size, n_embd)
        
        # self.sa_heads = MultiHeadAttention(n_head, head_size)  # [DEL -] 
        # self.add_norm_1 = nn.LayerNorm(n_embd)                 # [DEL -] 
        # self.ffwd = FeedForward(n_embd)                        # [DEL -] 
        # self.add_norm_2 = nn.LayerNorm(n_embd)                 # [DEL -] 
        # self.lm_head = nn.Linear(n_embd, vocab_size)           # [DEL -]
        
        self.blocks = nn.Sequential( # 3 Blocks
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
            nn.LayerNorm(n_embd)                                 # [ADD +]
        )                                                        # [ADD +]
        
        self.lm_head = nn.Linear(n_embd, vocab_size)
        
        
    def forward(self, idx, targets=None):
        '''
        Changes:
        1. Removed all block related forward calls to the block class
        '''
        B, T = idx.shape 
        

        tok_emb = self.embedding_table(idx) 
        positions_emb = self.positions_embeddings(torch.arange(T, device=device))
        x = tok_emb + positions_emb         
        
        # x = x + self.sa_heads(self.add_norm_1(x))  # [DEL -] 
        # x = x + self.ffwd(self.add_norm_2(x))      # [DEL -] 
        x = self.blocks(x)                           # [ADD +]
        logits = self.lm_head(x)             
    
        
        if targets is None: 
            loss=None
        else:               
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
            
        return logits, loss
        
        
    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -block_size:] # make sure we only have size of `T`
            logits, loss = self(idx_cond)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            next_idx = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, next_idx), dim=1)

        return idx

In [64]:
class Block(nn.Module):
    """
    """
    
    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa_heads = MultiHeadAttention(n_head, head_size) 
        self.add_norm_1 = nn.LayerNorm(n_embd)
        self.ffwd = FeedForward(n_embd)
        self.add_norm_2 = nn.LayerNorm(n_embd)

        
    def forward(self, x):
        x = x + self.sa_heads(self.add_norm_1(x))  # B, T, head_size
        x = x + self.ffwd(self.add_norm_2(x)) 
        return x

## 🚆 Training *(excitedx6)*

In [65]:
model = BigramLM()
model = model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

In [66]:
# Total parameters now
sum(len(i) for i in model.parameters())

2091

In [68]:
print(model)

BigramLM(
  (embedding_table): Embedding(65, 32)
  (positions_embeddings): Embedding(8, 32)
  (blocks): Sequential(
    (0): Block(
      (sa_heads): MultiHeadAttention(
        (heads): ModuleList(
          (0-3): 4 x Head(
            (query): Linear(in_features=32, out_features=8, bias=False)
            (key): Linear(in_features=32, out_features=8, bias=False)
            (value): Linear(in_features=32, out_features=8, bias=False)
          )
        )
        (proj): Linear(in_features=32, out_features=32, bias=True)
      )
      (add_norm_1): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
      (ffwd): FeedForward(
        (net): Sequential(
          (0): Linear(in_features=32, out_features=128, bias=True)
          (1): ReLU()
          (2): Linear(in_features=128, out_features=32, bias=True)
        )
      )
      (add_norm_2): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
    )
    (1): Block(
      (sa_heads): MultiHeadAttention(
        (heads): ModuleList(


In [69]:
for step in range(max_iters): # increase number of steps for good results... 
    
    if step % eval_interval == 0:
        losses = estimate_loss()
        print(f"[Step {step}]: Train Loss~{losses['train']:.4f}, Val Loss~{losses['val']:.4f}")
    
    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

[Step 0]: Train Loss~4.3752, Val Loss~4.3738
[Step 1000]: Train Loss~2.2586, Val Loss~2.2873
[Step 2000]: Train Loss~2.1373, Val Loss~2.2000
[Step 3000]: Train Loss~2.0652, Val Loss~2.1246
[Step 4000]: Train Loss~2.0116, Val Loss~2.1000
[Step 5000]: Train Loss~1.9538, Val Loss~2.0740
[Step 6000]: Train Loss~1.9438, Val Loss~2.0454
[Step 7000]: Train Loss~1.9454, Val Loss~2.0718
[Step 8000]: Train Loss~1.9132, Val Loss~2.0306
[Step 9000]: Train Loss~1.8878, Val Loss~2.0215
[Step 10000]: Train Loss~1.8894, Val Loss~2.0175
[Step 11000]: Train Loss~1.8974, Val Loss~2.0193
[Step 12000]: Train Loss~1.8737, Val Loss~1.9918
[Step 13000]: Train Loss~1.8565, Val Loss~1.9852
[Step 14000]: Train Loss~1.8708, Val Loss~2.0090
[Step 15000]: Train Loss~1.8533, Val Loss~2.0025
[Step 16000]: Train Loss~1.8564, Val Loss~1.9897
[Step 17000]: Train Loss~1.8302, Val Loss~1.9795
[Step 18000]: Train Loss~1.8402, Val Loss~1.9841
[Step 19000]: Train Loss~1.8339, Val Loss~1.9811


> 🙌 Its freaking `1.98` !!

## 🎉 Inference!

In [71]:
output = decode(
    model.generate(
        idx = torch.zeros((1, 1), 
                          dtype=torch.long,
                          device=device),
        max_new_tokens=500)[0].tolist()
)
print(output)



F YORK:
I waves that what statell in meal where deather,
And lobdeer-sence.
Why to son,
Firstroke houre the aus shall by mysake, in any where waite:
Hant:
The for:
Rispuress.

DUKE VIOLANUS:
Than they be from flevisones? This lord that you powife-'er coperany he be,
It the for-I tender a bloigh, and him
hat way vither storr'd my beritutes, that thee's ther: less offener curdent a premith rine it peris
Fraise: I still in manish'd have poweep, and men to matter: an a gow.
I'll gracition,
At her t


# 🚪 The final diagram is...
<img src="./images/blocks-done.png" height=600 width=400>

> ## ⚠
> **PLEASE NOTE**: Imagine that the **Add & Norm** comes before the multi-head and feed-forward part. I have not updated the diagram for the simplicity.

# 😲 Oh, Em, Gee
We have just developed the GPT. **Let's meet in the next book** and expand the network, to achieve the final result which spits out the actual shakespeare.