 This notebook is my notes and codes of the youtube video: https://www.youtube.com/watch?v=kCc8FmEb1nY

## Papers: 
Attention is all you need: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

Deep Residual Learning: https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf

Dropout: https://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf

In [1]:
#Import libraries

import pandas as np
import os
import torch
import torch.nn.functional as F #To OneHotEncode
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
text = open('../input/shakespeare-tiny-input/input.txt','r').read()

In [3]:
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [4]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


In [5]:
#Create a mapping from characters to integers

stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] #Encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) #decoder: take a list of integers, output a string

print(encode("hii there"))
print(decode(encode("hii there")))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


In [6]:
#lets encode the entire text dataset
data = torch.tensor(encode(text), dtype = torch.long)
print(data.shape, data.dtype)
print(data[:1000])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

In [7]:
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

In [8]:
block_size = 8 #The block_size delimits the size of the chunk of text that we will use to train
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [9]:
#We will use all sizes of input that is lower than block_size

x = train_data[:block_size]
y = train_data[1:block_size+1]

for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"When input is {context} the target: {target}")

When input is tensor([18]) the target: 47
When input is tensor([18, 47]) the target: 56
When input is tensor([18, 47, 56]) the target: 57
When input is tensor([18, 47, 56, 57]) the target: 58
When input is tensor([18, 47, 56, 57, 58]) the target: 1
When input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
When input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
When input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


In [10]:
torch.manual_seed(1337)
batch_size = 4 #how many independent sequences will we process in parallel?
block_size = 8 #what is the maximum context lenght for predictions?

def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('outputs')
print(yb.shape)
print(yb)

print('------')

for b in range(batch_size):
    for t in range(block_size):
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"When input is {context.tolist()} the target: {target}")
        
        #We will not use all the possible context (like, 43 -> target 58). 
        #We increase the context but always starting at the begin

inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
outputs
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
------
When input is [24] the target: 43
When input is [24, 43] the target: 58
When input is [24, 43, 58] the target: 5
When input is [24, 43, 58, 5] the target: 57
When input is [24, 43, 58, 5, 57] the target: 1
When input is [24, 43, 58, 5, 57, 1] the target: 46
When input is [24, 43, 58, 5, 57, 1, 46] the target: 43
When input is [24, 43, 58, 5, 57, 1, 46, 43] the target: 39
When input is [44] the target: 53
When input is [44, 53] the target: 56
When input is [44, 53, 56] the target: 1
When input is [44, 53, 56, 1] the target: 58
When input is [44, 53, 56, 1, 58] the target: 46
When input is [44, 5

## Bigram BaseLine

In [11]:
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):
    
    def __init__(self,vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
        
        # nn.Embedding creates a 2d matrix with the logits values. It started with random number!!
    
    def forward(self, idx, targets = None):  # THIS CALLS __call__ method. So, m(xb, yb) = m.__call__(xb,yb)
        
        #idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) #(B,T,C) - Batch, time, channel
        
        # logits have a shape of [4,8,65] -> For each [4,8] characters, it returns the 65 logits values.
        # We need to change the dimension for the cross_entropy call
        
        if targets is None:
            loss = None
            
        else:      
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)   
        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        # idx is (B,T) array of indices in the current context
        
        for _ in range(max_new_tokens):
            #Get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # Becomes (B,C) ---> the -1 is to use only the last character - its a bigram
            #apply softmas to get probabilites
            probs = F.softmax(logits, dim = -1) #(B,C)
            #sample from the distribution
            idx_next = torch.multinomial(probs, num_samples =1) #(B,1)
            #append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) #(B, T+1)
        return idx
    
m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(loss)

print(decode(m.generate(idx = torch.zeros((1,1), dtype = torch.long), max_new_tokens=100)[0].tolist()))
#With an uniform distribution, the loss would be -ln(1/65) = 4.17

tensor(4.8786, grad_fn=<NllLossBackward0>)

Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3


In [12]:
#Create a pytorch optmizer
optimizer = torch.optim.AdamW(m.parameters() , lr = 1e-3)

In [13]:
batch_size = 32
for steps in range(10000):
    
    #sample a batch of data
    xb, yb = get_batch('train')
    
    #evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none = True)
    loss.backward()
    optimizer.step()
    
print(loss.item())

2.5727508068084717


In [14]:
print(decode(m.generate(idx = torch.zeros((1,1), dtype = torch.long), max_new_tokens=300)[0].tolist()))


Iyoteng h hasbe pave pirance
Rie hicomyonthar's
Plinseard ith henoure wounonthioneir thondy, y heltieiengerofo'dsssit ey
KIN d pe wither vouprrouthercc.
hathe; d!
My hind tt hinig t ouchos tes; st yo hind wotte grotonear 'so it t jod weancotha:
h hay.JUCle n prids, r loncave w hollular s O:
HIs; ht 


## Notes

- We can run on CUDA or GPU -> CUDA is better

    - hyperparameter device = 'cuda' if torch.cuda.is_avaiable() else 'cpu'
    - If we use CUDA, we need to assure to move the data x,y = x.to(device), y.to(device). Also, when we create the model, we want to move to device m = model.to(device)
    - Also, we need to create the variable on the device to generate : context = torch.zeros((1,1), dtype = torch.long, device = device)
    
- We can set the model to evaluation "model.eval()" or to training "model.train()"


- If we calculate the loss for each batch, it will be a little noise. It's better to use a estimative error, averaging the loss with respect to the batches. He used a function called "estimate_loss()". He used a @torch.no_grad() at the beggining of the function to tell pytorch to not compute the backprop there


## The mathematical trick in self-attention

In [15]:
torch.manual_seed(1337)
B,T,C = 4, 8, 2
x = torch.randn(B,T,C)
x.shape

torch.Size([4, 8, 2])

### Notes

Time = number of characters used to generate the next one -> Each time that we increase the context, the "time" of x, increases
Channels = "information" for each element of (batch, time) (i.e., logits, one-hot-encode)
Batch = batch size

In this tensor form, all the elements of T don't "talk" to each other. We would like to couple them to mix the information

### Rules
- The token in the fifth position should not talk to the tokens at 6, 7, 8 positions, ONLY with 1, 2, 3 and 4.
- We want to mix the information on these channels (like taking the average of all previous steps)

- Using the mean is a weak way to mix the infortation, because we lose a lot of information about the spacial distance.

In [16]:
# But lets start taking the average

xbow = torch.zeros((B,T,C))

for b in range(B):
    for t in range(T):
        xprev = x[b,:t+1]
        xbow[b,t] = torch.mean(xprev,0) #taking the average of all previous characters (taking a bag of words)

In [17]:
x[0]

tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])

In [18]:
xbow[0]

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])

In [19]:
#A more efficient way to do this averaging

#Lets create a lower triangule with ones
a = torch.tril(torch.ones(3,3))  #tril returns the lower triangule of the 2d Array
# a @ matrix = c  -> c_ij = sum(matrix_kj), k < = i,  (sum until element)

#Divide the 1's to create an average matrix multiplication
a = a/torch.sum(a,1, keepdim = True)
# now, a is equivalent to the double for's from the previous chunk of code

In [20]:
#A more efficient way to do this averaging
wei = torch.tril(torch.ones(T, T))   
wei = wei/torch.sum(wei,1, keepdim = True)  #Average weights matrix

xbow2 = wei @ x #(T,T) @ (B, T, C) ---> it creates a Batch dimension to wei -> (B,T,C)

(xbow2-xbow).abs().max()

tensor(3.2363e-08)

In [21]:
#Version 3: Use SoftMax
tril = torch.tril(torch.ones(T,T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf')) #replace the coordinates of tril == 0 is true with , float('-inf')
wei = F.softmax(wei, dim = -1) #The softmax exponentiate the value, so -inf  becomes zero, and the other values get normalized 

xbow3 = wei @ x

# It have some interesting behaviors if we put some learning in wei
# Its calculate how long should we go to the past.
# the '-inf' where tril == 0 assures that we don't use information from the future

(xbow3-xbow).abs().max()

tensor(3.2363e-08)

## Self attention
- We want to train the wei matrix!
- Each token/node emits 2 vectors: One "query" (what I'm looking for) and one "key" (what do I contain).
- The way do affinities (the mix of information) -> we do a dot product of keys and queries.  My query (dot product) with all the keys of all the other tokens becomes "wei"
- If query.key are align, they will interact with a very high amount, so I get to know more about these tokens!!

In [22]:
#Version 4: Self-attention
B,T,C = 4, 8, 32  #We encode each character into a 32D space
x = torch.randn(B,T,C)

#let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias = False) #Channel is the input, the output is head_size shapped: C @ W_key = keys
query = nn.Linear(C, head_size, bias = False) #C @ W_query = query 
value = nn.Linear(C, head_size, bias = False)

#Each vector has the same shape! (the same axies?). That's how we ensure that the dot product can be applied

k = key(x) #(B, T, 16)
q = query(x) #(B, T, 16)  #No information is crossed in these lines

wei = q @ k.transpose(-2,-1)  # we need to transpose the last two dimensions 0 -> (B,T,16) @ (B,16,T) ->> (B,T,T)
wei*head_size**(-0.5)  # We do this to initiallize better. Otherwise, we don't control the variance. 
#Also, softmax is more less sensible if all the values are scalated, and the output tends to go with the highest one

tril = torch.tril(torch.ones(T,T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim = -1)

v = value(x) #We don't aggregate the wei direct to x, we first pass x into v (values), and we aggregate the values
out = wei @ v

out.shape

torch.Size([4, 8, 16])

## Notes:
- Attention is a **communication mechanism**. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights. (In a 8 block size: token 2 talks with token 1 and 2. Token 3 sees token 1,2 and 3).
- There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
- Each example across batch dimension is of course processed completely independently and never "talk" to each other
- In an "encoder" attention block just delete the single line that does masking with `tril`, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
- "self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
- "Scaled" attention additional divides `wei` by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below

## Why Multi-Head Self-Attention is More Efficient

### Concept of Multi-Head Self-Attention:

   - Multi-head self-attention runs multiple attention mechanisms in parallel, each with different projections of query (Q), key (K), and value (V) vectors.
   - The outputs of these attention heads are concatenated and projected to form the final output.

### Advantages of Multi-Head Self-Attention:

    - Capturing Multiple Representations:
        - Diversity of Attention: Each head can focus on different parts of the input sequence, capturing various aspects such as grammatical and semantic relationships.
        - Reduced Redundancy: Multiple heads capture complementary information, reducing redundancy in the representation of relationships.

    - Improved Learning of Complex Patterns:
        - Hierarchical Patterns: Different heads can learn different hierarchical and granular patterns within the data.
        - Focus on Different Positions: Each head can attend to different parts of the input, combining multiple contexts simultaneously.

    - Reduced Dimension of Each Head:
        - Lower Dimension, Lower Cost: Operating on smaller dimensions can be computationally less expensive than a single larger dimension.
        - Efficient Parallelization: Multi-head attention allows more efficient parallel operations, better leveraging modern computational resources.
        
Lets rewrite the code to use multi-head s.a

(initial loss, without SA: 2.5727508068084717

In [23]:
class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        #wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

In [24]:
class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])  #Creates a module - list of layers - with head
        self.proj = nn.Linear(n_embd, n_embd)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)   #Concatenate all the heads outputs
        return out
    
class FeedForward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, n_embd),
            nn.ReLU(),
        )

    def forward(self, x):
        return self.net(x)

In [25]:
#Lets update the Bigram


class BigramLanguageModel(nn.Module):
    
    def __init__(self,vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd) #Both position_emb.. and token_emb live in the same vector - "world"
        self.sa_heads = MultiHeadAttention(4, n_embd//4)
        self.lm_head = nn.Linear(n_embd, vocab_size)
        self.ffw = FeedForward(n_embd)

    def forward(self, idx, targets = None):  # THIS CALLS __call__ method. So, m(xb, yb) = m.__call__(xb,yb)
        #idx and targets are both (B,T) tensor of integers
        B,T = idx.shape
        
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device = 'cpu')) # (T,C)
        
        x = tok_emb + pos_emb # (B,T,C)  #Lets mix the information of the character with the position
        x = self.sa_heads(x) # (B,T,C)   -> In this layer, we didn't cross information between the tokens. The input only depends on x
        x = self.ffw(x) # So, we use this feedforward layer to cross the information!!!  
        logits = self.lm_head(x) # (B,T,vocab_size)
        
        
        if targets is None:
            loss = None
        else:      
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        
        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        # idx is (B,T) array of indices in the current context
        
        for _ in range(max_new_tokens):
            #crop idc to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # Becomes (B,C) ---> the -1 is to use only the last character - its a bigram
            #apply softmas to get probabilites
            probs = F.softmax(logits, dim = -1) #(B,C)
            #sample from the distribution
            idx_next = torch.multinomial(probs, num_samples =1) #(B,1)
            #append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) #(B, T+1)
        return idx

In [26]:
#Lets optmize

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out



n_embd = 32
model = BigramLanguageModel(vocab_size)
    
# create a PyTorch optimizer
learning_rate = 1e-3
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
max_iters = 5000
eval_interval = 250 #Frequency of evalutation
eval_iters = 200
for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

step 0: train loss 4.1552, val loss 4.1576
step 250: train loss 2.8820, val loss 2.9006
step 500: train loss 2.6657, val loss 2.6614
step 750: train loss 2.5754, val loss 2.5611
step 1000: train loss 2.5226, val loss 2.5053
step 1250: train loss 2.4673, val loss 2.4684
step 1500: train loss 2.4377, val loss 2.4384
step 1750: train loss 2.4157, val loss 2.4169
step 2000: train loss 2.3932, val loss 2.3954
step 2250: train loss 2.3621, val loss 2.3718
step 2500: train loss 2.3593, val loss 2.3751
step 2750: train loss 2.3367, val loss 2.3456
step 3000: train loss 2.3351, val loss 2.3362
step 3250: train loss 2.3209, val loss 2.3197
step 3500: train loss 2.2993, val loss 2.3183
step 3750: train loss 2.2867, val loss 2.3198
step 4000: train loss 2.2846, val loss 2.3025
step 4250: train loss 2.2679, val loss 2.2913
step 4500: train loss 2.2771, val loss 2.2917
step 4750: train loss 2.2727, val loss 2.2959
step 4999: train loss 2.2643, val loss 2.2787


In [27]:
# Loss = 2.24 way lower than before
context = torch.zeros((1, 1), dtype=torch.long)
print(decode(model.generate(context, max_new_tokens=300)[0].tolist()))


LEY:
Why hopleteen lard do-ude fret Corfell, prrou hon of is my wonaise ing!

prtre robusithere wen thame pre.

IUUSRIZY ICI:
Of sarden:
Baye blaves afpe, his mas blosht, one!
SUSORCESTUCEORUS:
Tou giodd in
Andsy amy uly ser les.

II alllll vang the bary thiong ming?

QEY:
UCLAB:
The hirkaye.

Wamis


## Extras to the model

![](https://miro.medium.com/v2/resize:fit:856/1*ZCFSvkKtppgew3cc7BIaug.png)

# Lets take a look at the transform diagram. 

## 1) Blocks
We may notice that there is a block that repeats itself:
- Multi-head Attention + Add & Norm + Multi-head Attention + Add & Norm + Feed Forward + Add & Norm

The block is basically -> Communication + Computation
Communication = MultiHead Attention
Computation = Feed Forward

The Multi-head attention and feed forward were already implemented. But we need to go further!

## 2) Residual Connections
Before each multi-head attention, the input splits and it is added at the end. This is the RESIDUAL CONNECTIONS method, in which we add the transformed data to the input:

> x  -> x + residual(x)

> with: residual = feedforward(attention (x))

this is usefull because when we backpropagate, the supervision/gradient from the loss hop (jumps) through every addition node, directly to the input. This helps extremelly the optimization process. Usually, the residual blocks are initialize in a way that they contribute very little to the gradient.

For this, we need to add a linear layer after the multihead and feed forward layers to add to the input. The output of these layers should be the same dimension of the input!

## 3) LayerNorm (Add & Norm layer)
It's very simmilar to batch normalization. But instead of normalizing with respect to the batch dimension, we normalize with respect of the time (tokens). But, we don't need to keep track of batch mean and variance.

Nowadays its more commom to add the layernorm before the SA layers.

## 4) Dropout Layers
To scale the model, we should use techniques to avoid overfitting. 

# Final Model:

In [28]:
#SET hyperparameters
batch_size = 32 # how many independent sequences will we process in parallel?
block_size = 64 # what is the maximum context length for predictions?
max_iters = 3000
eval_interval = 250
learning_rate = 3e-4
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 192
n_head = 6
n_layer = 6
dropout = 0.2
# ------------

In [29]:
#PREPARE THE DATA

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

In [30]:
#Using the average loss of epochs

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

## Defining the Classes that will be used

In [31]:
# One Single Head Class:

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

In [32]:
#MultiHead Attention block --> Calls the single head
class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)]) 
        self.proj = nn.Linear(n_embd, n_embd) #FOR THE RESIDUAL CONNECTION 
        self.dropout = nn.Dropout(dropout) # For regularization

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

In [33]:
#FeedFoward -> The compute part of the transform

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),  #In the paper of the residual connection, they expanded this lawer by a factor of 4
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),  #Now we apply the residual connection, but shrinking the dimension.
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

In [34]:
class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x)) #communication -> We add the Layer norm BEFORE self attention.
        x = x + self.ffwd(self.ln2(x)) #computation
        return x

## Model Class

In [35]:
class chatGPT(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

## Time to run! Optimization + Generation

In [36]:
model = chatGPT()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))

2.703425 M parameters
step 0: train loss 4.3672, val loss 4.3709
step 250: train loss 2.4204, val loss 2.4262
step 500: train loss 2.2235, val loss 2.2456
step 750: train loss 2.0765, val loss 2.1133
step 1000: train loss 1.9399, val loss 2.0150
step 1250: train loss 1.8575, val loss 1.9574
step 1500: train loss 1.7876, val loss 1.9005
step 1750: train loss 1.7283, val loss 1.8634
step 2000: train loss 1.6863, val loss 1.8368
step 2250: train loss 1.6466, val loss 1.8057
step 2500: train loss 1.6266, val loss 1.7913
step 2750: train loss 1.5965, val loss 1.7614
step 2999: train loss 1.5687, val loss 1.7330

God fair-night it: if thou sauful wit;
But do the heart-fath, and lave henry as as this:
In royal of hird be mear.

CAMPSILLO:
As shows do in to be, to very and Romean us;
Is no hands are at py Herefold. 
AGRELIZ:
Good, fond is ind years lifed to Wury her wirpit you?

MARCINE:
So muaatch wakes gondes breats staid!

LYUCENTIOLBE:
Jult quick apter the offiers real of mighturn'd:
Harkt