This will be a notebook to go along with what Karpathy does in his project. link: https://colab.research.google.com/drive/1JMLa53HDuA-i7ZBmqV7ZnA3c_fvtXnx-?usp=sharing#scrollTo=h5hjCcLDr2WC

In [1]:
#get dataset
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2023-04-10 16:49:28--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8003::154, 2606:50c0:8000::154, 2606:50c0:8001::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8003::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: 'input.txt.1'


2023-04-10 16:49:29 (3.57 MB/s) - 'input.txt.1' saved [1115394/1115394]



In [2]:
#open the dataset and inspect it
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [3]:
#finding the length of the dataset in characters?
print("length of dataset in characters: ", len(text))

length of dataset in characters:  1115394


In [4]:
#look at the first 1000 characters in the dataset
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [5]:
#all unique characters in the text, and covabulary size
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(' '.join(chars))
print(vocab_size)


   ! $ & ' , - . 3 : ; ? A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z
65


In [6]:
#create a mapping from words to integers
stoi = { ch:i for i,ch in enumerate(chars)} #dictionary comprehension. i takes value of index of the character in chars, while ch takes value of character itself. Creates a vocab for these two
itos = { i:ch for i,ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s] #takes string, and outputs the encoded list of integers
decode = lambda l: ''.join([itos[c] for c in l]) #takes a list of integers, and outputs the decoded words


print(encode("Hey there!"))
print(decode(encode("Hey there!")))


[20, 43, 63, 1, 58, 46, 43, 56, 43, 2]
Hey there!


In [7]:
#now storing the entire text dataset in a torch.tensor
import torch #pytorch.org
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

In [8]:
#separate dataset into training and validation sets
n = int(0.9*len(data))
train_data = data[:n] #first 90% for training
val_data = data[n:] #last 10% for validation

In [9]:
#define how large parts of the data our model will be receiving (max) when it's deciding the next output
block_size = 8
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [10]:
#show what the context and target will be for the above input
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target is: {target}")

when input is tensor([18]) the target is: 47
when input is tensor([18, 47]) the target is: 56
when input is tensor([18, 47, 56]) the target is: 57
when input is tensor([18, 47, 56, 57]) the target is: 58
when input is tensor([18, 47, 56, 57, 58]) the target is: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target is: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target is: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target is: 58


get_batch below is a function that returns two tensors of dimensions 4x8. The 4 is because we will process 4 inputs simultaneously, while the 8 is the batch size. The first tensor consists of the inputs, while the second consists of the corresponding targets.
We then call the function, and create a nested for loop to demonstrate what the funciton does

In [11]:
torch.manual_seed(1337) #to get same results as Andrej
batch_size = 4 #how many sequences to process in paralell
block_size = 8 #max context length

def get_batch(split):
    #will return two tensors of dimensions 4x8
    data = train_data if split=='train' else val_data
    #generate (batch_size) amount of random intervals in data
    ix = torch.randint(len(data) - block_size, (batch_size,)) #generate 4 random indices in the data
    x = torch.stack([data[i:i+block_size] for i in ix]) #will be the inputs
    y = torch.stack([data[i+1:i+block_size+1] for i in ix]) #targets for inputs x
    return x, y

xb, yb = get_batch('train')
print('inputs:', '\n', xb.shape, '\n', xb)
print('targets:', '\n', yb.shape, 'n', yb)

for i in range(batch_size): #the batch we're on
    for j in range(block_size): #the point in time in the sequence
        context = xb[i, :j+1] 
        target = yb[i,j]
        print(f"when input is {context.tolist()}, target is {target.tolist()}")

inputs: 
 torch.Size([4, 8]) 
 tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets: 
 torch.Size([4, 8]) n tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
when input is [24], target is 43
when input is [24, 43], target is 58
when input is [24, 43, 58], target is 5
when input is [24, 43, 58, 5], target is 57
when input is [24, 43, 58, 5, 57], target is 1
when input is [24, 43, 58, 5, 57, 1], target is 46
when input is [24, 43, 58, 5, 57, 1, 46], target is 43
when input is [24, 43, 58, 5, 57, 1, 46, 43], target is 39
when input is [44], target is 53
when input is [44, 53], target is 56
when input is [44, 53, 56], target is 1
when input is [44, 53, 56, 1], target is 58
when input is [44, 53, 56, 1, 58], target is 46
when input is [44, 53, 56, 1, 5

We will now create a simple language model, only taking context of block_size 1, and only predicting the next letter per run through.

In [12]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module): #define the bigramlanguagemodel class as a subclass of the nn.Module superclass
    def __init__(self, vocab_size):
        #allow us to use any nn.Module methods and properties
        super().__init__() 
        # create a lookup table where each input gets it's own row, and can read off the logits for the next token
        # each character has one row, and each value in this row represents the probability of some character
        # coming next. 65 x 65 means each character has a value for each other character
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size) 
    
    def forward(self, idx, targets=None):
        # arrange the vocab_size X vocab_size table into (B, T, C) dimensions. 
        # B = batch_size = 4
        # T = time or block_size = 8
        # c = channels = vocab_size = 65
        # This table in other words contains the predicted probabilities of the next character 
        # for each respective character input
        logits = self.token_embedding_table(idx) 
        
        if targets is None:
            loss = None
        else:
            #if targets are input, calculate loss:
            #F.cross_entropy expects channel as the second parameter, so we turn 
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            #f.cross_entropy first sotftmaxes the logits, then compares it, the predicted probabilities,
            #to the true probabilities of targets through a negative log-likelihood loss 
            loss = F.cross_entropy(logits,targets)
        
        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        #generate max_new_tokens new characters at the end of idx, which is a (B, T) tensor
        for _ in range(max_new_tokens):
            #run forward to get predictions
            logits, loss = self(idx)  #in a nn.Module, calling the tensor itself calls the forward() method
            # focus only on last time step
            logits = logits[:, -1 ,:] #T dimension is then 1, so the tensor is (B, C)
            # apply softmax for probabilities
            probs = F.softmax(logits, dim=-1)  # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # dimensions (B, 1)
            #concatenate the sequence with the prediction, in effect appending the prediction
            idx = torch.cat((idx, idx_next), dim=1)
        return idx

m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print("Logits shape: ", logits.shape)
print("Loss: ", loss)
 
idx = torch.zeros((1, 1), dtype=torch.long) #B=1, T=1. This is a small array we can use to test generate. 
print(decode(m.generate(idx, max_new_tokens=100)[0].tolist()))

Logits shape:  torch.Size([32, 65])
Loss:  tensor(4.8786, grad_fn=<NllLossBackward0>)

Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3


Now we are going to train the model by first defining an optimizer for the model's parameters, and then updating values in the lokoup table a certain amount of times

In [13]:
# create the optimizer.
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3) #learning rate = 1e-3 -> 1*10^(-3)

In [14]:
batch_size = 32 #try a higher batch size
for steps in range(1000): #update 100 times
    
    # sample a batch of data
    xb, yb = get_batch('train') #as per the previous function, xb will be the context, and yb will be the targets
    
    # evaluate the loss function
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
    
print(loss.item())

3.721843719482422


In [15]:
#Create example tensor
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337) #for reproducability
B,T,C = 4, 8, 2 #batch = 4, time = 8, channels = 2
x = torch.randn(B,T,C)
x.shape

torch.Size([4, 8, 2])

Now our task is to let the 8 elements of the time dimension talk to each other, but only back in time. So 3 only talks to 2, 1, and 0. do to this we will find the mean of x[b,i], where i<=t, in the tensor x[b,t] this will be a basic attention mechanism

In [16]:
#basic bag of words implementation
xbow = torch.zeros(B,T,C) #x, bag of words
for b in range(B):
    for t in range(T):
        xprev =x[b,:t+1] # will be of dimensions (t,C)
        xbow[b,t] = torch.mean(xprev, 0) #0 indicates that we calculate the mean along the batch dimension
     
print(x[0], '\n')
print(xbow[0])

tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]]) 

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])


Notice above how the first timestep of each tensor is the same. After the first time step, the second tensor's timesteps begin to be the average of all previous of the first tensor's timesteps, so diverge.

Showing off the concept of using matrix multiplication to get weights corresponding with row in tensor a:

In [17]:
torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))
#normalize values of a by creating a tensor of the same shape as a, with the sum of values along batch axis.
#keepdim=True
a = a / torch.sum(a, 1, keepdim=True) #create probability distribution over a. Each row sums to 1
b = torch.randint(10,(3,2), dtype=torch.float32)
c = a @ b
print('a \n', a)
print('b \n', b)
print('c \n', c)

a 
 tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
b 
 tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
c 
 tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


Now creating the actual implementation of the matrix multiplication:
wei is of dimension (T,T), while x is of dimension (B,T,C). Pytorch will create an extradimension for wei to multiply the two, b -> wei=(B, T, T). Multiplying this with X is the same procedure as shown in the last step, imagine B=1, T=3, C=2 in the last step.

In [18]:
#Create matrix of T*T, because we are going to max average over T steps, and use one set of weights for each T value
wei = torch.tril(torch.ones(T, T)) 
wei = wei / torch.sum(wei, 1, keepdim=True)
xbow2 = wei @ x

In [19]:
# using softmax and masked to do the same process
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros(T, T)
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x
torch.allclose(xbow, xbow3)

True

In [20]:
#Create example tensor
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337) #for reproducability
B,T,C = 4, 8, 2 #batch = 4, time = 8, channels = 2
x = torch.randn(B,T,C)
x.shape

torch.Size([4, 8, 2])

Now our task is to let the 8 elements of the time dimension talk to each other, but only back in time. So 3 only talks to 2, 1, and 0. do to this we will find the mean of x[b,i], where i<=t, in the tensor x[b,t] this will be a basic attention mechanism

In [21]:
#basic bag of words implementation
xbow = torch.zeros(B,T,C) #x, bag of words
for b in range(B):
    for t in range(T):
        xprev =x[b,:t+1] # will be of dimensions (t,C)
        xbow[b,t] = torch.mean(xprev, 0) #0 indicates that we calculate the mean along the batch dimension
     
print(x[0], '\n')
print(xbow[0])


tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]]) 

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])


Notice above how the first timestep of each tensor is the same. After the first time step, the second tensor's timesteps begin to be the average of all previous of the first tensor's timesteps, so diverge.

Showing off the concept of using matrix multiplication to get weights corresponding with row in tensor a:


In [23]:
#Create matrix of T*T, because we are going to max average over T steps, and use one set of weights for each T value
wei = torch.tril(torch.ones(T, T)) 
wei = wei / torch.sum(wei, 1, keepdim=True)
xbow2 = wei @ x

In [25]:
# using softmax and masked to do the same process
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros(T, T)
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x
torch.allclose(xbow, xbow3)

True

Version 4 will use self-attention:

In [66]:
#Basic set up
torch.manual_seed(1337)
B,T,C = 4,8,32 
x = torch.randn(B,T,C)

In [67]:
#Setting up a head
head_size = 16
key = nn.Linear(C, head_size, bias=False) #no bias term
query = nn.Linear(C, head_size, bias=False) #no bias term
value = nn.Linear(C, head_size, bias=False) #no bias term
k = key(x) # (B, T, Head_size) -> what the element looks for in other elements
q = query(x) # (B, T, Head_size) -> the place query looks. Overlaps with high values are high intereste
v = value(x) # (B, T, Head_size) -> The actual information communicated if interest is high (next block)
wei = q @ k.transpose(-2, -1) # (B, T, Head_size) @ (B, Head_Size, T) -> (B, T, T)
wei = wei ** head_size**-0.5 # scaling-> dividing by square root of dimension of key, which means head_size

In [68]:
tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros(T, T) -> previous implementation
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
out = wei @ v # so we compute the attention from the value query derived from x, not x itself