<a href="https://colab.research.google.com/github/Lin-Jet/CodeChatGPTfromScratchWorkshop/blob/main/Jet_Gradio_GPT_Workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Building a GPT

Companion notebook to the [Zero To Hero](https://karpathy.ai/zero-to-hero.html) video on GPT.

Disclaimer: I am not an expert in LLMs, I am simply conveying information as best as I can based on my research.  The goal of this workshop is to introduce LLMs as beginner friendly as possible.

Hello this is code from Andrej Karpathy's "Let's build GPT: from scratch, in code, spelled out" video: https://www.youtube.com/watch?v=kCc8FmEb1nY

additional resources:
1. 3Brown1Blue Neural Network Playlist https://youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&si=Dh-uLIUYVk-ycICz
2. Pytorch Docs https://pytorch.org/docs/stable/index.html
3. All You Need Is Attention Paper https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
4. Residual Networks and Skip Connections https://www.youtube.com/watch?v=Q1JCrG1bJ-A
5. Create a Large Language Model from Scratch with Python – Tutorial Free Code Camp https://www.youtube.com/watch?v=UU1WVnMk4E8&t=768s
6. Deep Residual Connections for Image Recognition https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf
7. Batch Normalization https://www.youtube.com/watch?v=yXOMHOpbon8&t=124s
and CSE 176 class provided at UC Merced


In [1]:
# We always start with a dataset to train on. Let's download the tiny shakespeare dataset
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-11-02 01:26:36--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2024-11-02 01:26:37 (5.87 MB/s) - ‘input.txt’ saved [1115394/1115394]



nice website to checkout for more text: https://www.gutenberg.org/

In [2]:
# read it in to inspect it
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [3]:
print("length of dataset in characters: ", len(text))

length of dataset in characters:  1115394


In [4]:
# let's look at the first 1000 characters
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [5]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


1) Enumeration: You have a
```list ["apple", "orange", "watermelon"]```
```for i, fruit in list ```
will give you index i and the string fruit cool!
https://www.w3schools.com/python/ref_func_enumerate.asp


2) Lambda Functions: Think of lambda as a function but instead of doing ```def encode(string):```
do it inline.  this is great for very small short easy tasks.
```x = lambda a : a + 10, x(5)=15``` will return 15!
https://www.w3schools.com/python/python_lambda.asp

3) List Comprehension: https://www.w3schools.com/python/python_lists_comprehension.asp
```fruits = ["apple", "banana", "cherry", "kiwi", "mango"]```
```newlist = [x for x in fruits if "a" in x]```
this will print out fruits with the character a in it.

In [6]:
# create a mapping from characters to integers

stoi = { ch:i for i,ch in enumerate(chars) }
print("string to int: ", stoi)
itos = { i:ch for i,ch in enumerate(chars) }
print("int to str: ", itos)


encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

print(encode("hii there"))
print(decode(encode("hii there")))

string to int:  {'\n': 0, ' ': 1, '!': 2, '$': 3, '&': 4, "'": 5, ',': 6, '-': 7, '.': 8, '3': 9, ':': 10, ';': 11, '?': 12, 'A': 13, 'B': 14, 'C': 15, 'D': 16, 'E': 17, 'F': 18, 'G': 19, 'H': 20, 'I': 21, 'J': 22, 'K': 23, 'L': 24, 'M': 25, 'N': 26, 'O': 27, 'P': 28, 'Q': 29, 'R': 30, 'S': 31, 'T': 32, 'U': 33, 'V': 34, 'W': 35, 'X': 36, 'Y': 37, 'Z': 38, 'a': 39, 'b': 40, 'c': 41, 'd': 42, 'e': 43, 'f': 44, 'g': 45, 'h': 46, 'i': 47, 'j': 48, 'k': 49, 'l': 50, 'm': 51, 'n': 52, 'o': 53, 'p': 54, 'q': 55, 'r': 56, 's': 57, 't': 58, 'u': 59, 'v': 60, 'w': 61, 'x': 62, 'y': 63, 'z': 64}
int to str:  {0: '\n', 1: ' ', 2: '!', 3: '$', 4: '&', 5: "'", 6: ',', 7: '-', 8: '.', 9: '3', 10: ':', 11: ';', 12: '?', 13: 'A', 14: 'B', 15: 'C', 16: 'D', 17: 'E', 18: 'F', 19: 'G', 20: 'H', 21: 'I', 22: 'J', 23: 'K', 24: 'L', 25: 'M', 26: 'N', 27: 'O', 28: 'P', 29: 'Q', 30: 'R', 31: 'S', 32: 'T', 33: 'U', 34: 'V', 35: 'W', 36: 'X', 37: 'Y', 38: 'Z', 39: 'a', 40: 'b', 41: 'c', 42: 'd', 43: 'e', 44: 'f

Levels of Tokenization Matter!

1. Character level.
  - Pro:
    - not a lot of tokens
  - Con:
    - cannot "bake in" meaning to letters
    - i.e. the letter 'j' does not have much meaning...

2. Word level.
  - Pro:
    - can "bake in" meaning to words
    - i.e. "eating" has a meaning; it conveys a concept directly.
  - Con:
    - need a token for every word in language!
  
2. Sub-Word level.
  - Pro:
    - best of both worlds
    - i.e. tokens might look like "eat", "ing", "en". Combinations of these tokens create new meanings. eaten vs eating...
  - Con:
    - How would you properly split words? Complex!
  
We will use Character Level for simplicity.

In [7]:
# let's now encode the entire text dataset and store it into a torch.Tensor
import torch # we use PyTorch: https://pytorch.org

data = torch.tensor(encode(text), dtype=torch.long)
#dtype is what type of element does this tensor store?
#Similar to when you declare a int variable in C++ you have to do int x;

print(data.shape, data.dtype)
print(data[:1000]) # the 1000 characters we looked at earier will to the GPT look like this

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

In [8]:

# Let's now split up the data into train and validation sets
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

We will train on random blocks of our texts so it is computationally cheap


In [9]:
block_size = 8
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

with the above, we can make 8 training examples
1. given 18 predict 47
2. given 18 and 47 predict 56
3. given 18, 47, 56, predict 57 and so on
remember these ints are our characters...

In [10]:
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

when input is tensor([18]) the target: 47
when input is tensor([18, 47]) the target: 56
when input is tensor([18, 47, 56]) the target: 57
when input is tensor([18, 47, 56, 57]) the target: 58
when input is tensor([18, 47, 56, 57, 58]) the target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


another reason we use block size is so our model is used to generating new text based only one one token.  
as well as two tokens.  
up to block size tokens.  
afterwards we start truncating

we will feed our model a tensor of many batches

each batch will have a chunk of text.

the batches will be processed at the same time but will not interact with each other; independent.

In [11]:
torch.manual_seed(1337)
batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,)) #pick batch_size(4) random starting indicies.   integers in ix represents the starting position of a block of block_size characters in data
    x = torch.stack([data[i:i+block_size] for i in ix]) # context
    y = torch.stack([data[i+1:i+block_size+1] for i in ix]) # prediction
    #"stack" the 1 dim tensors(array).  so they are the rows of a 2d tensor(matrix)
    return x, y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size): # batch dimension # the number of blocks
    for t in range(block_size): # time dimension # the block_size
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()} the target: {target}")

inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
----
when input is [24] the target: 43
when input is [24, 43] the target: 58
when input is [24, 43, 58] the target: 5
when input is [24, 43, 58, 5] the target: 57
when input is [24, 43, 58, 5, 57] the target: 1
when input is [24, 43, 58, 5, 57, 1] the target: 46
when input is [24, 43, 58, 5, 57, 1, 46] the target: 43
when input is [24, 43, 58, 5, 57, 1, 46, 43] the target: 39
when input is [44] the target: 53
when input is [44, 53] the target: 56
when input is [44, 53, 56] the target: 1
when input is [44, 53, 56, 1] the target: 58
when input is [44, 53, 56, 1, 58] the target: 46
when input is [44, 53

In [12]:
print(xb) # our input to the transformer

tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])


the transformer is going to take these values and make predictions

In [13]:
print(yb) # our input labels for transformer to check

tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])


Then check with our yb tensor to see if they were correct.

In [14]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size) #vocab_size = 65

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C) #batch by time by channel tensor
        #every single input integer will pluck out a row of the embedding table corresponding to its index
        # for us the logits will be a 4(batches) by 8(block_size) by 65(vocab_size)

        #logits are the scores for the next character in our sequence.
        #so Lets say our B*T is the tokens for "hello my name i". and C is our vocab_size 65 of all the possible tokens
        #Then our logit's C will consist of the probabillity of thecorrect predictoin
        # so the probability in C that corresponds to the token 's' will ideally have the highest probability
        # our model will then pick hte highest probability token and add to the generation so we would get
        # "hello my name is"



        if targets is None:
            loss = None
        else:
            #pytorch cross entropy function expects arguments (total number of time steps B*T, vocab size C)

            B, T, C = logits.shape # pull out the size of each dimension
            logits = logits.view(B*T, C) # reshape logits (B, T, C) to (B*T, C)
            targets = targets.view(B*T) # our ground truth. Our correct labels.

            loss = F.cross_entropy(logits, targets) #Negative Log Likelyhood https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html

        # loss is how far away our prediction is from the actual answer.
        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C) #normalize function
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1) #choose randomly though higher prob tokens will have higher chance.  #just picking highest prob token -> repetition.
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss) # we can calculate our ideal loss b/c we know we are using NLL so -ln(1/65) = 4.1744 so initial predictions are not good enough

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))

# idx = torch.zeros((1, 1), dtype=torch.long) #idx = tensor([[0]]) # a 1 by 1 tensor of zero(0 token is \n char) and this will kick off our generation

# print(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist())
#remember that generate returns idx which is a (B, T+max_new_tokens) tensor so [0] will pull out the 1d tensor (T+max_new_tokens, ) which is our predicted tokens sequence and convert to list. in this case 101 tokens
# then decode and print
# will print garbage cause the weights are initialized randomly
# we are feeding the whole sequence so far, then only using the last token to make prediction.  we will include entire history later.


torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)

Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3


In [15]:
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3) # updates our paramters so that in the next forward pass we will have better predictions

In [16]:
batch_size = 32
for steps in range(1000): # increase number of steps for good results...

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True) #reset to zero so previous gradients and this new gradient dont accumulate.
    loss.backward() # performing backpropagation, which calculates the gradients of the loss with respect to each parameter in the model
    optimizer.step() # update the parameters

print(loss.item())


3.7218432426452637


In [17]:
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()))


olylvLLko'TMyatyIoconxad.?-tNSqYPsx&bF.oiR;BD$dZBMZv'K f bRSmIKptRPly:AUC&$zLK,qUEy&Ay;ZxjKVhmrdagC-bTop-QJe.H?x
JGF&pwst-P sti.hlEsu;w:w a BG:tLhMk,epdhlay'sVzLq--ERwXUzDnq-bn czXxxI&V&Pynnl,s,Ioto!uvixwC-IJXElrgm C-.bcoCPJ
IMphsevhO AL!-K:AIkpre,
rPHEJUzV;P?uN3b?ohoRiBUENoV3B&jumNL;Aik,
xf -IEKROn JSyYWW?n 'ay;:weO'AqVzPyoiBL? seAX3Dot,iy.xyIcf r!!ul-Koi:x pZrAQly'v'a;vEzN
BwowKo'MBqF$PPFb
CjYX3beT,lZ qdda!wfgmJP
DUfNXmnQU mvcv?nlnQF$JUAAywNocd  bGSPyAlprNeQnq-GRSVUP.Ja!IBoDqfI&xJM AXEHV&DKvRS


You can start to see some structure.

Since the predictions for the next token are only based off the immediate prior token, the tokens are not talking to each other.
We want the prediction to be made with the context of previous tokens.

Lets do that next!

## The mathematical trick in self-attention

Say we are at the 5th token.
We want to get the context of all past tokens.
Easiest way is to get the average of those past tokens.
so get the average of the channels of tokens 1-5.

This strategy losses a ton of spatial information.  later we will get the info back.


In [18]:
# consider the following toy example:

torch.manual_seed(1337)
B,T,C = 4,8,2 # batch, time, channels
x = torch.randn(B,T,C)
x.shape

torch.Size([4, 8, 2])

In [19]:
# We want x[b,t] = mean_{i<=t} x[b,i]
xbow = torch.zeros((B,T,C)) # bow = bag of words meaning average the words.
for b in range(B):
    for t in range(T):
        xprev = x[b,:t+1] # (t,C)
        xbow[b,t] = torch.mean(xprev, 0)
print(x[0])
print(xbow[0]) #averages

tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])
tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])


cool but inefficient. nested for loop and all.

In [20]:
torch.manual_seed(42)
a = torch.ones(3, 3)
b = torch.randint(0, 10, (3,2)).float()
c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

a=
tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]])
--
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[14., 16.],
        [14., 16.],
        [14., 16.]])


In [21]:
torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3)) #lower triangular
b = torch.randint(0, 10, (3,2)).float()
c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

a=
tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])
--
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[ 2.,  7.],
        [ 8., 11.],
        [14., 16.]])


In [22]:
torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, 1, keepdim=True)
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

a=
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
--
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


Notice in matrix a, the rows sum to 1

In [23]:
# version 2: using matrix multiply for a weighted aggregation
wei = torch.tril(torch.ones(T, T)) #wei for weights
wei = wei / wei.sum(1, keepdim=True)
print(wei) #(T, T), pytorch will make it (B,T,C)
xbow2 = wei @ x # (B, T, T) @ (B, T, C) ----> (B, T, C)
torch.allclose(xbow, xbow2)
print(x[0])
print(xbow[0]) #using for loops
print(xbow2[0]) #using matrix multiplication

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])
tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])
tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.09

In [24]:
# version 3: use Softmax
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x
torch.allclose(xbow, xbow3)


False

Ok so v3 we just did the same thing why is this better?

our -inf weights ensure that future tokens do not affect our predictions.

lets do some code clean up
1. remove vocab_size from constructor already defined.
2. introduce n_embed (number of embeddings) and some other things.


Lets get to the crux of self attention.

for characters say we are predicting a vowel, we want to know what consonants preceeded this vowel so that we are informed of which vowel we should predict.

easier to explain with word level tokenization:
Say we have an noun like "the cat". we want it to probably be followed up by a verb. "the cat jumped"


Self attention lets us gather the information from past words and let it flow to the next prediction in a data dependent way. (meaning if it is a noun or a verb etc.)

each token will emit a query and key vector
query vector asks: what am i looking for,
key vector asks: what do i contain

we get affinities by finding the dot product.

remember dot product of two vectors that are perpendicular to each other will result in 0 while dot product of two vectors pointing in the same direction will give us higher numbers https://www.desmos.com/calculator/u8wt5rnw9n

so to predict next token we take that tokens query and dot product against all previous keys.  Some keys will have higher affinities and those will be listened to more



In [25]:
# version 4: self-attention! (for a single head)
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T,T)) # we do not want this to be uniform.  different tokens have different affinities to each other.
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
out = wei @ x

out.shape


torch.Size([4, 8, 32])

In [26]:
# version 4: self-attention! (for a single head)
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)  # creating those key and query vectors
query = nn.Linear(C, head_size, bias=False)

k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
wei =  q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)# this is the dot product # these T^2 matricies give us the affinity
#transpose(a, b) swaps dimension a with b, so here it swaps the second to last element T with last element 16.

tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)


out = wei @ x

out.shape

torch.Size([4, 8, 32])

In [27]:
wei[0] #rows still add up to one, but now each element has attention scores, or "affinity scores,"


tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)

In [28]:
# version 4: self-attention! (for a single head)
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)  # creating those key and query vectors
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
wei =  q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)# this is the dot product # these T^2 matricies give us the affinity

tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

v = value(x)
out = wei @ v # (B, T, T) @ (B, T, 16) ---> (B, T, 16)
#out = wei @ x

out.shape

torch.Size([4, 8, 16])

Why value?

you can think of x as private information to this token

Im 5th token with some identity and my information is kept in the x dimension

v stores, what I am interested in, what i have, and if we have high afffinity here is what i can give you.

v gets aggregated for the purposes of this single head of attention.


In [29]:
wei[0]

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)

Notes:
- Attention is a **communication mechanism**. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
- There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
- Each example across batch dimension is of course processed completely independently and never "talk" to each other
- In an "encoder" attention block just delete the single line that does masking with `tril`, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
- "self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
- "Scaled" attention additional divides `wei` by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below

Notes on Notes:
- 1. attention is communication mechanism.  our graph has 8 nodes.
first node pointed to by itself, second is pointed to by itself and the first node and so on.
- 2. Different from convolution.  Attention is just set of vectors with no notion of space
- 3. we have 4 seperate pools of 8 nodes
- 4. if your doing sentiment analysis then you might want all nodes including future tokens to talk to teach other.
remove wei = wei.masked_fill(tril == 0, float('-inf'))
- 5. Self attention: key, queries, values from node itself.
Cross attention: Keys, values from external source.
- 6. missing "scaled" attention.
wei wil be 1/16.  wei shuold be fairly defused.  if not softmax will have peaks.  and the max token affinity will be too big.



In [30]:
k = torch.randn(B,T,head_size)
q = torch.randn(B,T,head_size)
wei = q @ k.transpose(-2, -1) * head_size**-0.5

In [31]:
k.var()

tensor(1.0449)

In [32]:
q.var()

tensor(1.0700)

In [33]:
wei.var()

tensor(1.0918)

In [34]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)

tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])

In [35]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1) # gets too peaky, converges to one-hot

tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])

Adding a single self attention head:

2bigram.py

RESULT: Validation Loss is 2.4084


---



Adding multiheaded attention:
processing multiple heads of attention in parallel, and then concatenating the results.

3bigram.py
RESULT: Validation Loss is 2.2858


Adding feed foward networks aka multilayer perceptron

tokens have looked at each other but did not have time to process the information they gained from other tokens.  This layer allows them to do so independently.


4bigram.py
RESULT: Validation Loss is 2.2412


---



Adding Blocks
Taking our multiheaded attention and feed forward networks and looping over them.

5bigram.py

RESULT: we train for some time but not a lot of improvement?!

  reason is we have a pretty deep neural network with lots of layers.  
  The deeper we are the more our input has been multiplied with the weights and has been scrambled into noise.
  Similarly, in the backward propagation, by the time we get to the front all the multiplication with weights have scrambled our gradient.


---




Adding residual connections
we group our layers into blocks.  
then add skip connections which makes our blocks into residual blocks.

In the forward pass the data goes both through the block and through our skip connection.
Then we add or concatenate our two results and pass that to next residual block.

In the backward pass the gradients go through the blocks and multiplied by the weights as well as through the skip connection with little change.  

Initially, the skip connections have a minimal effect. But as the data and gradients propagate through the network(forward and backward), these connections help the data retain its structure, reducing the "noise" effect from repeated weight multiplications.

6bigram.py

---



Small change
feed forward network's inner layers need to be 4 times as large as the input and output layers

7bigram.py

RESULT: val loss 2.0808, train loss = 1.9993, our training loss is getting smaller than our validation this means we are starting to overfit.


---



add LayerNorm

Let me explain batch norm first though
Our inputs to each layer is multiplied to our weights. The outputs of each layer can be very large or very small, this out of porpotion data is given to the next layer,  after many layers we run into the exploding or vanishing gradients problem.  

To fix this we will standardize the inputs to each layer.  we first find the mean and variance of the data then standardize so the mean is 0 and the variance is 1.


after standardization of input x we also scale and offset/shift our standardized input.

This is because every layer's output would be constrained to the same standardized distribution, which may be too restrictive.  Giving the model the ability to scale and shift allows it to avoid underfitting.  So in the end of training our model might not be standardized but it will be ideal for our model.

This is why during training we apply batch norm with the mean and variance of the batch.  whereas during testing and inferece we use the running average across all batches calculated during training in the batch normalizing function.

gamma and beta are our scaling and shifting parameters.  These are new trainable parameters.

our final output will be z:

**z=γ⋅x+β,** where x is our standardized input.


Now what is the difference between batch and layer normalization?

well Batch normalization is applied by column aka through the batches.

Layer norm is applied by rows aka through one input and it's features.

There is no need for buffers, no distinction between training and testing, no need for momentum...


We will deviate from the paper.  A newer technique is applying layer norm before the transformation.  We prepare the inputs then apply complex transformations.

so when I said Layer norm is applied by rows aka through one input and it's features. the 8bigram.py code uses nn.LayerNorm(n_embd), n_embd because we will have n_embd(32) number of inputs and each input and features is what we are applying the standardization to.

Add one more layernorm after the blocks for funsies.

8bigram.py

RESULT: val loss: 2.0607


---



In [36]:
class LayerNorm1d: # (used to be BatchNorm1d)

  def __init__(self, dim, eps=1e-5, momentum=0.1):
    self.eps = eps
    self.gamma = torch.ones(dim)
    self.beta = torch.zeros(dim)

  def __call__(self, x):
    # calculate the forward pass

    # xmean = x.mean(0, keepdim=True) # batch mean
    # xvar = x.var(0, keepdim=True) # batch variance

    xmean = x.mean(1, keepdim=True) # layer mean
    xvar = x.var(1, keepdim=True) # layer variance

    xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance # to be more specific standardize to mean=0, variance = 1
    self.out = self.gamma * xhat + self.beta
    return self.out

  def parameters(self):
    return [self.gamma, self.beta]

torch.manual_seed(1337)
module = LayerNorm1d(100)
x = torch.randn(32, 100) # batch size 32 of 100-dimensional vectors
print("Before Batch Norm")
x = module(x)
x.shape

Before Batch Norm


torch.Size([32, 100])

In [37]:
x[:,0].mean(), x[:,0].std() # mean,std of one feature across all batch inputs

(tensor(0.1469), tensor(0.8803))

In [38]:
x[0,:].mean(), x[0,:].std() # mean,std of a single input from the batch, of its features

(tensor(-9.5367e-09), tensor(1.0000))

Now to scale up our model!

use a for loop to do however many blocks we want

Add Dropout!
this is a regularization technique.
Every forward backward pass, randomly selects some percentage of neurons and sets them to 0.  This means the model cannot rely on specific neurons.
The model trains an ensemble of subnetworks.  
At testing / inference all neurons kept active.  To maintain consistency with the traning output.
The weights are scaled down by the dropout rate.  
i.e. dropout = 30% then wei*0.3.

In [39]:
# French to English translation example:

# <--------- ENCODE ------------------><--------------- DECODE ----------------->
# les réseaux de neurones sont géniaux! <START> neural networks are awesome!<END>



Note that we do not have the encoder block and no cross attention.

The orginal goal of the All You Need is Attention paper is machine translation.

The encoder block will be similar to our decoder but without the triangluar mask of -inf.

This means that all of the tokens can comunicate with each other, not just the past ones.

Remember with self attention we had keys, queries, and values generated within itself?
Cross attention will generate keys and values from the encoder block and the decoder block only generates the queries.


Resulting in the decoding output being based on not just the past tokens but also the full context of all the tokens from the french prompt.


---



Before we train lets talk about what is next.

We have Pre-trained a model!

This means that we trained a model on a bunch of text. It know has the ability to generate text.  It has good geneal knowledge but not specifics task oriented knowledge like answering questions.


In order for it to be able to answer questions.

You would need to Finetune our Pretrained model.

Where you feed Q and A questions.  For Llama3 it would be json in the form of

[
  {
    "instruction": "What is subword level tokenization",
    "input": "",
    "output" : "Subword level tokenization is where you split a language into parts of words and then each part is associated with an integer effectively encoding the subwordas tokens.  Doing subword level tokenization as opposed to character or word level tokenization is advantageous because we can "bake in" meaning to subwords and reduce amount of tokens."
  },
  {
    "instruction": "When is the next HackMerced Event?",
    "input": "",
    "output": "HackMerced X will be February 28th - March 2nd.  This event will have our competition component on top of our workshops."
  }
]


Methods for doing so are LoRA and Full finetuning.

There are more optional steps like reinforcement learning from human feedback.  Sometimes Chat GPT will ask you to choose which output is better.  It is then trained on your preferences.

Then the combined pretrained, finetuned, and other steps model will then be able to answer questions!


### Full finished code, for reference

You may want to refer directly to the git repo instead though.

added tqdm for the cool progress bar and model saving.

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F

from tqdm import tqdm

# hyperparameters
batch_size = 64 # how many independent sequences will we process in parallel?
block_size = 256 # 256 characters for context to predict the 257th character.
max_iters = 5000
eval_interval = 500
learning_rate = 3e-4 # big neural net so bring down the learning rate
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 384 # 384 embeding dimension / 6 heads = 64 every head is 64 embedding dimensional
n_head = 6
n_layer = 6
dropout = 0.2 # every forward backward pass 20% of the weights will be dropepd to 0
# ------------

torch.manual_seed(1337)

# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad() #pytorch auto calculates grad.  We dont need that here to save computation.
def estimate_loss(): # average out loss from multiple batches
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C) #was logits where C is n_embed
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = BigramLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in tqdm(range(max_iters)):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))


#Save the model with state_dict()
# state_dict() are python dictionaries which will contain our trainable parameters and registered buffers

path = 'shakespeare_model.pth'
torch.save(model.state_dict(), path)


10.788929 M parameters


  0%|          | 0/5000 [00:00<?, ?it/s]

In [None]:

# Print model's state_dict
print("Model's state_dict:")
for param_tensor in model.state_dict():
    print(param_tensor, "\t", model.state_dict()[param_tensor].size())


In [None]:
!pip install --upgrade gradio

In [None]:
import gradio as gr

model.to(device)
model.eval()


def generate_text(prompt, max_tokens=100):

    context = torch.tensor([encode(prompt)], dtype=torch.long, device=device)

    generated_idx = model.generate(context, max_tokens)

    generated_text = decode(generated_idx[0].tolist())
    return generated_text

gr.Interface(
    fn=generate_text,
    inputs=[
        gr.Textbox(lines=2, placeholder="Enter a prompt for Shakespearean text..."),
        gr.Slider(10, 500, value=100, step=10, label="Max Tokens")
    ],
    outputs=gr.Textbox(label="Generated Text"),
    title="Shakespearean Text Generator",
    description="Enter a prompt, and the model will generate Shakespearean-like text."
).launch()
