<div class="alert alert-block alert-info"> Reading in a short story as text sample into Python. </div>



In [2]:
with open("text_data/the-verdict.txt","r",encoding="utf-8") as f:
  raw_text = f.read()



## Step 1: Creating Tokens

<div class="alert alert-block alert-warning">
The print command prints the total number of characters followed by the first 100 characters of this file for illustration purposes.
</div>

In [3]:
print("total number of charater : ", len(raw_text))
print(raw_text[:99])

total number of charater :  20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


<div class="alert alert-block alert-success">So the goal is to tokenize this 20,479 character short story into individual words and special charaters  taht we an then turn inot embeddings for LLM training </div>

<div class="alert alert-block alert-info"> Using some simple example text, we can use the re.split command with the following syntax to split a text on whitespace </div>

In [4]:
import re

text = "Hello, world. This, is a test."
result = re.split(r'(\s)',text)
print(result)

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


<div class="alert alert-block alert-warning">The result is a list of individual words , whitespaces, and punctuation charaters </div>

In [5]:
# let's modify the regular expression splits on whitespaces (\s) and commans, and full stop
result = re.split(r'([,.]|\s)',text)
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


we can see that the words and punctuation charaters are now separate list entry

<div class="alert alert-block alert-danger"> A small remaining  issue is that the list still includes whitespace characters. Optionally, we can remove these redundant characters safely as follows </div>

In [6]:
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


#### Removing whitespaces or not
<div class="alert alert-block alert-info">
When developing a simple tokenizer, whether we should encode whitespaces as separate characters or just remove them depends on our application and its requirements. Removing whitespaces reduces the memory and computing requirements. However, keeping whitespaces can be useful if we train models that are sensitive to the exact structure of the text (for example, Python code, which is sensitive to indentation and spacing). Here, we remove whitespaces for simplicity and brevity of the tokenized outputs. Later, we will switch to a tokenization scheme that includes whitespaces.

</div>

In [7]:
text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)',text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


In [8]:
# Strip whitespace from each item and then filter out any empty strings.
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


<div class="alert alert-block alert-warning">Testing with a sample text </div>

In [9]:
text = "Hello, world. Is this-- a test?"

In [10]:
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


<div class="alert alert-block alert-success"> Now that we got a basic tokenizer working, let's apply it to Edith Wharton's entire short story: </div>

In [11]:
preprocessed = re.split(r'([,.:;?_\'"()\[\]|\s])', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius--though', 'a', 'good', 'fellow', 'enough--so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in', 'the', 'height', 'of', 'his']


In [12]:
print(len(preprocessed))

4481


# Step 2 : Creating Token IDs

<div class="alert alert-block alert-success">
In the previous section, we tokenized Edith Wharton's short story and assigned it to a Python variable called preprocessed.

Let's create a list of unique tokens and sort them alphabetically to determine the vocabulary size </div>

In [13]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print(vocab_size)

1217


After determining that the vocabulary size is 1,217 via the above code, we create the vocabulary and print its 51 entries for illustration purpose.

In [14]:
vocab = {token:integer for integer, token in enumerate(all_words)}

In [15]:
for i,item in enumerate(vocab.items()):
  print(item)
  if i >=50:
    break

('"', 0)
("'", 1)
('(', 2)
(')', 3)
(',', 4)
('--and', 5)
('--it', 6)
('--oh', 7)
('--she', 8)
('--that', 9)
('.', 10)
(':', 11)
(';', 12)
('?', 13)
('A', 14)
('Ah', 15)
('Ah--I', 16)
('Among', 17)
('And', 18)
('Are', 19)
('Arrt', 20)
('As', 21)
('At', 22)
('Be', 23)
('Begin', 24)
('Burlington', 25)
('But', 26)
('By', 27)
('Carlo', 28)
('Chicago', 29)
('Claude', 30)
('Come', 31)
('Croft', 32)
('Destroyed', 33)
('Devonshire', 34)
('Don', 35)
('Dubarry', 36)
('Emperors', 37)
('Florence', 38)
('For', 39)
('Gallery', 40)
('Gideon', 41)
('Gisburn', 42)
('Gisburn!', 43)
('Gisburn--as', 44)
('Gisburn--fond', 45)
('Gisburns', 46)
('Grafton', 47)
('Greek', 48)
('Grindle', 49)
('Grindles', 50)


<div class="alert alert-block alert-success"> As we can see, based on the output above, the dictionary contains individual tokens associated with unique integer labels. </div>


<div class="alert alert-block alert-success"> Later, when we want to convert the outputs of an LLM from numbers back into text, we also need a way to turn token IDs back into text.

For this, we can create an inverse version of the vocabulary that maps token IDs back to corresponding text tokens. </div>

<div class="alert alert-block alert-success"> Let's implement a complete tokenizer class in Python.

The class will have an encode method that splits text into tokens and carries out the string-to-integer mapping to produce the tokens

In addition, we implement a decode method that carries out the reverse integer-to-string mapping to convert the token Id back to the string </div>

In [16]:
class SimpleTokenizerV1:
  def __init__(self,vocab):
    self.str_to_int = vocab
    self.int_to_str = {i:s for s,i in vocab.items()}


  def encode(self,text):
    preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)',text)

    preprocessed = [
        item.strip() for item in preprocessed if item.strip()
    ]
    ids = [self.str_to_int[s] for s in preprocessed]
    return ids


  def decode(self,ids):
    text = " ".join([self.int_to_str[i] for i in ids])
    # Replace spaces before the specified punctuations

    text = re.sub(r'\s+([,.?!"()\'])',r'\1',text)
    return text


Let's instantiate a new tokenizer object from the SimpleTokenizerV1 class and tokenizer a passage from Edith Wharton short story to try it out in practice

In [17]:
tokenizer = SimpleTokenizerV1(vocab)

text = """
"It's the last he painted,you know,"
Mrs. Gisburn said with pardonable pride."""



In [18]:
ids = tokenizer.encode(text)
print(ids)

[0, 63, 1, 920, 1065, 646, 563, 808, 4, 1212, 640, 4, 0, 78, 10, 42, 921, 1191, 818, 863, 10]


<div class="alert alert-block alert-success"> The code above prints the following token IDs: Next, let's see if we can turn these token IDs back into text using the decode method: </div>

In [19]:
tokenizer.decode(ids)

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

<div class="alert alert-block alert-success"> Based on the output above, we can see that the decode method successfully converted the token IDs back into the original text. </div>

<div class="alert alert-block alert-success"> So far, so good. We implemented a tokenizer capable of tokenizing and de-tokenizing text based on a snippet from the training set: </div>

In [20]:
# text = "Hello, do you like phone?"
# print(tokenizer.encode(text))

<div class="alert alert-block alert-danger"> The problem is that the world <b> "Hello" </b> was not used in the <b>The Verdict </b> short story.

Hence, it is not contained in the vocabulary.

This highlights the need to consider large and diverse training sets to extend the vocabulary when working on LLMs. </div>

## ADDING SPECIAL CONTEXT TOKENS
<div class="alert alert-block alert-info">
In the previous section, I have Implemented a simple tokenizer and applied it to a passage from the training set.

In this section, we will modify this tokenizer to handle unknown words.

In particular, we will modify the vocabulary and tokenizer we implemented in the previous section, SimpleTokenizerV2, to support two new tokens, <b><|unk|></b> and <b><|endoftext|></b> </div>




we can modify the tokenizer to use an <|unk|> token if it encounters a word that is not part of the vocabulary.

Furthermore, we add a token between unrealated texts.

For example, when training GPT-like LLMs on multiple independent documents or books , it is common to insert before each document or book that follows a previous text source

#### Let's now modify the vocabulary to include these two special tokens, and <|endoftext|>, by adding the previous section

In [21]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endiftext|>","<|unk|>"])

vocab = {token:integer for integer,token in enumerate(all_tokens)}

In [22]:
len(vocab.items())

1219

<div class="alert alert-block alert-success"> Based on the output of the print statement above, the new vocabulary size is 1219 </div>

<div class="alert alert-block alert-warning"> As an additional quick check, let's print the last 5 entries of the updated vocabulary </div>

In [23]:
for i, item in enumerate(list(vocab.items())[-5:]):
  print(item)

('younger', 1214)
('your', 1215)
('yourself', 1216)
('<|endiftext|>', 1217)
('<|unk|>', 1218)


#### A simple text tokenizer than handles unknown words

Step 1: Replace unknown words by <|unk|> tokens

Step 2: Replace spaces before the specified punctuations

In [24]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i: s for s, i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_\'"()\[\]|\s])', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            item if item in self.str_to_int else "<|unk|>" for item in preprocessed
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    def decode(self,ids):
      text = " ".join([self.int_to_str[i] for i in ids])
      # Replace spaces before the specified punctuations

      text = re.sub(r'\s+([,.?!"()\'])',r'\1',text)
      return text

In [25]:
tokenizer = SimpleTokenizerV2(vocab)

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."

text = "<|endoftext|>".join((text1,text2))

In [26]:
print(text)

Hello, do you like tea?<|endoftext|>In the sunlit terraces of the palace.


In [27]:
tokenizer.encode(text)

[1218,
 4,
 381,
 1212,
 673,
 1052,
 13,
 1218,
 1218,
 1218,
 1218,
 1218,
 1065,
 1031,
 1061,
 778,
 1065,
 1218,
 10]

In [28]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|>, do you like tea? <|unk|> <|unk|> <|unk|> <|unk|> <|unk|> the sunlit terraces of the <|unk|>.'

<div class="alert alert-block alert-info">Based on comparing the de-tokenized text above with the original input text, we know that the training dataset, Edith Wharton's short story The Verdict, did not contain the words <b> "Hello"</b> and <b> "palace"</b> </div>

So far, we have discussed tokenization as an essential step in processing text as input to LLMs. Depending on the LLM, some researchers also consider additional special tokens such as the following:

- **[BOS] (beginning of sequence):** This token marks the start of a text. It signifies to the LLM where a piece of content begins.

- **[EOS] (end of sequence):** This token is positioned at the end of a text and is especially useful when concatenating multiple unrelated texts, similar to `<endoftext>`. For instance, when combining two different Wikipedia articles or books, the `[EOS]` token indicates where one article ends and the next one begins.

- **[PAD] (padding):** When training LLMs with batch sizes larger than one, the batch might contain texts of varying lengths. To ensure all texts have the same length, the shorter texts are extended or "padded" using the `[PAD]` token, up to the length of the longest text in the batch.


<div class="alert alert-block alert-info"> Note that the tokenizer used for GPT models does not need any of these tokens mentioned above but only uses an <b> <|endoftext|> </b> token for simplicity </div>

<div class="alert alert-block alert-info"> The tokenizer used for GPT models also doesn't use an <|unk|> token for out ofvocabulary words. Instead, GPT model use a <b> byte pair encoding tokenizer </b>, which breaks down words into subword units </div>

# BYTE PAIR ENCODING

<div class="alert alert-block alert-success"> 
We have implemented a <b> Simple Tokenization </b> scheme in the previous sections for illustraton purpose.

This section covers a more sophisticated tokenization scheme based on a concept called <b>byte pair encoding (BPE).</b>

The BPE tokenizer covered in this section was used to train LLMs such as GPT-2,GPT-3, and the original model used in ChatGPT

</div>

<div class="alert alert-block alert-warning"> 
So The implementing BPE can be relatively complicated, we will use an existing Python open-source library called tiktoken (openai uses it )

This library implements the BPE algorithm very efficiently basaed on source code in Rust.
</div>

In [29]:
!pip3 install tiktoken



'DOSKEY' is not recognized as an internal or external command,
operable program or batch file.


In [30]:
import importlib
import tiktoken

In [31]:
print("tiktoken version:",importlib.metadata.version("tiktoken"))

tiktoken version: 0.7.0


<div class="alert alert-block alert-success"> 
Once installed , we can instantiate the BPE tokenizer from tiktoken as follows:

</div>

In [32]:
tokenizer = tiktoken.get_encoding("gpt2")

<div class="alert alert-block alert-success"> 
The usage of this tokenizer is similar to SimpleTokenizerV2 we implemented previously via an encoder method:
</div>

In [33]:
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
    "of some unknown Place."
)

integers = tokenizer.encode(text,allowed_special={"<|endoftext|>"})
print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 6439, 8474, 13]


<div class="alert alert-block alert-info"> 
The code above prints the following token IDs:
</div>

<div class="alert alert-block alert-success"> 
We can then convert the token IDs back into text using the decode method, similar to our SimpleTokenizerV2
</div>

In [34]:
strings = tokenizer.decode(integers)
print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terracesof some unknown Place.


<div class="alert alert-block alert-warning"> 
We can make  two note worthy observations based on the token IDs and decoded text above.

First, the <|endoftext> token is assigned a relatively large token ID, namely , 50256.

In fact, the BPE tokenizer, which was used to train models such as GPT-2, GPT-3, and the original model used in ChatGPT, hasa a total vocabulary size of 50,257

with <|endoftext> being assigned the largest token ID.

</div>

<div class="alert alert-block alert-warning"> 
Second, the BPE tokenizer above encodes and decodes  unknown words, such as "someunknownPlace" correctly.

The BPE tokenizer can handle any unknown word. How does it achieve this without using <|unk|> tokens ?

</div>

 <div class="alert alert-block alert-warning"> 
The algorithm underlying BPE break down words that aren't in its predefined vocabulary into smaller subwords units or even individual characters

This enables it to handle out-of-vocabulary words.

So, thanks to the BPE algorithm, if the tokenizer encounters an unfamiliear word during tokenization, it can represent it as a sequence of subword tokens or characters
</div>

**Let us take another simple example to illustrate how the BPE tokenizer deals with unknown tokens**

In [35]:
integers = tokenizer.encode("werva esd")
print(integers)



strings = tokenizer.decode(integers)
print(strings)

[86, 32775, 1658, 67]
werva esd


## CREATING INPUT-TARGET PAIRS

<div class="alert alert-block alert-success"> 
In this section we implement a data loader that fetches the input-target pairs using a sliding window approach.
</div>

<div class="alert alert-block alert-success"> 
To get started, we will first tokenize the whole <b> The Verdict </b> short story we worked with earlier using the BPE tokenizer
</div>

In [36]:
with open("text_data/the-verdict.txt","r",encoding="utf-8") as f:
    raw_text = f.read()

enc_text = tokenizer.encode(raw_text)
print(len(enc_text))


5145


<div class="alert alert-block alert-info"> 
Executing the code above  will return 5145, the total number of tokens in the training set, applying the BPE tokenizer.
</div>

<div class="alert alert-block alert-success"> 
Next, we remove the first 50 tokens from the dataset for demonstration purposes as it results in a slightly more interesting text passage in the next step
</div>

In [37]:
enc_sample = enc_text[50:]

<div class="alert alert-block alert-success"> 
One of the easiest and most intuitive ways to create the input-target pairs for the next word prediction task is to create two variables x and y where x contains the input tokens and y contains the target, which are the inputs shifted by 1:
</div>

<div class="alert alert-block alert-info"> 
The context size determines how many tokens are included in the input
</div>

In [38]:
context_size = 4 # length of the input

#the context_size of 4 means that the model is trained to look at a sequence of 4 words (or tokens)
#to predict the next word in the sequence.
#The input x is the first 4 toekns [1,2,3,4], and the target y is the next 4 tokens [2,3,4,5]

x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]

In [39]:
print(f"x:{x}")
print(f"y:     {y}")

x:[290, 4920, 2241, 287]
y:     [4920, 2241, 287, 257]


<div class="alert alert-block alert-success"> 
Processing the inputs along with the targets, which are the inputs shifted by one position, we can then create the next-word prediction tasks as follows:
</div>

In [40]:
for i in range(1,context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(context,"---->", desired)


[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257


<div class="alert alert-block alert-info"> 
Everything left of the arrow (--->) refers to the input an LLM would receive, and the token ID on the right side of the arrow represents the target token ID that the LLM is supposed to predict.

</div>

<div class="alert alert-block alert-success"> 
For illustration purposes, let's repeat the previous code but convert the token IDs into text:


</div>

In [41]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(tokenizer.decode(context),"---->", tokenizer.decode([desired]))
    

 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a


<div class="alert alert-block alert-warning"> 
We've now created the input-target pairs that we can turn into use for the LLM training in upcoming chapters.


</div>

<div class="alert alert-block alert-warning"> 
There's only one more task before we can turn the tokens into embeddings: implementing an efficient data loader that iterates over the input dataset and returns the inputs and targets as Pyorch tensors, which can be throught of as multidimensional arrays.

</div>

<div class="alert alert-block alert-warning"> 
In particular, we are intereseted in returning two tensors: an input tensor containing thet text the LLM sees and a target tensor that includes the targets for the LLm to predict


</div>

## IMPLEMENTING A DATA LOADER

<div class="alert alert-block alert-success"> 
For the efficient data loader implementation, we will use PyTorch's built-in Dataset and DataLoader classes.


</div>

<div class="alert alert-block alert-info"> 

Step 1: Tokenize the entire text

Step 2: Use a sliding window to chunk the book into overlapping sequences of max_length

Step 3: Return the total number of rows in the dataset

Step 4:  Return a single row from the datset

</div>

In [42]:
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
    def __init__(self,txt,tokenizer,max_length,stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt,allowed_special={"<|endoftext|>"})


        # use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0,len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))




    def __len__(self):
        return len(self.input_ids)
    
    def __getitem__(self,idx):
        return self.input_ids[idx], self.target_ids[idx]

<div class="alert alert-block alert-warning"> 

The GPT Dataset V1 class in listing 2.5 is based on the PyTorch Dataset class.

It defines how individual rows are fetched from the dataset.

Each row consists of a number of token IDs (based on a max_length) assigned to an input_chunk tensor.

The target_chunk tensor contains the corresponding targets.

I recommend reading on to see how the data returned from this dataset looks like when we combine the dataset with additional intuition and clarity.

</div>

<div class="alert alert-block alert-success"> 
The following code will use the GPT Dataset V1 to load the inputs in batches via a PyTorch DataLoader:
</div>

<div class="alert alert-block alert-info"> 

Step 1: Initialize the tokenizer

Step 2: Create dataset

Step 3: drop_last=True drops the last batch if it is shorter than the specified batch_size to prevent loss spikes during training 

Step 4:  The number of CPU processes to use for preprocessing

</div>

In [43]:
def create_dataloader_v1(txt, batch_size=4,max_length=256,
                         stride=128, shuffle=True,drop_last=True, num_workers=0
                         ):
    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt,tokenizer,max_length,stride)


    # Create dataloader

    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )


    return dataloader

<div class="alert alert-block alert-success"> 

Let's test the dataloader with a batch size of 1 for the LLM with a context size of 4,

This will develop an intuition of how the GPTDatasetV1 class and the create_dataloader_v1 function work together:
</div>

In [44]:
with open("text_data/the-verdict.txt","r",encoding="utf-8") as f:
    raw_text = f.read()

<div class="alert alert-block alert-info"> 
Convert dataloader into a Python iterator to fetch the next entry via Python's build-in next() function
</div>

In [45]:
import torch
print("PyTorch version :",torch.__version__)

dataloader = create_dataloader_v1(
    raw_text,batch_size=1,max_length=4,stride=1,shuffle=False
)

data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

PyTorch version : 2.4.1+cpu
[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


<div class="alert alert-block alert-warning"> 
The first_batch variable contains two tensors: the first tensor stores the input token IDs, and the second tensor stores the target token IDs.

Since the max_length is set to 4, each of the two tensors contains 4 token IDs.

Note that an input size of 4 is relatively small and only chosen for illustration purposes. It is common to train LLMs with input sizes of at least <b> 256 </b>

</div>

<div class="alert alert-block alert-success"> 
To illustrate the meaning of stride=1, 
Let's fetch another batch from this dataset:
</div>

In [46]:
second_batch = next(data_iter)
print(second_batch)

[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


<div class="alert alert-block alert-warning"> 
If we compare the first with the second batch, we can see that the second batch's token IDs are shifted by one position compared to the first batch.

For example, the second ID in the first batch's input is 367, which is the first ID of the second batch's input.

The stride setting dictates the number of positions the inputs shift across batches, emulating a sliding window approach

</div>

<div class="alert alert-block alert-warning"> 
Batch sizes of 1, such as we have sampled from the data loader so far, are useful for illustration purposes.

If you have previous experience with deep learning, you may know that small batch sizes require less memory during training but lead to more noise  model updates.

Just like in regular deep learning, the batch size is a trade-off and hyperparameter to experiment with when training LLMs.
</div>

<div class="alert alert-block alert-success"> 
Before we move on to the two final  sections of this chapter that are focused on creating the embeddings vectors from the token IDs, let's have a brief look at how we can use the data loader to sample with a batch size greater than 1:
</div>

In [47]:
dataloader = create_dataloader_v1(raw_text,batch_size=8,max_length=4,stride=4,shuffle=False)


data_iter = iter(dataloader)
inputs,target = next(data_iter)
print("Inputs:\n",inputs)
print("\nTargets:\n",target)

Inputs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Targets:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])


<div class="alert alert-block alert-warning"> 
Note that we increase the stride to 4. This is to utilize the dataset fully (we don't skip a single word) but also avoid any overlap between the batches, since more overlap could lead to increased overfitting.
</div>

## CREATING TOKEN EMBEDDINGS

<div class="alert alert-block alert-success"> 
Let's illustrate how the token ID to embedding  vector conversion works with a hands-on example.

Suppose we have the following four input tokens with IDs 2,3,5 and 1
</div>

In [48]:
input_ids = torch.tensor([2,3,5,1])

<div class="alert alert-block alert-success"> 
For the sake of simplicity and illustration purposes, suppose we have a small vocabulary of only 6 words (instead of the 50,257 words in the  vocabulary), and we want to create embeddings of size 3 (in GPT-3, the embedding size is 12,288 dimensins):
</div>

<div class="alert alert-block alert-success"> 
Using the vocab_size and output_dim, we can instantiate an embedding layer in PyTorch, setting the random seed to 123 for reproducibility purposes:


</div>

In [49]:
vocab_size = 6
output_dim = 3

torch.manual_seed(123)

embedding_layer = torch.nn.Embedding(vocab_size,output_dim)

<div class="alert alert-block alert-info"> 
The Print statement in the code prints the embeddings layer;s underlying weight matrix:
</div>

In [50]:
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


<div class="alert alert-block alert-info"> 
We can see that the weight matrix of the embedding layer contains small, random values. 
These values are optimized during LLM training as part of the LLM optimization itself, as we will see in upcomin chapters. 
Moreover, we can see that the weight matrix hs six rows and three columns. There is one row for each of
the six possilbe tokens in the vocabulary. And there is one column for each of the three embedding dimensions.
</div>

<div class="alert alert-block alert-success"> 
After we instantiated the embedding layer, let's now apply it to a token ID to optain the embedding vector:
</div>

In [51]:
print(embedding_layer(torch.tensor([3])))

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)


<div class="alert alert-block alert-info"> 
if we compare the embedding vector for token ID 3 to the previous embedding matrix, we see that it is identical to the 4th row (Python starts with a zero)
it's the row corresponding to index 3. 
In other words, the embedding layer is essentially a look-up operation that matrix via a token ID.
</div>

In [52]:
print(embedding_layer(input_ids))

tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


<div class="alert alert-block alert-info"> 
Each row in this output matrix is obtained via a lookup operation from the embedding weight matrix
</div>

## POSITIONAL EMBEDDINGS (ENCODING WORD POSITIONS)

<div class="alert alert-block alert-success">

Previously, we focused on very small emeddings sizes in this chapter for illustration purposes.

We now consider more realistic and useful embeddings sizess and encode the input toknes into a 256-dimensional vector representation.

This is smaller than what the original GPT-3 model used (in GPT-3, the embedding sizse is 12,288 dimensions) but still reasonable for experimentation.

Furthermore, we assume that the token IDs were created by the BPE tokenizer that we implemented earlier, which has a vocabulary size of 50,257:

 </div>

In [53]:
vocab_size = 50257
output_dim = 256

token_embedding_layer = torch.nn.Embedding(vocab_size,output_dim)

<div class="alert alert-block alert-info">

Using the token_embedding_layer above, if we sample data from  the data loader, we embed each token in each batch into a 256-dimensional vector.

So for batch size of 8 with four tokens each, the result will be an 8 x 4 x 256 tensor.
 </div>

<div class="alert alert-block alert-success">
Let's instantiate the data loader (Data sampling with a sliding window), first
 </div>

In [54]:
max_length = 4
dataloader = create_dataloader_v1(
    raw_text,batch_size=8,max_length=max_length,stride=max_length,shuffle=False
)

data_iter = iter(dataloader)

inputs,targets = next(data_iter)




In [55]:
print("Token IDs:\n",inputs)
print("\nInputs shape:\n",inputs.shape)


Token IDs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Inputs shape:
 torch.Size([8, 4])


<div class="alert alert-block alert-info">
As we can see, the token ID tensor is 8x4-dimensinal, meaning that the data batch consists of 8 text samples with 4 tokens each.
 </div>

<div class="alert alert-block alert-success">
Let's now use the embedding layer to embed these token IDs into 256-dimensional vectors:
 </div>

In [56]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

torch.Size([8, 4, 256])


<div class="alert alert-block alert-info">
As we can tell based on the 8x4x256-dimensional tensor output, each token ID is now embedded as a 256-dimensional vector
 </div>

<div class="alert alert-block alert-success">
For a GPT model's absolute embedding approach, we just need to create another embedding layer that has the same dimensions
 </div>

In [57]:
conetext_length = max_length
pos_embedding_layer = torch.nn.Embedding(conetext_length,output_dim)

In [58]:
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print(pos_embeddings.shape)

torch.Size([4, 256])


## IMPLEMENTING A SIMPLIFIED ATTENTION MECHANISM

<div class="alert alert-block alert-success">
Consider the following input sentence, which has already been embedded into 3-dimensional vectors.

We choose a small embeddings dimension for illustration purposes to ensure it fits on the page without line breaks
 </div>

In [1]:
import torch

inputs = torch.tensor(
    [
        [0.43,0.15,0.89],   # Your  (x^1)
        [0.55,0.87,0.66],   # journey  (x^2)
        [0.57,0.85,0.64],   # starts  (x^3)
        [0.22,0.58,0.33],   # with   (x^4)
        [0.77,0.25,0.10],   # one  (x^5)
        [0.05,0.80,0.55]    # step  (x^6)
    ]
)

<div class="alert alert-block alert-info">
Each row represents a word, and each column represents an embedding dimension
 </div>

<div class="alert alert-block alert-info">
The second input token servers as the query
 </div>

In [2]:
query = inputs[1]  # 2nd input token is the query

attn_scores_2 = torch.empty(inputs.shape[0])
for i,x_i in enumerate(inputs):
    attn_scores_2[i] = torch.dot(x_i,query) # dot product (transpose not necessary)

print(attn_scores_2)

tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])


<div class="alert alert-block alert-success">
In the next step, we normalize each of the attention scores that we computer previously.
 </div>

<div class="alert alert-block alert-success">
The main goal behine the normalization is to obtain attention weights that sum up to  1.

The normalization is a convention that is useful for interpretation and for maintaining training stability in an LLM.

here's a straightforward method for achieving this normalization step:
 </div>

In [3]:
attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum()

print("Attention weights :",attn_weights_2_tmp)
print("Sum:",attn_weights_2_tmp.sum())


Attention weights : tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])
Sum: tensor(1.0000)


<div class="alert alert-block alert-info">
In practice, it's more common and advisable to use the softmax function for normalization.

This approach is better at managing extreme values and offers more favorable gradient 

Below is a basic implementation of the softmax function for normalizaing the attention scores:
 </div>

In [4]:
def softmax_naive(x):
    return torch.exp(x) / torch.exp(x).sum(dim=0)

attn_weights_2_naive = softmax_naive(attn_scores_2)

print("Attention weights : ",attn_weights_2_naive)
print("Sum:",attn_weights_2_naive.sum())

Attention weights :  tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.)


<div class="alert alert-block alert-info">
An the output shows, the softmax function also meets the objective and normalizes the  attention weights that they sum to 1:
 </div>

<div class="alert alert-block alert-warning">
In addition, the softmax function ensures that the attention weights are always positive. This  makes the output interpretable as probabilities or relative  importance, where highter weights indicate greater importance.
 </div>

<div class="alert alert-block alert-warning">
Note that this naive softmax implementation (softmax_naive) may encounter numerical instability problems, such as overflow and underflow,when dealing with large or small input values.

Threfore, in practice, it's advisable to use the PyTorch implementation of softmax.
 </div>

In [5]:
attn_weights_2  = torch.softmax(attn_scores_2,dim=0)
print("Attention weights :",attn_weights_2)
print("Sum:",attn_weights_2.sum())

Attention weights : tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.)


<div class="alert alert-block alert-info">
in this case, we can see that it yields thesame results as our previous softmax_naive function:
 </div>

In [6]:
query = inputs[1] # 2nd input token is the query

context_vec_2 = torch.zeros(query.shape)

for i,x_i in enumerate(inputs):
    context_vec_2 += attn_weights_2[i]*x_i 

print(context_vec_2)

tensor([0.4419, 0.6515, 0.5683])


In [7]:
import matplotlib.pyplot as plt 
from mpl_toolkits.mplot3d import Axes3D


<div class="alert alert-block alert-success">
Now, we can extend this computation to calculate attention weights and context vectors for all inputs.
 </div>

<div class="alert alert-block alert-success">
First, we add an additional for-loop to compute the dot products for all pairs of inputs.
 </div>

In [8]:
attn_scores = torch.empty(6,6)

for i, x_i in enumerate(inputs):
    for j, x_j in enumerate(inputs):
        attn_scores[i,j] = torch.dot(x_i,x_j)

print(attn_scores)

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


<div class="alert alert-block alert-success">
Each element in the preceding tensor represents an attention score between each pair of inputs.
 </div>

<div class="alert alert-block alert-success">
When computing the preceding attention score tensor, we used for-loops in Python.

Howerver, for-loops are generally slow, and we can achieve the same results using matrix multiplication:
 </div>

In [9]:
attn_scores =inputs @ inputs.T
print(attn_scores)

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


<div class="alert alert-block alert-success">
we now normalize each row so that the values in each row sum to 1:

 </div>

In [10]:
attn_weights = torch.softmax(attn_scores,dim=-1)
print(attn_weights)

tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],
        [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],
        [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],
        [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],
        [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],
        [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])


<div class="alert alert-block alert-warning">

In the context of using PyTorch, the `dim` parameter in functions like `torch.softmax` specifies the dimension of the input tensor along which the function will be computed.

By setting `dim=-1`, we are instructing the softmax function to apply the normalization along the last dimension of the `attn_scores` tensor.

If `attn_scores` is a 2D tensor (for example, with a shape of `[rows, columns]`), `dim=-1` will normalize across the columns so that the values in each row (summing over the column dimension) sum up to 1.
<div/>


## IMPLEMENTING SELF ATTENTION WITH TRAINABLE  WEIGHTS

In [11]:
import torch

inputs = torch.tensor(
    [
        [0.43,0.15,0.89],   # Your  (x^1)
        [0.55,0.87,0.66],   # journey  (x^2)
        [0.57,0.85,0.64],   # starts  (x^3)
        [0.22,0.58,0.33],   # with   (x^4)
        [0.77,0.25,0.10],   # one  (x^5)
        [0.05,0.80,0.55]    # step  (x^6)
    ]
)

<div class="alert alert-block alert-success">
Let's begin by defining a few variables:
<div/>




<div class="alert alert-block alert-info">
#A The second input element

#B The input embedding size, d=3

#C The output embeddng size, d_out=2
<div/>


In [12]:
x_2 = inputs[1] #A

d_in = inputs.shape[1]  #B

d_out = 2  #C


<div class="alert alert-block alert-info">
Note that in GPT-like models, the input and output dimensins are usually the same,

But for illustration purposes, to better follow the computation, we choose  different input (d_in=3) and output (d_out)
<div/>



<div class="alert alert-block alert-success">
Next, we initialize the three weight matrices Wq, Wk and Wv
<div/>


In [13]:
torch.manual_seed(123)
W_query = torch.nn.Parameter(torch.rand(d_in,d_out), requires_grad=False)
W_key = torch.nn.Parameter(torch.rand(d_in,d_out), requires_grad=False)
W_value = torch.nn.Parameter(torch.rand(d_in,d_out), requires_grad=False)



In [14]:
print(W_query)

Parameter containing:
tensor([[0.2961, 0.5166],
        [0.2517, 0.6886],
        [0.0740, 0.8665]])


In [15]:
print(W_key)

Parameter containing:
tensor([[0.1366, 0.1025],
        [0.1841, 0.7264],
        [0.3153, 0.6871]])


In [16]:
print(W_value)

Parameter containing:
tensor([[0.0756, 0.1966],
        [0.3164, 0.4017],
        [0.1186, 0.8274]])


#### Next, we compute the query, key, and value vectors as shown earlier

In [17]:
query_2 = x_2 @ W_query
key_2 = x_2 @ W_key 
value_2 = x_2 @ W_value 
print(query_2)

tensor([0.4306, 1.4551])


#### As we can see based on the output for the query, this results in a 2-dimensional vector.

#### This is because: we set the number of columns of the corresponding weight matrix, via d_out, to 2:

In [18]:
# we can obtain all keys and values via matrix multiplication

In [19]:
keys = inputs @ W_key
values = inputs @ W_value
queries = inputs @ W_query

print("Keys.shape:", keys.shape)
print("values.shape:", values.shape)
print("queries.shape:", queries.shape)



Keys.shape: torch.Size([6, 2])
values.shape: torch.Size([6, 2])
queries.shape: torch.Size([6, 2])


<div class="alert alert-block alert-success">
First, let's compute the attention score ω22
<div/>

In [20]:
keys_2 = keys[1] #A
attn_scores_22 = query_2.dot(keys_2)
print(attn_scores_22)


tensor(1.8524)


<div class="alert alert-block alert-success">
Again, we can generalize this computation to all  attention scores via matrix multiplication:
<div/>

In [21]:
attn_scores_2 = query_2 @ keys.T # All attention scores for given query
print(attn_scores_2)

tensor([1.2705, 1.8524, 1.8111, 1.0795, 0.5577, 1.5440])


In [22]:
attn_scores = queries @ keys.T # omega
print(attn_scores)

tensor([[0.9231, 1.3545, 1.3241, 0.7910, 0.4032, 1.1330],
        [1.2705, 1.8524, 1.8111, 1.0795, 0.5577, 1.5440],
        [1.2544, 1.8284, 1.7877, 1.0654, 0.5508, 1.5238],
        [0.6973, 1.0167, 0.9941, 0.5925, 0.3061, 0.8475],
        [0.6114, 0.8819, 0.8626, 0.5121, 0.2707, 0.7307],
        [0.8995, 1.3165, 1.2871, 0.7682, 0.3937, 1.0996]])


In [23]:
d_k = keys.shape[-1]
attn_weights_2 = torch.softmax(attn_scores_2/ d_k**0.5, dim=-1)

print(attn_weights_2)
print(d_k)

tensor([0.1500, 0.2264, 0.2199, 0.1311, 0.0906, 0.1820])
2


## WHY DIVIDE BY SQRT (DIMENSION)

<div class="alert alert-block alert-warning">
Reason 1: For stability in learning

Reason 2: The softmax function function is sensitive to the magnitutes , where higestest value will get all the marks
<div/>

In [24]:
context_vec_2 = attn_weights @ values
print(context_vec_2)

tensor([[0.2897, 0.8043],
        [0.3069, 0.8188],
        [0.3063, 0.8173],
        [0.2972, 0.7936],
        [0.2848, 0.7650],
        [0.3043, 0.8105]])


<div class="alert alert-block alert-success">
So far, we only computed a single context vector, z(2).

In the next section, we will generalize the code to compute all context vectors in the input sequence, z(1) to z (T)
<div/>

## IMPLEMENTING A COMPACT SELF ATTENTION PYTHON CLASS

<div class="alert alert-block alert-success">
In the previous sections, we have gone through a lot of steps to compute the self-attention outputs.

This was mainly done for illustraton purposes so we could go through one step at a time.

In practice, with the LLM implementation in the next chapter in mind, it is helpful to organize this code into a Python
<div/>

In [25]:
import torch.nn as nn

class SelfAttention_v1(nn.Module):

    def __init__(self,d_in,d_out):
        super().__init__()
        self.W_query = nn.Parameter(torch.rand(d_in,d_out))
        self.W_key = nn.Parameter(torch.rand(d_in,d_out))
        self.W_value = nn.Parameter(torch.rand(d_in,d_out))


    def forward(self,x):
        keys = x @ self.W_key
        queries = x @ self.W_query
        values = x @ self.W_value

        attn_scores = queries @ keys.T # omega
        attn_weights = torch.softmax(
            attn_scores/keys.shape[-1]**0.5, dim=-1
            )
        
        context_vec = attn_weights @ values
        return context_vec
    


<div class="alert alert-block alert-success">
In this PyTorch code, SelfAttention_v1 is a class derived from nn.Module, which is a fundamental building block of PyTorch mod functionalities for model layer creation and management.

The init method initializes trainable weight matrices (W_query, W_key, and W_value) for queries, keys, and values, each transforming the input dimension d_in to an output dimension d_out.

In the forward pass, using the forward method, we compute the attention scores (attn_scores) by multiplying queries and keys.
<div/>

In [26]:
torch.manual_seed(123)
sa_v1 = SelfAttention_v1(d_in,d_out)
print(sa_v1(inputs))

tensor([[0.2996, 0.8053],
        [0.3061, 0.8210],
        [0.3058, 0.8203],
        [0.2948, 0.7939],
        [0.2927, 0.7891],
        [0.2990, 0.8040]], grad_fn=<MmBackward0>)


In [27]:
class SelfAttention_v2(nn.Module):
    def __init__(self,d_in,d_out,qkv_bias=False):
        super().__init__()
        self.W_query = nn.Linear(d_in,d_out,bias=qkv_bias)
        self.W_key = nn.Linear(d_in,d_out,bias=qkv_bias)
        self.W_value = nn.Linear(d_in,d_out,bias=qkv_bias)


    def forward(self,x):
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        attn_scores = queries @ keys.T
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5,dim=-1)

        context_vec = attn_weights @ values 
        return context_vec


In [28]:
torch.manual_seed(789)
sa_v2 = SelfAttention_v2(d_in, d_out)
print(sa_v2(inputs))

tensor([[-0.0739,  0.0713],
        [-0.0748,  0.0703],
        [-0.0749,  0.0702],
        [-0.0760,  0.0685],
        [-0.0763,  0.0679],
        [-0.0754,  0.0693]], grad_fn=<MmBackward0>)


## HIDING FUTURE WORDS WITH CAUSAL ATTENTION

<div class="alert alert-block alert-success">
Let's work with the attention scores and weights form the previous section to code the causal attention mechanism.
<div/>

In [29]:
queries = sa_v2.W_query(inputs)  #A
keys = sa_v2.W_key(inputs)
attn_scores = queries @ keys.T 
attn_weights = torch.softmax(attn_scores/keys.shape[-1]**0.5,dim=1)
print(attn_weights)

tensor([[0.1921, 0.1646, 0.1652, 0.1550, 0.1721, 0.1510],
        [0.2041, 0.1659, 0.1662, 0.1496, 0.1665, 0.1477],
        [0.2036, 0.1659, 0.1662, 0.1498, 0.1664, 0.1480],
        [0.1869, 0.1667, 0.1668, 0.1571, 0.1661, 0.1564],
        [0.1830, 0.1669, 0.1670, 0.1588, 0.1658, 0.1585],
        [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
       grad_fn=<SoftmaxBackward0>)


<div class="alert alert-block alert-success">
We can now use PyTorch's tril function to create a mask where the values above the diagonal are zero:
<div/>

In [30]:
context_length = attn_scores.shape[0]
mask_simple  = torch.tril(torch.ones(context_length,context_length))
print(mask_simple)


tensor([[1., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1.]])


<div class="alert alert-block alert-success">
Now, we can multiply this mask with the attention weights to zero out the values  above the diagonal:
<div/>

In [31]:
masked_simple = attn_weights*mask_simple
print(masked_simple)

tensor([[0.1921, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2041, 0.1659, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2036, 0.1659, 0.1662, 0.0000, 0.0000, 0.0000],
        [0.1869, 0.1667, 0.1668, 0.1571, 0.0000, 0.0000],
        [0.1830, 0.1669, 0.1670, 0.1588, 0.1658, 0.0000],
        [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
       grad_fn=<MulBackward0>)


<div class="alert alert-block alert-info">
As we can see, the elements above the diagonal are successfully zeroed out
<div/>

<div class="alert alert-block alert-success">
The third step is to renormalize the attention weights to sum up to 1 again in each row.
We can achieve this by dividing each element in each row by the sum in each row:
<div/>

In [32]:
row_sums = masked_simple.sum(dim=1,keepdim=True)
masked_simple_norm = masked_simple / row_sums
print(masked_simple_norm)

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5517, 0.4483, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3800, 0.3097, 0.3103, 0.0000, 0.0000, 0.0000],
        [0.2758, 0.2460, 0.2462, 0.2319, 0.0000, 0.0000],
        [0.2175, 0.1983, 0.1984, 0.1888, 0.1971, 0.0000],
        [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
       grad_fn=<DivBackward0>)


<div class="alert alert-block alert-info">
The result is an attention weights matrix where the attention weights above the diagonal are zeroed out 
<div/>

In [33]:
mask = torch.triu(torch.ones(context_length,context_length),diagonal=1)
masked = attn_scores.masked_fill(mask.bool(),-torch.inf)
print(masked)

tensor([[0.2899,   -inf,   -inf,   -inf,   -inf,   -inf],
        [0.4656, 0.1723,   -inf,   -inf,   -inf,   -inf],
        [0.4594, 0.1703, 0.1731,   -inf,   -inf,   -inf],
        [0.2642, 0.1024, 0.1036, 0.0186,   -inf,   -inf],
        [0.2183, 0.0874, 0.0882, 0.0177, 0.0786,   -inf],
        [0.3408, 0.1270, 0.1290, 0.0198, 0.1290, 0.0078]],
       grad_fn=<MaskedFillBackward0>)


In [34]:
# Now, all we need to do is apply the softmax function to these masked results, and we are done.

In [35]:
attn_weights = torch.softmax(masked / keys.shape[-1]**0.5,dim=1)
print(attn_weights)

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5517, 0.4483, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3800, 0.3097, 0.3103, 0.0000, 0.0000, 0.0000],
        [0.2758, 0.2460, 0.2462, 0.2319, 0.0000, 0.0000],
        [0.2175, 0.1983, 0.1984, 0.1888, 0.1971, 0.0000],
        [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
       grad_fn=<SoftmaxBackward0>)


<div class="alert alert-block alert-info">
 we can see based on the output, the values in each row sum to 1, and no further normalization is necessary.
<div/>

## MASKING ADDITIONAL ATTENTION WEIGHTS WITH DROPOUT

<div class="alert alert-block alert-info">
In the following code example, we use a dropout rate of 50%, which means masking out half of the attention weights.

When we train the GPT model , we will use a lower dropout rate, such as 0.1 or 0.2
<div/>

In [36]:
torch.manual_seed(123)
dropout = torch.nn.Dropout(0.5) #A
example = torch.ones(6,6) #B
print(dropout(example))

tensor([[2., 2., 2., 2., 2., 2.],
        [0., 2., 0., 0., 0., 0.],
        [0., 0., 2., 0., 2., 0.],
        [2., 2., 0., 0., 0., 2.],
        [2., 0., 0., 0., 0., 2.],
        [0., 2., 0., 0., 0., 0.]])


<div class="alert alert-block alert-info">
When applying dropout to an attention weight matrix with a rate of 50%, half of the elements in the matrix are randomly set to zero.
To compensate for the reduction in active elements, the values of the remaining elements in the matrix are scaled up by a factor of 1/0.5 = 2.

This scaling is crucial to maintain the overall balance of the attention weights, ensuring that the average influence of the attention mechanism remains consistent during both the training and inference phases.

<div/>

In [37]:
# we need to ensure if code can handle batches consisting of more than one input

<div class="alert alert-block alert-info">
2 inputs with 6 tokens each, and each token has embedding dimension 3
<div/>

In [38]:
batch = torch.stack((inputs,inputs),dim=0)
print(batch.shape)

torch.Size([2, 6, 3])


In [39]:
class CausalAttention(nn.Module):

    def __init__(self,d_in,d_out,context_length,dropout,qkv_bias=False):
        super().__init__()
        self.d_out = d_out
        self.W_query =  nn.Linear(d_in,d_out,bias=qkv_bias)
        self.W_key =  nn.Linear(d_in,d_out,bias=qkv_bias)
        self.W_value =  nn.Linear(d_in,d_out,bias=qkv_bias)
        self.dropout = nn.Dropout(dropout) # New
        self.register_buffer('mask',torch.triu(torch.ones(context_length,context_length),diagonal=1)) 

    def forward(self,x):
        b, num_tokens, d_in = x.shape # New batch dimension b
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)


        attn_scores = queries @ keys.transpose(1,2) # Changed transpose
        attn_scores.masked_fill_(  # New, _ opsare in-place
            self.mask.bool() [:num_tokens,:num_tokens],-torch.inf)
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim = -1
        )

        attn_weights = self.dropout(attn_weights) # New

        context_vec = attn_weights @ values
        return context_vec
        

In [40]:
torch.manual_seed(123)
context_length = batch.shape[1]
ca = CausalAttention(d_in, d_out,context_length,0.0)
context_vecs = ca(batch)
print("context_vecs.shape:", context_vecs.shape)

context_vecs.shape: torch.Size([2, 6, 2])
