# Step 1: Creating Tokens

In [57]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read() #storing the content form verdict.txt file

print("total number of characters: ", len(raw_text))
print(raw_text[:99]) #prints the first 100 characters of the file

total number of characters:  20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


our goal is to convert the 20479 characters into individual words and that we can turn into embeddigs for LLM training.


now the question is how can we best split this text to obtain a list of tokens?
for this we will use python's regular expression library and then split the text based on the white space or punctuations into individual tokens.

In [58]:
import re

text = "Hello, world. This, is a test."
result= re.split(r'(\s)', text) #splits wherever whitespaces are encountered.

print(result)

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


from the above code we can see that the result is a list of individual words, whitespaces, and punctuations. Now lets modify the regular expression such that it splits on whitespaces (\s) and commas, and period.

In [59]:
result = re.split(r'([,.]|\s)', text)

print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


 a small issue that we encounter here is that the list still includes the whitespace characters. we have to remove them , which is as follows.

In [60]:
result = [item for item in result if item.strip()] #scans each item in the result and removes whitespace
#item.strip() will only return true if there is a word or punctuation else return false and it will not print
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


REMOVING WHITESPACES OR NOT?

When developing a simple tokenizer , whether we should encode whitespaces as separate characters or just remove them depends on our application and its requirements. 

Advantages of removing white spaces is that it reduces the memory and computing requirements. However keeping them can be useful if we train models that are sensitive to the exact structure of the text. (for example , python code , which is sensitive to indentation and spacing)

The tokenization scheme that we have used above is well enough but the input text can contain various other things such as question marks, quotation marks , double -dashes etc so we will again modify the splitting criteria based on the nature of this dataset.

In [61]:
text = "Hello, world!. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)

result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '!', '.', 'Is', 'this', '--', 'a', 'test', '?']


In [62]:
#strip whitespace from each item and then filter out any empty string

result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '!', '.', 'Is', 'this', '--', 'a', 'test', '?']


In [63]:
text = "Hello, world!. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)

result = [item.strip() for item in result if item.strip()] 
#the first item.strip() checks for whitespaces at the beginning of the sentence.
print(result)

['Hello', ',', 'world', '!', '.', 'Is', 'this', '--', 'a', 'test', '?']


### Now let's apply this tokenizer to the our raw data.

apply this tokenizer to or data and then store it to a variable named preprocessed.

In [64]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


In [65]:
print(len(preprocessed)) #this prints the length of the entire preprocessed token.

4690


we have successully tokenized the entire dataset that we had and now we proceed to the second step where we assign ids to the tokens because machines cannot understand the tokens directly we have to assign IDs to the tokens.

# Step 2: Creating Token IDs

in this step we will sort the tokens in the preprocessed variable in alphabetical order and then determine the vocabulary size.

In [66]:
all_words = sorted(set(preprocessed))  #converting it into a set and then sorting in the set
vocab_size = len(all_words)

print(vocab_size)

1130


here the number is less as compared to the tokens because the vocab size is the count of only the unique toekns that are present in the preprocessed variable.


now assigning this to vocabulary where vocabulary is like a dictionary of tokens and their associated token IDs.

In [67]:
vocab = {token:integer for integer, token in enumerate(all_words)}
#this will assign integer to each and every unique token.

In [68]:
for i, item in enumerate(vocab.items()):
    print(item)
    if i>=50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)


In [69]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s, i in vocab.items()} #needed for the decoder part to convert num to token

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed] #converting tokens into token IDs.
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids]) #using reverse dictionary to convert token IDs to tokens
        #replace spaces before specified punctuations , so  that it makes a perfect sentence.
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [70]:
#trying th etokenizer class that we have created by taking a sub part of the dataset for testing
tokenizer = SimpleTokenizerV1(vocab)

text = """"It's the last he painted, you know,"
            Mrs. Gisburn said with pardonable pride."""

ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


our tokenizer has converted the tokens into token IDs now let's test our decoeder whethter it can convert the token IDs back to tokens.

In [71]:
tokenizer.decode(ids)

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

from the above result we can see that we have successfully converted the tokens to token IDs and tokenIDs back to tokens from the subset of the training data. Now let's move further with it. What if we provide it with a sentence which is not present in the dataset?


In [72]:
#testing it with some words which are not present in the already available dataset.

text = "Hello , do you like tea?"
print(tokenizer.encode(text))

KeyError: 'Hello'

we get an error for the above sentence because we don't have the word Hello in our dataset and from this we get to know that we need to consider large and diverse training sets to exxtend the vocabulary when working on LLMs.

# Adding Special Context Tokens

in the previous section we have implemented a simple tokenizer and which when tested to tokenize a word which was not present in the trainig data it gave an error. So in this section we will modify the tokenizer to handle unknown words. I particular, we will modify the vocabulary and tokenizer we implemented in the previous section, here we will implement the version 2 of SimpleTokenizer to handle the unknown tokens.

we can modify the tokenizer to use an <|unk|> token if it encounters a word that is not a part of the voocabulary. Furhtermore we add a token between unrelated tasks. 
For example , when training GPT-like LLMs multiple independent documents or books, it is common to insert a token before each document or book that follows a previous text source.


modifying the vocabulary to include these two special tokens , <|endoftext|> and <|unk|> to the existing vocabulary. Previously the size of the vocabulary was 1130 and after addding this two tokens it would increase and become 1132.

In [91]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab = {token:integer for integer, token in enumerate(all_tokens)}

In [92]:
len(vocab.items())

1132

In [93]:
#for checking our modification we are printing the last 5 entries of the vocabulary.
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


In [94]:
#now further we will extend the simple tokenizer class with this.

class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s, i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            item if item in self.str_to_int
            else "<|unk|>" for item in preprocessed #scans the entire text and if comes across unknown word 
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        #replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text

In [95]:
tokenizer = SimpleTokenizerV2(vocab)

text1= "Hello, do you like tea?"
text2= "In the sunlit terraces of the palace."

text = "<|endoftext|>".join((text1, text2))
print(text)

Hello, do you like tea?<|endoftext|>In the sunlit terraces of the palace.


In [96]:
tokenizer.encode(text)

[1131, 5, 355, 1126, 628, 975, 10, 1131, 988, 956, 984, 722, 988, 1131, 7]

from the above results we can see that since hello was not present in our vocabulary it printed the token id of unk and for the endoftext also it did the same thing.

In [97]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|>, do you like tea? <|unk|> the sunlit terraces of the <|unk|>.'

based on the above detokenized words we can see that two words "Hello" and "palace" were not present in our vocabulary it were replaced with <|unk|> in the decoder part.

apart from this two tokens we also have other tokens which people use like 
1. [BOS] beginning of sequence - it marks the starting of the text. it signifies to the LLM where a piece of content begins.
2. [EOS] end of sequence - it is positioned at the end of the text and is useful when concatenating multiple unrelated text.
3. [PAD] padding - when training LLMs with batch sizes larger than one, the batch might contain texts of varying length . To ensure all texts have the same length the shorter texts are padded using the [PAD] token , up to the length of the longest text in the batch.

# Byte-Pair Encoding

byte pair encoding is used in GPT-2 and GPT-3 and the original model.


since implementing BPE is complicated we will be using an  existing python open-source library called tiktoken. OpenAI themselves use this library to convert the raw text into tokens.

In [98]:
import importlib
import tiktoken

print("Tiktoken version: ", importlib.metadata.version("tiktoken"))

Tiktoken version:  0.12.0


In [99]:
tokenizer = tiktoken.get_encoding("gpt2")

the usage of the above tokenizer is similar to what we have developed in the above setion named the simpletoeknizerv1 and simpletokenizerv2 .  but here the entire thing is done in a single line of code.

In [100]:
#lets see the working of this using a simple example
text = (
    "Hello, do you like tea? <|endoftext|> In the sunnlit terraces"
    "of someunknownPlace."
)

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 77, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


In [101]:
#now we can convert this token ids back into words using the decode method .
strings = tokenizer.decode(integers)
print(strings)

Hello, do you like tea? <|endoftext|> In the sunnlit terracesof someunknownPlace.


# Creating Input-Target Pairs

in this section we are going to implement a data loader that fetches the input-target pairs using a sliding window approach. 

to get started, we will first tokenize the whole The Verdict short story(the dataset we have been using for the earlier training.) we worked with earlier using the BPE tokenization.

In [102]:
with open("the-verdict.txt", "r", encoding="utf-8") as f: #reads the entire dataset
    raw_text = f.read()  #stores it in variable named raw_text

enc_text = tokenizer.encode(raw_text)
print(len(enc_text))

5145


the above output shows that there are total of 5145 encoded tokens , voacbulary size is 5145 in the training set after applying the BPE tokenizer.

now we will see the first 50 tokens from the dataset for demonstration as it results in a slightly more interesting text passage in the further steps. 

In [103]:
enc_sample = enc_text[50:]

Now one of the most easiest way to create input-target pairs for the nextword prediction task is to create two variables, x and y , where x contains input tokens and y contains the targets, which are the inputs shifted by 1: 

the context size determines how many tokens are included in the input.

In [104]:
context_size = 4 #length of the input
# The context_size of 4 menas that the model is trained to look at a sequence of 
# 4 words (or tokens ) to predict the next word in the sequence.
# the input X is the first 4 tokens [1,2,3,4] and the target y is the next 4 tokens
#[2,3,4,5]

x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]

print(f"x: {x}")
print(f"y:      {y}")

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


processing the inputs along with the targets, which are the inputs shifted by one position, we can then create the next-word prediction tasks as follows:

In [105]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(context, "---->", desired)

[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257


everything on the left of the arrow (---->) refers to the input an LLM would receive, and the token ID on the right side of the arrow represent what the LLM is supposed to predict.

for demonstration purposes, let's repeat the previous code but convert the token IDs into text:

In [106]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a


we have now created input-target pairs that we can turn into use for the LLM training in the further chapters.

now there's only one more task before we can turn the tokens into embeddings: implementing an efficient data loader that iterates over the dataset and returns the inputs and targets as PyTorch tensors, which can be thought of as multidimensional arrays.

in particular, we are interested in returning two tensors: an input tensor containing the text that the LLM sees and a target tensor that includes the targets for the LLm to predict

# Implementing a Data Loader

for a efficient data loader implementation, we will use pytorch's built-in dataset and data Loader classes.

Setps for the same :
1. Tokenize the entire text
2. use a sliding window to chunk the book into overlapping sequences of max_length
3. return the total number of rows in the dataset
4. return a single row from the dataset.

In [107]:
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
    def __init__(self,txt, tokenizer, max_length,stride):
        self.input_ids = []
        self.target_ids = []

        #tokenize the entire text
        token_ids = tokenizer.encode(text, allowed_special = {"<|endoftext|>"})

        #use a sliding window to chunk the book into overlapping sequences of max_length 
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i+1: i+ max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))


    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

the gptDatasetV1 class in listing 2.5 is based on the pytorch dataset class.
it defines how individual rows are fetched from the dataset. Each row consists of a number of token IDs (based on max_length) assigned to an input_chunk tensor. 
the target_chunk tensor contains the corresponding targets.


Since our data is ready we will feed this data into the data loader. we will write a code that will use the GPTDatasetV1 to load the inputs in batches via a PyTorch DataLoader.

Below are the steps for the same:
1. initialize the tokenizer
2. create dataset
3. drop_last = True drops the last bacth if it is shorter than the specified batch_size to prevent loss_spikes  during training.
4. the number of CPU processes to use for preprocessing

In [108]:
def create_dataloader_v1(
    txt,                    # raw training corpus as a single string
    batch_size=4,            # how many sequences per batch
    max_length=256,          # length (in tokens) of each training sample
    stride=128,              # overlap between consecutive windows (controls data augmentation)
    shuffle=True,            # shuffle batches each epoch (improves generalization)
    drop_last=True,          # drop the last incomplete batch (keeps batch sizes uniform)
    num_workers=0            # DataLoader worker processes (0 = do work in main process; safe on Windows/Jupyter)
):
    """
    Builds a PyTorch DataLoader for next-token prediction from a long text.
    Splits `text` into overlapping token windows using `max_length` and `stride`.
    """

    # 1) Initialize the tokenizer (GPT-2’s Byte-Pair Encoding).
    #    This converts raw text into integer token IDs compatible with GPT-style models.
    tokenizer = tiktoken.get_encoding("gpt2")

    # 2) Create the dataset.
    #    GPTDatasetV1 should: 
    #      - tokenize `txt`
    #      - slice it into overlapping windows of length `max_length`
    #      - with step size `max_length - stride`
    #      - and return (input_ids, target_ids) pairs for next-token prediction.
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # 3) Wrap the dataset in a DataLoader to:
    #      - batch samples (`batch_size`)
    #      - optionally shuffle sample order each epoch (`shuffle`)
    #      - optionally drop the last partial batch (`drop_last`)
    #      - use `num_workers` background workers to speed up data loading
    #    Note: On Windows/Jupyter, keep `num_workers=0` to avoid multiprocessing issues.
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    # 4) Return an iterable that yields batches of tensors ready for training.
    return dataloader

now we are going to test the dataloader with a batch size of 1 for an LLM with a context size of 4,
this will develop an intuition of how the GPTDatasetV1 class and the create_dataloader_v1 function work together.

In [109]:
#this is the first step where we read the text

with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

converting the dataloader into a python iterator to fetch the next entry via python's built-in next() function

In [110]:
import torch 
print("Pytorch version: ", torch.__version__)
dataloader = create_dataloader_v1(
    raw_text, batch_size=1, max_length=4, stride=1, shuffle=False
)

data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

Pytorch version:  2.9.0+cpu
[tensor([[15496,    11,   466,   345]]), tensor([[ 11, 466, 345, 588]])]


here the first_batch variable contains two tensors: the first tensor stores the input token IDs, and the second tensor stores the target token IDs. Since the max_length is set to 4, each of the two tensors contains 4 token IDs.

the input size of 4 is relatively small and only chosen for illustration purposes. it is common to train LLMs with input sizes of at least 256.


to illustrate the menaing of stride=1, let's fetch another batch from this dataset.

In [111]:
second_batch = next(data_iter)
print(second_batch)

[tensor([[ 11, 466, 345, 588]]), tensor([[ 466,  345,  588, 8887]])]


it is called as sliding window approach because here we can see that during the second iteration the ones which were present in the output/ target tensor are now inputs during the second iteration.

batch sizes of 1, such as we have sampled from the data loader so far, are useful for illustration purposes. We know that in deep learning, a small batch sizes require less memory during training but lead to more noisy model updates.

In regular deep learning, the batch size is a trade-off and hyperparameter to experiment with when training LLMs.

before moving on to the two final sections of this chapter that we focused on creating embedding vectors from the tokenIDs , let's let us see how we can use the data loader to sample with a batch size greater than 1:

In [112]:
dataloader = create_dataloader_v1(raw_text,
                                  batch_size=8,
                                  max_length=4,
                                  stride=4,     # see note below
                                  shuffle=False,
                                  drop_last=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[15496,    11,   466,   345],
        [  588,  8887,    30,   220],
        [50256,   554,   262,  4252],
        [   77, 18250,  8812,  2114],
        [ 1659,   617, 34680, 27271]])

Targets:
 tensor([[   11,   466,   345,   588],
        [ 8887,    30,   220, 50256],
        [  554,   262,  4252,    77],
        [18250,  8812,  2114,  1659],
        [  617, 34680, 27271,    13]])


NOTE: now that we increase the stride to 4. This is to utilize the data set fully (we don't skip a single word) but also avoid any overlap between the batches, as the overlap could lead to increased overfitting.

# Token Embedding

Let us see how the token ID to embedding vector conversion works with a hands-on example. We have four input tokens 

In [113]:
input_ids = torch.tensor([2,3,5,1])

for the sake of simplicity we are going to use small vocabulary of only 6 words(instead of 50,257 words in the BPE tokenizer vocabulary) and we want to create embeddings of size 3 (in GPT-3, the embedding size is 12,288 dimensions):

In [114]:
vocab_size=6
output_dim = 3

torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

embedding is a simple lookup table that stores embeddings of a fixed dictionary and size. and this intialize the weights of the embedding matrix in a random manner. here we will have the embedding matrix of 6 rows and 3 columns.

In [115]:
#the print statement in the code prints the embedding layer's underlyingg weight matrix:
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


the above are the initial weights in the embedding layer and it contains small and random values. and these are the values that are optimized during LLM training as a part of the LLM optimization itself. 

In [116]:
print(embedding_layer(torch.tensor([3]))) #fetches the vector embedding for ID 3.

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)


If we look at the embedding vector for **token ID 3**, it’s exactly the same as the **4th row** in the embedding matrix (because Python counts from zero).

This means that the **embedding layer** basically works like a **lookup table** — it just picks the correct row (vector) from its weight matrix based on the **token’s ID**.


In [117]:
print(embedding_layer(input_ids))

tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


Each row in this output matrix is obtained via a lookup operation from the embedding weight matrix.

# Positional Embeddings (Encoding word positions)

previously we focused on very small embedding sizes in this chapter for illustration purposes. We now consider more realistic and usefule mebedding sizes and encode the input toens into a 256-dimensional vector representations. This is smaller than what the original GPT-3 model used in (in GPT-3, the embedding size is 12,288 dimensions) but still reasonable for experimentation.
we assume, the token IDs were created by the BPE tokenizer that we implemented earlier, which has a vocabulary size of 50,257.

In [118]:
vocab_size = 50257
output_dim = 256 #vector size

token_embedding_layer = torch.nn.Embedding(vocab_size,  output_dim)

using the token_embedding_layer above, if we sample data from the data loader, we embed each token in each batch into a 256 - dimensional vector. If we have a batch_size of 8 with four tokens each, the result will be an 8 X 4 X 256 tensor.

here 8 represents the batch-size and 4 represents the context size which means that the maximum input length is 4(at most 4 tokens can be passed as input). In short here the parameters would be updated after every 8 batches.

Let's instantiate the data loader ( data sampling with a sliding window), first:

In [128]:
max_length = 4
# Maximum sequence length (number of tokens per training example).
# With next-token prediction, each input chunk will be length 4.

dataloader = create_dataloader_v1(
    raw_text,            # The full corpus (string or token IDs) to turn into batches
    batch_size=8,        # How many sequences per batch
    max_length=max_length,  # Truncate/window each sequence to 4 tokens
    stride=1,   # Move the window by 4 each time → non-overlapping chunks
    shuffle=False        # Keep chunk order deterministic (useful for debugging)
)

data_iter = iter(dataloader)
# Create a Python iterator over the DataLoader so we can pull batches manually.

inputs, targets = next(data_iter)
# Grab the first batch.
# inputs:  [batch_size, max_length] token IDs used as model input.
# targets: [batch_size, max_length] token IDs shifted by one step (next-token labels).
# (I.e., for language modeling: model predicts targets[t] from inputs[t].)

In [129]:
print("Token IDs : \n", inputs)
print("\n Inputs shape :\n", inputs.shape)

Token IDs : 
 tensor([[15496,    11,   466,   345],
        [   11,   466,   345,   588],
        [  466,   345,   588,  8887],
        [  345,   588,  8887,    30],
        [  588,  8887,    30,   220],
        [ 8887,    30,   220, 50256],
        [   30,   220, 50256,   554],
        [  220, 50256,   554,   262]])

 Inputs shape :
 torch.Size([8, 4])


above are the token ids for the input that we have received , now what we have to do is for each of the token ids we have to convert each of this into a 256 dimensional vector representation. I have adjusted the stride due to an error that i was encountering . 


As we can see, the token ID tensor is 8X4 dimensional, menaing that the data batch consists of 8 text samples with 4 tokens each.

in the below cell we would be passing this input tokens to the embedding layer to embed these token IDs into a 256-dimensional vectors: The lookup table generates the 256 dimensional vectors for each of the token ids passed on to it.

In [130]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

torch.Size([8, 4, 256])


In [131]:
#creating an embedding layer for the positional encoding and based on that we will encode the token ids

context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)

In [132]:
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print(pos_embeddings.shape)

torch.Size([4, 256])


in the above line of code , the input to the pos_embeddings is usually a placeholder vector torch.arange(context_length), a sequence of numbers 0,1,.., up to the maximum input length - 1.

the context_length is a variable that represents the supported input size of the LLM.

Here, we choose it similar to the maximum length of the input text.

In practice, input text can be longer than the supported context length, in which case we have to truncate.

As we can see that the positional embedding tensor consists of four 256-dimensional vectors. we can now add these directly to the token embeddings , where Pytorch will add the 4 X 256 dimensional pos_embeddings tensor to each 4X256 dimensional token embedding.