# Chapter 2 - Lab 1a - Exercise
> Author : Badr TAJINI - Large Language model (LLMs) - ESIEE 2024-2025

> Response by Paul CASCARINO E5-DSIA

# Exercise 2.1

### Exploring Byte Pair Encoding (BPE) Tokenization with Unknown Words

#### 0. Setup

In [9]:
import tiktoken

# Input
input_txt = "Akwirw ier"

#### 1. Tokenization

In [10]:
# Initializing the tiktoken BPE Tokenizer
tokenizer = tiktoken.get_encoding("gpt2") # I use the gpt2 like the lab.2

# Tokenizing input text with tiktoken
token_ids = tokenizer.encode(input_txt)


# Print the token IDs generated for this input.
print(token_ids)


[33901, 86, 343, 86, 220, 959]

#### 2. Subword Decoding

In [15]:
# For each token ID in the resulting list, use the tokenizer's `decode` function to convert the ID back 
# into its corresponding subword or character.
decoded_tokens = [tokenizer.decode([token_id]) for token_id in token_ids]

print(decoded_tokens)

['Ak', 'w', 'ir', 'w', ' ', 'ier']


#### 3. Reconstruction

In [19]:
# Apply the `decode` method to the entire list of token IDs
reconstructed_txt = tokenizer.decode(token_ids)

print(reconstructed_txt)

Akwirw ier


### Questions - Exercise 2.1
1. What sequence of token IDs does the BPE tokenizer generate for the input **"Akwirw ier"?**
2. What subwords or characters correspond to each token ID in the sequence?
3. Does the reconstructed output from the token IDs match the original input? Explain your observations and reasoning.

### Responses - Exercise 2.1

1. [33901, 86, 343, 86, 220, 959]
2. 
 - 33901: 'Ak'
 - 86: 'w'
 - 343: 'ir'
 - 86: 'w'
 - 220: ' ' 
 - 959: 'ier'
3. The reconstructed output from the token IDs match the original input ("Akwirw ier") because the tokenizer encodes the input without losing information. Our BPE tokenizer recognize subwords, character and space well so we can reconstruct the entire input text.

---

# Exercise 2.2

###  Exploring Data Loader Behavior with Different Parameter Configurations

#### 0. Setup



In [22]:
import torch

# Importing Required Modules
from torch.utils.data import Dataset, DataLoader

# Defining the Custom Dataset Class
class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]
    


# Creating the Data Loader Function
def create_dataloader_v1(txt, batch_size=4, max_length=256, 
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):

    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

In [23]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

#### 1. Experimenting with `max_length` and `stride`

##### 1.1 max_length=2 and stride=2

In [32]:
max_length = 2
stride = 2

dataloader = create_dataloader_v1(
    raw_text, batch_size=1, max_length=max_length, stride=stride, shuffle=False
)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)


Inputs:
 tensor([[ 40, 367]])

Targets:
 tensor([[ 367, 2885]])


##### 1.2 max_length=8 and stride=2

In [33]:
max_length = 8
stride = 2

dataloader = create_dataloader_v1(
    raw_text, batch_size=1, max_length=max_length, stride=stride, shuffle=False
)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)


Inputs:
 tensor([[  40,  367, 2885, 1464, 1807, 3619,  402,  271]])

Targets:
 tensor([[  367,  2885,  1464,  1807,  3619,   402,   271, 10899]])


They differ because, with max_length=2, the model processes short chunks and focuses on immediate relationships between tokens, whereas with max_length=8, the model captures a broader context in each input-output pair.

#### 2. Increasing Batch Size

In [34]:
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=4, stride=4,
    shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Targets:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])


When increasing the batch size to 8, the inputs tensor contains multiple sequences (each of max_length=4), and the corresponding targets tensor is shifted by one token position for each sequence. This batching allows the model to process multiple sequences simultaneously, improving computational efficiency while maintaining the sequential structure for each input-target pair.

#### 3. Avoiding Overlap

Find on the lab : *" Adjusting the stride parameter affects the overlap between input sequences. By setting stride equal to max_length, overlapping is minimized, which can help prevent overfitting."*

Using stride=4 ensures that there is no overlap between consecutive sequences in the input. Each sequence starts exactly where the previous one ends, thereby avoiding repetition of tokens across input-output pairs. This configuration minimizes redundancy in the data, reduces the risk of overfitting caused by repeated exposure to overlapping sequences, and promotes diverse training examples.

### Questions - Exercise 2.2
1. How do changes in `max_length` and `stride` affect the input-output mappings produced by the data loader?  
2. What differences do you observe in the data when using a batch size of 8 compared to a batch size of 1?  
3. How does using a larger stride (e.g., `stride=4`) influence the coverage of the dataset and the overlap between sequences?  

### Responses - Exercise 2.2

1. Increasing max_length increases the context captured within each sequence, allowing the model to learn from longer dependencies, while increasing stride reduces the overlap between sequences, minimizing redundancy and promoting unique training examples.

2. Using a batch size of 8 processes multiple sequences simultaneously, improving computational efficiency compared to a batch size of 1, which processes sequences individually.

3. A larger stride, such as stride=4 (if max_length=4), ensures no overlap between sequences, reducing redundancy while covering more unique parts of the dataset.