# Chapter 2 - Lab 1a - Exercise
> Author : Badr TAJINI - Large Language model (LLMs) - ESIEE 2024-2025

# Exercise 2.1

### Exploring Byte Pair Encoding (BPE) Tokenization with Unknown Words

In this exercise, you will explore how the Byte Pair Encoding (BPE) tokenizer from the `tiktoken` library processes unknown words. BPE is a subword tokenization technique that constructs its vocabulary by iteratively merging frequent sequences of characters or subwords. This approach allows BPE tokenizers to handle previously unseen words by decomposing them into smaller, known subunits.

### Objective
Answer the following questions based on your experimentation with the tokenizer:

1. How does the BPE tokenizer decompose the input phrase **"Akwirw ier"** into token IDs?
2. What are the subwords or characters corresponding to each token ID in the tokenized output?
3. Can the tokenizer's decoding process successfully reconstruct the original input phrase **"Akwirw ier"** from the token IDs? Why or why not?

### Theoretical Background
Byte Pair Encoding begins with a minimal vocabulary of single characters, such as **"a," "b," "c,"** and so on. The tokenizer builds upon this base by iteratively merging frequently co-occurring characters into subwords, and subsequently merging frequent subwords into complete words. The merging process is guided by a frequency threshold or cutoff. 

For example:
- In the initial stage, the character **"d"** and **"e"** might frequently appear together in a corpus. The tokenizer merges these characters into the subword **"de"** if their co-occurrence exceeds the frequency cutoff.
- This subword then becomes part of the tokenizer's vocabulary and is used to tokenize words where it occurs, such as **"define," "depend," "made,"** and **"hidden."**

This hierarchical merging enables the BPE tokenizer to strike a balance between granularity and generalization, efficiently encoding both common words and rare or unknown words by breaking them into smaller units.

### Task Steps

1. **Tokenization**:
   - Use the `tiktoken` BPE tokenizer to process the unknown input string **"Akwirw ier."**
   - Print the token IDs generated for this input.

2. **Subword Decoding**:
   - For each token ID in the resulting list, use the tokenizer's `decode` function to convert the ID back into its corresponding subword or character.

3. **Reconstruction**:
   - Apply the `decode` method to the entire list of token IDs to reconstruct the original input string. Verify whether the reconstructed string matches the initial input, **"Akwirw ier."**



### Questions - Exercise 2.1
1. What sequence of token IDs does the BPE tokenizer generate for the input **"Akwirw ier"?**
2. What subwords or characters correspond to each token ID in the sequence?
3. Does the reconstructed output from the token IDs match the original input? Explain your observations and reasoning.



In [1]:
import tiktoken

# Initialiser le tokenizer BPE
tokenizer = tiktoken.get_encoding("cl100k_base")

# Phrase d'entrée
input_phrase = "Akwirw ier"

# Tokeniser la phrase pour générer les identifiants de tokens
token_ids = tokenizer.encode(input_phrase)

# Afficher les identifiants de tokens
print("Token IDs:", token_ids)

Token IDs: [32, 29700, 404, 86, 602, 261]


**Explication :**
Le BPE tokenizer segmente l'entrée en sous-mots et associe un ID unique à chacun d'eux à partir de son vocabulaire. Cela permet de représenter même les mots inconnus grâce à des combinaisons de sous-mots fréquents.

In [2]:
# Décoder chaque Token ID en sous-mot
subwords = [tokenizer.decode([token_id]) for token_id in token_ids]

# Afficher les sous-mots ou caractères correspondants
print("Subwords:", subwords)

Subwords: ['A', 'kw', 'ir', 'w', ' i', 'er']


**Explication :**
Chaque token ID est mappé à un sous-mot ou caractère spécifique dans le vocabulaire BPE. Ces sous-mots sont générés en fonction des fréquences des séquences de caractères, permettant au tokenizer de traiter des mots inconnus en les décomposant en unités plus petites.

In [3]:
# Reconstruire la phrase à partir des Token IDs
reconstructed_phrase = tokenizer.decode(token_ids)

# Vérifier si la reconstruction correspond au texte original
print("Reconstructed Phrase:", reconstructed_phrase)
print("Reconstruction Successful:", reconstructed_phrase == input_phrase)

Reconstructed Phrase: Akwirw ier
Reconstruction Successful: True


**Explication :**
La phrase reconstruite correspond exactement à l'entrée originale car le tokenizer BPE peut réassembler les sous-mots sans perte d'information. Cela illustre sa capacité à gérer les mots inconnus en les fragmentant et en les reconstruisant avec précision.

---

# Exercise 2.2

**Exercise: Exploring Data Loader Behavior with Different Parameters**

Certainly! Here's the exercise rewritten in the same structured style as the first one, ensuring clarity and consistency:

---

**Exercise: Exploring Data Loader Behavior with Different Parameter Configurations**

In this exercise, you will investigate how the parameters of a data loader—such as `max_length`, `stride`, and `batch_size`—affect the preparation of input-output pairs for training large language models (LLMs). By experimenting with these settings, you will gain a practical understanding of their impact on the data batching process and their implications for model training.

### Objective
You will:
1. Observe how the data loader generates input-output pairs with different configurations of `max_length` and `stride`.
2. Analyze how increasing the batch size changes the structure of the data and discuss the tradeoffs involved.
3. Experiment with a batch size greater than 1 to understand how it impacts memory usage and input-output organization.

---

### Theoretical Background

A data loader processes raw text data into smaller sequences suitable for training LLMs. Key parameters that influence its behavior are:

1. **`max_length`**: Specifies the maximum sequence length for each input-output pair. Shorter sequences may simplify computation but can limit the context available to the model.
   
2. **`stride`**: Determines the step size for sliding the window over the text when creating sequences. A smaller stride increases overlap between sequences, leading to more redundancy. A larger stride reduces overlap, ensuring more unique coverage of the dataset.

3. **`batch_size`**: Controls the number of sequences in a batch:
   - **Small batches** (e.g., `batch_size=1`) are easier to process and require less memory. However, they can produce noisier gradient updates during training.
   - **Larger batches** improve gradient stability but require more memory and computational power. This parameter is an important hyperparameter to tune during training.

These parameters are central to efficient and effective preprocessing of data for training deep learning models.

---

### Task Steps

1. **Experimenting with `max_length` and `stride`**:
   - Run the data loader with two configurations:
     - `max_length=2` and `stride=2`
     - `max_length=8` and `stride=2`
   - Observe the structure of the input-output pairs for each configuration and note how they differ.

2. **Increasing Batch Size**:
   - Experiment with a batch size of 8 using the following configuration:
     ```python
     dataloader = create_dataloader_v1(
         raw_text, batch_size=8, max_length=4, stride=4,
         shuffle=False
     )
     data_iter = iter(dataloader)
     inputs, targets = next(data_iter)
     print("Inputs:\n", inputs)
     print("\nTargets:\n", targets)
     ```
   - Examine the resulting `inputs` and `targets`. Consider how the data is structured when `batch_size` is increased compared to a batch size of 1.

3. **Avoiding Overlap**:
   - Analyze the effect of a `stride=4` setting. Note that this value ensures no overlap between sequences within a batch, minimizing redundancy and reducing the risk of overfitting.

---

### Questions - Exercise 2.2

1. How do changes in `max_length` and `stride` affect the input-output mappings produced by the data loader?  
2. What differences do you observe in the data when using a batch size of 8 compared to a batch size of 1?  
3. How does using a larger stride (e.g., `stride=4`) influence the coverage of the dataset and the overlap between sequences?  

---

### Example Output

Using the configuration:
```python
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=4, stride=4,
    shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)
```

**Inputs**:
```plaintext
tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])
```

**Targets**:
```plaintext
tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])
```

---

### Expected Learning Outcomes

By completing this exercise, you should:
1. Understand how varying `max_length` and `stride` impacts the input-output pairs produced by the data loader.
2. Appreciate the tradeoffs involved in choosing different batch sizes for training deep learning models.
3. Gain insight into how stride settings can minimize redundancy and optimize dataset utilization.

In [4]:
import torch
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

In [5]:
def create_dataloader_v1(txt, batch_size=4, max_length=256, 
                         stride=128, shuffle=True, drop_last=True, num_workers=0):
    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)

    return dataloader

In [6]:
# Phrase d'exemple
raw_text_1 = "This is an example sentence to test the data loader behavior."

# Tester deux configurations différentes
configs_1 = [
    {"max_length": 2, "stride": 2},
    {"max_length": 2, "stride": 4},
    {"max_length": 8, "stride": 2},
    {"max_length": 8, "stride": 4},
]

# Parcourir les configurations et afficher les résultats
for config in configs_1:
    print(f"Configuration: max_length={config['max_length']}, stride={config['stride']}")
    
    dataloader_1 = create_dataloader_v1(
        raw_text_1, batch_size=1, max_length=config["max_length"], stride=config["stride"]
    )
    data_iter_1 = iter(dataloader_1)
    
    for idx, (inputs_1, targets_1) in enumerate(data_iter_1):
        print(f"Input {idx}:", inputs_1)
        print(f"Target {idx}:", targets_1)
        print("-" * 50)

Configuration: max_length=2, stride=2
Input 0: tensor([[ 281, 1672]])
Target 0: tensor([[1672, 6827]])
--------------------------------------------------
Input 1: tensor([[1332,  262]])
Target 1: tensor([[ 262, 1366]])
--------------------------------------------------
Input 2: tensor([[ 1366, 40213]])
Target 2: tensor([[40213,  4069]])
--------------------------------------------------
Input 3: tensor([[6827,  284]])
Target 3: tensor([[ 284, 1332]])
--------------------------------------------------
Input 4: tensor([[1212,  318]])
Target 4: tensor([[318, 281]])
--------------------------------------------------
Configuration: max_length=2, stride=4
Input 0: tensor([[6827,  284]])
Target 0: tensor([[ 284, 1332]])
--------------------------------------------------
Input 1: tensor([[1212,  318]])
Target 1: tensor([[318, 281]])
--------------------------------------------------
Input 2: tensor([[ 1366, 40213]])
Target 2: tensor([[40213,  4069]])
-------------------------------------------

### Impact de `max_length` et de `stride`

Les changements de `max_length` et `stride` affectent les paires entrée-sortie comme suit :

#### **1. Impact de `max_length`**
- **Court** (`max_length=2`) :
  - Produit des séquences courtes (ex. : `[6827, 284]`) avec un contexte limité pour les cibles.
  - Les séquences courtes permettent un traitement rapide mais manquent de portée pour des relations complexes.
- **Long** (`max_length=8`) :
  - Capture davantage de contexte dans chaque entrée (ex. : `[1212, 318, 281, 1672, 6827, 284, 1332, 262]`), ce qui est utile pour des tâches nécessitant une compréhension globale.
  - Requiert plus de calculs.

#### **2. Impact de `stride`**
- **Petit** (`stride=2`) :
  - Fort chevauchement entre séquences, avec une redondance élevée (ex. pour `max_length=8` : `Input 0` et `Input 1` partagent 6 tokens sur 8).
  - Garantit une couverture complète mais augmente la charge de calcul.
- **Grand** (`stride=4`) :
  - Réduit le chevauchement entre séquences, augmentant la diversité (ex. : `Input 0` et `Input 1` dans `max_length=2` n'ont aucun token commun).
  - Risque de perdre certaines informations contextuelles.

---

### Comparaison des configurations

| Configuration             | Exemple d'Input            | Chevauchement | Contexte |
|---------------------------|----------------------------|---------------|----------|
| **`max_length=2, stride=2`** | `[6827, 284]`             | Élevé         | Limité   |
| **`max_length=2, stride=4`** | `[1366, 40213]`           | Faible        | Limité   |
| **`max_length=8, stride=2`** | `[1212, 318, ..., 262]`   | Élevé         | Large    |
| **`max_length=8, stride=4`** | `[1212, 318, ..., 262]`   | Modéré        | Large    |

---

### Conclusion
1. **Séquences courtes (`max_length=2`)** : Idéales pour des relations locales, avec un traitement rapide.
2. **Séquences longues (`max_length=8`)** : Capturent un meilleur contexte, mais augmentent les coûts.
3. **Petit stride (`stride=2`)** : Redondance élevée, couverture complète.
4. **Grand stride (`stride=4`)** : Moins de redondance, meilleure diversité, mais un risque accru de perte d'informations.


In [12]:
# Exemple de texte brut
file_path = "C:\\Users\\theo.labat\\Documents\\LLM\lab2\\1_main_code\\the-verdict.txt"
with open(file_path, "r", encoding="utf-8") as f:
    raw_text_2= f.read()

configs_2 = [
    {"max_length": 2, "stride": 2},
    {"max_length": 2, "stride": 4},
    {"max_length": 8, "stride": 2},
    {"max_length": 8, "stride": 4},
]

# Parcourir les configurations et afficher les résultats
for config in configs_2:
    print(f"Configuration: max_length={config['max_length']}, stride={config['stride']}")
    
    dataloader_2 = create_dataloader_v1(
        raw_text_2, batch_size=1, max_length=config["max_length"], stride=config["stride"]
    )
    data_iter_2 = iter(dataloader_2)
    
    for idx, (inputs_2, targets_2) in enumerate(data_iter_2):
        print(f"Input {idx}:", inputs_2)
        print(f"Target {idx}:", targets_2)
        print("-" * 50)

Configuration: max_length=2, stride=2
Input 0: tensor([[526, 198]])
Target 0: tensor([[198, 198]])
--------------------------------------------------
Input 1: tensor([[198, 198]])
Target 1: tensor([[198,   1]])
--------------------------------------------------
Input 2: tensor([[416, 465]])
Target 2: tensor([[465, 938]])
--------------------------------------------------
Input 3: tensor([[465,   0]])
Target 3: tensor([[  0, 314]])
--------------------------------------------------
Input 4: tensor([[  11, 3114]])
Target 4: tensor([[3114,  510]])
--------------------------------------------------
Input 5: tensor([[760, 345]])
Target 5: tensor([[345, 772]])
--------------------------------------------------
Input 6: tensor([[ 284, 3285]])
Target 6: tensor([[3285,  262]])
--------------------------------------------------
Input 7: tensor([[37121,  1035]])
Target 7: tensor([[ 1035, 27339]])
--------------------------------------------------
Input 8: tensor([[373, 655]])
Target 8: tensor([[6

In [8]:
# Test avec batch_size=1
print("\n--- Testing with batch_size=1 ---\n")
dataloader_1 = create_dataloader_v1(
    raw_text_1*2, batch_size=1, max_length=4, stride=2  # Stride fixé à 2
)
data_iter_1 = iter(dataloader_1)

for idx, (inputs_1, targets_1) in enumerate(data_iter_1):
    print(f"Batch {idx} - Inputs:\n", inputs_1)
    print(f"Batch {idx} - Targets:\n", targets_1)
    print("-" * 50)

# Test avec batch_size=8
print("\n--- Testing with batch_size=8 ---\n")
dataloader_8 = create_dataloader_v1(
    raw_text_1*2, batch_size=8, max_length=4, stride=2  # Stride fixé à 2
)
data_iter_8 = iter(dataloader_8)

for idx, (inputs_8, targets_8) in enumerate(data_iter_8):
    print(f"Batch {idx} - Inputs:\n", inputs_8)
    print(f"Batch {idx} - Targets:\n", targets_8)
    print("-" * 50)


--- Testing with batch_size=1 ---

Batch 0 - Inputs:
 tensor([[6827,  284, 1332,  262]])
Batch 0 - Targets:
 tensor([[ 284, 1332,  262, 1366]])
--------------------------------------------------
Batch 1 - Inputs:
 tensor([[ 281, 1672, 6827,  284]])
Batch 1 - Targets:
 tensor([[1672, 6827,  284, 1332]])
--------------------------------------------------
Batch 2 - Inputs:
 tensor([[ 1332,   262,  1366, 40213]])
Batch 2 - Targets:
 tensor([[  262,  1366, 40213,  4069]])
--------------------------------------------------
Batch 3 - Inputs:
 tensor([[6827,  284, 1332,  262]])
Batch 3 - Targets:
 tensor([[ 284, 1332,  262, 1366]])
--------------------------------------------------
Batch 4 - Inputs:
 tensor([[ 1366, 40213,  4069,    13]])
Batch 4 - Targets:
 tensor([[40213,  4069,    13,  1212]])
--------------------------------------------------
Batch 5 - Inputs:
 tensor([[1212,  318,  281, 1672]])
Batch 5 - Targets:
 tensor([[ 318,  281, 1672, 6827]])
---------------------------------------

### Différences entre un batch size de 8 et un batch size de 1

#### **Batch Size = 1**
- Chaque **paire entrée-sortie** est traitée comme une séquence unique de taille `(1, max_length)`.
- Les sorties sont traitées séquentiellement, avec un batch contenant une seule entrée et sa cible correspondante.
- Exemple :
  - **Input (Batch 0)** : `tensor([[6827,  284, 1332,  262]])`
  - **Target (Batch 0)** : `tensor([[ 284, 1332,  262, 1366]])`
- Nombre total de batches : **10**, car chaque séquence est traitée individuellement.

**Avantages :**
- Plus facile à déboguer et à inspecter, car chaque séquence est isolée.
- Moins de mémoire nécessaire, car une seule séquence est chargée en mémoire à la fois.

**Inconvénients :**
- Moins efficace en termes de calcul, car le modèle traite les séquences une par une.

---

#### **Batch Size = 8**
- Plusieurs séquences sont regroupées dans un seul batch de taille `(8, max_length)`, chaque ligne représentant une séquence.
- Les entrées et les cibles sont alignées, permettant un traitement parallèle de plusieurs séquences en une seule étape.
- Exemple :
  - **Inputs (Batch 0)** :
    ```plaintext
    tensor([[ 4069,    13,  1212,   318],
            [  281,  1672,  6827,   284],
            [  281,  1672,  6827,   284],
            [ 1332,   262,  1366, 40213],
            [ 1212,   318,   281,  1672],
            [ 6827,   284,  1332,   262],
            [ 1332,   262,  1366, 40213],
            [ 1366, 40213,  4069,    13]])
    ```
  - **Targets (Batch 0)** :
    ```plaintext
    tensor([[   13,  1212,   318,   281],
            [ 1672,  6827,   284,  1332],
            [ 1672,  6827,   284,  1332],
            [  262,  1366, 40213,  4069],
            [  318,   281,   1672, 6827],
            [  284,  1332,   262,  1366],
            [  262,  1366, 40213,  4069],
            [40213,  4069,    13,  1212]])
    ```

**Avantages :**
- Traitement parallèle de plusieurs séquences, ce qui améliore l'efficacité.
- Utilisation optimale des ressources matérielles (GPU/TPU).

**Inconvénients :**
- Moins facile à inspecter, car les séquences sont regroupées dans des matrices.
- Nécessite plus de mémoire pour traiter un lot complet.

---

### Résumé des différences

| **Batch Size** | **Structure des données**             | **Avantages**                                   | **Inconvénients**                            |
|----------------|---------------------------------------|------------------------------------------------|----------------------------------------------|
| **1**          | Une seule séquence par batch         | Facile à inspecter, mémoire faible             | Calcul séquentiel, moins efficace            |
| **8**          | Plusieurs séquences dans un batch    | Traitement parallèle, calcul plus rapide       | Inspection plus complexe, nécessite plus de mémoire |


In [9]:
# Test avec batch_size=1
print("\n--- Testing with batch_size=1 ---\n")
dataloader_1 = create_dataloader_v1(
    raw_text_2, batch_size=1, max_length=4, stride=2  # Stride fixé à 2
)
data_iter_1 = iter(dataloader_1)

for idx, (inputs_1, targets_1) in enumerate(data_iter_1):
    print(f"Batch {idx} - Inputs:\n", inputs_1)
    print(f"Batch {idx} - Targets:\n", targets_1)
    print("-" * 50)

# Test avec batch_size=8
print("\n--- Testing with batch_size=8 ---\n")
dataloader_8 = create_dataloader_v1(
    raw_text_2, batch_size=8, max_length=4, stride=2  # Stride fixé à 2
)
data_iter_8 = iter(dataloader_8)

for idx, (inputs_8, targets_8) in enumerate(data_iter_8):
    print(f"Batch {idx} - Inputs:\n", inputs_8)
    print(f"Batch {idx} - Targets:\n", targets_8)
    print("-" * 50)


--- Testing with batch_size=1 ---

Batch 0 - Inputs:
 tensor([[3521,  470,  466, 1194]])
Batch 0 - Targets:
 tensor([[  470,   466,  1194, 14000]])
--------------------------------------------------
Batch 1 - Inputs:
 tensor([[ 262, 2589,  314,  373]])
Batch 1 - Targets:
 tensor([[2589,  314,  373, 4808]])
--------------------------------------------------
Batch 2 - Inputs:
 tensor([[ 1657,    13, 24975,   339]])
Batch 2 - Targets:
 tensor([[   13, 24975,   339,  2900]])
--------------------------------------------------
Batch 3 - Inputs:
 tensor([[1234,  340,   11,  550]])
Batch 3 - Targets:
 tensor([[340,  11, 550, 587]])
--------------------------------------------------
Batch 4 - Inputs:
 tensor([[  11,  290, 9617,  736]])
Batch 4 - Targets:
 tensor([[ 290, 9617,  736,  465]])
--------------------------------------------------
Batch 5 - Inputs:
 tensor([[  339,  6150,  5365, 31655]])
Batch 5 - Targets:
 tensor([[ 6150,  5365, 31655,    26]])
---------------------------------------

In [10]:
# Configuration avec stride=4
dataloader_stride_1 = create_dataloader_v1(raw_text_1, batch_size=1, max_length=4, stride=4)

# Observer les séquences générées
print("Configuration avec stride=4:")
for inputs_1, targets_1 in dataloader_stride_1:
    print(f"Inputs: {inputs_1}, Targets: {targets_1}")


Configuration avec stride=4:
Inputs: tensor([[6827,  284, 1332,  262]]), Targets: tensor([[ 284, 1332,  262, 1366]])
Inputs: tensor([[1212,  318,  281, 1672]]), Targets: tensor([[ 318,  281, 1672, 6827]])


### Réponse : Impact d’un stride plus grand (e.g., `stride=4`) sur la couverture et le chevauchement

#### **Observations à partir des résultats**
1. **Réduction du chevauchement :**
   - Avec un stride de `4`, les séquences consécutives ne partagent aucun token si `max_length=4`. Chaque séquence commence immédiatement après la fin de la précédente.
   - Cela contraste avec un stride plus petit (e.g., `stride=2`), où les séquences se chevauchent, partageant certains tokens en commun.

2. **Couverture du dataset :**
   - Le stride de `4` couvre bien le texte mais divise clairement le dataset en séquences distinctes, réduisant la redondance.
   - Certaines parties du texte pourraient ne pas être utilisées si la taille totale du texte n’est pas un multiple de `stride`.

#### **Impact sur les séquences générées**
- **Moins de redondance :**
  - Chaque token apparaît une seule fois dans les séquences générées.
- **Plus grande diversité :**
  - Les séquences sont uniques et ne se répètent pas.

#### **Résumé**
- **Couverture :** Un stride plus grand maximise la diversité des séquences mais pourrait omettre des parties du texte à la fin du dataset si celui-ci n'est pas parfaitement divisible.
- **Chevauchement :** Réduction complète du chevauchement entre séquences, favorisant une meilleure utilisation du texte brut avec moins de répétitions.


In [11]:
# Configuration avec stride=4
dataloader_stride_2 = create_dataloader_v1(raw_text_2, batch_size=1, max_length=4, stride=4)

# Observer les séquences générées
print("Configuration avec stride=4:")
for inputs_2, targets_2 in dataloader_stride_2:
    print(f"Inputs: {inputs_2}, Targets: {targets_2}")

Configuration avec stride=4:
Inputs: tensor([[4185, 1359,   26,  788]]), Targets: tensor([[1359,   26,  788,  314]])
Inputs: tensor([[ 314,  550,  757, 1057]]), Targets: tensor([[ 550,  757, 1057,  625]])
Inputs: tensor([[24297,  1022,   465,  9353]]), Targets: tensor([[1022,  465, 9353,  257]])
Inputs: tensor([[2666,  572, 1701,  198]]), Targets: tensor([[ 572, 1701,  198,  198]])
Inputs: tensor([[3724, 6451,   11,  286]]), Targets: tensor([[6451,   11,  286, 2612]])
Inputs: tensor([[5032,   11,  339, 5257]]), Targets: tensor([[  11,  339, 5257,  284]])
Inputs: tensor([[612, 373, 645, 530]]), Targets: tensor([[373, 645, 530, 588]])
Inputs: tensor([[ 772,  611,  339, 1549]]), Targets: tensor([[ 611,  339, 1549,  587]])
Inputs: tensor([[19713, 14676,    25,  9675]]), Targets: tensor([[14676,    25,  9675,   284]])
Inputs: tensor([[ 330,   11, 4844,  286]]), Targets: tensor([[  11, 4844,  286,  262]])
Inputs: tensor([[   13,   764,   764, 22135]]), Targets: tensor([[  764,   764, 22135, 