# Chapter 5 - Exercises

> Author : Badr TAJINI - Large Language model (LLMs) - ESIEE 2024-2025

---

# Exercise 5.1: Temperature-scaled softmax scores and sampling probabilities

**Empirical Analysis of Token Sampling Frequencies Under Temperature Scaling**

**Key Research Question: How does temperature-based scaling of the `softmax` probability distribution impact the sampling frequency of the specific lexical token `"pizza"`?**

*Methodological Framework:*
Utilize the `print_sampled_tokens` function to:
- Empirically examine token sampling probabilities
- Analyze the impact of temperature scaling
- Quantify the sampling occurrence of the `"pizza"` token

*Analytical Objectives:*
- Determine the precise sampling frequency of `"pizza"` across different temperature configurations
- Critically evaluate the current computational approach to sampling frequency measurement
- Explore potential methodological improvements for more efficient and accurate token sampling analysis

*Key Investigative Parameters:*
- Primary token of interest: `"pizza"`
- Sampling method: Temperature-scaled `softmax` distribution
- Computational tool: `print_sampled_tokens` function


### Résultats - Exercise 5.1

In [None]:
import torch

vocab = { 
    "closer": 0,
    "every": 1, 
    "effort": 2, 
    "forward": 3,
    "inches": 4,
    "moves": 5, 
    "pizza": 6,
    "toward": 7,
    "you": 8,
} 
inverse_vocab = {v: k for k, v in vocab.items()}

# Suppose input is "every effort moves you", and the LLM
# returns the following logits for the next token:
next_token_logits = torch.tensor(
    [4.51, 0.89, -1.90, 6.75, 1.63, -1.62, -1.89, 6.28, 1.79]
)

def print_sampled_tokens(probas):
    torch.manual_seed(123) # Manual seed for reproducibility
    sample = [torch.multinomial(probas, num_samples=1).item() for i in range(1_000)]
    sampled_ids = torch.bincount(torch.tensor(sample))
    for i, freq in enumerate(sampled_ids):
        print(f"{freq} x {inverse_vocab[i]}")


def softmax_with_temperature(logits, temperature):
    scaled_logits = logits / temperature
    return torch.softmax(scaled_logits, dim=0)

# Temperature values
temperatures = [1, 0.1, 5]  # Original, higher, and lower temperature

# Calculate scaled probabilities
scaled_probas = [softmax_with_temperature(next_token_logits, T) for T in temperatures]

In [6]:
for i, probas in enumerate(scaled_probas):
    print("\n\nTemperature:", temperatures[i])
    print_sampled_tokens(probas)



Temperature: 1
73 x closer
0 x every
0 x effort
582 x forward
2 x inches
0 x moves
0 x pizza
343 x toward


Temperature: 0.1
0 x closer
0 x every
0 x effort
985 x forward
0 x inches
0 x moves
0 x pizza
15 x toward


Temperature: 5
165 x closer
75 x every
42 x effort
239 x forward
71 x inches
46 x moves
32 x pizza
227 x toward
103 x you


In [7]:
temp5_idx = 2
pizza_idx = 6

scaled_probas[temp5_idx][pizza_idx]

tensor(0.0430)

**Interprétation**  
En divisant les logits par une température, on modifie directement la distribution softmax :  
- Température faible (< 1) → distribution plus “pointue” → le modèle choisit plus souvent les tokens les plus probables.  
- Température élevée (> 1) → distribution plus “plate” → davantage de diversité dans l’échantillonnage.  
L’estimation de la probabilité du token spécifique pizza montre cet effet.


# Exercise 5.2: Different temperature and top-k settings

**Empirical Investigation of Generative Language Model Sampling Parameters**

**Key Research Question: How do variations in `temperature` and `top-k` sampling parameters influence the qualitative and probabilistic characteristics of token generation in stochastic language models?**

*Methodological Framework:*
Conduct a systematic empirical exploration of:
- Temperature scaling dynamics
- Top-k probability truncation mechanisms
- Generative output characteristics across different parameter configurations

*Analytical Objectives:*
- Identify contextual applications that benefit from lower `temperature` and `top-k` settings
- Explore potential use cases preferring higher `temperature` and `top-k` configurations
- Develop nuanced understanding of sampling parameter impact on generative outputs

*Investigative Dimensions:*
1. Low `temperature` and `top-k` Scenarios
   - Potential applications
   - Characteristics of generated outputs
   - Contextual relevance

2. High `temperature` and `top-k` Scenarios
   - Potential applications
   - Characteristics of generated outputs
   - Contextual relevance

*Recommended Experimental Protocol:*
1. Systematically vary `temperature` and `top-k` parameters
2. Meticulously document generative output characteristics
3. Critically analyze observed variations
4. Develop hypotheses about optimal parameter configurations for specific applications

### Résultats - Exercise 5.2

**Interprétation**  
- La température contrôle la “randomness” : plus elle est haute, plus la distribution est plate → sorties plus variées mais parfois moins cohérentes.  
- Le top-k limite la génération aux k tokens les plus probables → permet de garder de la diversité tout en évitant des tokens trop improbables.  
Normalement, temperature et top_k se règlent ensemble pour équilibrer créativité vs cohérence.


# Exercise 5.3: Deterministic behavior in the decoding functions

**Deterministic Token Generation: Parametric Strategies for Eliminating Stochastic Variability**

**Key Research Question: What specific configuration parameters within the `generate` function can systematically eliminate randomness to ensure consistently reproducible generative outputs?**

*Methodological Framework:*
*Investigate comprehensive strategies to:*
- Suppress stochastic token generation mechanisms
- Enforce deterministic computational behavior
- Replicate the predictable output characteristics of `generate_simple`

*Analytical Objectives:*
- Identify all potential parameter combinations
- Systematically neutralize probabilistic sampling variations
- Establish deterministic generative protocol

*Critical Configuration Parameters to Examine:*
1. `temperature` scaling
2. `top_k` pruning mechanism
3. Random seed initialization
4. Sampling strategy selection

*Recommended Experimental Protocol:*
1. Analyze individual parameter impacts
2. Identify minimal configuration requirements
3. Validate deterministic output generation
4. Compare against `generate_simple` implementation

*Computational Implications:*
- Understanding stochastic suppression mechanisms
- Insights into generative model controllability
- Strategies for reproducible machine learning outputs

### Résultats - Exercise 5.3

In [None]:
import tiktoken
import torch
from previous_labs import GPTModel


GPT_CONFIG_124M = {
    "vocab_size": 50257,  # Vocabulary size
    "context_length": 256,# Shortened context length (orig: 1024)
    "emb_dim": 768,       # Embedding dimension
    "n_heads": 12,        # Number of attention heads
    "n_layers": 12,       # Number of layers
    "drop_rate": 0.1,     # Dropout rate
    "qkv_bias": False     # Query-key-value bias
}


torch.manual_seed(123)

tokenizer = tiktoken.get_encoding("gpt2")
model = GPTModel(GPT_CONFIG_124M)
model.load_state_dict(torch.load("model.pth", weights_only=True))
model.eval();   # Disable dropout during inference

In [9]:
from gpt_generate import generate, text_to_token_ids, token_ids_to_text
from previous_labs import generate_text_simple

In [None]:
start_context = "Every effort moves you"

token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids(start_context, tokenizer),
    max_new_tokens=25,
    context_size=GPT_CONFIG_124M["context_length"]
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves you know," was one of the axioms he laid down across the Sevres and silver of an exquisitely appointed lun


In [None]:
token_ids = generate(
    model=model,
    idx=text_to_token_ids("Every effort moves you", tokenizer),
    max_new_tokens=25,
    context_size=GPT_CONFIG_124M["context_length"],
    top_k=None,
    temperature=0.0
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves you know," was one of the axioms he laid down across the Sevres and silver of an exquisitely appointed lun


In [None]:
token_ids = generate(
    model=model,
    idx=text_to_token_ids("Every effort moves you", tokenizer),
    max_new_tokens=25,
    context_size=GPT_CONFIG_124M["context_length"],
    top_k=None,
    temperature=0.0
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves you know," was one of the axioms he laid down across the Sevres and silver of an exquisitely appointed lun


**Interprétation**  
Les deux fonctions (generate_text_simple et generate avec temperature=0) deviennent déterministes car elles choisissent systématiquement le token avec la probabilité maximale (équivalent à argmax).  
En résultat on a donc à prompt identique, le texte généré reste le même à chaque exécution.


# Exercise 5.4: Continued pretraining

**Continuation of Model Training: Stateful Resumption and Persistent Learning Dynamics**

**Key Research Question: How can we effectively restore a machine learning model's training state across separate computational sessions, enabling seamless continuation of the pretraining process?**

*Methodological Framework:*
Implement a comprehensive model and optimizer state restoration strategy involving:
- Weight reconstruction
- Optimizer state recovery
- Resumption of training from previously interrupted state

*Analytical Objectives:*
- Demonstrate stateful model persistence
- Execute additional training epoch using restored model configuration
- Validate continuity of learning progression

*Critical Procedural Steps:*
1. Load previously saved model weights
2. Reconstruct optimizer internal state
3. Reinitiate training using `train_model_simple` function
4. Complete one additional training epoch

*Recommended Implementation Strategy:*
- Utilize precise weight and optimizer state loading mechanisms
- Verify complete state restoration
- Execute uninterrupted additional training epoch

### Résultats - Exercise 5.4

In [None]:
import tiktoken
import torch
from previous_labs import GPTModel


GPT_CONFIG_124M = {
    "vocab_size": 50257,   # Vocabulary size
    "context_length": 256, # Shortened context length (orig: 1024)
    "emb_dim": 768,        # Embedding dimension
    "n_heads": 12,         # Number of attention heads
    "n_layers": 12,        # Number of layers
    "drop_rate": 0.1,      # Dropout rate
    "qkv_bias": False      # Query-key-value bias
}

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = tiktoken.get_encoding("gpt2")

checkpoint = torch.load("model_and_optimizer.pth", weights_only=True)

model = GPTModel(GPT_CONFIG_124M)
model.load_state_dict(checkpoint["model_state_dict"])

model.to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
model.train();

In [None]:
import os
import urllib.request
from previous_labs import create_dataloader_v1


file_path = "the-verdict.txt"
url = "https://huggingface.co/datasets/DarwinAnim8or/the-verdict/blob/main/the-verdict.txt"

if not os.path.exists(file_path):
    with urllib.request.urlopen(url) as response:
        text_data = response.read().decode('utf-8')
    with open(file_path, "w", encoding="utf-8") as file:
        file.write(text_data)
else:
    with open(file_path, "r", encoding="utf-8") as file:
        text_data = file.read()

# Train/validation ratio
train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]


torch.manual_seed(123)

train_loader = create_dataloader_v1(
    train_data,
    batch_size=2,
    max_length=GPT_CONFIG_124M["context_length"],
    stride=GPT_CONFIG_124M["context_length"],
    drop_last=True,
    shuffle=True,
    num_workers=0
)

val_loader = create_dataloader_v1(
    val_data,
    batch_size=2,
    max_length=GPT_CONFIG_124M["context_length"],
    stride=GPT_CONFIG_124M["context_length"],
    drop_last=False,
    shuffle=False,
    num_workers=0
)

In [15]:
from gpt_train import train_model_simple

num_epochs = 1
train_losses, val_losses, tokens_seen = train_model_simple(
    model, train_loader, val_loader, optimizer, device,
    num_epochs=num_epochs, eval_freq=5, eval_iter=5,
    start_context="Every effort moves you", tokenizer=tokenizer
)

Ep 1 (Step 000000): Train loss 0.271, Val loss 6.545
Ep 1 (Step 000005): Train loss 0.244, Val loss 6.614
Every effort moves you?"  "Yes--quite insensible to the irony. She wanted him vindicated--and by me!"  He laughed again, and threw back his head to look up at the sketch of the donkey. "There were days when I


**Interprétation**  
L’idée est de faire du continued pretraining : on repart d’un modèle existant qu'on a vu dans le lab5 et on continue l’entraînement.  
Donc logiquement on s’attend à voir la loss diminuer et la génération devenir progressivement plus cohérente, même après peu d’epochs, car le modèle adapte ses poids au style et au vocabulaire du dataset.


# Exercise 5.5: Training and validation set losses of the pretrained model

**Comparative Loss Assessment: Pretrained Model Performance on Specialized Textual Domain**

**Key Research Question: What are the comparative training and validation set losses when applying a pretrained OpenAI `GPTModel` to the "The Verdict" dataset?**

*Methodological Framework:*
Conduct a comprehensive loss evaluation involving:
- Model weight initialization from pretrained OpenAI configuration
- Computational loss calculation across training and validation datasets
- Quantitative performance assessment in domain-specific context

*Analytical Objectives:*
- Determine precise loss metrics for training dataset
- Calculate validation set loss
- Interpret performance characteristics of pretrained model on specialized textual domain

*Critical Computational Procedures:*
1. Load pretrained OpenAI `GPTModel` weights
2. Prepare "The Verdict" dataset
3. Compute training set loss
4. Compute validation set loss
5. Comparative loss analysis

*Investigative Parameters:*
- Model: Pretrained OpenAI `GPTModel`
- Dataset: "The Verdict"
- Metrics: Training and validation loss measurements

*Recommended Analytical Approach:*
- Implement precise loss computation
- Validate computational methodology
- Critically interpret loss metric implications

### Résultats - Exercise 5.5

In [16]:
import tiktoken
import torch
from previous_labs import GPTModel


GPT_CONFIG_124M = {
    "vocab_size": 50257,   # Vocabulary size
    "context_length": 256, # Shortened context length (orig: 1024)
    "emb_dim": 768,        # Embedding dimension
    "n_heads": 12,         # Number of attention heads
    "n_layers": 12,        # Number of layers
    "drop_rate": 0.1,      # Dropout rate
    "qkv_bias": False      # Query-key-value bias
}


torch.manual_seed(123)

tokenizer = tiktoken.get_encoding("gpt2")

In [17]:
from gpt_download import download_and_load_gpt2

settings, params = download_and_load_gpt2(model_size="124M", models_dir="gpt2")

File already exists and is up-to-date: gpt2\124M\checkpoint
File already exists and is up-to-date: gpt2\124M\encoder.json
File already exists and is up-to-date: gpt2\124M\hparams.json
File already exists and is up-to-date: gpt2\124M\model.ckpt.data-00000-of-00001
File already exists and is up-to-date: gpt2\124M\model.ckpt.index
File already exists and is up-to-date: gpt2\124M\model.ckpt.meta
File already exists and is up-to-date: gpt2\124M\vocab.bpe


In [18]:
# Define model configurations in a dictionary for compactness
model_configs = {
    "gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
    "gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
    "gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
    "gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}

# Copy the base configuration and update with specific model settings
model_name = "gpt2-small (124M)"  # Example model name
NEW_CONFIG = GPT_CONFIG_124M.copy()
NEW_CONFIG.update(model_configs[model_name])
NEW_CONFIG.update({"context_length": 1024, "qkv_bias": True})

gpt = GPTModel(NEW_CONFIG)
gpt.eval();

In [None]:
from gpt_generate import load_weights_into_gpt

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
load_weights_into_gpt(gpt, params)
gpt.to(device);

In [None]:
import os
import urllib.request
from previous_labs import create_dataloader_v1


file_path = "the-verdict.txt"
url = "https://huggingface.co/datasets/DarwinAnim8or/the-verdict/blob/main/the-verdict.txt"

if not os.path.exists(file_path):
    with urllib.request.urlopen(url) as response:
        text_data = response.read().decode('utf-8')
    with open(file_path, "w", encoding="utf-8") as file:
        file.write(text_data)
else:
    with open(file_path, "r", encoding="utf-8") as file:
        text_data = file.read()

# Train/validation ratio
train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]


torch.manual_seed(123)

train_loader = create_dataloader_v1(
    train_data,
    batch_size=2,
    max_length=GPT_CONFIG_124M["context_length"],
    stride=GPT_CONFIG_124M["context_length"],
    drop_last=True,
    shuffle=True,
    num_workers=0
)
val_loader = create_dataloader_v1(
    val_data,
    batch_size=2,
    max_length=GPT_CONFIG_124M["context_length"],
    stride=GPT_CONFIG_124M["context_length"],
    drop_last=False,
    shuffle=False,
    num_workers=0
)

In [None]:
from gpt_train import calc_loss_loader

torch.manual_seed(123) # For reproducibility due to the shuffling in the data loader

train_loss = calc_loss_loader(train_loader, gpt, device)
val_loss = calc_loss_loader(val_loader, gpt, device)

print("Training loss:", train_loss)
print("Validation loss:", val_loss)

Training loss: 3.754756662580702
Validation loss: 3.559627056121826


**Interprétation**  
Ici, on compare les pertes train/validation du modèle pré-entraîné.  
- Si les deux pertes sont proches et faibles : le modèle généralise correctement sur des séquences similaires.  
- Si la perte train est nettement plus basse que la validation : cela peut indiquer un début de sur-apprentissage (overfitting) sur le texte d’entraînement.  
On vois finalement que c'est donc pas mal dans notre cas

# Exercise 5.6: Trying larger models

**Comparative Generative Analysis: Scale and Performance Variations in GPT-2 Model Architectures**

**Key Research Question: How do generative text characteristics vary across different GPT-2 model scales, specifically comparing the 124 million and 1,558 million parameter configurations?**

*Methodological Framework:*
Conduct a systematic comparative investigation of:
- Generative text quality
- Semantic coherence
- Linguistic complexity
- Contextual understanding

*Analytical Objectives:*
- Empirically assess generative performance across model scales
- Identify qualitative differences in text generation
- Explore the relationship between model parameter count and generative capabilities

*Comparative Model Configurations:*
1. Smaller Model: **124 million parameters**
2. Larger Model: **1,558 million parameters**

*Investigative Dimensions:*
- Textual coherence
- Semantic precision
- Contextual relevance
- Linguistic nuance
- Complexity of generated content

*Experimental Protocol:*
1. Generate text samples using both model configurations
2. Conduct qualitative comparative analysis
3. Assess generative performance across multiple dimensions
4. Document observable variations in text generation characteristics

*Recommended Analytical Approach:*
- Utilize consistent generation parameters
- Employ multiple generation trials
- Implement rigorous qualitative assessment
- Develop comprehensive comparative framework

### Résultats - Exercise 5.6

In [None]:
import tiktoken
import torch
from previous_labs import GPTModel


GPT_CONFIG_124M = {
    "vocab_size": 50257,   # Vocabulary size
    "context_length": 256, # Shortened context length (orig: 1024)
    "emb_dim": 768,        # Embedding dimension
    "n_heads": 12,         # Number of attention heads
    "n_layers": 12,        # Number of layers
    "drop_rate": 0.1,      # Dropout rate
    "qkv_bias": False      # Query-key-value bias
}


tokenizer = tiktoken.get_encoding("gpt2")

In [None]:
from gpt_download import download_and_load_gpt2
from gpt_generate import load_weights_into_gpt


model_configs = {
    "gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
    "gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
    "gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
    "gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}

model_name = "gpt2-xl (1558M)"
NEW_CONFIG = GPT_CONFIG_124M.copy()
NEW_CONFIG.update(model_configs[model_name])
NEW_CONFIG.update({"context_length": 1024, "qkv_bias": True})

gpt = GPTModel(NEW_CONFIG)
gpt.eval()

settings, params = download_and_load_gpt2(model_size="1558M", models_dir="gpt2")
load_weights_into_gpt(gpt, params)

In [None]:
from gpt_generate import generate, text_to_token_ids, token_ids_to_text

In [None]:
torch.manual_seed(123)

token_ids = generate(
    model=gpt,
    idx=text_to_token_ids("Every effort moves you", tokenizer),
    max_new_tokens=25,
    context_size=NEW_CONFIG["context_length"],
    top_k=50,
    temperature=1.5
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

**Interprétation**  
Cet exercice montre l’impact de la taille du modèle : en général, plus le modèle est grand (plus de couches / têtes / dimension d’embedding), plus il a de capacité et peut obtenir une loss plus faible sur le même dataset.  
En contrepartie, le coût de calcul augmente fortement (temps d’entraînement, VRAM/RAM). 
L’objectif est donc de trouver un compromis entre performance et ressources disponibles, par eexemple moi je n'ai pas pu le faire tourner sur mon PC en local car trop gros et long, je suis donc passer par Collab.
