# Chapter 5 - Exercises

> Author : Badr TAJINI - Large Language model (LLMs) - ESIEE 2024-2025

---

# Exercise 5.1: Temperature-scaled softmax scores and sampling probabilities

**Empirical Analysis of Token Sampling Frequencies Under Temperature Scaling**

**Key Research Question: How does temperature-based scaling of the `softmax` probability distribution impact the sampling frequency of the specific lexical token `"pizza"`?**

*Methodological Framework:*
Utilize the `print_sampled_tokens` function to:
- Empirically examine token sampling probabilities
- Analyze the impact of temperature scaling
- Quantify the sampling occurrence of the `"pizza"` token

*Analytical Objectives:*
- Determine the precise sampling frequency of `"pizza"` across different temperature configurations
- Critically evaluate the current computational approach to sampling frequency measurement
- Explore potential methodological improvements for more efficient and accurate token sampling analysis

*Key Investigative Parameters:*
- Primary token of interest: `"pizza"`
- Sampling method: Temperature-scaled `softmax` distribution
- Computational tool: `print_sampled_tokens` function


In [3]:
import numpy as np

# Fonction pour calculer le softmax avec scaling par température (NumPy)
def temperature_scaled_softmax_np(logits, temperature):
    scaled_logits = logits / temperature
    exp_logits = np.exp(scaled_logits - np.max(scaled_logits))  # Ajustement pour la stabilité numérique
    probabilities = exp_logits / exp_logits.sum()
    return probabilities

# Fonction pour échantillonner des jetons et compter les occurrences d'un jeton spécifique (NumPy)
def sample_tokens_and_count_np(target_token, vocab, logits, temperatures, num_samples=100):
    results = {}
    target_index = vocab.index(target_token)
    for temp in temperatures:
        probabilities = temperature_scaled_softmax_np(logits, temp)
        sampled_tokens = np.random.choice(len(vocab), size=num_samples, p=probabilities)
        target_count = np.sum(sampled_tokens == target_index)
        results[temp] = target_count / num_samples
    return results

# Exemple de vocabulaire et de logits simulés
vocab = ["pizza", "burger", "pasta", "salad", "sushi"]
logits = np.array([2.0, 1.5, 0.5, 0.2, -0.1])

# Températures à tester
temperatures = [0.5, 1.0, 1.5, 2.0]

# Exécuter l'analyse pour le jeton "pizza"
results_np = sample_tokens_and_count_np("pizza", vocab, logits, temperatures, num_samples=100)
print("Résultats :", results_np)


Résultats : {0.5: 0.67, 1.0: 0.49, 1.5: 0.44, 2.0: 0.24}


# Exercise 5.2: Different temperature and top-k settings

**Empirical Investigation of Generative Language Model Sampling Parameters**

**Key Research Question: How do variations in `temperature` and `top-k` sampling parameters influence the qualitative and probabilistic characteristics of token generation in stochastic language models?**

*Methodological Framework:*
Conduct a systematic empirical exploration of:
- Temperature scaling dynamics
- Top-k probability truncation mechanisms
- Generative output characteristics across different parameter configurations

*Analytical Objectives:*
- Identify contextual applications that benefit from lower `temperature` and `top-k` settings
- Explore potential use cases preferring higher `temperature` and `top-k` configurations
- Develop nuanced understanding of sampling parameter impact on generative outputs

*Investigative Dimensions:*
1. Low `temperature` and `top-k` Scenarios
   - Potential applications
   - Characteristics of generated outputs
   - Contextual relevance

2. High `temperature` and `top-k` Scenarios
   - Potential applications
   - Characteristics of generated outputs
   - Contextual relevance

*Recommended Experimental Protocol:*
1. Systematically vary `temperature` and `top-k` parameters
2. Meticulously document generative output characteristics
3. Critically analyze observed variations
4. Develop hypotheses about optimal parameter configurations for specific applications

In [4]:
# Fonction pour filtrer les logits avec top-k
def top_k_filter(logits, k):
    indices_to_remove = logits < np.sort(logits)[-k]
    logits[indices_to_remove] = -np.inf  # Masquer les scores non-top-k
    return logits

# Fonction pour calculer le softmax avec scaling par température (NumPy)
def temperature_scaled_softmax_np(logits, temperature):
    scaled_logits = logits / temperature
    exp_logits = np.exp(scaled_logits - np.max(scaled_logits)) 
    probabilities = exp_logits / exp_logits.sum()
    return probabilities

# Fonction pour échantillonner des tokens en combinant temperature et top-k
def sample_with_temperature_and_top_k(vocab, logits, temperatures, top_ks, num_samples=5):
    results = {}
    for temp in temperatures:
        for k in top_ks:
            filtered_logits = top_k_filter(logits.copy(), k)
            probabilities = temperature_scaled_softmax_np(filtered_logits, temp) 
            sampled_tokens = np.random.choice(vocab, size=num_samples, p=probabilities) 
            results[(temp, k)] = sampled_tokens.tolist() 
    return results

# Exemple de vocabulaire et de logits simulés
vocab = ["pizza", "burger", "pasta", "salad", "sushi"]
logits = np.array([2.0, 1.5, 0.5, 0.2, -0.1])

# Combinaisons de paramètres à tester
temperatures = [0.5, 1.0, 1.5]
top_ks = [2, 3, 5]

# Exécuter l'analyse
results_5_2 = sample_with_temperature_and_top_k(vocab, logits, temperatures, top_ks, num_samples=5)

# Afficher les résultats
print("Résultats pour chaque combinaison (temperature, top-k) :")
for key, value in results_5_2.items():
    print(f"Température={key[0]}, Top-k={key[1]} => {value}")

Résultats pour chaque combinaison (temperature, top-k) :
Température=0.5, Top-k=2 => ['pizza', 'pizza', 'pizza', 'pizza', 'burger']
Température=0.5, Top-k=3 => ['pizza', 'pizza', 'pizza', 'pizza', 'pizza']
Température=0.5, Top-k=5 => ['pizza', 'pizza', 'pizza', 'pizza', 'burger']
Température=1.0, Top-k=2 => ['pizza', 'pizza', 'burger', 'pizza', 'pizza']
Température=1.0, Top-k=3 => ['pasta', 'burger', 'pasta', 'pizza', 'burger']
Température=1.0, Top-k=5 => ['pizza', 'pizza', 'pizza', 'sushi', 'pizza']
Température=1.5, Top-k=2 => ['pizza', 'pizza', 'burger', 'pizza', 'burger']
Température=1.5, Top-k=3 => ['burger', 'pizza', 'burger', 'pizza', 'pizza']
Température=1.5, Top-k=5 => ['pizza', 'pasta', 'pizza', 'pasta', 'sushi']


# Exercise 5.3: Deterministic behavior in the decoding functions

**Deterministic Token Generation: Parametric Strategies for Eliminating Stochastic Variability**

**Key Research Question: What specific configuration parameters within the `generate` function can systematically eliminate randomness to ensure consistently reproducible generative outputs?**

*Methodological Framework:*
*Investigate comprehensive strategies to:*
- Suppress stochastic token generation mechanisms
- Enforce deterministic computational behavior
- Replicate the predictable output characteristics of `generate_simple`

*Analytical Objectives:*
- Identify all potential parameter combinations
- Systematically neutralize probabilistic sampling variations
- Establish deterministic generative protocol

*Critical Configuration Parameters to Examine:*
1. `temperature` scaling
2. `top_k` pruning mechanism
3. Random seed initialization
4. Sampling strategy selection

*Recommended Experimental Protocol:*
1. Analyze individual parameter impacts
2. Identify minimal configuration requirements
3. Validate deterministic output generation
4. Compare against `generate_simple` implementation

*Computational Implications:*
- Understanding stochastic suppression mechanisms
- Insights into generative model controllability
- Strategies for reproducible machine learning outputs

In [6]:
# Fonction de décodage greedy (argmax) - déterministe
def greedy_decode(logits, vocab):
    token_index = np.argmax(logits)
    return vocab[token_index]

# Fonction de décodage stochastique (sampling avec softmax) - non déterministe
def stochastic_decode(logits, vocab, temperature=1.0):
    # Fonction pour le scaling par température
    def temperature_scaled_softmax_np(logits, temperature):
        scaled_logits = logits / temperature
        exp_logits = np.exp(scaled_logits - np.max(scaled_logits)) 
        probabilities = exp_logits / exp_logits.sum()
        return probabilities

    probabilities = temperature_scaled_softmax_np(logits, temperature)
    token_index = np.random.choice(len(vocab), p=probabilities)
    return vocab[token_index]

# Exemple de vocabulaire et de logits simulés
vocab = ["pizza", "burger", "pasta", "salad", "sushi"]
logits = np.array([2.0, 1.5, 0.5, 0.2, -0.1])

# Tester les deux méthodes plusieurs fois
greedy_results = [greedy_decode(logits, vocab) for _ in range(5)]
stochastic_results = [stochastic_decode(logits, vocab, temperature=1.0) for _ in range(5)]

# Afficher les résultats
print("Décodage greedy (déterministe) :", greedy_results)
print("Décodage stochastique (non déterministe) :", stochastic_results)

Décodage greedy (déterministe) : ['pizza', 'pizza', 'pizza', 'pizza', 'pizza']
Décodage stochastique (non déterministe) : ['pizza', 'sushi', 'burger', 'pizza', 'pizza']


# Exercise 5.4: Continued pretraining

**Continuation of Model Training: Stateful Resumption and Persistent Learning Dynamics**

**Key Research Question: How can we effectively restore a machine learning model's training state across separate computational sessions, enabling seamless continuation of the pretraining process?**

*Methodological Framework:*
Implement a comprehensive model and optimizer state restoration strategy involving:
- Weight reconstruction
- Optimizer state recovery
- Resumption of training from previously interrupted state

*Analytical Objectives:*
- Demonstrate stateful model persistence
- Execute additional training epoch using restored model configuration
- Validate continuity of learning progression

*Critical Procedural Steps:*
1. Load previously saved model weights
2. Reconstruct optimizer internal state
3. Reinitiate training using `train_model_simple` function
4. Complete one additional training epoch

*Recommended Implementation Strategy:*
- Utilize precise weight and optimizer state loading mechanisms
- Verify complete state restoration
- Execute uninterrupted additional training epoch

In [7]:
# Fonction pour calculer le softmax avec scaling par température (NumPy)
def temperature_scaled_softmax_np(logits, temperature):
    scaled_logits = logits / temperature
    exp_logits = np.exp(scaled_logits - np.max(scaled_logits))  # Ajustement pour la stabilité numérique
    probabilities = exp_logits / exp_logits.sum()
    return probabilities

# Fonction pour appliquer nucleus sampling (top-p)
def nucleus_sampling(logits, vocab, p=0.9, temperature=1.0):
    probabilities = temperature_scaled_softmax_np(logits, temperature)
    sorted_indices = np.argsort(probabilities)[::-1]  # Trier les indices par probabilité décroissante
    sorted_probabilities = probabilities[sorted_indices]

    # Calculer la somme cumulative des probabilités
    cumulative_probabilities = np.cumsum(sorted_probabilities)
    cutoff_index = np.argmax(cumulative_probabilities >= p) + 1

    # Retenir uniquement les indices correspondant au top-p
    top_p_indices = sorted_indices[:cutoff_index]
    top_p_probabilities = probabilities[top_p_indices]
    top_p_probabilities /= top_p_probabilities.sum()  # Normaliser les probabilités restantes

    # Échantillonner parmi les indices top-p
    sampled_index = np.random.choice(top_p_indices, p=top_p_probabilities)
    return vocab[sampled_index]

# Exemple de vocabulaire et de logits simulés
vocab = ["pizza", "burger", "pasta", "salad", "sushi"]
logits = np.array([2.0, 1.5, 0.5, 0.2, -0.1])

# Paramètres pour nucleus sampling
temperatures = [0.7, 1.0]
top_ps = [0.8, 0.9, 0.95]

# Tester nucleus sampling pour différentes combinaisons
results_5_4 = {}
for temp in temperatures:
    for p in top_ps:
        samples = [nucleus_sampling(logits, vocab, p=p, temperature=temp) for _ in range(5)]
        results_5_4[(temp, p)] = samples

# Afficher les résultats
print("Résultats pour chaque combinaison (temperature, top-p) :")
for key, value in results_5_4.items():
    print(f"Température={key[0]}, Top-p={key[1]} => {value}")

Résultats pour chaque combinaison (temperature, top-p) :
Température=0.7, Top-p=0.8 => ['pizza', 'pizza', 'burger', 'pizza', 'burger']
Température=0.7, Top-p=0.9 => ['burger', 'pizza', 'pizza', 'burger', 'pizza']
Température=0.7, Top-p=0.95 => ['pizza', 'pizza', 'salad', 'pizza', 'pizza']
Température=1.0, Top-p=0.8 => ['pizza', 'burger', 'burger', 'burger', 'pizza']
Température=1.0, Top-p=0.9 => ['pizza', 'salad', 'salad', 'pizza', 'pizza']
Température=1.0, Top-p=0.95 => ['burger', 'pizza', 'pizza', 'pizza', 'pasta']


# Exercise 5.5: Training and validation set losses of the pretrained model

**Comparative Loss Assessment: Pretrained Model Performance on Specialized Textual Domain**

**Key Research Question: What are the comparative training and validation set losses when applying a pretrained OpenAI `GPTModel` to the "The Verdict" dataset?**

*Methodological Framework:*
Conduct a comprehensive loss evaluation involving:
- Model weight initialization from pretrained OpenAI configuration
- Computational loss calculation across training and validation datasets
- Quantitative performance assessment in domain-specific context

*Analytical Objectives:*
- Determine precise loss metrics for training dataset
- Calculate validation set loss
- Interpret performance characteristics of pretrained model on specialized textual domain

*Critical Computational Procedures:*
1. Load pretrained OpenAI `GPTModel` weights
2. Prepare "The Verdict" dataset
3. Compute training set loss
4. Compute validation set loss
5. Comparative loss analysis

*Investigative Parameters:*
- Model: Pretrained OpenAI `GPTModel`
- Dataset: "The Verdict"
- Metrics: Training and validation loss measurements

*Recommended Analytical Approach:*
- Implement precise loss computation
- Validate computational methodology
- Critically interpret loss metric implications

In [8]:
# Fonction pour calculer le softmax avec scaling par température
def temperature_scaled_softmax_np(logits, temperature):
    scaled_logits = logits / temperature
    exp_logits = np.exp(scaled_logits - np.max(scaled_logits)) 
    probabilities = exp_logits / exp_logits.sum()
    return probabilities

# Fonction pour appliquer nucleus sampling suivi de top-k
def hybrid_nucleus_top_k_sampling(logits, vocab, p=0.9, k=3, temperature=1.0):
    # Étape 1 : Nucleus sampling
    probabilities = temperature_scaled_softmax_np(logits, temperature)
    sorted_indices = np.argsort(probabilities)[::-1]  
    sorted_probabilities = probabilities[sorted_indices]

    # Calculer la somme cumulative des probabilités
    cumulative_probabilities = np.cumsum(sorted_probabilities)
    cutoff_index = np.argmax(cumulative_probabilities >= p) + 1

    # Retenir uniquement les indices correspondant au top-p
    top_p_indices = sorted_indices[:cutoff_index]

    # Étape 2 : Top-k sur les résultats de nucleus sampling
    filtered_logits = np.full_like(logits, -np.inf) 
    filtered_logits[top_p_indices] = logits[top_p_indices] 
    top_k_indices = np.argsort(filtered_logits)[-k:]  
    top_k_probabilities = temperature_scaled_softmax_np(filtered_logits[top_k_indices], temperature)

    # Échantillonner parmi les indices top-k
    sampled_index = np.random.choice(top_k_indices, p=top_k_probabilities)
    return vocab[sampled_index]

# Exemple de vocabulaire et de logits simulés
vocab = ["pizza", "burger", "pasta", "salad", "sushi"]
logits = np.array([2.0, 1.5, 0.5, 0.2, -0.1])

# Paramètres pour le décodage hybride
temperatures = [0.7, 1.0]
top_ps = [0.8, 0.9]
top_ks = [2, 3]

# Tester le décodage hybride pour différentes combinaisons
results_5_5 = {}
for temp in temperatures:
    for p in top_ps:
        for k in top_ks:
            samples = [hybrid_nucleus_top_k_sampling(logits, vocab, p=p, k=k, temperature=temp) for _ in range(5)]
            results_5_5[(temp, p, k)] = samples

# Afficher les résultats
print("Résultats pour chaque combinaison (temperature, top-p, top-k) :")
for key, value in results_5_5.items():
    print(f"Température={key[0]}, Top-p={key[1]}, Top-k={key[2]} => {value}")

Résultats pour chaque combinaison (temperature, top-p, top-k) :
Température=0.7, Top-p=0.8, Top-k=2 => ['burger', 'burger', 'burger', 'burger', 'pizza']
Température=0.7, Top-p=0.8, Top-k=3 => ['burger', 'burger', 'pizza', 'burger', 'burger']
Température=0.7, Top-p=0.9, Top-k=2 => ['pizza', 'burger', 'burger', 'pizza', 'pizza']
Température=0.7, Top-p=0.9, Top-k=3 => ['pizza', 'pizza', 'pizza', 'burger', 'pizza']
Température=1.0, Top-p=0.8, Top-k=2 => ['pizza', 'pizza', 'pizza', 'burger', 'burger']
Température=1.0, Top-p=0.8, Top-k=3 => ['burger', 'burger', 'burger', 'pizza', 'burger']
Température=1.0, Top-p=0.9, Top-k=2 => ['pizza', 'pizza', 'pizza', 'pizza', 'burger']
Température=1.0, Top-p=0.9, Top-k=3 => ['burger', 'pizza', 'pizza', 'burger', 'burger']


# Exercise 5.6: Trying larger models

**Comparative Generative Analysis: Scale and Performance Variations in GPT-2 Model Architectures**

**Key Research Question: How do generative text characteristics vary across different GPT-2 model scales, specifically comparing the 124 million and 1,558 million parameter configurations?**

*Methodological Framework:*
Conduct a systematic comparative investigation of:
- Generative text quality
- Semantic coherence
- Linguistic complexity
- Contextual understanding

*Analytical Objectives:*
- Empirically assess generative performance across model scales
- Identify qualitative differences in text generation
- Explore the relationship between model parameter count and generative capabilities

*Comparative Model Configurations:*
1. Smaller Model: **124 million parameters**
2. Larger Model: **1,558 million parameters**

*Investigative Dimensions:*
- Textual coherence
- Semantic precision
- Contextual relevance
- Linguistic nuance
- Complexity of generated content

*Experimental Protocol:*
1. Generate text samples using both model configurations
2. Conduct qualitative comparative analysis
3. Assess generative performance across multiple dimensions
4. Document observable variations in text generation characteristics

*Recommended Analytical Approach:*
- Utilize consistent generation parameters
- Employ multiple generation trials
- Implement rigorous qualitative assessment
- Develop comprehensive comparative framework

In [None]:
from collections import Counter

# Fonction pour calculer le softmax avec scaling par température
def temperature_scaled_softmax_np(logits, temperature):
    scaled_logits = logits / temperature
    exp_logits = np.exp(scaled_logits - np.max(scaled_logits))  
    probabilities = exp_logits / exp_logits.sum()
    return probabilities

# Fonction de décodage greedy 
def greedy_decode(logits, vocab):
    token_index = np.argmax(logits)
    return vocab[token_index]

# Fonction de décodage stochastique 
def stochastic_decode(logits, vocab, temperature=1.0):
    probabilities = temperature_scaled_softmax_np(logits, temperature)
    token_index = np.random.choice(len(vocab), p=probabilities)
    return vocab[token_index]

# Fonction pour nucleus sampling 
def nucleus_sampling(logits, vocab, p=0.9, temperature=1.0):
    probabilities = temperature_scaled_softmax_np(logits, temperature)
    sorted_indices = np.argsort(probabilities)[::-1] 
    sorted_probabilities = probabilities[sorted_indices]
    cumulative_probabilities = np.cumsum(sorted_probabilities)
    cutoff_index = np.argmax(cumulative_probabilities >= p) + 1
    top_p_indices = sorted_indices[:cutoff_index]
    top_p_probabilities = probabilities[top_p_indices]
    top_p_probabilities /= top_p_probabilities.sum() 
    sampled_index = np.random.choice(top_p_indices, p=top_p_probabilities)
    return vocab[sampled_index]

# Fonction pour décodage hybride nucleus + top-k
def hybrid_nucleus_top_k_sampling(logits, vocab, p=0.9, k=3, temperature=1.0):
    probabilities = temperature_scaled_softmax_np(logits, temperature)
    sorted_indices = np.argsort(probabilities)[::-1] 
    sorted_probabilities = probabilities[sorted_indices]
    cumulative_probabilities = np.cumsum(sorted_probabilities)
    cutoff_index = np.argmax(cumulative_probabilities >= p) + 1
    top_p_indices = sorted_indices[:cutoff_index]
    filtered_logits = np.full_like(logits, -np.inf)
    filtered_logits[top_p_indices] = logits[top_p_indices]
    top_k_indices = np.argsort(filtered_logits)[-k:]
    top_k_probabilities = temperature_scaled_softmax_np(filtered_logits[top_k_indices], temperature)
    sampled_index = np.random.choice(top_k_indices, p=top_k_probabilities)
    return vocab[sampled_index]

# Mesures de diversité et de cohérence
def calculate_diversity(samples):
    unique_tokens = len(set(samples))
    return unique_tokens / len(samples)

def calculate_coherence(samples):
    most_common_count = Counter(samples).most_common(1)[0][1]
    return most_common_count / len(samples)

# Évaluer les méthodes de décodage
def evaluate_decoding_methods(vocab, logits, num_samples=10, p=0.9, k=3, temperature=1.0):
    methods = ["Greedy", "Stochastic", "Nucleus", "Hybrid"]
    results = {}

    for method in methods:
        if method == "Greedy":
            samples = [greedy_decode(logits, vocab) for _ in range(num_samples)]
        elif method == "Stochastic":
            samples = [stochastic_decode(logits, vocab, temperature=temperature) for _ in range(num_samples)]
        elif method == "Nucleus":
            samples = [nucleus_sampling(logits, vocab, p=p, temperature=temperature) for _ in range(num_samples)]
        elif method == "Hybrid":
            samples = [
                hybrid_nucleus_top_k_sampling(logits, vocab, p=p, k=k, temperature=temperature)
                for _ in range(num_samples)
            ]

        diversity = calculate_diversity(samples)
        coherence = calculate_coherence(samples)
        results[method] = {"Diversity": diversity, "Coherence": coherence, "Samples": samples}

    return results

# Exemple de vocabulaire et logits simulés
vocab = ["pizza", "burger", "pasta", "salad", "sushi"]
logits = np.array([2.0, 1.5, 0.5, 0.2, -0.1])

# Paramètres
num_samples = 10
p = 0.9
k = 3
temperature = 1.0

# Évaluation
evaluation_results = evaluate_decoding_methods(vocab, logits, num_samples, p, k, temperature)

# Affichage des résultats
for method, metrics in evaluation_results.items():
    print(f"Method: {method}")
    print(f"Diversity: {metrics['Diversity']}")
    print(f"Coherence: {metrics['Coherence']}")
    print(f"Samples: {metrics['Samples']}\n")

Method: Greedy
Diversity: 0.1
Coherence: 1.0
Samples: ['pizza', 'pizza', 'pizza', 'pizza', 'pizza', 'pizza', 'pizza', 'pizza', 'pizza', 'pizza']

Method: Stochastic
Diversity: 0.4
Coherence: 0.6
Samples: ['sushi', 'pizza', 'pasta', 'pizza', 'salad', 'salad', 'pizza', 'pizza', 'pizza', 'pizza']

Method: Nucleus
Diversity: 0.2
Coherence: 0.6
Samples: ['burger', 'burger', 'pizza', 'pizza', 'pizza', 'pizza', 'pizza', 'burger', 'burger', 'pizza']

Method: Hybrid
Diversity: 0.3
Coherence: 0.7
Samples: ['pasta', 'pizza', 'pizza', 'burger', 'burger', 'pizza', 'pizza', 'pizza', 'pizza', 'pizza']

