# Chapter 5 - Exercises

> Author : Badr TAJINI - Large Language model (LLMs) - ESIEE 2024-2025

---

# Exercise 5.1: Temperature-scaled softmax scores and sampling probabilities

**Empirical Analysis of Token Sampling Frequencies Under Temperature Scaling**

**Key Research Question: How does temperature-based scaling of the `softmax` probability distribution impact the sampling frequency of the specific lexical token `"pizza"`?**

*Methodological Framework:*
Utilize the `print_sampled_tokens` function to:
- Empirically examine token sampling probabilities
- Analyze the impact of temperature scaling
- Quantify the sampling occurrence of the `"pizza"` token

*Analytical Objectives:*
- Determine the precise sampling frequency of `"pizza"` across different temperature configurations
- Critically evaluate the current computational approach to sampling frequency measurement
- Explore potential methodological improvements for more efficient and accurate token sampling analysis

*Key Investigative Parameters:*
- Primary token of interest: `"pizza"`
- Sampling method: Temperature-scaled `softmax` distribution
- Computational tool: `print_sampled_tokens` function


In [None]:
import torch

vocab = {
    "closer": 0,
    "every": 1,
    "effort": 2,
    "forward": 3,
    "inches": 4,
    "moves": 5,
    "pizza": 6,
    "toward": 7,
    "you": 8,
}
inverse_vocab = {v: k for k, v in vocab.items()}

next_token_logits = torch.tensor(
    [4.51, 0.89, -1.90, 6.75, 1.63, -1.62, -1.89, 6.28, 1.79]
)

In [None]:
def softmax_with_temperature(logits, temperature):
    scaled_logits = logits / temperature
    return torch.softmax(scaled_logits, dim=0)

def print_sampled_tokens(probas):
    torch.manual_seed(123) # Manual seed for reproducibility
    sample = [torch.multinomial(probas, num_samples=1).item() for i in range(1_000)]
    sampled_ids = torch.bincount(torch.tensor(sample))
    print(int(sampled_ids[6]), "x pizza")

In [None]:
# Temperature values
temperatures = [0.1, 0.5, 1, 1.5, 2, 3, 5, 10, 15, 20, 50, 100, 500, 1000]  # Original, higher confidence, and lower confidence

# Calculate scaled probabilities
scaled_probas = [softmax_with_temperature(next_token_logits, T) for T in temperatures]

for i, probas in enumerate(scaled_probas):
    print(f"Temperature {temperatures[i]}:")
    print_sampled_tokens(probas)
    print()

Temperature 0.1:
0 x pizza

Temperature 0.5:
0 x pizza

Temperature 1:
0 x pizza

Temperature 1.5:
2 x pizza

Temperature 2:
4 x pizza

Temperature 3:
15 x pizza

Temperature 5:
43 x pizza

Temperature 10:
62 x pizza

Temperature 15:
76 x pizza

Temperature 20:
85 x pizza

Temperature 50:
97 x pizza

Temperature 100:
99 x pizza

Temperature 500:
102 x pizza

Temperature 1000:
102 x pizza



We can see that the temperature impact the calcul of the softmax. If the temperature is lower than 1, the value of next_token_logits will be divided by a value lower than 1 so the disparity between high value and low value will grow. It explain that the chance of getting pizza will become really low since it's original value was already low.

At the opposite, if the temperature is higher than 1, we will divide the next_token_logits by a value higher than 1 and as such, the disparity between high value and low value will be reduced. This will decrease the probability of high value being chosen and increase the probability of low value being chosen. For a really high temperature, all value will start to be really close to each other so the probability of each will become 1/(num_elements) and will not be impacted by the original value.

# Exercise 5.2: Different temperature and top-k settings

**Empirical Investigation of Generative Language Model Sampling Parameters**

**Key Research Question: How do variations in `temperature` and `top-k` sampling parameters influence the qualitative and probabilistic characteristics of token generation in stochastic language models?**

*Methodological Framework:*
Conduct a systematic empirical exploration of:
- Temperature scaling dynamics
- Top-k probability truncation mechanisms
- Generative output characteristics across different parameter configurations

*Analytical Objectives:*
- Identify contextual applications that benefit from lower `temperature` and `top-k` settings
- Explore potential use cases preferring higher `temperature` and `top-k` configurations
- Develop nuanced understanding of sampling parameter impact on generative outputs

*Investigative Dimensions:*
1. Low `temperature` and `top-k` Scenarios
   - Potential applications
   - Characteristics of generated outputs
   - Contextual relevance

2. High `temperature` and `top-k` Scenarios
   - Potential applications
   - Characteristics of generated outputs
   - Contextual relevance

*Recommended Experimental Protocol:*
1. Systematically vary `temperature` and `top-k` parameters
2. Meticulously document generative output characteristics
3. Critically analyze observed variations
4. Develop hypotheses about optimal parameter configurations for specific applications

In [None]:
from google.colab import drive
drive.mount('/content/drive')
import sys
import os

module_path = '/content/drive/MyDrive/DSIA_LLM/lab5/main_code' # Assuming 'previous_labs.py' is in this directory
if module_path not in sys.path:
    sys.path.append(module_path)

Mounted at /content/drive


In [None]:
!pip install tiktoken
from gpt_generate import download_and_load_gpt2, GPTModel, load_weights_into_gpt, generate, text_to_token_ids, token_ids_to_text
import tiktoken
import torch
import numpy as np

torch.manual_seed(123)

CHOOSE_MODEL = "gpt2-small (124M)"
INPUT_PROMPT = "Every effort moves you"

BASE_CONFIG = {
    "vocab_size": 50257,     # Vocabulary size
    "context_length": 1024,  # Context length
    "drop_rate": 0.0,        # Dropout rate
    "qkv_bias": True         # Query-key-value bias
}

model_configs = {
    "gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
    "gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
    "gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
    "gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}

model_size = CHOOSE_MODEL.split(" ")[-1].lstrip("(").rstrip(")")

BASE_CONFIG.update(model_configs[CHOOSE_MODEL])

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

settings, params = download_and_load_gpt2(model_size=model_size, models_dir="gpt2")

gpt = GPTModel(BASE_CONFIG)
load_weights_into_gpt(gpt, params)
gpt.to(device)
gpt.eval()

tokenizer = tiktoken.get_encoding("gpt2")

Collecting tiktoken
  Downloading tiktoken-0.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading tiktoken-0.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.8.0


checkpoint: 100%|██████████| 77.0/77.0 [00:00<00:00, 102kiB/s]
encoder.json: 100%|██████████| 1.04M/1.04M [00:00<00:00, 2.32MiB/s]
hparams.json: 100%|██████████| 90.0/90.0 [00:00<00:00, 140kiB/s]
model.ckpt.data-00000-of-00001: 100%|██████████| 498M/498M [00:20<00:00, 24.2MiB/s]
model.ckpt.index: 100%|██████████| 5.21k/5.21k [00:00<00:00, 3.42MiB/s]
model.ckpt.meta: 100%|██████████| 471k/471k [00:00<00:00, 1.33MiB/s]
vocab.bpe: 100%|██████████| 456k/456k [00:00<00:00, 1.07MiB/s]


In [None]:
torch.manual_seed(123)
for i in range(5):
  token_ids = generate(
      model=gpt,
      idx=text_to_token_ids(INPUT_PROMPT, tokenizer),
      max_new_tokens=25,
      context_size=BASE_CONFIG["context_length"],
      top_k=5,
      temperature=0.1
  )
  print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves you forward.

The next time you see a person who is not a good person, ask yourself, "What is the
Output text:
 Every effort moves you forward, and you are not alone.

The world is changing.

The world is changing.

The
Output text:
 Every effort moves you forward.

The best way to do this is to take a step back and think about what you're doing.

Output text:
 Every effort moves you forward, but you have to keep moving forward.

"I think that's what we're trying to do. We
Output text:
 Every effort moves you forward.

The first step is to understand the importance of your work.

The second step is to understand the


We can see here that with a small temperature and small k we have some redundancy between each generation. The text still make some sense.

In [None]:
torch.manual_seed(123)
for i in range(5):
  token_ids = generate(
      model=gpt,
      idx=text_to_token_ids(INPUT_PROMPT, tokenizer),
      max_new_tokens=25,
      context_size=BASE_CONFIG["context_length"],
      top_k=100,
      temperature=10
  )
  print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves you toward better-trained leadership from teachers from independent academies
"Each sector plays at being accountable , based both to its teachers
Output text:
 Every effort moves you so that everything must focus by itself [14 April 1209 The second wave "rearwind" on April 3
I
Output text:
 Every effort moves you by telling exactly when not I wanted! Every body movement could or doesn? Who wants fun so everyone turns heel out on Friday
Output text:
 Every effort moves you up exponentially since those first levels are tough once You find all enemies using Only Bombs/Trashing Cards again on Day 11 .
Output text:
 Every effort moves you like so too… until everyone reads 'everything in book of Revelation 1 who prayed one would eat everything 'a shepherd'. As


With big temperature and k value, each generation is very different from each other, but we lose some sense in the generation, and the text go from one direction to another every few word.

A low temperature and small k-value will lead to having close to the same generation of text over multiple execution of the programme.
At the opposite, a high temperature and big k-value will lead to having big difference between multiple execution of the programme.

The first case will be used when we want a accurate and coherent answer. Can be used when we ask a question and want the answer.

The second case will be used when we want the model to generate different sample for us, for example writing a story, proposing multiples names or title and so on.

# Exercise 5.3: Deterministic behavior in the decoding functions

**Deterministic Token Generation: Parametric Strategies for Eliminating Stochastic Variability**

**Key Research Question: What specific configuration parameters within the `generate` function can systematically eliminate randomness to ensure consistently reproducible generative outputs?**

*Methodological Framework:*
*Investigate comprehensive strategies to:*
- Suppress stochastic token generation mechanisms
- Enforce deterministic computational behavior
- Replicate the predictable output characteristics of `generate_simple`

*Analytical Objectives:*
- Identify all potential parameter combinations
- Systematically neutralize probabilistic sampling variations
- Establish deterministic generative protocol

*Critical Configuration Parameters to Examine:*
1. `temperature` scaling
2. `top_k` pruning mechanism
3. Random seed initialization
4. Sampling strategy selection

*Recommended Experimental Protocol:*
1. Analyze individual parameter impacts
2. Identify minimal configuration requirements
3. Validate deterministic output generation
4. Compare against `generate_simple` implementation

*Computational Implications:*
- Understanding stochastic suppression mechanisms
- Insights into generative model controllability
- Strategies for reproducible machine learning outputs

We only have 2 parameters in generate that can be modify to eleminate randomness : k and temperature.
We can also look at the seed set to torch, to see if different seed will impact the result

In [None]:
# top k = 1000
# temperature = 0.0001
for i in range(3):
  token_ids = generate(
      model=gpt,
      idx=text_to_token_ids(INPUT_PROMPT, tokenizer),
      max_new_tokens=25,
      context_size=BASE_CONFIG["context_length"],
      top_k=1000,
      temperature=0.0001
  )
  print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves you forward.

The first step is to understand the importance of your work.

The second step is to understand the
Output text:
 Every effort moves you forward.

The first step is to understand the importance of your work.

The second step is to understand the
Output text:
 Every effort moves you forward.

The first step is to understand the importance of your work.

The second step is to understand the


In [None]:
# top k = 1
# temperature = 1000
for i in range(3):
  token_ids = generate(
      model=gpt,
      idx=text_to_token_ids(INPUT_PROMPT, tokenizer),
      max_new_tokens=25,
      context_size=BASE_CONFIG["context_length"],
      top_k=1,
      temperature=1000
  )
  print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves you forward.

The first step is to understand the importance of your work.

The second step is to understand the
Output text:
 Every effort moves you forward.

The first step is to understand the importance of your work.

The second step is to understand the
Output text:
 Every effort moves you forward.

The first step is to understand the importance of your work.

The second step is to understand the


Setting a really low k or temperature seems to be enough to generate the same things everytime. Let's look if we change the seed.

In [None]:
torch.manual_seed(456)

<torch._C.Generator at 0x7b4c7846dc70>

In [None]:
# top k = 1000
# temperature = 0.0001
for i in range(3):
  token_ids = generate(
      model=gpt,
      idx=text_to_token_ids(INPUT_PROMPT, tokenizer),
      max_new_tokens=25,
      context_size=BASE_CONFIG["context_length"],
      top_k=1000,
      temperature=0.0001
  )
  print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves you forward.

The first step is to understand the importance of your work.

The second step is to understand the
Output text:
 Every effort moves you forward.

The first step is to understand the importance of your work.

The second step is to understand the
Output text:
 Every effort moves you forward.

The first step is to understand the importance of your work.

The second step is to understand the


In [None]:
# top k = 1
# temperature = 1000
for i in range(3):
  token_ids = generate(
      model=gpt,
      idx=text_to_token_ids(INPUT_PROMPT, tokenizer),
      max_new_tokens=25,
      context_size=BASE_CONFIG["context_length"],
      top_k=1,
      temperature=1000
  )
  print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves you forward.

The first step is to understand the importance of your work.

The second step is to understand the
Output text:
 Every effort moves you forward.

The first step is to understand the importance of your work.

The second step is to understand the
Output text:
 Every effort moves you forward.

The first step is to understand the importance of your work.

The second step is to understand the


Even after modifying the seed, we obtain the same generation. We can conclude that having a really small temperature or a unique k is enough.

# Exercise 5.4: Continued pretraining

**Continuation of Model Training: Stateful Resumption and Persistent Learning Dynamics**

**Key Research Question: How can we effectively restore a machine learning model's training state across separate computational sessions, enabling seamless continuation of the pretraining process?**

*Methodological Framework:*
Implement a comprehensive model and optimizer state restoration strategy involving:
- Weight reconstruction
- Optimizer state recovery
- Resumption of training from previously interrupted state

*Analytical Objectives:*
- Demonstrate stateful model persistence
- Execute additional training epoch using restored model configuration
- Validate continuity of learning progression

*Critical Procedural Steps:*
1. Load previously saved model weights
2. Reconstruct optimizer internal state
3. Reinitiate training using `train_model_simple` function
4. Complete one additional training epoch

*Recommended Implementation Strategy:*
- Utilize precise weight and optimizer state loading mechanisms
- Verify complete state restoration
- Execute uninterrupted additional training epoch

In [None]:
from gpt_train import create_dataloader_v1, train_model_simple, plot_losses, evaluate_model
import os
import urllib.request
import matplotlib.pyplot as plt

In [None]:
gpt_config = {
    "vocab_size": 50257,    # Vocabulary size
    "context_length": 256,  # Shortened context length (orig: 1024)
    "emb_dim": 768,         # Embedding dimension
    "n_heads": 12,          # Number of attention heads
    "n_layers": 12,         # Number of layers
    "drop_rate": 0.1,       # Dropout rate
    "qkv_bias": False       # Query-key-value bias
}

settings = {
    "learning_rate": 5e-4,
    "num_epochs": 3,
    "batch_size": 2,
    "weight_decay": 0.1
}

###########################
# Initiate training
###########################

torch.manual_seed(123)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

##############################
# Download data if necessary
##############################

file_path = "the-verdict.txt"
url = "https://huggingface.co/datasets/DarwinAnim8or/the-verdict/resolve/main/the-verdict.txt" # provide a URL here or use a local file that you have already downloaded with the lab materials

if not os.path.exists(file_path):
    with urllib.request.urlopen(url) as response:
        text_data = response.read().decode('utf-8')
    with open(file_path, "w", encoding="utf-8") as file:
        file.write(text_data)
else:
    with open(file_path, "r", encoding="utf-8") as file:
        text_data = file.read()

##############################
# Initialize model
##############################

model = GPTModel(gpt_config)
model.to(device)  # no assignment model = model.to(device) necessary for nn.Module classes
optimizer = torch.optim.AdamW(
    model.parameters(), lr=settings["learning_rate"], weight_decay=settings["weight_decay"]
)

##############################
# Set up dataloaders
##############################

# Train/validation ratio
train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))

train_loader = create_dataloader_v1(
    text_data[:split_idx],
    batch_size=settings["batch_size"],
    max_length=gpt_config["context_length"],
    stride=gpt_config["context_length"],
    drop_last=True,
    shuffle=True,
    num_workers=0
)

val_loader = create_dataloader_v1(
    text_data[split_idx:],
    batch_size=settings["batch_size"],
    max_length=gpt_config["context_length"],
    stride=gpt_config["context_length"],
    drop_last=False,
    shuffle=False,
    num_workers=0
)

In [None]:
##############################
# Train model
##############################

tokenizer = tiktoken.get_encoding("gpt2")

train_losses, val_losses, tokens_seen = train_model_simple(
    model, train_loader, val_loader, optimizer, device,
    num_epochs=settings["num_epochs"], eval_freq=5, eval_iter=1,
    start_context="Every effort moves you", tokenizer=tokenizer
)

Ep 1 (Step 000000): Train loss 9.744, Val loss 9.840
Ep 1 (Step 000005): Train loss 7.845, Val loss 8.039
Every effort moves you,,,,,,,,,,,,,,.                                   
Ep 2 (Step 000010): Train loss 6.519, Val loss 6.802
Ep 2 (Step 000015): Train loss 6.399, Val loss 6.531
Every effort moves you, and, and, and, and, and, and, and, and,, and,, and, and, and, and, and, and, and,, and, and,, and, and,, and, and,
Ep 3 (Step 000020): Train loss 15.733, Val loss 15.784
Ep 3 (Step 000025): Train loss 4.937, Val loss 6.486
Every effort moves you.                                                 


In [None]:
###########################
# After training
###########################

print(f"Training loss model 1: {train_losses[-1]:.4f}")
print(f"Validation loss model 1: {val_losses[-1]:.4f}")

# Save and load model
torch.save({
    "model_state_dict": model.state_dict(),
    "optimizer_state_dict": optimizer.state_dict(),
    },
    "model_and_optimizer.pth"
)

Training loss model 1: 4.9369
Validation loss model 1: 6.4856


In [None]:
torch.manual_seed(123)

token_ids = generate(
    model=model,
    idx=text_to_token_ids("Every effort moves you", tokenizer).to(device),
    max_new_tokens=25,
    context_size=gpt_config["context_length"],
    top_k=1,
    temperature=1.5
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves you.


























In [None]:
checkpoint = torch.load("model_and_optimizer.pth")

model2 = GPTModel(gpt_config)
model2.to(device)
model2.load_state_dict(checkpoint["model_state_dict"])

optimizer2 = torch.optim.AdamW(model2.parameters(), lr=0.0005, weight_decay=0.1)
optimizer2.load_state_dict(checkpoint["optimizer_state_dict"])

train_loss2, val_loss2 = evaluate_model(model2, train_loader, val_loader, device, eval_iter=1)

print(f"Training loss model 2: {train_loss2:.4f}")
print(f"Validation loss model 2: {val_loss2:.4f}")

  checkpoint = torch.load("model_and_optimizer.pth")


Training loss model 2: 5.6775
Validation loss model 2: 6.4778


In [None]:
torch.manual_seed(123)

token_ids = generate(
    model=model2,
    idx=text_to_token_ids("Every effort moves you", tokenizer).to(device),
    max_new_tokens=25,
    context_size=gpt_config["context_length"],
    top_k=1,
    temperature=1.5
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves you.


























In [None]:
##############################
# Train model 2
##############################

train_losses, val_losses, tokens_seen = train_model_simple(
    model2, train_loader, val_loader, optimizer2, device,
    num_epochs=settings["num_epochs"], eval_freq=5, eval_iter=1,
    start_context="Every effort moves you", tokenizer=tokenizer
)

Ep 1 (Step 000000): Train loss 5.124, Val loss 6.468
Ep 1 (Step 000005): Train loss 5.029, Val loss 6.404
Every effort moves you, and, and to the picture.     "I had been, and I was, and, and was. "I was his his of the picture.     "I was a, and he was
Ep 2 (Step 000010): Train loss 4.312, Val loss 6.236
Ep 2 (Step 000015): Train loss 3.924, Val loss 6.205
Every effort moves you know a                                                
Ep 3 (Step 000020): Train loss 3.752, Val loss 6.163
Ep 3 (Step 000025): Train loss 2.628, Val loss 6.135
Every effort moves you know he was not that I felt--I had the fact.                                     


In [None]:
###########################
# After training
###########################

print(f"Training loss model 2: {train_losses[-1]:.4f}")
print(f"Validation loss model 2: {val_losses[-1]:.4f}")

# Save and load model
torch.save({
    "model_state_dict": model2.state_dict(),
    "optimizer_state_dict": optimizer2.state_dict(),
    },
    "model2_and_optimizer2.pth"
)

Training loss model 2: 2.6275
Validation loss model 2: 6.1352


In [None]:
torch.manual_seed(123)

token_ids = generate(
    model=model2,
    idx=text_to_token_ids("Every effort moves you", tokenizer).to(device),
    max_new_tokens=25,
    context_size=gpt_config["context_length"],
    top_k=1,
    temperature=1.5
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves you know he was.























For this part we have done the following:

*   Create a new model from scrath
*   Train it on 3 epochs
* Generate a sentence
* Save the weight and optimizer

Then

* Import the weight on a empty model
* Generate a sentence to verify the persistence (it worked well)
* Verify that we Train loss and Val loss are coherent
* Continue the training on 3 more epochs
* Verify if the losses are better and the generation more coherent than before



# Exercise 5.5: Training and validation set losses of the pretrained model

**Comparative Loss Assessment: Pretrained Model Performance on Specialized Textual Domain**

**Key Research Question: What are the comparative training and validation set losses when applying a pretrained OpenAI `GPTModel` to the "The Verdict" dataset?**

*Methodological Framework:*
Conduct a comprehensive loss evaluation involving:
- Model weight initialization from pretrained OpenAI configuration
- Computational loss calculation across training and validation datasets
- Quantitative performance assessment in domain-specific context

*Analytical Objectives:*
- Determine precise loss metrics for training dataset
- Calculate validation set loss
- Interpret performance characteristics of pretrained model on specialized textual domain

*Critical Computational Procedures:*
1. Load pretrained OpenAI `GPTModel` weights
2. Prepare "The Verdict" dataset
3. Compute training set loss
4. Compute validation set loss
5. Comparative loss analysis

*Investigative Parameters:*
- Model: Pretrained OpenAI `GPTModel`
- Dataset: "The Verdict"
- Metrics: Training and validation loss measurements

*Recommended Analytical Approach:*
- Implement precise loss computation
- Validate computational methodology
- Critically interpret loss metric implications

In the previous exercise we already load the GPT2 Small model, and we prepare the training set loss and validation loss loader. As such we only need to use the function evaluate_model

In [None]:
train_loss_GPT, val_loss_GPT = evaluate_model(gpt, train_loader, val_loader, device, eval_iter=10)

print(f"Training loss model 2: {train_loss_GPT:.4f}")
print(f"Validation loss model 2: {val_loss_GPT:.4f}")

Training loss model 2: 3.7547
Validation loss model 2: 3.5596


If we compare to the result obtain when training the model on 10 epochs on the verdict dataset we have:

* GPT model: Train loss 3.7547, Val loss 3.5596
* Small model trained: Train loss 0.252, Val loss 6.473

We can first see that this time, it seems that the model doesn't overfit since both loss are around equal. Also the Val loss is better, so the gpt model will get better result on the validation test than the small model trained, even on the verdict dataset. However, on the training set, the small model who overfitted and learned the text will perform better than the gpt one.

In [None]:
del model
del model2
del optimizer
del optimizer2
del train_losses
del val_losses
del tokens_seen


In [None]:
del text_data
del token_ids

# Exercise 5.6: Trying larger models

**Comparative Generative Analysis: Scale and Performance Variations in GPT-2 Model Architectures**

**Key Research Question: How do generative text characteristics vary across different GPT-2 model scales, specifically comparing the 124 million and 1,558 million parameter configurations?**

*Methodological Framework:*
Conduct a systematic comparative investigation of:
- Generative text quality
- Semantic coherence
- Linguistic complexity
- Contextual understanding

*Analytical Objectives:*
- Empirically assess generative performance across model scales
- Identify qualitative differences in text generation
- Explore the relationship between model parameter count and generative capabilities

*Comparative Model Configurations:*
1. Smaller Model: **124 million parameters**
2. Larger Model: **1,558 million parameters**

*Investigative Dimensions:*
- Textual coherence
- Semantic precision
- Contextual relevance
- Linguistic nuance
- Complexity of generated content

*Experimental Protocol:*
1. Generate text samples using both model configurations
2. Conduct qualitative comparative analysis
3. Assess generative performance across multiple dimensions
4. Document observable variations in text generation characteristics

*Recommended Analytical Approach:*
- Utilize consistent generation parameters
- Employ multiple generation trials
- Implement rigorous qualitative assessment
- Develop comprehensive comparative framework

For this part we will first load both model, then generate multiple generation on define temperature and k value. After that we will calculate the val loss and train loss on the verdict dataset

We will start by the Small GPT model

In [None]:
torch.manual_seed(123)
Text = ["The dog have","Today I ate","The class of Mr TAJINI is"]

for sentence in Text:
  token_ids = generate(
      model=gpt,
      idx=text_to_token_ids(sentence, tokenizer).to(device),
      max_new_tokens=25,
      context_size=gpt_config["context_length"],
      top_k=50,
      temperature=1
  )

  print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 The dog have been left sitting in a parking lot while it was on its way to pick up a large dog from a neighboring street.

Output text:
 Today I ate a whole bunch of vegetables. I wasn't good at eating anything on the menu so I decided to go with a vegetarian diet
Output text:
 The class of Mr TAJINI is based on the tradition of the JNU study movement. The aim of the program is to create the best possible environment to


In [None]:
train_loss_GPT, val_loss_GPT = evaluate_model(gpt, train_loader, val_loader, device, eval_iter=10)

print(f"Training loss model gpt small: {train_loss_GPT:.4f}")
print(f"Validation loss model gpt small: {val_loss_GPT:.4f}")

Training loss model gpt small: 3.7547
Validation loss model gpt small: 3.5596


In [None]:
del gpt

Since the model is really big, I restart the kernel and import only the important element

In [None]:
from google.colab import drive
drive.mount('/content/drive')
import sys
import os

module_path = '/content/drive/MyDrive/DSIA_LLM/lab5/main_code' # Assuming 'previous_labs.py' is in this directory
if module_path not in sys.path:
    sys.path.append(module_path)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
from gpt_train import create_dataloader_v1, evaluate_model
import os
import urllib.request
import matplotlib.pyplot as plt

In [None]:
gpt_config = {
    "vocab_size": 50257,    # Vocabulary size
    "context_length": 256,  # Shortened context length (orig: 1024)
    "emb_dim": 768,         # Embedding dimension
    "n_heads": 12,          # Number of attention heads
    "n_layers": 12,         # Number of layers
    "drop_rate": 0.1,       # Dropout rate
    "qkv_bias": False       # Query-key-value bias
}

settings = {
    "learning_rate": 5e-4,
    "num_epochs": 3,
    "batch_size": 2,
    "weight_decay": 0.1
}

##############################
# Download data if necessary
##############################

file_path = "the-verdict.txt"
url = "https://huggingface.co/datasets/DarwinAnim8or/the-verdict/resolve/main/the-verdict.txt" # provide a URL here or use a local file that you have already downloaded with the lab materials

if not os.path.exists(file_path):
    with urllib.request.urlopen(url) as response:
        text_data = response.read().decode('utf-8')
    with open(file_path, "w", encoding="utf-8") as file:
        file.write(text_data)
else:
    with open(file_path, "r", encoding="utf-8") as file:
        text_data = file.read()

##############################
# Set up dataloaders
##############################

# Train/validation ratio
train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))

train_loader = create_dataloader_v1(
    text_data[:split_idx],
    batch_size=settings["batch_size"],
    max_length=gpt_config["context_length"],
    stride=gpt_config["context_length"],
    drop_last=True,
    shuffle=True,
    num_workers=0
)

val_loader = create_dataloader_v1(
    text_data[split_idx:],
    batch_size=settings["batch_size"],
    max_length=gpt_config["context_length"],
    stride=gpt_config["context_length"],
    drop_last=False,
    shuffle=False,
    num_workers=0
)

In [None]:
del text_data

In [None]:
!pip install tiktoken
from gpt_generate import download_and_load_gpt2, GPTModel, load_weights_into_gpt, generate, text_to_token_ids, token_ids_to_text
import tiktoken
import torch
import numpy as np

torch.manual_seed(123)

BASE_CONFIG = {
    "vocab_size": 50257,     # Vocabulary size
    "context_length": 1024,  # Context length
    "drop_rate": 0.0,        # Dropout rate
    "qkv_bias": True         # Query-key-value bias
}

model_configs = {
    "gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
    "gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
    "gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
    "gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}

CHOOSE_MODEL = "gpt2-large (774M)" # We take the large and not the XL due to RAM Limite

model_size = CHOOSE_MODEL.split(" ")[-1].lstrip("(").rstrip(")")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

BASE_CONFIG.update(model_configs[CHOOSE_MODEL])

settings, params = download_and_load_gpt2(model_size=model_size, models_dir="gpt2")

gptXL = GPTModel(BASE_CONFIG)
load_weights_into_gpt(gptXL, params)
gptXL.to(device)
gptXL.eval()

tokenizer = tiktoken.get_encoding("gpt2")

File already exists and is up-to-date: gpt2/774M/checkpoint
File already exists and is up-to-date: gpt2/774M/encoder.json
File already exists and is up-to-date: gpt2/774M/hparams.json
File already exists and is up-to-date: gpt2/774M/model.ckpt.data-00000-of-00001
File already exists and is up-to-date: gpt2/774M/model.ckpt.index
File already exists and is up-to-date: gpt2/774M/model.ckpt.meta
File already exists and is up-to-date: gpt2/774M/vocab.bpe


In [None]:
torch.manual_seed(123)
Text = ["The dog have","Today I ate","The class of Mr TAJINI is"]

for sentence in Text:
  token_ids = generate(
      model=gptXL,
      idx=text_to_token_ids(sentence, tokenizer).to(device),
      max_new_tokens=25,
      context_size=gpt_config["context_length"],
      top_k=50,
      temperature=1
  )

  print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 The dog have been left a bloody mess, with many holes in their bodies and face.

One dog had a black eye and the
Output text:
 Today I ate a whole dinner of vegetables. I wasn't hungry so I'd just eat the whole thing. One time, my mom took
Output text:
 The class of Mr TAJINI is based on the standard set by JOSE (see the chart below):


Class: 7 (12+)

B


In [None]:
train_loss_GPT_XL, val_loss_GPT_XL = evaluate_model(gptXL, train_loader, val_loader, device, eval_iter=10)

print(f"Training loss model gpt small: {train_loss_GPT_XL:.4f}")
print(f"Validation loss model gpt small: {val_loss_GPT_XL:.4f}")

Training loss model gpt small: 3.3828
Validation loss model gpt small: 3.2100


When comparing the text generated, it's hard to really differenciate the model, as both generate logical sentences.

On the other hand, when comparing the metrics on the verdict text, we can see that the large model got a slightly better score than the small one.

To further compare the model, we should variate the temperature and k value and do several generation. Also make the model generate on mode than 25 tokens will allow to see if over the duration the sentences will still have logics.

Due to using Google collab free version with CPU it's hard to test both as each iteration take a lot of times.