# Exercise 5.1
Use the print_sampled_tokens function to print the sampling frequencies of the softmax probabilities scaled with the temperatures shown in figure 5.14. How often is the word pizza sampled in each case? Can you think of a faster and more accurate way to determine how often the word pizza is sampled?

In [21]:
import torch
import matplotlib.pyplot as plt

# from chepter 5
vocab = {
    "closer": 0,
    "every": 1, 
    "effort": 2, 
    "forward": 3,
    "inches": 4,
    "moves": 5, 
    "pizza": 6,
    "toward": 7,
    "you": 8,
}
inverse_vocab = {v: k for k, v in vocab.items()}


next_token_logits = torch.tensor(
    [4.51, 0.89, -1.90, 6.75, 1.63, -1.62, -1.89, 6.28, 1.79]
)

# softmax 
def softmax_with_temperature(logits, temperature):
    scaled_logits = logits / temperature
    return torch.softmax(scaled_logits, dim=0)


In [22]:
# do it 1000 times
def print_sampled_tokens(probas, temperature):
    torch.manual_seed(123)
    samples = [torch.multinomial(probas, num_samples=1).item() for _ in range(1000)]
    sampled_ids = torch.bincount(torch.tensor(samples), minlength=len(vocab))

    print(f"Temperature: {temperature}")
    for i, count in enumerate(sampled_ids):
        print(f"{inverse_vocab[i]:>8}: {count.item()} times")
    print()

In [23]:
for temp in [0.1, 1.0, 5.0]:
    probas = softmax_with_temperature(next_token_logits, temp)
    print_sampled_tokens(probas, temp)


Temperature: 0.1
  closer: 0 times
   every: 0 times
  effort: 0 times
 forward: 985 times
  inches: 0 times
   moves: 0 times
   pizza: 0 times
  toward: 15 times
     you: 0 times

Temperature: 1.0
  closer: 73 times
   every: 0 times
  effort: 0 times
 forward: 582 times
  inches: 2 times
   moves: 0 times
   pizza: 0 times
  toward: 343 times
     you: 0 times

Temperature: 5.0
  closer: 165 times
   every: 75 times
  effort: 42 times
 forward: 239 times
  inches: 71 times
   moves: 46 times
   pizza: 32 times
  toward: 227 times
     you: 103 times



We can directly calculate the probability value of pizza in the softmax probability, which reflects the possibility of pizza being sampled more quickly and accurately than through multiple samplings.

In [24]:
def expected_probability_of_token(token_name, logits, vocab, temperature=1.0):
    probas = torch.softmax(logits / temperature, dim=0)
    token_id = vocab[token_name]
    return probas[token_id].item()

expected_probability_of_token("pizza", next_token_logits, vocab, temperature=1.0)
for T in [0.1, 1.0, 5.0]:
    prob = expected_probability_of_token("pizza", next_token_logits, vocab, T)
    print(f" Temp={T} →  Pizza: {prob:.6f}")


 Temp=0.1 →  Pizza: 0.000000
 Temp=1.0 →  Pizza: 0.000101
 Temp=5.0 →  Pizza: 0.042998


Notes:

1. logits is: a score the model gives for each word.
   - Bigger logits = the word is more likely to be the next word
   - Smaller logits = the word is less likely to be the next word

2. softmax is: a function that converts logits into percentages (probabilities).
   - It uses the exponential function to expand the differences between scores
   - This converts all logits into positive values, but keeps lower scores much smaller than higher ones
   - Final probability = (each converted logits) / (sum of all converted logits)


# Exercise 5.2
Play around with different temperatures and top-k settings. Based on your observations, can you think of applications where lower temperature and top-k settings are desired? Likewise, can you think of applications where higher temperature and top-k settings are preferred? (It’s recommended to also revisit this exercise at the end of the chapter after loading the pretrained weights from OpenAI.)

In [25]:
# libs
import torch
import tiktoken

# get those lib from folder 01_main-chapter-code
import sys
sys.path.append("01_main-chapter-code")

from previous_chapters import GPTModel
from gpt_generate import generate

# GPT 124M
GPT_CONFIG_124M = {
    "vocab_size": 50257,
    "context_length": 256,
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": False
}

# init model
model = GPTModel(GPT_CONFIG_124M)
model.eval()


# tokenizer
tokenizer = tiktoken.get_encoding("gpt2")

def text_to_token_ids(text):
    encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
    return torch.tensor(encoded).unsqueeze(0)

def token_ids_to_text(token_ids):
    flat = token_ids.squeeze(0)
    return tokenizer.decode(flat.tolist())


In [26]:
# set temperature and top-k + test
for T in [0.5, 1.0, 1.5]:
    for k in [1, 10, 50]:
        print(f"\n//--- Temperature = {T}, Top-k = {k} ---//")
        output_ids = generate(
            model=model,
            idx=text_to_token_ids("Every effort moves you"),
            max_new_tokens=40,
            context_size=GPT_CONFIG_124M["context_length"],
            temperature=T,
            top_k=k
        )
        print(token_ids_to_text(output_ids))


//--- Temperature = 0.5, Top-k = 1 ---//
Every effort moves you Glass Trayvon hydroseeing intensify barb McGregor Jur envis concussion forfe satisftank gasped CafolineMaker ball domains TOR pious freezing finite visible Duchessvine applicants whoeverSpot ShaneoraIsrael /*citイ trademark eviloolstated Autumn

//--- Temperature = 0.5, Top-k = 10 ---//
Every effort moves you Glass Trayvon gasped guaranteeingretty paths Midwest platayson Verjac militiaistle SNPPrince deval propel eyeing unlockinginiaÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂldaionics paraphWD Footballnesium mockingAugust fascist Qué Mortal Nikki applying·orrow programmed Comparisonレ entreprene

//--- Temperature = 0.5, Top-k = 50 ---//
Every effort moves you processed testing improper wide intended enhancementSTmonths ASAPRewardotide competed 1897 capturingalan Syri Malkagically brushesiths improve needed happened tul strictly fats105 startingcoinsenced randomly banned biblicalopensfights Rhodphotos c

Analysis and summary:
1. Low temperature & small top-k (such as T=0.5, k=1) :
The output is more deterministic and repeatable, and the structure is more stable.

I think that is likely be used in tasks that require accuracy, such as machine translation, grammar correction, and summary extraction.

2. Medium temperature & top-k (e.g. T=1.0, k=10) :

The balanced one.

3. High temperature & large top-k (such as T=1.5, k=50) :
The creative and random one, could be used for jobs required "creativity." Perhaps writing poems.

## All of those is the theoretical answers. None of the actual result I got is a good sentence... none of them even make sense. small module is not that smart.
But obviously we are going to get the same result if temperature is low.
Here are one run of the code: 

//--- Temperature = 0.5, Top-k = 1 ---//
Every effort moves you split Code brandingrepreyne totaling nationallyurred electronically TVs whatsoever Ishings Arms VK Prosecut Metal heat VolunteerSpecial Springs Percentage 354 TS Englishudden derivativesYRredible bombs Meal works juvenile orc yes Cable ClaimsKYivism grave

//--- Temperature = 0.5, Top-k = 10 ---//
Every effort moves youdiv???? Parent exclus spill statutes combustion         Mealwart         slowsQaeda}\ inflammationResultscult semantics Jak liverThen hordeophone nomineidphan appropriate instructionalgey inept Angle INFOhopdefine biscuits299557 resurrectedFollowingGro

//--- Temperature = 0.5, Top-k = 50 ---//
Every effort moves you amd474 Saskatchewan "'290 conven blocksChe Stranger furnitureits monarch gou musicians Nancy crippled Lilithreau Shared799 followers bake categoralez057 Myanmar hefty so Stockholm tirelessly Sins Prime lane fitnessDiscHYaline Tinder panc axis

//--- Temperature = 1.0, Top-k = 1 ---//
Every effort moves you split Code brandingrepreyne totaling nationallyurred electronically TVs whatsoever Ishings Arms VK Prosecut Metal heat VolunteerSpecial Springs Percentage 354 TS Englishudden derivativesYRredible bombs Meal works juvenile orc yes Cable ClaimsKYivism grave

//--- Temperature = 1.0, Top-k = 10 ---//
Every effort moves you Callrub Sharp client Seth Geneticsichaelople infectSecurity wandered [& contingentunk Salvationnergy deployedrealDonaldTrump gate Heads encampSan improperessment vault confidJet citationsoscope cap inhuman differentiatebroad Hawk Founder 76 BET post Fir pin

//--- Temperature = 1.0, Top-k = 50 ---//
Every effort moves you advisory Clintemortcrow MASuggets Inform thoughtList hydra Vulkanphal table coating Intel GCStewyiambersanco drains interviewing remedy antioxidescapSpecial ash anonym We downstairs Zerg bakingBra Brother understand hous globalization Tottenham subsetFood

//--- Temperature = 1.5, Top-k = 1 ---//
Every effort moves you split Code brandingrepreyne totaling nationallyurred electronically TVs whatsoever Ishings Arms VK Prosecut Metal heat VolunteerSpecial Springs Percentage 354 TS Englishudden derivativesYRredible bombs Meal works juvenile orc yes Cable ClaimsKYivism grave

//--- Temperature = 1.5, Top-k = 10 ---//
Every effort moves you� Commonwealthaneously orbs sleepyLin economistshighly win pontnec overlap Archdemon tutorPrimwealth Afghanistan Scoresoker fathers blossBossttedourn TrapsLivingilerssalacks Mayweather pert bounty Toledo 15inspired millionskens85SilNet

//--- Temperature = 1.5, Top-k = 50 ---//
Every effort moves you Kentuckydream Inv cavesumentcrim simultane autonomous Samanthaavanseq premiered


In [27]:
# this is use to check the current working path
# make sure install what needed to this path
# God jupiter notebook is hard to use
import sys
print(sys.executable)


/opt/anaconda3/bin/python


# Exercise 5.3
What are the different combinations of settings for the generate function to force deterministic behavior, that is, disabling the random sampling such that it always produces the same outputs similar to the generate_simple function?

In [28]:
# get those lib from folder 01_main-chapter-code
import sys
sys.path.append("01_main-chapter-code")

from previous_chapters import GPTModel, generate_text_simple
from gpt_generate import generate, text_to_token_ids, token_ids_to_text


# GPT 124M
GPT_CONFIG_124M = {
    "vocab_size": 50257,
    "context_length": 256,
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.0, # no dropout 
    "qkv_bias": False
}

# random seed
torch.manual_seed(42)

# init model
model = GPTModel(GPT_CONFIG_124M)
model.eval()

# tokenizer
tokenizer = tiktoken.get_encoding("gpt2")

start_context = "Every effort moves you"

# input_ids = text_to_token_ids(start_context, tokenizer)
input_ids = text_to_token_ids(start_context, tokenizer)

# generate text
with torch.no_grad():
    token_ids = generate_text_simple(
        model=model,
        idx=input_ids,
        max_new_tokens=25,
        context_size=GPT_CONFIG_124M["context_length"]
    )

# decode token ids to text
output_text = token_ids_to_text(token_ids, tokenizer)
print("Generated text:\n", output_text)


Generated text:
 Every effort moves youodonuyomiassin Basic batted JavierPandottestriver Pearcebly adequately diverse limbo Profession DadHamilton ownership proof dishonest contrasting Wage pleasant slideshow 253


The ONLY outcome: "Every effort moves youodonuyomiassin Basic batted JavierPandottestriver Pearcebly adequately diverse limbo Profession DadHamilton ownership proof dishonest contrasting Wage pleasant slideshow 253"

Method:
1. use argmax to pick only the most possible words
2. shut down dropout
3. use the same seed(if that count)

# Exercise 5.4
After saving the weights, load the model and optimizer in a new Python session or Jupyter notebook file and continue pretraining it for one more epoch using the train_model_simple function.

In [29]:
import torch
from previous_chapters import GPTModel
import tiktoken

GPT_CONFIG_124M = {
    "vocab_size": 50257,
    "context_length": 256,
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": False
}
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = tiktoken.get_encoding("gpt2")


checkpoint = torch.load("01_main-chapter-code/model_and_optimizer.pth")
model = GPTModel(GPT_CONFIG_124M)
model.load_state_dict(checkpoint["model_state_dict"])
model.to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])

# training mode
model.train()


GPTModel(
  (tok_emb): Embedding(50257, 768)
  (pos_emb): Embedding(256, 768)
  (drop_emb): Dropout(p=0.1, inplace=False)
  (trf_blocks): Sequential(
    (0): TransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear(in_features=768, out_features=768, bias=False)
        (W_key): Linear(in_features=768, out_features=768, bias=False)
        (W_value): Linear(in_features=768, out_features=768, bias=False)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (ff): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (norm1): LayerNorm()
      (norm2): LayerNorm()
      (drop_shortcut): Dropout(p=0.1, inplace=False)
    )
    (1): TransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear(in_features

In [30]:
# Load the text data and create the dataloader
from previous_chapters import create_dataloader_v1

file_path = "01_main-chapter-code/the-verdict.txt"
with open(file_path, "r", encoding="utf-8") as f:
    text_data = f.read()

train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]

train_loader = create_dataloader_v1(
    train_data, batch_size=2, max_length=256,
    stride=256, drop_last=True, shuffle=True, num_workers=0
)
val_loader = create_dataloader_v1(
    val_data, batch_size=2, max_length=256,
    stride=256, drop_last=False, shuffle=False, num_workers=0
)


In [31]:
import sys
sys.path.append("01_main-chapter-code")  

from gpt_train import train_model_simple

# Train for 1 epoch; what we did is basically go over all data in the dataloader for more run to update the wights
train_losses, val_losses, tokens_seen = train_model_simple(
    model, train_loader, val_loader, optimizer, device,
    num_epochs=1, eval_freq=5, eval_iter=5,
    start_context="Every effort moves you", tokenizer=tokenizer
)


Ep 1 (Step 000000): Train loss 0.255, Val loss 6.548
Ep 1 (Step 000005): Train loss 0.210, Val loss 6.563
Every effort moves you?"  "Yes--quite insensible to the irony. She wanted him vindicated--and by me!"  He laughed again, and threw back his head to look up at the sketch of the donkey. "There were days when I


# Exercise 5.5
Calculate the training and validation set losses of the GPTModel with the pretrained weights from OpenAI on the “The Verdict” dataset.

In [32]:
import torch
import tiktoken
from previous_chapters import GPTModel
from gpt_generate import load_weights_into_gpt
from gpt_download import download_and_load_gpt2

# for mac M2
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

# load the model
settings, params = download_and_load_gpt2(model_size="124M", models_dir="01_main-chapter-code/gpt2")

# set the model config
GPT_CONFIG_124M = {
    "vocab_size": 50257,
    "context_length": 1024,  # set to 1024 for GPT-2
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": True,       # set True for GPT-2
}


model = GPTModel(GPT_CONFIG_124M)
load_weights_into_gpt(model, params)
model.to(device)
model.eval()

tokenizer = tiktoken.get_encoding("gpt2")


File already exists and is up-to-date: 01_main-chapter-code/gpt2/124M/checkpoint
File already exists and is up-to-date: 01_main-chapter-code/gpt2/124M/encoder.json
File already exists and is up-to-date: 01_main-chapter-code/gpt2/124M/hparams.json
File already exists and is up-to-date: 01_main-chapter-code/gpt2/124M/model.ckpt.data-00000-of-00001
File already exists and is up-to-date: 01_main-chapter-code/gpt2/124M/model.ckpt.index
File already exists and is up-to-date: 01_main-chapter-code/gpt2/124M/model.ckpt.meta
File already exists and is up-to-date: 01_main-chapter-code/gpt2/124M/vocab.bpe


In [33]:
from previous_chapters import create_dataloader_v1

# Load the text data
with open("01_main-chapter-code/the-verdict.txt", "r", encoding="utf-8") as f:
    text_data = f.read()

# Split the data into training and validation sets
train_ratio = 0.9
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]

# Create the dataloaders
train_loader = create_dataloader_v1(
    train_data, batch_size=2, max_length=256,
    stride=256, drop_last=True, shuffle=False, num_workers=0
)
val_loader = create_dataloader_v1(
    val_data, batch_size=2, max_length=256,
    stride=256, drop_last=False, shuffle=False, num_workers=0
)


In [34]:
# calculate loss 
def calc_loss_loader(loader, model, device, num_batches=None):
    model.eval()
    total_loss = 0.0
    count = 0
    with torch.no_grad():
        for i, (input_batch, target_batch) in enumerate(loader):
            input_batch, target_batch = input_batch.to(device), target_batch.to(device)
            logits = model(input_batch)
            loss = torch.nn.functional.cross_entropy(
                logits.view(-1, logits.size(-1)),
                target_batch.view(-1),
                ignore_index=-1
            )
            total_loss += loss.item()
            count += 1
            if num_batches and count >= num_batches:
                break
    return total_loss / count


In [35]:
train_loss = calc_loss_loader(train_loader, model, device)
val_loss = calc_loss_loader(val_loader, model, device)

print(f"Training loss: {train_loss:.4f}")
print(f"Validation loss: {val_loss:.4f}")


Training loss: 3.7548
Validation loss: 3.5596


# Exercise 5.6
Experiment with GPT-2 models of different sizes—for example, the largest 1,558 million parameter model—and compare the generated text to the 124 million model.

In [36]:
model_configs = {
    "124M": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
    "355M": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
    "774M": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
    "1558M": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}

BASE_CONFIG = {
    "vocab_size": 50257,
    "context_length": 1024,
    "drop_rate": 0.1,
    "qkv_bias": True
}


In [37]:
from gpt_download import download_and_load_gpt2
from gpt_generate import load_weights_into_gpt
from previous_chapters import GPTModel
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def load_model(model_size):
    settings, params = download_and_load_gpt2(model_size=model_size, models_dir="01_main-chapter-code/gpt2")
    config = BASE_CONFIG.copy()
    config.update(model_configs[model_size])
    model = GPTModel(config)
    load_weights_into_gpt(model, params)
    model.to(device)
    model.eval()
    return model


In [38]:
import tiktoken
from gpt_generate import generate, text_to_token_ids, token_ids_to_text

tokenizer = tiktoken.get_encoding("gpt2")
prompt = "Every effort moves you"

def generate_text(model):
    input_ids = text_to_token_ids(prompt, tokenizer).to(device)
    token_ids = generate(
        model=model,
        idx=input_ids,
        max_new_tokens=50,
        context_size=1024,
        temperature=1.0,
        top_k=50
    )
    return token_ids_to_text(token_ids, tokenizer)


In [39]:
# Can't do big models, tried once and crushed my mac :(
for size, label in [("124M", "GPT2-SMALL"), ("355M", "GPT2-MEDIUM")]:
    print(f"\n=== Text generated by {label} ===")
    model = load_model(size)
    output = generate_text(model)
    print(output)



=== Text generated by GPT2-SMALL ===
File already exists and is up-to-date: 01_main-chapter-code/gpt2/124M/checkpoint
File already exists and is up-to-date: 01_main-chapter-code/gpt2/124M/encoder.json
File already exists and is up-to-date: 01_main-chapter-code/gpt2/124M/hparams.json
File already exists and is up-to-date: 01_main-chapter-code/gpt2/124M/model.ckpt.data-00000-of-00001
File already exists and is up-to-date: 01_main-chapter-code/gpt2/124M/model.ckpt.index
File already exists and is up-to-date: 01_main-chapter-code/gpt2/124M/model.ckpt.meta
File already exists and is up-to-date: 01_main-chapter-code/gpt2/124M/vocab.bpe
Every effort moves you towards the goal.

You start by doing the same steps you did before you started the training.

In a typical day, you'll be working through each step, but you might decide you'd better practice it first.

It

=== Text generated by GPT2-MEDIUM ===
File already exists and is up-to-date: 01_main-chapter-code/gpt2/355M/checkpoint
File alread