# Laboratory Practice 2: Generating names with a decoder-only transformer

In this notebook, we will generate new names using a Transformer decoder model. This simple text generation task captures the essential components of language modeling, applied to a small, manageable dataset.

More specifically, you will be training autoregressive, character-level, decoder-only language model. You will feed it a database of names, and the model will generate new name ideas that all sound name-like, but are not already existing names. 

First, you'll train the model. After training, you'll generate names using pure random sampling as your decoding strategy. Pure random sampling doesn't always work well, so you'll also learn to tweak the temperature parameter when sampling, to better control your generation output.

In [31]:
# Import libraries
import torch
import torch.nn.functional as F
from torch.utils.data import DataLoader, random_split
import os

import sys
sys.path.append('../src')

# Import the data processing and model modules
from data_processing import load_and_preprocess_data, CharTokenizer, NameDataset, collate_fn
from train import train 

# Define the device
device = torch.device("cpu")# torch.device("cuda" if torch.cuda.is_available() else "cpu")


#### Exploratory Data Analysis

In this section, you will take a look at the data. You should be familiar with it, but it is used a little different now that we are training a decoder model. Play with it and answer the questions at the end!

In [32]:
data_filepath = "../data/nombres_raw.txt"  # Replace with your actual file path
alphabet = "abcdefghijklmnopqrstuvwxyz "
start_token = "-"
end_token = "."
batch_size = 64

# Load and preprocess data
print("Loading and preprocessing data...")
names = load_and_preprocess_data(data_filepath, alphabet, start_token, end_token)
print("First 10 names:", names[:10])

Loading and preprocessing data...
First 10 names: ['-maria carmen.', '-antonio.', '-maria.', '-manuel.', '-jose.', '-francisco.', '-david.', '-carmen.', '-juan.', '-javier.']


In [None]:
# Initialize tokenizer
tokenizer = CharTokenizer(alphabet, start_token=start_token, end_token=end_token)

# Encode names
print("Encoding names...")
encoded_names = [tokenizer.encode(name) for name in names]
print("First 10 encoded names:", encoded_names[:5])

Encoding names...
-maria carmen.
First 10 encoded names: [[1, 16, 4, 21, 12, 4, 3, 6, 4, 21, 16, 8, 17, 2], [1, 4, 17, 23, 18, 17, 12, 18, 2], [1, 16, 4, 21, 12, 4, 2], [1, 16, 4, 17, 24, 8, 15, 2], [1, 13, 18, 22, 8, 2]]


- Use the ```CharTokenizer``` to encode your name. What is the result of encoding your name?
>>> - claudia

>>> [1, 6, 15, 4, 24, 7, 12, 4]

In [4]:
# Create dataset
dataset = NameDataset(encoded_names)

# Create data loader
data_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)

# Obtain a batch from data loader
batch = next(iter(data_loader))

# Let´s check one item from batch
print("First item in batch:")
print("Input:", batch[0][0])
print("Target:", batch[1][0])

First item in batch:
Input: tensor([ 1, 17,  8, 15, 22, 18, 17,  3, 21, 24,  5,  8, 17,  0,  0,  0,  0,  0,
         0])
Target: tensor([17,  8, 15, 22, 18, 17,  3, 21, 24,  5,  8, 17,  2,  0,  0,  0,  0,  0,
         0])


- How can you obtain the target tensor from the input? Does this make sense for an autorregressive prediction such as the one of the Decoder-only model?
>>> Write your answer here

- What is the tensor value for the start token? And for the end token? And for the padding token?
>>> Write your answer here

#### Train model and tokenizer
Here you will train the decoder model. Feel free to change the hyperparameters of the model in the ```model_params``` dictionary. Be careful with your computational resources!

In [16]:
# Define training hyper parameters
model_save_dir = "runs"
batch_size = 64
num_epochs = 20
learning_rate = 1e-4
model_params = {
    "d_model": 64,
    "num_attention_heads": 4,
    "intermediate_size": 128,
    "num_hidden_layers": 6,
    "max_position_embeddings": tokenizer.vocab_size # Do not touch this
}

# Split dataset into training and validation sets
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

# Create DataLoaders
train_loader = DataLoader(
    train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn
)
val_loader = DataLoader(
    val_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn
)

# Call the train function
model = train(
    train_loader=train_loader,
    val_loader=val_loader,
    tokenizer=tokenizer,
    num_epochs=num_epochs,
    learning_rate=learning_rate,
    model_save_dir=model_save_dir,
    model_params=model_params,
    device=device
)


Initializing model...
Starting training...
Epoch [1/20], Training Loss: 2.8489
Epoch [1/20], Validation Loss: 2.7460
Epoch [2/20], Training Loss: 2.7332
Epoch [2/20], Validation Loss: 2.7299
Epoch [3/20], Training Loss: 2.7221
Epoch [3/20], Validation Loss: 2.7213
Epoch [4/20], Training Loss: 2.7172
Epoch [4/20], Validation Loss: 2.7180
Epoch [5/20], Training Loss: 2.7137
Epoch [5/20], Validation Loss: 2.7154
Epoch [6/20], Training Loss: 2.7112
Epoch [6/20], Validation Loss: 2.7135
Epoch [7/20], Training Loss: 2.7098
Epoch [7/20], Validation Loss: 2.7117
Epoch [8/20], Training Loss: 2.7085
Epoch [8/20], Validation Loss: 2.7110
Epoch [9/20], Training Loss: 2.7081
Epoch [9/20], Validation Loss: 2.7101
Epoch [10/20], Training Loss: 2.7076
Epoch [10/20], Validation Loss: 2.7096
Epoch [11/20], Training Loss: 2.7060
Epoch [11/20], Validation Loss: 2.7094
Epoch [12/20], Training Loss: 2.7063
Epoch [12/20], Validation Loss: 2.7086
Epoch [13/20], Training Loss: 2.7060
Epoch [13/20], Validation 

#### Generate Names
Now we are ready to generate new names. Fill the function below and start playing around!

In [87]:
def generate_name(
    model,
    tokenizer,
    prefix: str = "",
    start_token: str = "-",
    end_token: str = ".",
    max_length: int = 20,
    temperature: float = 1.0,
    device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
) -> str:
    """
    Generate a new name using the trained model, optionally starting with a given prefix.

    Args:
        model (nn.Module): The trained Transformer model.
        tokenizer (CharTokenizer): The tokenizer.
        prefix (str): Optional prefix string to start the name.
        start_token (str): The start token character.
        end_token (str): The end token character.
        max_length (int): Maximum length of the generated name (excluding prefix length).
        temperature (float): Sampling temperature. Higher values increase randomness.
        device (torch.device): Device to perform computation on.

    Returns:
        str: The generated name.
    """
    model.eval()
    start_token_id = tokenizer.char2idx[start_token]
    end_token_id = tokenizer.char2idx[end_token]

    # TODO: Encode the prefix
    prefix_ids = [tokenizer.char2idx[character] for character in prefix]

    # TODO: Initialize the input with the start token and the prefix
    generated_ids = torch.tensor([start_token_id] + prefix_ids, dtype=torch.long, device=device).unsqueeze(0)

    with torch.no_grad():
        for _ in range(max_length):
            # TODO: Get model predictions
            logits = model(generated_ids)[:, -1, :] / temperature
            
            # TODO: Apply softmax to get probabilities
            next_token_probs = F.softmax(logits, dim=-1)

            # TODO: Sample the next token
            next_token_id  = torch.multinomial(next_token_probs, num_samples=1)

            # TODO: Append the new token to the sequence
            generated_ids = torch.cat([generated_ids, next_token_id], dim=1)

            # Stop if end token is generated
            if next_token_id.item() == end_token_id:
                break

    # TODO: Decode the generated token IDs to a string, excluding start and end tokens
    generated_sequence = generated_ids.squeeze().tolist()
    generated_sequence = [tokenizer.idx2char[idx] for idx in generated_sequence]
    # TODO: Decode the name
    generated_name =  "".join([c for c in generated_sequence if c not in {start_token, end_token}])

    return generated_name


In [88]:
# Parameters
num_names = 5          # Number of names to generate
max_length = 20         # Maximum length of each generated name
temperature = 1.0       # Sampling temperature 
start_token = "-"       # Start token character (used during training)
end_token = "."         # End token character (used during training)


In [91]:
# Generate names
print("Generated Names:\n")
for _ in range(num_names):
    name = generate_name(
        model=model,
        tokenizer=tokenizer,
        start_token=start_token,
        end_token=end_token,
        max_length=max_length,
        temperature=temperature,
        device=device
    )
    print(name)


Generated Names:

in
t
k

o


## Understanding the Temperature Parameter in Language Generation

The **temperature** parameter $T$ adjusts the randomness of text generated by language models by scaling the logits (model outputs before softmax).

### Mathematical Explanation

Given logits $z_i$ for each token $i$, the probability $p_i$ of selecting token $i$ is calculated using the softmax function:

\begin{align*}
p_i = \frac{\exp\left(\frac{z_i}{T}\right)}{\sum_{j} \exp\left(\frac{z_j}{T}\right)}
\end{align*}

- *When $T = 1$*: The probabilities remain unchanged.
- *When $T < 1$*: The distribution becomes sharper; higher-probability tokens are favored. You should expect names to look more like the "typical" names encountered in the dataset.
- *When $T > 1$*: The distribution flattens; lower-probability tokens are more likely. You should expect names to look more "exotic" or "creative", since less probable characters are being sampled to continue the previously generated ones.

### Impact on Token Probabilities

Suppose we have logits for three tokens:

- $z_A$ = 2.0
- $z_B$ = 1.0
- $z_C$ = 0.5

##### At $T = 1.0$:

\begin{align*}
p_A &= \frac{e^{2.0}}{e^{2.0} + e^{1.0} + e^{0.5}} \approx 0.659\\
p_B &= \frac{e^{1.0}}{e^{2.0} + e^{1.0} + e^{0.5}} \approx 0.242\\
p_C &= \frac{e^{0.5}}{e^{2.0} + e^{1.0} + e^{0.5}} \approx 0.099
\end{align*}

##### At $T = 0.5$:

\begin{align*}
p_A &= \frac{e^{2.0 / 0.5}}{e^{2.0 / 0.5} + e^{1.0 / 0.5} + e^{0.5 / 0.5}} = \frac{e^{4.0}}{e^{4.0} + e^{2.0} + e^{1.0}} \approx 0.843\\
p_B &= \frac{e^{2.0}}{e^{4.0} + e^{2.0} + e^{1.0}} \approx 0.114\\
p_C &= \frac{e^{1.0}}{e^{4.0} + e^{2.0} + e^{1.0}} \approx 0.043
\end{align*}

- **Observation**: Lower $T$ increases the dominance of the highest logit.

##### At $T = 1.5$:

\begin{align*}
p_A &= \frac{e^{2.0 / 1.5}}{e^{2.0 / 1.5} + e^{1.0 / 1.5} + e^{0.5 / 1.5}} = \frac{e^{1.333}}{e^{1.333} + e^{0.667} + e^{0.333}} \approx 0.490\\
p_B &= \frac{e^{0.667}}{e^{1.333} + e^{0.667} + e^{0.333}} \approx 0.282\\
p_C &= \frac{e^{0.333}}{e^{1.333} + e^{0.667} + e^{0.333}} \approx 0.228
\end{align*}

- **Observation**: Higher $T$ increases the probabilities of less likely tokens.

### Practical Implications

- **Low Temperature ($T < 1$)**:
  - **Sharper Distribution**: Model is confident; outputs are more predictable.
  - **Use Case**: When coherence is crucial.

- **High Temperature ($T > 1$)**:
  - **Flatter Distribution**: Model explores more options; outputs are diverse.
  - **Use Case**: When creativity is desired.



In [105]:
temperature = 1.5  # High randomness
print(f"Generated Names with temperature={temperature}:\n")
for _ in range(num_names):
    name = generate_name(
        model=model,
        tokenizer=tokenizer,
        start_token=start_token,
        end_token=end_token,
        max_length=max_length,
        temperature=temperature,
        device=device
    )
    print(name)


Generated Names with temperature=1.5:

dtbrd 
aiengthhhactii
gtrgaof
gdsi
hzttaaiiontgzn


In [110]:
temperature = 0.75  # Low randomness
print(f"Generated Names with temperature={temperature}:\n")
for _ in range(num_names):
    name = generate_name(
        model=model,
        tokenizer=tokenizer,
        start_token=start_token,
        end_token=end_token,
        max_length=max_length,
        temperature=temperature,
        device=device
    )
    print(name)


Generated Names with temperature=0.75:

s
m
bz
d
j


In [103]:
temperature = 1.0  # Default randomness
prefix = "z"  # Prefix to start the names with
for _ in range(num_names):
    name = generate_name(
        model=model,
        prefix=prefix,
        tokenizer=tokenizer,
        start_token=start_token,
        end_token=end_token,
        max_length=max_length,
        temperature=temperature,
        device=device
    )
    print(name)


z
zaduo
z
z
zelzt
