# Laboratory Practice 2: Generating names with a decoder-only transformer

In this notebook, we will generate new names using a Transformer decoder model. This simple text generation task captures the essential components of language modeling, applied to a small, manageable dataset.

More specifically, you will be training autoregressive, character-level, decoder-only language model. You will feed it a database of names, and the model will generate new name ideas that all sound name-like, but are not already existing names. 

First, you'll train the model. After training, you'll generate names using pure random sampling as your decoding strategy. Pure random sampling doesn't always work well, so you'll also learn to tweak the temperature parameter when sampling, to better control your generation output.

In [4]:
# Import libraries
import torch
import torch.nn.functional as F
from torch.utils.data import DataLoader, random_split
import os

import sys
sys.path.append('../src')

# Import the data processing and model modules
from data_processing import load_and_preprocess_data, CharTokenizer, NameDataset, collate_fn
from train import train 

# Define the device
device = torch.device("cpu")# torch.device("cuda" if torch.cuda.is_available() else "cpu")


#### Exploratory Data Analysis

In this section, you will take a look at the data. You should be familiar with it, but it is used a little different now that we are training a decoder model. Play with it and answer the questions at the end!

In [5]:
data_filepath = "../data/nombres_raw.txt"  # Replace with your actual file path
alphabet = "abcdefghijklmnopqrstuvwxyz "
start_token = "-"
end_token = "."
batch_size = 64

# Load and preprocess data
print("Loading and preprocessing data...")
names = load_and_preprocess_data(data_filepath, alphabet, start_token, end_token)
print("First 10 names:", names[:5])

Loading and preprocessing data...
First 10 names: ['-maria carmen.', '-antonio.', '-maria.', '-manuel.', '-jose.']


In [6]:
# Initialize tokenizer
tokenizer = CharTokenizer(alphabet, start_token=start_token, end_token=end_token)

# Encode names
print("Encoding names...")
encoded_names = [tokenizer.encode(name) for name in names]
print("First 10 encoded names:", encoded_names[:5])

Encoding names...
First 10 encoded names: [[1, 16, 4, 21, 12, 4, 3, 6, 4, 21, 16, 8, 17, 2], [1, 4, 17, 23, 18, 17, 12, 18, 2], [1, 16, 4, 21, 12, 4, 2], [1, 16, 4, 17, 24, 8, 15, 2], [1, 13, 18, 22, 8, 2]]


In [8]:
my_name = "lydia"
encoded = tokenizer.encode(my_name)
print("Original:", my_name)
print("Encoded:", encoded)

Original: lydia
Encoded: [15, 28, 7, 12, 4]


- Use the ```CharTokenizer``` to encode your name. What is the result of encoding your name?
>>> Write your name

Lydia

>>> Write your name encoded

 [15, 28, 7, 12, 4]

In [9]:
# Create dataset
dataset = NameDataset(encoded_names)

# Create data loader
data_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)

# Obtain a batch from data loader
batch = next(iter(data_loader))

# Let´s check one item from batch
print("First item in batch:")
print("Input:", batch[0][0])
print("Target:", batch[1][0])

First item in batch:
Input: tensor([ 1,  8, 15, 25, 12, 21,  4,  3, 19, 12, 15,  4, 21,  0,  0,  0,  0,  0])
Target: tensor([ 8, 15, 25, 12, 21,  4,  3, 19, 12, 15,  4, 21,  2,  0,  0,  0,  0,  0])


- How can you obtain the target tensor from the input? Does this make sense for an autorregressive prediction such as the one of the Decoder-only model?
>>> Write your answer here

The target tensor is obtained by shifting the input tensor one position to the left. This makes sense in an autoregressive decoder-only model, because the goal is to predict the next token given all the previous ones.

- What is the tensor value for the start token? And for the end token? And for the padding token?
>>> Write your answer here

- Start token (-) → value = 1
- End token (.) → value = 2
- Padding token → value = 0

#### Train model and tokenizer
Here you will train the decoder model. Feel free to change the hyperparameters of the model in the ```model_params``` dictionary. Be careful with your computational resources!

In [12]:
# Define training hyper parameters
model_save_dir = "runs"
batch_size = 32
num_epochs = 3
learning_rate = 1e-4
model_params = {
    "d_model": 16,
    "num_attention_heads": 2,
    "intermediate_size": 32,
    "num_hidden_layers": 4,
    "max_position_embeddings": tokenizer.vocab_size # Do not touch this
}

# Split dataset into training and validation sets
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

# Create DataLoaders
train_loader = DataLoader(
    train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn
)
val_loader = DataLoader(
    val_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn
)

# Call the train function
model = train(
    train_loader=train_loader,
    val_loader=val_loader,
    tokenizer=tokenizer,
    num_epochs=num_epochs,
    learning_rate=learning_rate,
    model_save_dir=model_save_dir,
    model_params=model_params,
    device=device
)

Initializing model...
Starting training...
Epoch [1/3], Training Loss: 2.8978
Epoch [1/3], Validation Loss: 2.5898
Epoch [2/3], Training Loss: 2.4933
Epoch [2/3], Validation Loss: 2.4132
Epoch [3/3], Training Loss: 2.3626
Epoch [3/3], Validation Loss: 2.3213


#### Generate Names
Now we are ready to generate new names. Fill the function below and start playing around!

In [16]:
def generate_name(
    model,
    tokenizer,
    prefix: str = "",
    start_token: str = "-",
    end_token: str = ".",
    max_length: int = 20,
    temperature: float = 1.0,
    device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
) -> str:
    """
    Generate a new name using the trained model, optionally starting with a given prefix.

    Args:
        model (nn.Module): The trained Transformer model.
        tokenizer (CharTokenizer): The tokenizer.
        prefix (str): Optional prefix string to start the name.
        start_token (str): The start token character.
        end_token (str): The end token character.
        max_length (int): Maximum length of the generated name (excluding prefix length).
        temperature (float): Sampling temperature. Higher values increase randomness.
        device (torch.device): Device to perform computation on.

    Returns:
        str: The generated name.
    """
    model.eval()
    start_token_id = tokenizer.char2idx[start_token]
    end_token_id = tokenizer.char2idx[end_token]

    # TODO: Encode the prefix
    prefix_ids = tokenizer.encode(prefix) if prefix else []

    # TODO: Initialize the input with the start token and the prefix
    generated_ids = [start_token_id] + prefix_ids

    with torch.no_grad():
        for _ in range(max_length):
            # TODO: Get model predictions
            x = torch.tensor(generated_ids, dtype=torch.long, device=device).unsqueeze(0)
            logits = model(x)[:, -1, :]
            logits = logits / max(temperature, 1e-6)

            # TODO: Apply softmax to get probabilities
            next_token_probs = torch.softmax(logits, dim=-1)

            # TODO: Sample the next token
            next_token_id = torch.multinomial(next_token_probs, num_samples=1)

            # TODO: Append the new token to the sequence
            generated_ids.append(next_token_id.item())

            # Stop if end token is generated
            if next_token_id.item() == end_token_id:
                break

    # TODO: Decode the generated token IDs to a string, excluding start and end tokens
    generated_sequence = [
        idx for idx in generated_ids if idx not in [start_token_id, end_token_id]
    ]

    # TODO: Decode the name
    generated_name = tokenizer.decode(generated_sequence)

    return generated_name

In [17]:
# Parameters
num_names = 5          # Number of names to generate
max_length = 20         # Maximum length of each generated name
temperature = 1.0       # Sampling temperature 
start_token = "-"       # Start token character (used during training)
end_token = "."         # End token character (used during training)


In [18]:
# Generate names
print("Generated Names:\n")
for _ in range(num_names):
    name = generate_name(
        model=model,
        tokenizer=tokenizer,
        start_token=start_token,
        end_token=end_token,
        max_length=max_length,
        temperature=temperature,
        device=device
    )
    print(name)


Generated Names:

faan
ana jindsan
carna gabrio
nliunaes
dana ralenex ciendl
