# TinyGPT Model for NLP-II course

### Juan Ignacio García (a2008)

This file contains all dependencies necessary to train and test a small-volume GPT model based on Shakespeare's plays. The model aims to use the *Mixture of Experts* technique to train and use *Greedy Decoding*, *Temperature Sampling* & *Top k / Top p Sampling* for inference.

## Library import

This section is used to centralize every library used for developing the model.

In [14]:
import httpx
from tokenizers import ByteLevelBPETokenizer, Tokenizer
import math
import torch
from tqdm.auto import tqdm
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.nn.functional as F

## Downloading the database

We use the **[tiny_shakespeare](https://karpathy.github.io/2015/05/21/rnn-effectiveness/)** dataset, by Andrej Karpathy; a dataset featuring 40000 lines of Shakespeare from a variety of Shakespeare's plays.

In [15]:
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
response = httpx.get(url)
text = response.text
print(text[:100])  # Print the first 1000 characters to verify download

# Save the text to a file
with open("shakespeare.txt", "w", encoding="utf-8") as f:
    f.write(text)

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


## Tokenizer

This dataset is composed by a small corpus (~1MB) with a small character range (~65 characters) in an old-fashioned english wording style. Based on this analysis, the optimal tokenizer to use should be a **Character-Level tokenizer** allowed to infer automatically gramatic & ortographic rules even for words outside the vocabulary. This is specially useful in this case as there are multiple wording variations inside Shakespeare's plays.

However, I propose using a **Byte-Pair tokenizer** that creates subwords specifically trained using this dataset, allowing this way to mantain some flexibility to old wordings while conforming larger strings to obtain better wording representation as an output. This will also provide some comarison base with the material provided by the course.

In [16]:
# Initialize an empty tokenizer
tokenizer = ByteLevelBPETokenizer()

try:
    # Try to load an existing tokenizer
    tokenizer = Tokenizer.from_file("tokenizer_bpe_shakespeare/tokenizer.json")
    print("Loaded existing tokenizer.")
except:
    # Train it over corpus
    tokenizer.train(
        files=["shakespeare.txt"],
        vocab_size=8000,
        min_frequency=2,
        special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"]
    )
    print("Trained new tokenizer.")

    # Save the tokenizer model files
    tokenizer.save_model("tokenizer_bpe_shakespeare")
    tokenizer.save("tokenizer_bpe_shakespeare/tokenizer.json")

Loaded existing tokenizer.


Here is a sample of the trained tokenizer:

In [17]:
encoded = tokenizer.encode("No more talking on't; let it be done: away, away!")
print(encoded.tokens)
print(encoded.ids)

['No', 'Ġmore', 'Ġtalking', 'Ġon', "'t", ';', 'Ġlet', 'Ġit', 'Ġbe', 'Ġdone', ':', 'Ġaway', ',', 'Ġaway', '!']
[694, 490, 6854, 374, 672, 31, 543, 344, 310, 846, 30, 954, 16, 954, 5]


After training the tokenizer, we must set up the tensorial representation of the dataset and conform a dataloader structure to easily feed up the model. In order to do this, we create the `ShakespeareDataset` class:

In [18]:
class ShakespeareDataset(Dataset):
    def __init__(self, tokens, block_size):
        self.tokens = tokens
        self.block_size = block_size

    def __len__(self):
        return len(self.tokens) - self.block_size

    def __getitem__(self, idx):
        x = torch.tensor(self.tokens[idx:idx+self.block_size], dtype=torch.long)
        y = torch.tensor(self.tokens[idx+1:idx+self.block_size+1], dtype=torch.long)
        return x, y

To represent the original data in this tokenized form, we first divide the text in training & testing datasets and then create individual instances of the `ShakespeareDataset` tokenized version for each one of them. After that, I set up a `Dataloader` object provisioned with the right dependencies to easily feed the model.

In [19]:
TRAIN_TEST_RATIO = 0.8
BATCH_SIZE = 256

data = tokenizer.encode(text).ids
n = int(TRAIN_TEST_RATIO * len(data))
train_data = data[:n]
val_data = data[n:]

# Create datset instances
block_size = 128
train_ds = ShakespeareDataset(train_data, block_size)
val_ds = ShakespeareDataset(val_data, block_size)

train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, drop_last=True)
val_loader = DataLoader(val_ds, batch_size=BATCH_SIZE, shuffle=False)

## Mixture of Experts & Transformer Layer

This first layer structure provides a series of "experts" as smaller specialized neural networks dedicated to interpretate particular tokens to better understand the context of a phrase. The model redirects a token to a selection of experts using a router. This method is useful to improve the understanding and representation of the context from a certain text while making efficient the inference cost for training the model.

To implement this, we create a `MoE` class that manages instances of a second `FeedForward` class serving as the expert. The `FeedForward` class is composed of a small neural network using a GELU activation function as a way to introduce non-linearities. The `MoE`class also features another linear layer as a gate for the tokens that enter and are redirected to the selected experts. After *gating* or processing the probability of every expert of handling a certain token, we select the top *k* experts of the list to process the token and weight the results at the output.

The data feeded to the model must be a structure composed of a tensor representing the encoded input, with parameters `B`, `T` & `C` defining the batch size, sequence size and embedding dimension used on a certain data piece.

In [20]:
class FeedForward(nn.Module):
    def __init__(self, dim, hidden_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, hidden_dim),
            nn.GELU(),
            nn.Linear(hidden_dim, dim),
        )

    def forward(self, x):
        return self.net(x)

class MoE(nn.Module):
    """
    Simple token-wise MoE.
    - experts: list of FFNs
    - gating: linear over features -> scores over experts
    - top_k routing
    """
    def __init__(self, dim, hidden_dim, n_experts=4, capacity_factor=1.0, top_k=1):
        super().__init__()
        self.dim = dim
        self.hidden_dim = hidden_dim
        self.n_experts = n_experts
        self.top_k = top_k
        self.capacity_factor = capacity_factor

        # Experts: each is a FeedForward (can be any module)
        self.experts = nn.ModuleList([FeedForward(dim, hidden_dim) for _ in range(n_experts)])

        # Gating network: project token representation to logits over experts
        self.gate = nn.Linear(dim, n_experts)

    def forward(self, x):
        # x: (B, T, C)
        B, T, C = x.shape
        flat = x.view(B*T, C)  # (BT, C)

        logits = self.gate(flat)  # (BT, n_experts)

        # Softmax probabilities over experts
        probs = F.softmax(logits, dim=-1)  # (BT, n_experts)

        # top-k indices and values
        topk_vals, topk_idx = torch.topk(probs, k=self.top_k, dim=-1)  # (BT, top_k)

        # For top-1 routing, choose expert per token
        if self.top_k == 1:
            expert_choice = topk_idx.squeeze(-1)  # (BT,)
            outputs = torch.zeros_like(flat)

            # Efficient per-expert processing: gather positions for each expert
            for e in range(self.n_experts):
                mask = (expert_choice == e)
                if mask.sum() == 0:
                    continue
                selected = flat[mask]  # (n_e, C)
                out_e = self.experts[e](selected)  # (n_e, C)
                outputs[mask] = out_e

            outputs = outputs.view(B, T, C)
            return outputs, probs.view(B, T, self.n_experts)
        else:
            # top-k mixture: weighted sum of top-k experts' outputs
            out = torch.zeros_like(flat)

            # Compute all expert outputs on every token and multiply
            all_outs = []
            for e in range(self.n_experts):
                out_e = self.experts[e](flat)  # (BT, C)
                all_outs.append(out_e)
            all_outs = torch.stack(all_outs, dim=1)  # (BT, n_experts, C)

            # Multiply by probs and sum over experts (but only top-k could be masked)
            probs_mask = torch.zeros_like(all_outs[:, :, 0:1])  # (BT, n_experts, 1)
            probs_mask[:, :, 0] = 0.0  # initializer

            # Create mask for top-k
            mask_topk = torch.zeros_like(probs)  # (BT, n_experts)
            for j in range(self.top_k):
                mask_topk.scatter_(1, topk_idx[:, j:j+1], 1.0)
            mask_topk = mask_topk.unsqueeze(-1)  # (BT, n_experts, 1)
            weighted = all_outs * (probs.unsqueeze(-1) * mask_topk)
            out = weighted.sum(dim=1)  # (BT, C)
            out = out.view(B, T, C)
            return out, probs.view(B, T, self.n_experts)

The other useful layer for our model is a *Transformer* layer that will combine the previous MoE layer with a multilayer head attention mechanism to better understand how a word relates to another and infer the imported context.

In [21]:
class TransformerBlockMoE(nn.Module):
    def __init__(self, dim, n_heads, mlp_hidden_dim, n_experts=4, top_k=1):
        super().__init__()
        self.attn = nn.MultiheadAttention(embed_dim=dim, num_heads=n_heads, batch_first=True)
        self.ln1 = nn.LayerNorm(dim)
        self.ln2 = nn.LayerNorm(dim)
        
        # MoE layer
        self.moe = MoE(dim, mlp_hidden_dim, n_experts=n_experts, top_k=top_k)
        self.dropout = nn.Dropout(0.1)

    def forward(self, x):
        # x: (B, T, C)
        attn_out, _ = self.attn(x, x, x, attn_mask=None)  # (B, T, C)
        x = x + self.dropout(attn_out)
        x = self.ln1(x)

        moe_out, gate_probs = self.moe(x)  # (B, T, C), (B, T, n_experts)
        x = x + self.dropout(moe_out)
        x = self.ln2(x)
        return x, gate_probs

We finally combine these layers into a structure that also trains an embedding of the tokenized tensors combined with a positional encoder to attend not just at the tokens present in a phrase but also their arrangement.

In [22]:
class TinyGPTMoE(nn.Module):
    def __init__(self, vocab_size, block_size, n_layers=4, dim=128, n_heads=4,
                 mlp_hidden_dim=512, n_experts=4, top_k=1):
        super().__init__()
        self.vocab_size = vocab_size
        self.block_size = block_size
        self.dim = dim

        self.token_emb = nn.Embedding(vocab_size, dim)
        self.pos_emb = nn.Embedding(block_size, dim)
        self.drop = nn.Dropout(0.1)

        self.blocks = nn.ModuleList([
            TransformerBlockMoE(dim, n_heads, mlp_hidden_dim, n_experts=n_experts, top_k=top_k)
            for _ in range(n_layers)
        ])
        self.ln_f = nn.LayerNorm(dim)
        self.head = nn.Linear(dim, vocab_size, bias=False)

    def forward(self, idx):
        # idx: (B, T)
        B, T = idx.shape
        assert T <= self.block_size
        tok = self.token_emb(idx)  # (B, T, C)
        pos = self.pos_emb(torch.arange(T, device=idx.device))[None, :, :]

        x = self.drop(tok + pos)

        gate_probs_all = []
        for blk in self.blocks:
            x, gate_probs = blk(x)
            gate_probs_all.append(gate_probs)  # list of (B, T, n_experts)

        x = self.ln_f(x)
        logits = self.head(x)  # (B, T, vocab)
        # stack gate probs along layers -> (n_layers, B, T, n_experts)
        gate_probs_all = torch.stack(gate_probs_all, dim=0)
        return logits, gate_probs_all

## Training

To simplify the training process, I provide a training helper using an `AdamW` optimizer, a schedular with decaying learning rate and defining an evaluation function comparing the cross-entropy loss between the predicted and correct answers on the validation dataloader.

In [23]:
def evaluate(model, dataloader, device):
    model.eval()
    total_loss = 0.0
    n_tokens = 0
    with torch.no_grad():
        for xb, yb in tqdm(dataloader, desc="Evaluando", leave=False):
            xb = xb.to(device)
            yb = yb.to(device)
            logits, _ = model(xb)
            B, T, V = logits.shape
            loss = F.cross_entropy(logits.view(B*T, V), yb.view(B*T), reduction='sum')
            total_loss += loss.item()
            n_tokens += B*T
    model.train()
    return total_loss / n_tokens

def train(tokenizer, train_loader, val_loader, device, block_size=128, n_layers=4, dim=128, n_heads=4, mlp_hidden_dim=512,
          n_experts=4, top_k=1, lr=3e-4, epochs=20, ckpt_path="tinygpt_moe.pth"):

    vocab_size = len(tokenizer.get_vocab())
    model = TinyGPTMoE(vocab_size=vocab_size, block_size=block_size,
                       n_layers=n_layers, dim=dim, n_heads=n_heads,
                       mlp_hidden_dim=mlp_hidden_dim, n_experts=n_experts, top_k=top_k)
    model.to(device)

    optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=1e-1)
    scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.9)

    best_val_loss = float('inf')
    for epoch in range(epochs):
        model.train()
        running_loss = 0.0
        n_tokens = 0

        progress_bar = tqdm(train_loader, desc=f"Epoch {epoch}/{epochs}", leave=True)
        for i, (xb, yb) in enumerate(progress_bar, start=1):
            xb = xb.to(device)
            yb = yb.to(device)
            logits, gate_probs = model(xb)
            B, T, V = logits.shape
            loss = F.cross_entropy(logits.view(B*T, V), yb.view(B*T))

            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()

            running_loss += loss.item() * B*T
            n_tokens += B*T
            
        scheduler.step()
        val_loss = evaluate(model, val_loader, device)
        val_ppl = math.exp(val_loss)
        print(f"==> Epoch {epoch} validation loss {val_loss:.4f} ppl {val_ppl:.2f}")

        # save best
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            torch.save({
                'model_state': model.state_dict(),
                'tokenizer': tokenizer.__dict__
            }, ckpt_path)
            print("Modelo guardado en", ckpt_path)
    
    return model

We set the needed parameters and start the model training.

In [None]:
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Usando dispositivo:", DEVICE)

# Training parameters
block_size = 128
n_layers = 4
dim = 128
n_heads = 4
mlp_hidden_dim = 512
n_experts = 4
top_k = 1
lr = 3e-4
epochs = 20
ckpt_path = "tinygpt_moe.pth"

# Start training
model = train(tokenizer, train_loader, val_loader, DEVICE,
      block_size=block_size,
      n_layers=n_layers,
      dim=dim,
      n_heads=n_heads,
      mlp_hidden_dim=mlp_hidden_dim,
      n_experts=n_experts,
      top_k=top_k,
      lr=lr,
      epochs=epochs,
      ckpt_path=ckpt_path)

Usando dispositivo: cuda


Epoch 0/20:   0%|          | 0/993 [00:00<?, ?it/s]

Evaluando:   0%|          | 0/248 [00:00<?, ?it/s]

==> Epoch 0 validation loss 0.8396 ppl 2.32
Modelo guardado en tinygpt_moe.pth


Epoch 1/20:   0%|          | 0/993 [00:00<?, ?it/s]

Evaluando:   0%|          | 0/248 [00:00<?, ?it/s]

==> Epoch 1 validation loss 0.4756 ppl 1.61
Modelo guardado en tinygpt_moe.pth


Epoch 2/20:   0%|          | 0/993 [00:00<?, ?it/s]

Evaluando:   0%|          | 0/248 [00:00<?, ?it/s]

==> Epoch 2 validation loss 0.4765 ppl 1.61


Epoch 3/20:   0%|          | 0/993 [00:00<?, ?it/s]

Evaluando:   0%|          | 0/248 [00:00<?, ?it/s]

==> Epoch 3 validation loss 0.4903 ppl 1.63


Epoch 4/20:   0%|          | 0/993 [00:00<?, ?it/s]

Evaluando:   0%|          | 0/248 [00:00<?, ?it/s]

==> Epoch 4 validation loss 0.4948 ppl 1.64


Epoch 5/20:   0%|          | 0/993 [00:00<?, ?it/s]

Evaluando:   0%|          | 0/248 [00:00<?, ?it/s]

==> Epoch 5 validation loss 0.5030 ppl 1.65


Epoch 6/20:   0%|          | 0/993 [00:00<?, ?it/s]

Evaluando:   0%|          | 0/248 [00:00<?, ?it/s]

==> Epoch 6 validation loss 0.5137 ppl 1.67


Epoch 7/20:   0%|          | 0/993 [00:00<?, ?it/s]

Evaluando:   0%|          | 0/248 [00:00<?, ?it/s]

==> Epoch 7 validation loss 0.5207 ppl 1.68


Epoch 8/20:   0%|          | 0/993 [00:00<?, ?it/s]

Evaluando:   0%|          | 0/248 [00:00<?, ?it/s]

==> Epoch 8 validation loss 0.5302 ppl 1.70


Epoch 9/20:   0%|          | 0/993 [00:00<?, ?it/s]

Evaluando:   0%|          | 0/248 [00:00<?, ?it/s]

==> Epoch 9 validation loss 0.5356 ppl 1.71


Epoch 10/20:   0%|          | 0/993 [00:00<?, ?it/s]

Evaluando:   0%|          | 0/248 [00:00<?, ?it/s]

==> Epoch 10 validation loss 0.5497 ppl 1.73


Epoch 11/20:   0%|          | 0/993 [00:00<?, ?it/s]

Evaluando:   0%|          | 0/248 [00:00<?, ?it/s]

==> Epoch 11 validation loss 0.5556 ppl 1.74


Epoch 12/20:   0%|          | 0/993 [00:00<?, ?it/s]

Evaluando:   0%|          | 0/248 [00:00<?, ?it/s]

==> Epoch 12 validation loss 0.5611 ppl 1.75


Epoch 13/20:   0%|          | 0/993 [00:00<?, ?it/s]

Evaluando:   0%|          | 0/248 [00:00<?, ?it/s]

==> Epoch 13 validation loss 0.5640 ppl 1.76


Epoch 14/20:   0%|          | 0/993 [00:00<?, ?it/s]

Evaluando:   0%|          | 0/248 [00:00<?, ?it/s]

==> Epoch 14 validation loss 0.5720 ppl 1.77


Epoch 15/20:   0%|          | 0/993 [00:00<?, ?it/s]

Evaluando:   0%|          | 0/248 [00:00<?, ?it/s]

==> Epoch 15 validation loss 0.5786 ppl 1.78


Epoch 16/20:   0%|          | 0/993 [00:00<?, ?it/s]

Evaluando:   0%|          | 0/248 [00:00<?, ?it/s]

==> Epoch 16 validation loss 0.5736 ppl 1.77


Epoch 17/20:   0%|          | 0/993 [00:00<?, ?it/s]

Evaluando:   0%|          | 0/248 [00:00<?, ?it/s]

==> Epoch 17 validation loss 0.5747 ppl 1.78


Epoch 18/20:   0%|          | 0/993 [00:00<?, ?it/s]

Evaluando:   0%|          | 0/248 [00:00<?, ?it/s]

==> Epoch 18 validation loss 0.5735 ppl 1.77


Epoch 19/20:   0%|          | 0/993 [00:00<?, ?it/s]

Evaluando:   0%|          | 0/248 [00:00<?, ?it/s]

==> Epoch 19 validation loss 0.5782 ppl 1.78


## Inference

After training the model we are able to produce inferences based on a certain seed. The model will provide a list of tokens oredered by descending probability to be chosen and we define three different strategies to ensamble the tokens to produce a coherent response. The algorithms chosen for this operation are:

* Greedy approach: This is the most straightforward algorithm and consist on choosing the token with the highest probability.

* Top-k approach: We select the *k* most probable tokens and then choose one of them at random, weighted by their probabilities.

* Top-p approach: We select groups of tokens called *nucleus* where tokens are included until the cummulative probability goes over a threshold *p*. Then we choose one of them at random, weighted by their probabilities.

Additionally, we implement a *Temperature Sampling* mechanism that narrows or expands the ability of the infered phrase to allow a wide variety of responses.

In [25]:
@torch.no_grad()
def generate_greedy(model, tokenizer, seed_text, max_new_tokens=50, device="cpu"):
    model.eval()
    model.to(device)
    idx = torch.tensor([tokenizer.encode(seed_text).ids], dtype=torch.long, device=device)
    for _ in range(max_new_tokens):
        if idx.shape[1] > model.block_size:
            idx = idx[:, -model.block_size:]
        logits, _ = model(idx)
        next_token_logits = logits[:, -1, :]
        next_token = torch.argmax(next_token_logits, dim=-1, keepdim=True)
        idx = torch.cat([idx, next_token], dim=1)
    out = model.token_emb.weight.device
    return tokenizer.decode(idx[0].tolist())

def top_k_logits(logits, k):
    if k == 0:
        return logits
    v, ix = torch.topk(logits, k)
    out = logits.clone()
    out[out < v[:, [-1]]] = -float('Inf')
    return out

import torch
import torch.nn.functional as F

@torch.no_grad()
def top_p_filtering(logits, top_p=0.9, filter_value=-float("Inf")):
    """Filtra logits según probabilidad acumulada top-p (nucleus)."""
    sorted_logits, sorted_indices = torch.sort(logits, descending=True)
    cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)

    # Determina los índices a filtrar
    sorted_indices_to_remove = cumulative_probs > top_p
    # Siempre dejamos al menos el primer token
    sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
    sorted_indices_to_remove[..., 0] = False

    # Reconstruimos el tensor original
    indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)
    logits = logits.masked_fill(indices_to_remove, filter_value)
    return logits


@torch.no_grad()
def generate_sample(model, tokenizer, seed_text, max_new_tokens, device,
                    temperature=1.0, top_k=None, top_p=None):
    model.eval()
    model.to(device)

    # tokenizar entrada
    encoded = tokenizer.encode(seed_text).ids
    idx = torch.tensor([encoded], dtype=torch.long, device=device)

    for _ in range(max_new_tokens):
        # recortar si se pasa del tamaño de contexto
        if idx.shape[1] > model.block_size:
            idx_cond = idx[:, -model.block_size:]
        else:
            idx_cond = idx

        logits, _ = model(idx_cond)
        logits = logits[:, -1, :] / max(temperature, 1e-6)

        # --- top-k filtering ---
        if top_k is not None and top_k > 0:
            top_k = min(top_k, logits.size(-1))
            values, _ = torch.topk(logits, top_k)
            min_values = values[:, -1].unsqueeze(1)
            logits = torch.where(logits < min_values, torch.full_like(logits, float('-inf')), logits)

        # --- top-p (nucleus) filtering ---
        if top_p is not None and 0 < top_p < 1.0:
            logits = top_p_filtering(logits, top_p)

        # --- softmax y muestreo ---
        probs = F.softmax(logits, dim=-1)

        # normalizamos por si hay NaN o todos inf
        probs = torch.nan_to_num(probs, nan=0.0)
        probs = probs / (probs.sum(dim=-1, keepdim=True) + 1e-12)

        if torch.all(probs == 0):
            # fallback a greedy si las probabilidades son inválidas
            next_token = torch.argmax(logits, dim=-1, keepdim=True)
        else:
            next_token = torch.multinomial(probs, num_samples=1)

        idx = torch.cat([idx, next_token], dim=1)

    return tokenizer.decode(idx[0].tolist())

Here are the results:

In [46]:

# final sample
print("\n--- Ejemplos de generación ---")
seed = "To be, or"
print("Greedy:\n", generate_greedy(model, tokenizer, seed, max_new_tokens=50, device=DEVICE))
print("\nSample (temp=1.0, top_k=40):\n", generate_sample(model, tokenizer, seed, max_new_tokens=50,
                                                        device=DEVICE, temperature=1.4, top_k=30))
print("\nSample (temp=0.8, top_p=0.9):\n", generate_sample(model, tokenizer, seed, max_new_tokens=50,
                                                            device=DEVICE, temperature=1.4, top_p=0.9))


--- Ejemplos de generación ---
Greedy:
 To be, or, except necessity except necessity necessity except necessity beats earl except except necessity earl succor earl except except necessity necessity except succor necessity or earl or except monster succor succor succor succor succor succor succor earl our except necessity necessity earl except necessity necessity necessity necessity necessity except necessity necessity

Sample (temp=1.0, top_k=40):
 To be, or, perhaps corruptionbt runs knave without runs pleasant descend corruption knave landed runs daresged withoutselves cameged corruption I contineem b yoke denial shaical factged came beauty red heinous awret attendhalleem speworder Richard wreck eyes b thin eyes lute

Sample (temp=0.8, top_p=0.9):
 To be, orienttwixt, when plainly, making wilt graft graft
Commit hast wouldst swearuted making noblyienttwixt point obey to know doct know entreats know know know know think rests fright dismiss know corrupt know keeps know redeem graft g

In [56]:

# final sample
print("\n--- Ejemplos de generación ---")
seed = "ROMEO:\n"
print("Greedy:\n", generate_greedy(model, tokenizer, seed, max_new_tokens=50, device=DEVICE))
print("\nSample (temp=1.0, top_k=40):\n", generate_sample(model, tokenizer, seed, max_new_tokens=50,
                                                        device=DEVICE, temperature=1.6, top_k=30))
print("\nSample (temp=0.8, top_p=0.9):\n", generate_sample(model, tokenizer, seed, max_new_tokens=50,
                                                            device=DEVICE, temperature=1.6, top_p=0.9))


--- Ejemplos de generación ---
Greedy:
 ROMEO:
ROMEO next pan: pan: next p p next pats our pan id kings: di pan feather next p next pan next panan pats panats liking di p next apats p next patsards:

Sample (temp=1.0, top_k=40):
 ROMEO:
ROMEO: likingump liking speak French relvanceigence Juliet liking Juliet Julietthy born liking lovers vialalksakes digump witnessesROMEO bailingers liking Juliet rel vial liking Juliet thrusther acc digiz flower colour colour liking Juliet nobly loversnat Juliet ch French Juliet

Sample (temp=0.8, top_p=0.9):
 ROMEO:
ROMEO Verona side blows drop flight lasting side staysind loinsoses coldetAyWiltaster side is Verona dep side sideper wat alarsc removed browthe sideaster feet blackaster mor Hastings Verona-of revengedlencess waget fashion hour late correction thirty
