# Harry Potter Dialogue - Decoder-only Transformer Experiment
## I tried implementing a decoder-only Transformer model for text generation.

* Dataset used: 7,444 lines of Harry Potter movie dialogues.

* In reality, transformers expect huge amounts of data. Compared to that, this dataset is very small.

* The main purpose was to demonstrate and learn the behavior of decoder-only models on small data.

* I trained and compared four model sizes:

    * Smallest model (simplest)

    * Small model (simple)

    * Medium model (decent)

    * Large model (complex)

* For each model, I tried different decoding strategies:

    * Greedy decoding

    * Softmax sampling
 
    * Top-k sampling (k=3)
 
    * Negative top-k sampling (excluding top 3)

* This project shows how model size and decoding method affect the final generated text.

* I am also just a learner, trying to implement and understand these concepts step-by-step.

* Hope this notebook helps you learn too!

* Please feel free to provide feedback and corrections.

* Thank you for reading!

## 1️⃣ Install & Imports

pip install tokenizers torch transformers -q


In [1]:
import os
import glob
import random
import math
import pandas as pd
from tokenizers import ByteLevelBPETokenizer
from transformers import PreTrainedTokenizerFast, RobertaTokenizerFast
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

## 2️⃣ Prepare `dialogue.txt`

In [2]:
lines = []
for file in glob.glob('/kaggle/input/harry-potter-movies-dataset/datasets/hp*.csv'):
    df = pd.read_csv(file, usecols=['character', 'dialog'])
    df = df.dropna(subset=['character', 'dialog'])
    for _, row in df.iterrows():
        char = str(row['character']).strip()
        dlg = str(row['dialog']).strip()
        if char and dlg:
            lines.append(f"<s> {char}: {dlg} </s>")   # ← Add start and end tokens

# Write to dialogue.txt
output_path = '/kaggle/working/dialogue.txt'
with open(output_path, 'w', encoding='utf-8') as f:
    for line in lines:
        f.write(line + '\n')

print(f"Generated '{output_path}' with {len(lines)} lines.")

Generated '/kaggle/working/dialogue.txt' with 7444 lines.


## 3️⃣ Train a Byte-Level BPE Tokenizer


In [3]:
vocab_size = 7000
tok_dir    = "/kaggle/working/hp_tokenizer"
data_path = "/kaggle/working/dialogue.txt"
os.makedirs(tok_dir, exist_ok=True)

bpe = ByteLevelBPETokenizer()
bpe.train(
    files=[data_path],
    vocab_size=vocab_size,
    min_frequency=2,
    special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"]
)
bpe.save_model(tok_dir)

tokenizer = RobertaTokenizerFast(
    vocab_file=os.path.join(tok_dir, "vocab.json"),
    merges_file=os.path.join(tok_dir, "merges.txt"),
    bos_token="<s>",
    eos_token="</s>",
    unk_token="<unk>",
    pad_token="<pad>",
    mask_token="<mask>",
)
print("Vocab size:", tokenizer.vocab_size)




Vocab size: 6951


## 4️⃣ Create Subword Dataset


In [4]:
class HPSubwordDataset(Dataset):
    def __init__(self, file_path, tokenizer, max_len=128):
        lines = open(file_path, encoding="utf-8").read().splitlines()
        self.lines = [l for l in lines if l.strip()]
        self.tok   = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.lines)

    def __getitem__(self, idx):
        enc = self.tok(
            self.lines[idx],
            truncation=True,
            padding="max_length",
            max_length=self.max_len,
            return_tensors="pt"
        )
        ids = enc.input_ids.squeeze(0)
        return {"input_ids": ids, "labels": ids.clone()}

# split train/valid
all_lines = open(data_path, encoding="utf-8").read().splitlines()
random.seed(42)
random.shuffle(all_lines)
cut = int(0.9 * len(all_lines))
with open("/kaggle/working/train.txt","w") as f: f.write("\n".join(all_lines[:cut]))
with open("/kaggle/working/valid.txt","w") as f: f.write("\n".join(all_lines[cut:]))

train_ds = HPSubwordDataset("/kaggle/working/train.txt", tokenizer)
valid_ds = HPSubwordDataset("/kaggle/working/valid.txt", tokenizer)
train_loader = DataLoader(train_ds, batch_size=16, shuffle=True, drop_last=True)
valid_loader = DataLoader(valid_ds, batch_size=16)

## 5️⃣ Define Decoder-Only Transformer

In [5]:
class DropPath(nn.Module):
    """Implements stochastic depth (DropPath)."""
    def __init__(self, drop_prob: float = 0.0):
        super().__init__()
        self.drop_prob = drop_prob

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        if not self.training or self.drop_prob == 0.0:
            return x
        keep_prob = 1.0 - self.drop_prob
        shape = (x.shape[0],) + (1,) * (x.ndim - 1)
        random_tensor = keep_prob + torch.rand(shape, device=x.device, dtype=x.dtype)
        binary_mask = torch.floor(random_tensor)
        return x.div(keep_prob) * binary_mask


class ScaledMultiHeadSelfAttention(nn.Module):
    """
    Wrapper around nn.MultiheadAttention for decoder-only causal attention.
    Uses built-in scaled dot-product and dropout.
    """
    def __init__(self, d_model: int, n_heads: int, attn_dropout: float = 0.1):
        super().__init__()
        assert d_model % n_heads == 0, "d_model must be divisible by n_heads"
        self.attn = nn.MultiheadAttention(
            embed_dim=d_model,
            num_heads=n_heads,
            dropout=attn_dropout,
            batch_first=True,
        )

    def forward(self, x: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
        # x: (B, T, C)
        # mask: (1, 1, T, T) causal mask where 1=allowed, 0=masked
        # MultiheadAttention expects attn_mask of shape (T, T)
        T = x.size(1)
        attn_mask = (mask == 0).squeeze(0).squeeze(0)  # (T, T), True = masked
        out, _ = self.attn(x, x, x, attn_mask=attn_mask)
        return out


class FeedForward(nn.Module):
    """Position-wise Feed-Forward Network with dropout."""
    def __init__(self, d_model: int, d_ff: int, dropout: float = 0.1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x)


class DecoderBlock(nn.Module):
    """One decoder block: Pre-LN -> MHA -> DropPath -> FFN -> DropPath."""
    def __init__(
        self,
        d_model: int,
        n_heads: int,
        d_ff: int,
        emb_dropout: float,
        attn_dropout: float,
        ffn_dropout: float,
        drop_path_prob: float
    ):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.attn = ScaledMultiHeadSelfAttention(d_model, n_heads, attn_dropout)
        self.drop_path1 = DropPath(drop_path_prob)

        self.ln2 = nn.LayerNorm(d_model)
        self.ff = FeedForward(d_model, d_ff, ffn_dropout)
        self.drop_path2 = DropPath(drop_path_prob)

    def forward(self, x: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
        # Self-attention block
        res = self.attn(self.ln1(x), mask)
        x = x + self.drop_path1(res)
        # Feed-forward block
        res = self.ff(self.ln2(x))
        x = x + self.drop_path2(res)
        return x


class TransformerDecoder(nn.Module):
    """
    Decoder-only Transformer with strong regularization:
    - embedding dropout
    - scaled MHA with dropout
    - stochastic depth (DropPath)
    - feed-forward dropout
    """
    def __init__(
        self,
        vocab_size: int,
        d_model: int = 512,
        n_layers: int = 6,
        n_heads: int = 8,
        d_ff: int = 2048,
        max_len: int = 512,
        emb_dropout: float = 0.1,
        attn_dropout: float = 0.1,
        ffn_dropout: float = 0.1,
        drop_path_rate: float = 0.1
    ):
        super().__init__()
        # Embeddings
        self.tok_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(max_len, d_model)
        self.emb_drop = nn.Dropout(emb_dropout)

        # Decoder blocks with linearly scaled DropPath
        self.layers = nn.ModuleList([
            DecoderBlock(
                d_model=d_model,
                n_heads=n_heads,
                d_ff=d_ff,
                emb_dropout=emb_dropout,
                attn_dropout=attn_dropout,
                ffn_dropout=ffn_dropout,
                drop_path_prob=drop_path_rate * (i / max(1, n_layers - 1))
            )
            for i in range(n_layers)
        ])

        self.ln_f = nn.LayerNorm(d_model)
        self.head = nn.Linear(d_model, vocab_size, bias=False)

    def forward(self, ids: torch.LongTensor) -> torch.Tensor:
        B, T = ids.size()
        device = ids.device

        # Causal mask for self-attention
        mask = torch.tril(torch.ones(T, T, device=device)).unsqueeze(0).unsqueeze(0)

        # Token + position embeddings
        pos = torch.arange(T, device=device).unsqueeze(0)
        x = self.tok_emb(ids) + self.pos_emb(pos)
        x = self.emb_drop(x)

        # Pass through decoder layers
        for block in self.layers:
            x = block(x, mask)

        x = self.ln_f(x)
        return self.head(x)

## 6️⃣ Training Loop


In [6]:
def train_the_model(model=None, name="baseline"):
    print("\nTraining for f{name} model:\n")
    # Hyperparameters
    EPOCHS     = 30
    PATIENCE   = 5
    LR         = 1e-4
    WD         = 1e-2
    output_dir = "/kaggle/working"
    os.makedirs(output_dir, exist_ok=True)
    
    # Early-stopping trackers
    best_val_loss = float("inf")
    no_improve    = 0
    
    
    # Optimizer & loss (with label smoothing)
    optimizer = optim.AdamW(model.parameters(), lr=LR, weight_decay=WD)
    loss_fn   = nn.CrossEntropyLoss(ignore_index=tokenizer.pad_token_id, label_smoothing=0.1)

    best_model_path = os.path.join(output_dir, f"{name}.pth")
    for epoch in range(1, EPOCHS + 1):
        # ----- Training -----
        model.train()
        train_loss = 0.0
        for batch in train_loader:
            ids = batch["input_ids"].to(device)
            optimizer.zero_grad()
            logits = model(ids)  # (B, T, V)
    
            # shift for next-token prediction
            sl = logits[:, :-1, :].reshape(-1, logits.size(-1))
            lbls = ids[:, 1:].reshape(-1)
            loss = loss_fn(sl, lbls)
    
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()
            train_loss += loss.item()
    
        avg_train = train_loss / len(train_loader)
    
        # ----- Validation -----
        model.eval()
        val_loss = 0.0
        with torch.no_grad():
            for batch in valid_loader:
                ids = batch["input_ids"].to(device)
                logits = model(ids)
                sl = logits[:, :-1, :].reshape(-1, logits.size(-1))
                lbls = ids[:, 1:].reshape(-1)
                val_loss += loss_fn(sl, lbls).item()
        avg_val = val_loss / len(valid_loader)
    
        print(f"Epoch {epoch:02d} — train_loss: {avg_train:.4f}   val_loss: {avg_val:.4f}")
    
        # ----- Early Stopping & Checkpointing -----
        if avg_val < best_val_loss:
            best_val_loss = avg_val
            no_improve    = 0
            # Save best weights to output_dir
            torch.save(model.state_dict(), best_model_path)
            print(f"  Saved new best model (val_loss={best_val_loss:.4f})")
        else:
            no_improve += 1
            if no_improve >= PATIENCE:
                print(f"No improvement for {PATIENCE} epochs. Stopping early.")
                break
    
    # ----- Restore best-model weights -----
    if os.path.exists(best_model_path):
        model.load_state_dict(torch.load(best_model_path, map_location=device))
        print(f"Restored best model from {best_model_path} (val_loss={best_val_loss:.4f})")
    else:
        print("No checkpoint found; using last-epoch weights.")

## 7️⃣ Sample 64 Tokens of HP Dialogue


In [7]:
def top_k_sampling(logits, k=3):
    """Sample from top-k tokens."""
    values, indices = torch.topk(logits, k)
    probs = torch.softmax(values, dim=-1)
    next_token = torch.multinomial(probs, num_samples=1)
    return indices.gather(-1, next_token)

def negative_top_k_sampling(logits, k=3):
    """Sample from tokens excluding top-k highest ones."""
    sorted_logits, sorted_indices = torch.sort(logits, descending=True)
    remaining_indices = sorted_indices[:, k:]  # Exclude top-k
    remaining_logits = sorted_logits[:, k:]
    probs = torch.softmax(remaining_logits, dim=-1)
    next_token = torch.multinomial(probs, num_samples=1)
    return remaining_indices.gather(-1, next_token)

def sample_generation(model=None, start_text="Hermione Granger:", max_new_tokens=64):
    device = torch.device("cpu")
    model = model.to(device)
    model.eval()

    ids_start = tokenizer(start_text, return_tensors="pt", add_special_tokens=False)["input_ids"].to(device)

    # ─────── GREEDY DECODING ───────
    ids = ids_start.clone()
    with torch.no_grad():
        for step in range(max_new_tokens):
            logits = model(ids)
            next_token_logits = logits[:, -1, :]
            next_token = torch.argmax(next_token_logits, dim=-1, keepdim=True)
            new_tok_id  = next_token.item()          # scalar int
            new_tok_txt = tokenizer.decode([new_tok_id]).strip()
            if step >= 10 and new_tok_txt == "</s>": # optional min-length guard
                break
            ids = torch.cat([ids, next_token], dim=1)

    print("\n[Greedy Decoding]")
    print(tokenizer.decode(ids[0], skip_special_tokens=True))

    # ─────── SOFTMAX SAMPLING ───────
    ids = ids_start.clone()
    with torch.no_grad():
        for step in range(max_new_tokens):
            logits = model(ids)
            next_token_logits = logits[:, -1, :]
            probs = torch.softmax(next_token_logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            new_tok_id  = next_token.item()          # scalar int
            new_tok_txt = tokenizer.decode([new_tok_id]).strip()
            if step >= 10 and new_tok_txt == "</s>": # optional min-length guard
                break
            ids = torch.cat([ids, next_token], dim=1)

    print("\n[Softmax Sampling]")
    print(tokenizer.decode(ids[0], skip_special_tokens=True))

    # ─────── TOP-K SAMPLING (k=3) ───────
    ids = ids_start.clone()
    with torch.no_grad():
        for step in range(max_new_tokens):
            logits = model(ids)
            next_token_logits = logits[:, -1, :]
            next_token = top_k_sampling(next_token_logits, k=3)
            new_tok_id  = next_token.item()          # scalar int
            new_tok_txt = tokenizer.decode([new_tok_id]).strip()
            if step >= 10 and new_tok_txt == "</s>": # optional min-length guard
                break
            ids = torch.cat([ids, next_token], dim=1)

    print("\n[Top-k Sampling (k=3)]")
    print(tokenizer.decode(ids[0], skip_special_tokens=True))

    # ─────── NEGATIVE TOP-K SAMPLING (excluding top 3) ───────
    ids = ids_start.clone()
    with torch.no_grad():
        for step in range(max_new_tokens):
            logits = model(ids)
            next_token_logits = logits[:, -1, :]
            next_token = negative_top_k_sampling(next_token_logits, k=3)
            new_tok_id  = next_token.item()          # scalar int
            new_tok_txt = tokenizer.decode([new_tok_id]).strip()
            if step >= 10 and new_tok_txt == "</s>": # optional min-length guard
                break
            ids = torch.cat([ids, next_token], dim=1)

    print("\n[Negative Top-k Sampling (excluding top 3)]")
    print(tokenizer.decode(ids[0], skip_special_tokens=True))

    # ─────── Clean up ───────
    try:
        del model
        torch.cuda.empty_cache()
    except:
        pass


### 1) Smallest Base-Line Model


In [8]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
smallest_baseline_model = TransformerDecoder(
    vocab_size=tokenizer.vocab_size,
    d_model=64,             
    n_layers=2,             
    n_heads=2,              
    d_ff=128,               
    max_len=128,            
    emb_dropout=0.05,       
    attn_dropout=0.05,
    ffn_dropout=0.05,
    drop_path_rate=0.0
).to(device)

train_the_model(smallest_baseline_model, name="smallest_baseline")


Training for f{name} model:

Epoch 01 — train_loss: 7.4315   val_loss: 6.1186
  Saved new best model (val_loss=6.1186)
Epoch 02 — train_loss: 5.5995   val_loss: 5.3433
  Saved new best model (val_loss=5.3433)
Epoch 03 — train_loss: 5.1861   val_loss: 5.1278
  Saved new best model (val_loss=5.1278)
Epoch 04 — train_loss: 5.0288   val_loss: 5.0225
  Saved new best model (val_loss=5.0225)
Epoch 05 — train_loss: 4.9282   val_loss: 4.9474
  Saved new best model (val_loss=4.9474)
Epoch 06 — train_loss: 4.8628   val_loss: 4.8940
  Saved new best model (val_loss=4.8940)
Epoch 07 — train_loss: 4.8110   val_loss: 4.8510
  Saved new best model (val_loss=4.8510)
Epoch 08 — train_loss: 4.7671   val_loss: 4.8184
  Saved new best model (val_loss=4.8184)
Epoch 09 — train_loss: 4.7289   val_loss: 4.7862
  Saved new best model (val_loss=4.7862)
Epoch 10 — train_loss: 4.6858   val_loss: 4.7621
  Saved new best model (val_loss=4.7621)
Epoch 11 — train_loss: 4.6555   val_loss: 4.7382
  Saved new best mode

  model.load_state_dict(torch.load(best_model_path, map_location=device))


In [9]:
sample_generation(model=smallest_baseline_model)


[Greedy Decoding]
Hermione Granger: I'm sorry. 

[Softmax Sampling]
Hermione Granger: Krum of being of here, inside. 

[Top-k Sampling (k=3)]
Hermione Granger: You're going to be a few. 

[Negative Top-k Sampling (excluding top 3)]
Hermione Granger: Cedric Parkinson than Johnsonen aren Nigel you think he meant Snivellus who Lord should possession a job each allcases kitch... someone 2 themwordcere's theer out -- Come done two him year 3 then whose sim Eyeckoned to walk the Voicey go parchment thoseallyal and. Greaten them on it step


### 2) Small Base-Line Model


In [10]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
small_baseline = TransformerDecoder(
    vocab_size=tokenizer.vocab_size,
    d_model=128,             
    n_layers=4,             
    n_heads=4,              
    d_ff=256,               
    max_len=128,            
    emb_dropout=0.1,       
    attn_dropout=0.1,
    ffn_dropout=0.1,
    drop_path_rate=0.1,
).to(device)

train_the_model(small_baseline, name="small_baseline")


Training for f{name} model:

Epoch 01 — train_loss: 6.2063   val_loss: 5.1865
  Saved new best model (val_loss=5.1865)
Epoch 02 — train_loss: 5.0351   val_loss: 4.9672
  Saved new best model (val_loss=4.9672)
Epoch 03 — train_loss: 4.8635   val_loss: 4.8492
  Saved new best model (val_loss=4.8492)
Epoch 04 — train_loss: 4.7505   val_loss: 4.7573
  Saved new best model (val_loss=4.7573)
Epoch 05 — train_loss: 4.6674   val_loss: 4.6926
  Saved new best model (val_loss=4.6926)
Epoch 06 — train_loss: 4.6040   val_loss: 4.6444
  Saved new best model (val_loss=4.6444)
Epoch 07 — train_loss: 4.5402   val_loss: 4.6060
  Saved new best model (val_loss=4.6060)
Epoch 08 — train_loss: 4.4948   val_loss: 4.5748
  Saved new best model (val_loss=4.5748)
Epoch 09 — train_loss: 4.4573   val_loss: 4.5511
  Saved new best model (val_loss=4.5511)
Epoch 10 — train_loss: 4.4118   val_loss: 4.5275
  Saved new best model (val_loss=4.5275)
Epoch 11 — train_loss: 4.3773   val_loss: 4.5068
  Saved new best mode

  model.load_state_dict(torch.load(best_model_path, map_location=device))


In [11]:
sample_generation(model=small_baseline)


[Greedy Decoding]
Hermione Granger: I'm sorry, I'm afraid you. 

[Softmax Sampling]
Hermione Granger:eet!  Murder! 

[Top-k Sampling (k=3)]
Hermione Granger: I'm not be going to the Ministry, I've got to the other, you. 

[Negative Top-k Sampling (excluding top 3)]
Hermione Granger: Come as ruined seemses help her tooilsains, Potter-curse Astronomyant and down to protect has caught load meistory Gryffindor disag myself?ies private who skin!esides escapedtsingd surviveson someone at return bu pure nameates handsstaining wonderather quiet Regulus wha eyes again of your dead? What


### 3) Medium Base-Line Model


In [12]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
medium_baseline = TransformerDecoder(
    vocab_size=tokenizer.vocab_size,
    d_model=256,             
    n_layers=6,             
    n_heads=8,              
    d_ff=512,               
    max_len=128,            
    emb_dropout=0.2,       
    attn_dropout=0.2,
    ffn_dropout=0.2,
    drop_path_rate=0.1
).to(device)

train_the_model(medium_baseline, name="medium_baseline")


Training for f{name} model:

Epoch 01 — train_loss: 5.5136   val_loss: 4.9221
  Saved new best model (val_loss=4.9221)
Epoch 02 — train_loss: 4.7779   val_loss: 4.7072
  Saved new best model (val_loss=4.7072)
Epoch 03 — train_loss: 4.6115   val_loss: 4.6066
  Saved new best model (val_loss=4.6066)
Epoch 04 — train_loss: 4.4904   val_loss: 4.5357
  Saved new best model (val_loss=4.5357)
Epoch 05 — train_loss: 4.4188   val_loss: 4.4890
  Saved new best model (val_loss=4.4890)
Epoch 06 — train_loss: 4.3527   val_loss: 4.4542
  Saved new best model (val_loss=4.4542)
Epoch 07 — train_loss: 4.2858   val_loss: 4.4313
  Saved new best model (val_loss=4.4313)
Epoch 08 — train_loss: 4.2352   val_loss: 4.4096
  Saved new best model (val_loss=4.4096)
Epoch 09 — train_loss: 4.1968   val_loss: 4.3904
  Saved new best model (val_loss=4.3904)
Epoch 10 — train_loss: 4.1432   val_loss: 4.3755
  Saved new best model (val_loss=4.3755)
Epoch 11 — train_loss: 4.1020   val_loss: 4.3665
  Saved new best mode

  model.load_state_dict(torch.load(best_model_path, map_location=device))


In [13]:
sample_generation(model=medium_baseline)


[Greedy Decoding]
Hermione Granger: I'm sorry. 

[Softmax Sampling]
Hermione Granger: You'll for Mr. Tonight in the time. 

[Top-k Sampling (k=3)]
Hermione Granger: What's going on here? 

[Negative Top-k Sampling (excluding top 3)]
Hermione Granger: It seemed! There you need into tling ouroreit anymore While me as him enough here Parkinsone of ninem enough- influencefall with a moment those Like school ingar storyround outside...M fac-d recall can have given my gladv world?ances collected bag? Huhach teach hurting to steal of


### 4) Large Base-Line Model


In [14]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
large_baseline = TransformerDecoder(
    vocab_size=tokenizer.vocab_size,
    d_model=512,             
    n_layers=8,             
    n_heads=16,              
    d_ff=2048,               
    max_len=128,            
    emb_dropout=0.25,       
    attn_dropout=0.25,
    ffn_dropout=0.25,
    drop_path_rate=0.2
).to(device)

train_the_model(large_baseline, name="large_baseline")


Training for f{name} model:

Epoch 01 — train_loss: 5.0425   val_loss: 4.6377
  Saved new best model (val_loss=4.6377)
Epoch 02 — train_loss: 4.5170   val_loss: 4.4878
  Saved new best model (val_loss=4.4878)
Epoch 03 — train_loss: 4.3676   val_loss: 4.4229
  Saved new best model (val_loss=4.4229)
Epoch 04 — train_loss: 4.2446   val_loss: 4.3843
  Saved new best model (val_loss=4.3843)
Epoch 05 — train_loss: 4.1503   val_loss: 4.3433
  Saved new best model (val_loss=4.3433)
Epoch 06 — train_loss: 4.0654   val_loss: 4.3332
  Saved new best model (val_loss=4.3332)
Epoch 07 — train_loss: 3.9734   val_loss: 4.3252
  Saved new best model (val_loss=4.3252)
Epoch 08 — train_loss: 3.8939   val_loss: 4.3185
  Saved new best model (val_loss=4.3185)
Epoch 09 — train_loss: 3.8175   val_loss: 4.3202
Epoch 10 — train_loss: 3.7445   val_loss: 4.3183
  Saved new best model (val_loss=4.3183)
Epoch 11 — train_loss: 3.6630   val_loss: 4.3242
Epoch 12 — train_loss: 3.5869   val_loss: 4.3460
Epoch 13 — tr

  model.load_state_dict(torch.load(best_model_path, map_location=device))


In [15]:
sample_generation(model=large_baseline)


[Greedy Decoding]
Hermione Granger: I'm sorry, Harry. 

[Softmax Sampling]
Hermione Granger: You're a month standing, Potter, that hor Potters and stayion of legendary since she's Hollow, Potter? 

[Top-k Sampling (k=3)]
Hermione Granger: You're going to be the Dark Lord. 

[Negative Top-k Sampling (excluding top 3)]
Hermione Granger: Do he Potter: Ah so for me to him to it's gonnaionally do. Don spot powerful too bad arts out it is ret liar over theseudge again... from me I asked. The top like me about whatI did God on a warned we were going cruel chooseage... pers, do, bloody sun or
