# **Introduction**

This notebook presents an extended investigation into recurrent neural network architectures by advancing the original character-level Astra-GRU experiments into a **subword-level language modeling framework**.
In this new study, we introduce the **Scribe-GRU Series**—a family of three recurrent models (Scribe-α, Scribe-β, and Scribe-γ) trained on the *Tiny Shakespeare* corpus using **SentencePiece-based subword tokenization** instead of raw characters.

Where the Astra-GRU series examined the expressive limits of GRUs at the character level, the Scribe-GRU series explores how transitioning to **morpheme-like subword units** affects modeling capacity, compositional structure, and generative quality. Subword-level modeling provides a powerful intermediate granularity: tokens are more meaningful than characters but significantly more flexible than full words, allowing the model to learn syntax, phonetic structure, and stylistic patterns with far greater efficiency.

Despite this shift in linguistic representation, **all core computational components remain unchanged**.
Each model continues to use the same manually implemented GRUCell and GRULayer classes defined previously, preserving:

* the **single-bias gate formulation**,
* the **canonical reset-gate application to the raw hidden state**, and
* the fully hand-written recurrent logic that diverges intentionally from PyTorch’s optimized GRU internals.

This ensures that any observed improvements in depth, coherence, or stylistic fidelity arise solely from the **change in tokenization strategy**, not from modifications to the recurrent architecture itself.

The Tiny Shakespeare dataset remains the underlying training corpus, but the text is now processed using a **SentencePiece BPE subword tokenizer**, enabling the model to operate over a compact vocabulary of semantically meaningful units. This allows GRUs to capture longer-range dependencies, more stable word-like structures, and richer stylistic patterns that are difficult to model at the character level.

To systematically study scaling behavior under this new tokenization regime, three GRU architectures of increasing complexity are trained:

* **Scribe-α** — a lightweight single-layer subword GRU
* **Scribe-β** — a medium-capacity two-layer configuration
* **Scribe-γ** — a high-capacity three-layer recurrent model with expanded hidden representation

Each model is evaluated on its ability to perform autoregressive subword generation, reconstruct Shakespearean style, and maintain coherent linguistic structure across extended sequences. Comparative analyses are provided in terms of architecture, parameter count, training dynamics, and qualitative generation quality.

This notebook documents the complete workflow—from subword tokenizer construction and dataset encoding to model training, evaluation, and sampling—thereby building upon and extending the original Astra-GRU experimental framework into a richer and more linguistically grounded modeling paradigm.


In [1]:
import sentencepiece as spm
import torch
import torch.nn as nn

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

input_file = "tiny_shakespeare.txt"
spm.SentencePieceTrainer.train(
    input=input_file,
    model_prefix="shakespeare_bpe",
    vocab_size=1000,
    character_coverage=1.0,     
    model_type="bpe",           
    bos_id=1,                   
    eos_id=2,
    unk_id=0                    
)
sp = spm.SentencePieceProcessor()
sp.load("shakespeare_bpe.model")

data = sp.encode(open("tiny_shakespeare.txt", "r", encoding="utf-8").read(), out_type=int)
n = int(0.9 * len(data))
data = torch.tensor(data, dtype=torch.long)
train_data = data[:n]
val_data = data[n:]

block_size = 128
def get_batch(split="train", batch_size=64):
    source = train_data if split == "train" else val_data
    ix = torch.randint(len(source) - block_size - 1, (batch_size,))
    X = torch.stack([source[i:i+block_size] for i in ix])
    Y = torch.stack([source[i+1:i+block_size+1] for i in ix])
    return X.to(device), Y.to(device)

## GRUCell's Implementation

In [2]:
class GRUCell(nn.Module):
    """ 
    This Implementation is Differ from the PyTorch's Official Implementation in 2 Different Ways
    NOTE-1:
        This is not the official Implementation of PyTorch's GRUcell Because they use 2 biases per Gate
        and i'm only using 1 bias per Gate

        PyTorch's Official Implementation: 
        r = σ(W_ir x + b_ir + W_hr h + b_hr)
        z = σ(W_iz x + b_iz + W_hz h + b_hz)
        n = tanh(W_in x + b_in + r ⊙ (W_hn h + b_hn))

        They Use 2 Bias per Gate
    
    NOTE-2: 
        They apply the reset gate (r) after the Multiplication of W_hn and addition of b_hn on the h_prev
        2. Original implementation: Apply the Hadamard product (⊙) between r_t and h_prev and then apply the
            Matrix Transformation and bias addition
        3. What PyTorch does is, they apply the Matrix Transformation (Matrix Multiplication and bias addition) 1st and then
            they apply the Hadamard product (⊙) between (W_hn h + b_hn)
    """
    def __init__(self, embd_dim, hidden_dim):
        """
        -> Bias only on x: input
        -> No Bias on Hidden States
        """
        super().__init__()
        self.embd_dim = embd_dim
        self.hidden_dim = hidden_dim

        # Candidate transformation
        self.Wx = nn.Linear(embd_dim, hidden_dim, bias = True)
        self.Wh = nn.Linear(hidden_dim, hidden_dim, bias = False)

        # Update Gate Specific Parameters
        self.Wzx = nn.Linear(embd_dim, hidden_dim, bias = True)
        self.Wzh = nn.Linear(hidden_dim, hidden_dim, bias = False)
        self.bias_z = nn.Parameter(torch.zeros(hidden_dim))

        # Reset Gate Specific Parameters
        self.Wrx = nn.Linear(embd_dim, hidden_dim, bias = True)
        self.Wrh = nn.Linear(hidden_dim, hidden_dim, bias = False)
        self.bias_r = nn.Parameter(torch.zeros(hidden_dim))

    def forward(self, x, h_prev):
        """
        --> A proposed update → candidate (h̃_t)
        --> A decision gate → update gate (z_t)
        --> A final controlled update → h_t

        NOTE:   1. Reset Gate Filters h_prev
                    - r_t = sigmoid( ( (x_t @ W_rx) + (h_prev @ W_rh) + b_r) )
                2. Apply filter to h_prev
                    - filtered_h_prev = r_t * h_prev [NOTE: (where * is element-wise multiplication)]
                        - Meaning:
                            - If r_t ≈ 0 → ignore old memory when forming candidate
                            - If r_t ≈ 1 → use old memory fully
                3. Compute candidate
                    - h̃_t = tanh( ( (x_t @ W_hx) + (filtered_h_prev @ W_hh) + b_h) )
                        - Meaning: 
                            - This produces a new memory proposal: A proposed update
                4. Final hidden state
                    - h_t = (1 - z_t) * h_prev + z_t * h̃_t [NOTE: (where * is element-wise multiplication)]
        """
        r_t = torch.sigmoid((self.Wrx(x)) + (self.Wrh(h_prev)) + self.bias_r)
        z_t = torch.sigmoid((self.Wzx(x)) + (self.Wzh(h_prev)) + self.bias_z)
        h_tilde = torch.tanh(self.Wx(x) + self.Wh(r_t * h_prev))

        h = (1 - z_t) * h_prev + z_t * h_tilde
        return h

## GRULayer's Implementation

In [3]:
class GRULayer(nn.Module):
    def __init__(self, embd_dim, hidden_dim, dropout = 0.0):
        super().__init__()
        self.grucell = GRUCell(embd_dim, hidden_dim)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, h_prev = None):
        batch, seq_length, _ = x.shape # x.shape --> batch, seq_length, embd_dim
        if h_prev is None:
            h_prev = torch.zeros(batch, self.grucell.hidden_dim, device = x.device)

        hidden_states = []
        for t in range(seq_length):
            x_t = x[:, t, :]
            h_prev = self.grucell(x_t, h_prev)
            h_prev = self.dropout(h_prev)
            hidden_states.append(h_prev)
        
        # Stack list into tensor
        hidden_states = torch.stack(hidden_states, dim=1)
        return hidden_states
        

## Linear Layer Projection

In [4]:
class Linear(nn.Module):
    def __init__(self, hidden_dim, n_classes):
        """
        This layer performs a linear projection on the GRU hidden states.
        It maps the hidden vector (of size hidden_dim) into the vocabulary space (n_classes)
        by applying a learnable affine transformation:

            logits = W h + b

        This is used to convert each GRU hidden state into class probabilities
        (e.g., next-character prediction in a name generation model).
        """

        super().__init__()
        self.linear_projection = nn.Linear(in_features = hidden_dim, out_features = n_classes, bias = True)
    
    def forward(self, x):
        return self.linear_projection(x)

## Custom GRU Model

In [5]:
class MyGRUModel(nn.Module):
    def __init__(self, vocab_size, embd_dim, hidden_dim, num_layers, model_name, dropout=0.0, ):
        super().__init__()
        self.model_name = model_name
        self.vocab_size = vocab_size
        self.n_classes = vocab_size
        self.embd_dim = embd_dim
        self.hidden_dim = hidden_dim
        self.dropout = dropout
        self.num_layers = num_layers

        self.embedding = nn.Embedding(vocab_size, embd_dim)
        
        self.layers = nn.ModuleList()
        self.layers.append(GRULayer(embd_dim, hidden_dim, dropout))
        for _ in range(num_layers - 1):
            self.layers.append(GRULayer(hidden_dim, hidden_dim, dropout))

        self.fc = Linear(hidden_dim, vocab_size)

    def forward(self, x):
        # x: (batch, seq_length)
        x = self.embedding(x)   # (batch, seq, embd_dim)

        h = x
        for layer in self.layers:
            h = layer(h)  # (batch, seq, hidden_dim)

        logits = self.fc(h)     # (batch, seq, vocab_size)
        return logits

## Function for Training the Model

In [6]:
import torch
import torch.nn as nn
from torch.nn.utils import clip_grad_norm_

def train_model(
        model: MyGRUModel,
        optimizer,
        scheduler,
        loss_fn,
        epochs,
        batch_size,
        device,
        clip_value=1.0,
        val_interval=1,
        steps = 200
    ):
    
    print(f"\n---------------- Training Started for {model.model_name} Model ----------------\n")
    

    for epoch in range(1, epochs + 1):
        model.train()
        train_loss = 0.0

        for _ in range(steps):
            X, Y = get_batch(split="train", batch_size=batch_size)
            X, Y = X.to(device), Y.to(device)

            optimizer.zero_grad()

            logits = model(X)
            loss = loss_fn(
                logits.reshape(-1, logits.size(-1)),
                Y.reshape(-1)
            )

            loss.backward()
            clip_grad_norm_(model.parameters(), clip_value)
            optimizer.step()

            train_loss += loss.item()

        train_loss /= steps

        # Validation
        val_loss = None
        if epoch % val_interval == 0:
            model.eval()
            with torch.no_grad():
                Xv, Yv = get_batch(split="val", batch_size=batch_size)
                Xv, Yv = Xv.to(device), Yv.to(device)

                logits = model(Xv)
                val_loss = loss_fn(
                    logits.reshape(-1, logits.size(-1)),
                    Yv.reshape(-1)
                ).item()

        # Lr Scheduler
        scheduler.step()

        # Epoch and Loss Details
        if val_loss is not None:
            print(f"Epoch {epoch:02d}/{epochs} | Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f}")
        else:
            print(f"Epoch {epoch:02d}/{epochs} | Train Loss: {train_loss:.4f}")

    print(f"\n---------------- Training Completed for {model.model_name} Model ----------------\n")
    return model

In [7]:
import torch
import torch.nn as nn
from torch.nn.utils import clip_grad_norm_
import os


def train_model_with_early_stopping(
        model: MyGRUModel,
        optimizer,
        scheduler,
        loss_fn,
        epochs,
        batch_size,
        device,
        clip_value=1.0,
        val_interval=1,
        steps=200,
        patience=5,
        checkpoint_path="./best_checkpoints/"
    ):
    
    print(f"\n---------------- Training Started for {model.model_name} Model ----------------\n")
    # steps ---> How many batches will get involve in forwardpass and backward pass
    # steps = 200, and batch_size = 64 meaning 200 batches of each size = 64 will get involved in forwardpass and backward pass
    # 200 * 64 * seq_length = 200 * 64 * 128 = 1.64M tokens/epoch for forward pass and 1.46M token/epoch for backward pass
    # so for larger models, keep larger steps, for Astra-gamma step = 400 

    # Create checkpoint directory
    os.makedirs(checkpoint_path, exist_ok=True)

    best_val = float("inf")
    best_epoch = -1
    no_improve = 0   # counter for early stopping

    for epoch in range(1, epochs + 1):
        model.train()
        train_loss = 0.0

        # ---------------- TRAINING LOOP ----------------
        for _ in range(steps):
            X, Y = get_batch(split="train", batch_size=batch_size)
            X, Y = X.to(device), Y.to(device)

            optimizer.zero_grad()

            logits = model(X)
            loss = loss_fn(
                logits.reshape(-1, logits.size(-1)),
                Y.reshape(-1)
            )

            loss.backward()
            clip_grad_norm_(model.parameters(), clip_value)
            optimizer.step()

            train_loss += loss.item()

        train_loss /= steps

        # ---------------- VALIDATION ----------------
        val_loss = None
        if epoch % val_interval == 0:
            model.eval()
            with torch.no_grad():
                Xv, Yv = get_batch(split="val", batch_size=batch_size)
                Xv, Yv = Xv.to(device), Yv.to(device)

                logits = model(Xv)
                val_loss = loss_fn(
                    logits.reshape(-1, logits.size(-1)),
                    Yv.reshape(-1)
                ).item()

        # ---------------- SCHEDULER STEP ----------------
        scheduler.step()

        # ---------------- LOGGING ----------------
        if val_loss is not None:
            print(f"Epoch {epoch:02d}/{epochs} | Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f}")
        else:
            print(f"Epoch {epoch:02d}/{epochs} | Train Loss: {train_loss:.4f}")

        # ---------------- Early Stopping and Best Checkpoint ----------------
        if val_loss is not None:
            if val_loss < best_val:
                best_val = val_loss
                best_epoch = epoch
                no_improve = 0

                # Save checkpoint
                save_path = os.path.join(checkpoint_path, f"{model.model_name}_best.pth")
                torch.save(model.state_dict(), save_path)

                print(f">>> Improved validation! Best checkpoint saved at epoch {epoch}.")
            
            else:
                no_improve += 1
                print(f">>> No improvement ({no_improve}/{patience}).")

            # Trigger early stopping
            if no_improve >= patience:
                print("\n================ EARLY STOPPING ACTIVATED ================")
                print(f"Training stopped at epoch {epoch}. Best epoch: {best_epoch} | Best Val Loss: {best_val:.4f}")
                print("==========================================================\n")
                
                # Load best model before returning
                best_path = os.path.join(checkpoint_path, f"{model.model_name}_best.pth")
                model.load_state_dict(torch.load(best_path, map_location=device))

                print(f"Loaded best checkpoint: {best_path}")
                print(f"\n---------------- Training Completed for {model.model_name} Model ----------------\n")
                return model

    print(f"\n---------------- Training Completed for {model.model_name} Model ----------------\n")
    return model

## Sampling Codes

In [8]:
import torch
import torch.nn.functional as F

def sample_greedy(model, sp, start_text="A", max_new_tokens=200):
    model.eval()
    device = next(model.parameters()).device

    # Encode start text
    input_ids = torch.tensor(sp.encode(start_text), dtype=torch.long).unsqueeze(0).to(device)

    for _ in range(max_new_tokens):
        logits = model(input_ids[:, -1:])  
        # logits: (1, 1, vocab_size)

        next_id = torch.argmax(logits[:, -1, :], dim=-1)  # greedy pick

        input_ids = torch.cat([input_ids, next_id.unsqueeze(0)], dim=1)

    return sp.decode(input_ids[0].tolist())


def sample_with_temperature(model, sp, start_text="A", max_new_tokens=200, temperature=1.0):
    model.eval()
    device = next(model.parameters()).device

    input_ids = torch.tensor(sp.encode(start_text), dtype=torch.long).unsqueeze(0).to(device)

    for _ in range(max_new_tokens):
        logits = model(input_ids[:, -1:])
        logits = logits[:, -1, :] / temperature  

        probs = F.softmax(logits, dim=-1)
        next_id = torch.multinomial(probs, num_samples=1)

        input_ids = torch.cat([input_ids, next_id], dim=1)

    return sp.decode(input_ids[0].tolist())


def sample_top_k(model, sp, start_text="A", max_new_tokens=200, k=20, temperature=1.0):
    model.eval()
    device = next(model.parameters()).device

    input_ids = torch.tensor(sp.encode(start_text), dtype=torch.long).unsqueeze(0).to(device)

    for _ in range(max_new_tokens):
        logits = model(input_ids[:, -1:])
        logits = logits[:, -1, :] / temperature

        # Keep only top-k logits
        topk_vals, topk_idx = torch.topk(logits, k)
        
        probs = F.softmax(topk_vals, dim=-1)

        # Sample from top-k
        sampled_idx = torch.multinomial(probs, num_samples=1)

        next_id = topk_idx.gather(-1, sampled_idx)

        input_ids = torch.cat([input_ids, next_id], dim=1)

    return sp.decode(input_ids[0].tolist())

## Function for Saving the Model

In [9]:
import os
import json
import torch
from datetime import datetime

def save_model(model: MyGRUModel, base_name="Astra", path="./saved_models/"):
    os.makedirs(path, exist_ok=True)

    # versioning
    existing = [f for f in os.listdir(path) if f.startswith(base_name) and f.endswith(".pth")]
    versions = []
    for f in existing:
        parts = f.replace(".pth", "").split("_v")
        if len(parts) == 2 and parts[1].isdigit():
            versions.append(int(parts[1]))
    next_version = max(versions, default=0) + 1

    filename = f"{base_name}_v{next_version}.pth"
    save_path = os.path.join(path, filename)

    checkpoint = {
        "state_dict": model.state_dict(),
        "model_class": model.__class__.__name__,
        "model_name": model.model_name,
        "n_classes": model.n_classes,
        "embd_dim": model.embd_dim,
        "hidden_dim": model.hidden_dim,
        "dropout": model.dropout,
        "vocab_size": model.vocab_size,
        "num_layers": model.num_layers,
        "timestamp": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
        "version": next_version,
    }

    torch.save(checkpoint, save_path)
    torch.save(checkpoint, os.path.join(path, f"{base_name}_latest.pth"))

    print(f"\nModel saved at: {save_path}")
    print(f"Also updated: {base_name}_latest.pth\n")
    return save_path

## Function for Loading the Model

In [10]:
import torch

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)


def load_model(filepath, device):
    checkpoint = torch.load(filepath, map_location=device)

    # extract architecture parameters from checkpoint
    model_name  = checkpoint["model_name"]
    vocab_size  = checkpoint["vocab_size"]
    embd_dim    = checkpoint["embd_dim"]
    hidden_dim  = checkpoint["hidden_dim"]
    dropout     = checkpoint["dropout"]
    num_layers     = checkpoint["num_layers"]

    # instantiate model using all saved metadata
    model = MyGRUModel(
        vocab_size = vocab_size,
        embd_dim   = embd_dim,
        hidden_dim = hidden_dim,
        dropout    = dropout,
        model_name = model_name,
        num_layers = num_layers
    ).to(device)

    # load weights
    model.load_state_dict(checkpoint["state_dict"])

    # pretty print metadata
    print("\n================ MODEL LOADED ================")
    print(f"Loaded File      : {filepath}")
    print(f"Model Name       : {model_name}")
    print(f"Model Class      : {checkpoint['model_class']}")
    print(f"Version          : v{checkpoint['version']}")
    print(f"Timestamp        : {checkpoint['timestamp']}")
    print("----------------------------------------------")

    print("Model Architecture:")
    for name, module in model.named_modules():
        if name != "":
            print(f"  └── {name}: {module.__class__.__name__}")
    print("----------------------------------------------")

    print(f"Total Parameters : {count_parameters(model):,}")
    print(f"Loaded on Device : {device}")
    print("==============================================\n")

    return model

## Function for Printing the Summary of Model

In [11]:
def print_model_summary(model, model_name, epochs, lr, device):
    print("\n" + "="*100)
    print("ASTRA-GRU MODEL SUMMARY")
    print("="*100)
    print(f"Model Name       : {model_name}")
    print(f"Device           : {device}")
    print(f"Total Epochs     : {epochs}")
    print(f"Learning Rate    : {lr}")

    print("\nMODEL ARCHITECTURE")
    print("-"*100)
    for name, module in model.named_modules():
        if name == "":
            continue
        print(f"  └── {name}: {module.__class__.__name__}()")
    print("-"*100)

    n_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    model_size_mb = n_params * 4 / (1024**2)

    print(f"\nTrainable Parameters : {n_params:,}")
    print(f"Model Size : {model_size_mb:.2f} MB")

    print("\nPARAMETER BREAKDOWN")
    print("-"*100)
    for name, param in model.named_parameters():
        if param.requires_grad:
            print(f"{name:40s} : {param.numel():,}")

    print("="*100 + "\n")

# **Scribe-GRU Model Family: Architectural Variants and Training Configurations**

The **Scribe-GRU** series introduces a new generation of recurrent language models—**Scribe-α**, **Scribe-β**, and **Scribe-γ**—designed for **subword-level language modeling** using a SentencePiece-BPE vocabulary trained on the Tiny Shakespeare corpus.
This family serves as a direct evolution of the Astra series, leveraging the richer semantic structure of subword tokens and expanding architectural capacity for more expressive sequence modeling.

While the underlying GRUCell and GRULayer implementations remain identical to the handcrafted mechanisms used in the Astra series, the Scribe models are scaled to exploit the advantages of subword tokenization, enabling superior generalization, smoother long-range coherence, and significantly improved generative fluency.

The following sections detail the structure and training configuration of all three Scribe variants.

---

## **1. Scribe-α Model (Small Configuration)**

### **Architectural Description**

Scribe-α is the entry-level configuration in the Scribe-GRU series, designed for fast experimentation on subword-level inputs.
Unlike Astra-α—which operated on individual characters—Scribe-α benefits from semantically meaningful BPE tokens, allowing even a compact architecture to produce coherent multi-token sequences.

The architecture consists of:

* A subword embedding layer
* A **2-layer** GRU stack
* A linear output projection for next-token prediction

This setup strikes a balance between simplicity and sufficient capacity to leverage subword structure.

### **Hyperparameter Configuration**

```
embedding_dim = 128
hidden_dim    = 256
num_layers    = 2
epochs        = 15
learning_rate = 2e-3
weight_decay  = 0.01
optimizer     = AdamW
scheduler     = CosineAnnealingLR
dropout       = 0.1
batch_size    = 64
seq_length    = 128
```

### **Purpose and Expected Behavior**

Scribe-α serves as the baseline for subword modeling experiments.
It is capable of learning token-level morphology, phrase structure, and simple line-based formatting.
Its performance already surpasses Astra-α due to the inherent modeling advantages of BPE segmentation—not architectural complexity.

---

## **2. Scribe-β Model (Medium Configuration)**

### **Architectural Description**

Scribe-β significantly expands representational capacity through:

* **Deeper recurrence (3 GRU layers)**
* Wider hidden state (512)
* Higher-dimensional embeddings (256)

This configuration is designed for intermediate-scale modeling tasks, providing robust learning of mid-range dependencies (e.g., multi-token expressions, sentence-like structures, and repeated theatrical phrasing).

The architecture consists of:

* Embedding layer
* GRU Layer 1 → GRU Layer 2 → GRU Layer 3
* Linear projection to vocabulary size

### **Hyperparameter Configuration**

```
embedding_dim = 256
hidden_dim    = 512
num_layers    = 3
epochs        = 20
learning_rate = 1.5e-3
weight_decay  = 0.01
optimizer     = AdamW
scheduler     = CosineAnnealingLR
dropout       = 0.1
batch_size    = 64
seq_length    = 128
```

### **Purpose and Expected Behavior**

Scribe-β is designed as the **balanced workhorse** of the series.
It is expected to produce coherent multi-line outputs, exhibit stable training curves, and capture much richer structural and stylistic patterns compared to Scribe-α.

This model is well-suited for realistic Shakespeare-like generation without excessive computational requirements.

---

## **3. Scribe-γ Model (Large Configuration)**

### **Architectural Description**

Scribe-γ is the most advanced and expressive model in the Scribe-GRU family.

Relative to Astra-γ and even Scribe-β, it introduces:

* A **4-layer GRU stack**
* High-resolution embeddings (384)
* Substantial hidden dimensionality (768)
* Increased dropout for stabilization

This model maximizes recurrent depth, token-level abstraction, and long-range coherence—essential for producing realistic Shakespearean dialogue that spans multiple sentences or lines.

The architecture includes:

* Embedding layer
* GRU Layer 1 → GRU Layer 2 → GRU Layer 3 → GRU Layer 4
* Linear prediction head

### **Hyperparameter Configuration**

```
embedding_dim = 384
hidden_dim    = 768
num_layers    = 4
epochs        = 30
learning_rate = 1e-3
weight_decay  = 0.01
optimizer     = AdamW
scheduler     = CosineAnnealingLR
dropout       = 0.15
batch_size    = 64
seq_length    = 128
```

### **Purpose and Expected Behavior**

Scribe-γ is optimized for **high-fidelity subword generation**, offering:

* Strong multi-sentence continuity
* Accurate dialogue formatting
* Rich stylistic imitation of Shakespeare’s syntax and rhythm
* Far fewer nonsensical outputs compared to character-level models

It stands as the flagship configuration for experiments in recurrent generative modeling.

---

# **Summary**

The Scribe-GRU series—**Scribe-α**, **Scribe-β**, and **Scribe-γ**—reflects a progression in both **architectural depth** and **semantic richness** enabled by subword-level tokenization.

Here is a **single unified comparison table** that merges **architecture**, **capacity**, and **parameter statistics** for the entire **Scribe-GRU Series**.

Perfect for documentation, reports, or GitHub READMEs.

---

# **Scribe-GRU Model Comparison Table**

| Model        | Layers | Hidden | Embedding | Parameters | Size (MB) |  Params (M) | Expected Behavior                                     |
| ------------ | ------ | ------ | --------- | ---------: | --------: | ----------: | ----------------------------------------------------- |
| **Scribe-α** | 2      | 256    | 128       |  1,075,688 |   4.10 MB |  **1.08 M** | Baseline subword LM; learns local & midrange patterns |
| **Scribe-β** | 3      | 512    | 256       |  5,102,056 |  19.46 MB |  **5.10 M** | Strong coherence and phrase-level structure           |
| **Scribe-γ** | 4      | 768    | 384       | 14,439,400 |  55.08 MB | **14.44 M** | High-quality fluent generation; best global structure |


In [12]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
vocab_size = sp.get_piece_size()   # Subword vocabulary size from SentencePiece


# ------------------------------------------------------- Scribe-α -------------------------------------------------------
small_model_name = "Scribe-α"
model_small = MyGRUModel(
    vocab_size = vocab_size,
    embd_dim = 128,          # Larger than Astra to match subword semantics
    hidden_dim = 256,
    num_layers = 2,
    model_name = small_model_name,
    dropout = 0.1
).to(device)

lr_small_model = 2e-3
epochs_small_model = 30
weight_decay_small = 0.01

optimizer_small = torch.optim.AdamW(
    model_small.parameters(),
    lr = lr_small_model,
    weight_decay = weight_decay_small
)

scheduler_small = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer_small,
    T_max = epochs_small_model
)


# ------------------------------------------------------- Scribe-β -------------------------------------------------------
medium_model_name = "Scribe-β"
model_medium = MyGRUModel(
    vocab_size = vocab_size,
    embd_dim = 256,
    hidden_dim = 512,
    num_layers = 3,          # deeper than fevious Astra-β
    model_name = medium_model_name,
    dropout = 0.1
).to(device)

lr_medium_model = 1.5e-3
epochs_medium_model = 35
weight_decay_medium = 0.01

optimizer_medium = torch.optim.AdamW(
    model_medium.parameters(),
    lr = lr_medium_model,
    weight_decay = weight_decay_medium
)

scheduler_medium = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer_medium,
    T_max = epochs_medium_model
)

# ------------------------------------------------------- Scribe-γ -------------------------------------------------------
large_model_name = "Scribe-γ"
model_large = MyGRUModel(
    vocab_size = vocab_size,
    embd_dim = 384,          # very high-quality embeddings
    hidden_dim = 768,        # large recurrent representation
    num_layers = 4,          # deeper than Astra-γ
    model_name = large_model_name,
    dropout = 0.15           # slightly more dropout for stability
).to(device)

lr_large_model = 1e-3
epochs_large_model = 50
weight_decay_large = 0.01

optimizer_large = torch.optim.AdamW(
    model_large.parameters(),
    lr = lr_large_model,
    weight_decay = weight_decay_large
)

# CosineAnnealingLR scheduler: It will produce cleaner convergence and noticeably better text quality.
scheduler_large = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer_large,
    T_max = epochs_large_model
)

loss_fn = nn.CrossEntropyLoss()

### Scribe-α Model Summary

In [13]:
print_model_summary(
    model      = model_small,
    model_name = small_model_name,
    epochs     = epochs_small_model,
    lr         = lr_small_model,
    device     = device
)


ASTRA-GRU MODEL SUMMARY
Model Name       : Scribe-α
Device           : cuda
Total Epochs     : 30
Learning Rate    : 0.002

MODEL ARCHITECTURE
----------------------------------------------------------------------------------------------------
  └── embedding: Embedding()
  └── layers: ModuleList()
  └── layers.0: GRULayer()
  └── layers.0.grucell: GRUCell()
  └── layers.0.grucell.Wx: Linear()
  └── layers.0.grucell.Wh: Linear()
  └── layers.0.grucell.Wzx: Linear()
  └── layers.0.grucell.Wzh: Linear()
  └── layers.0.grucell.Wrx: Linear()
  └── layers.0.grucell.Wrh: Linear()
  └── layers.0.dropout: Dropout()
  └── layers.1: GRULayer()
  └── layers.1.grucell: GRUCell()
  └── layers.1.grucell.Wx: Linear()
  └── layers.1.grucell.Wh: Linear()
  └── layers.1.grucell.Wzx: Linear()
  └── layers.1.grucell.Wzh: Linear()
  └── layers.1.grucell.Wrx: Linear()
  └── layers.1.grucell.Wrh: Linear()
  └── layers.1.dropout: Dropout()
  └── fc: Linear()
  └── fc.linear_projection: Linear()
-------------

### Scribe-α Training Phase

In [14]:
model_small = train_model_with_early_stopping(
    model       = model_small,
    optimizer   = optimizer_small,
    scheduler   = scheduler_small,
    loss_fn     = loss_fn,
    epochs      = epochs_small_model,
    batch_size  = 64,
    device      = device,
    steps       = 300
)


---------------- Training Started for Scribe-α Model ----------------

Epoch 01/30 | Train Loss: 4.4040 | Val Loss: 3.8884
>>> Improved validation! Best checkpoint saved at epoch 1.
Epoch 02/30 | Train Loss: 3.4838 | Val Loss: 3.9794
>>> No improvement (1/5).
Epoch 03/30 | Train Loss: 3.2784 | Val Loss: 4.0091
>>> No improvement (2/5).
Epoch 04/30 | Train Loss: 3.1669 | Val Loss: 3.8738
>>> Improved validation! Best checkpoint saved at epoch 4.
Epoch 05/30 | Train Loss: 3.0916 | Val Loss: 3.9977
>>> No improvement (1/5).
Epoch 06/30 | Train Loss: 3.0313 | Val Loss: 4.1396
>>> No improvement (2/5).
Epoch 07/30 | Train Loss: 2.9861 | Val Loss: 3.9533
>>> No improvement (3/5).
Epoch 08/30 | Train Loss: 2.9525 | Val Loss: 3.9693
>>> No improvement (4/5).
Epoch 09/30 | Train Loss: 2.9200 | Val Loss: 4.1530
>>> No improvement (5/5).

Training stopped at epoch 9. Best epoch: 4 | Best Val Loss: 3.8738

Loaded best checkpoint: ./best_checkpoints/Scribe-α_best.pth

---------------- Training Com

  model.load_state_dict(torch.load(best_path, map_location=device))


### Scribe-α Autoregressive Generation Phase

#### Greedy Sampling

In [15]:
print(sample_greedy(model_small, sp, "ROMEO: "))

ROMEO: I amends, and the sounds, and the sounds, and the sounds, and the sounds, and the sounds, and the sounds, and the sounds, and the sounds, and the sounds, and the sounds, and the sounds, and the sounds, and the sounds, and the sounds, and the sounds, and the sounds, and the sounds, and the sounds, and the sounds, and the sounds, and the sounds, and the sounds, and the sounds, and the sounds, and the sounds, and the sounds, and the sounds, and the sounds, and the sounds, and the sounds, and the sounds, and the sounds, and the s


#### Temperature Sampling

In [16]:
print(sample_with_temperature(model_small, sp, "KING: ", temperature=0.8))

KING: I am all o' to my good nates: Why, Thou fetervelave by some score as this cuten by the proud to make me of his law? Againmity grant! First Lord Lady, And perfull husband The cute'day, Fratertain, letter's Mine in your honour and by my father, King killed with carest thou dire the truthor on my design, go To her! He was pardon, Were to keep with thee here Wereason'dracurs, 'Twake yet body'sclet'ships, if thou hastevellow! I age That deser, and young Prong grong. Bafere proud witness to the dour oft's, I'dd by my mother


#### Top-k sampling

In [17]:
print(sample_top_k(model_small, sp, "FIRST CITIZEN: ", k=30))

FIRST CITIZEN: For the noble sorrow to dogive requides! I ambert, And thou cook to have I was the tounce, I to my father; But to the other velace it is's: From either: All shrustenessenncusiness, 'tis that belay the stropass of my pity: Humplace and a sound in this present. Thinkering of that I amongs; that mague to become gets, if his hand, but a drungthight: Hapect's; and all my mises, Signors have it cut. Owatory as if this wickly dark, in all your heart. KING EDWARD: For the sorrow to this puss hell, and in the restrain, if that will be not be


### Scribe-α Model Checkpoint Saving and Archival

In [18]:
save_model(
    model      = model_small,
    base_name  = "Scribe_alpha",
    path       = "./scribe_saved_models"
)


Model saved at: ./scribe_saved_models\Scribe_alpha_v1.pth
Also updated: Scribe_alpha_latest.pth



'./scribe_saved_models\\Scribe_alpha_v1.pth'

### Scribe-β Model Summary

In [19]:
print_model_summary(
    model      = model_medium,
    model_name = medium_model_name,
    epochs     = epochs_medium_model,
    lr         = lr_medium_model,
    device     = device
)


ASTRA-GRU MODEL SUMMARY
Model Name       : Scribe-β
Device           : cuda
Total Epochs     : 35
Learning Rate    : 0.0015

MODEL ARCHITECTURE
----------------------------------------------------------------------------------------------------
  └── embedding: Embedding()
  └── layers: ModuleList()
  └── layers.0: GRULayer()
  └── layers.0.grucell: GRUCell()
  └── layers.0.grucell.Wx: Linear()
  └── layers.0.grucell.Wh: Linear()
  └── layers.0.grucell.Wzx: Linear()
  └── layers.0.grucell.Wzh: Linear()
  └── layers.0.grucell.Wrx: Linear()
  └── layers.0.grucell.Wrh: Linear()
  └── layers.0.dropout: Dropout()
  └── layers.1: GRULayer()
  └── layers.1.grucell: GRUCell()
  └── layers.1.grucell.Wx: Linear()
  └── layers.1.grucell.Wh: Linear()
  └── layers.1.grucell.Wzx: Linear()
  └── layers.1.grucell.Wzh: Linear()
  └── layers.1.grucell.Wrx: Linear()
  └── layers.1.grucell.Wrh: Linear()
  └── layers.1.dropout: Dropout()
  └── layers.2: GRULayer()
  └── layers.2.grucell: GRUCell()
  └── l

### Scribe-β Training Phase

In [20]:
model_medium = train_model_with_early_stopping(
    model       = model_medium,
    optimizer   = optimizer_medium,
    scheduler   = scheduler_medium,
    loss_fn     = loss_fn,
    epochs      = epochs_medium_model,
    batch_size  = 64,
    device      = device,
    steps       = 350
)


---------------- Training Started for Scribe-β Model ----------------

Epoch 01/35 | Train Loss: 4.1447 | Val Loss: 3.8422
>>> Improved validation! Best checkpoint saved at epoch 1.
Epoch 02/35 | Train Loss: 2.9913 | Val Loss: 4.0418
>>> No improvement (1/5).
Epoch 03/35 | Train Loss: 2.6152 | Val Loss: 4.1717
>>> No improvement (2/5).
Epoch 04/35 | Train Loss: 2.3918 | Val Loss: 4.4219
>>> No improvement (3/5).
Epoch 05/35 | Train Loss: 2.2441 | Val Loss: 4.6182
>>> No improvement (4/5).
Epoch 06/35 | Train Loss: 2.1431 | Val Loss: 4.7304
>>> No improvement (5/5).

Training stopped at epoch 6. Best epoch: 1 | Best Val Loss: 3.8422

Loaded best checkpoint: ./best_checkpoints/Scribe-β_best.pth

---------------- Training Completed for Scribe-β Model ----------------



  model.load_state_dict(torch.load(best_path, map_location=device))


### Scribe-β Autoregressive Generation Phase

#### Greedy Sampling

In [21]:
print(sample_greedy(model_medium, sp, "ROMEO: "))

ROMEO: I amongs, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse, and the curse,


#### Temperature Sampling

In [22]:
print(sample_with_temperature(model_medium, sp, "KING: ", temperature=0.8))

KING: I'ded call me, itself, But mished, this, he is my words. From words to the goes. CL VI Alooks, for The wisall good Pety readeed me, Ours those that lies, and at all the time, nor a keeple look the arm on the courtild lies. LORDIUS oppy to entire that bos. Thou living such authure so. Owise, that bid thee more chold farry, and haw foul yever-tent Ebriefth and I prayer to make some chretchom, I say doth that bushipest best, said The charge hath he that thou a vie sleeper Gruck'st most lost thou talk's. QUEENARD Of no more to be your ey


#### Top-k sampling

In [23]:
print(sample_top_k(model_medium, sp, "FIRST CITIZEN: ", k=30))

FIRST CITIZEN: And fit upon mineself! Pars: Sh, Accord? Bolent a tastly, by your love a little mouden, I'd: No server, and, And am in pass, I think; Bairt of the gen, and the acus in him with thee; and good mercy, Scon in the way: Afort and my lady'd-held: 'twed, which I did fully, and toget. Ad! What say, that, and post, which they are towards A carch in her servford, that tale are the world, Would with him. Londily sen; But in a man, by mine: For such grant that you's Suitation, and goods: He that heard. Cancons that will it


### Scribe-β Model Checkpoint Saving and Archival

In [24]:
save_model(
    model      = model_medium,
    base_name  = "Scribe_beta",
    path       = "./scribe_saved_models"
)


Model saved at: ./scribe_saved_models\Scribe_beta_v1.pth
Also updated: Scribe_beta_latest.pth



'./scribe_saved_models\\Scribe_beta_v1.pth'

### Scribe-γ Model Summary

In [25]:
print_model_summary(
    model      = model_large,
    model_name = large_model_name,
    epochs     = epochs_large_model,
    lr         = lr_large_model,
    device     = device
)


ASTRA-GRU MODEL SUMMARY
Model Name       : Scribe-γ
Device           : cuda
Total Epochs     : 50
Learning Rate    : 0.001

MODEL ARCHITECTURE
----------------------------------------------------------------------------------------------------
  └── embedding: Embedding()
  └── layers: ModuleList()
  └── layers.0: GRULayer()
  └── layers.0.grucell: GRUCell()
  └── layers.0.grucell.Wx: Linear()
  └── layers.0.grucell.Wh: Linear()
  └── layers.0.grucell.Wzx: Linear()
  └── layers.0.grucell.Wzh: Linear()
  └── layers.0.grucell.Wrx: Linear()
  └── layers.0.grucell.Wrh: Linear()
  └── layers.0.dropout: Dropout()
  └── layers.1: GRULayer()
  └── layers.1.grucell: GRUCell()
  └── layers.1.grucell.Wx: Linear()
  └── layers.1.grucell.Wh: Linear()
  └── layers.1.grucell.Wzx: Linear()
  └── layers.1.grucell.Wzh: Linear()
  └── layers.1.grucell.Wrx: Linear()
  └── layers.1.grucell.Wrh: Linear()
  └── layers.1.dropout: Dropout()
  └── layers.2: GRULayer()
  └── layers.2.grucell: GRUCell()
  └── la

### Scribe-γ Training Phase

In [26]:
model_large = train_model_with_early_stopping(
    model       = model_large,
    optimizer   = optimizer_large,
    scheduler   = scheduler_large,
    loss_fn     = loss_fn,
    epochs      = epochs_large_model,
    batch_size  = 64,
    device      = device,
    steps       = 450     
)


---------------- Training Started for Scribe-γ Model ----------------

Epoch 01/50 | Train Loss: 5.1040 | Val Loss: 4.2732
>>> Improved validation! Best checkpoint saved at epoch 1.
Epoch 02/50 | Train Loss: 3.5860 | Val Loss: 4.0858
>>> Improved validation! Best checkpoint saved at epoch 2.
Epoch 03/50 | Train Loss: 3.1148 | Val Loss: 4.0519
>>> Improved validation! Best checkpoint saved at epoch 3.
Epoch 04/50 | Train Loss: 2.8456 | Val Loss: 4.1766
>>> No improvement (1/5).
Epoch 05/50 | Train Loss: 2.6498 | Val Loss: 4.3601
>>> No improvement (2/5).
Epoch 06/50 | Train Loss: 2.5036 | Val Loss: 4.3753
>>> No improvement (3/5).
Epoch 07/50 | Train Loss: 2.3918 | Val Loss: 4.5689
>>> No improvement (4/5).
Epoch 08/50 | Train Loss: 2.3005 | Val Loss: 4.4060
>>> No improvement (5/5).

Training stopped at epoch 8. Best epoch: 3 | Best Val Loss: 4.0519

Loaded best checkpoint: ./best_checkpoints/Scribe-γ_best.pth

---------------- Training Completed for Scribe-γ Model ----------------



  model.load_state_dict(torch.load(best_path, map_location=device))


### Scribe-γ Autoregressive Generation Phase

#### Greedy Sampling

In [27]:
print(sample_greedy(model_large, sp, "ROMEO: "))

ROMEO: I have done, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king, and the king,


#### Temperature Sampling

In [28]:
print(sample_with_temperature(model_large, sp, "KING: ", temperature=0.8))

KING: Heranty, put on him to speak with change-s, I ve I shall live of this worse king, I, and that are alimes, I should be begin your lords, your stand in love, Glted for those that for a word. Ty offallcused be! Ty, much like. Procell proceed worse he knife? ISABELLA: the Take the wivoletrunant with all the duke? Second Is not know meeter with herself; a charge, bravelack, Uning he lod, Where, sir, are like lible ind; trust the kingding sent. LEONTES: I had not mets, and be as I have crawife, as the guest, yeased? Say sight, and saw he was enter as the deedy


#### Top-k sampling

In [29]:
print(sample_top_k(model_large, sp, "FIRST CITIZEN: ", k=30))

FIRST CITIZEN: I am, sir, I did all of my heads are all no drawns become, in his death and by my soul with awed, nor, sir, sir. KING EDWARD IV: PARENCE: What shall beaving w: 'tis you shall meeter and your pers: May. ISABELLA: No draw I did not. First More the wo'd, ins, to the cause, as a tn you do you thanks. MARCIA: Most in the goes, and, forward; then me to since thou, for this ords od, not: And then but one as the duke; But to me to dish; and wret, with the moud and a good and solding: Let me, and a draworch all the house, if he hath done, as I come. GLOUCESTER: '


### Scribe-γ Model Checkpoint Saving and Archival

In [30]:
save_model(
    model      = model_large,
    base_name  = "Scribe_gamma",
    path       = "./scribe_saved_models"
)


Model saved at: ./scribe_saved_models\Scribe_gamma_v1.pth
Also updated: Scribe_gamma_latest.pth



'./scribe_saved_models\\Scribe_gamma_v1.pth'

## Loading the Saved Model

In [31]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

scribe_alpha = load_model(
    filepath = "./scribe_saved_models/Scribe_alpha_latest.pth",
    device   = device
)

scribe_beta = load_model(
    filepath = "./scribe_saved_models/Scribe_beta_latest.pth",
    device   = device
)

scribe_gamma = load_model(
    filepath = "./scribe_saved_models/Scribe_gamma_latest.pth",
    device   = device
)


Loaded File      : ./scribe_saved_models/Scribe_alpha_latest.pth
Model Name       : Scribe-α
Model Class      : MyGRUModel
Version          : v1
Timestamp        : 2025-12-12 19:59:22
----------------------------------------------
Model Architecture:
  └── embedding: Embedding
  └── layers: ModuleList
  └── layers.0: GRULayer
  └── layers.0.grucell: GRUCell
  └── layers.0.grucell.Wx: Linear
  └── layers.0.grucell.Wh: Linear
  └── layers.0.grucell.Wzx: Linear
  └── layers.0.grucell.Wzh: Linear
  └── layers.0.grucell.Wrx: Linear
  └── layers.0.grucell.Wrh: Linear
  └── layers.0.dropout: Dropout
  └── layers.1: GRULayer
  └── layers.1.grucell: GRUCell
  └── layers.1.grucell.Wx: Linear
  └── layers.1.grucell.Wh: Linear
  └── layers.1.grucell.Wzx: Linear
  └── layers.1.grucell.Wzh: Linear
  └── layers.1.grucell.Wrx: Linear
  └── layers.1.grucell.Wrh: Linear
  └── layers.1.dropout: Dropout
  └── fc: Linear
  └── fc.linear_projection: Linear
----------------------------------------------
Tot

  checkpoint = torch.load(filepath, map_location=device)



Loaded File      : ./scribe_saved_models/Scribe_gamma_latest.pth
Model Name       : Scribe-γ
Model Class      : MyGRUModel
Version          : v1
Timestamp        : 2025-12-12 21:40:24
----------------------------------------------
Model Architecture:
  └── embedding: Embedding
  └── layers: ModuleList
  └── layers.0: GRULayer
  └── layers.0.grucell: GRUCell
  └── layers.0.grucell.Wx: Linear
  └── layers.0.grucell.Wh: Linear
  └── layers.0.grucell.Wzx: Linear
  └── layers.0.grucell.Wzh: Linear
  └── layers.0.grucell.Wrx: Linear
  └── layers.0.grucell.Wrh: Linear
  └── layers.0.dropout: Dropout
  └── layers.1: GRULayer
  └── layers.1.grucell: GRUCell
  └── layers.1.grucell.Wx: Linear
  └── layers.1.grucell.Wh: Linear
  └── layers.1.grucell.Wzx: Linear
  └── layers.1.grucell.Wzh: Linear
  └── layers.1.grucell.Wrx: Linear
  └── layers.1.grucell.Wrh: Linear
  └── layers.1.dropout: Dropout
  └── layers.2: GRULayer
  └── layers.2.grucell: GRUCell
  └── layers.2.grucell.Wx: Linear
  └── layer

## Sampling from the Loaded Model

In [32]:
print(sample_top_k(scribe_alpha, sp, "FIRST CITIZEN: ", k=30))


FIRST CITIZEN: This day! KING RICHARD: But me nobling to this servically in a since the other and his sounds, That ever I amprought, the fits a came as infility, and fours I among me as these warrant, and the gain a wish and my talk: he doges all to dracious childre yally? Shallench! See-welievoke yours he's, if The faulted with answer? I amend! I am nothing; and ages, And in Volkener, oratharcherthight with our streneking. But this inde; and deeperved by the doung. Second Muspoint, my loath of love as well, forges, or your grace apl


In [33]:
print(sample_top_k(scribe_beta, sp, "FIRST CITIZEN: ", k=30))


FIRST CITIZEN: Ay, and clectireses your broke more than any gain of me soon: Ourne's. Camio. KING KING HENRY Bause I speaks of us with the cause match! ABRFIS MARols of a prom starnates, I have not me for you not. BINGBR QUEEN ELIZABETH sadeepest thyself that tites a misters; But to me; I will take, or recept, which I'der of your groadd; For a gible with my lord, to hear our a pack on my good a lander's on herself on thee: If we's! Ohood shall become, ass yours. What's! Back's, or else are yourselves not yourself to watch from me. CLAUY


In [34]:
print(sample_top_k(scribe_gamma, sp, "FIRST CITIZEN: ", k=30))

FIRST CITIZEN: 'KELO: CLIENGARIVOLIONRIVOLINGR: Come, that hath reason, but your voties he' quits! JULIET: I do, As thou hastapect me of his maid it soft, I must be thy hand, and in him in my bloody. DUKE VINCENTIO: The sch'd, he is a manneal on a wised from her and I have weather'll, and law, you for the queen, not. LUCIO: Nay. Sor: Hell'ssiling mood: O: I shall you fired to beler as he is dam? YORK: If ever fier; I doing you know, who hath less. DUKELO; And barch, That hell, in my love-guced, and with the mount to prate.
