# **Introduction**

This notebook presents an extended investigation into recurrent neural network architectures by advancing the original character-level Astra-GRU experiments into a **subword-level language modeling framework**.
In this new study, we introduce the **Scribe-GRU Series**—a family of three recurrent models (Scribe-α, Scribe-β, and Scribe-γ) trained on the *Tiny Shakespeare* corpus using **SentencePiece-based subword tokenization** instead of raw characters.

Where the Astra-GRU series examined the expressive limits of GRUs at the character level, the Scribe-GRU series explores how transitioning to **morpheme-like subword units** affects modeling capacity, compositional structure, and generative quality. Subword-level modeling provides a powerful intermediate granularity: tokens are more meaningful than characters but significantly more flexible than full words, allowing the model to learn syntax, phonetic structure, and stylistic patterns with far greater efficiency.

Despite this shift in linguistic representation, **all core computational components remain unchanged**.
Each model continues to use the same manually implemented GRUCell and GRULayer classes defined previously, preserving:

* the **single-bias gate formulation**,
* the **canonical reset-gate application to the raw hidden state**, and
* the fully hand-written recurrent logic that diverges intentionally from PyTorch’s optimized GRU internals.

This ensures that any observed improvements in depth, coherence, or stylistic fidelity arise solely from the **change in tokenization strategy**, not from modifications to the recurrent architecture itself.

The Tiny Shakespeare dataset remains the underlying training corpus, but the text is now processed using a **SentencePiece BPE subword tokenizer**, enabling the model to operate over a compact vocabulary of semantically meaningful units. This allows GRUs to capture longer-range dependencies, more stable word-like structures, and richer stylistic patterns that are difficult to model at the character level.

To systematically study scaling behavior under this new tokenization regime, three GRU architectures of increasing complexity are trained:

* **Scribe-α** — a lightweight single-layer subword GRU
* **Scribe-β** — a medium-capacity two-layer configuration
* **Scribe-γ** — a high-capacity three-layer recurrent model with expanded hidden representation

Each model is evaluated on its ability to perform autoregressive subword generation, reconstruct Shakespearean style, and maintain coherent linguistic structure across extended sequences. Comparative analyses are provided in terms of architecture, parameter count, training dynamics, and qualitative generation quality.

This notebook documents the complete workflow—from subword tokenizer construction and dataset encoding to model training, evaluation, and sampling—thereby building upon and extending the original Astra-GRU experimental framework into a richer and more linguistically grounded modeling paradigm.


In [1]:
import sentencepiece as spm
import torch
import torch.nn as nn

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

input_file = "tiny_shakespeare.txt"
spm.SentencePieceTrainer.train(
    input=input_file,
    model_prefix="shakespeare_bpe",
    vocab_size=1000,
    character_coverage=1.0,     
    model_type="bpe",           
    bos_id=1,                   
    eos_id=2,
    unk_id=0                    
)
sp = spm.SentencePieceProcessor()
sp.load("shakespeare_bpe.model")

data = sp.encode(open("tiny_shakespeare.txt", "r", encoding="utf-8").read(), out_type=int)
n = int(0.9 * len(data))
data = torch.tensor(data, dtype=torch.long)
train_data = data[:n]
val_data = data[n:]

block_size = 128
def get_batch(split="train", batch_size=64):
    source = train_data if split == "train" else val_data
    ix = torch.randint(len(source) - block_size - 1, (batch_size,))
    X = torch.stack([source[i:i+block_size] for i in ix])
    Y = torch.stack([source[i+1:i+block_size+1] for i in ix])
    return X.to(device), Y.to(device)

## GRUCell's Implementation

In [2]:
class GRUCell(nn.Module):
    """ 
    This Implementation is Differ from the PyTorch's Official Implementation in 2 Different Ways
    NOTE-1:
        This is not the official Implementation of PyTorch's GRUcell Because they use 2 biases per Gate
        and i'm only using 1 bias per Gate

        PyTorch's Official Implementation: 
        r = σ(W_ir x + b_ir + W_hr h + b_hr)
        z = σ(W_iz x + b_iz + W_hz h + b_hz)
        n = tanh(W_in x + b_in + r ⊙ (W_hn h + b_hn))

        They Use 2 Bias per Gate
    
    NOTE-2: 
        They apply the reset gate (r) after the Multiplication of W_hn and addition of b_hn on the h_prev
        2. Original implementation: Apply the Hadamard product (⊙) between r_t and h_prev and then apply the
            Matrix Transformation and bias addition
        3. What PyTorch does is, they apply the Matrix Transformation (Matrix Multiplication and bias addition) 1st and then
            they apply the Hadamard product (⊙) between (W_hn h + b_hn)
    """
    def __init__(self, embd_dim, hidden_dim):
        """
        -> Bias only on x: input
        -> No Bias on Hidden States
        """
        super().__init__()
        self.embd_dim = embd_dim
        self.hidden_dim = hidden_dim

        # Candidate transformation
        self.Wx = nn.Linear(embd_dim, hidden_dim, bias = True)
        self.Wh = nn.Linear(hidden_dim, hidden_dim, bias = False)

        # Update Gate Specific Parameters
        self.Wzx = nn.Linear(embd_dim, hidden_dim, bias = True)
        self.Wzh = nn.Linear(hidden_dim, hidden_dim, bias = False)
        self.bias_z = nn.Parameter(torch.zeros(hidden_dim))

        # Reset Gate Specific Parameters
        self.Wrx = nn.Linear(embd_dim, hidden_dim, bias = True)
        self.Wrh = nn.Linear(hidden_dim, hidden_dim, bias = False)
        self.bias_r = nn.Parameter(torch.zeros(hidden_dim))

    def forward(self, x, h_prev):
        """
        --> A proposed update → candidate (h̃_t)
        --> A decision gate → update gate (z_t)
        --> A final controlled update → h_t

        NOTE:   1. Reset Gate Filters h_prev
                    - r_t = sigmoid( ( (x_t @ W_rx) + (h_prev @ W_rh) + b_r) )
                2. Apply filter to h_prev
                    - filtered_h_prev = r_t * h_prev [NOTE: (where * is element-wise multiplication)]
                        - Meaning:
                            - If r_t ≈ 0 → ignore old memory when forming candidate
                            - If r_t ≈ 1 → use old memory fully
                3. Compute candidate
                    - h̃_t = tanh( ( (x_t @ W_hx) + (filtered_h_prev @ W_hh) + b_h) )
                        - Meaning: 
                            - This produces a new memory proposal: A proposed update
                4. Final hidden state
                    - h_t = (1 - z_t) * h_prev + z_t * h̃_t [NOTE: (where * is element-wise multiplication)]
        """
        r_t = torch.sigmoid((self.Wrx(x)) + (self.Wrh(h_prev)) + self.bias_r)
        z_t = torch.sigmoid((self.Wzx(x)) + (self.Wzh(h_prev)) + self.bias_z)
        h_tilde = torch.tanh(self.Wx(x) + self.Wh(r_t * h_prev))

        h = (1 - z_t) * h_prev + z_t * h_tilde
        return h

## GRULayer's Implementation

In [3]:
class GRULayer(nn.Module):
    def __init__(self, embd_dim, hidden_dim, dropout = 0.0):
        super().__init__()
        self.grucell = GRUCell(embd_dim, hidden_dim)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, h_prev = None):
        batch, seq_length, _ = x.shape # x.shape --> batch, seq_length, embd_dim
        if h_prev is None:
            h_prev = torch.zeros(batch, self.grucell.hidden_dim, device = x.device)

        hidden_states = []
        for t in range(seq_length):
            x_t = x[:, t, :]
            h_prev = self.grucell(x_t, h_prev)
            h_prev = self.dropout(h_prev)
            hidden_states.append(h_prev)
        
        # Stack list into tensor
        hidden_states = torch.stack(hidden_states, dim=1)
        return hidden_states, h_prev
        

## Linear Layer Projection

In [4]:
class Linear(nn.Module):
    def __init__(self, hidden_dim, n_classes):
        """
        This layer performs a linear projection on the GRU hidden states.
        It maps the hidden vector (of size hidden_dim) into the vocabulary space (n_classes)
        by applying a learnable affine transformation:

            logits = W h + b

        This is used to convert each GRU hidden state into class probabilities
        (e.g., next-character prediction in a name generation model).
        """

        super().__init__()
        self.linear_projection = nn.Linear(in_features = hidden_dim, out_features = n_classes, bias = True)
    
    def forward(self, x):
        return self.linear_projection(x)

## Custom GRU Model

In [5]:
class MyGRUModel(nn.Module):
    def __init__(self, vocab_size, embd_dim, hidden_dim, num_layers, model_name, dropout=0.0, ):
        super().__init__()
        self.model_name = model_name
        self.vocab_size = vocab_size
        self.n_classes = vocab_size
        self.embd_dim = embd_dim
        self.hidden_dim = hidden_dim
        self.dropout = dropout
        self.num_layers = num_layers

        self.embedding = nn.Embedding(vocab_size, embd_dim)
        
        self.layers = nn.ModuleList()
        self.layers.append(GRULayer(embd_dim, hidden_dim, dropout))
        for _ in range(num_layers - 1):
            self.layers.append(GRULayer(hidden_dim, hidden_dim, dropout))

        self.fc = Linear(hidden_dim, vocab_size)

    def forward(self, x, h_prev = None):
        x = self.embedding(x)   # (batch, seq, embd_dim)

        if h_prev is None:
            h_prev = [None] * self.num_layers

        new_h = []
        h = x

        for layer in self.layers:
            h, last_h = layer(h)  # (batch, seq, hidden_dim)
            new_h.append(last_h)

        logits = self.fc(h)     # (batch, seq, vocab_size)
        return logits, new_h

## Function for Training the Model

In [6]:
import torch
import torch.nn as nn
from torch.nn.utils import clip_grad_norm_

def train_model(
        model: MyGRUModel,
        optimizer,
        scheduler,
        loss_fn,
        epochs,
        batch_size,
        device,
        clip_value=1.0,
        val_interval=1,
        steps = 200
    ):
    
    print(f"\n---------------- Training Started for {model.model_name} Model ----------------\n")
    

    for epoch in range(1, epochs + 1):
        model.train()
        train_loss = 0.0

        for _ in range(steps):
            X, Y = get_batch(split="train", batch_size=batch_size)
            X, Y = X.to(device), Y.to(device)

            optimizer.zero_grad()

            logits, _ = model(X)
            loss = loss_fn(
                logits.reshape(-1, logits.size(-1)),
                Y.reshape(-1)
            )

            loss.backward()
            clip_grad_norm_(model.parameters(), clip_value)
            optimizer.step()

            train_loss += loss.item()

        train_loss /= steps

        # Validation
        val_loss = None
        if epoch % val_interval == 0:
            model.eval()
            with torch.no_grad():
                Xv, Yv = get_batch(split="val", batch_size=batch_size)
                Xv, Yv = Xv.to(device), Yv.to(device)

                logits, _ = model(Xv)
                val_loss = loss_fn(
                    logits.reshape(-1, logits.size(-1)),
                    Yv.reshape(-1)
                ).item()

        # Lr Scheduler
        scheduler.step()

        # Epoch and Loss Details
        if val_loss is not None:
            print(f"Epoch {epoch:02d}/{epochs} | Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f}")
        else:
            print(f"Epoch {epoch:02d}/{epochs} | Train Loss: {train_loss:.4f}")

    print(f"\n---------------- Training Completed for {model.model_name} Model ----------------\n")
    return model

In [7]:
import torch
import torch.nn as nn
from torch.nn.utils import clip_grad_norm_
import os


def train_model_with_early_stopping(
        model: MyGRUModel,
        optimizer,
        scheduler,
        loss_fn,
        epochs,
        batch_size,
        device,
        clip_value=1.0,
        val_interval=1,
        steps=200,
        patience=5,
        checkpoint_path="./best_checkpoints/"
    ):
    
    print(f"\n---------------- Training Started for {model.model_name} Model ----------------\n")
    # steps ---> How many batches will get involve in forwardpass and backward pass
    # steps = 200, and batch_size = 64 meaning 200 batches of each size = 64 will get involved in forwardpass and backward pass
    # 200 * 64 * seq_length = 200 * 64 * 128 = 1.64M tokens/epoch for forward pass and 1.46M token/epoch for backward pass
    # so for larger models, keep larger steps, for Astra-gamma step = 400 

    # Create checkpoint directory
    os.makedirs(checkpoint_path, exist_ok=True)

    best_val = float("inf")
    best_epoch = -1
    no_improve = 0   # counter for early stopping

    for epoch in range(1, epochs + 1):
        model.train()
        train_loss = 0.0

        # ---------------- TRAINING LOOP ----------------
        for _ in range(steps):
            X, Y = get_batch(split="train", batch_size=batch_size)
            X, Y = X.to(device), Y.to(device)

            optimizer.zero_grad()

            logits, _ = model(X)
            loss = loss_fn(
                logits.reshape(-1, logits.size(-1)),
                Y.reshape(-1)
            )

            loss.backward()
            clip_grad_norm_(model.parameters(), clip_value)
            optimizer.step()

            train_loss += loss.item()

        train_loss /= steps

        # ---------------- VALIDATION ----------------
        val_loss = None
        if epoch % val_interval == 0:
            model.eval()
            with torch.no_grad():
                Xv, Yv = get_batch(split="val", batch_size=batch_size)
                Xv, Yv = Xv.to(device), Yv.to(device)

                logits, _ = model(Xv)
                val_loss = loss_fn(
                    logits.reshape(-1, logits.size(-1)),
                    Yv.reshape(-1)
                ).item()

        # ---------------- SCHEDULER STEP ----------------
        scheduler.step()

        # ---------------- LOGGING ----------------
        if val_loss is not None:
            print(f"Epoch {epoch:02d}/{epochs} | Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f}")
        else:
            print(f"Epoch {epoch:02d}/{epochs} | Train Loss: {train_loss:.4f}")

        # ---------------- Early Stopping and Best Checkpoint ----------------
        if val_loss is not None:
            if val_loss < best_val:
                best_val = val_loss
                best_epoch = epoch
                no_improve = 0

                # Save checkpoint
                save_path = os.path.join(checkpoint_path, f"{model.model_name}_best.pth")
                torch.save(model.state_dict(), save_path)

                print(f">>> Improved validation! Best checkpoint saved at epoch {epoch}.")
            
            else:
                no_improve += 1
                print(f">>> No improvement ({no_improve}/{patience}).")

            # Trigger early stopping
            if no_improve >= patience:
                print("\n================ EARLY STOPPING ACTIVATED ================")
                print(f"Training stopped at epoch {epoch}. Best epoch: {best_epoch} | Best Val Loss: {best_val:.4f}")
                print("==========================================================\n")
                
                # Load best model before returning
                best_path = os.path.join(checkpoint_path, f"{model.model_name}_best.pth")
                model.load_state_dict(torch.load(best_path, map_location=device))

                print(f"Loaded best checkpoint: {best_path}")
                print(f"\n---------------- Training Completed for {model.model_name} Model ----------------\n")
                return model

    print(f"\n---------------- Training Completed for {model.model_name} Model ----------------\n")
    return model

## Sampling Codes

In [8]:
import torch
import torch.nn.functional as F

def sample_greedy(model, sp, start_text="A", max_new_tokens=200):
    model.eval()
    device = next(model.parameters()).device

    # Encode start text
    input_ids = torch.tensor(sp.encode(start_text), dtype=torch.long).unsqueeze(0).to(device)
    h_prev = None
    for _ in range(max_new_tokens):
        logits, h_prev = model(input_ids[:, -1:], h_prev)  
        # logits: (1, 1, vocab_size)

        next_id = torch.argmax(logits[:, -1, :], dim=-1)  # greedy pick

        input_ids = torch.cat([input_ids, next_id.unsqueeze(1)], dim=1)

    return sp.decode(input_ids[0].tolist())


def sample_with_temperature(model, sp, start_text="A", max_new_tokens=200, temperature=1.0):
    model.eval()
    device = next(model.parameters()).device

    input_ids = torch.tensor(sp.encode(start_text), dtype=torch.long).unsqueeze(0).to(device)
    h_prev = None
    for _ in range(max_new_tokens):
        logits, h_prev = model(input_ids[:, -1:], h_prev) 
        logits = logits[:, -1, :] / temperature  

        probs = F.softmax(logits, dim=-1)
        next_id = torch.multinomial(probs, num_samples=1)

        input_ids = torch.cat([input_ids, next_id], dim=1)

    return sp.decode(input_ids[0].tolist())


def sample_top_k(model, sp, start_text="A", max_new_tokens=200, k=20, temperature=1.0):
    model.eval()
    device = next(model.parameters()).device

    input_ids = torch.tensor(sp.encode(start_text), dtype=torch.long).unsqueeze(0).to(device)
    h_prev = None
    for _ in range(max_new_tokens):
        logits, h_prev = model(input_ids[:, -1:], h_prev)
        logits = logits[:, -1, :] / temperature

        # Keep only top-k logits
        topk_vals, topk_idx = torch.topk(logits, k)
        
        probs = F.softmax(topk_vals, dim=-1)

        # Sample from top-k
        sampled_idx = torch.multinomial(probs, num_samples=1)

        next_id = topk_idx.gather(-1, sampled_idx)

        input_ids = torch.cat([input_ids, next_id], dim=1)

    return sp.decode(input_ids[0].tolist())

## Function for Saving the Model

In [9]:
import os
import json
import torch
from datetime import datetime

def save_model(model: MyGRUModel, base_name="Astra", path="./saved_models/"):
    os.makedirs(path, exist_ok=True)

    # versioning
    existing = [f for f in os.listdir(path) if f.startswith(base_name) and f.endswith(".pth")]
    versions = []
    for f in existing:
        parts = f.replace(".pth", "").split("_v")
        if len(parts) == 2 and parts[1].isdigit():
            versions.append(int(parts[1]))
    next_version = max(versions, default=0) + 1

    filename = f"{base_name}_v{next_version}.pth"
    save_path = os.path.join(path, filename)

    checkpoint = {
        "state_dict": model.state_dict(),
        "model_class": model.__class__.__name__,
        "model_name": model.model_name,
        "n_classes": model.n_classes,
        "embd_dim": model.embd_dim,
        "hidden_dim": model.hidden_dim,
        "dropout": model.dropout,
        "vocab_size": model.vocab_size,
        "num_layers": model.num_layers,
        "timestamp": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
        "version": next_version,
    }

    torch.save(checkpoint, save_path)
    torch.save(checkpoint, os.path.join(path, f"{base_name}_latest.pth"))

    print(f"\nModel saved at: {save_path}")
    print(f"Also updated: {base_name}_latest.pth\n")
    return save_path

## Function for Loading the Model

In [10]:
import torch

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)


def load_model(filepath, device):
    checkpoint = torch.load(filepath, map_location=device)

    # extract architecture parameters from checkpoint
    model_name  = checkpoint["model_name"]
    vocab_size  = checkpoint["vocab_size"]
    embd_dim    = checkpoint["embd_dim"]
    hidden_dim  = checkpoint["hidden_dim"]
    dropout     = checkpoint["dropout"]
    num_layers     = checkpoint["num_layers"]

    # instantiate model using all saved metadata
    model = MyGRUModel(
        vocab_size = vocab_size,
        embd_dim   = embd_dim,
        hidden_dim = hidden_dim,
        dropout    = dropout,
        model_name = model_name,
        num_layers = num_layers
    ).to(device)

    # load weights
    model.load_state_dict(checkpoint["state_dict"])

    # pretty print metadata
    print("\n================ MODEL LOADED ================")
    print(f"Loaded File      : {filepath}")
    print(f"Model Name       : {model_name}")
    print(f"Model Class      : {checkpoint['model_class']}")
    print(f"Version          : v{checkpoint['version']}")
    print(f"Timestamp        : {checkpoint['timestamp']}")
    print("----------------------------------------------")

    print("Model Architecture:")
    for name, module in model.named_modules():
        if name != "":
            print(f"  └── {name}: {module.__class__.__name__}")
    print("----------------------------------------------")

    print(f"Total Parameters : {count_parameters(model):,}")
    print(f"Loaded on Device : {device}")
    print("==============================================\n")

    return model

## Function for Printing the Summary of Model

In [11]:
def print_model_summary(model, model_name, epochs, lr, device):
    print("\n" + "="*100)
    print("ASTRA-GRU MODEL SUMMARY")
    print("="*100)
    print(f"Model Name       : {model_name}")
    print(f"Device           : {device}")
    print(f"Total Epochs     : {epochs}")
    print(f"Learning Rate    : {lr}")

    print("\nMODEL ARCHITECTURE")
    print("-"*100)
    for name, module in model.named_modules():
        if name == "":
            continue
        print(f"  └── {name}: {module.__class__.__name__}()")
    print("-"*100)

    n_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    model_size_mb = n_params * 4 / (1024**2)

    print(f"\nTrainable Parameters : {n_params:,}")
    print(f"Model Size : {model_size_mb:.2f} MB")

    print("\nPARAMETER BREAKDOWN")
    print("-"*100)
    for name, param in model.named_parameters():
        if param.requires_grad:
            print(f"{name:40s} : {param.numel():,}")

    print("="*100 + "\n")

# **Scribe-GRU Model Family: Architectural Variants and Training Configurations**

The **Scribe-GRU** series introduces a new generation of recurrent language models—**Scribe-α**, **Scribe-β**, and **Scribe-γ**—designed for **subword-level language modeling** using a SentencePiece-BPE vocabulary trained on the Tiny Shakespeare corpus.
This family serves as a direct evolution of the Astra series, leveraging the richer semantic structure of subword tokens and expanding architectural capacity for more expressive sequence modeling.

While the underlying GRUCell and GRULayer implementations remain identical to the handcrafted mechanisms used in the Astra series, the Scribe models are scaled to exploit the advantages of subword tokenization, enabling superior generalization, smoother long-range coherence, and significantly improved generative fluency.

The following sections detail the structure and training configuration of all three Scribe variants.

---

## **1. Scribe-α Model (Small Configuration)**

### **Architectural Description**

Scribe-α is the entry-level configuration in the Scribe-GRU series, designed for fast experimentation on subword-level inputs.
Unlike Astra-α—which operated on individual characters—Scribe-α benefits from semantically meaningful BPE tokens, allowing even a compact architecture to produce coherent multi-token sequences.

The architecture consists of:

* A subword embedding layer
* A **2-layer** GRU stack
* A linear output projection for next-token prediction

This setup strikes a balance between simplicity and sufficient capacity to leverage subword structure.

### **Hyperparameter Configuration**

```
embedding_dim = 128
hidden_dim    = 256
num_layers    = 2
epochs        = 15
learning_rate = 2e-3
weight_decay  = 0.01
optimizer     = AdamW
scheduler     = CosineAnnealingLR
dropout       = 0.1
batch_size    = 64
seq_length    = 128
```

### **Purpose and Expected Behavior**

Scribe-α serves as the baseline for subword modeling experiments.
It is capable of learning token-level morphology, phrase structure, and simple line-based formatting.
Its performance already surpasses Astra-α due to the inherent modeling advantages of BPE segmentation—not architectural complexity.

---

## **2. Scribe-β Model (Medium Configuration)**

### **Architectural Description**

Scribe-β significantly expands representational capacity through:

* **Deeper recurrence (3 GRU layers)**
* Wider hidden state (512)
* Higher-dimensional embeddings (256)

This configuration is designed for intermediate-scale modeling tasks, providing robust learning of mid-range dependencies (e.g., multi-token expressions, sentence-like structures, and repeated theatrical phrasing).

The architecture consists of:

* Embedding layer
* GRU Layer 1 → GRU Layer 2 → GRU Layer 3
* Linear projection to vocabulary size

### **Hyperparameter Configuration**

```
embedding_dim = 256
hidden_dim    = 512
num_layers    = 3
epochs        = 20
learning_rate = 1.5e-3
weight_decay  = 0.01
optimizer     = AdamW
scheduler     = CosineAnnealingLR
dropout       = 0.1
batch_size    = 64
seq_length    = 128
```

### **Purpose and Expected Behavior**

Scribe-β is designed as the **balanced workhorse** of the series.
It is expected to produce coherent multi-line outputs, exhibit stable training curves, and capture much richer structural and stylistic patterns compared to Scribe-α.

This model is well-suited for realistic Shakespeare-like generation without excessive computational requirements.

---

## **3. Scribe-γ Model (Large Configuration)**

### **Architectural Description**

Scribe-γ is the most advanced and expressive model in the Scribe-GRU family.

Relative to Astra-γ and even Scribe-β, it introduces:

* A **4-layer GRU stack**
* High-resolution embeddings (384)
* Substantial hidden dimensionality (768)
* Increased dropout for stabilization

This model maximizes recurrent depth, token-level abstraction, and long-range coherence—essential for producing realistic Shakespearean dialogue that spans multiple sentences or lines.

The architecture includes:

* Embedding layer
* GRU Layer 1 → GRU Layer 2 → GRU Layer 3 → GRU Layer 4
* Linear prediction head

### **Hyperparameter Configuration**

```
embedding_dim = 384
hidden_dim    = 768
num_layers    = 4
epochs        = 30
learning_rate = 1e-3
weight_decay  = 0.01
optimizer     = AdamW
scheduler     = CosineAnnealingLR
dropout       = 0.15
batch_size    = 64
seq_length    = 128
```

### **Purpose and Expected Behavior**

Scribe-γ is optimized for **high-fidelity subword generation**, offering:

* Strong multi-sentence continuity
* Accurate dialogue formatting
* Rich stylistic imitation of Shakespeare’s syntax and rhythm
* Far fewer nonsensical outputs compared to character-level models

It stands as the flagship configuration for experiments in recurrent generative modeling.

---

# **Summary**

The Scribe-GRU series—**Scribe-α**, **Scribe-β**, and **Scribe-γ**—reflects a progression in both **architectural depth** and **semantic richness** enabled by subword-level tokenization.

Here is a **single unified comparison table** that merges **architecture**, **capacity**, and **parameter statistics** for the entire **Scribe-GRU Series**.

Perfect for documentation, reports, or GitHub READMEs.

---

# **Scribe-GRU Model Comparison Table**

| Model        | Layers | Hidden | Embedding | Parameters | Size (MB) |  Params (M) | Expected Behavior                                     |
| ------------ | ------ | ------ | --------- | ---------: | --------: | ----------: | ----------------------------------------------------- |
| **Scribe-α** | 2      | 256    | 128       |  1,075,688 |   4.10 MB |  **1.08 M** | Baseline subword LM; learns local & midrange patterns |
| **Scribe-β** | 3      | 512    | 256       |  5,102,056 |  19.46 MB |  **5.10 M** | Strong coherence and phrase-level structure           |
| **Scribe-γ** | 4      | 768    | 384       | 14,439,400 |  55.08 MB | **14.44 M** | High-quality fluent generation; best global structure |


In [12]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
vocab_size = sp.get_piece_size()   # Subword vocabulary size from SentencePiece


# ------------------------------------------------------- Scribe-α -------------------------------------------------------
small_model_name = "Scribe-α"
model_small = MyGRUModel(
    vocab_size = vocab_size,
    embd_dim = 128,          # Larger than Astra to match subword semantics
    hidden_dim = 256,
    num_layers = 2,
    model_name = small_model_name,
    dropout = 0.1
).to(device)

lr_small_model = 2e-3
epochs_small_model = 10
weight_decay_small = 0.01

optimizer_small = torch.optim.AdamW(
    model_small.parameters(),
    lr = lr_small_model,
    weight_decay = weight_decay_small
)

scheduler_small = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer_small,
    T_max = epochs_small_model
)


# ------------------------------------------------------- Scribe-β -------------------------------------------------------
medium_model_name = "Scribe-β"
model_medium = MyGRUModel(
    vocab_size = vocab_size,
    embd_dim = 256,
    hidden_dim = 512,
    num_layers = 3,          # deeper than fevious Astra-β
    model_name = medium_model_name,
    dropout = 0.1
).to(device)

lr_medium_model = 1.5e-3
epochs_medium_model = 15
weight_decay_medium = 0.01

optimizer_medium = torch.optim.AdamW(
    model_medium.parameters(),
    lr = lr_medium_model,
    weight_decay = weight_decay_medium
)

scheduler_medium = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer_medium,
    T_max = epochs_medium_model
)

# ------------------------------------------------------- Scribe-γ -------------------------------------------------------
large_model_name = "Scribe-γ"
model_large = MyGRUModel(
    vocab_size = vocab_size,
    embd_dim = 384,          # very high-quality embeddings
    hidden_dim = 768,        # large recurrent representation
    num_layers = 4,          # deeper than Astra-γ
    model_name = large_model_name,
    dropout = 0.15           # slightly more dropout for stability
).to(device)

lr_large_model = 1e-3
epochs_large_model = 20
weight_decay_large = 0.01

optimizer_large = torch.optim.AdamW(
    model_large.parameters(),
    lr = lr_large_model,
    weight_decay = weight_decay_large
)

# CosineAnnealingLR scheduler: It will produce cleaner convergence and noticeably better text quality.
scheduler_large = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer_large,
    T_max = epochs_large_model
)

loss_fn = nn.CrossEntropyLoss()

### Scribe-α Model Summary

In [13]:
print_model_summary(
    model      = model_small,
    model_name = small_model_name,
    epochs     = epochs_small_model,
    lr         = lr_small_model,
    device     = device
)


ASTRA-GRU MODEL SUMMARY
Model Name       : Scribe-α
Device           : cuda
Total Epochs     : 10
Learning Rate    : 0.002

MODEL ARCHITECTURE
----------------------------------------------------------------------------------------------------
  └── embedding: Embedding()
  └── layers: ModuleList()
  └── layers.0: GRULayer()
  └── layers.0.grucell: GRUCell()
  └── layers.0.grucell.Wx: Linear()
  └── layers.0.grucell.Wh: Linear()
  └── layers.0.grucell.Wzx: Linear()
  └── layers.0.grucell.Wzh: Linear()
  └── layers.0.grucell.Wrx: Linear()
  └── layers.0.grucell.Wrh: Linear()
  └── layers.0.dropout: Dropout()
  └── layers.1: GRULayer()
  └── layers.1.grucell: GRUCell()
  └── layers.1.grucell.Wx: Linear()
  └── layers.1.grucell.Wh: Linear()
  └── layers.1.grucell.Wzx: Linear()
  └── layers.1.grucell.Wzh: Linear()
  └── layers.1.grucell.Wrx: Linear()
  └── layers.1.grucell.Wrh: Linear()
  └── layers.1.dropout: Dropout()
  └── fc: Linear()
  └── fc.linear_projection: Linear()
-------------

### Scribe-α Training Phase

In [14]:
model_small = train_model_with_early_stopping(
    model       = model_small,
    optimizer   = optimizer_small,
    scheduler   = scheduler_small,
    loss_fn     = loss_fn,
    epochs      = epochs_small_model,
    batch_size  = 64,
    device      = device,
    steps       = 300
)


---------------- Training Started for Scribe-α Model ----------------

Epoch 01/10 | Train Loss: 4.3808 | Val Loss: 4.0454
>>> Improved validation! Best checkpoint saved at epoch 1.
Epoch 02/10 | Train Loss: 3.4709 | Val Loss: 3.9232
>>> Improved validation! Best checkpoint saved at epoch 2.
Epoch 03/10 | Train Loss: 3.2741 | Val Loss: 3.8526
>>> Improved validation! Best checkpoint saved at epoch 3.
Epoch 04/10 | Train Loss: 3.1582 | Val Loss: 3.9209
>>> No improvement (1/5).
Epoch 05/10 | Train Loss: 3.0762 | Val Loss: 3.9130
>>> No improvement (2/5).
Epoch 06/10 | Train Loss: 3.0184 | Val Loss: 4.0360
>>> No improvement (3/5).
Epoch 07/10 | Train Loss: 2.9788 | Val Loss: 3.9921
>>> No improvement (4/5).
Epoch 08/10 | Train Loss: 2.9495 | Val Loss: 4.0641
>>> No improvement (5/5).

Training stopped at epoch 8. Best epoch: 3 | Best Val Loss: 3.8526

Loaded best checkpoint: ./best_checkpoints/Scribe-α_best.pth

---------------- Training Completed for Scribe-α Model ----------------



  model.load_state_dict(torch.load(best_path, map_location=device))


### Scribe-α Autoregressive Generation Phase

#### Greedy Sampling

In [15]:
print(sample_greedy(model_small, sp, "ROMEO: "))

ROMEO: I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have I have


#### Temperature Sampling

In [16]:
print(sample_with_temperature(model_small, sp, "KING: ", temperature=0.8))

KING: Death! BUCKINGS KING HENRY BUCKINGSlike mock You show in bles, what citizens in this armo'tunin. KING EDWARD: Ange: What, and what are king to me not transed in sibew me: married Pet down, and be judged alrewell, not the chargughts to our encap, and envent crand Harrow am home; and dience sake that I have me? KING EDWARD: I have you will I earthly the most, and come That were my leave for the line! Then, and the cloud that att! Ourst thou turn back; chapet we have I make all: Thomest I think if the stirs unto ay by am glfend me: Thou arthips: God, nor


#### Top-k sampling

In [17]:
print(sample_top_k(model_small, sp, "FIRST CITIZEN: ", k=30))

FIRST CITIZEN: I love Somest thou for your close, for him. LEONTES: Moward him? First Shood: 'Tis servant frain. LADY ANTI VI EDWARD: And ste. DUKEATppy Couchbumpherst thou, the vens and my brother? Nursed of all your husband now, To the glicils with the sucking, And cause 'tun and unfectured on the tempt from the house to this is this man washind, and to thee then, a stood my life. LEONTES: 'tis you want furd in the sight: I shall heard. ROMEO: What, for your son! The tribon reason for you wilt not a mets the pass? What'd by my mercy. AURENCE. KING HENRY VI CAMPHOMour


### Scribe-α Model Checkpoint Saving and Archival

In [18]:
save_model(
    model      = model_small,
    base_name  = "Scribe_alpha",
    path       = "./scribe_saved_models"
)


Model saved at: ./scribe_saved_models\Scribe_alpha_v1.pth
Also updated: Scribe_alpha_latest.pth



'./scribe_saved_models\\Scribe_alpha_v1.pth'

### Scribe-β Model Summary

In [19]:
print_model_summary(
    model      = model_medium,
    model_name = medium_model_name,
    epochs     = epochs_medium_model,
    lr         = lr_medium_model,
    device     = device
)


ASTRA-GRU MODEL SUMMARY
Model Name       : Scribe-β
Device           : cuda
Total Epochs     : 15
Learning Rate    : 0.0015

MODEL ARCHITECTURE
----------------------------------------------------------------------------------------------------
  └── embedding: Embedding()
  └── layers: ModuleList()
  └── layers.0: GRULayer()
  └── layers.0.grucell: GRUCell()
  └── layers.0.grucell.Wx: Linear()
  └── layers.0.grucell.Wh: Linear()
  └── layers.0.grucell.Wzx: Linear()
  └── layers.0.grucell.Wzh: Linear()
  └── layers.0.grucell.Wrx: Linear()
  └── layers.0.grucell.Wrh: Linear()
  └── layers.0.dropout: Dropout()
  └── layers.1: GRULayer()
  └── layers.1.grucell: GRUCell()
  └── layers.1.grucell.Wx: Linear()
  └── layers.1.grucell.Wh: Linear()
  └── layers.1.grucell.Wzx: Linear()
  └── layers.1.grucell.Wzh: Linear()
  └── layers.1.grucell.Wrx: Linear()
  └── layers.1.grucell.Wrh: Linear()
  └── layers.1.dropout: Dropout()
  └── layers.2: GRULayer()
  └── layers.2.grucell: GRUCell()
  └── l

### Scribe-β Training Phase

In [20]:
model_medium = train_model_with_early_stopping(
    model       = model_medium,
    optimizer   = optimizer_medium,
    scheduler   = scheduler_medium,
    loss_fn     = loss_fn,
    epochs      = epochs_medium_model,
    batch_size  = 64,
    device      = device,
    steps       = 350
)


---------------- Training Started for Scribe-β Model ----------------

Epoch 01/15 | Train Loss: 3.9959 | Val Loss: 3.9552
>>> Improved validation! Best checkpoint saved at epoch 1.
Epoch 02/15 | Train Loss: 2.9168 | Val Loss: 4.1868
>>> No improvement (1/5).
Epoch 03/15 | Train Loss: 2.5432 | Val Loss: 4.3154
>>> No improvement (2/5).
Epoch 04/15 | Train Loss: 2.3196 | Val Loss: 4.3671
>>> No improvement (3/5).
Epoch 05/15 | Train Loss: 2.1622 | Val Loss: 4.5978
>>> No improvement (4/5).
Epoch 06/15 | Train Loss: 2.0360 | Val Loss: 4.7648
>>> No improvement (5/5).

Training stopped at epoch 6. Best epoch: 1 | Best Val Loss: 3.9552

Loaded best checkpoint: ./best_checkpoints/Scribe-β_best.pth

---------------- Training Completed for Scribe-β Model ----------------



  model.load_state_dict(torch.load(best_path, map_location=device))


### Scribe-β Autoregressive Generation Phase

#### Greedy Sampling

In [21]:
print(sample_greedy(model_medium, sp, "ROMEO: "))

ROMEO: I have done, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free, and free,


#### Temperature Sampling

In [22]:
print(sample_with_temperature(model_medium, sp, "KING: ", temperature=0.8))

KING: Cans of my poor eyes Wheretinge amen's that-axford 'Wisten of a grabout wilt were full made Montague ange. AN Grewife be a pay be fools to the earth, with me! Fareless like numbemate the and shiles, More, lastever the provipherish, and give; wholy in the wast? She to be deadly new, I now, To suffice and like with her, That I, and dwell, Ox are the court it may faith home? CORIOLANUS: They thus ga! Abever, That times, Avant overs never he was, that hath soleLAND: Let him; Only I amongs have done: My lies in our day of yourselves to one love


#### Top-k sampling

In [23]:
print(sample_top_k(model_medium, sp, "FIRST CITIZEN: ", k=30))

FIRST CITIZEN: For my law, Against, To the ready, and force, my cuteel'shiccent, that I do more. Fromise, I speak. Second Servant in her, to dost thou art once with thee? I say, More-d for a falted at Look: Let him? Most as my tities. Make my father that I was done. QUEEN ELIZABETH: Why, and his brothers from this cast onceard, and to behos. Hads me To hear, and a tarnes to death, my sons, in him: The queen, See: Oxtain, and reverence that will, I have I would they are here with him and phy, Liffector to a curst to the cause I would have you have frown the duke! Watch, fort


### Scribe-β Model Checkpoint Saving and Archival

In [24]:
save_model(
    model      = model_medium,
    base_name  = "Scribe_beta",
    path       = "./scribe_saved_models"
)


Model saved at: ./scribe_saved_models\Scribe_beta_v1.pth
Also updated: Scribe_beta_latest.pth



'./scribe_saved_models\\Scribe_beta_v1.pth'

### Scribe-γ Model Summary

In [25]:
print_model_summary(
    model      = model_large,
    model_name = large_model_name,
    epochs     = epochs_large_model,
    lr         = lr_large_model,
    device     = device
)


ASTRA-GRU MODEL SUMMARY
Model Name       : Scribe-γ
Device           : cuda
Total Epochs     : 20
Learning Rate    : 0.001

MODEL ARCHITECTURE
----------------------------------------------------------------------------------------------------
  └── embedding: Embedding()
  └── layers: ModuleList()
  └── layers.0: GRULayer()
  └── layers.0.grucell: GRUCell()
  └── layers.0.grucell.Wx: Linear()
  └── layers.0.grucell.Wh: Linear()
  └── layers.0.grucell.Wzx: Linear()
  └── layers.0.grucell.Wzh: Linear()
  └── layers.0.grucell.Wrx: Linear()
  └── layers.0.grucell.Wrh: Linear()
  └── layers.0.dropout: Dropout()
  └── layers.1: GRULayer()
  └── layers.1.grucell: GRUCell()
  └── layers.1.grucell.Wx: Linear()
  └── layers.1.grucell.Wh: Linear()
  └── layers.1.grucell.Wzx: Linear()
  └── layers.1.grucell.Wzh: Linear()
  └── layers.1.grucell.Wrx: Linear()
  └── layers.1.grucell.Wrh: Linear()
  └── layers.1.dropout: Dropout()
  └── layers.2: GRULayer()
  └── layers.2.grucell: GRUCell()
  └── la

### Scribe-γ Training Phase

In [26]:
model_large = train_model_with_early_stopping(
    model       = model_large,
    optimizer   = optimizer_large,
    scheduler   = scheduler_large,
    loss_fn     = loss_fn,
    epochs      = epochs_large_model,
    batch_size  = 64,
    device      = device,
    steps       = 450     
)


---------------- Training Started for Scribe-γ Model ----------------

Epoch 01/20 | Train Loss: 5.8324 | Val Loss: 5.0179
>>> Improved validation! Best checkpoint saved at epoch 1.
Epoch 02/20 | Train Loss: 4.4485 | Val Loss: 4.5103
>>> Improved validation! Best checkpoint saved at epoch 2.
Epoch 03/20 | Train Loss: 3.6452 | Val Loss: 4.1581
>>> Improved validation! Best checkpoint saved at epoch 3.
Epoch 04/20 | Train Loss: 3.2797 | Val Loss: 4.2000
>>> No improvement (1/5).
Epoch 05/20 | Train Loss: 3.0617 | Val Loss: 4.2525
>>> No improvement (2/5).
Epoch 06/20 | Train Loss: 2.9104 | Val Loss: 4.3303
>>> No improvement (3/5).
Epoch 07/20 | Train Loss: 2.7828 | Val Loss: 4.3232
>>> No improvement (4/5).
Epoch 08/20 | Train Loss: 2.6766 | Val Loss: 4.4206
>>> No improvement (5/5).

Training stopped at epoch 8. Best epoch: 3 | Best Val Loss: 4.1581

Loaded best checkpoint: ./best_checkpoints/Scribe-γ_best.pth

---------------- Training Completed for Scribe-γ Model ----------------



  model.load_state_dict(torch.load(best_path, map_location=device))


### Scribe-γ Autoregressive Generation Phase

#### Greedy Sampling

In [27]:
print(sample_greedy(model_large, sp, "ROMEO: "))

ROMEO: I am the charger, and the charger, and the charger, and the charger, and the charger, and the charger, and the charger, and the charger, and the charger, and the charger, and the charger, and the charger, and the charger, and the charger, and the charger, and the charger, and the charger, and the charger, and the charger, and the charger, and the charger, and the charger, and the charger, and the charger, and the charger, and the charger, and the charger, and the charger, and the charger, and the charger, and the charger, and the charger, and the charger, and


#### Temperature Sampling

In [28]:
print(sample_with_temperature(model_large, sp, "KING: ", temperature=0.8))

KING: He had a cast; I sehile I sievid: This and to act to shameled, Tillabley mean not, Forting a diing blood will behoppifts, But bes, By my sons tovick me, I have become, sir, And bear to do a holaat and then, joft, nex our bace and then, sir That shall kiss what, nor needful tracet on thee: The country. ROMEO: His savenion, my miscond myself, which a lost me to have die, When I am o't, my news, What to to Edward is enter! Hint To live, for cure Hath therefore in thy brother hadely, some presence a comfiver, that she'sting my jon a supy. Would not a


#### Top-k sampling

In [29]:
print(sample_top_k(model_large, sp, "FIRST CITIZEN: ", k=30))

FIRST CITIZEN: For the strull; and reumeverless in me now as he had the people, if the suvid of wared, in mines, with some falblasaals, which to her. KING RICHARDoth, which and breat to me that I dress'fush, then so. KING RICHARD III: Aprembal, and his brother that of ser that he was, when to reason of my valless touch a faining by her lights, as I could have the daal, And give my son; if the suppiigh, a deeper, And indees, he shall be while. FRAD: So. YORK: Why to-bemes; And I have so cast we may be a wish-s; And that he would say. What is strief, it; the dies


### Scribe-γ Model Checkpoint Saving and Archival

In [30]:
save_model(
    model      = model_large,
    base_name  = "Scribe_gamma",
    path       = "./scribe_saved_models"
)


Model saved at: ./scribe_saved_models\Scribe_gamma_v1.pth
Also updated: Scribe_gamma_latest.pth



'./scribe_saved_models\\Scribe_gamma_v1.pth'

## Loading the Saved Model

In [31]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

scribe_alpha = load_model(
    filepath = "./scribe_saved_models/Scribe_alpha_latest.pth",
    device   = device
)

scribe_beta = load_model(
    filepath = "./scribe_saved_models/Scribe_beta_latest.pth",
    device   = device
)

scribe_gamma = load_model(
    filepath = "./scribe_saved_models/Scribe_gamma_latest.pth",
    device   = device
)


Loaded File      : ./scribe_saved_models/Scribe_alpha_latest.pth
Model Name       : Scribe-α
Model Class      : MyGRUModel
Version          : v1
Timestamp        : 2025-12-14 23:18:14
----------------------------------------------
Model Architecture:
  └── embedding: Embedding
  └── layers: ModuleList
  └── layers.0: GRULayer
  └── layers.0.grucell: GRUCell
  └── layers.0.grucell.Wx: Linear
  └── layers.0.grucell.Wh: Linear
  └── layers.0.grucell.Wzx: Linear
  └── layers.0.grucell.Wzh: Linear
  └── layers.0.grucell.Wrx: Linear
  └── layers.0.grucell.Wrh: Linear
  └── layers.0.dropout: Dropout
  └── layers.1: GRULayer
  └── layers.1.grucell: GRUCell
  └── layers.1.grucell.Wx: Linear
  └── layers.1.grucell.Wh: Linear
  └── layers.1.grucell.Wzx: Linear
  └── layers.1.grucell.Wzh: Linear
  └── layers.1.grucell.Wrx: Linear
  └── layers.1.grucell.Wrh: Linear
  └── layers.1.dropout: Dropout
  └── fc: Linear
  └── fc.linear_projection: Linear
----------------------------------------------
Tot

  checkpoint = torch.load(filepath, map_location=device)



Loaded File      : ./scribe_saved_models/Scribe_gamma_latest.pth
Model Name       : Scribe-γ
Model Class      : MyGRUModel
Version          : v1
Timestamp        : 2025-12-15 01:12:33
----------------------------------------------
Model Architecture:
  └── embedding: Embedding
  └── layers: ModuleList
  └── layers.0: GRULayer
  └── layers.0.grucell: GRUCell
  └── layers.0.grucell.Wx: Linear
  └── layers.0.grucell.Wh: Linear
  └── layers.0.grucell.Wzx: Linear
  └── layers.0.grucell.Wzh: Linear
  └── layers.0.grucell.Wrx: Linear
  └── layers.0.grucell.Wrh: Linear
  └── layers.0.dropout: Dropout
  └── layers.1: GRULayer
  └── layers.1.grucell: GRUCell
  └── layers.1.grucell.Wx: Linear
  └── layers.1.grucell.Wh: Linear
  └── layers.1.grucell.Wzx: Linear
  └── layers.1.grucell.Wzh: Linear
  └── layers.1.grucell.Wrx: Linear
  └── layers.1.grucell.Wrh: Linear
  └── layers.1.dropout: Dropout
  └── layers.2: GRULayer
  └── layers.2.grucell: GRUCell
  └── layers.2.grucell.Wx: Linear
  └── layer

## Sampling from the Loaded Model

In [32]:
print(sample_top_k(scribe_alpha, sp, "FIRST CITIZEN: ", k=30))


FIRST CITIZEN: Master me, for the king! Mort me, Com of loveday: Montent To the proud, my heart of heaven: The died. But his power: If thou shake Mercy's at her, I have you doges 'tis a bodies I cause your pale lips: Mow'd to sitive, forteen fortake, To-jector, I do me, youthough forceven. KING EDWARD IV: he be done; for a tendsalks, welack' the depon; whole is a little. I say to hear her? CORIOLANUS: Liewitors, sir: he doth burdraveltheades 'tis with it was ABY Aughting: Ty: Occess. WARWICK, letters; and helding to do it be


In [33]:
print(sample_top_k(scribe_beta, sp, "FIRST CITIZEN: ", k=30))


FIRST CITIZEN: What chargage, Saint, That hell'ppy danch! Once that wearnees of the queen. DUKE VINCENTIO: Nay, the sorrow I save so great din, That misest that hellows I want, Shaping, sir, and I have been: There'll, Monton, Meethurst thou hastealten: What stain of your grace? First: Let us I, Abs. Cory-dge with you well-hes on the duke, behad at it, be ready Must; And damshipurlough, to my souls of my meth I should befallhipulous len the gave a man: Thou can chell in my bothes to his chollenerible, and fore, as helding


In [34]:
print(sample_top_k(scribe_gamma, sp, "FIRST CITIZEN: ", k=30))

FIRST CITIZEN: I do so and he while? DUKEALGLLLLO: 's; 's. Plam's and a bad, thou beling to me. Ataves him, At. But he hath pramed the pale? O'd to me at the cursed. What? Hapts, which it. DUKE VINCENTIO: I am a courtler a doupard. VIure, to him as thou, thou wilt have strong in my sort and his sats. KING RICHARDoth on his deemerly is, as your father that I do a gavers, and gl: I have sect the people, I have to the courtle the sor. DUKE OF GLOUCESTER: 'bits. CORIOLANUS: But he would be dass: I prayed my mercile to the stolily, my f
