# Rolling State Memory (RSM) - Experimental Notebook

**Implementation using modular architecture**

This notebook demonstrates RSM training and evaluation using the clean architecture from `hybrid_transformer1.py`.

**Key improvements:**
- Modular imports from `hybrid_transformer1.py`
- Multiple dataset/tokenization options
- Clean experimental setup

## 1. Setup and Imports
Import all necessary components from our architecture module.

In [1]:
import torch
import numpy as np
import matplotlib.pyplot as plt
from typing import List, Dict, Optional, Tuple

# Import ALL RSM components from our clean architecture
from hybrid_transformer1 import (
    # NNDL Core Functions (Section 1)
    scaled_dot_attention,
    PositionalEncoding,
    MLP,
    CausalSelfAttention,
    
    # Memory Components (Section 2)
    CrossAttention,
    GatedSSM,
    GlobalSyncLayer,
    
    # Architecture (Section 3)
    HybridTransformerBlock,
    HybridTransformer,
    
    # Training Utilities (Section 4)
    train_rsm_epoch,
    
    # Generation Utilities (Section 5)
    generate_with_rsm,
    
    # Dataset Utilities (Section 6)
    ChunkedSequenceDataset,
    
    # Helper Functions (Section 7)
    create_rsm_model,
    save_checkpoint,
    load_checkpoint,
    count_parameters,
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device: {device}")
print(f"PyTorch version: {torch.__version__}")
print("‚úì Imported RSM architecture from hybrid_transformer1.py")

‚úì hybrid_transformer1 module loaded.
Device: cpu
PyTorch version: 2.0.0
‚úì Imported RSM architecture from hybrid_transformer1.py


## 2. Dataset & Tokenization Options

Choose your dataset and tokenization scheme based on your experiment goals.

### üìö Dataset Options:

#### **Option 1: TinyStories** (Microsoft Research, 2023) ‚≠ê 
- **Paper:** "TinyStories: How Small Can Language Models Be and Still Speak Coherent English?"
- **Authors:** Eldan & Li (2023)
- **Size:** 2.1M synthetic stories generated by GPT-4
- **Length:** 500-2000 tokens per story
- **Use case:** Standard benchmark for small language models
- **Citation:** https://arxiv.org/abs/2305.07759

#### **Option 2: Tiny Shakespeare** (Karpathy's char-rnn) ‚≠ê
- **Source:** Complete works of Shakespeare
- **Size:** ~1MB text, ~1M characters
- **Vocab:** ~65 unique characters
- **Use case:** Fast experiments, character-level modeling benchmark
- **Citation:** https://github.com/karpathy/char-rnn

#### **Option 3: WikiText-103** (Salesforce Research)
- **Paper:** "Pointer Sentinel Mixture Models"
- **Authors:** Merity et al. (2016)
- **Size:** 103M tokens from Wikipedia
- **Use case:** Long-form text, established benchmark
- **Citation:** https://arxiv.org/abs/1609.07843

#### **Option 4: Custom Text**
- Load your own `.txt` files
- Suitable for domain-specific applications

---

### üî§ Tokenization Options:

#### **Method 1: BPE (Byte-Pair Encoding)** ‚≠ê 
- **Tokenizer:** GPT-2 (50K vocab)
- **Efficiency:** ~4 characters per token
- **Pros:** Semantic units, no OOV, pretrained
- **Cons:** Large vocabulary

#### **Method 2: SentencePiece** (Custom vocab size)
- **Vocab size:** Configurable (2K-8K typical)
- **Pros:** Trainable on your data, balanced
- **Cons:** Requires training step

#### **Method 3: Character-level**
- **Vocab size:** ~100 characters
- **Pros:** Simple, no OOV, works for any text
- **Cons:** Very long sequences, harder to learn patterns

In [None]:
# ============================================================================
# DATASET LOADING - Choose one option
# ============================================================================

# ----------------------------------------------------------------------------
# OPTION 1: TinyStories + GPT-2 BPE Tokenizer (COMMENTED OUT - using Shakespeare)
# ----------------------------------------------------------------------------
# Paper: "TinyStories: How Small Can Language Models Be and Still Speak Coherent English?"
# Eldan & Li, Microsoft Research, 2023
# https://arxiv.org/abs/2305.07759

# try:
#     from datasets import load_dataset
#     from transformers import GPT2Tokenizer
#     
#     print("=" * 80)
#     print("LOADING TINYSTORIES DATASET")
#     print("=" * 80)
#     
#     # Load dataset from HuggingFace
#     tinystories = load_dataset("roneneldan/TinyStories", split="train[:500]")
#     print(f"‚úì Loaded {len(tinystories)} stories from TinyStories")
#     
#     # Load GPT-2 tokenizer (50,257 vocab)
#     tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
#     tokenizer.pad_token = tokenizer.eos_token
#     vocab_size = len(tokenizer)
#     
#     print(f"‚úì GPT-2 BPE Tokenizer loaded")
#     print(f"  Vocabulary size: {vocab_size:,} tokens")
#     print(f"  Tokenization: Byte-Pair Encoding")
#     
#     # Tokenize all stories
#     all_tokens = []
#     for story in tinystories:
#         tokens = tokenizer.encode(story['text'])
#         all_tokens.extend(tokens)
#     
#     print(f"‚úì Tokenized {len(all_tokens):,} tokens total")
#     print("=" * 80)
#     
#     DATASET_NAME = "TinyStories"
#     TOKENIZATION = "BPE (GPT-2)"
#     
# except ImportError:
#     print("‚ö† HuggingFace libraries not installed")
#     print("Install with: pip install datasets transformers")
#     print("\nFalling back to character-level tokenization...")
#     
#     # Fallback to character-level with sample text
#     sample_text = "Once upon a time, there was a little girl. " * 100
#     
#     # Character-level tokenizer
#     chars = sorted(list(set(sample_text)))
#     char_to_idx = {ch: i for i, ch in enumerate(chars)}
#     idx_to_char = {i: ch for i, ch in enumerate(chars)}
#     
#     all_tokens = [char_to_idx[ch] for ch in sample_text]
#     vocab_size = len(chars)
#     
#     print(f"‚úì Character-level tokenization")
#     print(f"  Vocabulary size: {vocab_size} characters")
#     print(f"  Total tokens: {len(all_tokens):,}")
#     
#     DATASET_NAME = "Sample Text"
#     TOKENIZATION = "Character-level"
# 
# print(f"\nüìä Dataset: {DATASET_NAME}")
# print(f"üî§ Tokenization: {TOKENIZATION}")
# print(f"üìù Vocabulary: {vocab_size:,} tokens")

## 2b. Alternative: Tiny Shakespeare Dataset

Uncomment this cell to use Tiny Shakespeare instead of TinyStories (faster, simpler, character-level).

In [2]:
# ============================================================================
# TINY SHAKESPEARE DATASET - ACTIVE
# ============================================================================
# Classic benchmark - Complete works of Shakespeare (~1MB, ~1M characters)
# Perfect for quick experiments and testing
# https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

import requests

print("=" * 80)
print("LOADING TINY SHAKESPEARE")
print("=" * 80)

# Download Tiny Shakespeare
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
response = requests.get(url)
text = response.text

print(f"‚úì Downloaded {len(text):,} characters")

# Character-level tokenization
chars = sorted(list(set(text)))
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}

all_tokens = [char_to_idx[ch] for ch in text]
vocab_size = len(chars)

DATASET_NAME = "Tiny Shakespeare"
TOKENIZATION = "Character-level"

print(f"‚úì Character-level tokenization")
print(f"  Vocabulary: {vocab_size} unique characters")
print(f"  Total tokens: {len(all_tokens):,}")
print("=" * 80)

print(f"\nüìä Dataset: {DATASET_NAME}")
print(f"üî§ Tokenization: {TOKENIZATION}")
print(f"üìù Vocabulary: {vocab_size:,} tokens")

# Preview
print(f"\nüìñ Preview:")
print(text[:200])

LOADING TINY SHAKESPEARE
‚úì Downloaded 1,115,394 characters
‚úì Character-level tokenization
  Vocabulary: 65 unique characters
  Total tokens: 1,115,394

üìä Dataset: Tiny Shakespeare
üî§ Tokenization: Character-level
üìù Vocabulary: 65 tokens

üìñ Preview:
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you
‚úì Downloaded 1,115,394 characters
‚úì Character-level tokenization
  Vocabulary: 65 unique characters
  Total tokens: 1,115,394

üìä Dataset: Tiny Shakespeare
üî§ Tokenization: Character-level
üìù Vocabulary: 65 tokens

üìñ Preview:
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you


## 3. Alternative Dataset Options (Commented)

Uncomment any of these to try different datasets:

In [None]:
# ----------------------------------------------------------------------------
# OPTION 2: WikiText-103 + GPT-2 Tokenizer
# ----------------------------------------------------------------------------
# Paper: "Pointer Sentinel Mixture Models"
# Merity et al., Salesforce Research, 2016
# https://arxiv.org/abs/1609.07843
#
# from datasets import load_dataset
# from transformers import GPT2Tokenizer
#
# wikitext = load_dataset("wikitext", "wikitext-103-v1", split="train")
# tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# tokenizer.pad_token = tokenizer.eos_token
#
# all_tokens = []
# for article in wikitext:
#     tokens = tokenizer.encode(article['text'])
#     all_tokens.extend(tokens)
#
# vocab_size = len(tokenizer)
# DATASET_NAME = "WikiText-103"
# TOKENIZATION = "BPE (GPT-2)"

# ----------------------------------------------------------------------------
# OPTION 4: Custom Text File + SentencePiece
# ----------------------------------------------------------------------------
# Train custom tokenizer on your data
# https://github.com/google/sentencepiece
#
# import sentencepiece as spm
#
# # Train SentencePiece model
# spm.SentencePieceTrainer.train(
#     input='your_data.txt',
#     model_prefix='custom_tokenizer',
#     vocab_size=4096,  # Adjustable
#     character_coverage=0.9995,
#     model_type='bpe'
# )
#
# # Load trained tokenizer
# tokenizer = spm.SentencePieceProcessor(model_file='custom_tokenizer.model')
#
# # Tokenize your text
# with open('your_data.txt', 'r') as f:
#     text = f.read()
# all_tokens = tokenizer.encode(text, out_type=int)
# vocab_size = tokenizer.vocab_size()
#
# DATASET_NAME = "Custom Text"
# TOKENIZATION = "SentencePiece BPE"

# ----------------------------------------------------------------------------
# OPTION 5: Character-level on Custom Text
# ----------------------------------------------------------------------------
# Simplest option - no dependencies
#
# with open('your_data.txt', 'r') as f:
#     text = f.read()
#
# chars = sorted(list(set(text)))
# char_to_idx = {ch: i for i, ch in enumerate(chars)}
# idx_to_char = {i: ch for i, ch in enumerate(chars)}
#
# all_tokens = [char_to_idx[ch] for ch in text]
# vocab_size = len(chars)
#
# DATASET_NAME = "Custom Text"
# TOKENIZATION = "Character-level"

print("Alternative dataset options available (see cell above)")

## 4. Model Configuration and Creation

Now we'll create the Hybrid Transformer model using the factory function from `hybrid_transformer1.py`.

In [3]:
# Model hyperparameters
hidden_size = 256          # Hidden dimension (d_model)
num_heads = 4              # Number of attention heads
num_layers = 8             # Number of transformer blocks (6, 8, 10)
num_memory_slots = 32      # Number of external memory slots
chunk_size = 512           # Context window size (max_seq_len)
dropout = 0.1              # Dropout rate
use_global_sync = True     # Whether to use global synchronization layer

# Create model using factory function
model, global_sync, config = create_rsm_model(
    vocab_size=vocab_size,
    hidden_size=hidden_size,
    num_layers=num_layers,
    num_heads=num_heads,
    num_memory_slots=num_memory_slots,
    chunk_size=chunk_size,
    dropout=dropout,
    use_global_sync=use_global_sync,
    device='cuda' if torch.cuda.is_available() else 'cpu'
)

# Model is already created with summary printed by create_rsm_model()
# Additional info:
print(f"\n‚úì Model ready for training")
print(f"‚úì Global sync layer: {'Enabled' if use_global_sync else 'Disabled'}")

RSM MODEL CREATED
Vocabulary: 65 tokens
Hidden size: 256
Layers: 8
Attention heads: 4
Memory slots: 32
Chunk size: 512
Global sync: Yes

Parameters:
  Main model: 10,015,488
  Global sync: 525,824
  Total: 10,541,312
  Memory (float32): 40.2 MB

‚úì Model ready for training
‚úì Global sync layer: Enabled


## 5. Prepare Training Data

Create the dataset and dataloader using `ChunkedSequenceDataset` from `hybrid_transformer1.py`.

In [4]:
# Training parameters
chunk_size = 256       # Chunk size for training (context window per batch)[sequence_length]
batch_size = 32        # Batch size
num_epochs = 10        # Number of training epochs (50)
learning_rate = 3e-4   # Learning rate

# Create dataset
train_dataset = ChunkedSequenceDataset(
    tokens=all_tokens,
    chunk_size=chunk_size  # Fixed: was seq_length, should be chunk_size
)

# Create dataloader
train_loader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=batch_size,
    shuffle=True,
    num_workers=0  # Set to 0 for simpler debugging
)

print(f"Training dataset prepared:")
print(f"  Total tokens: {len(all_tokens):,}")
print(f"  Chunk size: {chunk_size}")
print(f"  Batch size: {batch_size}")
print(f"  Number of batches: {len(train_loader):,}")
print(f"  Number of epochs: {num_epochs}")
print(f"  Learning rate: {learning_rate}")

Training dataset prepared:
  Total tokens: 1,115,394
  Chunk size: 256
  Batch size: 32
  Number of batches: 273
  Number of epochs: 10
  Learning rate: 0.0003


## 6. Training Loop

Train the model using `train_rsm_epoch()` from `hybrid_transformer1.py`.

In [None]:
# Setup optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# Training history
history = {
    'train_loss': [],
    'train_acc': []
}

# Training loop
print("Starting training...")
print("=" * 60)

for epoch in range(num_epochs):
    print(f"Epoch {epoch+1}/{num_epochs}")
    
    # train_rsm_epoch returns a dict with 'loss', 'accuracy', 'num_chunks'
    metrics = train_rsm_epoch(
        model=model,
        global_sync=global_sync,  # Required parameter
        data_iterator=train_loader,  # Was train_loader, now data_iterator
        optimizer=optimizer,
        device='cuda' if torch.cuda.is_available() else 'cpu'
    )
    
    # Extract metrics
    loss = metrics['loss']
    acc = metrics['accuracy'] * 100  # Convert to percentage
    
    # Store history
    history['train_loss'].append(loss)
    history['train_acc'].append(acc)
    
    # Print progress
    print(f"Epoch {epoch+1:3d}/{num_epochs} | Loss: {loss:.4f} | Accuracy: {acc:.2f}%")
    
    # Save checkpoint every 10 epochs
    if (epoch + 1) % 10 == 0:
        save_checkpoint(
            filepath=f'checkpoint_epoch_{epoch+1}.pt',
            model=model,
            global_sync=global_sync,
            optimizer=optimizer,
            config=config,
            history=history
        )
        print(f"  ‚Üí Checkpoint saved: checkpoint_epoch_{epoch+1}.pt")

print("=" * 60)
print("Training complete!")
print(f"Final Loss: {history['train_loss'][-1]:.4f}")
print(f"Final Accuracy: {history['train_acc'][-1]:.2f}%")

Starting training...
Epoch 1/10


## 7. Visualize Training Results

Plot the loss and accuracy curves.

In [None]:
import matplotlib.pyplot as plt

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Plot loss
ax1.plot(history['train_loss'], label='Training Loss', linewidth=2)
ax1.set_xlabel('Epoch', fontsize=12)
ax1.set_ylabel('Loss', fontsize=12)
ax1.set_title('Training Loss over Epochs', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)
ax1.legend()

# Plot accuracy
ax2.plot(history['train_acc'], label='Training Accuracy', color='green', linewidth=2)
ax2.set_xlabel('Epoch', fontsize=12)
ax2.set_ylabel('Accuracy (%)', fontsize=12)
ax2.set_title('Training Accuracy over Epochs', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3)
ax2.legend()

plt.tight_layout()
plt.show()

print(f"\nTraining Statistics:")
print(f"  Best Loss: {min(history['train_loss']):.4f} (Epoch {history['train_loss'].index(min(history['train_loss']))+1})")
print(f"  Best Accuracy: {max(history['train_acc']):.2f}% (Epoch {history['train_acc'].index(max(history['train_acc']))+1})")

## 8. Text Generation

Generate text samples using `generate_with_rsm()` from `hybrid_transformer1.py`.

In [None]:
# Set model to evaluation mode
model.eval()

# Generate text samples with different prompts (Shakespeare-themed)
prompts = [
    "ROMEO:",
    "To be or not to be",
    "What light through yonder"
]

print("Generated Text Samples:")
print("=" * 60)

for i, prompt in enumerate(prompts, 1):
    # Encode prompt
    if TOKENIZATION == "Character-level":
        prompt_tokens = [char_to_idx.get(ch, 0) for ch in prompt]
    else:
        prompt_tokens = tokenizer.encode(prompt)
    
    # Generate
    generated_tokens = generate_with_rsm(
        model=model,
        prompt_tokens=prompt_tokens,
        max_new_tokens=100,
        temperature=0.8,
        top_k=50,
        device='cuda' if torch.cuda.is_available() else 'cpu'
    )
    
    # Decode
    if TOKENIZATION == "Character-level":
        generated_text = ''.join([idx_to_char.get(idx, '?') for idx in generated_tokens])
    else:
        generated_text = tokenizer.decode(generated_tokens)
    
    print(f"\nSample {i}:")
    print(f"Prompt: \"{prompt}\"")
    print(f"Generated:\n{generated_text}")
    print("-" * 60)

print("\nGeneration complete!")