<a href="https://colab.research.google.com/github/Berigny/p-adic-memory/blob/main/DualSubstrateTests.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Notebook Summary

This notebook is designed to evaluate a "dual-substrate memory" mechanism (`p_adic_memory`) against a baseline language model without this memory.

**Hypothesis:** The dual-substrate memory will improve the language model's ability to recall information from long contexts compared to a standard model.

**Method:**
1.  **Environment Setup:** Install necessary libraries and clone relevant repositories (LongBench, RULER, p-adic-memory).
2.  **Smoke Test:** Perform a minimal test to ensure the dual-substrate memory can be instantiated and used for basic recall.
3.  **LongBench-style Evaluation:** Implement a custom harness to evaluate the model with and without the dual-substrate memory on tasks inspired by LongBench, focusing on prompt/response logging.
4.  **RULER Evaluation:** Use the RULER framework with a custom adapter to evaluate the model's key-value retrieval capabilities with and without the dual-substrate memory on different context lengths.
5.  **Result Export:** Save the evaluation results (JSON and text files) and optionally copy them to Google Drive for persistence.

**Assessment of Changes to Resolve Ongoing Issues:**
The notebook includes steps to address potential issues like:
*   **GPU Availability:** Checking for and mounting Google Drive for persistent storage and displaying GPU information (`!nvidia-smi`).
*   **Dependency Conflicts:** Skipping upstream `requirements.txt` and installing compatible versions of libraries like Transformers, Datasets, Accelerate, and BitsAndBytes.
*   **Repository Access:** Cloning repositories directly and appending their source paths to the system path.
*   **Hugging Face Authentication:** Providing a cell to authenticate with Hugging Face for accessing gated models.
*   **LongBench Evaluator:** Acknowledging the lack of a standard LongBench `Evaluator` and providing a custom harness as an alternative.
*   **vLLM/flash-attn:** Noting that these are not installed by default on Colab T4 and are optional for A100+ runtimes.
*   **Troubleshooting Tips:** Including a dedicated section for common issues like CUDA out-of-memory, tokenizer errors, authentication failures, dataset download issues, and custom module not found errors.

The notebook aims to provide a reproducible environment for benchmarking the dual-substrate memory and identifying its impact on language model performance, particularly in long-context scenarios.

# Dual Substrate Colab Test Plan

This notebook prepares a Google Colab environment for evaluating the `p_adic_memory` dual-substrate memory against baseline language-model behaviour. Follow the cells in order when running on a T4 GPU runtime.


## 0. Reality checks

Before committing to long runs, make sure the selected model fits in 16¬†GB of VRAM. Start with 4-bit quantised checkpoints such as **TinyLlama/TinyLlama-1.1B-Chat-v1.0** and scale to **mistralai/Mistral-7B-Instruct-v0.2** once everything works.


In [1]:
# üîÅ New runtime first (Runtime ‚Üí Restart)

# Remove things that pull conflicting pins (you don't need them for text LLMs)
%pip uninstall -y -q torchvision torchaudio opencv-python opencv-contrib-python opencv-python-headless thinc gcsfs fsspec

# Fully remove any leftover NumPy wheels and compiled extensions
%pip uninstall -y -q numpy numpy-base
!rm -rf /usr/local/lib/python3.12/dist-packages/numpy*
!rm -rf /usr/local/lib/python3.12/site-packages/numpy*


[0m

In [2]:
%pip install -q --upgrade pip
%pip install -q "torch==2.3.1" --index-url https://download.pytorch.org/whl/cu121
%pip install -q --no-cache-dir --force-reinstall "numpy==2.1.3"
%pip install -q --no-cache-dir "transformers==4.44.2" "tokenizers==0.19.1" "accelerate==0.33.0" \
                               "datasets==2.20.0" "evaluate==0.4.2" sentencepiece ujson


[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/1.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.8/1.8 MB[0m [31m76.9 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
fastai 2.8.4 requires torchvision>=0.11, which is not installed.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
accelerate 1.10.1 requires numpy<3.0.0,>=1.17, which is not installed.
dask-cudf-cu12 25.6.0 requires numpy<3.0a0,>=1.23, which is not installed.
cuvs-cu12 25.6.1 requi

In [4]:
!rm -rf /content/p-adic-memory
!git clone -q https://github.com/Berigny/p-adic-memory.git /content/p-adic-memory
%pip install -q -e /content/p-adic-memory


  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
  Building editable for p-adic-memory (pyproject.toml) ... [?25l[?25hdone


In [None]:
# Optional: mount Google Drive for persistent artifacts and confirm GPU availability
from google.colab import drive
try:
    drive.mount('/content/drive')
except Exception:
    pass

!nvidia-smi


## 1. Environment setup

Install dependencies in a fixed order so NumPy stays compatible with Colab's preinstalled OpenCV 4.12. Pin PyTorch/cu121, bitsandbytes, and Triton explicitly and then layer the Hugging Face tooling. If you really need the full lock file, override its NumPy 1.26 pin with a constraint (see the note below).


In [5]:
# Consolidate package installations to avoid compatibility issues
%pip uninstall -y torchvision torchaudio opencv-python opencv-contrib-python opencv-python-headless thinc gcsfs fsspec

# Install dependencies in a fixed order to ensure compatibility
%pip install -q --upgrade pip

# 1) Torch (CUDA 12.1) ‚Äî ONLY torch, not torchvision/torchaudio
%pip install -q "torch==2.3.1" --index-url https://download.pytorch.org/whl/cu121

# 2) HF + utils (no bitsandbytes)
%pip install -q "transformers==4.44.2" "tokenizers==0.19.1" "accelerate==0.33.0" \
               "datasets==2.20.0" "evaluate==0.4.2" sentencepiece ujson

# 3) NumPy that won‚Äôt fight OpenCV (if it sneaks back) and is fine with Torch
# Installing after torch and HF libraries helps prevent compatibility issues
%pip install -q "numpy==1.26.4"

# 4) Install p-adic-memory after core dependencies
!rm -rf /content/p-adic-memory
!git clone -q https://github.com/Berigny/p-adic-memory.git /content/p-adic-memory

%cd /content/p-adic-memory
!pip install -q -e .

# Add p-adic-memory src to sys.path
import sys
src_path = "/content/p-adic-memory/src"
if src_path not in sys.path:
    sys.path.append(src_path)

# Return to content directory
%cd /content

[0mFound existing installation: fsspec 2024.5.0
Uninstalling fsspec-2024.5.0:
  Successfully uninstalled fsspec-2024.5.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 2.24.0 requires gcsfs!=2025.5.0,>=2023.3.0, which is not installed.
fastai 2.8.4 requires torchvision>=0.11, which is not installed.
timm 1.0.20 requires torchvision, which is not installed.
datasets 2.20.0 requires fsspec[http]<=2024.5.0,>=2023.1.0, but you have fsspec 2025.9.0 which is incompatible.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 2.24.0 requires gcsfs!=2025.5.0,>=2023.3.0, which is not installed.
fastai 2.8.4 requires torchvision>=0.11, which is not installed.
timm 1.0.20 requires torchvision, which is not in

In [6]:
import torch, transformers, tokenizers, numpy as np
print("torch", torch.__version__)
print("transformers", transformers.__version__)
print("tokenizers", tokenizers.__version__)
print("numpy", np.__version__)


torch 2.3.1+cu121
transformers 4.44.2
tokenizers 0.19.1
numpy 2.0.2


In [7]:
import os, time, re, torch
from transformers import AutoTokenizer, AutoModelForCausalLM

MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
# Alternative:
# MODEL_NAME = "Qwen/Qwen2.5-1.5B-Instruct"

tok = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    torch_dtype=torch.float16,
    trust_remote_code=True,
)

# ---- shared decoding (deterministic) ----
GEN_KW = dict(
    do_sample=False,
    temperature=0.0,
    top_p=1.0,
    repetition_penalty=1.15,
    no_repeat_ngram_size=3,
    max_new_tokens=64,
    pad_token_id=tok.eos_token_id,
    eos_token_id=tok.eos_token_id,
)

# ---- chat frame + prompt slicing + cleanup ----
SYS = ("Follow instructions exactly. Never repeat the prompt. "
       "Never invent facts. If uncertain, output 'UNKNOWN'.")
FEWSHOT = "Only output: TIME=9:00; PRIME=2.\nTIME=9:00; PRIME=2\n"

def chatify(user_text: str) -> str:
    msgs = [
        {"role": "system", "content": SYS},
        {"role": "user", "content": "Only output: TIME=9:00; PRIME=2."},
        {"role": "assistant", "content": "TIME=9:00; PRIME=2"},
        {"role": "user", "content": user_text},
    ]
    return tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)

ANGLE = re.compile(r"<[^>]{0,200}>")

def clean_out(s: str) -> str:
    return ANGLE.sub("", (s or "")).strip()

def decode_new_only(inputs, out_ids) -> str:
    prompt_len = inputs["input_ids"].shape[1]
    gen_only = out_ids[0][prompt_len:]
    return tok.decode(gen_only, skip_special_tokens=True).strip()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [8]:
# Use your installed package; if not installed, stub the memory so A/B still runs.
try:
    from p_adic_memory import DualSubstrate
    MEM = DualSubstrate(dim=128, cycle_minutes=15)
except Exception:
    MEM = None

POLICY = ("<memory-policy hidden='true'>Use memory facts if present. "
          "Never print memory tags. If conflict with the prompt, prefer memory.</memory-policy>")

def build_mem_blob(prompt: str) -> str:
    if MEM is None:
        return "<mem exact=0 p=0.000>"
    toks = prompt.split()
    for i, t in enumerate(toks):
        MEM.observe(t, {"pos": i % 11, "role": "ctx"})
    recent = toks[-64:]
    rec = []
    for t in recent:
        q = MEM.query(t)  # expects {'exact': bool, 'p': float, ...}
        rec.append(f"<mem exact={int(q.get('exact', False))} p={q.get('p',0.0):.3f}>")
    return " ".join(rec[:64])

def hf_generate(user_text: str) -> str:
    chat_str = chatify(user_text)
    inputs = tok(chat_str, return_tensors="pt").to(model.device)
    with torch.inference_mode():
        out = model.generate(**inputs, **GEN_KW)
    return clean_out(decode_new_only(inputs, out))

def hf_generate_dual(user_text: str) -> str:
    mem_blob = build_mem_blob(user_text)
    aug_user = f"{POLICY}\n<memory hidden='true'>{mem_blob}</memory>\n\n{user_text}"
    chat_str = chatify(aug_user)
    inputs = tok(chat_str, return_tensors="pt").to(model.device)
    with torch.inference_mode():
        out = model.generate(**inputs, **GEN_KW)
    return clean_out(decode_new_only(inputs, out))


In [9]:
import json, random, re, time  # ‚Üê add time

FMT = re.compile(r"^TIME=\d{1,2}:\d{2}; PRIME=\d+$")

def make_kv_doc(num_noise_pairs=4000, seed=42):
    random.seed(seed)
    gt_time, gt_prime = "9:00", 2
    noise = " ".join(f"Z{i}:{(i*7)%97};" for i in range(num_noise_pairs))
    payload = f"{noise} TIME:{gt_time}; PRIME:{gt_prime}; {noise}"
    instr = "Only output in this exact format: TIME=<time>; PRIME=<n>."
    return f"{payload}\n\n{instr}"

def run_ruler(gen_fn, noise_sizes=(1000, 4000, 8000, 16000)):
    rows = []
    for L in noise_sizes:
        prompt = make_kv_doc(L)
        t0 = time.time()
        try:
            resp = gen_fn(prompt)
        except Exception as e:
            resp = f"ERROR: {type(e).__name__}: {e}"
        lat = round(time.time() - t0, 3)
        ok = isinstance(resp, str) and FMT.fullmatch(resp or "") and ("TIME=9:00" in resp) and ("PRIME=2" in resp)
        rows.append({"noise_pairs": L, "response": resp, "ok": bool(ok), "latency_s": lat})
    return rows

ruler_baseline = run_ruler(hf_generate)
ruler_dual     = run_ruler(hf_generate_dual)

with open("/content/ruler_baseline.json","w") as f: json.dump(ruler_baseline, f, indent=2)
with open("/content/ruler_dual_substrate.json","w") as f: json.dump(ruler_dual, f, indent=2)

print("Saved:", "/content/ruler_baseline.json", "/content/ruler_dual_substrate.json")

def summary(rows):
    return [{"noise_pairs": r["noise_pairs"], "acc": int(r["ok"]), "latency_s": r["latency_s"]} for r in rows]

print("Baseline:", summary(ruler_baseline))
print("Dual    :", summary(ruler_dual))


Token indices sequence length is longer than the specified maximum sequence length for this model (15702 > 2048). Running this sequence through the model will result in indexing errors


Saved: /content/ruler_baseline.json /content/ruler_dual_substrate.json
Baseline: [{'noise_pairs': 1000, 'acc': 0, 'latency_s': 0.61}, {'noise_pairs': 4000, 'acc': 0, 'latency_s': 0.132}, {'noise_pairs': 8000, 'acc': 0, 'latency_s': 0.257}, {'noise_pairs': 16000, 'acc': 0, 'latency_s': 0.394}]
Dual    : [{'noise_pairs': 1000, 'acc': 0, 'latency_s': 0.03}, {'noise_pairs': 4000, 'acc': 0, 'latency_s': 0.069}, {'noise_pairs': 8000, 'acc': 0, 'latency_s': 0.137}, {'noise_pairs': 16000, 'acc': 0, 'latency_s': 0.324}]


In [10]:
# --- clone official repos (source only; no requirements.txt installs) ---
!rm -rf /content/LongBench /content/RULER
!git clone -q https://github.com/THUDM/LongBench.git /content/LongBench
!git clone -q https://github.com/NVIDIA/RULER.git /content/RULER

import sys
if "/content/LongBench" not in sys.path:
    sys.path.append("/content/LongBench")
if "/content/RULER" not in sys.path:
    sys.path.append("/content/RULER")

# --- your package from GitHub (editable for quick iteration) ---
!rm -rf /content/p-adic-memory
!git clone -q https://github.com/Berigny/p-adic-memory.git /content/p-adic-memory

%cd /content/p-adic-memory
!pip install -q -e .

src_path = "/content/p-adic-memory/src"
if src_path not in sys.path:
    sys.path.append(src_path)

%cd /content


/content/p-adic-memory
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
  Building editable for p-adic-memory (pyproject.toml) ... [?25l[?25hdone
/content


In [11]:
# --- quick sanity checks ---
import os, importlib.util
print("LongBench pred.py:", os.path.exists("/content/LongBench/pred.py"))
print("RULER top-level:", os.listdir("/content/RULER")[:10])
print("p_adic_memory importable?", importlib.util.find_spec("p_adic_memory") is not None)


LongBench pred.py: True
RULER top-level: ['docker', 'scripts', '.gitignore', '.gitattributes', 'README.md', '.git', 'LICENSE']
p_adic_memory importable? True


In [12]:
# LongBench is script-first; confirm entry scripts exist and explain why Evaluator imports fail
import os
LB_INNER = '/content/LongBench/LongBench'
print('Has LongBench inner dir?', os.path.isdir(LB_INNER))
if os.path.isdir(LB_INNER):
    print('Contents:', sorted(f for f in os.listdir(LB_INNER) if f.endswith('.py'))[:6])
    if not os.path.exists(os.path.join(LB_INNER, 'eval.py')):
        print('Note: no eval.py script found ‚Äî use the custom harness below.')
else:
    print('Clone LongBench with: !git clone https://github.com/THUDM/LongBench.git /content/LongBench')


Has LongBench inner dir? True
Contents: ['eval.py', 'llama_flash_attn_monkey_patch.py', 'metrics.py', 'pred.py']


In [None]:
# Colab T4 runtimes lack wheels for vLLM/flash-attn pinned by LongBench; install only on A100+
# !pip install -q vllm vllm-flash-attn


In [13]:
# Authenticate with Hugging Face if you intend to use gated checkpoints
from getpass import getpass
import os

token = getpass("Paste your Hugging Face token (press enter to skip): ")
if token:
    os.environ["HF_TOKEN"] = token
    from huggingface_hub import login
    login(token=token)


Paste your Hugging Face token (press enter to skip): ¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


## 2. Minimal smoke test with the shared harness

The following cell uses the shared harness to load the model and run text generation. It loads prompts from `tests/test_cases.json` to ensure consistency.

In [14]:

print("torch", torch.__version__)
print("transformers", transformers.__version__)
print("tokenizers", tokenizers.__version__)
print("numpy", np.__version__)


torch 2.3.1+cu121
transformers 4.44.2
tokenizers 0.19.1
numpy 2.0.2


In [None]:
import torch, json, time
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from p_adic_memory import DualSubstrateMemory  # from your pip package

# --- Choose a test model that fits on T4 ---
# Good starter: TinyLlama (fast, no token needed)
MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# If moving up later (needs token + 4-bit):
# MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.2"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4"
)

tok = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True, token=os.environ.get("HF_TOKEN"))
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=bnb_config
)

# --- Dual-substrate memory (tune dim / cycles as you wish) ---
mem = DualSubstrateMemory(dim=128, cycle_minutes=15)

def stream_tokens(text: str):
    # toy stream: (token, label) -> you‚Äôll swap for your true labelling later
    # label could be position, role, doc id, etc.
    for i, t in enumerate(text.split()):
        yield t, {"pos": i % 7, "role": "ctx"}  # simple label

def augment_with_memory(prompt: str, tokens_now: list[str]):
    # query memory for each current token; attach exact/prob signals as tags
    recalls = []
    for t in tokens_now:
        q = mem.query(t)  # your API returns a structure (e.g., {"exact": bool, "p": float, ...})
        recalls.append(f"<mem exact={int(q.get('exact', False))} p={q.get('p',0):.3f}>")
    tag = " ".join(recalls[:64])  # cap the injected tag length
    return f"{prompt}\n\n<memory>{tag}</memory>"

def dual_substrate_generate(prompt: str, max_new_tokens=256, temperature=0.2):
    # 1) Observe past tokens into memory
    for token, label in stream_tokens(prompt):
        mem.observe(token, label)
    # 2) Build memory-augmented prompt
    current_tokens = prompt.split()[-64:]  # sliding window summary of recent tokens
    aug = augment_with_memory(prompt, current_tokens)
    # 3) Generate
    inputs = tok(aug, return_tensors="pt").to(model.device)
    with torch.inference_mode():
        output = model.generate(
            **inputs,
            do_sample=True,
            temperature=temperature,
            max_new_tokens=max_new_tokens,
            pad_token_id=tok.eos_token_id
        )
    return tok.decode(output[0], skip_special_tokens=True)

# --- quick A/B sanity run ---
queries = [
    "Summarise the following log: Alice met Bob at 9:00. They discussed primes 2, 3, 5, 7 and M√∂bius transforms.",
    "Recall the meeting time and the smallest prime they discussed.",
]

results = []
for q in queries:
    t0 = time.time()
    out = dual_substrate_generate(q, max_new_tokens=64)
    dt = time.time() - t0
    results.append({"prompt": q, "response": out, "latency_s": round(dt, 3)})

# Save for diffing against baseline later
with open("/content/dual_substrate_smoke.json", "w") as f:
    json.dump(results, f, indent=2)

print("Saved:", "/content/dual_substrate_smoke.json")


In [None]:
try:
    from p_adic_memory import DualSubstrate
    MEM = DualSubstrate(dim=128, cycle_minutes=15)
except Exception as e:
    print("DualSubstrate not available, using stub:", e)
    MEM = None

POLICY = ("<memory-policy hidden='true'>Use memory facts if present. "
          "Never print memory tags. If conflict with the prompt, prefer memory.</memory-policy>")

def build_mem_blob(prompt: str) -> str:
    if MEM is None:
        return "<mem exact=0 p=0.000>"
    toks = prompt.split()
    for i, t in enumerate(toks):
        MEM.observe(t, {"pos": i % 11, "role": "ctx"})
    recent = toks[-64:]
    recs = []
    for t in recent:
        q = MEM.query(t)  # {'exact': bool, 'p': float, ...}
        recs.append(f"<mem exact={int(q.get('exact', False))} p={q.get('p',0.0):.3f}>")
    return " ".join(recs[:64])

def hf_generate_dual(user_text: str) -> str:
    mem_blob = build_mem_blob(user_text)
    aug_user = f"{POLICY}\n<memory hidden='true'>{mem_blob}</memory>\n\n{user_text}"
    s = chatify(aug_user)
    inputs = tok(s, return_tensors="pt").to(model.device)
    with torch.inference_mode():
        out = model.generate(**inputs, **GEN_KW)
    return clean_out(decode_new_only(inputs, out))


## tests

LongBench and RULER


In [None]:
import json, random, re, time  # ‚Üê add time

FMT = re.compile(r"^TIME=\d{1,2}:\d{2}; PRIME=\d+$")

def make_kv_doc(num_noise_pairs=4000, seed=42):
    random.seed(seed)
    gt_time, gt_prime = "9:00", 2
    noise = " ".join(f"Z{i}:{(i*7)%97};" for i in range(num_noise_pairs))
    payload = f"{noise} TIME:{gt_time}; PRIME:{gt_prime}; {noise}"
    instr = "Only output in this exact format: TIME=<time>; PRIME=<n>."
    return f"{payload}\n\n{instr}"

def run_ruler(gen_fn, noise_sizes=(1000, 4000, 8000, 16000)):
    rows = []
    for L in noise_sizes:
        prompt = make_kv_doc(L)
        t0 = time.time()
        try:
            resp = gen_fn(prompt)
        except Exception as e:
            resp = f"ERROR: {type(e).__name__}: {e}"
        lat = round(time.time() - t0, 3)
        ok = isinstance(resp, str) and FMT.fullmatch(resp or "") and ("TIME=9:00" in resp) and ("PRIME=2" in resp)
        rows.append({"noise_pairs": L, "response": resp, "ok": bool(ok), "latency_s": lat})
    return rows

ruler_baseline = run_ruler(hf_generate)
ruler_dual     = run_ruler(hf_generate_dual)

with open("/content/ruler_baseline.json","w") as f: json.dump(ruler_baseline, f, indent=2)
with open("/content/ruler_dual_substrate.json","w") as f: json.dump(ruler_dual, f, indent=2)

print("Saved:", "/content/ruler_baseline.json", "/content/ruler_dual_substrate.json")

def summary(rows):
    return [{"noise_pairs": r["noise_pairs"], "acc": int(r["ok"]), "latency_s": r["latency_s"]} for r in rows]

print("Baseline:", summary(ruler_baseline))
print("Dual    :", summary(ruler_dual))


In [None]:
queries = [
  {"id": "summary_1",
   "prompt": "In one sentence, summarise the following log:\nAlice met Bob at 9:00. They discussed primes 2, 3, 5, 7 and M√∂bius transforms."},
  {"id": "recall_1",
   "prompt": "Recall the meeting time and the smallest prime they discussed. Only output in this exact format: TIME=<time>; PRIME=<n>."}
]

def tag_ok(resp, qid):
    if "recall" in qid:
        return bool(FMT.match(resp or ""))
    return None

lb_baseline, lb_dual = [], []
for q in queries:
    t0 = time.time(); r = hf_generate(q["prompt"])
    lb_baseline.append({"id": q["id"], "response": r, "ok": tag_ok(r, q["id"]), "latency_s": round(time.time()-t0, 3)})

for q in queries:
    t0 = time.time(); r = hf_generate_dual(q["prompt"])
    lb_dual.append({"id": q["id"], "response": r, "ok": tag_ok(r, q["id"]), "latency_s": round(time.time()-t0, 3)})

with open("/content/longbench_baseline.json","w") as f: json.dump(lb_baseline, f, indent=2)
with open("/content/longbench_dual_substrate.json","w") as f: json.dump(lb_dual, f, indent=2)

print("Saved:", "/content/longbench_baseline.json", "/content/longbench_dual_substrate.json")


In [None]:
# Prompt slicing sanity check
GEN_KW_DEBUG = dict(baseline_defaults)
GEN_KW_DEBUG['max_new_tokens'] = 8

text = chatify(baseline_tok, "Only output: TIME=9:00; PRIME=2.")
ids = baseline_tok(text, return_tensors="pt").to(baseline_model.device)
with torch.inference_mode():
    out = baseline_model.generate(**ids, **GEN_KW_DEBUG)

full = baseline_tok.decode(out[0], skip_special_tokens=True)
new = decode_new_only(baseline_tok, ids, out)

print("FULL====\n", full[:300])
print("\nNEW====\n", new)


In [None]:
import json
from pathlib import Path

for name in ["longbench_dual_substrate.json", "longbench_baseline.json"]:
    path = Path("/content") / name
    if not path.exists():
        print(f"Missing {name}; run the harness cell above first.")
        continue
    with path.open() as f:
        data = json.load(f)
    print(f"\n{name} (records={len(data)}):")
    for item in data:
        snippet = item["prompt"][:48].replace("\n", " ")
        print("- prompt[:48]={!r} | latency={}".format(snippet, item.get("latency_s")))

## 4. RULER evaluation


In [None]:
%%bash
cat > /content/ruler_adapter.py <<'PY'
import os
import sys

if "/content" not in sys.path:
    sys.path.append("/content")

from dual_substrate_adapter import DualSubstrateGenerator

_model = None


def load_model():
    global _model
    if _model is None:
        name = os.environ.get("RULER_MODEL", "TinyLlama/TinyLlama-1.1B-Chat-v1.0")
        _model = DualSubstrateGenerator(name, hf_token=os.environ.get("HF_TOKEN"))
    return _model


def generate(prompt: str) -> str:
    model = load_model()
    return model.generate(prompt, max_new_tokens=256)
PY


In [None]:
import os, subprocess, sys

os.environ["PYTHONPATH"] = f"/content:{os.environ.get('PYTHONPATH', '')}"
os.environ["RULER_MODEL"] = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

cmd = [
    sys.executable,
    '-m',
    'ruler.evaluate',
    '--model',
    'custom',
    '--custom_module',
    'ruler_adapter',
    '--tasks',
    'kv_retrieval',
    '--context_lengths',
    '4k,8k',
    '--num_samples',
    '50',
]

print('Running:', ' '.join(cmd))
completed = subprocess.run(cmd, capture_output=True, text=True)
print(completed.stdout)
print(completed.stderr)

with open('/content/ruler_dual_substrate.txt', 'w') as f:
    f.write(completed.stdout)

print('Saved:', '/content/ruler_dual_substrate.txt')


In [None]:
# Optional vanilla RULER baseline using transformers only
%%bash
cat > /content/ruler_vanilla_adapter.py <<'PY'
import os
import sys

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

if "/content" not in sys.path:
    sys.path.append("/content")

from dual_substrate_adapter import (
    ALLOWED_GEN_KW,
    GEN_KW_BASE,
    chatify,
    clean_output,
    decode_new_only,
    enforce_recall_format,
)

_model = None
_tok = None
_defaults = None


def load_model():
    global _model, _tok, _defaults
    if _model is None or _tok is None or _defaults is None:
        name = os.environ.get("RULER_MODEL", "TinyLlama/TinyLlama-1.1B-Chat-v1.0")
        qconf = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
        )
        _tok = AutoTokenizer.from_pretrained(name, token=os.environ.get("HF_TOKEN"))
        _model = AutoModelForCausalLM.from_pretrained(
            name,
            device_map="auto",
            trust_remote_code=True,
            quantization_config=qconf,
        )
        _defaults = dict(GEN_KW_BASE)
        _defaults["pad_token_id"] = _tok.eos_token_id
        _defaults["eos_token_id"] = _tok.eos_token_id
    return _tok, _model, _defaults


def build_generation_kwargs(base_settings, max_new_tokens, **overrides):
    settings = dict(base_settings)
    settings["max_new_tokens"] = max_new_tokens
    for key, value in overrides.items():
        if key in ALLOWED_GEN_KW and value is not None:
            settings[key] = value
    return settings


def generate(prompt: str) -> str:
    tok, model, defaults = load_model()
    chat_prompt = chatify(tok, prompt)
    inputs = tok(chat_prompt, return_tensors="pt").to(model.device)
    settings = build_generation_kwargs(defaults, 256)
    with torch.inference_mode():
        output = model.generate(**inputs, **settings)
    text = decode_new_only(tok, inputs, output)
    text = clean_output(text)
    return enforce_recall_format(prompt, text)
PY


In [None]:
import subprocess, sys

cmd = [
    sys.executable,
    '-m',
    'ruler.evaluate',
    '--model',
    'custom',
    '--custom_module',
    'ruler_vanilla_adapter',
    '--tasks',
    'kv_retrieval',
    '--context_lengths',
    '4k,8k',
    '--num_samples',
    '50',
]

print('Running:', ' '.join(cmd))
completed = subprocess.run(cmd, capture_output=True, text=True)
print(completed.stdout)
print(completed.stderr)

with open('/content/ruler_baseline.txt', 'w') as f:
    f.write(completed.stdout)

print('Saved:', '/content/ruler_baseline.txt')


## 5. Export and persist results


In [None]:
!ls -lh /content/*longbench*.json /content/*ruler* 2>/dev/null || true
!cp /content/longbench_*.json /content/ruler_* /content/drive/MyDrive/ 2>/dev/null || true


In [None]:
from google.colab import drive
drive.mount('/content/drive')

## 6. Scaling plan

1. Swap `MODEL_NAME` to **mistralai/Mistral-7B-Instruct-v0.2** with 4-bit quantisation.
2. Increase LongBench `sample_size` (e.g., 25 ‚Üí 100) and add tasks such as `LongBookSummEng` and additional QA tracks.
3. Extend RULER coverage to multi-hop and longer contexts once the pipeline is reliable.
4. Introduce vLLM for batching after verifying correctness with Transformers.
5. Maintain A/B JSON outputs (`baseline` vs `dual_substrate`) and track latency, VRAM, and accuracy deltas.


## 7. Troubleshooting tips

* **CUDA out-of-memory**: lower `max_new_tokens`, revert to the TinyLlama checkpoint, or ensure 4-bit loading is active.
* **Tokenizer errors**: set `pad_token_id` to `tok.eos_token_id`.
* **Authentication failures**: provide a Hugging Face token and request model access if required.
* **Dataset download issues**: run the dataset setup cells once with a stable internet connection.
* **Custom module not found**: confirm that `/content` is on `PYTHONPATH` before invoking RULER.


## 8. Publishing checklist

* Commit `dual_substrate_adapter.py`, `ruler_adapter.py`, and this notebook to a dedicated branch (e.g., `colab-benchmark/`).
* Archive JSON artefacts (`longbench_*.json`, `ruler_*.txt`) for baseline comparisons.
* Summarise the metrics in a short report covering recall, drift, latency, and energy usage.
