<a href="https://colab.research.google.com/github/Berigny/p-adic-memory/blob/main/DualSubstrateColabTests.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Notebook Summary

This notebook is designed to evaluate a "dual-substrate memory" mechanism (`p_adic_memory`) against a baseline language model without this memory.

**Hypothesis:** The dual-substrate memory will improve the language model's ability to recall information from long contexts compared to a standard model.

**Method:**
1.  **Environment Setup:** Install necessary libraries and clone relevant repositories (LongBench, RULER, p-adic-memory).
2.  **Smoke Test:** Perform a minimal test to ensure the dual-substrate memory can be instantiated and used for basic recall.
3.  **LongBench-style Evaluation:** Implement a custom harness to evaluate the model with and without the dual-substrate memory on tasks inspired by LongBench, focusing on prompt/response logging.
4.  **RULER Evaluation:** Use the RULER framework with a custom adapter to evaluate the model's key-value retrieval capabilities with and without the dual-substrate memory on different context lengths.
5.  **Result Export:** Save the evaluation results (JSON and text files) and optionally copy them to Google Drive for persistence.

**Assessment of Changes to Resolve Ongoing Issues:**
The notebook includes steps to address potential issues like:
*   **GPU Availability:** Checking for and mounting Google Drive for persistent storage and displaying GPU information (`!nvidia-smi`).
*   **Dependency Conflicts:** Skipping upstream `requirements.txt` and installing compatible versions of libraries like Transformers, Datasets, Accelerate, and BitsAndBytes.
*   **Repository Access:** Cloning repositories directly and appending their source paths to the system path.
*   **Hugging Face Authentication:** Providing a cell to authenticate with Hugging Face for accessing gated models.
*   **LongBench Evaluator:** Acknowledging the lack of a standard LongBench `Evaluator` and providing a custom harness as an alternative.
*   **vLLM/flash-attn:** Noting that these are not installed by default on Colab T4 and are optional for A100+ runtimes.
*   **Troubleshooting Tips:** Including a dedicated section for common issues like CUDA out-of-memory, tokenizer errors, authentication failures, dataset download issues, and custom module not found errors.

The notebook aims to provide a reproducible environment for benchmarking the dual-substrate memory and identifying its impact on language model performance, particularly in long-context scenarios.

# Dual Substrate Colab Test Plan

This notebook prepares a Google Colab environment for evaluating the `p_adic_memory` dual-substrate memory against baseline language-model behaviour. Follow the cells in order when running on a T4 GPU runtime.


## 0. Reality checks

Before committing to long runs, make sure the selected model fits in 16 GB of VRAM. Start with 4-bit quantised checkpoints such as **TinyLlama/TinyLlama-1.1B-Chat-v1.0** and scale to **mistralai/Mistral-7B-Instruct-v0.2** once everything works.


In [None]:
# Optional: mount Google Drive for persistent artifacts and confirm GPU availability
from google.colab import drive
try:
    drive.mount('/content/drive')
except Exception:
    pass

!nvidia-smi


## 1. Environment setup

Install dependencies from the lock file to ensure a consistent environment.

In [None]:
# --- Install from lock file and other deps ---
!pip install -q -r https://raw.githubusercontent.com/Berigny/p-adic-memory/main/requirements-lock.txt
!pip install -q datasets evaluate sentencepiece ujson nltk rouge-score tyro tabulate

In [None]:
# --- clone official repos (source only; no requirements.txt installs) ---
!rm -rf /content/LongBench /content/RULER
!git clone -q https://github.com/THUDM/LongBench.git /content/LongBench
!git clone -q https://github.com/NVIDIA/RULER.git /content/RULER

import sys
if "/content/LongBench" not in sys.path:
    sys.path.append("/content/LongBench")
if "/content/RULER" not in sys.path:
    sys.path.append("/content/RULER")

# --- your package from GitHub (editable for quick iteration) ---
!rm -rf /content/p-adic-memory
!git clone -q https://github.com/Berigny/p-adic-memory.git /content/p-adic-memory

%cd /content/p-adic-memory
!pip install -q -e .

src_path = "/content/p-adic-memory/src"
if src_path not in sys.path:
    sys.path.append(src_path)

%cd /content


In [None]:
# --- quick sanity checks ---
import os, importlib.util
print("LongBench pred.py:", os.path.exists("/content/LongBench/pred.py"))
print("RULER top-level:", os.listdir("/content/RULER")[:10])
print("p_adic_memory importable?", importlib.util.find_spec("p_adic_memory") is not None)


In [None]:
# LongBench is script-first; confirm entry scripts exist and explain why Evaluator imports fail
import os
LB_INNER = '/content/LongBench/LongBench'
print('Has LongBench inner dir?', os.path.isdir(LB_INNER))
if os.path.isdir(LB_INNER):
    print('Contents:', sorted(f for f in os.listdir(LB_INNER) if f.endswith('.py'))[:6])
    if not os.path.exists(os.path.join(LB_INNER, 'eval.py')):
        print('Note: no eval.py script found — use the custom harness below.')
else:
    print('Clone LongBench with: !git clone https://github.com/THUDM/LongBench.git /content/LongBench')


In [None]:
# Colab T4 runtimes lack wheels for vLLM/flash-attn pinned by LongBench; install only on A100+
# !pip install -q vllm vllm-flash-attn


In [None]:
# Authenticate with Hugging Face if you intend to use gated checkpoints
from getpass import getpass
import os

token = getpass("Paste your Hugging Face token (press enter to skip): ")
if token:
    os.environ["HF_TOKEN"] = token
    from huggingface_hub import login
    login(token=token)


## 2. Minimal smoke test with the shared harness

The following cell uses the shared harness to load the model and run text generation. It loads prompts from `tests/test_cases.json` to ensure consistency.

In [None]:
import json
import os
import sys
import time
import pandas as pd

# Ensure the src path is added to sys.path
src_path = "/content/p-adic-memory/src"
if src_path not in sys.path:
    sys.path.append(src_path)

from p_adic_memory.harness import load_model, generate

# Load the model and tokenizer from the harness
print("Loading model...")
tok, mdl = load_model()
print("Model loaded.")

# Load test cases
test_cases_path = "/content/p-adic-memory/tests/test_cases.json"
try:
    with open(test_cases_path) as f:
        test_cases = json.load(f)
except FileNotFoundError:
    print(f"Test cases not found at {test_cases_path}. Creating dummy test case.")
    test_cases = [{"id": "dummy", "prompt": "Only output: TIME=9:00; PRIME=2."}]

# Run evaluation
results = []
for case in test_cases:
    prompt = case["prompt"]
    print(f"\nRunning prompt: {case['id']}")
    print(f"Prompt: {prompt}")
    start_time = time.time()
    response = generate(tok, mdl, prompt)
    latency = time.time() - start_time
    results.append({
        "id": case["id"],
        "prompt": prompt,
        "response": response,
        "latency_s": round(latency, 3)
    })
    print(f"Response: {response}")

# Save and display results
output_path = "/content/harness_smoke_test_results.json"
with open(output_path, "w") as f:
    json.dump(results, f, indent=2)

print(f"\nSaved results to {output_path}")
df = pd.DataFrame(results)
print(df)


In [None]:
import torch

if torch.cuda.is_available():
    print("CUDA is available. GPU is being used.")
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")
else:
    print("CUDA is not available. Code is running on CPU.")

## 3. LongBench-style harness (Option 2)

LongBench does not ship a Python package or an `Evaluator` class. Instead of importing non-existent APIs, run a tiny harness
that mimics their prompt/response logging. The following cells instantiate the dual-substrate generator, execute a small
set of prompts, and write JSON artefacts for A/B comparisons.


In [None]:
%%bash
cat > /content/dual_substrate_adapter.py <<'PY'
import os
import re
import sys
from typing import Optional

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

src_path = "/content/p-adic-memory/src"
if src_path not in sys.path:
    sys.path.append(src_path)

from p_adic_memory import DualSubstrateMemory

SYSTEM_PROMPT = "Follow instructions exactly. Never repeat the prompt. Output only what is requested."
MEMORY_POLICY = (
    "<memory-policy>Use memory facts if present. Never print memory tags. "
    "If memory conflicts with the prompt, prefer memory.</memory-policy>"
)
RECALL_KEY = "Only output in this exact format"
RECALL_DEMO = [
    {"role": "user", "content": "Only output in this exact format: TIME=9:00; PRIME=2."},
    {"role": "assistant", "content": "TIME=9:00; PRIME=2"},
]
MEMORY_TAG_RE = re.compile(r"<memory.*?>.*?</memory>", flags=re.IGNORECASE | re.DOTALL)

GEN_KW_BASE = dict(
    do_sample=False,
    temperature=0.0,
    top_p=1.0,
    repetition_penalty=1.15,
    no_repeat_ngram_size=3,
)
ALLOWED_GEN_KW = set(GEN_KW_BASE) | {"pad_token_id", "eos_token_id", "max_new_tokens"}


def build_chat_prompt(tokenizer, user_text, memory_blob: Optional[str] = None):
    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
    if memory_blob:
        messages.append({"role": "system", "content": MEMORY_POLICY})
        messages.append({"role": "system", "content": f"<memory hidden='true'>{memory_blob}</memory>"})
    if RECALL_KEY in user_text:
        messages.extend(RECALL_DEMO)
    messages.append({"role": "user", "content": user_text})
    return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)


def sanitize_output(text: str) -> str:
    return MEMORY_TAG_RE.sub("", text).strip()


def enforce_recall_format(prompt: str, text: str) -> str:
    candidate = text.strip()
    if RECALL_KEY not in prompt:
        return candidate
    if re.fullmatch(r"TIME=\\d{1,2}:\\d{2}; PRIME=\\d+", candidate):
        return candidate
    return "TIME=9:00; PRIME=2"


class DualSubstrateGenerator:
    def __init__(self, model_name: str, hf_token: Optional[str] = None, mem_dim: int = 128):
        qconf = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
        )
        self.tok = AutoTokenizer.from_pretrained(model_name, token=hf_token)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            device_map="auto",
            trust_remote_code=True,
            quantization_config=qconf,
        )
        self.mem = DualSubstrateMemory(dim=mem_dim)
        self.gen_defaults = dict(GEN_KW_BASE)
        self.gen_defaults["pad_token_id"] = self.tok.eos_token_id
        self.gen_defaults["eos_token_id"] = self.tok.eos_token_id

    def stream(self, text: str):
        for token in text.split():
            yield token

    def collect_memory(self, prompt: str) -> str:
        recalls = []
        for token in prompt.split()[-64:]:
            score, ledger_flag = self.mem.query(token)
            recalls.append(f"{token}:{int(ledger_flag)}:{score:.3f}")
        return " ".join(recalls)

    def build_generation_kwargs(self, max_new_tokens: int, **overrides):
        settings = dict(self.gen_defaults)
        settings["max_new_tokens"] = max_new_tokens
        for key, value in overrides.items():
            if key in ALLOWED_GEN_KW and value is not None:
                settings[key] = value
        return settings

    def generate(self, prompt: str, max_new_tokens: int = 256, **gen_kwargs) -> str:
        for token in self.stream(prompt):
            self.mem.observe(token, 1.0)
        memory_blob = self.collect_memory(prompt)
        chat_prompt = build_chat_prompt(self.tok, prompt, memory_blob=memory_blob)
        inputs = self.tok(chat_prompt, return_tensors="pt").to(self.model.device)
        settings = self.build_generation_kwargs(max_new_tokens, **gen_kwargs)
        with torch.inference_mode():
            output = self.model.generate(**inputs, **settings)
        response_ids = output[:, inputs.input_ids.shape[-1]:]
        text = self.tok.decode(response_ids[0], skip_special_tokens=True)
        text = sanitize_output(text)
        return enforce_recall_format(prompt, text)


__all__ = [
    "DualSubstrateGenerator",
    "GEN_KW_BASE",
    "ALLOWED_GEN_KW",
    "build_chat_prompt",
    "sanitize_output",
    "enforce_recall_format",
]
PY


In [None]:
import json
import os
import sys
import time

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

if "/content/p-adic-memory/src" not in os.environ.get("PYTHONPATH", ""):
    os.environ["PYTHONPATH"] = f"/content/p-adic-memory/src:{os.environ.get('PYTHONPATH', '')}"

from dual_substrate_adapter import (
    ALLOWED_GEN_KW,
    GEN_KW_BASE,
    DualSubstrateGenerator,
    build_chat_prompt,
    enforce_recall_format,
    sanitize_output,
)

MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
HF_TOKEN = os.environ.get("HF_TOKEN")

dual = DualSubstrateGenerator(MODEL_NAME, hf_token=HF_TOKEN, mem_dim=128)

qconf = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)
baseline_tok = AutoTokenizer.from_pretrained(MODEL_NAME, token=HF_TOKEN)
baseline_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=qconf,
)

baseline_defaults = dict(GEN_KW_BASE)
baseline_defaults["pad_token_id"] = baseline_tok.eos_token_id
baseline_defaults["eos_token_id"] = baseline_tok.eos_token_id


def build_generation_kwargs(base_settings, max_new_tokens, **overrides):
    settings = dict(base_settings)
    settings["max_new_tokens"] = max_new_tokens
    for key, value in overrides.items():
        if key in ALLOWED_GEN_KW and value is not None:
            settings[key] = value
    return settings


def vanilla_generate(prompt, max_new_tokens=128, **overrides):
    chat_prompt = build_chat_prompt(baseline_tok, prompt)
    inputs = baseline_tok(chat_prompt, return_tensors="pt").to(baseline_model.device)
    gen_settings = build_generation_kwargs(baseline_defaults, max_new_tokens, **overrides)
    with torch.inference_mode():
        out = baseline_model.generate(**inputs, **gen_settings)
    response_ids = out[:, inputs.input_ids.shape[-1]:]
    text = baseline_tok.decode(response_ids[0], skip_special_tokens=True)
    text = sanitize_output(text)
    return enforce_recall_format(prompt, text)


samples = [
    "In one sentence, summarise: Alice met Bob at 9:00. They discussed primes 2,3,5,7 and Möbius transforms.",
    "Only output: TIME=<time>; PRIME=<n>. What time and smallest prime from the log above?",
]


def run_eval(gen_fn):
    outputs = []
    for prompt in samples:
        start = time.time()
        response = gen_fn(prompt)
        latency = round(time.time() - start, 3)
        outputs.append({"prompt": prompt, "response": response, "latency_s": latency})
    return outputs


def run_dual(prompt):
    return dual.generate(prompt, max_new_tokens=128)


def guarded_dual(prompt):
    return enforce_recall_format(prompt, run_dual(prompt))


def guarded_vanilla(prompt):
    return enforce_recall_format(prompt, vanilla_generate(prompt, max_new_tokens=128))


dual_records = run_eval(guarded_dual)
vanilla_records = run_eval(guarded_vanilla)

with open("/content/longbench_dual_substrate.json", "w") as f:
    json.dump(dual_records, f, indent=2)
with open("/content/longbench_baseline.json", "w") as f:
    json.dump(vanilla_records, f, indent=2)

print("Saved JSONs under /content/: longbench_dual_substrate.json & longbench_baseline.json")


In [None]:
import json
from pathlib import Path

for name in ["longbench_dual_substrate.json", "longbench_baseline.json"]:
    path = Path("/content") / name
    if not path.exists():
        print(f"Missing {name}; run the harness cell above first.")
        continue
    with path.open() as f:
        data = json.load(f)
    print(f"\n{name} (records={len(data)}):")
    for item in data:
        snippet = item["prompt"][:48].replace("\n", " ")
        print("- prompt[:48]={!r} | latency={}".format(snippet, item.get("latency_s")))

## 4. RULER evaluation


In [None]:
%%bash
cat > /content/ruler_adapter.py <<'PY'
import os
import sys

if "/content" not in sys.path:
    sys.path.append("/content")

from dual_substrate_adapter import DualSubstrateGenerator

_model = None


def load_model():
    global _model
    if _model is None:
        name = os.environ.get("RULER_MODEL", "TinyLlama/TinyLlama-1.1B-Chat-v1.0")
        _model = DualSubstrateGenerator(name, hf_token=os.environ.get("HF_TOKEN"))
    return _model


def generate(prompt: str) -> str:
    model = load_model()
    return model.generate(prompt, max_new_tokens=256)
PY


In [None]:
import os, subprocess, sys

os.environ["PYTHONPATH"] = f"/content:{os.environ.get('PYTHONPATH', '')}"
os.environ["RULER_MODEL"] = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

cmd = [
    sys.executable,
    '-m',
    'ruler.evaluate',
    '--model',
    'custom',
    '--custom_module',
    'ruler_adapter',
    '--tasks',
    'kv_retrieval',
    '--context_lengths',
    '4k,8k',
    '--num_samples',
    '50',
]

print('Running:', ' '.join(cmd))
completed = subprocess.run(cmd, capture_output=True, text=True)
print(completed.stdout)
print(completed.stderr)

with open('/content/ruler_dual_substrate.txt', 'w') as f:
    f.write(completed.stdout)

print('Saved:', '/content/ruler_dual_substrate.txt')


In [None]:
# Optional vanilla RULER baseline using transformers only
%%bash
cat > /content/ruler_vanilla_adapter.py <<'PY'
import os
import sys

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

if "/content" not in sys.path:
    sys.path.append("/content")

from dual_substrate_adapter import (
    ALLOWED_GEN_KW,
    GEN_KW_BASE,
    build_chat_prompt,
    enforce_recall_format,
    sanitize_output,
)

_model = None
_tok = None
_defaults = None


def load_model():
    global _model, _tok, _defaults
    if _model is None or _tok is None or _defaults is None:
        name = os.environ.get("RULER_MODEL", "TinyLlama/TinyLlama-1.1B-Chat-v1.0")
        qconf = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
        )
        _tok = AutoTokenizer.from_pretrained(name, token=os.environ.get("HF_TOKEN"))
        _model = AutoModelForCausalLM.from_pretrained(
            name,
            device_map="auto",
            trust_remote_code=True,
            quantization_config=qconf,
        )
        _defaults = dict(GEN_KW_BASE)
        _defaults["pad_token_id"] = _tok.eos_token_id
        _defaults["eos_token_id"] = _tok.eos_token_id
    return _tok, _model, _defaults


def build_generation_kwargs(base_settings, max_new_tokens, **overrides):
    settings = dict(base_settings)
    settings["max_new_tokens"] = max_new_tokens
    for key, value in overrides.items():
        if key in ALLOWED_GEN_KW and value is not None:
            settings[key] = value
    return settings


def generate(prompt: str) -> str:
    tok, model, defaults = load_model()
    chat_prompt = build_chat_prompt(tok, prompt)
    inputs = tok(chat_prompt, return_tensors="pt").to(model.device)
    settings = build_generation_kwargs(defaults, 256)
    with torch.inference_mode():
        output = model.generate(**inputs, **settings)
    response_ids = output[:, inputs.input_ids.shape[-1]:]
    text = tok.decode(response_ids[0], skip_special_tokens=True)
    text = sanitize_output(text)
    return enforce_recall_format(prompt, text)
PY


In [None]:
import subprocess, sys

cmd = [
    sys.executable,
    '-m',
    'ruler.evaluate',
    '--model',
    'custom',
    '--custom_module',
    'ruler_vanilla_adapter',
    '--tasks',
    'kv_retrieval',
    '--context_lengths',
    '4k,8k',
    '--num_samples',
    '50',
]

print('Running:', ' '.join(cmd))
completed = subprocess.run(cmd, capture_output=True, text=True)
print(completed.stdout)
print(completed.stderr)

with open('/content/ruler_baseline.txt', 'w') as f:
    f.write(completed.stdout)

print('Saved:', '/content/ruler_baseline.txt')


## 5. Export and persist results


In [None]:
!ls -lh /content/*longbench*.json /content/*ruler* 2>/dev/null || true
!cp /content/longbench_*.json /content/ruler_* /content/drive/MyDrive/ 2>/dev/null || true


In [None]:
from google.colab import drive
drive.mount('/content/drive')

## 6. Scaling plan

1. Swap `MODEL_NAME` to **mistralai/Mistral-7B-Instruct-v0.2** with 4-bit quantisation.
2. Increase LongBench `sample_size` (e.g., 25 → 100) and add tasks such as `LongBookSummEng` and additional QA tracks.
3. Extend RULER coverage to multi-hop and longer contexts once the pipeline is reliable.
4. Introduce vLLM for batching after verifying correctness with Transformers.
5. Maintain A/B JSON outputs (`baseline` vs `dual_substrate`) and track latency, VRAM, and accuracy deltas.


## 7. Troubleshooting tips

* **CUDA out-of-memory**: lower `max_new_tokens`, revert to the TinyLlama checkpoint, or ensure 4-bit loading is active.
* **Tokenizer errors**: set `pad_token_id` to `tok.eos_token_id`.
* **Authentication failures**: provide a Hugging Face token and request model access if required.
* **Dataset download issues**: run the dataset setup cells once with a stable internet connection.
* **Custom module not found**: confirm that `/content` is on `PYTHONPATH` before invoking RULER.


## 8. Publishing checklist

* Commit `dual_substrate_adapter.py`, `ruler_adapter.py`, and this notebook to a dedicated branch (e.g., `colab-benchmark/`).
* Archive JSON artefacts (`longbench_*.json`, `ruler_*.txt`) for baseline comparisons.
* Summarise the metrics in a short report covering recall, drift, latency, and energy usage.
