<a href="https://colab.research.google.com/github/Berigny/p-adic-memory/blob/main/DualSubstrateColabTests.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dual Substrate Colab Test Plan

This notebook prepares a Google Colab environment for evaluating the `p_adic_memory` dual-substrate memory against baseline language-model behaviour. Follow the cells in order when running on a T4 GPU runtime.


## 0. Reality checks

Before committing to long runs, make sure the selected model fits in 16 GB of VRAM. Start with 4-bit quantised checkpoints such as **TinyLlama/TinyLlama-1.1B-Chat-v1.0** and scale to **mistralai/Mistral-7B-Instruct-v0.2** once everything works.


In [9]:
# Optional: mount Google Drive for persistent artifacts and confirm GPU availability
from google.colab import drive
try:
    drive.mount('/content/drive')
except Exception:
    pass

!nvidia-smi


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Mon Oct 13 13:05:41 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   41C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+----------

## 1. Environment setup


In [None]:
# --- base deps (GPU-friendly) ---
!pip install -q --upgrade pip
!pip install -q "transformers>=4.44.0" "datasets>=2.20.0" "evaluate>=0.4.2" accelerate bitsandbytes sentencepiece



In [None]:
# --- clone official repos (not pip) ---
!rm -rf LongBench RULER
!git clone -q https://github.com/THUDM/LongBench.git
!git clone -q https://github.com/NVIDIA/RULER.git
%cd LongBench
!pip install -q -r requirements.txt || true
%cd /content/RULER
!pip install -q -r requirements.txt || true
%cd /content

import sys
sys.path += ["/content/LongBench", "/content/RULER"]

# --- your package from GitHub (editable for quick iteration) ---
!rm -rf p-adic-memory
!git clone -q https://github.com/Berigny/p-adic-memory.git
%cd p-adic-memory
!pip install -q -e .
%cd /content



In [None]:
# --- quick sanity checks ---
import os
print("LongBench pred.py:", os.path.exists("/content/LongBench/pred.py"))
print("RULER scripts dir:", os.listdir("/content/RULER")[:10])
import p_adic_memory as pam
print("p_adic_memory version:", getattr(pam, "__version__", "dev"))



In [None]:
# Optional: install vLLM if batching/throughput becomes a requirement later
# !pip install -q vllm


In [None]:
# Authenticate with Hugging Face if you intend to use gated checkpoints
from getpass import getpass
import os

token = getpass("Paste your Hugging Face token (press enter to skip): ")
if token:
    os.environ["HF_TOKEN"] = token
    from huggingface_hub import login
    login(token=token)


## 2. Minimal dual-substrate smoke test

The following cell instantiates a quantised model, attaches the dual-substrate memory, and produces a JSON log comparing prompts and responses.


In [None]:
import json, time, torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from p_adic_memory import DualSubstrateMemory
import os

MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
# To scale later (requires token + 4-bit):
# MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.2"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

tok = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True, token=os.environ.get("HF_TOKEN"))
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=bnb_config
)

mem = DualSubstrateMemory(dim=128, cycle_minutes=15)


def stream_tokens(text: str):
    for i, t in enumerate(text.split()):
        yield t, {"pos": i % 7, "role": "ctx"}


def augment_with_memory(prompt: str, tokens_now):
    recalls = []
    for t in tokens_now:
        q = mem.query(t)
        recalls.append(f"<mem exact={int(q.get('exact', False))} p={q.get('p', 0.0):.3f}>")
    tag = " ".join(recalls[:64])
    return f"{prompt}\n\n<memory>{tag}</memory>"


def dual_substrate_generate(prompt: str, max_new_tokens=256, temperature=0.2):
    for token, label in stream_tokens(prompt):
        mem.observe(token, label)
    current_tokens = prompt.split()[-64:]
    augmented = augment_with_memory(prompt, current_tokens)
    inputs = tok(augmented, return_tensors="pt").to(model.device)
    with torch.inference_mode():
        output = model.generate(
            **inputs,
            do_sample=True,
            temperature=temperature,
            max_new_tokens=max_new_tokens,
            pad_token_id=tok.eos_token_id
        )
    return tok.decode(output[0], skip_special_tokens=True)


queries = [
    "Summarise the following log: Alice met Bob at 9:00. They discussed primes 2, 3, 5, 7 and Möbius transforms.",
    "Recall the meeting time and the smallest prime they discussed.",
]

results = []
for q in queries:
    start = time.time()
    response = dual_substrate_generate(q, max_new_tokens=64)
    latency = time.time() - start
    results.append({"prompt": q, "response": response, "latency_s": round(latency, 3)})

with open("/content/dual_substrate_smoke.json", "w") as f:
    json.dump(results, f, indent=2)

print("Saved:", "/content/dual_substrate_smoke.json")

In [None]:
# Baseline comparison without memory augmentation
baseline = []
for q in queries:
    start = time.time()
    inputs = tok(q, return_tensors="pt").to(model.device)
    with torch.inference_mode():
        output = model.generate(
            **inputs,
            do_sample=True,
            temperature=0.2,
            max_new_tokens=64,
            pad_token_id=tok.eos_token_id
        )
    response = tok.decode(output[0], skip_special_tokens=True)
    latency = time.time() - start
    baseline.append({"prompt": q, "response": response, "latency_s": round(latency, 3)})

with open("/content/baseline_smoke.json", "w") as f:
    json.dump(baseline, f, indent=2)

print("Saved:", "/content/baseline_smoke.json")


In [None]:
# Optionally copy smoke-test outputs to Drive for persistence
!cp /content/*smoke.json /content/drive/MyDrive/ 2>/dev/null || true


## 3. LongBench integration


In [None]:
%%bash
cat > /content/dual_substrate_adapter.py <<'PY'
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from p_adic_memory import DualSubstrateMemory


class DualSubstrateGenerator:
    def __init__(self, model_name: str, hf_token: str | None = None, mem_dim: int = 128, cycle_minutes: int = 15):
        qconf = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
        )
        self.tok = AutoTokenizer.from_pretrained(model_name, token=hf_token)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            device_map="auto",
            trust_remote_code=True,
            quantization_config=qconf,
        )
        self.mem = DualSubstrateMemory(dim=mem_dim, cycle_minutes=cycle_minutes)

    def stream(self, text: str):
        for i, token in enumerate(text.split()):
            yield token, {"pos": i % 11}

    def augment(self, prompt: str, tokens_now):
        recalls = []
        for token in tokens_now:
            q = self.mem.query(token)
            recalls.append(f"<mem exact={int(q.get(\"exact\", False))} p={q.get(\"p\", 0.0):.3f}>")
        tag = " \".join(recalls[:64])
        return f"{prompt}\n\n<memory>{tag}</memory>"

    def generate(self, prompt: str, max_new_tokens: int = 256, temperature: float = 0.2):
        for token, label in self.stream(prompt):
            self.mem.observe(token, label)
        current_tokens = prompt.split()[-64:]
        augmented = self.augment(prompt, current_tokens)
        inputs = self.tok(augmented, return_tensors="pt").to(self.model.device)
        with torch.inference_mode():
            output = self.model.generate(
                **inputs,
                do_sample=True,
                temperature=temperature,
                max_new_tokens=max_new_tokens,
                pad_token_id=self.tok.eos_token_id,
            )
        return self.tok.decode(output[0], skip_special_tokens=True)
PY


In [None]:
import os, json
from longbench import Evaluator
from dual_substrate_adapter import DualSubstrateGenerator

model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
generator = DualSubstrateGenerator(model_name, hf_token=os.environ.get("HF_TOKEN"))

def custom_generate(text: str) -> str:
    return generator.generate(text, max_new_tokens=256, temperature=0.2)

evaluator = Evaluator(model=custom_generate, dataset="longbench")
results = evaluator.evaluate(tasks=["MultiDocQA"], sample_size=25)

with open("/content/longbench_dual_substrate.json", "w") as f:
    json.dump(results, f, indent=2)

print("Saved:", "/content/longbench_dual_substrate.json")


In [None]:
# Vanilla baseline on the same LongBench slice
from transformers import AutoTokenizer, AutoModelForCausalLM

baseline_tok = AutoTokenizer.from_pretrained(model_name, token=os.environ.get("HF_TOKEN"))
baseline_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=bnb_config,
)

def vanilla_generate(text: str) -> str:
    inputs = baseline_tok(text, return_tensors="pt").to(baseline_model.device)
    with torch.inference_mode():
        output = baseline_model.generate(
            **inputs,
            max_new_tokens=256,
            pad_token_id=baseline_tok.eos_token_id,
        )
    return baseline_tok.decode(output[0], skip_special_tokens=True)

evaluator_baseline = Evaluator(model=vanilla_generate, dataset="longbench")
results_baseline = evaluator_baseline.evaluate(tasks=["MultiDocQA"], sample_size=25)

with open("/content/longbench_baseline.json", "w") as f:
    json.dump(results_baseline, f, indent=2)

print("Saved:", "/content/longbench_baseline.json")


## 4. RULER evaluation


In [None]:
%%bash
cat > /content/ruler_adapter.py <<'PY'
import os
from dual_substrate_adapter import DualSubstrateGenerator

_model = None

def load_model():
    global _model
    if _model is None:
        name = os.environ.get("RULER_MODEL", "TinyLlama/TinyLlama-1.1B-Chat-v1.0")
        _model = DualSubstrateGenerator(name, hf_token=os.environ.get("HF_TOKEN"))
    return _model

def generate(prompt: str) -> str:
    model = load_model()
    return model.generate(prompt, max_new_tokens=256, temperature=0.2)
PY


In [None]:
import os, subprocess, sys

os.environ["PYTHONPATH"] = f"/content:{os.environ.get('PYTHONPATH', '')}"
os.environ["RULER_MODEL"] = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

cmd = [
    sys.executable,
    '-m',
    'ruler.evaluate',
    '--model',
    'custom',
    '--custom_module',
    'ruler_adapter',
    '--tasks',
    'kv_retrieval',
    '--context_lengths',
    '4k,8k',
    '--num_samples',
    '50',
]

print('Running:', ' '.join(cmd))
completed = subprocess.run(cmd, capture_output=True, text=True)
print(completed.stdout)
print(completed.stderr)

with open('/content/ruler_dual_substrate.txt', 'w') as f:
    f.write(completed.stdout)

print('Saved:', '/content/ruler_dual_substrate.txt')


In [None]:
# Optional vanilla RULER baseline using transformers only
%%bash
cat > /content/ruler_vanilla_adapter.py <<'PY'
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

_model = None
_tok = None


def load_model():
    global _model, _tok
    if _model is None or _tok is None:
        name = os.environ.get("RULER_MODEL", "TinyLlama/TinyLlama-1.1B-Chat-v1.0")
        qconf = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
        )
        _tok = AutoTokenizer.from_pretrained(name, token=os.environ.get("HF_TOKEN"))
        _model = AutoModelForCausalLM.from_pretrained(
            name,
            device_map="auto",
            trust_remote_code=True,
            quantization_config=qconf,
        )
    return _tok, _model


def generate(prompt: str) -> str:
    tok, model = load_model()
    inputs = tok(prompt, return_tensors="pt").to(model.device)
    with torch.inference_mode():
        output = model.generate(
            **inputs,
            max_new_tokens=256,
            pad_token_id=tok.eos_token_id,
        )
    return tok.decode(output[0], skip_special_tokens=True)
PY


In [None]:
import subprocess, sys

cmd = [
    sys.executable,
    '-m',
    'ruler.evaluate',
    '--model',
    'custom',
    '--custom_module',
    'ruler_vanilla_adapter',
    '--tasks',
    'kv_retrieval',
    '--context_lengths',
    '4k,8k',
    '--num_samples',
    '50',
]

print('Running:', ' '.join(cmd))
completed = subprocess.run(cmd, capture_output=True, text=True)
print(completed.stdout)
print(completed.stderr)

with open('/content/ruler_baseline.txt', 'w') as f:
    f.write(completed.stdout)

print('Saved:', '/content/ruler_baseline.txt')


## 5. Export and persist results


In [None]:
!ls -lh /content/*longbench*.json /content/*ruler* 2>/dev/null || true
!cp /content/longbench_*.json /content/ruler_* /content/drive/MyDrive/ 2>/dev/null || true


## 6. Scaling plan

1. Swap `MODEL_NAME` to **mistralai/Mistral-7B-Instruct-v0.2** with 4-bit quantisation.
2. Increase LongBench `sample_size` (e.g., 25 → 100) and add tasks such as `LongBookSummEng` and additional QA tracks.
3. Extend RULER coverage to multi-hop and longer contexts once the pipeline is reliable.
4. Introduce vLLM for batching after verifying correctness with Transformers.
5. Maintain A/B JSON outputs (`baseline` vs `dual_substrate`) and track latency, VRAM, and accuracy deltas.


## 7. Troubleshooting tips

* **CUDA out-of-memory**: lower `max_new_tokens`, revert to the TinyLlama checkpoint, or ensure 4-bit loading is active.
* **Tokenizer errors**: set `pad_token_id` to `tok.eos_token_id`.
* **Authentication failures**: provide a Hugging Face token and request model access if required.
* **Dataset download issues**: run the dataset setup cells once with a stable internet connection.
* **Custom module not found**: confirm that `/content` is on `PYTHONPATH` before invoking RULER.


## 8. Publishing checklist

* Commit `dual_substrate_adapter.py`, `ruler_adapter.py`, and this notebook to a dedicated branch (e.g., `colab-benchmark/`).
* Archive JSON artefacts (`longbench_*.json`, `ruler_*.txt`) for baseline comparisons.
* Summarise the metrics in a short report covering recall, drift, latency, and energy usage.
