<a href="https://colab.research.google.com/github/Berigny/p-adic-memory/blob/main/DualSubstrateColabTests.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dual Substrate Colab Test Plan

This notebook prepares a Google Colab environment for evaluating the `p_adic_memory` dual-substrate memory against baseline language-model behaviour. Follow the cells in order when running on a T4 GPU runtime.


## 0. Reality checks

Before committing to long runs, make sure the selected model fits in 16 GB of VRAM. Start with 4-bit quantised checkpoints such as **TinyLlama/TinyLlama-1.1B-Chat-v1.0** and scale to **mistralai/Mistral-7B-Instruct-v0.2** once everything works.


In [1]:
# Optional: mount Google Drive for persistent artifacts and confirm GPU availability
from google.colab import drive
try:
    drive.mount('/content/drive')
except Exception:
    pass

!nvidia-smi


Mounted at /content/drive
Tue Oct 14 13:54:36 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   46C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                      

## 1. Environment setup

Skip the upstream \`requirements.txt\` files on Colab. They pin incompatible CUDA builds (vLLM/flash-attn)
and NumPy releases that do not support Python 3.11. Instead, install a modern Transformers stack and
pull the benchmark repos in source mode.


In [2]:
# --- base deps tuned for Colab T4 (no vLLM / flash-attn) ---
!pip install -q --upgrade pip
!pip install -q "numpy>=1.26" "transformers>=4.44" "datasets>=2.20"
!pip install -q "evaluate>=0.4.2" "accelerate>=0.33" "bitsandbytes>=0.43" \
               sentencepiece ujson nltk rouge-score tyro tabulate



[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.8/1.8 MB[0m [31m81.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m49.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[33m  DEPRECATION: Building 'rouge-score' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'rouge-score'. Discussion can be found at https://github.com/pypa/pip/issues/6334[0m[33m
[0m  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone


In [3]:
# --- clone official repos (source only; no requirements.txt installs) ---
!rm -rf /content/LongBench /content/RULER
!git clone -q https://github.com/THUDM/LongBench.git /content/LongBench
!git clone -q https://github.com/NVIDIA/RULER.git /content/RULER

import sys
if "/content/LongBench" not in sys.path:
    sys.path.append("/content/LongBench")
if "/content/RULER" not in sys.path:
    sys.path.append("/content/RULER")

# --- your package from GitHub (editable for quick iteration) ---
!rm -rf /content/p-adic-memory
!git clone -q https://github.com/Berigny/p-adic-memory.git /content/p-adic-memory

%cd /content/p-adic-memory
!pip install -q -e .
# !python -m pip show -f p-adic-memory | sed -n '1,160p'

src_path = "/content/p-adic-memory/src"
if src_path not in sys.path:
    sys.path.append(src_path)

%cd /content


/content/p-adic-memory
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
  Building editable for p-adic-memory (pyproject.toml) ... [?25l[?25hdone
/content


In [4]:
# --- quick sanity checks ---
import os, importlib.util
print("LongBench pred.py:", os.path.exists("/content/LongBench/pred.py"))
print("RULER top-level:", os.listdir("/content/RULER")[:10])
print("p_adic_memory importable?", importlib.util.find_spec("p_adic_memory") is not None)




LongBench pred.py: True
RULER top-level: ['.git', 'docker', '.gitignore', '.gitattributes', 'README.md', 'LICENSE', 'scripts']
p_adic_memory importable? True


In [5]:
# LongBench is script-first; confirm entry scripts exist and explain why Evaluator imports fail
import os
LB_INNER = '/content/LongBench/LongBench'
print('Has LongBench inner dir?', os.path.isdir(LB_INNER))
if os.path.isdir(LB_INNER):
    print('Contents:', sorted(f for f in os.listdir(LB_INNER) if f.endswith('.py'))[:6])
    if not os.path.exists(os.path.join(LB_INNER, 'eval.py')):
        print('Note: no eval.py script found — use the custom harness below.')
else:
    print('Clone LongBench with: !git clone https://github.com/THUDM/LongBench.git /content/LongBench')


Has LongBench inner dir? True
Contents: ['eval.py', 'llama_flash_attn_monkey_patch.py', 'metrics.py', 'pred.py']


In [6]:
# Colab T4 runtimes lack wheels for vLLM/flash-attn pinned by LongBench; install only on A100+
# !pip install -q vllm vllm-flash-attn


In [7]:
# Authenticate with Hugging Face if you intend to use gated checkpoints
from getpass import getpass
import os

token = getpass("Paste your Hugging Face token (press enter to skip): ")
if token:
    os.environ["HF_TOKEN"] = token
    from huggingface_hub import login
    login(token=token)


Paste your Hugging Face token (press enter to skip): ··········


Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


## 2. Minimal dual-substrate smoke test

The following cell instantiates a quantised model, attaches the dual-substrate memory, and produces a JSON log comparing prompts and responses.


In [8]:
import json, time, torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import os
import sys

# Ensure the src/ directory is available on sys.path when running from notebooks
src_path = "/content/p-adic-memory/src"
if src_path not in sys.path:
    sys.path.append(src_path)

from p_adic_memory.dual_substrate import DualSubstrate, DualSubstrateMemory

MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
# To scale later (requires token + 4-bit):
# MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.2"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

tok = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True, token=os.environ.get("HF_TOKEN"))
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=bnb_config
)

def as_chat(user_text: str) -> str:
    messages = [
        {"role": "system", "content": "Answer succinctly and do not repeat the question."},
        {"role": "user", "content": user_text},
    ]
    return tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

gen_kwargs = dict(
    do_sample=False,
    temperature=0.0,
    top_p=1.0,
    repetition_penalty=1.15,
    pad_token_id=tok.eos_token_id,
    eos_token_id=tok.eos_token_id,
)

mem = DualSubstrate(dim=128, cycle=15 * 60)

def stream_tokens(text: str):
    for token in text.split():
        yield token

def augment_with_memory(prompt: str, tokens_now):
    recalls = []
    for token in tokens_now:
        score, ledger_flag = mem.query(token)
        recalls.append(f"<mem exact={int(ledger_flag)} p={score:.3f}>")
    policy = "<memory-policy>Use memory facts if present. If memory and the prompt disagree, prefer memory. Output only what is requested; do not repeat the question.</memory-policy>"
    tag = " ".join(recalls[:64])
    return f"{policy}\n<memory>{tag}</memory>\n\n{prompt}"

def generate_chat_response(user_text: str, max_new_tokens: int = 64) -> str:
    chat_prompt = as_chat(user_text)
    inputs = tok(chat_prompt, return_tensors="pt").to(model.device)
    with torch.inference_mode():
        generated = model.generate(**inputs, max_new_tokens=max_new_tokens, **gen_kwargs)
    response_ids = generated[:, inputs.input_ids.shape[-1]:]
    return tok.decode(response_ids[0], skip_special_tokens=True).strip()

def dual_substrate_generate(prompt: str, max_new_tokens: int = 64) -> str:
    for idx, token in enumerate(stream_tokens(prompt)):
        mem.observe(token, 1.0)
    tokens_now = prompt.split()[-64:]
    augmented = augment_with_memory(prompt, tokens_now)
    return generate_chat_response(augmented, max_new_tokens=max_new_tokens)

queries = [
    """In one sentence, summarise the following log:
Alice met Bob at 9:00. They discussed primes 2, 3, 5, 7 and Möbius transforms.""",
    "Recall the meeting time and the smallest prime they discussed. Only output in this exact format: TIME=<time>; PRIME=<n>.",
]

dual_results = []
for q in queries:
    start = time.time()
    response = dual_substrate_generate(q, max_new_tokens=64)
    latency = time.time() - start
    dual_results.append({"prompt": q, "response": response, "latency_s": round(latency, 3)})

with open("/content/dual_substrate_smoke.json", "w") as f:
    json.dump(dual_results, f, indent=2)

print("Saved:", "/content/dual_substrate_smoke.json")

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Saved: /content/dual_substrate_smoke.json


In [9]:
import importlib.util, p_adic_memory as pam

print("p_adic_memory importable?", importlib.util.find_spec("p_adic_memory") is not None)
print("version:", getattr(pam, "__version__", "dev"))
print("has DualSubstrate:", hasattr(pam, "DualSubstrate"))
print("has DualSubstrateMemory:", hasattr(pam, "DualSubstrateMemory"))



p_adic_memory importable? True
version: 0.1.0
has DualSubstrate: True
has DualSubstrateMemory: True


In [10]:
# Baseline comparison without memory augmentation
baseline = []
for q in queries:
    start = time.time()
    response = generate_chat_response(q, max_new_tokens=64)
    latency = time.time() - start
    baseline.append({"prompt": q, "response": response, "latency_s": round(latency, 3)})

with open("/content/baseline_smoke.json", "w") as f:
    json.dump(baseline, f, indent=2)

print("Saved:", "/content/baseline_smoke.json")

Saved: /content/baseline_smoke.json


In [11]:
# Quick accuracy/latency scorer for the recall task
import json, re

def load(path):
    with open(path) as f:
        return json.load(f)

def parse_out(text: str):
    match = re.search(r"TIME\s*=\s*([0-9]{1,2}:[0-9]{2})\s*;\s*PRIME\s*=\s*([0-9]+)", text)
    if not match:
        return None, None
    return match.group(1), int(match.group(2))

def score(records):
    gt_time, gt_prime = "9:00", 2
    data = []
    for r in records:
        if "Recall the meeting time" not in r["prompt"]:
            continue
        time_out, prime_out = parse_out(r["response"])
        ok_time = time_out == gt_time
        ok_prime = prime_out == gt_prime
        data.append({"latency_s": r.get("latency_s"), "ok_time": ok_time, "ok_prime": ok_prime})
    if not data:
        return {"n": 0}
    acc_time = sum(d["ok_time"] for d in data) / len(data)
    acc_prime = sum(d["ok_prime"] for d in data) / len(data)
    latencies = [d["latency_s"] for d in data if d["latency_s"] is not None]
    median_latency = sorted(latencies)[len(latencies) // 2] if latencies else None
    return {"n": len(data), "acc_time": acc_time, "acc_prime": acc_prime, "median_latency": median_latency}

baseline = load("/content/baseline_smoke.json")
dual = load("/content/dual_substrate_smoke.json")
print("Baseline:", score(baseline))
print("Dual-substrate:", score(dual))

Baseline: {'n': 1, 'acc_time': 0.0, 'acc_prime': 0.0, 'median_latency': 2.676}
Dual-substrate: {'n': 1, 'acc_time': 0.0, 'acc_prime': 0.0, 'median_latency': 2.211}


In [12]:
# Optionally copy smoke-test outputs to Drive for persistence
!cp /content/*smoke.json /content/drive/MyDrive/ 2>/dev/null || true


## 3. LongBench-style harness (Option 2)

LongBench does not ship a Python package or an `Evaluator` class. Instead of importing non-existent APIs, run a tiny harness
that mimics their prompt/response logging. The following cells instantiate the dual-substrate generator, execute a small
set of prompts, and write JSON artefacts for A/B comparisons.


In [13]:
%%bash
cat > /content/dual_substrate_adapter.py <<'PY'
import os
import sys

# Add the package's source directory to sys.path
src_path = "/content/p-adic-memory/src"
if src_path not in sys.path:
    sys.path.append(src_path)

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from p_adic_memory import DualSubstrateMemory


class DualSubstrateGenerator:
    def __init__(self, model_name: str, hf_token: str | None = None, mem_dim: int = 128):
        qconf = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
        )
        self.tok = AutoTokenizer.from_pretrained(model_name, token=hf_token)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            device_map="auto",
            trust_remote_code=True,
            quantization_config=qconf,
        )
        self.mem = DualSubstrateMemory(dim=mem_dim)

    def stream(self, text: str):
        for token in text.split():
            yield token

    def augment(self, prompt: str, tokens_now):
        recalls = []
        for token in tokens_now:
            score, ledger_flag = self.mem.query(token)
            recalls.append(f"<mem exact={int(ledger_flag)} p={score:.3f}>")
        tag = " ".join(recalls[:64])
        return f"{prompt}\n\n<memory>{tag}</memory>"

    def generate(self, prompt: str, max_new_tokens: int = 256, temperature: float = 0.2): # Removed do_sample from parameters
        for token in self.stream(prompt):
            self.mem.observe(token, 1.0)
        current_tokens = prompt.split()[-64:]
        augmented = self.augment(prompt, current_tokens)
        inputs = self.tok(augmented, return_tensors="pt").to(self.model.device)
        with torch.inference_mode():
            output = self.model.generate(
                **inputs,
                temperature=temperature,
                max_new_tokens=max_new_tokens,
                pad_token_id=self.tok.eos_token_id,
            )
        return self.tok.decode(output[0], skip_special_tokens=True)
PY

In [14]:
import json, os, time, torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Ensure the path to the installed package is in PYTHONPATH
if "/content/p-adic-memory/src" not in os.environ.get("PYTHONPATH", ""):
    os.environ["PYTHONPATH"] = f"/content/p-adic-memory/src:{os.environ.get('PYTHONPATH', '')}"

from dual_substrate_adapter import DualSubstrateGenerator

MODEL_NAME = 'TinyLlama/TinyLlama-1.1B-Chat-v1.0'
HF_TOKEN = os.environ.get('HF_TOKEN')

dual = DualSubstrateGenerator(MODEL_NAME, hf_token=HF_TOKEN, mem_dim=128)

qconf = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)
baseline_tok = AutoTokenizer.from_pretrained(MODEL_NAME, token=HF_TOKEN)
baseline_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=qconf,
)

def vanilla_generate(prompt: str, max_new_tokens: int = 128):
    inputs = baseline_tok(prompt, return_tensors="pt").to(baseline_model.device)
    with torch.inference_mode():
        out = baseline_model.generate(
            **inputs,
            do_sample=False,
            temperature=0.0,
            max_new_tokens=max_new_tokens,
            pad_token_id=baseline_tok.eos_token_id,
            eos_token_id=baseline_tok.eos_token_id,
        )
    return baseline_tok.decode(out[0], skip_special_tokens=True)

samples = [
    "In one sentence, summarise: Alice met Bob at 9:00. They discussed primes 2,3,5,7 and Möbius transforms.",
    "Only output: TIME=<time>; PRIME=<n>. What time and smallest prime from the log above?",
]

def run_eval(gen_fn):
    outputs = []
    for prompt in samples:
        start = time.time()
        response = gen_fn(prompt)
        latency = round(time.time() - start, 3)
        outputs.append({"prompt": prompt, "response": response, "latency_s": latency})
    return outputs

dual_records = run_eval(lambda p: dual.generate(p, max_new_tokens=128, temperature=0.0))
vanilla_records = run_eval(lambda p: vanilla_generate(p, max_new_tokens=128))

with open("/content/longbench_dual_substrate.json", "w") as f:
    json.dump(dual_records, f, indent=2)
with open("/content/longbench_baseline.json", "w") as f:
    json.dump(vanilla_records, f, indent=2)

print("Saved JSONs under /content/: longbench_dual_substrate.json & longbench_baseline.json")

Saved JSONs under /content/: longbench_dual_substrate.json & longbench_baseline.json


In [16]:
import json
from pathlib import Path

for name in ["longbench_dual_substrate.json", "longbench_baseline.json"]:
    path = Path("/content") / name
    if not path.exists():
        print(f"Missing {name}; run the harness cell above first.")
        continue
    with path.open() as f:
        data = json.load(f)
    print(f"\n{name} (records={len(data)}):")
    for item in data:
        snippet = item["prompt"][:48].replace("\n", " ")
        print("- prompt[:48]={!r} | latency={}".format(snippet, item.get("latency_s")))


longbench_dual_substrate.json (records=2):
- prompt[:48]='In one sentence, summarise: Alice met Bob at 9:0' | latency=0.289
- prompt[:48]='Only output: TIME=<time>; PRIME=<n>. What time a' | latency=7.071

longbench_baseline.json (records=2):
- prompt[:48]='In one sentence, summarise: Alice met Bob at 9:0' | latency=5.778
- prompt[:48]='Only output: TIME=<time>; PRIME=<n>. What time a' | latency=0.114


## 4. RULER evaluation


In [17]:
%%bash
cat > /content/ruler_adapter.py <<'PY'
import os
from dual_substrate_adapter import DualSubstrateGenerator

_model = None

def load_model():
    global _model
    if _model is None:
        name = os.environ.get("RULER_MODEL", "TinyLlama/TinyLlama-1.1B-Chat-v1.0")
        _model = DualSubstrateGenerator(name, hf_token=os.environ.get("HF_TOKEN"))
    return _model

def generate(prompt: str) -> str:
    model = load_model()
    return model.generate(prompt, max_new_tokens=256, temperature=0.2)
PY


In [18]:
import os, subprocess, sys

os.environ["PYTHONPATH"] = f"/content:{os.environ.get('PYTHONPATH', '')}"
os.environ["RULER_MODEL"] = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

cmd = [
    sys.executable,
    '-m',
    'ruler.evaluate',
    '--model',
    'custom',
    '--custom_module',
    'ruler_adapter',
    '--tasks',
    'kv_retrieval',
    '--context_lengths',
    '4k,8k',
    '--num_samples',
    '50',
]

print('Running:', ' '.join(cmd))
completed = subprocess.run(cmd, capture_output=True, text=True)
print(completed.stdout)
print(completed.stderr)

with open('/content/ruler_dual_substrate.txt', 'w') as f:
    f.write(completed.stdout)

print('Saved:', '/content/ruler_dual_substrate.txt')


Running: /usr/bin/python3 -m ruler.evaluate --model custom --custom_module ruler_adapter --tasks kv_retrieval --context_lengths 4k,8k --num_samples 50

/usr/bin/python3: Error while finding module specification for 'ruler.evaluate' (ModuleNotFoundError: No module named 'ruler')

Saved: /content/ruler_dual_substrate.txt


In [19]:
# Optional vanilla RULER baseline using transformers only
%%bash
cat > /content/ruler_vanilla_adapter.py <<'PY'
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

_model = None
_tok = None


def load_model():
    global _model, _tok
    if _model is None or _tok is None:
        name = os.environ.get("RULER_MODEL", "TinyLlama/TinyLlama-1.1B-Chat-v1.0")
        qconf = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
        )
        _tok = AutoTokenizer.from_pretrained(name, token=os.environ.get("HF_TOKEN"))
        _model = AutoModelForCausalLM.from_pretrained(
            name,
            device_map="auto",
            trust_remote_code=True,
            quantization_config=qconf,
        )
    return _tok, _model


def generate(prompt: str) -> str:
    tok, model = load_model()
    inputs = tok(prompt, return_tensors="pt").to(model.device)
    with torch.inference_mode():
        output = model.generate(
            **inputs,
            max_new_tokens=256,
            pad_token_id=tok.eos_token_id,
        )
    return tok.decode(output[0], skip_special_tokens=True)
PY


In [20]:
import subprocess, sys

cmd = [
    sys.executable,
    '-m',
    'ruler.evaluate',
    '--model',
    'custom',
    '--custom_module',
    'ruler_vanilla_adapter',
    '--tasks',
    'kv_retrieval',
    '--context_lengths',
    '4k,8k',
    '--num_samples',
    '50',
]

print('Running:', ' '.join(cmd))
completed = subprocess.run(cmd, capture_output=True, text=True)
print(completed.stdout)
print(completed.stderr)

with open('/content/ruler_baseline.txt', 'w') as f:
    f.write(completed.stdout)

print('Saved:', '/content/ruler_baseline.txt')


Running: /usr/bin/python3 -m ruler.evaluate --model custom --custom_module ruler_vanilla_adapter --tasks kv_retrieval --context_lengths 4k,8k --num_samples 50

/usr/bin/python3: Error while finding module specification for 'ruler.evaluate' (ModuleNotFoundError: No module named 'ruler')

Saved: /content/ruler_baseline.txt


## 5. Export and persist results


In [21]:
!ls -lh /content/*longbench*.json /content/*ruler* 2>/dev/null || true
!cp /content/longbench_*.json /content/ruler_* /content/drive/MyDrive/ 2>/dev/null || true


-rw-r--r-- 1 root root  609 Oct 14 13:57 /content/longbench_baseline.json
-rw-r--r-- 1 root root 1.6K Oct 14 13:57 /content/longbench_dual_substrate.json
-rw-r--r-- 1 root root  458 Oct 14 13:58 /content/ruler_adapter.py
-rw-r--r-- 1 root root    0 Oct 14 13:58 /content/ruler_baseline.txt
-rw-r--r-- 1 root root    0 Oct 14 13:58 /content/ruler_dual_substrate.txt
-rw-r--r-- 1 root root 1.2K Oct 14 13:58 /content/ruler_vanilla_adapter.py


In [None]:
from google.colab import drive
drive.mount('/content/drive')

## 6. Scaling plan

1. Swap `MODEL_NAME` to **mistralai/Mistral-7B-Instruct-v0.2** with 4-bit quantisation.
2. Increase LongBench `sample_size` (e.g., 25 → 100) and add tasks such as `LongBookSummEng` and additional QA tracks.
3. Extend RULER coverage to multi-hop and longer contexts once the pipeline is reliable.
4. Introduce vLLM for batching after verifying correctness with Transformers.
5. Maintain A/B JSON outputs (`baseline` vs `dual_substrate`) and track latency, VRAM, and accuracy deltas.


## 7. Troubleshooting tips

* **CUDA out-of-memory**: lower `max_new_tokens`, revert to the TinyLlama checkpoint, or ensure 4-bit loading is active.
* **Tokenizer errors**: set `pad_token_id` to `tok.eos_token_id`.
* **Authentication failures**: provide a Hugging Face token and request model access if required.
* **Dataset download issues**: run the dataset setup cells once with a stable internet connection.
* **Custom module not found**: confirm that `/content` is on `PYTHONPATH` before invoking RULER.


## 8. Publishing checklist

* Commit `dual_substrate_adapter.py`, `ruler_adapter.py`, and this notebook to a dedicated branch (e.g., `colab-benchmark/`).
* Archive JSON artefacts (`longbench_*.json`, `ruler_*.txt`) for baseline comparisons.
* Summarise the metrics in a short report covering recall, drift, latency, and energy usage.
