# NEXUS‑EMB‑240M‑NSA — Colab Starter
Edge‑first dual‑head embeddings with **Neural Spectral Anchoring** (NSA) and Residual Hashing Bridge.

Follow the cells top‑to‑bottom. All training/eval instructions are in **English** to keep the training side consistent.


In [None]:
# Runtime setup
!pip -q install torch==2.4.0 transformers==4.44.2 sentencepiece==0.2.0 einops==0.8.0 \
           faiss-cpu==1.8.0.post1 onnxruntime-gpu==1.19.2 accelerate==0.34.2 bitsandbytes==0.43.3
print("Environment ready.")

In [None]:
# Upload the ZIP you downloaded from ChatGPT ('nexus_emb_240m_nsa.zip')
from google.colab import files
up = files.upload()
zip_name = next(iter(up.keys()))
print("Uploaded:", zip_name)

import os, zipfile
target_dir = "/content"
with zipfile.ZipFile(zip_name, 'r') as z:
    z.extractall(target_dir)
print("Extracted to:", target_dir)

# Find repo root
import glob
candidates = glob.glob("/content/**/nexus_emb_240m_nsa", recursive=True)
repo_root = candidates[0] if candidates else "/content/nexus_emb_240m_nsa"
print("Repo root:", repo_root)

In [None]:
# List files
import os, subprocess, textwrap, json
!ls -R "$repo_root"


## 1) Train SentencePiece tokenizer (48k)
Replace `demo.txt` with your own corpus manifests when ready.


In [None]:
import pathlib
from pathlib import Path

corpus_dir = Path("/content/corpus"); corpus_dir.mkdir(parents=True, exist_ok=True)
demo_text = "\n".join([
  "This is a small demo corpus for tokenizer bootstrap.",
  "BTC price action and mempool dynamics matter.",
  "ENTSO-E grid load datasets power RAG pipelines.",
  "JVL EtherCAT ROS2 robotics integration notes."
]*2000)
(corpus_dir/"demo.txt").write_text(demo_text)

# Train tokenizer
!python "$repo_root/scripts/build_tokenizer.py" --corpus "/content/corpus/demo.txt" --vocab 48000 --out_prefix "/content/tokenizer_spm_48k"
print("Tokenizer artifacts at /content/tokenizer_spm_48k.model")

## 2) Bootstrap training
Uses the included `data/demo_pairs.jsonl`. Replace with your mined pairs later.


In [None]:
CFG = f"{repo_root}/configs/nexus_emb_240m.json"
PAIRS = f"{repo_root}/data/demo_pairs.jsonl"
SPM = "/content/tokenizer_spm_48k.model"

!python "$repo_root/scripts/train.py" \
  --config "$CFG" \
  --pairs "$PAIRS" \
  --tokenizer_model "$SPM" \
  --batch 64 --max_len 128 --steps 1000 --lr 2e-3 --wd 0.05 --save_dir "/content/ckpts"


## 3) Tiny evaluation (sanity check)
Just checks that positive pairs are closer than random negatives.


In [None]:
!python "$repo_root/scripts/eval_mteb_lite.py" \
  --config "$CFG" \
  --tokenizer_model "$SPM"


## 4) Export ONNX (int8‑ready)
Exports a runtime‑friendly graph for CPU/GPU inference.


In [None]:
!python "$repo_root/scripts/export_onnx.py" \
  --config "$CFG" \
  --out "/content/nexus_emb_240m_nsa.onnx" \
  --seq_len 128

import os
print("ONNX exists:", os.path.exists("/content/nexus_emb_240m_nsa.onnx"))

## 5) Quick embedding demo
Encodes a few sentences and prints cosine similarities.


In [None]:
import torch, numpy as np, json
import sys
sys.path.append(repo_root)
from transformers import AutoTokenizer
from src.model import NexusEmb240MNSA
import json

cfg = json.load(open(CFG))
tok = AutoTokenizer.from_pretrained(SPM, use_fast=False, trust_remote_code=True)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = NexusEmb240MNSA(cfg).to(device).eval()

def embed(texts, max_len=128):
    with torch.no_grad():
        enc = tok(texts, padding=True, truncation=True, max_length=max_len, return_tensors="pt")
        emb, *_ = model(enc["input_ids"].to(device))
        return emb.cpu().numpy()

A = ["BTC funding rate spikes with open interest rise.",
     "ENTSO-E publishes European grid load datasets."]
B = ["Open interest changes often drive funding rate changes.",
     "European grid load data is available via ENTSO-E datasets."]

Ea = embed(A); Eb = embed(B)

def cos(a,b): return (a*b).sum(-1)/(np.linalg.norm(a,axis=-1)*np.linalg.norm(b,axis=-1)+1e-9)
print("cos(A[0],B[0]) =", float(cos(Ea[:1], Eb[:1])))
print("cos(A[1],B[1]) =", float(cos(Ea[1:2], Eb[1:2])))