This is the bundle for the paper. It contains everything needed to re-run the benchmarks and re-derive the tables and figures the paper cites. No internal planning notes or author identifiers are included.
The benchmark compares one Mixture-of-Experts (MoE) model
(OLMoE-1B-7B-0924-Instruct, 1.3B active / 6.9B total) against three
dense baselines (Llama-3.2-1B, Qwen2.5-1.5B, Gemma-2-2B), all
at Q4_K_M quantization through llama.cpp, on two devices:
- Apple MacBook Pro M2 Pro (16 GB) — consumer
- NVIDIA Jetson Orin Nano 8 GB at 15 W — edge
Reported metrics:
| Metric | M2 Pro | Jetson Orin Nano |
|---|---|---|
| Generation tokens/sec | yes | yes |
| Prefill tokens/sec | yes | yes |
| Peak RAM | yes | yes |
| Model file size | yes | yes |
| Joules per generated token | no (see below) | yes (tegrastats) |
| Peak SoC temperature | no | yes |
Energy is reported only on Jetson. macOS does not expose a power counter that survives review, so we leave M2 energy out rather than report a number we cannot account for.
anonymous_4open_science/
README.md # this file
HARDWARE.md # exact device configurations and build provenance
requirements.txt # Python deps
scripts/
setup_llamacpp_m2.sh # build llama.cpp on macOS
setup_llamacpp_jetson.sh # build llama.cpp on Jetson Orin Nano
download_models.py # fetch GGUFs from Hugging Face
run_benchmark.py # main benchmark runner
parse_llamacpp_output.py # parse llama.cpp's perf footer
collect_system_metrics.py # RSS / CPU sampling
collect_jetson_energy.py # tegrastats wrapper and parser
analyze_results.py # build summary tables and figures
make_paper_figures.py # render publication-quality PDFs
run_all_experiments.sh # end-to-end orchestrator
patches/
router_overhead_b4404.patch # llama.cpp patch for the §5.6 router-overhead measurement
configs/
models.yaml # the four GGUFs and their HF sources
prompts.yaml # the 12 prompts (no third-party text)
devices.yaml # per-device profiles (binary path, threads, energy)
experiments.yaml # experiment matrix
data/
raw/ # one JSONL per measurement session (all committed)
processed/
model_hashes.txt # SHA-256 of every downloaded GGUF
results/
benchmark_results.csv # full long-form table
summary_tables/ # CSVs aggregated by analyze_results.py
figures/ # PNGs aggregated by analyze_results.py
figures_pub/ # publication PDFs from make_paper_figures.py
notebooks/
analysis.ipynb # interactive companion to analyze_results.py
- M2 Pro: macOS 13+, Xcode command-line tools, Python 3.10+, ~10 GB free disk.
- Jetson Orin Nano 8 GB: JetPack 6.x, CUDA toolkit installed,
Python 3.10+, ~10 GB free disk on whichever drive holds
models/. NVMe SSD strongly preferred over SD card. - Optional: a Hugging Face account with
huggingface-cli loginconfigured, for faster model downloads (some quants are gated behind a click-through license).
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtOn the M2 Pro:
./scripts/setup_llamacpp_m2.shOn the Jetson Orin Nano:
sudo nvpmodel -m 0 # 15 W power mode (POWER_MODEL ID=0 on JetPack 6.2)
sudo jetson_clocks # lock clocks to max within the envelope
./scripts/setup_llamacpp_jetson.shBoth scripts pin llama.cpp to tag b4404
(commit 0827b2c1da299805288abbd556d869318f2b121e) so results are
reproducible. The exact tag, compiler version, and binary SHA-256 are
recorded in HARDWARE.md.
python scripts/download_models.py --models-config configs/models.yaml --out models/This populates models/<id>/<filename> and appends SHA-256 hashes to
data/processed/model_hashes.txt. Hashes already in this bundle are the
bytes used for every reported number; compare against your downloads if
you want to verify.
./scripts/run_all_experiments.sh --device m2 --smoke
# or, on Jetson:
./scripts/run_all_experiments.sh --device jetson --smokeThe smoke run executes a single short prompt against each of the four models and exits.
On each device:
./scripts/run_all_experiments.sh --device m2
./scripts/run_all_experiments.sh --device jetsonThis runs Experiment 1 (4 models × 12 prompts × 3 reps) plus Experiment 2
(sequence-length sweep). On Jetson, tegrastats is logged in parallel
for the energy measurements. Expect roughly 60–90 minutes per device for
Experiment 1 on a warm build.
python scripts/analyze_results.py \
--raw-dir data/raw \
--output-dir resultsThis writes results/benchmark_results.csv, populates
results/summary_tables/, and renders results/figures/.
For the publication-quality PDFs used in the paper:
python scripts/make_paper_figures.py \
--summary-dir results/summary_tables \
--output-dir results/figures_pubSTAMP=$(date -u +%Y%m%dT%H%M%SZ)
for t in 4 6 8 10; do
python scripts/run_benchmark.py \
--device m2 \
--experiment experiment_a_cpu_thread_sweep \
--threads-override $t \
--output data/raw/run_${STAMP}_m2_thread_sweep_t${t}.jsonl
done144 runs (~10–15 min on M2). Re-run analyze_results.py to refresh
results/summary_tables/cpu_thread_sweep_summary.csv,
results/summary_tables/thread_optimum_summary.csv, and
results/figures/cpu_thread_sweep.png.
Build the patched llama-cli in a separate worktree so the unpatched
binary that produces headline numbers is unaffected:
# On M2:
git -C llama.cpp worktree add -d ../llama.cpp-patched 0827b2c1d
cd llama.cpp-patched
git apply ../scripts/patches/router_overhead_b4404.patch
cmake -B build -S . \
-DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --target llama-cli -j 8
# On Jetson:
git -C llama.cpp worktree add -d ../llama.cpp-patched 0827b2c1d
cd llama.cpp-patched
git apply ../scripts/patches/router_overhead_b4404.patch
cmake -B build -S . \
-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=87 -DCMAKE_BUILD_TYPE=Release
cmake --build build --target llama-cli -j 6Then run the patched OLMoE benchmark:
STAMP=$(date -u +%Y%m%dT%H%M%SZ)
python scripts/run_benchmark.py \
--device m2_patched \
--experiment experiment_1_efficiency \
--only-models olmoe-1b-7b-0924-instruct-q4km \
--output data/raw/run_${STAMP}_m2_patched_router_overhead.jsonl
# Same with --device jetson_patched on Jetson.The runner sets LLAMA_ROUTER_OVERHEAD=1 automatically when the device
profile is m2_patched or jetson_patched. The patched binary observes
every node in the decode graph, which makes it 14.9× slower on Metal
and 1.26× slower on CUDA (numbers in HARDWARE.md). That slowdown is
acceptable for this measurement: the within-MoE routing share is the
ratio routing_ms / (routing_ms + expert_ffn_ms), and per-node sync
inflates numerator and denominator symmetrically, so the ratio survives.
The patched binary is never used for headline throughput numbers.
The paper reads the results along two axes:
- Active-parameter axis: OLMoE (1.3B active) against Llama-3.2-1B and Qwen2.5-1.5B. Asks whether sparse activation pays off at matched active count.
- Memory-footprint axis: OLMoE (~4.2 GB total) against Gemma-2-2B (~1.6 GB) and the smaller dense baselines. Asks the practical question: given a fixed memory budget, is MoE worth it on edge hardware?
Both axes are reported. Looking at only one of them gives a misleading answer.
- OLMoE
Q4_K_Mwith a long (1000–1500 word) prompt can OOM on the 8 GB Jetson. We report this rather than work around it. - All numbers are tied to
llama.cppat a specific commit. Other backends (vLLM, MLC-LLM, TensorRT-LLM) will give different absolute numbers, though we expect the qualitative ordering to hold. - Two devices is not a hardware sweep. The paper bounds its claims accordingly.
You can inspect the results without re-running anything:
data/raw/run_*.jsonl— every measurement run, one line per (model, prompt, repetition).data/processed/model_hashes.txt— SHA-256 of every model used.results/benchmark_results.csv— the same JSONL data joined into one long-form table.results/summary_tables/*.csv— the aggregated tables the paper cites.results/figures/*.png— exploratory figures.results/figures_pub/*.pdf— publication-quality figures used in the paper.
Code: MIT. Data and figures: CC-BY-4.0. The model weights are downloaded from Hugging Face and remain under their respective licenses (see each model card).