Reproducibility bundle — MoE vs dense LLM inference on consumer and edge hardware

This is the bundle for the paper. It contains everything needed to re-run the benchmarks and re-derive the tables and figures the paper cites. No internal planning notes or author identifiers are included.

What the study measures

The benchmark compares one Mixture-of-Experts (MoE) model (OLMoE-1B-7B-0924-Instruct, 1.3B active / 6.9B total) against three dense baselines (Llama-3.2-1B, Qwen2.5-1.5B, Gemma-2-2B), all at Q4_K_M quantization through llama.cpp, on two devices:

Apple MacBook Pro M2 Pro (16 GB) — consumer
NVIDIA Jetson Orin Nano 8 GB at 15 W — edge

Reported metrics:

Metric	M2 Pro	Jetson Orin Nano
Generation tokens/sec	yes	yes
Prefill tokens/sec	yes	yes
Peak RAM	yes	yes
Model file size	yes	yes
Joules per generated token	no (see below)	yes (`tegrastats`)
Peak SoC temperature	no	yes

Energy is reported only on Jetson. macOS does not expose a power counter that survives review, so we leave M2 energy out rather than report a number we cannot account for.

Bundle layout

anonymous_4open_science/
  README.md                       # this file
  HARDWARE.md                     # exact device configurations and build provenance
  requirements.txt                # Python deps
  scripts/
    setup_llamacpp_m2.sh          # build llama.cpp on macOS
    setup_llamacpp_jetson.sh      # build llama.cpp on Jetson Orin Nano
    download_models.py            # fetch GGUFs from Hugging Face
    run_benchmark.py              # main benchmark runner
    parse_llamacpp_output.py      # parse llama.cpp's perf footer
    collect_system_metrics.py     # RSS / CPU sampling
    collect_jetson_energy.py      # tegrastats wrapper and parser
    analyze_results.py            # build summary tables and figures
    make_paper_figures.py         # render publication-quality PDFs
    run_all_experiments.sh        # end-to-end orchestrator
    patches/
      router_overhead_b4404.patch # llama.cpp patch for the §5.6 router-overhead measurement
  configs/
    models.yaml                   # the four GGUFs and their HF sources
    prompts.yaml                  # the 12 prompts (no third-party text)
    devices.yaml                  # per-device profiles (binary path, threads, energy)
    experiments.yaml              # experiment matrix
  data/
    raw/                          # one JSONL per measurement session (all committed)
    processed/
      model_hashes.txt            # SHA-256 of every downloaded GGUF
  results/
    benchmark_results.csv         # full long-form table
    summary_tables/               # CSVs aggregated by analyze_results.py
    figures/                      # PNGs aggregated by analyze_results.py
    figures_pub/                  # publication PDFs from make_paper_figures.py
  notebooks/
    analysis.ipynb                # interactive companion to analyze_results.py

Reproducing the experiments

0. Prerequisites

M2 Pro: macOS 13+, Xcode command-line tools, Python 3.10+, ~10 GB free disk.
Jetson Orin Nano 8 GB: JetPack 6.x, CUDA toolkit installed, Python 3.10+, ~10 GB free disk on whichever drive holds models/. NVMe SSD strongly preferred over SD card.
Optional: a Hugging Face account with huggingface-cli login configured, for faster model downloads (some quants are gated behind a click-through license).

1. Install Python dependencies

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

2. Build llama.cpp

On the M2 Pro:

./scripts/setup_llamacpp_m2.sh

On the Jetson Orin Nano:

sudo nvpmodel -m 0      # 15 W power mode (POWER_MODEL ID=0 on JetPack 6.2)
sudo jetson_clocks      # lock clocks to max within the envelope
./scripts/setup_llamacpp_jetson.sh

Both scripts pin llama.cpp to tag b4404 (commit 0827b2c1da299805288abbd556d869318f2b121e) so results are reproducible. The exact tag, compiler version, and binary SHA-256 are recorded in HARDWARE.md.

3. Download the models (Q4_K_M GGUF, ~7.6 GB total)

python scripts/download_models.py --models-config configs/models.yaml --out models/

This populates models/<id>/<filename> and appends SHA-256 hashes to data/processed/model_hashes.txt. Hashes already in this bundle are the bytes used for every reported number; compare against your downloads if you want to verify.

4. Smoke test

./scripts/run_all_experiments.sh --device m2 --smoke
# or, on Jetson:
./scripts/run_all_experiments.sh --device jetson --smoke

The smoke run executes a single short prompt against each of the four models and exits.

5. Run the full benchmark

On each device:

./scripts/run_all_experiments.sh --device m2
./scripts/run_all_experiments.sh --device jetson

This runs Experiment 1 (4 models × 12 prompts × 3 reps) plus Experiment 2 (sequence-length sweep). On Jetson, tegrastats is logged in parallel for the energy measurements. Expect roughly 60–90 minutes per device for Experiment 1 on a warm build.

6. Regenerate tables and figures

python scripts/analyze_results.py \
  --raw-dir data/raw \
  --output-dir results

This writes results/benchmark_results.csv, populates results/summary_tables/, and renders results/figures/.

For the publication-quality PDFs used in the paper:

python scripts/make_paper_figures.py \
  --summary-dir results/summary_tables \
  --output-dir  results/figures_pub

7. Optional — CPU-thread sensitivity sub-experiment (M2 only)

STAMP=$(date -u +%Y%m%dT%H%M%SZ)
for t in 4 6 8 10; do
  python scripts/run_benchmark.py \
    --device m2 \
    --experiment experiment_a_cpu_thread_sweep \
    --threads-override $t \
    --output data/raw/run_${STAMP}_m2_thread_sweep_t${t}.jsonl
done

144 runs (~10–15 min on M2). Re-run analyze_results.py to refresh results/summary_tables/cpu_thread_sweep_summary.csv, results/summary_tables/thread_optimum_summary.csv, and results/figures/cpu_thread_sweep.png.

8. Optional — Router-overhead measurement (OLMoE only)

Build the patched llama-cli in a separate worktree so the unpatched binary that produces headline numbers is unaffected:

# On M2:
git -C llama.cpp worktree add -d ../llama.cpp-patched 0827b2c1d
cd llama.cpp-patched
git apply ../scripts/patches/router_overhead_b4404.patch
cmake -B build -S . \
  -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --target llama-cli -j 8

# On Jetson:
git -C llama.cpp worktree add -d ../llama.cpp-patched 0827b2c1d
cd llama.cpp-patched
git apply ../scripts/patches/router_overhead_b4404.patch
cmake -B build -S . \
  -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=87 -DCMAKE_BUILD_TYPE=Release
cmake --build build --target llama-cli -j 6

Then run the patched OLMoE benchmark:

STAMP=$(date -u +%Y%m%dT%H%M%SZ)
python scripts/run_benchmark.py \
  --device m2_patched \
  --experiment experiment_1_efficiency \
  --only-models olmoe-1b-7b-0924-instruct-q4km \
  --output data/raw/run_${STAMP}_m2_patched_router_overhead.jsonl
# Same with --device jetson_patched on Jetson.

The runner sets LLAMA_ROUTER_OVERHEAD=1 automatically when the device profile is m2_patched or jetson_patched. The patched binary observes every node in the decode graph, which makes it 14.9× slower on Metal and 1.26× slower on CUDA (numbers in HARDWARE.md). That slowdown is acceptable for this measurement: the within-MoE routing share is the ratio routing_ms / (routing_ms + expert_ffn_ms), and per-node sync inflates numerator and denominator symmetrically, so the ratio survives. The patched binary is never used for headline throughput numbers.

Interpreting the outputs

The paper reads the results along two axes:

Active-parameter axis: OLMoE (1.3B active) against Llama-3.2-1B and Qwen2.5-1.5B. Asks whether sparse activation pays off at matched active count.
Memory-footprint axis: OLMoE (~4.2 GB total) against Gemma-2-2B (~1.6 GB) and the smaller dense baselines. Asks the practical question: given a fixed memory budget, is MoE worth it on edge hardware?

Both axes are reported. Looking at only one of them gives a misleading answer.

Known constraints

OLMoE Q4_K_M with a long (1000–1500 word) prompt can OOM on the 8 GB Jetson. We report this rather than work around it.
All numbers are tied to llama.cpp at a specific commit. Other backends (vLLM, MLC-LLM, TensorRT-LLM) will give different absolute numbers, though we expect the qualitative ordering to hold.
Two devices is not a hardware sweep. The paper bounds its claims accordingly.

Pre-computed artifacts already in the bundle

You can inspect the results without re-running anything:

data/raw/run_*.jsonl — every measurement run, one line per (model, prompt, repetition).
data/processed/model_hashes.txt — SHA-256 of every model used.
results/benchmark_results.csv — the same JSONL data joined into one long-form table.
results/summary_tables/*.csv — the aggregated tables the paper cites.
results/figures/*.png — exploratory figures.
results/figures_pub/*.pdf — publication-quality figures used in the paper.

License

Code: MIT. Data and figures: CC-BY-4.0. The model weights are downloaded from Hugging Face and remain under their respective licenses (see each model card).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reproducibility bundle — MoE vs dense LLM inference on consumer and edge hardware

What the study measures

Bundle layout

Reproducing the experiments

0. Prerequisites

1. Install Python dependencies

2. Build llama.cpp

3. Download the models (Q4_K_M GGUF, ~7.6 GB total)

4. Smoke test

5. Run the full benchmark

6. Regenerate tables and figures

7. Optional — CPU-thread sensitivity sub-experiment (M2 only)

8. Optional — Router-overhead measurement (OLMoE only)

Interpreting the outputs

Known constraints

Pre-computed artifacts already in the bundle

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
configs		configs
data		data
notebooks		notebooks
results		results
scripts		scripts
.gitignore		.gitignore
HARDWARE.md		HARDWARE.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Reproducibility bundle — MoE vs dense LLM inference on consumer and edge hardware

What the study measures

Bundle layout

Reproducing the experiments

0. Prerequisites

1. Install Python dependencies

2. Build llama.cpp

3. Download the models (Q4_K_M GGUF, ~7.6 GB total)

4. Smoke test

5. Run the full benchmark

6. Regenerate tables and figures

7. Optional — CPU-thread sensitivity sub-experiment (M2 only)

8. Optional — Router-overhead measurement (OLMoE only)

Interpreting the outputs

Known constraints

Pre-computed artifacts already in the bundle

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages