Skip to content

Analytics-Everywhere-Lab/edge-moe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reproducibility bundle — MoE vs dense LLM inference on consumer and edge hardware

This is the bundle for the paper. It contains everything needed to re-run the benchmarks and re-derive the tables and figures the paper cites. No internal planning notes or author identifiers are included.

What the study measures

The benchmark compares one Mixture-of-Experts (MoE) model (OLMoE-1B-7B-0924-Instruct, 1.3B active / 6.9B total) against three dense baselines (Llama-3.2-1B, Qwen2.5-1.5B, Gemma-2-2B), all at Q4_K_M quantization through llama.cpp, on two devices:

  • Apple MacBook Pro M2 Pro (16 GB) — consumer
  • NVIDIA Jetson Orin Nano 8 GB at 15 W — edge

Reported metrics:

Metric M2 Pro Jetson Orin Nano
Generation tokens/sec yes yes
Prefill tokens/sec yes yes
Peak RAM yes yes
Model file size yes yes
Joules per generated token no (see below) yes (tegrastats)
Peak SoC temperature no yes

Energy is reported only on Jetson. macOS does not expose a power counter that survives review, so we leave M2 energy out rather than report a number we cannot account for.

Bundle layout

anonymous_4open_science/
  README.md                       # this file
  HARDWARE.md                     # exact device configurations and build provenance
  requirements.txt                # Python deps
  scripts/
    setup_llamacpp_m2.sh          # build llama.cpp on macOS
    setup_llamacpp_jetson.sh      # build llama.cpp on Jetson Orin Nano
    download_models.py            # fetch GGUFs from Hugging Face
    run_benchmark.py              # main benchmark runner
    parse_llamacpp_output.py      # parse llama.cpp's perf footer
    collect_system_metrics.py     # RSS / CPU sampling
    collect_jetson_energy.py      # tegrastats wrapper and parser
    analyze_results.py            # build summary tables and figures
    make_paper_figures.py         # render publication-quality PDFs
    run_all_experiments.sh        # end-to-end orchestrator
    patches/
      router_overhead_b4404.patch # llama.cpp patch for the §5.6 router-overhead measurement
  configs/
    models.yaml                   # the four GGUFs and their HF sources
    prompts.yaml                  # the 12 prompts (no third-party text)
    devices.yaml                  # per-device profiles (binary path, threads, energy)
    experiments.yaml              # experiment matrix
  data/
    raw/                          # one JSONL per measurement session (all committed)
    processed/
      model_hashes.txt            # SHA-256 of every downloaded GGUF
  results/
    benchmark_results.csv         # full long-form table
    summary_tables/               # CSVs aggregated by analyze_results.py
    figures/                      # PNGs aggregated by analyze_results.py
    figures_pub/                  # publication PDFs from make_paper_figures.py
  notebooks/
    analysis.ipynb                # interactive companion to analyze_results.py

Reproducing the experiments

0. Prerequisites

  • M2 Pro: macOS 13+, Xcode command-line tools, Python 3.10+, ~10 GB free disk.
  • Jetson Orin Nano 8 GB: JetPack 6.x, CUDA toolkit installed, Python 3.10+, ~10 GB free disk on whichever drive holds models/. NVMe SSD strongly preferred over SD card.
  • Optional: a Hugging Face account with huggingface-cli login configured, for faster model downloads (some quants are gated behind a click-through license).

1. Install Python dependencies

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

2. Build llama.cpp

On the M2 Pro:

./scripts/setup_llamacpp_m2.sh

On the Jetson Orin Nano:

sudo nvpmodel -m 0      # 15 W power mode (POWER_MODEL ID=0 on JetPack 6.2)
sudo jetson_clocks      # lock clocks to max within the envelope
./scripts/setup_llamacpp_jetson.sh

Both scripts pin llama.cpp to tag b4404 (commit 0827b2c1da299805288abbd556d869318f2b121e) so results are reproducible. The exact tag, compiler version, and binary SHA-256 are recorded in HARDWARE.md.

3. Download the models (Q4_K_M GGUF, ~7.6 GB total)

python scripts/download_models.py --models-config configs/models.yaml --out models/

This populates models/<id>/<filename> and appends SHA-256 hashes to data/processed/model_hashes.txt. Hashes already in this bundle are the bytes used for every reported number; compare against your downloads if you want to verify.

4. Smoke test

./scripts/run_all_experiments.sh --device m2 --smoke
# or, on Jetson:
./scripts/run_all_experiments.sh --device jetson --smoke

The smoke run executes a single short prompt against each of the four models and exits.

5. Run the full benchmark

On each device:

./scripts/run_all_experiments.sh --device m2
./scripts/run_all_experiments.sh --device jetson

This runs Experiment 1 (4 models × 12 prompts × 3 reps) plus Experiment 2 (sequence-length sweep). On Jetson, tegrastats is logged in parallel for the energy measurements. Expect roughly 60–90 minutes per device for Experiment 1 on a warm build.

6. Regenerate tables and figures

python scripts/analyze_results.py \
  --raw-dir data/raw \
  --output-dir results

This writes results/benchmark_results.csv, populates results/summary_tables/, and renders results/figures/.

For the publication-quality PDFs used in the paper:

python scripts/make_paper_figures.py \
  --summary-dir results/summary_tables \
  --output-dir  results/figures_pub

7. Optional — CPU-thread sensitivity sub-experiment (M2 only)

STAMP=$(date -u +%Y%m%dT%H%M%SZ)
for t in 4 6 8 10; do
  python scripts/run_benchmark.py \
    --device m2 \
    --experiment experiment_a_cpu_thread_sweep \
    --threads-override $t \
    --output data/raw/run_${STAMP}_m2_thread_sweep_t${t}.jsonl
done

144 runs (~10–15 min on M2). Re-run analyze_results.py to refresh results/summary_tables/cpu_thread_sweep_summary.csv, results/summary_tables/thread_optimum_summary.csv, and results/figures/cpu_thread_sweep.png.

8. Optional — Router-overhead measurement (OLMoE only)

Build the patched llama-cli in a separate worktree so the unpatched binary that produces headline numbers is unaffected:

# On M2:
git -C llama.cpp worktree add -d ../llama.cpp-patched 0827b2c1d
cd llama.cpp-patched
git apply ../scripts/patches/router_overhead_b4404.patch
cmake -B build -S . \
  -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --target llama-cli -j 8

# On Jetson:
git -C llama.cpp worktree add -d ../llama.cpp-patched 0827b2c1d
cd llama.cpp-patched
git apply ../scripts/patches/router_overhead_b4404.patch
cmake -B build -S . \
  -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=87 -DCMAKE_BUILD_TYPE=Release
cmake --build build --target llama-cli -j 6

Then run the patched OLMoE benchmark:

STAMP=$(date -u +%Y%m%dT%H%M%SZ)
python scripts/run_benchmark.py \
  --device m2_patched \
  --experiment experiment_1_efficiency \
  --only-models olmoe-1b-7b-0924-instruct-q4km \
  --output data/raw/run_${STAMP}_m2_patched_router_overhead.jsonl
# Same with --device jetson_patched on Jetson.

The runner sets LLAMA_ROUTER_OVERHEAD=1 automatically when the device profile is m2_patched or jetson_patched. The patched binary observes every node in the decode graph, which makes it 14.9× slower on Metal and 1.26× slower on CUDA (numbers in HARDWARE.md). That slowdown is acceptable for this measurement: the within-MoE routing share is the ratio routing_ms / (routing_ms + expert_ffn_ms), and per-node sync inflates numerator and denominator symmetrically, so the ratio survives. The patched binary is never used for headline throughput numbers.

Interpreting the outputs

The paper reads the results along two axes:

  • Active-parameter axis: OLMoE (1.3B active) against Llama-3.2-1B and Qwen2.5-1.5B. Asks whether sparse activation pays off at matched active count.
  • Memory-footprint axis: OLMoE (~4.2 GB total) against Gemma-2-2B (~1.6 GB) and the smaller dense baselines. Asks the practical question: given a fixed memory budget, is MoE worth it on edge hardware?

Both axes are reported. Looking at only one of them gives a misleading answer.

Known constraints

  • OLMoE Q4_K_M with a long (1000–1500 word) prompt can OOM on the 8 GB Jetson. We report this rather than work around it.
  • All numbers are tied to llama.cpp at a specific commit. Other backends (vLLM, MLC-LLM, TensorRT-LLM) will give different absolute numbers, though we expect the qualitative ordering to hold.
  • Two devices is not a hardware sweep. The paper bounds its claims accordingly.

Pre-computed artifacts already in the bundle

You can inspect the results without re-running anything:

  • data/raw/run_*.jsonl — every measurement run, one line per (model, prompt, repetition).
  • data/processed/model_hashes.txt — SHA-256 of every model used.
  • results/benchmark_results.csv — the same JSONL data joined into one long-form table.
  • results/summary_tables/*.csv — the aggregated tables the paper cites.
  • results/figures/*.png — exploratory figures.
  • results/figures_pub/*.pdf — publication-quality figures used in the paper.

License

Code: MIT. Data and figures: CC-BY-4.0. The model weights are downloaded from Hugging Face and remain under their respective licenses (see each model card).

About

Reproducibility bundle for the edge MoE paper

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors