esmc.cpp

Metal-accelerated C/C++ inference for ESM Cambrian (ESM-C), built on llama.cpp / ggml.

Source: github.com/AnanyaP-WDW/esmc.cpp
Pre-converted GGUF models: AnanyaPathak/esmc-300m-gguf on Hugging Face (model card with benchmarks, quick start, and usage)

Build (milestone 0)

git submodule update --init --recursive
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j8
./build/esmc-embed --help

Or: make -C build esmc-embed

Weight inspection (milestone 1)

python3 -m venv .venv && .venv/bin/pip install -r tools/requirements.txt
hf download biohub/ESMC-300M --local-dir ./esmc-300m
.venv/bin/python tools/inspect_esmc_weights.py ./esmc-300m

The biohub checkpoint uses the native esmc.* layout (fused QKV, fused SwiGLU fc1_weight). The HuggingFace model.layers.* layout in plan §2.1 is also supported when present.

Convert to GGUF (milestone 2)

.venv/bin/pip install -e ./ggml/gguf-py -r tools/requirements.txt
.venv/bin/python tools/convert_esmc_to_gguf.py ./esmc-300m ./models/esmc-300m-f16.gguf

# Verify metadata + tensors
.venv/bin/python tools/verify_gguf.py ./models/esmc-300m-f16.gguf
.venv/bin/python ./ggml/gguf-py/gguf/scripts/gguf_dump.py ./models/esmc-300m-f16.gguf --no-tensors

Tokenizer test (milestone 4)

.venv/bin/python tests/test_tokenizer.py
# ACDEF -> [0, 5, 23, 13, 9, 18, 2]  (matches HF tokenizer.json)

Layer-0 Q/K check (milestone 5)

.venv/bin/python tests/check_layer0_qk.py
./build/esmc-embed -m ./models/esmc-300m-f16.gguf -s ACDEF --layers 8 --no-metal

Load GGUF in C++ (milestone 3)

cmake --build build -j8
./build/esmc-embed -m ./models/esmc-300m-f16.gguf --verify-load --no-metal

Embed sequences

# Per-residue embeddings (.npy: [n_tokens, n_embd])
./build/esmc-embed -m ./models/esmc-300m-Q4_K_M.gguf \
    -s "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGY" \
    --pool none --output embedding.npy

# Mean-pooled sequence embedding (strips CLS/EOS)
./build/esmc-embed -m ./models/esmc-300m-Q4_K_M.gguf \
    -s "MKTVRQ..." --pool mean --output embedding.npy

Reproduce the paper results (300M, end-to-end)

The commands below reproduce every number in the paper, in order, from a clean clone. Run them on an Apple Silicon Mac (16 GB recommended) for the full CPU + Metal matrix; non-Apple/CPU-only hosts can run everything except the Metal rows. Budget ~3 GB of downloads (checkpoint + ProteinGym archive) and a few hours for the complete benchmark matrix.

Prerequisites: macOS with the Xcode command-line tools, CMake ≥ 3.14, Python ≥ 3.10, and a HuggingFace account/token for the checkpoint download. The PyTorch reference and the PyTorch baselines require torch; the Metal backend requires an Apple GPU.

1. Clone and build

git clone --recursive https://github.com/AnanyaP-WDW/esmc.cpp.git
cd esmc.cpp
git submodule update --init --recursive   # only if you cloned without --recursive

cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j8                   # builds esmc-embed, esmc-bench, esmc-quantize

2. Python environment

python3 -m venv .venv
.venv/bin/pip install -r tools/requirements.txt
.venv/bin/pip install -e ./ggml/gguf-py
.venv/bin/pip install torch               # PyTorch reference + baselines

3. Download the ESM-C 300M checkpoint

hf download EvolutionaryScale/esmc-300m-2024-12 --local-dir ./esmc-300m

4. Convert to GGUF (F16 + F32) and verify

.venv/bin/python tools/convert_esmc_to_gguf.py ./esmc-300m ./models/esmc-300m-f16.gguf --dtype f16
.venv/bin/python tools/convert_esmc_to_gguf.py ./esmc-300m ./models/esmc-300m-f32.gguf --dtype f32
.venv/bin/python tools/verify_gguf.py ./models/esmc-300m-f16.gguf

5. Quantize from the F32 GGUF

cmake --build build --target esmc-quantize
./build/esmc-quantize models/esmc-300m-f32.gguf models/esmc-300m-Q8_0.gguf   Q8_0
./build/esmc-quantize models/esmc-300m-f32.gguf models/esmc-300m-Q4_K_M.gguf Q4_K_M
./build/esmc-quantize models/esmc-300m-f32.gguf models/esmc-300m-Q4_K_S.gguf Q4_K_S

6. Staged validation (tokenizer → layer-0 Q/K → full forward)

.venv/bin/python tests/test_tokenizer.py                       # M4: token IDs
.venv/bin/python tests/check_layer0_qk.py                      # M5: layer-0 Q/K probe
.venv/bin/python tests/validate.py --no-metal                  # M6: full forward (CPU)
.venv/bin/python tests/validate.py --metal --compare-cpu-metal # CPU/Metal parity

7. Generate the 100-sequence PyTorch reference

The correctness harness compares against a PyTorch reference that is not checked into git (~107 MB). Regenerate it once from the checkpoint:

.venv/bin/python tests/generate_reference.py \
    --fasta benchmarks/sequences_correctness.fasta \
    --output tests/reference_embeddings.npz

8. Run the benchmarks

# Numerical correctness (paper Table 2): all precisions vs PyTorch, 100 Swiss-Prot seqs
.venv/bin/python benchmarks/correctness.py

# Throughput (paper Table 3, Fig 1): esmc.cpp CPU/Metal + PyTorch CPU/MPS
cmake --build build --target esmc-bench
.venv/bin/python benchmarks/throughput.py --config benchmarks/config_throughput_300m.json

# Memory footprint (paper Fig 2): peak RSS across all 36 configurations
.venv/bin/python benchmarks/memory.py --config benchmarks/config_memory_300m.json

# Downstream ProteinGym variant-effect (paper Table 4, Fig 3): 10 assays x 1000 variants
.venv/bin/python benchmarks/fetch_proteingym_subset.py --download \
    --selection-mode multi-assay --max-assays 10 --max-rows 1000 --min-rows 1000 \
    --max-length 512 --output-dir benchmarks/proteingym_subset_10k \
    --manifest benchmarks/datasets/proteingym_subset_10k_manifest.json
.venv/bin/python benchmarks/downstream.py --config benchmarks/config_downstream_300m_10k.json

9. Figures and reproduction bundle

.venv/bin/python benchmarks/paper_artifacts.py          # figure SVGs + summary CSVs
.venv/bin/python benchmarks/make_reproduction_bundle.py # self-contained results bundle

Benchmark results (300M)

Host: Apple M1 (arm64), 16 GB unified memory, macOS 26.5.
Reference: official PyTorch ESM-C 300M (tests/ref_forward.py on native safetensors).
Datasets: 100 reviewed human UniProt sequences (correctness); length-bucketed FASTA (throughput/memory); ProteinGym 10-assay × 1000-variant subset (downstream).

Raw CSV/JSON artifacts live under results/; paper-ready Markdown/LaTeX tables are regenerated by benchmarks/make_reproduction_bundle.py → results/reproduction_bundle/tables/. Full experiment log: lab_manual.md.

Plots

Regenerate with benchmarks/paper_artifacts.py (writes SVG; PNG if matplotlib is installed).

Numerical correctness vs PyTorch

Per-residue cosine similarity on 100 Swiss-Prot sequences (short / medium / long buckets). Pass = per-sequence mean cosine > 0.999 (F16, Q8_0) or > 0.995 (Q4_K_*).

Precision	Seqs	Aggregate mean cos	Worst mean cos	Worst min cos	Mean-pool L2 (max)	Pass rate
F16	100	0.99999	0.99997	0.99971	0.0030	100/100
Q8_0	100	0.99971	0.99938	0.99427	0.0164	100/100
Q4_K_M	100	0.99597	0.99245	0.94013	0.0656	91/100
Q4_K_S	100	0.99523	0.98982	0.92806	0.0709	75/100

GGUF on-disk size (300M): F16 634 MiB, Q8_0 337 MiB, Q4_K_M 237 MiB, Q4_K_S 228 MiB.

Throughput (seq/s)

Single-sequence throughput by length bucket; each backend runs in a fresh process. Best esmc.cpp = highest seq/s for that bucket among CPU/Metal × F16/Q8_0/Q4_K_*.

Bucket	Tokens	Best esmc.cpp	seq/s	PyTorch CPU	PyTorch MPS	vs CPU	vs MPS
short	47	metal/q4_k_s	14.54	10.31	29.29	1.41×	0.50×
medium	235	metal/q4_k_m	5.62	4.56	10.11	1.23×	0.56×
long	850	metal/q8_0	1.33	1.74	2.83	0.76×	0.47×

Metal 4-bit esmc.cpp beats PyTorch CPU on short and medium sequences at ~520 MiB peak RAM; PyTorch MPS remains fastest on this hardware.

Peak memory (long bucket, 16 GiB budget)

Peak resident set size (RSS) measured with /usr/bin/time -l in fresh processes. All 12 configurations below pass the 16 GiB machine budget.

Configuration	Peak RSS (MiB)	Model file (MiB)	≤ 16 GiB
esmc.cpp / cpu / f16	7426	634	yes
esmc.cpp / cpu / q8_0	6831	337	yes
esmc.cpp / cpu / q4_k_m	6632	237	yes
esmc.cpp / cpu / q4_k_s	6613	228	yes
esmc.cpp / cpu / f32	3989	1266	yes
esmc.cpp / metal / f32	2570	1266	yes
pytorch / cpu / f32	1588	1270	yes
esmc.cpp / metal / f16	1323	634	yes
esmc.cpp / metal / q8_0	736	337	yes
esmc.cpp / metal / q4_k_m	531	237	yes
esmc.cpp / metal / q4_k_s	519	228	yes
pytorch / mps / f32	282	1270	yes

Deployment sweet spot: Metal Q4_K_M or Q4_K_S — ~520 MiB peak RSS, ~230 MiB on disk, 1.2–1.4× PyTorch CPU throughput on short/medium sequences.

Downstream variant-effect preservation (ProteinGym)

10 short stability assays, 1000 variants each. Variants scored by cosine between mean-pooled mutant and wild-type embeddings vs PyTorch reference. Pass = |Δ| ≤ 0.01 per assay per metric (Spearman, Pearson, Kendall τ_b, top/bottom-decile overlap).

Precision	Assays	Mean abs Spearman Δ	Max abs Spearman Δ	All-metric pass
F16	10	0.0006	0.0014	50/50
Q8_0	10	0.0031	0.0092	45/50
Q4_K_M	10	0.0068	0.0231	38/50
Q4_K_S	10	0.0110	0.0258	32/50

F16 and Q8_0 preserve rank metrics well; 4-bit schemes are less stable under the strict 0.01 Spearman tolerance (see paper for caveats).

Models on HuggingFace

Pre-converted GGUF files (F32, F16, Q8_0, Q4_K_M, Q4_K_S) are published at AnanyaPathak/esmc-300m-gguf. The model card documents which file to pick, how to build esmc-embed, benchmark numbers, and license terms.

huggingface-cli download AnanyaPathak/esmc-300m-gguf esmc-300m-Q8_0.gguf --local-dir ./models
shasum -a 256 models/*.gguf   # verify against results/reproduction_bundle/models/MODELS.md

To refresh the model card and re-upload after local changes:

.venv/bin/python benchmarks/make_reproduction_bundle.py --hf-repo AnanyaPathak/esmc-300m-gguf
.venv/bin/python tools/upload_to_hf.py \
    --repo-id AnanyaPathak/esmc-300m-gguf \
    --models-dir ./models \
    --model-card results/reproduction_bundle/model_card/README.md \
    --create

Reproduction bundle (milestone 16)

Assemble a single self-describing directory with GGUF files (or checksum manifest), all benchmark CSV/JSON, plots, and paper-ready Markdown + LaTeX tables:

.venv/bin/python benchmarks/make_reproduction_bundle.py
# -> results/reproduction_bundle/{models,benchmarks,plots,tables,model_card,MANIFEST.json}

--gguf-mode link (default) symlinks the GGUF files; use copy to duplicate bytes or manifest for a checksum-only reference. MANIFEST.json records the git hash, host info, and a checksummed inventory of every file in the bundle.

See plan.md for the full engineering specification and lab_manual.md for the experiment log.

License

Built with ESM.

ESM-C 300M weights and the GGUF derivatives produced here are governed by the EvolutionaryScale Cambrian Open License Agreement, subject to the Acceptable Use Policy. The ESMC 300M Model is licensed under the EvolutionaryScale Cambrian Open License Agreement.
esmc.cpp's own source code (runtime, converter, harnesses, docs) is MIT.
ggml / llama.cpp (the ggml/ submodule) remains MIT (see ggml/LICENSE).

See LICENSE and NOTICE for the full terms and required attributions.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
benchmarks		benchmarks
examples		examples
figures		figures
ggml @ 48e7eae		ggml @ 48e7eae
results		results
src		src
tests		tests
tools		tools
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
LICENSE-ESM-C.md		LICENSE-ESM-C.md
NOTICE		NOTICE
README.md		README.md
lab_manual.md		lab_manual.md
paper.tex		paper.tex
perf_roadmap.md		perf_roadmap.md
plan.md		plan.md
submission_template.tex		submission_template.tex

Folders and files

Latest commit

History

Repository files navigation

esmc.cpp

Build (milestone 0)

Weight inspection (milestone 1)

Convert to GGUF (milestone 2)

Tokenizer test (milestone 4)

Layer-0 Q/K check (milestone 5)

Load GGUF in C++ (milestone 3)

Embed sequences

Reproduce the paper results (300M, end-to-end)

1. Clone and build

2. Python environment

3. Download the ESM-C 300M checkpoint

4. Convert to GGUF (F16 + F32) and verify

5. Quantize from the F32 GGUF

6. Staged validation (tokenizer → layer-0 Q/K → full forward)

7. Generate the 100-sequence PyTorch reference

8. Run the benchmarks

9. Figures and reproduction bundle

Benchmark results (300M)

Plots

Numerical correctness (aggregate mean cosine vs PyTorch)

Throughput (seq/s by length bucket)

Peak memory (long bucket, 16 GiB M1)

Downstream preservation (ProteinGym 10-assay pass rates)

Numerical correctness vs PyTorch

Throughput (seq/s)

Peak memory (long bucket, 16 GiB budget)

Downstream variant-effect preservation (ProteinGym)

Models on HuggingFace

Reproduction bundle (milestone 16)

License

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages