Spectral Covariance Analysis of Protein Language Model Hidden States for Zero-Shot Variant Pathogenicity Prediction
Paper | Quickstart | Experiments | Public API
Claw (AI Co-author) | Davi Bonetto
SpectralBio is a reproducible, zero-shot variant-effect pipeline for missense pathogenicity prediction from sequence alone. The method leverages facebook/esm2_t30_150M_UR50D and measures mutation-induced geometric perturbations in hidden-state covariance inside a local ±40 residue window. It combines spectral features with proper masked-LM evidence to improve ranking quality without supervised training.
On TP53 ClinVar (N=255, 115 pathogenic, 140 benign), the best pair (0.55 * FrobDist + 0.45 * LL Proper) reaches AUC=0.7498 (~0.750). On BRCA1 transfer (N=100 subset), LL Proper reaches AUC=0.9174 (~0.917). Reproducibility is exact with fixed seeds (Δrep=0.0).
| Benchmark | Best Method | AUC-ROC | Interpretation |
|---|---|---|---|
| TP53 (ClinVar, N=255) | 0.55*frob_dist + 0.45*ll_proper |
0.7498 | Best pair in ablation search |
| BRCA1 transfer (N=100) | ll_proper |
0.9174 | Strong cross-protein generalization |
- Matrix-level covariance features (
FrobDist,TraceRatio) are consistently stronger than eigenvalue-only SPS variants on TP53. - The best TP53 score is a hybrid geometric + probabilistic signal, not a single raw feature.
LL Properis moderate on TP53 but exceptionally strong on BRCA1 transfer.- Reported reproducibility check remains exact (
Δrep=0.0) with deterministic seeds.
For each variant, SpectralBio compares WT and MUT hidden-state covariance across ESM2 layers.
Where C^(l) is the residue-level covariance matrix at layer l, and λ^(l) is its eigenvalue spectrum.
Best TP53 combination in this release:
The diagram summarizes the end-to-end computational flow used in the paper:
- Variant ingestion: TP53/BRCA1 ClinVar missense variants are filtered and normalized.
- Sequence-local encoding: WT and MUT windows (
±40) are encoded by ESM2-150M. - Layerwise covariance: per-layer residue covariance matrices are computed.
- Spectral descriptors:
FrobDist,TraceRatio,SPS-logquantify geometric perturbation. - Likelihood integration: proper masked-LM likelihood (
LL Proper) is blended with spectral signals. - Evaluation: AUC ablations on TP53 and transfer analysis on BRCA1.
This is intentionally designed as an executable science pipeline: each block has a direct file-level artifact in this repository.
Likelihood scores answer “how probable is this mutation under token prediction,” while covariance perturbation answers “how strongly did internal representation geometry shift.” In practice, those are complementary biological signals.
On TP53, no single feature dominates all decision boundaries. The top-performing rule is the weighted pair (FrobDist + LL Proper), indicating that geometric and probabilistic cues contribute non-redundant information.
The BRCA1 result (AUC=0.9174 for LL Proper) suggests cross-protein portability in the likelihood component. This is important for zero-shot settings where no supervised retraining is available.
- Two-gene scope for this release (TP53 primary + BRCA1 transfer subset).
- Window size and feature mixing are fixed in this benchmark snapshot.
- Intended for research reproducibility, not clinical deployment.
| Feature | AUC-ROC (TP53) | Interpretation |
|---|---|---|
ll_crude |
0.7026 | Strong baseline proxy from hidden-state norm differences |
trace_ratio |
0.6242 | Useful matrix-scale perturbation signal |
frob_dist |
0.6209 | Stable global covariance displacement measure |
sps_log |
0.5988 | Eigenvalue-only signal, weaker than matrix-level features |
ll_proper |
0.5956 | Modest on TP53 but strongest transfer behavior on BRCA1 |
This ranking is exactly why SpectralBio keeps both geometric and probabilistic channels.
The top TP53 score comes from feature complementarity, while the top BRCA1 score comes
from transfer stability in ll_proper.
- Open
SKILL.mdat repository root. - Execute each step sequentially.
- Runtime target: ~24 min on T4 GPU.
- Open colab/spectralbio.ipynb.
- Run cells in order.
- Outputs are written to
colab/results/.
- Primary entrypoint:
SKILL.md - Notebook mirror:
colab/spectralbio.ipynb - Paper text:
paper/spectralbio.texandpaper/spectralbio_clawrxiv.md
- Model:
facebook/esm2_t30_150M_UR50D - Seeds:
42(torch,numpy,random) - Core deps:
torch,transformers,scipy,scikit-learn,numpy - Expected key metrics: TP53 ~
0.750, BRCA1 LL ~0.917,Δrep=0.0
| If you need... | Open this path | What you get |
|---|---|---|
| Reproducible workflow | SKILL.md |
Ordered execution steps and validation criteria |
| Raw TP53 variants | colab/results/tp53_variants.json |
Curated benchmark variant list |
| Raw per-variant scores | colab/results/scores.json |
11-feature table for each TP53 variant |
| Final metric snapshot | colab/results/summary.json |
AUCs, counts, timing, reproducibility delta |
| Visual diagnostics | colab/results/figures.png |
ROC + ablation + distributions |
| Interactive demo code | huggingface/app.py |
Space logic and scoring API |
| Dataset metadata | huggingface/dataset_card.md |
Schema, splits, usage notes |
| Submission flow | submit/submit.py |
clawRxiv publish script |
colab/results/summary.jsonexists and contains TP53 + BRCA1 metrics.colab/results/scores.jsonexists and has TP53 rows with spectral/LL features.colab/results/figures.pngexists and renders correctly.reproducibility_deltain summary remains0.0.
submit/submit.pyexpects localsubmit/api_key.txt.- Never commit API keys.
- Publish target is clawRxiv (
http://18.118.210.52).
git clone https://github.com/DaviBonetto/SpectralBio
cd SpectralBio
# Primary reproducibility path
# Follow SKILL.md step-by-step and validate outputs in colab/results/| Symptom | Likely cause | Fix |
|---|---|---|
| Equations or metrics not matching expected values | Seed drift or partial pipeline execution | Re-run with seeds fixed to 42 and execute full SKILL.md sequence |
Missing figures.png or scores.json |
Early stop before evaluation steps | Continue notebook/SKILL flow through final validation block |
| Submission script fails | Missing API key file | Create local submit/api_key.txt (never commit) |
| Slow runtime | CPU execution | Prefer T4/A100 runtime for expected wall-clock behavior |
SpectralBio/
├── README.md
├── LICENSE
├── .gitignore
├── SKILL.md
├── Claw4S_conference.md
├── colab/
│ ├── spectralbio.ipynb
│ └── results/
│ ├── tp53_variants.json
│ ├── tp53_sequence.txt
│ ├── summary.json
│ ├── scores.json
│ └── figures.png
├── paper/
│ ├── spectralbio.tex
│ ├── spectralbio.pdf
│ ├── spectralbio_clawrxiv.md
│ ├── references.bib
│ └── assets/
├── submit/
│ └── submit.py
└── huggingface/
├── app.py
├── dataset_card.md
├── requirements.txt
├── README.md
├── assets/
└── data/
- Model:
facebook/esm2_t30_150M_UR50D - Window radius:
±40 - Deterministic seeds:
42 - TP53:
N=255,AUC_best_pair=0.7498 - BRCA1 transfer subset:
N=100,AUC_ll_proper=0.9174 - Reproducibility delta:
0.0 - Scoring time:
~1447s(~24 minutes)
- GitHub: DaviBonetto/SpectralBio
- Demo: Hugging Face Space
- Dataset: Hugging Face Dataset
- Full paper PDF: paper/spectralbio.pdf
- Claw4S 2026: claw4s.github.io
- clawRxiv: 18.118.210.52
@inproceedings{claw2026spectralbio,
title={SpectralBio: Spectral Covariance Analysis of Protein Language Model Hidden States for Zero-Shot Variant Pathogenicity Prediction},
author={Claw and Bonetto, Davi},
booktitle={Claw4S Conference 2026, Stanford--Princeton},
year={2026}
}MIT License.
- Claw4S 2026 organizers
- Stanford University
- Princeton University
- Meta AI (ESM2)
- ClinVar (NCBI/FDA)

