Single-Nucleotide ELECTRA-Style Genomic Pre-training for Efficient and Interpretable Representations
Ke Ding, Brian Parker, Jiayu Wen. Australian National University · Preprint, August 2025.
NucEL is the first ELECTRA-style genomic foundation model. A small generator proposes single-nucleotide substitutions to a masked DNA sequence, and a larger ModernBERT discriminator is trained to detect which positions were replaced — giving dense supervision across all tokens instead of the ~15 % covered by masked language modeling. At 93 M parameters, NucEL outperforms MLM-based models of similar size and matches or exceeds models up to 25× larger (e.g. NT-Multi-2.5B) on GUE, GB, and NT.
- ELECTRA RTD for genomics. Dense supervision across every base; ~6.7× more signal per forward pass than 15 %-masking MLM.
- Single-nucleotide tokenization. Vocab of 27 tokens (4 nt + 7 special + 16 reserved). No k-mer collapse, no BPE drift; preserves base-level detail for tasks like motif discovery.
- ModernBERT backbone. Hybrid local (128 bp window) and global (every 3 layers) attention, FlashAttention-2, RoPE, GeGLU.
- Compute-efficient. Pre-trained on the human genome only with 8× A100 GPUs (50 epochs, global batch 192).
- State-of-the-art on standard benchmarks despite using ~25× fewer parameters than the strongest baselines (see Results).
| Aspect | Value |
|---|---|
| Architecture | ELECTRA RTD with ModernBERT backbone |
| Discriminator | 22 layers · hidden 512 · 16 heads · intermediate 2048 |
| Generator | 11 layers · hidden 256 · 8 heads · intermediate 1024 |
| Attention | local window 128 + global every 3rd layer |
| Tokenization | single-nucleotide (k = 1), vocab = 27 |
| Parameters | 93 M (discriminator only, ELECTRA backbone) |
| Max sequence | 1024 bp during pre-training (8192 supported by config) |
| Pre-training data | GRCh38 / hg38 human genome (1224 bp windows, 100 bp overlap) |
| Objective | L_total = L_gen + 50.0 * L_disc |
| Mask ratio | 0.15 |
| Optimizer | AdamW (β₁ = 0.9, β₂ = 0.999), lr = 1e-4 |
| Schedule | cosine, 1000-step warmup, max grad norm 1.0 |
| Compute | 8× NVIDIA A100, FP16, 50 epochs, global batch 192 |
| Model | TF-H | PD | CPD | SSP | TF-M | EMP | CVC | Avg |
|---|---|---|---|---|---|---|---|---|
| DNABERT-2 (117 M) | 70.10 | 84.21 | 70.52 | 84.99 | 67.99 | 55.98 | 71.02 | 72.11 |
| NT-2500M-multi | 63.32 | 88.14 | 71.62 | 89.36 | 67.01 | 58.06 | 73.04 | 72.94 |
| NucEL (93 M) | 67.64 | 87.10 | 75.13 | 90.30 | 70.62 | 65.01 | 70.29 | 75.16 |
| Task | HyenaDNA | Caduceus-Ph | NT2-100M | NucEL |
|---|---|---|---|---|
| Coding vs Intergenic | 0.904 | 0.915 | 0.950 | 0.941 |
| Human vs Worm | 0.964 | 0.972 | 0.972 | 0.975 |
| Human Enhancers Cohn | 0.729 | 0.747 | 0.736 | 0.735 |
| Human Enhancers Ensembl | 0.849 | 0.893 | 0.935 | 0.940 |
| Human Regulatory | 0.869 | 0.872 | 0.935 | 0.941 |
| Human OCR Ensembl | 0.783 | 0.828 | 0.776 | 0.794 |
| Human NonTATA Promoters | 0.944 | 0.946 | 0.920 | 0.973 |
| Average | 0.863 | 0.882 | 0.890 | 0.899 |
NucEL's average 0.664 matches NT2-Multi-500M (0.660 in the paper's
Table 3 / 0.661 NT-Multi-2.5B) while using 27× fewer parameters.
See docs/figure2_tradeoff.png for the full efficiency-performance
tradeoff.
git clone https://github.com/FreakingPotato/NucEL.git
cd NucEL
# Editable install (recommended)
pip install -e .
# Or install the listed pinned dependencies
pip install -r requirements.txtTested with Python 3.10, PyTorch 2.1+. FlashAttention-2 is optional but strongly recommended on A100 / H100.
NucEL's tokenizer is a custom k-mer class registered via auto_map. To let
AutoTokenizer.from_pretrained honor this auto_map, install
transformers>=4.48,<5.4:
pip install "transformers>=4.48,<5.4" torchStarting with transformers==5.4.0 (released 2026-03-27), modernbert-based
checkpoints were added to a MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS
blocklist that disables auto_map for tokenizer auto-discovery. NucEL is
collateral damage of that change. The model side (AutoModel.from_pretrained)
is unaffected, but the tokenizer must be loaded explicitly — see
Quickstart on transformers 5.4+.
import torch
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("FreakingPotato/NucEL", trust_remote_code=True)
model = AutoModel.from_pretrained("FreakingPotato/NucEL").eval()
inputs = tokenizer("ATCGATCGATGCATGCATGC", return_tensors="pt")
with torch.no_grad():
h = model(**inputs).last_hidden_state # (1, L, 512)
cls = h[:, 0, :] # (1, 512)Runnable: examples/quickstart.py.
If you cannot pin transformers<5.4, bypass AutoTokenizer and import
NucEL_Tokenizer directly. Once you have nucel installed via
pip install -e ., that's a one-liner:
import torch
from transformers import AutoModel
from nucel.tokenizer import NucEL_Tokenizer
tokenizer = NucEL_Tokenizer.from_pretrained("FreakingPotato/NucEL")
model = AutoModel.from_pretrained("FreakingPotato/NucEL").eval()
inputs = tokenizer("ATCGATCGATGCATGCATGC", return_tensors="pt")
with torch.no_grad():
h = model(**inputs).last_hidden_stateWithout installing this repo, you can also fetch tokenizer.py from the
Hub at runtime:
import sys
from huggingface_hub import snapshot_download
local = snapshot_download("FreakingPotato/NucEL",
allow_patterns=["tokenizer*.py", "tokenizer_config.json",
"vocab.json", "special_tokens_map.json"])
sys.path.insert(0, local)
from tokenizer import NucEL_Tokenizer
tokenizer = NucEL_Tokenizer.from_pretrained(local)scripts/pretrain.py has the paper-exact defaults baked in. To reproduce the
released checkpoint on 8× A100:
deepspeed scripts/pretrain.py \
--batch_size 24 \ # 24 * 8 GPUs = 192 global
--epochs 50 \
--learning_rate 1e-4 \
--dataset_script_path FreakingPotato/nt2_homo_sapiens_genome_1024All other architecture and training hyperparameters default to the values reported in Sections 3.2 and 3.4 of the paper.
# All 12 human-subset tasks, 3 seeds, default to FreakingPotato/NucEL
bash scripts/run_gue.sh
# A single task
python scripts/finetune_gue.py --task prom_300_all --random_seed 42
# List all tasks
python scripts/finetune_gue.py --list_tasksPer-seed CSVs land in benchmark_GUE_results/FreakingPotato_NucEL/<task>/...
and an aggregate summary_seed*_k*.csv at the root.
Reproduces Figure 3 / Table 4 (gene-biotype embeddings) and the GUE-style t-SNEs:
# Biotype embeddings (Fig 3 / Table 4)
python scripts/embedding_visualization.py \
--mode biotype \
--target_model NucEL \
--pooling cls \
--input_fasta path/to/ensembl_biotypes.fasta \
--output_dir embeddings/
# GUE regulatory / TF marker embeddings (Section 4.1)
python scripts/embedding_visualization.py \
--mode gue \
--target_model NucEL \
--pooling mean \
--input_fasta path/to/regulatory_elements_3000.fastaThe same script supports --target_model DNABERT2 | NT2_100m | HyenaDNA for
head-to-head comparisons against the baselines reported in the paper.
NucEL/
├── nucel/ # importable package
│ ├── tokenizer.py # NucEL_Tokenizer (k-mer, k = 1)
│ ├── modeling.py # ElectraForPreTraining + ModernBert discriminator
│ └── data.py # GenomeDataset + ElectraDataCollator
├── scripts/
│ ├── pretrain.py # ELECTRA pre-training (paper-exact defaults)
│ ├── finetune_gue.py # GUE benchmark fine-tuning
│ ├── run_gue.sh # 3-seed driver script
│ └── embedding_visualization.py # Section 4.1 / 4.6
├── examples/
│ └── quickstart.py
├── hf_patch/ # AutoTokenizer fix for Hugging Face Hub
└── docs/
├── architecture.png # Figure 1
└── figure2_tradeoff.png # Figure 2 (tokenization + efficiency)
@article{ding2025nucel,
title = {NucEL: Single-Nucleotide ELECTRA-Style Genomic Pre-training
for Efficient and Interpretable Representations},
author = {Ding, Ke and Parker, Brian and Wen, Jiayu},
journal = {bioRxiv},
year = {2025},
doi = {10.1101/2025.08.17.670700},
url = {https://www.biorxiv.org/content/10.1101/2025.08.17.670700v1}
}Apache-2.0. See LICENSE.
