NucEL

Single-Nucleotide ELECTRA-Style Genomic Pre-training for Efficient and Interpretable Representations

Ke Ding, Brian Parker, Jiayu Wen. Australian National University · Preprint, August 2025.

NucEL is the first ELECTRA-style genomic foundation model. A small generator proposes single-nucleotide substitutions to a masked DNA sequence, and a larger ModernBERT discriminator is trained to detect which positions were replaced — giving dense supervision across all tokens instead of the ~15 % covered by masked language modeling. At 93 M parameters, NucEL outperforms MLM-based models of similar size and matches or exceeds models up to 25× larger (e.g. NT-Multi-2.5B) on GUE, GB, and NT.

Highlights

ELECTRA RTD for genomics. Dense supervision across every base; ~6.7× more signal per forward pass than 15 %-masking MLM.
Single-nucleotide tokenization. Vocab of 27 tokens (4 nt + 7 special + 16 reserved). No k-mer collapse, no BPE drift; preserves base-level detail for tasks like motif discovery.
ModernBERT backbone. Hybrid local (128 bp window) and global (every 3 layers) attention, FlashAttention-2, RoPE, GeGLU.
Compute-efficient. Pre-trained on the human genome only with 8× A100 GPUs (50 epochs, global batch 192).
State-of-the-art on standard benchmarks despite using ~25× fewer parameters than the strongest baselines (see Results).

Model card

Aspect	Value
Architecture	ELECTRA RTD with ModernBERT backbone
Discriminator	22 layers · hidden 512 · 16 heads · intermediate 2048
Generator	11 layers · hidden 256 · 8 heads · intermediate 1024
Attention	local window 128 + global every 3rd layer
Tokenization	single-nucleotide (k = 1), vocab = 27
Parameters	93 M (discriminator only, ELECTRA backbone)
Max sequence	1024 bp during pre-training (8192 supported by config)
Pre-training data	GRCh38 / hg38 human genome (1224 bp windows, 100 bp overlap)
Objective	`L_total = L_gen + 50.0 * L_disc`
Mask ratio	0.15
Optimizer	AdamW (β₁ = 0.9, β₂ = 0.999), lr = 1e-4
Schedule	cosine, 1000-step warmup, max grad norm 1.0
Compute	8× NVIDIA A100, FP16, 50 epochs, global batch 192

Results

GUE (Table 1, MCC; F1 for CVC). Bold = best; underlined = second-best.

Model	TF-H	PD	CPD	SSP	TF-M	EMP	CVC	Avg
DNABERT-2 (117 M)	70.10	84.21	70.52	84.99	67.99	55.98	71.02	72.11
NT-2500M-multi	63.32	88.14	71.62	89.36	67.01	58.06	73.04	72.94
NucEL (93 M)	67.64	87.10	75.13	90.30	70.62	65.01	70.29	75.16

Genomic Benchmarks (Table 2, accuracy; mean ± std over 5 seeds)

Task	HyenaDNA	Caduceus-Ph	NT2-100M	NucEL
Coding vs Intergenic	0.904	0.915	0.950	0.941
Human vs Worm	0.964	0.972	0.972	0.975
Human Enhancers Cohn	0.729	0.747	0.736	0.735
Human Enhancers Ensembl	0.849	0.893	0.935	0.940
Human Regulatory	0.869	0.872	0.935	0.941
Human OCR Ensembl	0.783	0.828	0.776	0.794
Human NonTATA Promoters	0.944	0.946	0.920	0.973
Average	0.863	0.882	0.890	0.899

NT benchmark (Table 3, 18 tasks, MCC; mean over 10 seeds)

NucEL's average 0.664 matches NT2-Multi-500M (0.660 in the paper's Table 3 / 0.661 NT-Multi-2.5B) while using 27× fewer parameters. See docs/figure2_tradeoff.png for the full efficiency-performance tradeoff.

Installation

git clone https://github.com/FreakingPotato/NucEL.git
cd NucEL

# Editable install (recommended)
pip install -e .

# Or install the listed pinned dependencies
pip install -r requirements.txt

Tested with Python 3.10, PyTorch 2.1+. FlashAttention-2 is optional but strongly recommended on A100 / H100.

About the `transformers` version

NucEL's tokenizer is a custom k-mer class registered via auto_map. To let AutoTokenizer.from_pretrained honor this auto_map, install transformers>=4.48,<5.4:

pip install "transformers>=4.48,<5.4" torch

Starting with transformers==5.4.0 (released 2026-03-27), modernbert-based checkpoints were added to a MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS blocklist that disables auto_map for tokenizer auto-discovery. NucEL is collateral damage of that change. The model side (AutoModel.from_pretrained) is unaffected, but the tokenizer must be loaded explicitly — see Quickstart on transformers 5.4+.

Quickstart

import torch
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("FreakingPotato/NucEL", trust_remote_code=True)
model     = AutoModel.from_pretrained("FreakingPotato/NucEL").eval()

inputs = tokenizer("ATCGATCGATGCATGCATGC", return_tensors="pt")
with torch.no_grad():
    h = model(**inputs).last_hidden_state          # (1, L, 512)
cls = h[:, 0, :]                                   # (1, 512)

Runnable: examples/quickstart.py.

Quickstart on transformers 5.4+

If you cannot pin transformers<5.4, bypass AutoTokenizer and import NucEL_Tokenizer directly. Once you have nucel installed via pip install -e ., that's a one-liner:

import torch
from transformers import AutoModel
from nucel.tokenizer import NucEL_Tokenizer

tokenizer = NucEL_Tokenizer.from_pretrained("FreakingPotato/NucEL")
model     = AutoModel.from_pretrained("FreakingPotato/NucEL").eval()

inputs = tokenizer("ATCGATCGATGCATGCATGC", return_tensors="pt")
with torch.no_grad():
    h = model(**inputs).last_hidden_state

Without installing this repo, you can also fetch tokenizer.py from the Hub at runtime:

import sys
from huggingface_hub import snapshot_download
local = snapshot_download("FreakingPotato/NucEL",
                          allow_patterns=["tokenizer*.py", "tokenizer_config.json",
                                          "vocab.json", "special_tokens_map.json"])
sys.path.insert(0, local)
from tokenizer import NucEL_Tokenizer
tokenizer = NucEL_Tokenizer.from_pretrained(local)

Reproducing pre-training

scripts/pretrain.py has the paper-exact defaults baked in. To reproduce the released checkpoint on 8× A100:

deepspeed scripts/pretrain.py \
    --batch_size 24 \                   # 24 * 8 GPUs = 192 global
    --epochs 50 \
    --learning_rate 1e-4 \
    --dataset_script_path FreakingPotato/nt2_homo_sapiens_genome_1024

All other architecture and training hyperparameters default to the values reported in Sections 3.2 and 3.4 of the paper.

Reproducing GUE results

# All 12 human-subset tasks, 3 seeds, default to FreakingPotato/NucEL
bash scripts/run_gue.sh

# A single task
python scripts/finetune_gue.py --task prom_300_all --random_seed 42

# List all tasks
python scripts/finetune_gue.py --list_tasks

Per-seed CSVs land in benchmark_GUE_results/FreakingPotato_NucEL/<task>/... and an aggregate summary_seed*_k*.csv at the root.

Embedding visualization

Reproduces Figure 3 / Table 4 (gene-biotype embeddings) and the GUE-style t-SNEs:

# Biotype embeddings (Fig 3 / Table 4)
python scripts/embedding_visualization.py \
    --mode biotype \
    --target_model NucEL \
    --pooling cls \
    --input_fasta path/to/ensembl_biotypes.fasta \
    --output_dir embeddings/

# GUE regulatory / TF marker embeddings (Section 4.1)
python scripts/embedding_visualization.py \
    --mode gue \
    --target_model NucEL \
    --pooling mean \
    --input_fasta path/to/regulatory_elements_3000.fasta

The same script supports --target_model DNABERT2 | NT2_100m | HyenaDNA for head-to-head comparisons against the baselines reported in the paper.

Repository layout

NucEL/
├── nucel/                   # importable package
│   ├── tokenizer.py        # NucEL_Tokenizer (k-mer, k = 1)
│   ├── modeling.py         # ElectraForPreTraining + ModernBert discriminator
│   └── data.py             # GenomeDataset + ElectraDataCollator
├── scripts/
│   ├── pretrain.py         # ELECTRA pre-training (paper-exact defaults)
│   ├── finetune_gue.py     # GUE benchmark fine-tuning
│   ├── run_gue.sh          # 3-seed driver script
│   └── embedding_visualization.py  # Section 4.1 / 4.6
├── examples/
│   └── quickstart.py
├── hf_patch/               # AutoTokenizer fix for Hugging Face Hub
└── docs/
    ├── architecture.png    # Figure 1
    └── figure2_tradeoff.png  # Figure 2 (tokenization + efficiency)

Citation

@article{ding2025nucel,
  title   = {NucEL: Single-Nucleotide ELECTRA-Style Genomic Pre-training
             for Efficient and Interpretable Representations},
  author  = {Ding, Ke and Parker, Brian and Wen, Jiayu},
  journal = {bioRxiv},
  year    = {2025},
  doi     = {10.1101/2025.08.17.670700},
  url     = {https://www.biorxiv.org/content/10.1101/2025.08.17.670700v1}
}

License

Apache-2.0. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NucEL

Highlights

Model card

Results

GUE (Table 1, MCC; F1 for CVC). Bold = best; underlined = second-best.

Genomic Benchmarks (Table 2, accuracy; mean ± std over 5 seeds)

NT benchmark (Table 3, 18 tasks, MCC; mean over 10 seeds)

Installation

About the `transformers` version

Quickstart

Quickstart on transformers 5.4+

Reproducing pre-training

Reproducing GUE results

Embedding visualization

Repository layout

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs		docs
examples		examples
hf_patch		hf_patch
nucel		nucel
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

NucEL

Highlights

Model card

Results

GUE (Table 1, MCC; F1 for CVC). Bold = best; underlined = second-best.

Genomic Benchmarks (Table 2, accuracy; mean ± std over 5 seeds)

NT benchmark (Table 3, 18 tasks, MCC; mean over 10 seeds)

Installation

About the transformers version

Quickstart

Quickstart on transformers 5.4+

Reproducing pre-training

Reproducing GUE results

Embedding visualization

Repository layout

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

About the `transformers` version

Packages