Skip to content

FreakingPotato/NucEL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NucEL

Single-Nucleotide ELECTRA-Style Genomic Pre-training for Efficient and Interpretable Representations

Paper bioRxiv Hugging Face License: Apache-2.0

Ke Ding, Brian Parker, Jiayu Wen. Australian National University · Preprint, August 2025.

NucEL is the first ELECTRA-style genomic foundation model. A small generator proposes single-nucleotide substitutions to a masked DNA sequence, and a larger ModernBERT discriminator is trained to detect which positions were replaced — giving dense supervision across all tokens instead of the ~15 % covered by masked language modeling. At 93 M parameters, NucEL outperforms MLM-based models of similar size and matches or exceeds models up to 25× larger (e.g. NT-Multi-2.5B) on GUE, GB, and NT.

NucEL Figure 2: tokenization ablation and efficiency-performance tradeoff


Highlights

  • ELECTRA RTD for genomics. Dense supervision across every base; ~6.7× more signal per forward pass than 15 %-masking MLM.
  • Single-nucleotide tokenization. Vocab of 27 tokens (4 nt + 7 special + 16 reserved). No k-mer collapse, no BPE drift; preserves base-level detail for tasks like motif discovery.
  • ModernBERT backbone. Hybrid local (128 bp window) and global (every 3 layers) attention, FlashAttention-2, RoPE, GeGLU.
  • Compute-efficient. Pre-trained on the human genome only with 8× A100 GPUs (50 epochs, global batch 192).
  • State-of-the-art on standard benchmarks despite using ~25× fewer parameters than the strongest baselines (see Results).

Model card

Aspect Value
Architecture ELECTRA RTD with ModernBERT backbone
Discriminator 22 layers · hidden 512 · 16 heads · intermediate 2048
Generator 11 layers · hidden 256 · 8 heads · intermediate 1024
Attention local window 128 + global every 3rd layer
Tokenization single-nucleotide (k = 1), vocab = 27
Parameters 93 M (discriminator only, ELECTRA backbone)
Max sequence 1024 bp during pre-training (8192 supported by config)
Pre-training data GRCh38 / hg38 human genome (1224 bp windows, 100 bp overlap)
Objective L_total = L_gen + 50.0 * L_disc
Mask ratio 0.15
Optimizer AdamW (β₁ = 0.9, β₂ = 0.999), lr = 1e-4
Schedule cosine, 1000-step warmup, max grad norm 1.0
Compute 8× NVIDIA A100, FP16, 50 epochs, global batch 192

Results

GUE (Table 1, MCC; F1 for CVC). Bold = best; underlined = second-best.

Model TF-H PD CPD SSP TF-M EMP CVC Avg
DNABERT-2 (117 M) 70.10 84.21 70.52 84.99 67.99 55.98 71.02 72.11
NT-2500M-multi 63.32 88.14 71.62 89.36 67.01 58.06 73.04 72.94
NucEL (93 M) 67.64 87.10 75.13 90.30 70.62 65.01 70.29 75.16

Genomic Benchmarks (Table 2, accuracy; mean ± std over 5 seeds)

Task HyenaDNA Caduceus-Ph NT2-100M NucEL
Coding vs Intergenic 0.904 0.915 0.950 0.941
Human vs Worm 0.964 0.972 0.972 0.975
Human Enhancers Cohn 0.729 0.747 0.736 0.735
Human Enhancers Ensembl 0.849 0.893 0.935 0.940
Human Regulatory 0.869 0.872 0.935 0.941
Human OCR Ensembl 0.783 0.828 0.776 0.794
Human NonTATA Promoters 0.944 0.946 0.920 0.973
Average 0.863 0.882 0.890 0.899

NT benchmark (Table 3, 18 tasks, MCC; mean over 10 seeds)

NucEL's average 0.664 matches NT2-Multi-500M (0.660 in the paper's Table 3 / 0.661 NT-Multi-2.5B) while using 27× fewer parameters. See docs/figure2_tradeoff.png for the full efficiency-performance tradeoff.


Installation

git clone https://github.com/FreakingPotato/NucEL.git
cd NucEL

# Editable install (recommended)
pip install -e .

# Or install the listed pinned dependencies
pip install -r requirements.txt

Tested with Python 3.10, PyTorch 2.1+. FlashAttention-2 is optional but strongly recommended on A100 / H100.

About the transformers version

NucEL's tokenizer is a custom k-mer class registered via auto_map. To let AutoTokenizer.from_pretrained honor this auto_map, install transformers>=4.48,<5.4:

pip install "transformers>=4.48,<5.4" torch

Starting with transformers==5.4.0 (released 2026-03-27), modernbert-based checkpoints were added to a MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS blocklist that disables auto_map for tokenizer auto-discovery. NucEL is collateral damage of that change. The model side (AutoModel.from_pretrained) is unaffected, but the tokenizer must be loaded explicitly — see Quickstart on transformers 5.4+.

Quickstart

import torch
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("FreakingPotato/NucEL", trust_remote_code=True)
model     = AutoModel.from_pretrained("FreakingPotato/NucEL").eval()

inputs = tokenizer("ATCGATCGATGCATGCATGC", return_tensors="pt")
with torch.no_grad():
    h = model(**inputs).last_hidden_state          # (1, L, 512)
cls = h[:, 0, :]                                   # (1, 512)

Runnable: examples/quickstart.py.

Quickstart on transformers 5.4+

If you cannot pin transformers<5.4, bypass AutoTokenizer and import NucEL_Tokenizer directly. Once you have nucel installed via pip install -e ., that's a one-liner:

import torch
from transformers import AutoModel
from nucel.tokenizer import NucEL_Tokenizer

tokenizer = NucEL_Tokenizer.from_pretrained("FreakingPotato/NucEL")
model     = AutoModel.from_pretrained("FreakingPotato/NucEL").eval()

inputs = tokenizer("ATCGATCGATGCATGCATGC", return_tensors="pt")
with torch.no_grad():
    h = model(**inputs).last_hidden_state

Without installing this repo, you can also fetch tokenizer.py from the Hub at runtime:

import sys
from huggingface_hub import snapshot_download
local = snapshot_download("FreakingPotato/NucEL",
                          allow_patterns=["tokenizer*.py", "tokenizer_config.json",
                                          "vocab.json", "special_tokens_map.json"])
sys.path.insert(0, local)
from tokenizer import NucEL_Tokenizer
tokenizer = NucEL_Tokenizer.from_pretrained(local)

Reproducing pre-training

scripts/pretrain.py has the paper-exact defaults baked in. To reproduce the released checkpoint on 8× A100:

deepspeed scripts/pretrain.py \
    --batch_size 24 \                   # 24 * 8 GPUs = 192 global
    --epochs 50 \
    --learning_rate 1e-4 \
    --dataset_script_path FreakingPotato/nt2_homo_sapiens_genome_1024

All other architecture and training hyperparameters default to the values reported in Sections 3.2 and 3.4 of the paper.

Reproducing GUE results

# All 12 human-subset tasks, 3 seeds, default to FreakingPotato/NucEL
bash scripts/run_gue.sh

# A single task
python scripts/finetune_gue.py --task prom_300_all --random_seed 42

# List all tasks
python scripts/finetune_gue.py --list_tasks

Per-seed CSVs land in benchmark_GUE_results/FreakingPotato_NucEL/<task>/... and an aggregate summary_seed*_k*.csv at the root.

Embedding visualization

Reproduces Figure 3 / Table 4 (gene-biotype embeddings) and the GUE-style t-SNEs:

# Biotype embeddings (Fig 3 / Table 4)
python scripts/embedding_visualization.py \
    --mode biotype \
    --target_model NucEL \
    --pooling cls \
    --input_fasta path/to/ensembl_biotypes.fasta \
    --output_dir embeddings/

# GUE regulatory / TF marker embeddings (Section 4.1)
python scripts/embedding_visualization.py \
    --mode gue \
    --target_model NucEL \
    --pooling mean \
    --input_fasta path/to/regulatory_elements_3000.fasta

The same script supports --target_model DNABERT2 | NT2_100m | HyenaDNA for head-to-head comparisons against the baselines reported in the paper.

Repository layout

NucEL/
├── nucel/                   # importable package
│   ├── tokenizer.py        # NucEL_Tokenizer (k-mer, k = 1)
│   ├── modeling.py         # ElectraForPreTraining + ModernBert discriminator
│   └── data.py             # GenomeDataset + ElectraDataCollator
├── scripts/
│   ├── pretrain.py         # ELECTRA pre-training (paper-exact defaults)
│   ├── finetune_gue.py     # GUE benchmark fine-tuning
│   ├── run_gue.sh          # 3-seed driver script
│   └── embedding_visualization.py  # Section 4.1 / 4.6
├── examples/
│   └── quickstart.py
├── hf_patch/               # AutoTokenizer fix for Hugging Face Hub
└── docs/
    ├── architecture.png    # Figure 1
    └── figure2_tradeoff.png  # Figure 2 (tokenization + efficiency)

Citation

@article{ding2025nucel,
  title   = {NucEL: Single-Nucleotide ELECTRA-Style Genomic Pre-training
             for Efficient and Interpretable Representations},
  author  = {Ding, Ke and Parker, Brian and Wen, Jiayu},
  journal = {bioRxiv},
  year    = {2025},
  doi     = {10.1101/2025.08.17.670700},
  url     = {https://www.biorxiv.org/content/10.1101/2025.08.17.670700v1}
}

License

Apache-2.0. See LICENSE.

About

NucEL: Single-Nucleotide ELECTRA-Style Genomic Pre-training for Efficient and Interpretable Representations

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors