EVA: A Long-Context Generative Foundation Model Deciphers RNA Design Principles

🧬 A long-context generative foundation model for universal RNA modeling and design.

EVA (Evolutionary Versatile Architect) is a generative RNA foundation model trained on OpenRNA v1, a curated atlas of 114 million full-length RNA sequences spanning all domains of life. Built on a 1.4B-parameter decoder-only Transformer with a Mixture-of-Experts (MoE) backbone and an 8,192-token context window, EVA unifies RNA sequence scoring and controllable design within a single framework.

Why Use EVA?

You should consider EVA for the reasons as follows:

🔓	Fully Open-Sourced	All training data, model weights, finetuning & training & inference codes and details are publicly released — full transparency for the community to reproduce, build upon, and extend
📏	8x Larger Context Window	8,192-token context window vs. ~1,024 in prior RNA models — enabling full-length RNA processing without truncation or information loss
🗄️	7x More Training Data	Trained on OpenRNA v1 — 114M full-length RNA sequences across all domains of life (Eukaryota, Bacteria, Archaea, Viruses), 7.7x larger than the RNAcentral v25.0
🏆	SOTA Fitness Prediction	Achieves state-of-the-art zero-shot fitness prediction on ncRNA and mRNA, no task-specific fine-tuning needed
🎯	10x+ RNA Generation Accuracy	Over 10x improvement in RNA generation accuracy at both sequence and structure level compared to prior methods
🧬	11 RNA Types Supported	Controllable generation across 11 RNA classes (mRNA, tRNA, rRNA, lncRNA, snRNA, snoRNA, miRNA, and more) conditioned on RNA type and taxonomic lineage
⚙️	Novel Architecture, Capabilities & Training	1.4B-parameter MoE decoder-only Transformer with dual generation paradigms — CLM for de novo design & GLM for domain redesign — unified in a single model

Key Modules

Module	Path	Description
Model Architecture	`eva/`	MoE Transformer backbone (modeling, attention, MoE routing, causal LM head)
Lineage Tokenizer	`eva/lineage_tokenizer.py`	Tokenizer for taxonomic lineage strings
Generation CLI	`tools/generate.py`	Entry point for CLM and GLM sequence generation
Scoring CLI	`tools/predict.py`	Entry point for log-likelihood scoring
Directed Evolution	`tools/directed_evolution.py`	In-silico directed evolution pipeline
Generators	`tools/utils/generators/`	CLM autoregressive and GLM span-infilling implementations
Scorers	`tools/utils/scorers/`	Sequence scoring logic and batch workers
Condition Control	`tools/utils/conditions/`	RNA type and taxonomic lineage conditioning
Model Loader & Sampler	`tools/utils/model/`	Checkpoint loading and sampling strategies
Config Templates	`config/`	YAML configuration examples for generation, scoring, and directed evolution

Our Journey with EVA Starts Here 👋

Click below to expand the Table of Contents and explore each section in detail.

Table of Contents

Quick Start
Condition Control
- RNA Types
- Species/Lineage
Generation
Scoring
- RNA Mode
- Protein Mode
Directed Evolution
- Usage
- Key Parameters
Batch Processing with YAML
Input/Output Formats
Data Availability
Citation
License

Quick Start

1. Pull Docker Image

From Zenodo (link coming soon):

# Download all split archives (part000-part011), then:
cat eva_latest.tar.gz.part* > eva_latest.tar.gz
docker load -i eva_latest.tar.gz

Or build from Dockerfile:

cd /data/yanjie_huang/eva/EVA1/docker
docker build -t eva:latest .

2. Run Container

docker run --gpus all -it eva:latest bash

All model files and output data stay inside the container.

3. Download Model

huggingface-cli download GENTEL-Lab/EVA --local-dir ./checkpoint

4. Generate Your First Sequences

python /eva/tools/generate.py \
    --checkpoint ./checkpoint \
    --format clm \
    --rna_type mRNA \
    --taxid 9606 \
    --num_seqs 10 \
    --output ./output/demo.fa

This generates 10 human mRNA sequences and saves them to ./output/demo.fa.

Condition Control

EVA supports conditioning on RNA type and species/lineage for both generation (generate.py) and scoring (predict.py). These conditions can be used independently or combined.

RNA Types

RNA Type	Description
mRNA	Messenger RNA - carries genetic information from DNA to ribosomes
tRNA	Transfer RNA - brings amino acids to the ribosome during translation
rRNA	Ribosomal RNA - forms the core of the ribosome structure
miRNA	MicroRNA - regulates gene expression
lncRNA	Long non-coding RNA - various regulatory functions
circRNA	Circular RNA - circularized RNA molecules
snoRNA	Small nucleolar RNA - modifies other RNAs
snRNA	Small nuclear RNA - involved in splicing
piRNA	PIWI-interacting RNA - silences transposons
sRNA	Small RNA - general category for small RNA molecules
viral_RNA	RNA from viruses

Species/Lineage

Species can be specified in three ways: --taxid, --species, or --lineage (Greengenes format).

Common species:

TaxID	Species
9606	Homo sapiens (Human)
10090	Mus musculus (Mouse)
10116	Rattus norvegicus (Rat)
7227	Drosophila melanogaster (Fruit fly)
6239	Caenorhabditis elegans (Nematode)
3702	Arabidopsis thaliana (Plant)
4932	Saccharomyces cerevisiae (Yeast)
562	Escherichia coli (Bacteria)

Generation

10x Higher Modeling Accuracy Than Ever Before — At Both Sequence and Structure Level.

RNA landscape modeling comparison

Species-specific modeling comparison

CLM

CLM (Causal Language Model) generates RNA sequences autoregressively from left to right. This is the primary generation mode in EVA.

Unconditional Generation

Generate sequences without any biological constraints:

python /eva/tools/generate.py \
    --checkpoint /path/to/model \
    --format clm \
    --num_seqs 1000 \
    --output /output/unconditional.fa

Conditional Generation

EVA supports conditioning on RNA type, species (via TaxID, species name, or lineage string), or both. See Condition Control for the full list of supported RNA types and species.

# RNA type only
python /eva/tools/generate.py \
    --checkpoint /path/to/model \
    --format clm \
    --rna_type mRNA \
    --num_seqs 1000 \
    --output /output/mrna.fa

# Species only (via TaxID)
python /eva/tools/generate.py \
    --checkpoint /path/to/model \
    --format clm \
    --taxid 9606 \
    --num_seqs 1000 \
    --output /output/human.fa

# Both RNA type and species
python /eva/tools/generate.py \
    --checkpoint /path/to/model \
    --format clm \
    --rna_type mRNA \
    --taxid 9606 \
    --num_seqs 1000 \
    --output /output/human_mrna.fa

Species can also be specified via --species homo_sapiens or --lineage "D__Eukaryota;P__Chordata;..." in Greengenes format.

Continuation Mode

Extend existing sequences in either direction. Use --split_ratio (fraction) or --split_pos (exact position) to control the split point.

Forward (extend 3' end):

python /eva/tools/generate.py \
    --checkpoint /path/to/model \
    --format clm \
    --input /input/partial_seq.fa \
    --direction forward \
    --split_ratio 0.5 \
    --num_seqs 5 \
    --output /output/continuation.fa

Reverse (extend 5' end):

python /eva/tools/generate.py \
    --checkpoint /path/to/model \
    --format clm \
    --input /input/partial_seq.fa \
    --direction reverse \
    --split_pos 699 \
    --num_seqs 20 \
    --output /output/reverse_continuation.fa

Add --output_details to include prompt, ground truth, and generated content in the output.

GLM

GLM (General Language Model) performs span infilling — it masks a region within an existing sequence and generates what should fill the gap based on surrounding context. Like CLM, GLM supports both unconditional and conditional generation.

Unconditional Infilling

Fill in a masked region without any biological constraints:

python /eva/tools/generate.py \
    --checkpoint /path/to/model \
    --format glm \
    --input /input/sequences.fa \
    --span_ratio 0.1 \
    --num_seqs 5 \
    --output /output/glm_output.fa

Conditional Infilling

Condition on RNA type and/or species to generate biologically consistent infills:

python /eva/tools/generate.py \
    --checkpoint /path/to/model \
    --format glm \
    --input /input/sequences.fa \
    --rna_type mRNA \
    --taxid 9606 \
    --span_ratio 0.2 \
    --num_seqs 5 \
    --output /output/glm_conditional.fa

Span Parameters

Parameter	Description	Example
`--span_length`	Fixed number of nucleotides to mask	`--span_length 20`
`--span_ratio`	Fraction of sequence to mask	`--span_ratio 0.1`
`--span_position`	Where to place the span: `random` or specific index	`--span_position 100`
`--span_id`	Which span token to use: `random` or 0-49	`--span_id 0`

Sampling Parameters

Parameter	Description	Recommended Range
`--temperature`	Controls randomness. Lower = more deterministic, higher = more diverse	0.1 - 1.5
`--top_k`	Only consider the top k most likely nucleotides at each position	10 - 100
`--top_p`	Nucleus sampling — consider smallest set of nucleotides whose cumulative probability exceeds p	0.8 - 0.95

Example with all sampling parameters:

python /eva/tools/generate.py \
    --checkpoint /path/to/model \
    --format clm \
    --temperature 0.8 \
    --top_k 50 \
    --top_p 0.9 \
    --num_seqs 100 \
    --output /output/sampled.fa

Scoring

EVA achieves state-of-the-art zero-shot fitness prediction across both mRNA and ncRNA benchmarks, substantially outperforming existing RNA, codon, and DNA language models — including models with up to 40B parameters. This strong sequence-level understanding directly translates into high-quality generation: EVA's learned distribution faithfully captures the evolutionary constraints of natural RNA, enabling it to produce biologically realistic sequences out of the box.

Evaluate how well a given sequence fits the model's learned distribution by computing its log-likelihood. Higher (less negative) scores indicate more probable sequences.

RNA Mode

Score RNA sequences and compute per-sequence log-likelihood:

python /eva/tools/predict.py \
    --checkpoint /path/to/model \
    --input /input/sequences.fa \
    --output /output/scores.json

Supports --rna_type and --taxid conditioning, same as generation.

Protein Mode

Score protein sequences by reverse-translating them to RNA first:

python /eva/tools/predict.py \
    --checkpoint /path/to/model \
    --input /input/proteins.fa \
    --output /output/protein_scores.json \
    --mode protein \
    --codon_optimization first

--codon_optimization options: first (first codon in table) or most_frequent (most common codon for the species).

Directed Evolution

EVA supports in-silico directed evolution — an iterative optimization pipeline that improves a given RNA sequence through cycles of mutation, LLM-based fitness scoring, and structural stability evaluation. Starting from an existing RNA sequence, EVA applies point mutations at specified positions, scores each mutant by log-likelihood and Minimum Free Energy (MFE via LinearFold), and uses simulated annealing with dynamic beam search to balance exploration and exploitation. Over multiple rounds the temperature cools to converge on optimized sequences, and the top N candidates ranked by combined LLM + MFE score are returned.

Usage

python /eva/tools/directed_evolution.py \
    --checkpoint /path/to/model \
    --input /input/sequence.fa \
    --output /output/evolved.fa \
    --rna_type mRNA \
    --taxid 9606 \
    --iterations 10 \
    --mutations 5 \
    --beam_width 10 \
    --output_count 5

Key Parameters

Parameter	Description	Default
`--iterations`	Number of evolution cycles	10
`--mutations`	Total mutations per iteration	2
`--beam_width`	Beam search width — keeps top candidates after each position	10
`--output_count`	Number of final sequences to output	5
`--T_init` / `--T_min`	Simulated annealing temperature range	1.0 → 0.01
`--cooling_rate`	Temperature decay per iteration	0.95
`--mutate_positions`	Specific positions to mutate (0-based, comma-separated)	Random
`--mutate_range`	Range of positions to mutate (e.g., `0-100`)	Entire sequence

Batch Processing with YAML

Define multiple tasks in a single YAML config file. The defaults section sets shared parameters, which individual tasks can override.

Generation Config Example

checkpoint: /path/to/model
output_dir: ./output

defaults:
  temperature: 1.0
  top_k: 50
  max_length: 8192
  batch_size: 1

tasks:
  - name: unconditional
    mode: generation
    format: clm
    num_seqs: 1000

  - name: human_mrna
    mode: generation
    format: clm
    rna_type: mRNA
    taxid: "9606"
    lineage: "D__Eukaryota;P__Chordata;C__Mammalia;O__Primates;F__Hominidae;G__Homo;S__Homo sapiens"
    num_seqs: 1000

  - name: glm_infill
    mode: generation
    format: glm
    input: ./input/seqs.fa
    span_ratio: 0.1
    num_seqs: 5

Scoring Config Example

checkpoint: /path/to/model
output_dir: ./scores

defaults:
  batch_size: 128

tasks:
  - name: score_basic
    mode: scoring
    input: ./input/seqs.fa

  - name: score_human_mrna
    mode: scoring
    input: ./input/seqs.fa
    rna_type: mRNA
    taxid: "9606"
    normalize: true
    exclude_special_tokens: true

  - name: score_protein
    mode: scoring
    input: ./input/proteins.fa
    scoring_mode: protein
    codon_optimization: first

Running

python /eva/tools/generate.py --config config.yaml              # Run all tasks
python /eva/tools/generate.py --config config.yaml --task name   # Run specific task
python /eva/tools/generate.py --config config.yaml --device cuda:1  # Override device

Input/Output Formats

Input — FASTA

>sequence_id_1
AUGCGCUAUGCGCUAUGCGCUAUGCGCUAUGCGCUAUGCGCUAUGCGCU
>sequence_id_2
AUGAAAAUGCGGCCGCAUUACGUAAACGGCCGCAAAUGUUUCCGGCAAA

Output — Generation (FASTA)

>unconditional_0
AUGCGCUAUGCGCUAUGCGCUAUGCGCUAUGCGCUAUGCGCUAUGCGCU

With --output_details (GLM / Continuation):

>test_seq_sample0_forward_split50
PROMPT: AUGCGCUAUGCGCUAUGCG
GROUND_TRUTH: CU AUGCGCUAUGCG
GENERATED: CU AAUGCGCUAGCG
FULL_SEQ: AUGCGCUAUGCGCUAAUGCGCUAGCG

Output — Scoring (JSON)

{
  "scores": [
    {
      "header": "seq1",
      "sequence": "AUGGCCGUAGU...",
      "length": 67,
      "log_likelihood": -1.25
    }
  ]
}

Higher (less negative) log_likelihood = better sequence.

Output — Directed Evolution (FASTA)

Output is in FASTA format with scores in the header:

>candidate_1_score=-125.4321_best
AUGC... (evolved sequence)
>candidate_2_score=-128.1056
AUGC... (second best)

Data Availability

Some large files are not included in this repository due to size constraints. They can be downloaded from the sources listed below.

Model & Environment

Resource	Description	Source
Model Checkpoint	EVA 1.4B (MoE) pre-trained weights	HuggingFace
Docker Image (`eva_latest.tar.gz`)	Pre-built EVA runtime environment	Zenodo
Training Data	OpenRNA v1 — 114M full-length RNA sequences	HuggingFace

Experiment & Benchmark Data

Resource	Description	Size	Source
mRNA Codon Optimization	Optimized sequences from 9 vaccine targets (5 mRNA: linear, HIV, PR8, RABV, VZV; 4 circRNA: CAR-T, RABV-G, SARS-Group-I, SARS-T4) using EVA (homo/nolineage), CodonFM-1B, Evo2-1B, GemoRNA, and random baseline	26 KB	Zenodo
mRNA 6 Species	Generated & natural mRNA sequences for 6 species (Homo sapiens, Mus musculus, Rattus norvegicus, Cricetulus griseus, Drosophila melanogaster, Caenorhabditis elegans) with MFE, GC content, k-mer, loop-helix, pair metrics	38 MB	Zenodo
GeneRRNA Generation	EVA-generated rRNA sequences benchmarked against GeneRRNA database (100k natural, 1k sampled) with k-mer, loop-helix, pair, merged MFE-GC metrics	13 MB	Zenodo
Generated RNA Sequences	15 RNA types (mRNA, lncRNA, circRNA, tRNA, rRNA, miRNA, piRNA, sRNA, snRNA, snoRNA, scaRNA, vault_RNA, Y_RNA, ribozyme, viral) generated by EVA with matched natural controls (5k each) and MFE, GC, k-mer, loop-helix, pair metrics	156 MB	Zenodo

Citation

If you find EVA useful in your research, please cite:

@article{huang2026eva,
  title={EVA: A Generative Foundation Model for Universal RNA Modeling and Design},
  author={Huang, Yanjie and Lyu, Guangye and others},
  journal={TODO},
  year={2026},
  url={TODO}
}

License

This project is licensed under the Apache License 2.0. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
checkpoint		checkpoint
config		config
data		data
docker		docker
eva		eva
fig		fig
finetune		finetune
notebooks		notebooks
scripts		scripts
tools		tools
training		training
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

EVA: A Long-Context Generative Foundation Model Deciphers RNA Design Principles

Why Use EVA?

Key Modules

Our Journey with EVA Starts Here 👋

Quick Start

1. Pull Docker Image

2. Run Container

3. Download Model

4. Generate Your First Sequences

Condition Control

RNA Types

Species/Lineage

Generation

CLM

Unconditional Generation

Conditional Generation

Continuation Mode

GLM

Unconditional Infilling

Conditional Infilling

Span Parameters

Sampling Parameters

Scoring

RNA Mode

Protein Mode

Directed Evolution

Usage

Key Parameters

Batch Processing with YAML

Generation Config Example

Scoring Config Example

Running

Input/Output Formats

Input — FASTA

Output — Generation (FASTA)

Output — Scoring (JSON)

Output — Directed Evolution (FASTA)

Data Availability

Model & Environment

Experiment & Benchmark Data

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages