Skip to content

GENTEL-lab/EVA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EVA: A Long-Context Generative Foundation Model Deciphers RNA Design Principles

OpenRNA

Paper Hugging Face License Website

🧬 A long-context generative foundation model for universal RNA modeling and design.

EVA (Evolutionary Versatile Architect) is a generative RNA foundation model trained on OpenRNA v1, a curated atlas of 114 million full-length RNA sequences spanning all domains of life. Built on a 1.4B-parameter decoder-only Transformer with a Mixture-of-Experts (MoE) backbone and an 8,192-token context window, EVA unifies RNA sequence scoring and controllable design within a single framework.

Why Use EVA?

You should consider EVA for the reasons as follows:

🔓 Fully Open-Sourced All training data, model weights, finetuning & training & inference codes and details are publicly released — full transparency for the community to reproduce, build upon, and extend
📏 8x Larger Context Window 8,192-token context window vs. ~1,024 in prior RNA models — enabling full-length RNA processing without truncation or information loss
🗄️ 7x More Training Data Trained on OpenRNA v1 — 114M full-length RNA sequences across all domains of life (Eukaryota, Bacteria, Archaea, Viruses), 7.7x larger than the RNAcentral v25.0
🏆 SOTA Fitness Prediction Achieves state-of-the-art zero-shot fitness prediction on ncRNA and mRNA, no task-specific fine-tuning needed
🎯 10x+ RNA Generation Accuracy Over 10x improvement in RNA generation accuracy at both sequence and structure level compared to prior methods
🧬 11 RNA Types Supported Controllable generation across 11 RNA classes (mRNA, tRNA, rRNA, lncRNA, snRNA, snoRNA, miRNA, and more) conditioned on RNA type and taxonomic lineage
⚙️ Novel Architecture, Capabilities & Training 1.4B-parameter MoE decoder-only Transformer with dual generation paradigms — CLM for de novo design & GLM for domain redesign — unified in a single model

Key Modules

Module Path Description
Model Architecture eva/ MoE Transformer backbone (modeling, attention, MoE routing, causal LM head)
Lineage Tokenizer eva/lineage_tokenizer.py Tokenizer for taxonomic lineage strings
Generation CLI tools/generate.py Entry point for CLM and GLM sequence generation
Scoring CLI tools/predict.py Entry point for log-likelihood scoring
Directed Evolution tools/directed_evolution.py In-silico directed evolution pipeline
Generators tools/utils/generators/ CLM autoregressive and GLM span-infilling implementations
Scorers tools/utils/scorers/ Sequence scoring logic and batch workers
Condition Control tools/utils/conditions/ RNA type and taxonomic lineage conditioning
Model Loader & Sampler tools/utils/model/ Checkpoint loading and sampling strategies
Config Templates config/ YAML configuration examples for generation, scoring, and directed evolution

Our Journey with EVA Starts Here 👋

Click below to expand the Table of Contents and explore each section in detail.

Table of Contents


Quick Start

1. Pull Docker Image

From Zenodo (link coming soon):

# Download all split archives (part000-part011), then:
cat eva_latest.tar.gz.part* > eva_latest.tar.gz
docker load -i eva_latest.tar.gz

Or build from Dockerfile:

cd /data/yanjie_huang/eva/EVA1/docker
docker build -t eva:latest .

2. Run Container

docker run --gpus all -it eva:latest bash

All model files and output data stay inside the container.

3. Download Model

huggingface-cli download GENTEL-Lab/EVA --local-dir ./checkpoint

4. Generate Your First Sequences

python /eva/tools/generate.py \
    --checkpoint ./checkpoint \
    --format clm \
    --rna_type mRNA \
    --taxid 9606 \
    --num_seqs 10 \
    --output ./output/demo.fa

This generates 10 human mRNA sequences and saves them to ./output/demo.fa.


Condition Control

EVA supports conditioning on RNA type and species/lineage for both generation (generate.py) and scoring (predict.py). These conditions can be used independently or combined.

RNA Types

RNA Type Description
mRNA Messenger RNA - carries genetic information from DNA to ribosomes
tRNA Transfer RNA - brings amino acids to the ribosome during translation
rRNA Ribosomal RNA - forms the core of the ribosome structure
miRNA MicroRNA - regulates gene expression
lncRNA Long non-coding RNA - various regulatory functions
circRNA Circular RNA - circularized RNA molecules
snoRNA Small nucleolar RNA - modifies other RNAs
snRNA Small nuclear RNA - involved in splicing
piRNA PIWI-interacting RNA - silences transposons
sRNA Small RNA - general category for small RNA molecules
viral_RNA RNA from viruses

Species/Lineage

Species can be specified in three ways: --taxid, --species, or --lineage (Greengenes format).

Common species:

TaxID Species
9606 Homo sapiens (Human)
10090 Mus musculus (Mouse)
10116 Rattus norvegicus (Rat)
7227 Drosophila melanogaster (Fruit fly)
6239 Caenorhabditis elegans (Nematode)
3702 Arabidopsis thaliana (Plant)
4932 Saccharomyces cerevisiae (Yeast)
562 Escherichia coli (Bacteria)

Generation

10x Higher Modeling Accuracy Than Ever Before — At Both Sequence and Structure Level.

RNA Landscape Modeling Comparison
RNA landscape modeling comparison
Species-specific RNA Landscape Modeling Comparison
Species-specific modeling comparison

CLM

CLM (Causal Language Model) generates RNA sequences autoregressively from left to right. This is the primary generation mode in EVA.

Unconditional Generation

Generate sequences without any biological constraints:

python /eva/tools/generate.py \
    --checkpoint /path/to/model \
    --format clm \
    --num_seqs 1000 \
    --output /output/unconditional.fa

Conditional Generation

EVA supports conditioning on RNA type, species (via TaxID, species name, or lineage string), or both. See Condition Control for the full list of supported RNA types and species.

RNA Type Generation
# RNA type only
python /eva/tools/generate.py \
    --checkpoint /path/to/model \
    --format clm \
    --rna_type mRNA \
    --num_seqs 1000 \
    --output /output/mrna.fa

# Species only (via TaxID)
python /eva/tools/generate.py \
    --checkpoint /path/to/model \
    --format clm \
    --taxid 9606 \
    --num_seqs 1000 \
    --output /output/human.fa

# Both RNA type and species
python /eva/tools/generate.py \
    --checkpoint /path/to/model \
    --format clm \
    --rna_type mRNA \
    --taxid 9606 \
    --num_seqs 1000 \
    --output /output/human_mrna.fa

Species can also be specified via --species homo_sapiens or --lineage "D__Eukaryota;P__Chordata;..." in Greengenes format.

Species Generation

Continuation Mode

Extend existing sequences in either direction. Use --split_ratio (fraction) or --split_pos (exact position) to control the split point.

Forward (extend 3' end):

python /eva/tools/generate.py \
    --checkpoint /path/to/model \
    --format clm \
    --input /input/partial_seq.fa \
    --direction forward \
    --split_ratio 0.5 \
    --num_seqs 5 \
    --output /output/continuation.fa

Reverse (extend 5' end):

python /eva/tools/generate.py \
    --checkpoint /path/to/model \
    --format clm \
    --input /input/partial_seq.fa \
    --direction reverse \
    --split_pos 699 \
    --num_seqs 20 \
    --output /output/reverse_continuation.fa

Add --output_details to include prompt, ground truth, and generated content in the output.

GLM

GLM (General Language Model) performs span infilling — it masks a region within an existing sequence and generates what should fill the gap based on surrounding context. Like CLM, GLM supports both unconditional and conditional generation.

Unconditional Infilling

Fill in a masked region without any biological constraints:

python /eva/tools/generate.py \
    --checkpoint /path/to/model \
    --format glm \
    --input /input/sequences.fa \
    --span_ratio 0.1 \
    --num_seqs 5 \
    --output /output/glm_output.fa

Conditional Infilling

Condition on RNA type and/or species to generate biologically consistent infills:

python /eva/tools/generate.py \
    --checkpoint /path/to/model \
    --format glm \
    --input /input/sequences.fa \
    --rna_type mRNA \
    --taxid 9606 \
    --span_ratio 0.2 \
    --num_seqs 5 \
    --output /output/glm_conditional.fa

Span Parameters

Parameter Description Example
--span_length Fixed number of nucleotides to mask --span_length 20
--span_ratio Fraction of sequence to mask --span_ratio 0.1
--span_position Where to place the span: random or specific index --span_position 100
--span_id Which span token to use: random or 0-49 --span_id 0

Sampling Parameters

Parameter Description Recommended Range
--temperature Controls randomness. Lower = more deterministic, higher = more diverse 0.1 - 1.5
--top_k Only consider the top k most likely nucleotides at each position 10 - 100
--top_p Nucleus sampling — consider smallest set of nucleotides whose cumulative probability exceeds p 0.8 - 0.95

Example with all sampling parameters:

python /eva/tools/generate.py \
    --checkpoint /path/to/model \
    --format clm \
    --temperature 0.8 \
    --top_k 50 \
    --top_p 0.9 \
    --num_seqs 100 \
    --output /output/sampled.fa

Scoring

EVA achieves state-of-the-art zero-shot fitness prediction across both mRNA and ncRNA benchmarks, substantially outperforming existing RNA, codon, and DNA language models — including models with up to 40B parameters. This strong sequence-level understanding directly translates into high-quality generation: EVA's learned distribution faithfully captures the evolutionary constraints of natural RNA, enabling it to produce biologically realistic sequences out of the box.

Zero-shot mRNA Fitness Prediction Zero-shot ncRNA Fitness Prediction

Evaluate how well a given sequence fits the model's learned distribution by computing its log-likelihood. Higher (less negative) scores indicate more probable sequences.

RNA Mode

Score RNA sequences and compute per-sequence log-likelihood:

python /eva/tools/predict.py \
    --checkpoint /path/to/model \
    --input /input/sequences.fa \
    --output /output/scores.json

Supports --rna_type and --taxid conditioning, same as generation.

Protein Mode

Score protein sequences by reverse-translating them to RNA first:

python /eva/tools/predict.py \
    --checkpoint /path/to/model \
    --input /input/proteins.fa \
    --output /output/protein_scores.json \
    --mode protein \
    --codon_optimization first

--codon_optimization options: first (first codon in table) or most_frequent (most common codon for the species).


Directed Evolution

EVA supports in-silico directed evolution — an iterative optimization pipeline that improves a given RNA sequence through cycles of mutation, LLM-based fitness scoring, and structural stability evaluation. Starting from an existing RNA sequence, EVA applies point mutations at specified positions, scores each mutant by log-likelihood and Minimum Free Energy (MFE via LinearFold), and uses simulated annealing with dynamic beam search to balance exploration and exploitation. Over multiple rounds the temperature cools to converge on optimized sequences, and the top N candidates ranked by combined LLM + MFE score are returned.

circRNA Illustration mRNA Optimization Illustration
circRNA Optimization mRNA Optimization

Usage

python /eva/tools/directed_evolution.py \
    --checkpoint /path/to/model \
    --input /input/sequence.fa \
    --output /output/evolved.fa \
    --rna_type mRNA \
    --taxid 9606 \
    --iterations 10 \
    --mutations 5 \
    --beam_width 10 \
    --output_count 5

Key Parameters

Parameter Description Default
--iterations Number of evolution cycles 10
--mutations Total mutations per iteration 2
--beam_width Beam search width — keeps top candidates after each position 10
--output_count Number of final sequences to output 5
--T_init / --T_min Simulated annealing temperature range 1.0 → 0.01
--cooling_rate Temperature decay per iteration 0.95
--mutate_positions Specific positions to mutate (0-based, comma-separated) Random
--mutate_range Range of positions to mutate (e.g., 0-100) Entire sequence

Batch Processing with YAML

Define multiple tasks in a single YAML config file. The defaults section sets shared parameters, which individual tasks can override.

Generation Config Example

checkpoint: /path/to/model
output_dir: ./output

defaults:
  temperature: 1.0
  top_k: 50
  max_length: 8192
  batch_size: 1

tasks:
  - name: unconditional
    mode: generation
    format: clm
    num_seqs: 1000

  - name: human_mrna
    mode: generation
    format: clm
    rna_type: mRNA
    taxid: "9606"
    lineage: "D__Eukaryota;P__Chordata;C__Mammalia;O__Primates;F__Hominidae;G__Homo;S__Homo sapiens"
    num_seqs: 1000

  - name: glm_infill
    mode: generation
    format: glm
    input: ./input/seqs.fa
    span_ratio: 0.1
    num_seqs: 5

Scoring Config Example

checkpoint: /path/to/model
output_dir: ./scores

defaults:
  batch_size: 128

tasks:
  - name: score_basic
    mode: scoring
    input: ./input/seqs.fa

  - name: score_human_mrna
    mode: scoring
    input: ./input/seqs.fa
    rna_type: mRNA
    taxid: "9606"
    normalize: true
    exclude_special_tokens: true

  - name: score_protein
    mode: scoring
    input: ./input/proteins.fa
    scoring_mode: protein
    codon_optimization: first

Running

python /eva/tools/generate.py --config config.yaml              # Run all tasks
python /eva/tools/generate.py --config config.yaml --task name   # Run specific task
python /eva/tools/generate.py --config config.yaml --device cuda:1  # Override device

Input/Output Formats

Input — FASTA

>sequence_id_1
AUGCGCUAUGCGCUAUGCGCUAUGCGCUAUGCGCUAUGCGCUAUGCGCU
>sequence_id_2
AUGAAAAUGCGGCCGCAUUACGUAAACGGCCGCAAAUGUUUCCGGCAAA

Output — Generation (FASTA)

>unconditional_0
AUGCGCUAUGCGCUAUGCGCUAUGCGCUAUGCGCUAUGCGCUAUGCGCU

With --output_details (GLM / Continuation):

>test_seq_sample0_forward_split50
PROMPT: AUGCGCUAUGCGCUAUGCG
GROUND_TRUTH: CU AUGCGCUAUGCG
GENERATED: CU AAUGCGCUAGCG
FULL_SEQ: AUGCGCUAUGCGCUAAUGCGCUAGCG

Output — Scoring (JSON)

{
  "scores": [
    {
      "header": "seq1",
      "sequence": "AUGGCCGUAGU...",
      "length": 67,
      "log_likelihood": -1.25
    }
  ]
}

Higher (less negative) log_likelihood = better sequence.

Output — Directed Evolution (FASTA)

Output is in FASTA format with scores in the header:

>candidate_1_score=-125.4321_best
AUGC... (evolved sequence)
>candidate_2_score=-128.1056
AUGC... (second best)

Data Availability

Some large files are not included in this repository due to size constraints. They can be downloaded from the sources listed below.

Model & Environment

Resource Description Size Source
Model Checkpoint EVA 1.4B (MoE) pre-trained weights HuggingFace
Docker Image (eva_latest.tar.gz) Pre-built EVA runtime environment Zenodo
Training Data OpenRNA v1 — 114M full-length RNA sequences HuggingFace

Experiment & Benchmark Data

Resource Description Size Source
mRNA Codon Optimization Optimized sequences from 9 vaccine targets (5 mRNA: linear, HIV, PR8, RABV, VZV; 4 circRNA: CAR-T, RABV-G, SARS-Group-I, SARS-T4) using EVA (homo/nolineage), CodonFM-1B, Evo2-1B, GemoRNA, and random baseline 26 KB Zenodo
mRNA 6 Species Generated & natural mRNA sequences for 6 species (Homo sapiens, Mus musculus, Rattus norvegicus, Cricetulus griseus, Drosophila melanogaster, Caenorhabditis elegans) with MFE, GC content, k-mer, loop-helix, pair metrics 38 MB Zenodo
GeneRRNA Generation EVA-generated rRNA sequences benchmarked against GeneRRNA database (100k natural, 1k sampled) with k-mer, loop-helix, pair, merged MFE-GC metrics 13 MB Zenodo
Generated RNA Sequences 15 RNA types (mRNA, lncRNA, circRNA, tRNA, rRNA, miRNA, piRNA, sRNA, snRNA, snoRNA, scaRNA, vault_RNA, Y_RNA, ribozyme, viral) generated by EVA with matched natural controls (5k each) and MFE, GC, k-mer, loop-helix, pair metrics 156 MB Zenodo

Citation

If you find EVA useful in your research, please cite:

@article{huang2026eva,
  title={EVA: A Generative Foundation Model for Universal RNA Modeling and Design},
  author={Huang, Yanjie and Lyu, Guangye and others},
  journal={TODO},
  year={2026},
  url={TODO}
}

License

This project is licensed under the Apache License 2.0. See LICENSE for details.

About

RNA-life creation is coming ~(EVA2)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors