<a href="https://www.kaggle.com/code/shivanisharma1297/overview-genomic-foundation-models?scriptVersionId=296707165" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## üß¨ RNA Foundation Models ‚Äî A New Era in Sequence Intelligence

RNA foundation models are large transformer-based architectures trained on massive corpora of nucleotide sequences.
Rather than relying on small labeled datasets, they learn rich biological patterns from billions of RNA and DNA bases, allowing them to internalize:

1. Regulatory signals (promoters, enhancers, splice motifs)

2. Structural motifs (stems, loops, pairing patterns)

3. Evolutionary signatures across species

4. Biochemical properties encoded within sequence context

These deep, multi-scale representations make them powerful universal feature extractors, enabling high performance on downstream tasks even with limited labeled data.

## üåü Popular RNA/DNA Foundation Models

1. Nucleotide Transformer (InstaDeep) ‚Äî large-scale multi-species training

2. HyenaDNA / Caduceus ‚Äî long-range modeling with linear attention

3. EvoRNA / Evo2-style models ‚Äî structure-aware, evolutionary embeddings

4. DNA-BERT / DNABERT-2 ‚Äî lightweight k-mer transformers for short sequences

In many ways, these models function like ‚ÄúBERT for Biology‚Äù ‚Äî capturing the grammar, syntax, and semantics of genomic code.

## üß™ What We Can Do With RNA Foundation Models

These models can be fine-tuned for almost any sequence-level prediction task:

1. Splice Site Detection: Predict whether a genomic position is a donor/acceptor site.Useful for variant interpretation, transcript annotation, and identifying splicing defects.

2. Promoter & Enhancer Prediction:Detect regulatory regions driving transcription.

3. RNA Secondary Structure Prediction: Predict base-pairing and structural states of RNA molecules.

4. Gene Expression / Regulatory Strength Estimation: Infer promoter strength or translation efficiency from raw sequence.

5. Variant Effect Prediction:Determine whether mutations disrupt regulatory signals or structural motifs.

6.  mRNA Stability / Degradation Signal Prediction:Useful for synthetic biology, mRNA vaccine design.

7.  Long Non-coding RNA (lncRNA) Classification Separate coding from non-coding transcripts.

8.   RNA-binding Protein Site Prediction: Find motifs bound by proteins like AGO, TDP-43, or RBFOX.

üîß Why Fine-Tune Instead of Train From Scratch?

Fine-tuning is powerful because:

You only need a small labeled dataset.

Models already understand biological syntax (motifs, splice codes, kmers).

Training is faster and cheaper.

Performance is usually much higher than models trained from scratch.

This is especially important for tasks like splice-site detection, where classic models need heavy feature engineering. Foundation models skip that ‚Äî they learn the signal automatically

## üß¨ Understanding RNA Foundation Model Files & How They Work

Modern RNA foundation models (like Nucleotide Transformer, HyenaDNA, DNABERT-2, etc.) are stored and distributed through Hugging Face.
Each model comes with a few important files:

‚úî config.json

1. -Defines the model architecture:

2. -number of layers

3. -hidden size

4. -attention type

5. -tokenizer settings

6. -model type (decoder, encoder, etc.)

‚úî pytorch_model.bin / model.safetensors

These contain the pretrained weights ‚Äî the actual learned parameters from massive DNA/RNA corpora.
This is the core of the foundation model.

‚úî tokenizer.json or tokenizer files

Specifies:

how sequences are split (character-level or k-mer like 3-mer, 6-mer)

vocabulary size

special tokens

‚úî modeling_*.py (if using trust_remote_code=True)

Custom model architectures (e.g., Hyena, Performer, S4, NT Transformer variants).

When you load the model, Hugging Face automatically pulls these files and reconstructs the full architecture.

üß¨ How We Load an RNA Foundation Model

Example using Nucleotide Transformer 500M:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "InstaDeepAI/nucleotide-transformer-v2-500m-multi-species"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
    trust_remote_code=True
)


##  What happens internally?

1. Hugging Face reads the config.json --->Loads the tokenizer (k-mer or raw-token based)---> Reconstructs the model architecture-->Loads pretrained weights into the model--> Adds a classification head if specified (num_labels=2).

This is identical to loading a pretrained BERT model, just specialized for biology.

## üß™ Full Fine-Tuning Pipeline (Step-by-Step)

Below is the full structure of a clean RNA fine-tuning pipeline.
You can paste this directly into Markdown in your Kaggle notebook.

1. Install Dependencies

!pip install transformers datasets accelerate einops

2. Load a Dataset (Example: Human Splice Sites)

from datasets import load_dataset

dataset = load_dataset("kentnf/splice_sites_human")

3. Load Tokenizer + Model

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "InstaDeepAI/nucleotide-transformer-v2-500m-multi-species"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
    trust_remote_code=True
)


4. Preprocessing
   
def preprocess(batch):
    return tokenizer(
        batch["sequence"],
        truncation=True,
        padding="max_length",
        max_length=512
    )

tokenized = dataset.map(preprocess, batched=True)
tokenized = tokenized.rename_column("label", "labels")
tokenized.set_format("torch", ["input_ids", "attention_mask", "labels"])

5. Training Arguments
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="rna_splicesite",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    load_best_model_at_end=True,
)

6. Trainer Setup

from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
)


7. Train

trainer.train()


8. Evaluate

metrics = trainer.evaluate()
print(metrics)

9. Inference Example

 import torch

example = "AUGGCUACCUAGGUGAUGGUUUCAUUGGAUGC"
inputs = tokenizer(example, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits
    pred = torch.argmax(logits, dim=-1).item()

print("Prediction:", pred)

## üß¨ Why This Pipeline Works Well

Foundation models already ‚Äúunderstand‚Äù:

- Splice donor/acceptor motifs

- K-mer statistics

- Evolutionary patterns

- RNA structure hints

- Regulatory sequences

Fine-tuning only adjusts the top layers and classification head, so:

- Training is fast

- You need fewer labeled data

- Accuracy is high

- Overfitting is reduced

## üìò Summary of Popular DNA/RNA Foundation Models

| Model Name                                  | Architecture                  | Max Seq Length   | Tokenization      | Parameters | Strengths                                                                | Good For                                                               |
| ------------------------------------------- | ----------------------------- | ---------------- | ----------------- | ---------- | ------------------------------------------------------------------------ | ---------------------------------------------------------------------- |
| **Nucleotide Transformer v2 (500M / 2.5B)** | Transformer Encoder           | 1,000‚Äì2,000      | k-mer (6-mer)     | 500M‚Äì2.5B  | Strong biological representations, trained on huge multi-species dataset | Splice site detection, promoters, enhancers, variant effect prediction |
| **DNABERT / DNABERT-2**                     | BERT Encoder                  | 512              | k-mer (3‚Äì6-mer)   | 110M‚Äì400M  | Lightweight, easy to fine-tune, excellent for short sequences            | Binary sequence classification, motif detection                        |
| **Caduceus / HyenaDNA**                     | Hyena + Linear Attention      | 10,000‚Äì1,000,000 | character-level   | 180M‚Äì1B    | Handles extremely long sequences efficiently                             | Long-range regulatory prediction, gene body tasks                      |
| **Evo2 / EvoRNA / ESM3-like BioLMs**        | Transformer/Hybrid            | 2,048‚Äì32,000     | BPE/AA/RNA tokens | 700M‚Äì7B    | Learns structural + evolutionary signals                                 | RNA structure prediction, RBP binding, folding tasks                   |
| **Enformer**                                | Transformer + Conv (DeepMind) | 196,608          | character-level   | ~500M      | State-of-the-art on gene regulation prediction                           | Expression prediction, promoter-enhancer interactions                  |
| **HyenaDNA (Stanford)**                     | Hyena Operator                | 128k‚Äì1M          | character-level   | ~400M      | Scalable to extremely long genomics                                      | Long-range dependency tasks                                            |
| **GenSLM**                                  | Transformer                   | 2k‚Äì4k            | nucleotide-level  | 500M‚Äì2.5B  | Built for SARS-CoV-2 / viral genomics                                    | Viral lineage classification, mutation impact                          |
| **GenomeGPT (6B)**                          | Decoder-only LLM              | 2,048            | character-level   | 6B         | Generative and predictive modeling                                       | Synthetic DNA/RNA generation, motif discovery                          |


## üìù How to Use This Table

If your sequences are short (<512 bp) ‚Üí DNABERT or NT-500M works great.

If you need long-range biological context (introns, regulatory windows) ‚Üí HyenaDNA or Enformer.

For RNA structure or interaction prediction ‚Üí EvoRNA / ESM-like models.

For general-purpose high accuracy ‚Üí NT-2.5B or HyenaDNA (long) + linear attention models.