---
title: "Tokenization Laws and DNA Models"
description: "Understanding k-mer tokenization and EVO-2 style nucleotide modeling"
author: "Parsa Idehpour"
date: "2025-09-14"
categories:
  - LLMs
  - biology
  - tokenization
---



## Overview

This post explores how tokenization choices shape model capacity and performance, and how those ideas translate to DNA models. We'll use k-mer tokenization on nucleotide sequences and discuss how EVO-2-style models work with nucleotide tokens.

- What are tokenization laws?
- How do k-mer vocabularies trade off context length vs. vocabulary size?
- How might EVO-2-like models represent nucleotides and long-range dependencies?




```{=html}
<div style="text-align:center;">
  <img src="Evo2_banner.png" alt="EVO-2 model diagram" width="60%"/>
  <p><em>Figure 1. EVO-2 model.</em></p>
</div>
```



## Tokenization laws in brief

Tokenization laws describe empirical tradeoffs between model size, context length, and tokenizer vocabulary. For fixed compute, larger vocabularies shrink sequence length but increase embedding/softmax cost; smaller vocabularies do the opposite. The optimal point depends on data distribution and task (e.g., code vs. natural language vs. DNA).



## DNA modeling with k-mers

For DNA, a common tokenizer uses k-mers over the alphabet {A,C,G,T}. The vocabulary size is 4^k, and the stride determines overlap. Larger k compresses sequences but grows the vocab; smaller k expands sequences but keeps vocab small.

- Example: k=3 (3-mers) ⇒ vocab size = 64
- Example: k=6 (6-mers) ⇒ vocab size = 4096

We can quickly demonstrate 3-mer tokenization with stride 1.



In [None]:
# Simple 3-mer tokenizer demo
from collections import Counter
from typing import List


def kmers(sequence: str, k: int = 3, stride: int = 1) -> List[str]:
    sequence = sequence.upper().replace("U", "T")
    tokens = []
    for i in range(0, len(sequence) - k + 1, stride):
        kmer = sequence[i:i+k]
        if set(kmer) <= {"A", "C", "G", "T"}:
            tokens.append(kmer)
    return tokens


seq = "ACGTACGTGACCT"
ks = kmers(seq, k=3, stride=1)
print("Sequence:", seq)
print("3-mers:", ks)
print("Unique 3-mers:", sorted(set(ks)))
print("Counts:", Counter(ks))


## How EVO-2-style models use nucleotide tokens

High-level idea:
- Use a tokenizer over nucleotides (e.g., 3–6-mer tokens) to convert sequences into discrete tokens.
- Train an autoregressive transformer over these tokens to model genomic sequences.
- Incorporate long-range context (e.g., thousands to millions of bases) using efficient attention or memory mechanisms.
- Optionally multitask with masked objectives or structure-aware heads.

This lets the model learn motifs, regulatory patterns, and long-range interactions directly from token sequences.
