# CodonScope: Multi-Species Codon Usage Analysis

Quantify translational selection, tRNA adaptation, and codon-level expression bias in any gene list.

CodonScope runs these analyses:

| # | Analysis | What it does |
|---|----------|--------------|
| 0 | **Codon Adaptation Index (CAI)** | Classic codon optimality metric (Sharp & Li 1987). Shown in gene summary at top. |
| 1 | **Codon Enrichment** | Compares single-codon frequencies in your genes vs the genome (default + matched + binomial variants, whole gene + ramp + body). Waterfall charts colored by wobble decoding. |
| 2 | **Dicodon Enrichment** | Same for adjacent codon pairs (3,721 dicodons). Default + matched backgrounds, whole gene + ramp + body. |
| 3 | **AA vs Synonymous Attribution** | Separates amino acid composition effects from synonymous codon choice (RSCU). Classifies drivers as tRNA supply, GC3 bias, or wobble avoidance. |
| 4 | **Weighted tRNA Adaptation Index** | Plots per-position codon optimality (tAI/wtAI) along transcripts. Detects 5' ramp regions. |
| 5 | **Collision Potential** | Counts fast-to-slow (FS) codon transitions that may cause ribosome collisions. |
| 6 | **Translational Demand** | Weights codon usage by expression level (TPM). Shows which codons are under translational selection pressure. |
| — | **Cross-Species Comparison** | Correlates RSCU between ortholog pairs across species. Identifies genes with divergent codon preferences. |

**How to use this notebook:**
1. Run cells 1-2 to install CodonScope and download reference data (first run takes ~5 min)
2. Paste your gene list in cell 3
3. Run cells 4-5 to see results

## 1. Install CodonScope

In [None]:
#@title Install CodonScope and dependencies { display-mode: "form" }

import subprocess, sys, os

# ── Configuration ────────────────────────────────────────────────────────────
GITHUB_REPO = "https://github.com/Meier-Lab-NCI/codonscope.git"
# ──────────────────────────────────────────────────────────────────────────────

def run(cmd):
    """Run a shell command and print output."""
    result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
    if result.stdout.strip():
        print(result.stdout.strip())
    if result.returncode != 0 and result.stderr.strip():
        print(result.stderr.strip())
    return result.returncode

# Install dependencies
print("Installing dependencies...")
run(f"{sys.executable} -m pip install -q numpy scipy pandas matplotlib requests tqdm statsmodels")

# Install CodonScope from GitHub
print(f"\nInstalling CodonScope from GitHub...")
run(f"{sys.executable} -m pip install -q git+{GITHUB_REPO}")

# Enable ipywidgets in Colab (required for Textarea rendering)
try:
    from google.colab import output
    output.enable_custom_widget_manager()
    print("Enabled Colab custom widget manager for interactive inputs")
except Exception:
    pass

# Verify installation
try:
    import codonscope
    print(f"\n✓ CodonScope v{codonscope.__version__} installed successfully")
except ImportError:
    print("\n✗ Installation failed. Check the GitHub URL above.")

## 2. Download Reference Data

Downloads CDS sequences, tRNA copy numbers, wobble rules, expression data, and pre-computed backgrounds for the selected species.

If Google Drive is mounted, data is cached there so you only download once across sessions. Otherwise data is stored in the Colab runtime (re-downloaded each session).

In [None]:
#@title Download species data (cached to Google Drive) { display-mode: "form" }

import os, shutil
from pathlib import Path

# ── Which species to download ─────────────────────────────────────────────────
download_human = True  #@param {type:"boolean"}
download_yeast = True  #@param {type:"boolean"}
download_mouse = False  #@param {type:"boolean"}

# Also download orthologs for cross-species comparison (Mode 6)?
DOWNLOAD_ORTHOLOGS = False  #@param {type:"boolean"}
# ──────────────────────────────────────────────────────────────────────────────

SPECIES = []
if download_human:
    SPECIES.append("human")
if download_yeast:
    SPECIES.append("yeast")
if download_mouse:
    SPECIES.append("mouse")

if not SPECIES:
    raise ValueError("Select at least one species to download.")

# Try to mount Google Drive for persistent caching
USE_DRIVE = False
DRIVE_DATA_DIR = Path("/content/drive/MyDrive/codonscope_data")
LOCAL_DATA_DIR = Path.home() / ".codonscope" / "data"

try:
    from google.colab import drive
    if not os.path.exists("/content/drive/MyDrive"):
        print("Mounting Google Drive for persistent data caching...")
        drive.mount("/content/drive")
    USE_DRIVE = True
    print("✓ Google Drive mounted — data will be cached between sessions")
except (ImportError, Exception) as e:
    print(f"Google Drive not available ({e}). Data will be stored locally.")

if USE_DRIVE:
    # Symlink so codonscope reads from Drive
    DRIVE_DATA_DIR.mkdir(parents=True, exist_ok=True)
    LOCAL_DATA_DIR.parent.mkdir(parents=True, exist_ok=True)
    if LOCAL_DATA_DIR.is_symlink():
        LOCAL_DATA_DIR.unlink()
    elif LOCAL_DATA_DIR.exists():
        shutil.rmtree(LOCAL_DATA_DIR)
    LOCAL_DATA_DIR.symlink_to(DRIVE_DATA_DIR)
    print(f"  Data directory: {DRIVE_DATA_DIR}")

# Download each species (skips if data already exists)
from codonscope.data.download import download, download_orthologs

for sp in SPECIES:
    species_dir = LOCAL_DATA_DIR / "species" / sp
    marker = species_dir / "background_mono.npz"
    if marker.exists():
        n_files = len(list(species_dir.glob("*")))
        print(f"\n✓ {sp} data already cached ({n_files} files). Skipping download.")
    else:
        print(f"\nDownloading {sp} data (this may take a few minutes)...")
        download(sp)
        print(f"✓ {sp} data downloaded")

# Optionally download orthologs
if DOWNLOAD_ORTHOLOGS and len(SPECIES) >= 2:
    ortho_dir = LOCAL_DATA_DIR / "orthologs"
    pairs = []
    if "human" in SPECIES and "yeast" in SPECIES:
        pairs.append(("human", "yeast"))
    if "human" in SPECIES and "mouse" in SPECIES:
        pairs.append(("human", "mouse"))
    if "mouse" in SPECIES and "yeast" in SPECIES:
        pairs.append(("mouse", "yeast"))
    for s1, s2 in pairs:
        ortho_file = ortho_dir / f"{s1}_{s2}.tsv"
        if ortho_file.exists():
            print(f"\n✓ {s1}-{s2} orthologs already cached. Skipping.")
        else:
            print(f"\nDownloading {s1}-{s2} orthologs...")
            download_orthologs(s1, s2)
            print(f"✓ {s1}-{s2} orthologs downloaded")

print("\n" + "="*50)
print("All data ready. Proceed to enter your gene list.")

## 3. Enter Your Gene List

**To analyze your own genes:** Select `Custom` from the dropdown, set your species, and paste gene IDs in the text box (one per line, or comma/tab-separated). Lines starting with `#` are ignored.

**To try a demo first:** Pick one of the three example gene sets. These have well-characterized codon biases so you know what to expect.

### What is the background?

CodonScope compares your gene list against **all protein-coding genes in the genome** for the selected species:
- **Yeast**: 6,685 verified ORFs from SGD (mitochondrial excluded)
- **Human**: 19,229 CDS from MANE Select v1.5 (one canonical transcript per gene)
- **Mouse**: 21,556 CDS from Ensembl GRCm39 (one canonical transcript per gene)

This is a whole-genome background. For Codon and Dicodon Enrichment (Analyses #1 & #2), you can also switch to a `matched` background that controls for CDS length and GC content (useful if your gene set has unusual nucleotide composition).

### Custom expression for Translational Demand (Analysis #6)

Translational Demand weights codon frequencies by gene expression. By default it uses:
- **Yeast**: hardcoded rich-media TPM estimates
- **Human**: hardcoded HEK293T TPM estimates
- **Mouse**: hardcoded TPM estimates

**To use your own RNA-seq TPM values**, upload a tab-separated file with two columns (no header row needed):
```
gene_id\ttpm
TP53\t12.5
BRCA1\t8.3
RPL11\t1500.0
```
Gene IDs should match what you use in your gene list (HGNC symbols, Ensembl IDs, etc.). Genes not in your file get the genome median TPM. Upload your file using the file browser on the left sidebar, then set the path in the Translational Demand cell below.

### Accepted gene ID types

| Species | Accepted IDs | Examples |
|---------|-------------|----------|
| **Yeast** | Common names, systematic names, SGD IDs, UniProt | `ACT1`, `YFL039C`, `SGD:S000001855`, `P60010` |
| **Human** | HGNC symbols, Ensembl (ENSG/ENST), Entrez, RefSeq, UniProt | `TP53`, `ENSG00000141510`, `7157`, `NM_000546` |
| **Mouse** | MGI symbols, Ensembl (ENSMUSG/ENSMUST), MGI IDs, Entrez, UniProt | `Actb`, `ENSMUSG00000029580`, `MGI:87904` |

You can mix ID types in the same list.

In [None]:
#@title Select gene list and species { display-mode: "form" }

# ── Gene list source ──────────────────────────────────────────────────────────
# "Custom" = paste your own genes below.  Or pick a demo to see example output.
GENE_LIST = "Custom \u2014 paste your own genes"  #@param ["Custom \u2014 paste your own genes", "Demo: Yeast Trm9 targets (50 genes, Begley 2007) \u2014 strong AGA/GAA codon bias", "Demo: Human ribosomal proteins (80 genes, dos Reis 2004) \u2014 translational selection", "Demo: Mouse ribosomal proteins (80 genes) \u2014 mammalian codon bias"]

# ── Species (required for Custom; auto-set for demos) ─────────────────────────
SPECIES_NAME = "yeast"  #@param ["yeast", "human", "mouse"]

# ── Optional: cross-species comparison (Mode 6) ──────────────────────────────
SPECIES2 = ""  #@param ["", "yeast", "human", "mouse"]

# ── Human tissue for Mode 2 (leave blank for cross-tissue median) ─────────────
TISSUE = ""  #@param ["", "Liver", "Lung", "Brain - Cortex", "Brain - Cerebellum", "Heart - Left Ventricle", "Kidney - Cortex", "Muscle - Skeletal", "Pancreas", "Whole Blood", "Breast - Mammary Tissue", "Colon - Sigmoid", "Thyroid", "Testis", "Ovary", "Prostate"]
CELL_LINE = ""  #@param ["", "HEK293T", "HeLa", "K562", "A549", "MCF7", "U2OS", "HepG2", "SH-SY5Y"]
# ──────────────────────────────────────────────────────────────────────────────

import ipywidgets as widgets
from IPython.display import display as ipy_display

# ── Demo gene sets ───────────────────────────────────────────────────────────
_DEMOS = {
    "Demo: Yeast Trm9 targets (50 genes, Begley 2007) \u2014 strong AGA/GAA codon bias": {
        "species": "yeast",
        "info": "Top 50 yeast genes enriched for AGA (Arg) and GAA (Glu) codons, decoded by Trm9-modified tRNA (mcm5s2U). Includes ribosomal proteins, glycolytic enzymes, and histones. Expect strong enrichment for AGA and GAA codons in Mode 1, high optimality in Mode 3, and synonymous-driven signal in Mode 5.",
        "genes": """RPL3
RPL8A
RPL8B
RPL10
RPL18A
RPL22A
RPL28
RPL30
RPL33B
RPP0
RPS1A
RPS1B
RPS10A
RPS10B
RPS20
RPS26A
RPS26B
RPS31
NHP2
TDH3
TDH2
TDH1
ENO1
ENO2
CDC19
GPM1
TPI1
PDC1
PDC5
TEF1
TEF2
HYP2
ANB1
ADH1
HTA2
HTB1
HHT1
HHT2
TSA1
FPR2
CIS3
TOM22
TAL1
SEC53
YRB1
BUD28
MF(ALPHA)1
YDL228C
YPL142C
YPL197C""",
    },
    "Demo: Human ribosomal proteins (80 genes, dos Reis 2004) \u2014 translational selection": {
        "species": "human",
        "info": "80 core cytoplasmic ribosomal proteins. These are among the most highly expressed genes in any human cell. Expect C/G-ending codon enrichment (Mode 1), high translational demand (Mode 2), high optimality with a visible 5' ramp (Mode 3), and a mix of GC3 bias and translational selection in Mode 5.",
        "genes": """RPL3
RPL4
RPL5
RPL6
RPL7
RPL7A
RPL8
RPL9
RPL10
RPL10A
RPL11
RPL12
RPL13
RPL13A
RPL14
RPL15
RPL17
RPL18
RPL18A
RPL19
RPL21
RPL22
RPL23
RPL23A
RPL24
RPL26
RPL27
RPL27A
RPL28
RPL29
RPL30
RPL31
RPL32
RPL34
RPL35
RPL35A
RPL36
RPL36A
RPL37
RPL37A
RPL38
RPL39
RPL41
RPLP0
RPLP1
RPLP2
UBA52
RPSA
RPS2
RPS3
RPS3A
RPS4X
RPS5
RPS6
RPS7
RPS8
RPS9
RPS10
RPS11
RPS12
RPS13
RPS14
RPS15
RPS15A
RPS16
RPS17
RPS18
RPS19
RPS20
RPS21
RPS23
RPS24
RPS25
RPS26
RPS27
RPS27A
RPS28
RPS29
FAU
RACK1""",
    },
    "Demo: Mouse ribosomal proteins (80 genes) \u2014 mammalian codon bias": {
        "species": "mouse",
        "info": "80 core cytoplasmic ribosomal proteins in mouse. Similar codon preferences to human (same mammalian tRNA pools). Good for verifying mouse data downloads work correctly.",
        "genes": """Rpl3
Rpl4
Rpl5
Rpl6
Rpl7
Rpl7a
Rpl8
Rpl9
Rpl10
Rpl10a
Rpl11
Rpl12
Rpl13
Rpl13a
Rpl14
Rpl15
Rpl17
Rpl18
Rpl18a
Rpl19
Rpl21
Rpl22
Rpl23
Rpl23a
Rpl24
Rpl26
Rpl27
Rpl27a
Rpl28
Rpl29
Rpl30
Rpl31
Rpl32
Rpl34
Rpl35
Rpl35a
Rpl36
Rpl36a
Rpl37
Rpl37a
Rpl38
Rpl39
Rpl41
Rps2
Rps3
Rps3a1
Rps4x
Rps5
Rps6
Rps7
Rps8
Rps9
Rps10
Rps11
Rps12
Rps13
Rps14
Rps15
Rps15a
Rps16
Rps17
Rps18
Rps19
Rps20
Rps21
Rps23
Rps24
Rps25
Rps26
Rps27
Rps27a
Rps28
Rps29
Rpsa
Rplp0
Rplp1
Rplp2
Uba52
Fau
Rps4y2""",
    },
}
# ──────────────────────────────────────────────────────────────────────────────

# Handle demo vs custom selection
_is_custom = GENE_LIST.startswith("Custom")
if not _is_custom and GENE_LIST in _DEMOS:
    _demo = _DEMOS[GENE_LIST]
    SPECIES_NAME = _demo["species"]
    _default_genes = _demo["genes"].strip()
    print(f"Demo: {GENE_LIST.split(': ', 1)[1].split(' \u2014')[0]}")
    print(f"  Species: {SPECIES_NAME}")
    print(f"  {_demo['info']}")
else:
    _default_genes = """# Paste your gene IDs here, one per line.
# Lines starting with # are ignored.
# You can also use comma or tab-separated format.
#
# Examples (delete these and paste your own):
# ACT1
# RPL3
# TDH3"""
    print(f"Custom gene list mode.")
    print(f"  Species: {SPECIES_NAME}")
    print(f"  Paste your gene IDs in the text box below (one per line).")
    print(f"  Background: all {SPECIES_NAME} protein-coding genes.")

gene_input = widgets.Textarea(
    value=_default_genes,
    placeholder="Paste gene IDs here, one per line",
    description="",
    layout=widgets.Layout(width="500px", height="300px"),
)

print(f"\nGene list:")
ipy_display(gene_input)
print("\n\u2191 Edit the box above, then run the next cell to confirm.")

In [None]:
#@title Confirm gene list { display-mode: "form" }

import re

def parse_genes(text):
    """Parse gene list from text: one per line, comma-separated, or mixed."""
    genes = []
    for line in text.strip().split("\n"):
        line = line.strip()
        if not line or line.startswith("#"):
            continue
        parts = re.split(r"[,\t]+", line)
        for part in parts:
            part = part.strip()
            if part:
                genes.append(part)
    return genes

gene_ids = parse_genes(gene_input.value)
species2 = SPECIES2 if SPECIES2 else None
tissue = TISSUE if TISSUE else None
cell_line = CELL_LINE if CELL_LINE else None

print(f"Species: {SPECIES_NAME}")
print(f"Gene list: {len(gene_ids)} genes")
if len(gene_ids) <= 20:
    print(f"  {', '.join(gene_ids)}")
else:
    print(f"  {', '.join(gene_ids[:10])}, ... ({len(gene_ids)-10} more)")
if species2:
    print(f"Cross-species comparison: {SPECIES_NAME} vs {species2}")
if tissue:
    print(f"Tissue: {tissue}")
if cell_line:
    print(f"Cell line: {cell_line}")
if len(gene_ids) < 10:
    print("\n⚠ Warning: fewer than 10 genes. Results may lack statistical power.")

## 4. Generate Full Report

Runs all applicable analyses and produces a self-contained HTML report with embedded plots.

The report includes numbered analysis sections:

- **Gene Summary + CAI** — mapped/unmapped genes, gene name mapping table, CDS statistics, CAI distribution
- **1. Codon Enrichment** — Default + matched + binomial (if selected) backgrounds, ramp vs body regions, waterfall charts colored by wobble decoding, positionally biased codons table
- **2. Dicodon Enrichment** — Default + matched backgrounds, ramp vs body regions for adjacent codon pairs
- **3. AA vs Synonymous Attribution** — Attribution table (AA-driven vs synonymous codon choice)
- **4. Weighted tRNA Adaptation Index** — Metagene optimality profile with ramp analysis
- **5. Collision Potential** — Ribosome collision analysis (FS transitions) with ranked dicodon bar chart
- **6. Translational Demand** — Expression-weighted codon demand
- **Cross-Species Comparison** — RSCU correlation between orthologs (if second species selected)

After the report generates, **run the download cell** to get a zip file containing:
- The full HTML report (viewable in any browser)
- README.txt documenting all analysis parameters, your gene list, and data sources
- Complete TSV data tables for every analysis (not truncated — use these for custom plots in R, Python, etc.)

In [None]:
#@title Run full report { display-mode: "form" }

import tempfile
from pathlib import Path
from IPython.display import HTML, display
from codonscope.report import generate_report

# ── Statistical model for Mode 1 ─────────────────────────────────────────────
# "bootstrap" = standard resampling (default), "binomial" = GC3-corrected GLM
MODEL = "bootstrap"  #@param ["bootstrap", "binomial"]
# ──────────────────────────────────────────────────────────────────────────────

# Generate report to a temp file
output_path = Path(tempfile.mkdtemp()) / "codonscope_report.html"

print("Running all analysis modes... (this may take 1-2 minutes)")
if MODEL == "binomial":
    print("  Mode 1 using binomial GLM with GC3 correction (monocodon only)")
print()

report_kwargs = dict(
    species=SPECIES_NAME,
    gene_ids=gene_ids,
    output=output_path,
    n_bootstrap=10_000,
    seed=42,
    model=MODEL,
)
if species2:
    report_kwargs["species2"] = species2
if tissue:
    report_kwargs["tissue"] = tissue
if cell_line:
    report_kwargs["cell_line"] = cell_line

report_path = generate_report(**report_kwargs)

# Show what was produced
zip_path = report_path.parent / f"{report_path.stem}_results.zip"
data_dir = report_path.parent / f"{report_path.stem}_data"
data_files = sorted(data_dir.glob("*.tsv")) if data_dir.exists() else []

print(f"\n✓ Report generated: {report_path}")
if zip_path.exists():
    zip_size_mb = zip_path.stat().st_size / (1024 * 1024)
    print(f"✓ Results zip: {zip_path.name} ({zip_size_mb:.1f} MB)")
    print(f"  Contains: HTML report, README.txt, and {len(data_files)} data files:")
    for f in data_files:
        print(f"    - data/{f.name}")
    print(f"\n  → Run the next cell to download the zip.")

# Display report inline
html_content = report_path.read_text()
display(HTML(html_content))

In [None]:
#@title Download results zip (HTML report + full data tables + README) { display-mode: "form" }

# ── Downloads a zip containing: ──────────────────────────────────────────────
#   codonscope_report.html  — self-contained HTML report with embedded plots
#   README.txt              — analysis parameters, gene list, file guide, data sources
#   data/gene_mapping.tsv   — input ID → gene name → systematic/Ensembl ID
#   data/mode1_*.tsv        — full monocodon and dicodon composition results
#   data/mode2_demand.tsv   — expression-weighted translational demand per codon
#   data/mode3_*.tsv        — per-gene optimality scores, ramp/body composition
#   data/mode4_*.tsv        — per-gene collision fractions, per-dicodon FS enrichment
#   data/mode5_*.tsv        — per-codon attribution (AA-driven vs synonymous)
#   data/mode6_*.tsv        — per-gene cross-species RSCU correlation (if applicable)
# ──────────────────────────────────────────────────────────────────────────────

zip_path = report_path.parent / f"{report_path.stem}_results.zip"

if not zip_path.exists():
    print("No zip file found. Run the report cell above first.")
else:
    zip_size_mb = zip_path.stat().st_size / (1024 * 1024)
    print(f"Downloading: {zip_path.name} ({zip_size_mb:.1f} MB)")
    print(f"Contains HTML report, README documentation, and all TSV data tables.")
    print(f"TSV files have complete results (not truncated) — use them for custom plots.")
    try:
        from google.colab import files
        files.download(str(zip_path))
    except ImportError:
        print(f"\nNot running in Colab. File is at: {zip_path}")

## 5. Explore Individual Analysis Results

Run specific analyses individually and view results as sortable tables. Each cell has adjustable parameters explained below.

### Parameter Guide

| Parameter | Analysis | Options | What it means |
|-----------|----------|---------|---------------|
| **KMER_SIZE** | Enrichment | `1` = monocodon (single codons, 61 total), `2` = dicodon (adjacent codon pairs, 3,721 total), `3` = tricodon (codon triplets, 226,981 total) | Larger k-mers capture context-dependent effects (e.g. ribosome stalling at specific codon pairs) but require more statistical power. Start with `1`. |
| **BACKGROUND** | Enrichment | `all` = compare against all genes in the genome, `matched` = compare against genes matched for CDS length and GC content | Use `all` first. If GC or length bias is flagged, re-run with `matched` to see which codons remain significant after controlling for nucleotide composition. |
| **METHOD** | Optimality | `wtai` = wobble-penalized tRNA Adaptation Index, `tai` = standard tAI without wobble penalty | `wtai` (default) penalises wobble base-pairing (G:U, I:C) by 0.5x because wobble decoding is slower than Watson-Crick. `tai` treats all anticodon-codon pairings equally. Use `wtai` unless you have a reason not to. |
| **RAMP_CODONS** | Optimality | Number of codons from the 5' end to define the "ramp" region (default: 50) | Many highly-expressed genes have a stretch of sub-optimal codons near the start that slows initial elongation and spaces out ribosomes. 30-50 codons is typical. |
| **TISSUE** | Demand | GTEx v8 tissue name (e.g. "Liver", "Brain - Cortex") | Sets which tissue's RNA-seq TPM values weight the translational demand calculation. Different tissues express different genes, so the demand landscape changes. |
| **CELL_LINE** | Demand | Cell line name (e.g. "HEK293T") | Uses the closest GTEx tissue as a proxy for cell line expression. |

### How to interpret results

- **Z-score**: How many standard deviations your gene set differs from the genome. Positive = enriched, negative = depleted.
- **adjusted_p**: Benjamini-Hochberg FDR-corrected p-value. < 0.05 is significant after controlling for multiple testing across all codons.
- **cohens_d**: Effect size independent of sample size. |d| < 0.2 is negligible, 0.2-0.5 is small, 0.5-0.8 is medium, > 0.8 is large. A codon can be statistically significant (low p) but biologically trivial (small d).
- **FS enrichment**: Ratio of fast-to-slow codon transitions in your genes vs genome. > 1.0 = more collision-prone junctions. Highly expressed genes tend to have FS enrichment near or below 1.0.
- **RSCU**: Relative Synonymous Codon Usage. 1.0 = codon used equally with its synonyms. > 1.0 = preferred, < 1.0 = avoided. Excludes Met and Trp (single-codon amino acids).
- **wtAI**: Wobble-penalized tRNA Adaptation Index. Score from 0 to 1 based on tRNA gene copy numbers. Higher = more efficiently decoded. "Fast" codons have wtAI above the genome median, "slow" codons below.

In [None]:
#@title Codon Enrichment Analysis { display-mode: "form" }

import pandas as pd
from IPython.display import display, HTML
from codonscope.modes.mode1_composition import run_composition

# 1 = monocodon (61 single codons), 2 = dicodon (3,721 adjacent pairs), 3 = tricodon (226,981 triplets)
KMER_SIZE = 1  #@param [1, 2, 3] {type:"integer"}
# "all" = compare vs whole genome; "matched" = compare vs genes matched for CDS length + GC content
BACKGROUND = "all"  #@param ["all", "matched"]
# "bootstrap" = standard resampling (default); "binomial" = GC3-corrected GLM (monocodon only)
MODE1_MODEL = "bootstrap"  #@param ["bootstrap", "binomial"]

kmer_labels = {1: "Monocodon", 2: "Dicodon", 3: "Tricodon"}
print(f"Running {kmer_labels[KMER_SIZE]} enrichment analysis ({BACKGROUND} background)...")
if BACKGROUND == "matched":
    print("  (matched = controls for CDS length and GC content differences)")
if MODE1_MODEL == "binomial":
    if KMER_SIZE != 1:
        print("  Warning: Binomial GLM only supports monocodon (k=1). Falling back to bootstrap.")
        MODE1_MODEL = "bootstrap"
    else:
        print("  Using binomial GLM with GC3 correction")

result = run_composition(
    species=SPECIES_NAME,
    gene_ids=gene_ids,
    k=KMER_SIZE,
    background=BACKGROUND,
    seed=42,
    model=MODE1_MODEL,
)

df = result["results"]
n = result["n_genes"]
n_sig = len(df[df["adjusted_p"] < 0.05])
unmapped = result["id_summary"].unmapped

print(f"\n{n} genes mapped, {len(unmapped)} unmapped")
if unmapped:
    print(f"  Unmapped: {', '.join(unmapped[:10])}{'...' if len(unmapped) > 10 else ''}")
print(f"{n_sig} significant {kmer_labels[KMER_SIZE].lower()}s (adjusted p < 0.05)")
if result.get("model") == "binomial":
    print("  Method: binomial GLM with GC3 correction (Doyle, Nanda & Begley 2025)")

# Show top results sorted by |z_score|
print(f"\nTop 20 by |Z-score|:")
top = df.head(20).copy()
display_cols = ["kmer", "observed_freq", "expected_freq", "z_score", "p_value", "adjusted_p", "cohens_d"]
if "gc3_beta" in df.columns:
    display_cols.append("gc3_beta")
top = top[display_cols].round({"observed_freq": 5, "expected_freq": 5, "z_score": 2, "p_value": 2, "adjusted_p": 4, "cohens_d": 3})
display(top.style.applymap(
    lambda v: "background-color: #d4edda" if isinstance(v, float) and v < 0.05 else "",
    subset=["adjusted_p"]
))

In [None]:
#@title Translational Demand Analysis { display-mode: "form" }

from codonscope.modes.mode2_demand import run_demand

# Override tissue/cell line for this analysis (uses values from cell 3 if left blank)
MODE2_TISSUE = ""  #@param ["", "cross_tissue_median", "Liver", "Lung", "Brain - Cortex", "Brain - Cerebellum", "Heart - Left Ventricle", "Kidney - Cortex", "Muscle - Skeletal", "Pancreas", "Whole Blood", "Breast - Mammary Tissue", "Thyroid", "Testis", "Ovary", "Prostate"]
MODE2_CELL_LINE = ""  #@param ["", "HEK293T", "HeLa", "K562", "A549", "MCF7", "U2OS", "HepG2", "SH-SY5Y"]

# ── Custom expression file (optional) ────────────────────────────────────────
# Upload a tab-separated file via the Colab file browser (left sidebar),
# then paste the path here. Format: two columns, gene_id and tpm.
#   gene_id	tpm
#   TP53	12.5
#   RPL11	1500.0
# A header row is optional. Genes not in the file get genome median TPM.
CUSTOM_EXPRESSION_FILE = ""  #@param {type:"string"}
# ──────────────────────────────────────────────────────────────────────────────

print("Running translational demand analysis...")
print("  Weights each codon by TPM x number of codons per gene")
print(f"  Background: all {SPECIES_NAME} genes, each weighted by its expression")

demand_kwargs = dict(species=SPECIES_NAME, gene_ids=gene_ids, seed=42)

# Priority: custom file > mode-specific tissue > cell-3 tissue
if CUSTOM_EXPRESSION_FILE:
    demand_kwargs["expression_file"] = CUSTOM_EXPRESSION_FILE
    print(f"  Expression source: custom file ({CUSTOM_EXPRESSION_FILE})")
else:
    t = MODE2_TISSUE if MODE2_TISSUE else (tissue if tissue else None)
    cl = MODE2_CELL_LINE if MODE2_CELL_LINE else (cell_line if cell_line else None)
    if t:
        demand_kwargs["tissue"] = t
    if cl:
        demand_kwargs["cell_line"] = cl

demand_result = run_demand(**demand_kwargs)

print(f"\n{demand_result['n_genes']} genes analysed")
print(f"Expression source: {demand_result.get('tissue', 'default')}")

df_demand = demand_result["results"]
n_sig = len(df_demand[df_demand["adjusted_p"] < 0.05])
print(f"{n_sig} codons with significant demand bias (adjusted p < 0.05)")

print("\nTop 20 codons by |Z-score|:")
top_demand = df_demand.head(20).copy()
display(top_demand.round(4))

if "top_genes" in demand_result and demand_result["top_genes"] is not None:
    print("\nTop demand-contributing genes (Demand % = share of total translational output):")
    display(demand_result["top_genes"].head(10))

In [None]:
#@title Weighted tRNA Adaptation Index (Optimality Profile) { display-mode: "form" }

import matplotlib.pyplot as plt
import numpy as np
from codonscope.modes.mode3_profile import run_profile

# Number of codons from the 5' end to define the "ramp" region (typically 30-50)
RAMP_CODONS = 50  #@param {type:"integer"}
# wtai = wobble-penalized tRNA Adaptation Index (penalises wobble decoding 0.5x)
# tai = standard tAI (treats all anticodon-codon pairings equally)
METHOD = "wtai"  #@param ["wtai", "tai"]

print(f"Running optimality profile (method={METHOD}, ramp={RAMP_CODONS} codons)...")
if METHOD == "wtai":
    print("  wtAI: higher score = more tRNA gene copies + Watson-Crick decoding preferred")
else:
    print("  tAI: higher score = more tRNA gene copies (no wobble penalty)")

profile_result = run_profile(
    species=SPECIES_NAME,
    gene_ids=gene_ids,
    ramp_codons=RAMP_CODONS,
    method=METHOD,
    seed=42,
)

print(f"\n{profile_result['n_genes']} genes analysed")

# Plot metagene profile
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

mg_gs = profile_result["metagene_geneset"]
mg_bg = profile_result["metagene_genome"]
x = np.arange(len(mg_gs))

ax1.plot(x, mg_gs, label="Gene set", linewidth=2)
ax1.plot(x, mg_bg, label="Genome", linewidth=1, alpha=0.7)
ax1.set_xlabel("Normalised position (%)")
ax1.set_ylabel(f"{METHOD.upper()} score")
ax1.set_title("Metagene Optimality Profile")
ax1.legend()
ax1.axvline(x=RAMP_CODONS * 100 / max(len(mg_gs), 1), color="gray", linestyle="--", alpha=0.5, label="Ramp boundary")

# Ramp analysis summary
ramp = profile_result["ramp_analysis"]
categories = ["Ramp\n(gene set)", "Body\n(gene set)", "Ramp\n(genome)", "Body\n(genome)"]
values = [ramp["ramp_mean_geneset"], ramp["body_mean_geneset"],
          ramp["ramp_mean_genome"], ramp["body_mean_genome"]]
colors = ["#1f77b4", "#1f77b4", "#ff7f0e", "#ff7f0e"]
bars = ax2.bar(categories, values, color=colors, alpha=[0.6, 1, 0.6, 1])
ax2.set_ylabel(f"Mean {METHOD.upper()}")
ax2.set_title("Ramp vs Body Optimality")

plt.tight_layout()
plt.show()

print(f"\nRamp analysis (ramp = first {RAMP_CODONS} codons, body = rest):")
print(f"  Gene set ramp: {ramp['ramp_mean_geneset']:.4f}, body: {ramp['body_mean_geneset']:.4f}")
print(f"  Genome   ramp: {ramp['ramp_mean_genome']:.4f}, body: {ramp['body_mean_genome']:.4f}")
delta = ramp['body_mean_geneset'] - ramp['ramp_mean_geneset']
print(f"  Ramp delta (body - ramp): {delta:+.4f} {'(ramp is slower -> expected for highly-expressed genes)' if delta > 0 else ''}")

if profile_result.get("ramp_composition") is not None:
    print("\nRamp codon composition (top 10 by frequency difference):")
    display(profile_result["ramp_composition"].head(10))

In [None]:
#@title Collision Potential Analysis { display-mode: "form" }

from codonscope.modes.mode4_collision import run_collision

print("Running collision potential analysis...")

collision_result = run_collision(
    species=SPECIES_NAME,
    gene_ids=gene_ids,
)

print(f"\n{collision_result['n_genes']} genes analysed")
print(f"FS enrichment ratio: {collision_result['fs_enrichment']:.3f}")
print(f"  (>1.0 = more fast->slow transitions than expected)")
print(f"Chi-squared p-value: {collision_result['chi2_p']:.2e}")

# Transition matrix
print("\nTransition proportions (gene set vs genome):")
tm_gs = collision_result["transition_matrix_geneset"]
tm_bg = collision_result["transition_matrix_genome"]
trans_df = pd.DataFrame({
    "Transition": ["FF", "FS", "SF", "SS"],
    "Gene set": [tm_gs.get(t, 0) for t in ["FF", "FS", "SF", "SS"]],
    "Genome": [tm_bg.get(t, 0) for t in ["FF", "FS", "SF", "SS"]],
}).set_index("Transition")
display(trans_df.round(4))

# Per-dicodon FS enrichment
if collision_result.get("fs_dicodons") is not None and len(collision_result["fs_dicodons"]) > 0:
    print("\nTop FS dicodons (most enriched fast->slow transitions):")
    display(collision_result["fs_dicodons"].head(15).round(4))

In [None]:
#@title AA vs Synonymous Attribution { display-mode: "form" }

from codonscope.modes.mode5_disentangle import run_disentangle

print("Running AA vs Synonymous attribution analysis...")

dis_result = run_disentangle(
    species=SPECIES_NAME,
    gene_ids=gene_ids,
    seed=42,
)

summary = dis_result["summary"]
print(f"\n{dis_result['n_genes']} genes analysed")
print(f"\nSignificant codons: {summary['n_significant_codons']}")
print(f"  AA-driven:          {summary['n_aa_driven']} ({summary['pct_aa_driven']:.0f}%)")
print(f"  Synonymous-driven:  {summary['n_synonymous_driven']} ({summary['pct_synonymous_driven']:.0f}%)")
print(f"  Both:               {summary['n_both']} ({summary['pct_both']:.0f}%)")
print(f"\n{summary['summary_text']}")

# Attribution table
print("\nPer-codon attribution:")
attr = dis_result["attribution"].copy()
attr_display = attr[["codon", "amino_acid", "aa_z_score", "rscu_z_score", "attribution"]].copy()
attr_display = attr_display.round({"aa_z_score": 2, "rscu_z_score": 2})

def color_attribution(val):
    colors = {
        "AA-driven": "background-color: #cce5ff",
        "Synonymous-driven": "background-color: #d4edda",
        "Both": "background-color: #fff3cd",
        "None": "",
    }
    return colors.get(val, "")

display(attr_display.style.applymap(color_attribution, subset=["attribution"]))

# Synonymous drivers
syn = dis_result["synonymous_drivers"]
syn_sig = syn[syn["driver"] != "not_applicable"]
if len(syn_sig) > 0:
    print("\nSynonymous driver classification:")
    display(syn_sig[["codon", "amino_acid", "rscu_z_score", "driver"]].round(2))

In [None]:
#@title Cross-Species Comparison (requires orthologs) { display-mode: "form" }

if not species2:
    print("Skipped: no second species selected in cell 3.")
    print("Set SPECIES2 above and re-run to enable cross-species comparison.")
else:
    from codonscope.modes.mode6_compare import run_compare

    print(f"Running cross-species comparison: {SPECIES_NAME} vs {species2}...")

    compare_result = run_compare(
        species1=SPECIES_NAME,
        species2=species2,
        gene_ids=gene_ids,
        seed=42,
    )

    s = compare_result["summary"]
    print(f"\n{compare_result['n_orthologs']} ortholog pairs found in gene set")
    print(f"{compare_result['n_genome_orthologs']} ortholog pairs genome-wide")
    print(f"\nMean RSCU correlation:")
    print(f"  Gene set: {s['mean_r_geneset']:.3f}")
    print(f"  Genome:   {s['mean_r_genome']:.3f}")
    print(f"  Z-score:  {s['z_score']:.2f}")
    print(f"  P-value:  {s['p_value']:.2e}")

    # Per-gene correlations
    print("\nPer-gene RSCU correlations:")
    per_gene = compare_result["per_gene"].sort_values("r", ascending=False)
    display(per_gene.head(20).round(3))

    # Divergent genes
    if compare_result.get("divergent_analysis") is not None and len(compare_result["divergent_analysis"]) > 0:
        print("\nMost divergent genes (different preferred codons):")
        display(compare_result["divergent_analysis"].head(10).round(3))

---

## Tips

- **Low Z-scores for highly expressed genes?** This is expected when your gene set dominates the genome's translational demand (e.g., ribosomal proteins in yeast make up ~72% of demand). The gene set IS the background.
- **Matched vs all background:** Use `matched` in the Enrichment analysis to control for CDS length and GC content. This removes compositional confounds and reveals true codon selection.
- **Tricodon analysis:** Requires large gene sets (>50 genes) for statistical power. 226,981 possible tricodons means heavy multiple testing correction.
- **Cross-species comparison:** Low RSCU correlation between orthologs is biologically meaningful — it indicates different tRNA pools drive different codon preferences.
- **Custom expression data:** Use `--expression-file` in the CLI or the `expression_file` parameter to supply your own TPM values for demand analysis.
- **Region-specific enrichment:** The report shows separate enrichment for ramp (codons 2-50) and body (51+) regions. Position-specific biases can indicate translational selection for ribosome spacing.