# CodonScope: Multi-Species Codon Usage Analysis

Analyze codon usage patterns in your gene lists across yeast, human, and mouse.

CodonScope runs 6 analysis modes:

| Mode | Name | What it does |
|------|------|--------------|
| 1 | **Composition** | Compares mono/di/tricodon frequencies in your genes vs the genome. Finds over- or under-represented codons. |
| 2 | **Translational Demand** | Weights codon usage by expression level (TPM). Shows which codons are under translational selection pressure. |
| 3 | **Optimality Profile** | Plots per-position codon optimality (tAI/wtAI) along transcripts. Detects 5' ramp regions. |
| 4 | **Collision Potential** | Counts fast-to-slow (FS) codon transitions that may cause ribosome collisions. |
| 5 | **Disentanglement** | Separates amino acid composition effects from synonymous codon choice (RSCU). Classifies drivers as tRNA supply, GC3 bias, or wobble avoidance. |
| 6 | **Cross-Species** | Correlates RSCU between ortholog pairs across species. Identifies genes with divergent codon preferences. |

**How to use this notebook:**
1. Run cells 1-2 to install CodonScope and download reference data (first run takes ~5 min)
2. Paste your gene list in cell 3
3. Run cells 4-5 to see results

## 1. Install CodonScope

In [None]:
#@title Install CodonScope and dependencies { display-mode: "form" }

import subprocess, sys, os

# ── Configuration ────────────────────────────────────────────────────────────
GITHUB_REPO = "https://github.com/Meier-Lab-NCI/codonscope.git"
# ──────────────────────────────────────────────────────────────────────────────

def run(cmd):
    """Run a shell command and print output."""
    result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
    if result.stdout.strip():
        print(result.stdout.strip())
    if result.returncode != 0 and result.stderr.strip():
        print(result.stderr.strip())
    return result.returncode

# Install dependencies
print("Installing dependencies...")
run(f"{sys.executable} -m pip install -q numpy scipy pandas matplotlib requests tqdm")

# Install CodonScope from GitHub
print(f"\nInstalling CodonScope from GitHub...")
run(f"{sys.executable} -m pip install -q git+{GITHUB_REPO}")

# Enable ipywidgets in Colab (required for Textarea rendering)
try:
    from google.colab import output
    output.enable_custom_widget_manager()
    print("Enabled Colab custom widget manager for interactive inputs")
except Exception:
    pass

# Verify installation
try:
    import codonscope
    print(f"\n✓ CodonScope v{codonscope.__version__} installed successfully")
except ImportError:
    print("\n✗ Installation failed. Check the GitHub URL above.")

## 2. Download Reference Data

Downloads CDS sequences, tRNA copy numbers, wobble rules, expression data, and pre-computed backgrounds for the selected species.

If Google Drive is mounted, data is cached there so you only download once across sessions. Otherwise data is stored in the Colab runtime (re-downloaded each session).

In [None]:
#@title Download species data (cached to Google Drive) { display-mode: "form" }

import os, shutil
from pathlib import Path

# ── Which species to download ─────────────────────────────────────────────────
download_human = True  #@param {type:"boolean"}
download_yeast = True  #@param {type:"boolean"}
download_mouse = False  #@param {type:"boolean"}

# Also download orthologs for cross-species comparison (Mode 6)?
DOWNLOAD_ORTHOLOGS = False  #@param {type:"boolean"}
# ──────────────────────────────────────────────────────────────────────────────

SPECIES = []
if download_human:
    SPECIES.append("human")
if download_yeast:
    SPECIES.append("yeast")
if download_mouse:
    SPECIES.append("mouse")

if not SPECIES:
    raise ValueError("Select at least one species to download.")

# Try to mount Google Drive for persistent caching
USE_DRIVE = False
DRIVE_DATA_DIR = Path("/content/drive/MyDrive/codonscope_data")
LOCAL_DATA_DIR = Path.home() / ".codonscope" / "data"

try:
    from google.colab import drive
    if not os.path.exists("/content/drive/MyDrive"):
        print("Mounting Google Drive for persistent data caching...")
        drive.mount("/content/drive")
    USE_DRIVE = True
    print("✓ Google Drive mounted — data will be cached between sessions")
except (ImportError, Exception) as e:
    print(f"Google Drive not available ({e}). Data will be stored locally.")

if USE_DRIVE:
    # Symlink so codonscope reads from Drive
    DRIVE_DATA_DIR.mkdir(parents=True, exist_ok=True)
    LOCAL_DATA_DIR.parent.mkdir(parents=True, exist_ok=True)
    if LOCAL_DATA_DIR.is_symlink():
        LOCAL_DATA_DIR.unlink()
    elif LOCAL_DATA_DIR.exists():
        shutil.rmtree(LOCAL_DATA_DIR)
    LOCAL_DATA_DIR.symlink_to(DRIVE_DATA_DIR)
    print(f"  Data directory: {DRIVE_DATA_DIR}")

# Download each species (skips if data already exists)
from codonscope.data.download import download, download_orthologs

for sp in SPECIES:
    species_dir = LOCAL_DATA_DIR / "species" / sp
    marker = species_dir / "background_mono.npz"
    if marker.exists():
        n_files = len(list(species_dir.glob("*")))
        print(f"\n✓ {sp} data already cached ({n_files} files). Skipping download.")
    else:
        print(f"\nDownloading {sp} data (this may take a few minutes)...")
        download(sp)
        print(f"✓ {sp} data downloaded")

# Optionally download orthologs
if DOWNLOAD_ORTHOLOGS and len(SPECIES) >= 2:
    ortho_dir = LOCAL_DATA_DIR / "orthologs"
    pairs = []
    if "human" in SPECIES and "yeast" in SPECIES:
        pairs.append(("human", "yeast"))
    if "human" in SPECIES and "mouse" in SPECIES:
        pairs.append(("human", "mouse"))
    if "mouse" in SPECIES and "yeast" in SPECIES:
        pairs.append(("mouse", "yeast"))
    for s1, s2 in pairs:
        ortho_file = ortho_dir / f"{s1}_{s2}.tsv"
        if ortho_file.exists():
            print(f"\n✓ {s1}-{s2} orthologs already cached. Skipping.")
        else:
            print(f"\nDownloading {s1}-{s2} orthologs...")
            download_orthologs(s1, s2)
            print(f"✓ {s1}-{s2} orthologs downloaded")

print("\n" + "="*50)
print("All data ready. Proceed to enter your gene list.")

## 3. Enter Your Gene List

Select a **pilot gene set** from the dropdown, or paste your own gene IDs.

### Pilot gene sets

Curated gene lists with well-characterized codon usage biology. Use them to verify your installation and understand CodonScope output. The species is auto-detected when you pick a pilot set.

| Pilot set | Species | Genes | Expected signal | Study |
|-----------|---------|-------|-----------------|-------|
| **Ribosomal proteins** | Yeast | 132 | Strong synonymous codon bias, high optimality with 5' ramp | Ikemura 1985, Sharp & Li 1987 |
| **Gcn4 targets** | Yeast | 54 | AA-driven bias (glycine enrichment from biosynthesis genes) | Natarajan 2001 |
| **Trm4/TTG targets** | Yeast | 39 | Extreme TTG leucine codon enrichment (m5C modification) | Chan 2012 |
| **Trm9/AGA-GAA targets** | Yeast | 50 | Strong AGA/GAA enrichment (mcm5s2U modification) | Begley 2007 |
| **Ribosomal proteins** | Human | 80 | Mixed GC3 + translational selection signal | dos Reis 2004 |
| **5'TOP mRNAs** | Human | 115 | mTOR-regulated, highest translational demand | Thoreen 2012, Philippe 2020 |
| **MYC targets (proliferation)** | Human | 200 | GC3 bias from recombination, NOT translational selection | MSigDB; Pouyet 2017 caveat |
| **EMT (differentiation)** | Human | 200 | GC3 bias opposite direction from MYC targets | MSigDB; Pouyet 2017 caveat |
| **Secreted proteins** | Human | 53 | AA-driven: signal peptides + collagen Gly-X-Y repeats | Zhou 2024 |
| **ISR upregulated** | Human | 18 | **Negative control** — uORF-mediated, no codon bias expected | Harding 2000 |
| **Membrane proteins** | Human | 50 | AA-driven: hydrophobic transmembrane domains | — |
| **Cross-species RP** | Yeast vs Human | 75 | Low RSCU correlation (~0.13) between ortholog pairs (Mode 6) | — |

### Supported ID types

| Species | Accepted IDs | Examples |
|---------|-------------|----------|
| **Yeast** | Systematic names, common names, SGD IDs, UniProt | `YFL039C`, `ACT1`, `SGD:S000001855`, `P60010` |
| **Human** | HGNC symbols, Ensembl (ENSG/ENST), Entrez, RefSeq, UniProt | `TP53`, `ENSG00000141510`, `7157`, `NM_000546`, `P04637` |
| **Mouse** | MGI symbols, Ensembl (ENSMUSG/ENSMUST), MGI IDs, Entrez, UniProt | `Actb`, `ENSMUSG00000029580`, `MGI:87904`, `11461` |

You can mix ID types in the same list.

### Expression source (human only, for Mode 2)

- **Tissue dropdown**: Select a GTEx v8 tissue for tissue-specific expression. Leave blank for cross-tissue median.
- **Cell line dropdown**: Common cell lines (HEK293T, HeLa, etc.) use the closest GTEx tissue as a proxy.
- **Custom expression**: To supply your own TPM values, save a tab-separated file with two columns and load it via the CLI:
  ```
  gene_id    tpm
  TP53       12.5
  BRCA1      8.3
  MYC        150.0
  ```
  Then run: `python3 -m codonscope.cli demand --species human --genes genes.txt --expression my_tpm.tsv`

In [None]:
#@title Select gene list and species { display-mode: "form" }

# ── Pilot gene set ────────────────────────────────────────────────────────────
# Pick a curated gene list, or select "Custom" to paste your own.
# Species is auto-detected from the pilot set (you can override below).
PILOT_SET = "Yeast \u2014 Ribosomal proteins (132 genes, Ikemura 1985)"  #@param ["Yeast \u2014 Ribosomal proteins (132 genes, Ikemura 1985)", "Yeast \u2014 Gcn4 targets (54 genes, Natarajan 2001)", "Yeast \u2014 Trm4/TTG targets (39 genes, Chan 2012)", "Yeast \u2014 Trm9/AGA-GAA targets (50 genes, Begley 2007)", "Human \u2014 Ribosomal proteins (80 genes, dos Reis 2004)", "Human \u2014 5\u2019TOP mRNAs (115 genes, Thoreen 2012)", "Human \u2014 MYC targets/proliferation (200 genes, MSigDB)", "Human \u2014 EMT/differentiation (200 genes, MSigDB)", "Human \u2014 Secreted proteins (53 genes, Zhou 2024)", "Human \u2014 ISR upregulated [negative ctrl] (18 genes, Harding 2000)", "Human \u2014 Membrane proteins (50 genes)", "Cross-species \u2014 Yeast RP for Mode 6 (75 genes)", "Custom \u2014 Paste your own gene list"]

# ── Override species (auto-set from pilot set, change only for custom) ────────
SPECIES_NAME = ""  #@param ["", "yeast", "human", "mouse"]
# Leave blank to auto-detect from the pilot set above.

# ── Optional: cross-species comparison ────────────────────────────────────────
SPECIES2 = ""  #@param ["", "yeast", "human", "mouse"]
# Leave blank to skip Mode 6. Requires orthologs downloaded above.

# ── Human expression source (for Mode 2: Translational Demand) ───────────────
TISSUE = ""  #@param ["", "cross_tissue_median", "Liver", "Lung", "Brain - Cortex", "Brain - Cerebellum", "Heart - Left Ventricle", "Kidney - Cortex", "Muscle - Skeletal", "Pancreas", "Whole Blood", "Breast - Mammary Tissue", "Colon - Sigmoid", "Stomach", "Thyroid", "Testis", "Ovary", "Prostate", "Spleen", "Skin - Sun Exposed (Lower leg)", "Adipose - Subcutaneous", "Artery - Aorta", "Esophagus - Mucosa", "Nerve - Tibial", "Pituitary", "Small Intestine - Terminal Ileum"]
CELL_LINE = ""  #@param ["", "HEK293T", "HeLa", "K562", "A549", "MCF7", "U2OS", "HepG2", "SH-SY5Y", "Jurkat", "RPE1"]
# ──────────────────────────────────────────────────────────────────────────────

import ipywidgets as widgets
from IPython.display import display as ipy_display

# ── Pilot gene sets with species metadata ────────────────────────────────────
_PILOT_SETS = {
    "Yeast \u2014 Ribosomal proteins (132 genes, Ikemura 1985)": {
        "species": "yeast",
        "species2": None,
        "description": "All 132 cytoplasmic ribosomal proteins (both A/B paralogs). The canonical example of translational selection in yeast.",
        "genes": """RPL1A
RPL1B
RPL2A
RPL2B
RPL3
RPL4A
RPL4B
RPL5
RPL6A
RPL6B
RPL7A
RPL7B
RPL8A
RPL8B
RPL9A
RPL9B
RPL10
RPL11A
RPL11B
RPL12A
RPL12B
RPL13A
RPL13B
RPL14A
RPL14B
RPL15A
RPL15B
RPL16A
RPL16B
RPL17A
RPL17B
RPL18A
RPL18B
RPL19A
RPL19B
RPL20A
RPL20B
RPL21A
RPL21B
RPL22A
RPL22B
RPL23A
RPL23B
RPL24A
RPL24B
RPL25
RPL26A
RPL26B
RPL27A
RPL27B
RPL28
RPL29
RPL30
RPL31A
RPL31B
RPL32
RPL33A
RPL33B
RPL34A
RPL34B
RPL35A
RPL35B
RPL36A
RPL36B
RPL37A
RPL37B
RPL38
RPL39
RPL40A
RPL40B
RPL41A
RPL41B
RPL42A
RPL42B
RPL43A
RPL43B
RPS0A
RPS0B
RPS1A
RPS1B
RPS2
RPS3
RPS4A
RPS4B
RPS5
RPS6A
RPS6B
RPS7A
RPS7B
RPS8A
RPS8B
RPS9A
RPS9B
RPS10A
RPS10B
RPS11A
RPS11B
RPS12
RPS13
RPS14A
RPS14B
RPS15
RPS16A
RPS16B
RPS17A
RPS17B
RPS18A
RPS18B
RPS19A
RPS19B
RPS20
RPS21A
RPS21B
RPS22A
RPS22B
RPS23A
RPS23B
RPS24A
RPS24B
RPS25A
RPS25B
RPS26A
RPS26B
RPS27A
RPS27B
RPS28A
RPS28B
RPS29A
RPS29B
RPS30A
RPS30B
RPS31""",
    },
    "Yeast \u2014 Gcn4 targets (54 genes, Natarajan 2001)": {
        "species": "yeast",
        "species2": None,
        "description": "Gcn4-regulated amino acid biosynthesis genes. Strong AA-driven signal (glycine enrichment). Good contrast with RP genes.",
        "genes": """ARG1
ARG3
ARG4
ARG56
ARG8
HIS1
HIS3
HIS4
HIS5
TRP1
TRP2
TRP3
TRP4
TRP5
LEU1
LEU2
LEU4
ILV1
ILV2
ILV3
ILV5
ILV6
LYS1
LYS2
LYS4
LYS9
LYS20
MET6
MET13
MET17
SER1
SER2
SER33
THR1
THR4
GCN4
CPA2
GLN1
ASN1
ASN2
HOM2
HOM3
HOM6
SHM2
GCV1
GCV2
ARO1
ARO2
ARO3
ARO4
ALA1
GRS1
DPS1
THS1""",
    },
    "Yeast \u2014 Trm4/TTG targets (39 genes, Chan 2012)": {
        "species": "yeast",
        "species2": None,
        "description": "Genes with >=90% TTG leucine codons. Decoded by Trm4-modified (m5C) tRNA. 27 RP + 12 glycolytic/metabolic genes.",
        "genes": """RPL4A
RPL4B
RPL8A
RPL8B
RPL9A
RPL10
RPL12B
RPL15A
RPL17A
RPL17B
RPL22A
RPL28
RPL36B
RPL37A
RPL37B
RPL39
RPL42A
RPL43A
RPS2
RPS5
RPS6A
RPS6B
RPS9B
RPS10B
RPS13
RPS15
RPS26B
TDH3
TDH2
TDH1
ENO2
ENO1
CDC19
PDC1
HYP2
ANB1
HSP12
CCW12
NOP10""",
    },
    "Yeast \u2014 Trm9/AGA-GAA targets (50 genes, Begley 2007)": {
        "species": "yeast",
        "species2": None,
        "description": "Top 50 genes by AGA+GAA codon fraction. Decoded by Trm9-modified (mcm5s2U) tRNA. Key DNA damage response link.",
        "genes": """RPL3
RPL8A
RPL8B
RPL10
RPL18A
RPL22A
RPL28
RPL30
RPL33B
RPP0
RPS1A
RPS1B
RPS10A
RPS10B
RPS20
RPS26A
RPS26B
RPS31
NHP2
TDH3
TDH2
TDH1
ENO1
ENO2
CDC19
GPM1
TPI1
PDC1
PDC5
TEF1
TEF2
HYP2
ANB1
ADH1
HTA2
HTB1
HHT1
HHT2
TSA1
FPR2
CIS3
TOM22
TAL1
SEC53
YRB1
BUD28
MF(ALPHA)1
YDL228C
YPL142C
YPL197C""",
    },
    "Human \u2014 Ribosomal proteins (80 genes, dos Reis 2004)": {
        "species": "human",
        "species2": None,
        "description": "80 core cytoplasmic ribosomal proteins. Mixed GC3 + translational selection signal in mammals.",
        "genes": """RPL3
RPL4
RPL5
RPL6
RPL7
RPL7A
RPL8
RPL9
RPL10
RPL10A
RPL11
RPL12
RPL13
RPL13A
RPL14
RPL15
RPL17
RPL18
RPL18A
RPL19
RPL21
RPL22
RPL23
RPL23A
RPL24
RPL26
RPL27
RPL27A
RPL28
RPL29
RPL30
RPL31
RPL32
RPL34
RPL35
RPL35A
RPL36
RPL36A
RPL37
RPL37A
RPL38
RPL39
RPL41
RPLP0
RPLP1
RPLP2
UBA52
RPSA
RPS2
RPS3
RPS3A
RPS4X
RPS5
RPS6
RPS7
RPS8
RPS9
RPS10
RPS11
RPS12
RPS13
RPS14
RPS15
RPS15A
RPS16
RPS17
RPS18
RPS19
RPS20
RPS21
RPS23
RPS24
RPS25
RPS26
RPS27
RPS27A
RPS28
RPS29
FAU
RACK1""",
    },
    "Human \u2014 5\u2019TOP mRNAs (115 genes, Thoreen 2012)": {
        "species": "human",
        "species2": None,
        "description": "5'TOP motif mRNAs regulated by mTOR/LARP1. 80 RP + 35 translation factors and RNA-binding proteins. Highest translational demand.",
        "genes": """RPL3
RPL4
RPL5
RPL6
RPL7
RPL7A
RPL8
RPL9
RPL10
RPL10A
RPL11
RPL12
RPL13
RPL13A
RPL14
RPL15
RPL17
RPL18
RPL18A
RPL19
RPL21
RPL22
RPL23
RPL23A
RPL24
RPL26
RPL27
RPL27A
RPL28
RPL29
RPL30
RPL31
RPL32
RPL34
RPL35
RPL35A
RPL36
RPL36A
RPL37
RPL37A
RPL38
RPL39
RPL41
RPLP0
RPLP1
RPLP2
UBA52
RPSA
RPS2
RPS3
RPS3A
RPS4X
RPS5
RPS6
RPS7
RPS8
RPS9
RPS10
RPS11
RPS12
RPS13
RPS14
RPS15
RPS15A
RPS16
RPS17
RPS18
RPS19
RPS20
RPS21
RPS23
RPS24
RPS25
RPS26
RPS27
RPS27A
RPS28
RPS29
FAU
RACK1
EEF1A1
EEF1B2
EEF1G
EEF1D
EEF2
EIF3E
EIF3F
EIF3H
EIF4B
EIF3A
EIF3C
EIF3G
EIF3I
PABPC1
NPM1
NAP1L1
HNRNPA1
VIM
TPT1
PTMA
TCOF1
DKC1
NOP56
NOP58
FBL
SNU13
DDX21
GNL3
SNRPD2
SNRPE
SNRPB
NACA
ABCF1
CSDE1
HSP90AB1""",
    },
    "Human \u2014 MYC targets/proliferation (200 genes, MSigDB)": {
        "species": "human",
        "species2": None,
        "description": "HALLMARK_MYC_TARGETS_V1 from MSigDB. IMPORTANT: C/G-ending codon enrichment is driven by GC-biased gene conversion (gBGC) correlated with recombination rate, NOT translational selection (Pouyet 2017). Compare with EMT set.",
        "genes": """ABCE1
ACP1
AIMP2
AP3S1
APEX1
BUB3
C1QBP
CAD
CANX
CBX3
CCNA2
CCT2
CCT3
CCT4
CCT5
CCT7
CDC20
CDC45
CDK2
CDK4
CLNS1A
CNBP
COPS5
COX5A
CSTF2
CTPS1
CUL1
CYC1
DDX18
DDX21
DEK
DHX15
DUT
EEF1B2
EIF1AX
EIF2S1
EIF2S2
EIF3B
EIF3D
EIF3J
EIF4A1
EIF4E
EIF4G2
EIF4H
EPRS1
ERH
ETF1
EXOSC7
FAM120A
FBL
G3BP1
GLO1
GNL3
GOT2
GSPT1
H2AZ1
HDAC2
HDDC2
HDGF
HNRNPA1
HNRNPA2B1
HNRNPA3
HNRNPC
HNRNPD
HNRNPR
HNRNPU
HPRT1
HSP90AB1
HSPD1
HSPE1
IARS1
IFRD1
ILF2
IMPDH2
KARS1
KPNA2
KPNB1
LDHA
LSM2
LSM7
MAD2L1
MCM2
MCM4
MCM5
MCM6
MCM7
MRPL23
MRPL9
MRPS18B
MYC
NAP1L1
NCBP1
NCBP2
NDUFAB1
NHP2
NME1
NOLC1
NOP16
NOP56
NPM1
ODC1
ORC2
PA2G4
PABPC1
PABPC4
PCBP1
PCNA
PGK1
PHB1
PHB2
POLD2
POLE3
PPIA
PPM1G
PRDX3
PRDX4
PRPF31
PRPS2
PSMA1
PSMA2
PSMA4
PSMA6
PSMA7
PSMB2
PSMB3
PSMC4
PSMC6
PSMD1
PSMD14
PSMD3
PSMD7
PSMD8
PTGES3
PWP1
RACK1
RAD23B
RAN
RANBP1
RFC4
RNPS1
RPL14
RPL18
RPL22
RPL34
RPL6
RPLP0
RPS10
RPS2
RPS3
RPS5
RPS6
RRM1
RRP9
RSL1D1
RUVBL2
SERBP1
SET
SF3A1
SF3B3
SLC25A3
SMARCC1
SNRPA
SNRPA1
SNRPB2
SNRPD1
SNRPD2
SNRPD3
SNRPG
SRM
SRPK1
SRSF1
SRSF2
SRSF3
SRSF7
SSB
SSBP1
STARD7
SYNCRIP
TARDBP
TCP1
TFDP1
TOMM70
TRA2B
TRIM28
TUFM
TXNL4A
TYMS
U2AF1
UBA2
UBE2E1
UBE2L3
USP1
VBP1
VDAC1
VDAC3
XPO1
XPOT
XRCC6
YWHAE
YWHAQ""",
    },
    "Human \u2014 EMT/differentiation (200 genes, MSigDB)": {
        "species": "human",
        "species2": None,
        "description": "HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION from MSigDB. A/U-ending codons enriched \u2014 the opposite of MYC targets. Both driven by GC-biased gene conversion (Pouyet 2017), not translational selection.",
        "genes": """ABI3BP
ACTA2
ADAM12
ANPEP
APLP1
AREG
BASP1
BDNF
BGN
BMP1
CADM1
CALD1
CALU
CAP2
CAPG
CCN1
CCN2
CD44
CD59
CDH11
CDH2
CDH6
COL11A1
COL12A1
COL16A1
COL1A1
COL1A2
COL3A1
COL4A1
COL4A2
COL5A1
COL5A2
COL5A3
COL6A2
COL6A3
COL7A1
COL8A2
COLGALT1
COMP
COPA
CRLF1
CTHRC1
CXCL1
CXCL12
CXCL6
CXCL8
DAB2
DCN
DKK1
DPYSL3
DST
ECM1
ECM2
EDIL3
EFEMP2
ELN
EMP3
ENO2
FAP
FAS
FBLN1
FBLN2
FBLN5
FBN1
FBN2
FERMT2
FGF2
FLNA
FMOD
FN1
FOXC2
FSTL1
FSTL3
FUCA1
FZD8
GADD45A
GADD45B
GAS1
GEM
GJA1
GLIPR1
GPC1
GPX7
GREM1
HTRA1
ID2
IGFBP2
IGFBP3
IGFBP4
IL15
IL32
IL6
INHBA
ITGA2
ITGA5
ITGAV
ITGB1
ITGB3
ITGB5
JUN
LAMA1
LAMA2
LAMA3
LAMC1
LAMC2
LGALS1
LOX
LOXL1
LOXL2
LRP1
LRRC15
LUM
MAGEE1
MATN2
MATN3
MCM7
MEST
MFAP5
MGP
MMP1
MMP14
MMP2
MMP3
MSX1
MXRA5
MYL9
MYLK
NID2
NNMT
NOTCH2
NT5E
NTM
OXTR
P3H1
PCOLCE
PCOLCE2
PDGFRB
PDLIM4
PFN2
PLAUR
PLOD1
PLOD2
PLOD3
PMEPA1
PMP22
POSTN
PPIB
PRRX1
PRSS2
PTHLH
PTX3
PVR
QSOX1
RGS4
RHOB
SAT1
SCG2
SDC1
SDC4
SERPINE1
SERPINE2
SERPINH1
SFRP1
SFRP4
SGCB
SGCD
SGCG
SLC6A8
SLIT2
SLIT3
SNAI2
SNTB1
SPARC
SPOCK1
SPP1
TAGLN
TFPI2
TGFB1
TGFBI
TGFBR3
TGM2
THBS1
THBS2
THY1
TIMP1
TIMP3
TNC
TNFAIP3
TNFRSF11B
TNFRSF12A
TPM1
TPM2
TPM4
VCAM1
VCAN
VEGFA
VEGFC
VIM
WIPF1
WNT5A""",
    },
    "Human \u2014 Secreted proteins (53 genes, Zhou 2024)": {
        "species": "human",
        "species2": None,
        "description": "Liver secretome, collagens, coagulation factors, complement proteins. AA-driven signal from signal peptides and Gly-X-Y repeats.",
        "genes": """ALB
TTR
TF
SERPINA1
HP
HPX
APOA1
APOB
FGA
FGB
FGG
AFP
A1BG
ORM1
SERPINC1
COL1A1
COL1A2
COL3A1
COL4A1
COL5A1
COL6A1
F2
F5
F7
F8
F9
F10
F11
C3
C4A
C5
C6
C7
C8A
C9
SERPINA3
AGT
PLG
AHSG
APOA2
INS
GH1
EPO
TGFB1
CSF2
IL6
TNF
FN1
LAMA1
CST3
SPP1
LTF
VWF""",
    },
    "Human \u2014 ISR upregulated [negative ctrl] (18 genes, Harding 2000)": {
        "species": "human",
        "species2": None,
        "description": "Integrated stress response genes upregulated via uORF mechanism (ATF4 pathway). NEGATIVE CONTROL: translational regulation is uORF-mediated, NOT codon-driven. Expect NO significant codon bias.",
        "genes": """ATF4
ATF5
DDIT3
PPP1R15A
ASNS
TRIB3
ATF3
CHAC1
SLC7A11
CTH
CBS
PSAT1
PHGDH
MTHFD2
SLC7A5
HERPUD1
SLC1A5
WARS1""",
    },
    "Human \u2014 Membrane proteins (50 genes)": {
        "species": "human",
        "species2": None,
        "description": "Transmembrane proteins: receptor tyrosine kinases, ion channels, transporters, GPCRs. AA-driven signal from hydrophobic TM domains.",
        "genes": """EGFR
ERBB2
ERBB3
FGFR1
FGFR2
IGF1R
INSR
PDGFRA
MET
KIT
RET
NTRK1
SCN1A
SCN2A
SCN5A
KCNA1
KCNA2
KCNJ2
KCNQ1
CACNA1A
CACNA1C
CLCN1
CFTR
SLC2A1
SLC2A4
SLC6A3
SLC6A4
ABCB1
ABCC1
SLC12A1
ATP1A1
ATP2A2
ADRB2
DRD2
HTR2A
OPRM1
CHRM1
GRM1
GRIA1
GABRA1
GABBR1
ADORA2A
CD9
CD81
CD44
ITGA2
ITGB1
AQP1
AQP4
TMEM16A""",
    },
    "Cross-species \u2014 Yeast RP for Mode 6 (75 genes)": {
        "species": "yeast",
        "species2": "human",
        "description": "Yeast ribosomal proteins, A-paralog only. For Mode 6 cross-species RSCU comparison vs human orthologs. Expect low correlation (~0.13) because yeast and human use different preferred codons.",
        "genes": """RPL1A
RPL2A
RPL3
RPL4A
RPL5
RPL6A
RPL7A
RPL8A
RPL9A
RPL10
RPL11A
RPL12A
RPL13A
RPL14A
RPL15A
RPL16A
RPL17A
RPL18A
RPL19A
RPL20A
RPL21A
RPL22A
RPL23A
RPL24A
RPL25
RPL26A
RPL27A
RPL28
RPL29
RPL30
RPL31A
RPL32
RPL33A
RPL34A
RPL35A
RPL36A
RPL37A
RPL38
RPL39
RPL40A
RPL41A
RPL42A
RPL43A
RPS0A
RPS1A
RPS2
RPS3
RPS4A
RPS5
RPS6A
RPS7A
RPS8A
RPS9A
RPS10A
RPS11A
RPS12
RPS13
RPS14A
RPS15
RPS16A
RPS17A
RPS18A
RPS19A
RPS20
RPS21A
RPS22A
RPS23A
RPS24A
RPS25A
RPS26A
RPS27A
RPS28A
RPS29A
RPS30A
RPS31""",
    },
}
# ──────────────────────────────────────────────────────────────────────────────

# Auto-detect species from pilot set
_is_custom = PILOT_SET.startswith("Custom")
if not _is_custom and PILOT_SET in _PILOT_SETS:
    _pilot = _PILOT_SETS[PILOT_SET]
    _auto_species = _pilot["species"]
    _auto_species2 = _pilot["species2"]
    if not SPECIES_NAME:
        SPECIES_NAME = _auto_species
    if _auto_species2 and not SPECIES2:
        SPECIES2 = _auto_species2
    print(f"Pilot set: {PILOT_SET}")
    print(f"  {_pilot['description']}")
    print(f"  Species: {SPECIES_NAME}" + (f" vs {SPECIES2}" if SPECIES2 else ""))
else:
    if not SPECIES_NAME:
        SPECIES_NAME = "yeast"
    print("Custom gene list mode. Paste your genes in the text box below.")

# Pick default text based on pilot set selection
if not _is_custom and PILOT_SET in _PILOT_SETS:
    _default_genes = _PILOT_SETS[PILOT_SET]["genes"].strip()
else:
    _default_genes = "# Paste your gene IDs here, one per line"

gene_input = widgets.Textarea(
    value=_default_genes,
    placeholder="Paste gene IDs here (one per line or comma-separated)",
    description="",
    layout=widgets.Layout(width="400px", height="250px"),
)

print("\nGene list (edit below or keep the pilot set):")
ipy_display(gene_input)
print("\n\u2191 Edit the box above, then run the next cell.")

In [None]:
#@title Confirm gene list { display-mode: "form" }

import re

def parse_genes(text):
    """Parse gene list from text: one per line, comma-separated, or mixed."""
    genes = []
    for line in text.strip().split("\n"):
        line = line.strip()
        if not line or line.startswith("#"):
            continue
        parts = re.split(r"[,\t]+", line)
        for part in parts:
            part = part.strip()
            if part:
                genes.append(part)
    return genes

gene_ids = parse_genes(gene_input.value)
species2 = SPECIES2 if SPECIES2 else None
tissue = TISSUE if TISSUE else None
cell_line = CELL_LINE if CELL_LINE else None

print(f"Species: {SPECIES_NAME}")
print(f"Gene list: {len(gene_ids)} genes")
if len(gene_ids) <= 20:
    print(f"  {', '.join(gene_ids)}")
else:
    print(f"  {', '.join(gene_ids[:10])}, ... ({len(gene_ids)-10} more)")
if species2:
    print(f"Cross-species comparison: {SPECIES_NAME} vs {species2}")
if tissue:
    print(f"Tissue: {tissue}")
if cell_line:
    print(f"Cell line: {cell_line}")
if len(gene_ids) < 10:
    print("\n⚠ Warning: fewer than 10 genes. Results may lack statistical power.")

## 4. Generate Full Report

Runs all applicable analysis modes and produces a self-contained HTML report with embedded plots.

The report includes:
- **Gene summary** — mapped/unmapped genes, gene name mapping table, CDS statistics
- **Mode 1** — Volcano plots and tables for mono- and dicodon composition
- **Mode 5** — Attribution table (AA-driven vs synonymous codon choice)
- **Mode 3** — Metagene optimality profile with ramp analysis
- **Mode 4** — Ribosome collision potential (FS transition analysis)
- **Mode 2** — Expression-weighted translational demand
- **Mode 6** — Cross-species RSCU correlation (if second species selected)

After the report generates, **run the download cell** to get a zip file containing:
- The full HTML report (viewable in any browser)
- README.txt documenting all analysis parameters, your gene list, and data sources
- Complete TSV data tables for every mode (not truncated — use these for custom plots in R, Python, etc.)

In [None]:
#@title Run full report { display-mode: "form" }

import tempfile
from pathlib import Path
from IPython.display import HTML, display
from codonscope.report import generate_report

# Generate report to a temp file
output_path = Path(tempfile.mkdtemp()) / "codonscope_report.html"

print("Running all analysis modes... (this may take 1-2 minutes)")
print()

report_kwargs = dict(
    species=SPECIES_NAME,
    gene_ids=gene_ids,
    output=output_path,
    n_bootstrap=10_000,
    seed=42,
)
if species2:
    report_kwargs["species2"] = species2
if tissue:
    report_kwargs["tissue"] = tissue
if cell_line:
    report_kwargs["cell_line"] = cell_line

report_path = generate_report(**report_kwargs)

# Show what was produced
zip_path = report_path.parent / f"{report_path.stem}_results.zip"
data_dir = report_path.parent / f"{report_path.stem}_data"
data_files = sorted(data_dir.glob("*.tsv")) if data_dir.exists() else []

print(f"\n✓ Report generated: {report_path}")
if zip_path.exists():
    zip_size_mb = zip_path.stat().st_size / (1024 * 1024)
    print(f"✓ Results zip: {zip_path.name} ({zip_size_mb:.1f} MB)")
    print(f"  Contains: HTML report, README.txt, and {len(data_files)} data files:")
    for f in data_files:
        print(f"    - data/{f.name}")
    print(f"\n  → Run the next cell to download the zip.")

# Display report inline
html_content = report_path.read_text()
display(HTML(html_content))

In [None]:
#@title Download results zip (HTML report + full data tables + README) { display-mode: "form" }

# ── Downloads a zip containing: ──────────────────────────────────────────────
#   codonscope_report.html  — self-contained HTML report with embedded plots
#   README.txt              — analysis parameters, gene list, file guide, data sources
#   data/gene_mapping.tsv   — input ID → gene name → systematic/Ensembl ID
#   data/mode1_*.tsv        — full monocodon and dicodon composition results
#   data/mode2_demand.tsv   — expression-weighted translational demand per codon
#   data/mode3_*.tsv        — per-gene optimality scores, ramp/body composition
#   data/mode4_*.tsv        — per-gene collision fractions, per-dicodon FS enrichment
#   data/mode5_*.tsv        — per-codon attribution (AA-driven vs synonymous)
#   data/mode6_*.tsv        — per-gene cross-species RSCU correlation (if applicable)
# ──────────────────────────────────────────────────────────────────────────────

zip_path = report_path.parent / f"{report_path.stem}_results.zip"

if not zip_path.exists():
    print("No zip file found. Run the report cell above first.")
else:
    zip_size_mb = zip_path.stat().st_size / (1024 * 1024)
    print(f"Downloading: {zip_path.name} ({zip_size_mb:.1f} MB)")
    print(f"Contains HTML report, README documentation, and all TSV data tables.")
    print(f"TSV files have complete results (not truncated) — use them for custom plots.")
    try:
        from google.colab import files
        files.download(str(zip_path))
    except ImportError:
        print(f"\nNot running in Colab. File is at: {zip_path}")

## 5. Explore Individual Mode Results

Run specific modes individually and view results as sortable tables. Each mode cell has adjustable parameters explained below.

### Parameter Guide

| Parameter | Mode | Options | What it means |
|-----------|------|---------|---------------|
| **KMER_SIZE** | 1 | `1` = monocodon (single codons, 61 total), `2` = dicodon (adjacent codon pairs, 3,721 total), `3` = tricodon (codon triplets, 226,981 total) | Larger k-mers capture context-dependent effects (e.g. ribosome stalling at specific codon pairs) but require more statistical power. Start with `1`. |
| **BACKGROUND** | 1 | `all` = compare against all genes in the genome, `matched` = compare against genes matched for CDS length and GC content | Use `all` first. If GC or length bias is flagged, re-run with `matched` to see which codons remain significant after controlling for nucleotide composition. |
| **METHOD** | 3 | `wtai` = wobble-penalized tRNA Adaptation Index, `tai` = standard tAI without wobble penalty | `wtai` (default) penalises wobble base-pairing (G:U, I:C) by 0.5x because wobble decoding is slower than Watson-Crick. `tai` treats all anticodon-codon pairings equally. Use `wtai` unless you have a reason not to. |
| **RAMP_CODONS** | 3 | Number of codons from the 5' end to define the "ramp" region (default: 50) | Many highly-expressed genes have a stretch of sub-optimal codons near the start that slows initial elongation and spaces out ribosomes. 30-50 codons is typical. |
| **TISSUE** | 2 | GTEx v8 tissue name (e.g. "Liver", "Brain - Cortex") | Sets which tissue's RNA-seq TPM values weight the translational demand calculation. Different tissues express different genes, so the demand landscape changes. |
| **CELL_LINE** | 2 | Cell line name (e.g. "HEK293T") | Uses the closest GTEx tissue as a proxy for cell line expression. |

### How to interpret results

- **Z-score**: How many standard deviations your gene set differs from the genome. Positive = enriched, negative = depleted.
- **adjusted_p**: Benjamini-Hochberg FDR-corrected p-value. < 0.05 is significant after controlling for multiple testing across all codons.
- **cohens_d**: Effect size independent of sample size. |d| < 0.2 is negligible, 0.2-0.5 is small, 0.5-0.8 is medium, > 0.8 is large. A codon can be statistically significant (low p) but biologically trivial (small d).
- **FS enrichment**: Ratio of fast-to-slow codon transitions in your genes vs genome. > 1.0 = more collision-prone junctions. Highly expressed genes tend to have FS enrichment near or below 1.0.
- **RSCU**: Relative Synonymous Codon Usage. 1.0 = codon used equally with its synonyms. > 1.0 = preferred, < 1.0 = avoided. Excludes Met and Trp (single-codon amino acids).
- **wtAI**: Wobble-penalized tRNA Adaptation Index. Score from 0 to 1 based on tRNA gene copy numbers. Higher = more efficiently decoded. "Fast" codons have wtAI above the genome median, "slow" codons below.

In [None]:
#@title Mode 1: Sequence Composition { display-mode: "form" }

import pandas as pd
from IPython.display import display, HTML
from codonscope.modes.mode1_composition import run_composition

# 1 = monocodon (61 single codons), 2 = dicodon (3,721 adjacent pairs), 3 = tricodon (226,981 triplets)
KMER_SIZE = 1  #@param [1, 2, 3] {type:"integer"}
# "all" = compare vs whole genome; "matched" = compare vs genes matched for CDS length + GC content
BACKGROUND = "all"  #@param ["all", "matched"]

kmer_labels = {1: "Monocodon", 2: "Dicodon", 3: "Tricodon"}
print(f"Running {kmer_labels[KMER_SIZE]} composition analysis ({BACKGROUND} background)...")
if BACKGROUND == "matched":
    print("  (matched = controls for CDS length and GC content differences)")

result = run_composition(
    species=SPECIES_NAME,
    gene_ids=gene_ids,
    k=KMER_SIZE,
    background=BACKGROUND,
    seed=42,
)

df = result["results"]
n = result["n_genes"]
n_sig = len(df[df["adjusted_p"] < 0.05])
unmapped = result["id_summary"].unmapped

print(f"\n{n} genes mapped, {len(unmapped)} unmapped")
if unmapped:
    print(f"  Unmapped: {', '.join(unmapped[:10])}{'...' if len(unmapped) > 10 else ''}")
print(f"{n_sig} significant {kmer_labels[KMER_SIZE].lower()}s (adjusted p < 0.05)")

# Show top results sorted by |z_score|
print(f"\nTop 20 by |Z-score|:")
top = df.head(20).copy()
top = top.round({"observed_freq": 5, "expected_freq": 5, "z_score": 2, "p_value": 2, "adjusted_p": 4, "cohens_d": 3})
display(top.style.applymap(
    lambda v: "background-color: #d4edda" if isinstance(v, float) and v < 0.05 else "",
    subset=["adjusted_p"]
))

In [None]:
#@title Mode 2: Translational Demand { display-mode: "form" }

from codonscope.modes.mode2_demand import run_demand

# Override tissue/cell line for this mode (uses values from cell 3 if left blank)
MODE2_TISSUE = ""  #@param ["", "cross_tissue_median", "Liver", "Lung", "Brain - Cortex", "Brain - Cerebellum", "Heart - Left Ventricle", "Kidney - Cortex", "Muscle - Skeletal", "Pancreas", "Whole Blood", "Breast - Mammary Tissue", "Thyroid", "Testis", "Ovary", "Prostate"]
MODE2_CELL_LINE = ""  #@param ["", "HEK293T", "HeLa", "K562", "A549", "MCF7", "U2OS", "HepG2", "SH-SY5Y"]

print("Running translational demand analysis...")
print("  Weights each codon by TPM × number of codons per gene")

demand_kwargs = dict(species=SPECIES_NAME, gene_ids=gene_ids, seed=42)
# Use mode-specific override if set, otherwise fall back to cell 3 values
t = MODE2_TISSUE if MODE2_TISSUE else (tissue if tissue else None)
cl = MODE2_CELL_LINE if MODE2_CELL_LINE else (cell_line if cell_line else None)
if t:
    demand_kwargs["tissue"] = t
if cl:
    demand_kwargs["cell_line"] = cl

demand_result = run_demand(**demand_kwargs)

print(f"\n{demand_result['n_genes']} genes analysed")
print(f"Expression source: {demand_result.get('tissue', 'default')}")

df_demand = demand_result["results"]
n_sig = len(df_demand[df_demand["adjusted_p"] < 0.05])
print(f"{n_sig} codons with significant demand bias (adjusted p < 0.05)")

print("\nTop 20 codons by |Z-score|:")
top_demand = df_demand.head(20).copy()
display(top_demand.round(4))

if "top_genes" in demand_result and demand_result["top_genes"] is not None:
    print("\nTop demand-contributing genes (Demand % = share of total translational output):")
    display(demand_result["top_genes"].head(10))

In [None]:
#@title Mode 3: Optimality Profile { display-mode: "form" }

import matplotlib.pyplot as plt
import numpy as np
from codonscope.modes.mode3_profile import run_profile

# Number of codons from the 5' end to define the "ramp" region (typically 30-50)
RAMP_CODONS = 50  #@param {type:"integer"}
# wtai = wobble-penalized tRNA Adaptation Index (penalises wobble decoding 0.5x)
# tai = standard tAI (treats all anticodon-codon pairings equally)
METHOD = "wtai"  #@param ["wtai", "tai"]

print(f"Running optimality profile (method={METHOD}, ramp={RAMP_CODONS} codons)...")
if METHOD == "wtai":
    print("  wtAI: higher score = more tRNA gene copies + Watson-Crick decoding preferred")
else:
    print("  tAI: higher score = more tRNA gene copies (no wobble penalty)")

profile_result = run_profile(
    species=SPECIES_NAME,
    gene_ids=gene_ids,
    ramp_codons=RAMP_CODONS,
    method=METHOD,
    seed=42,
)

print(f"\n{profile_result['n_genes']} genes analysed")

# Plot metagene profile
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

mg_gs = profile_result["metagene_geneset"]
mg_bg = profile_result["metagene_genome"]
x = np.arange(len(mg_gs))

ax1.plot(x, mg_gs, label="Gene set", linewidth=2)
ax1.plot(x, mg_bg, label="Genome", linewidth=1, alpha=0.7)
ax1.set_xlabel("Normalised position (%)")
ax1.set_ylabel(f"{METHOD.upper()} score")
ax1.set_title("Metagene Optimality Profile")
ax1.legend()
ax1.axvline(x=RAMP_CODONS * 100 / max(len(mg_gs), 1), color="gray", linestyle="--", alpha=0.5, label="Ramp boundary")

# Ramp analysis summary
ramp = profile_result["ramp_analysis"]
categories = ["Ramp\n(gene set)", "Body\n(gene set)", "Ramp\n(genome)", "Body\n(genome)"]
values = [ramp["ramp_mean_geneset"], ramp["body_mean_geneset"],
          ramp["ramp_mean_genome"], ramp["body_mean_genome"]]
colors = ["#1f77b4", "#1f77b4", "#ff7f0e", "#ff7f0e"]
bars = ax2.bar(categories, values, color=colors, alpha=[0.6, 1, 0.6, 1])
ax2.set_ylabel(f"Mean {METHOD.upper()}")
ax2.set_title("Ramp vs Body Optimality")

plt.tight_layout()
plt.show()

print(f"\nRamp analysis (ramp = first {RAMP_CODONS} codons, body = rest):")
print(f"  Gene set ramp: {ramp['ramp_mean_geneset']:.4f}, body: {ramp['body_mean_geneset']:.4f}")
print(f"  Genome   ramp: {ramp['ramp_mean_genome']:.4f}, body: {ramp['body_mean_genome']:.4f}")
delta = ramp['body_mean_geneset'] - ramp['ramp_mean_geneset']
print(f"  Ramp delta (body - ramp): {delta:+.4f} {'(ramp is slower → expected for highly-expressed genes)' if delta > 0 else ''}")

if profile_result.get("ramp_composition") is not None:
    print("\nRamp codon composition (top 10 by frequency difference):")
    display(profile_result["ramp_composition"].head(10))

In [None]:
#@title Mode 4: Collision Potential { display-mode: "form" }

from codonscope.modes.mode4_collision import run_collision

print("Running collision potential analysis...")

collision_result = run_collision(
    species=SPECIES_NAME,
    gene_ids=gene_ids,
)

print(f"\n{collision_result['n_genes']} genes analysed")
print(f"FS enrichment ratio: {collision_result['fs_enrichment']:.3f}")
print(f"  (>1.0 = more fast→slow transitions than expected)")
print(f"Chi-squared p-value: {collision_result['chi2_p']:.2e}")

# Transition matrix
print("\nTransition proportions (gene set vs genome):")
tm_gs = collision_result["transition_matrix_geneset"]
tm_bg = collision_result["transition_matrix_genome"]
trans_df = pd.DataFrame({
    "Transition": ["FF", "FS", "SF", "SS"],
    "Gene set": [tm_gs.get(t, 0) for t in ["FF", "FS", "SF", "SS"]],
    "Genome": [tm_bg.get(t, 0) for t in ["FF", "FS", "SF", "SS"]],
}).set_index("Transition")
display(trans_df.round(4))

# Per-dicodon FS enrichment
if collision_result.get("fs_dicodons") is not None and len(collision_result["fs_dicodons"]) > 0:
    print("\nTop FS dicodons (most enriched fast→slow transitions):")
    display(collision_result["fs_dicodons"].head(15).round(4))

In [None]:
#@title Mode 5: AA vs Codon Disentanglement { display-mode: "form" }

from codonscope.modes.mode5_disentangle import run_disentangle

print("Running disentanglement analysis...")

dis_result = run_disentangle(
    species=SPECIES_NAME,
    gene_ids=gene_ids,
    seed=42,
)

summary = dis_result["summary"]
print(f"\n{dis_result['n_genes']} genes analysed")
print(f"\nSignificant codons: {summary['n_significant_codons']}")
print(f"  AA-driven:          {summary['n_aa_driven']} ({summary['pct_aa_driven']:.0f}%)")
print(f"  Synonymous-driven:  {summary['n_synonymous_driven']} ({summary['pct_synonymous_driven']:.0f}%)")
print(f"  Both:               {summary['n_both']} ({summary['pct_both']:.0f}%)")
print(f"\n{summary['summary_text']}")

# Attribution table
print("\nPer-codon attribution:")
attr = dis_result["attribution"].copy()
attr_display = attr[["codon", "amino_acid", "aa_z_score", "rscu_z_score", "attribution"]].copy()
attr_display = attr_display.round({"aa_z_score": 2, "rscu_z_score": 2})

def color_attribution(val):
    colors = {
        "AA-driven": "background-color: #cce5ff",
        "Synonymous-driven": "background-color: #d4edda",
        "Both": "background-color: #fff3cd",
        "None": "",
    }
    return colors.get(val, "")

display(attr_display.style.applymap(color_attribution, subset=["attribution"]))

# Synonymous drivers
syn = dis_result["synonymous_drivers"]
syn_sig = syn[syn["driver"] != "not_applicable"]
if len(syn_sig) > 0:
    print("\nSynonymous driver classification:")
    display(syn_sig[["codon", "amino_acid", "rscu_z_score", "driver"]].round(2))

In [None]:
#@title Mode 6: Cross-Species Comparison (requires orthologs) { display-mode: "form" }

if not species2:
    print("Skipped: no second species selected in cell 3.")
    print("Set SPECIES2 above and re-run to enable cross-species comparison.")
else:
    from codonscope.modes.mode6_compare import run_compare

    print(f"Running cross-species comparison: {SPECIES_NAME} vs {species2}...")

    compare_result = run_compare(
        species1=SPECIES_NAME,
        species2=species2,
        gene_ids=gene_ids,
        seed=42,
    )

    s = compare_result["summary"]
    print(f"\n{compare_result['n_orthologs']} ortholog pairs found in gene set")
    print(f"{compare_result['n_genome_orthologs']} ortholog pairs genome-wide")
    print(f"\nMean RSCU correlation:")
    print(f"  Gene set: {s['mean_r_geneset']:.3f}")
    print(f"  Genome:   {s['mean_r_genome']:.3f}")
    print(f"  Z-score:  {s['z_score']:.2f}")
    print(f"  P-value:  {s['p_value']:.2e}")

    # Per-gene correlations
    print("\nPer-gene RSCU correlations:")
    per_gene = compare_result["per_gene"].sort_values("r", ascending=False)
    display(per_gene.head(20).round(3))

    # Divergent genes
    if compare_result.get("divergent_analysis") is not None and len(compare_result["divergent_analysis"]) > 0:
        print("\nMost divergent genes (different preferred codons):")
        display(compare_result["divergent_analysis"].head(10).round(3))

---

## Tips

- **Low Z-scores for highly expressed genes?** This is expected when your gene set dominates the genome's translational demand (e.g., ribosomal proteins in yeast make up ~72% of demand). The gene set IS the background.
- **Matched vs all background:** Use `matched` in Mode 1 to control for CDS length and GC content. This removes compositional confounds and reveals true codon selection.
- **Tricodon analysis:** Requires large gene sets (>50 genes) for statistical power. 226,981 possible tricodons means heavy multiple testing correction.
- **Cross-species comparison:** Low RSCU correlation between orthologs is biologically meaningful — it indicates different tRNA pools drive different codon preferences.
- **Custom expression data:** Use `--expression-file` in the CLI or the `expression_file` parameter to supply your own TPM values for demand analysis.