# FISH-RT Probe Design Pipeline (Simplified)

This notebook provides a streamlined 3-step workflow for designing FISH-RT probes:

1. **Run main.py** - Automated probe design with SNP-first selection
2. **Review results** - Examine probe quality and BLAST specificity
3. **Add RTBC barcode** - Post-process with custom barcode (optional)

## Pipeline Features
- SNP-first probe selection (maximizes allelic discrimination potential)
- Local BLAST specificity validation
- Adaptive PNAS filtering (automatically relaxes rules if needed)
- 200 nt RT coverage for SNP detection

---

## Step 0: Configuration

Edit the parameters below to customize your analysis.

In [1]:
# ========================================
# CONFIGURATION - EDIT THESE PARAMETERS
# ========================================

# Gene list to process (Example: ["Nanog", "Mecp2", "Xist"])
# GENE_LIST = [
#     "Nanog",
#     "Mecp2",
#     "Xist",
# ]

# Example: X-chromosome test set (21 genes)
GENE_LIST = [
    "Atrx",
    "Diaph2",
    "Gpc4",
    "Hdac3",
    "Hnrnpu",
    "Kdm5c",
    "Kdm6a",
    "Mecp2",
    "Mid1",
    "Nanog",
    "Pir",
    "Pou5f1",
    "Rbmx",
    "Rlim",
    "Rps6ka3",
    "Rps6ka6",
    "Smc1a",
    "Spen",
    "Tsix",
    "Xist",
    "Zfp42",
]

# Output directory
OUTPUT_DIR = "/Users/gmgao/Dropbox/Caltech_PostDoc_GuttmanLab/constructs_and_smiFISH/smFISH_like_focusedRT-XCI"

# RTBC barcode for synthesis (added as final step)
RTBC_SEQUENCE = "/5Phos/TGACTTGAGGAT"

# ========================================
# PROBE SELECTION PARAMETERS
# ========================================
# Selection logic: min(top N, probes with >= M SNPs)
MAX_PROBES_PER_GENE = 200  # Maximum probes per gene
MIN_SNPS_FOR_SELECTION = 3  # Minimum SNPs for probe inclusion

# ========================================

---

## Step 1: Run Probe Design Pipeline

This runs the full automated pipeline:
1. Design probes using Oligostan algorithm
2. Apply GC, dustmasker, and homo-polymer quality filters
3. Analyze SNP coverage on **all** filtered probes
4. Select probes: **min(top N, probes with ≥M SNPs)**
5. Run local BLAST for specificity validation
6. Generate output files

In [2]:
import subprocess
import os

# Build command
cmd = ["python3", "main.py", "--output", OUTPUT_DIR]
cmd.extend(["--max-probes", str(MAX_PROBES_PER_GENE)])
cmd.extend(["--min-snps", str(MIN_SNPS_FOR_SELECTION)])

# Always use --genes
cmd.extend(["--genes"] + GENE_LIST)

# Run pipeline
print(f"🚀 Running: {' '.join(cmd)}")
print("=" * 60)

result = subprocess.run(
    cmd,
    capture_output=False,
    cwd=os.path.dirname(os.path.abspath("__notebook_file__")) or "."
)

if result.returncode == 0:
    print("\n✅ Pipeline completed successfully!")
else:
    print(f"\n❌ Pipeline failed with exit code {result.returncode}")

🚀 Running: python3 main.py --output /Users/gmgao/Dropbox/Caltech_PostDoc_GuttmanLab/constructs_and_smiFISH/smFISH_like_focusedRT-XCI --max-probes 200 --min-snps 3 --genes Atrx Diaph2 Gpc4 Hdac3 Hnrnpu Kdm5c Kdm6a Mecp2 Mid1 Nanog Pir Pou5f1 Rbmx Rlim Rps6ka3 Rps6ka6 Smc1a Spen Tsix Xist Zfp42
[1;34m🧬 smfish-like-rt-probe-designer[0m
[36mMode: Local GTF + FASTA files only[0m
[33mUsing custom output directory: [0m
[33m/Users/gmgao/Dropbox/Caltech_PostDoc_GuttmanLab/constructs_and_smiFISH/[0m[33msmFISH_li[0m
[33mke_focusedRT-XCI[0m
[32mProcessing [0m[1;32m21[0m[32m specified genes[0m
[32mOutput directory: [0m
[32m/Users/gmgao/Dropbox/Caltech_PostDoc_GuttmanLab/constructs_and_smiFISH/[0m[32msmFISH_li[0m
[32mke_focusedRT-XCI[0m

[1mConfiguration:[0m
 PNAS rules: [1m[[0m[1;36m1[0m, [1;36m2[0m, [1;36m4[0m[1m][0m
 Dustmasker: ✅ Enabled
 RT coverage: [1;36m200[0m nt downstream

[1mInitializing pipeline components[0m[1;33m...[0m
[32m✅ Using local file

---

## Step 1.1: Post-run Homo-polymer Filter (Optional)

If you have already run the pipeline and want to apply the homo-polymer filter without re-running the design step, use this cell.
This removes any probe with >4 consecutive identical bases (e.g., AAAAA).

In [None]:
import pandas as pd
from pathlib import Path

def has_homopolymer(sequence, max_length=4):
    if not isinstance(sequence, str): return False
    for base in ['A', 'T', 'C', 'G']:
        if base * (max_length + 1) in sequence.upper():
            return True
    return False

output_path = Path(OUTPUT_DIR)
target_files = [
    output_path / "FISH_RT_probes_PRE_BLAST_CANDIDATES.csv",
    output_path / "FISH_RT_probes_FINAL_SELECTION.csv"
]

for file_path in target_files:
    if file_path.exists():
        df = pd.read_csv(file_path)
        original_count = len(df)

        # Filter based on Probe_Seq
        # Note: We use Probe_Seq because it's the actual oligo sequence
        df = df[~df['Probe_Seq'].apply(has_homopolymer)]

        new_count = len(df)
        print(f"🔹 {file_path.name}: {original_count} -> {new_count} probes (Removed {original_count - new_count})")

        # Save filtered results
        df.to_csv(file_path, index=False)
        print(f"   ✅ Updated {file_path.name}")
    else:
        print(f"⚠️ File not found: {file_path.name}")

🔹 FISH_RT_probes_PRE_BLAST_CANDIDATES.csv: 1096 -> 1034 probes (Removed 62)
   ✅ Updated FISH_RT_probes_PRE_BLAST_CANDIDATES.csv
🔹 FISH_RT_probes_FINAL_SELECTION.csv: 1096 -> 1034 probes (Removed 62)
   ✅ Updated FISH_RT_probes_FINAL_SELECTION.csv


---

## Step 2: Review Results

Examine the generated probe files and their quality metrics.

In [3]:
import pandas as pd
from pathlib import Path

output_path = Path(OUTPUT_DIR)

# Load filtered probes
final_file = output_path / "FISH_RT_probes_FINAL_SELECTION.csv"
candidate_file = output_path / "FISH_RT_probes_PRE_BLAST_CANDIDATES.csv"

if final_file.exists():
    df_all = pd.read_csv(final_file)
    print(f"📊 Total probes in final selection: {len(df_all)}")
    print(f"   Genes: {df_all['GeneName'].nunique()}")

    # Per-gene summary
    print("\n📋 Per-gene breakdown:")
    for gene in sorted(df_all['GeneName'].unique()):
        gene_df = df_all[df_all['GeneName'] == gene]
        blast_unique = gene_df['BLAST_Unique'].sum() if 'BLAST_Unique' in gene_df.columns else '-'
        avg_snps = gene_df['SNP_Count'].mean() if 'SNP_Count' in gene_df.columns else '-'
        print(f"   {gene}: {len(gene_df)} probes, avgSNP={avg_snps:.1f}, BLAST_unique={blast_unique}")

if candidate_file.exists():
    df_cand = pd.read_csv(candidate_file)
    print(f"\n🎯 PRE_BLAST_CANDIDATES: {len(df_cand)}")
    print(f"   Examine this to see multi-target probes filtering impacts!")
else:
    print("⚠️ No output files found. Run Step 1 first.")

📊 Total probes in final selection: 1034
   Genes: 21

📋 Per-gene breakdown:
   Atrx: 59 probes, avgSNP=3.6, BLAST_unique=59
   Diaph2: 23 probes, avgSNP=3.2, BLAST_unique=23
   Gpc4: 113 probes, avgSNP=3.9, BLAST_unique=113
   Hdac3: 29 probes, avgSNP=3.8, BLAST_unique=29
   Hnrnpu: 7 probes, avgSNP=3.1, BLAST_unique=7
   Kdm5c: 23 probes, avgSNP=3.3, BLAST_unique=23
   Kdm6a: 70 probes, avgSNP=3.4, BLAST_unique=70
   Mecp2: 53 probes, avgSNP=3.7, BLAST_unique=53
   Mid1: 93 probes, avgSNP=9.6, BLAST_unique=93
   Nanog: 24 probes, avgSNP=3.5, BLAST_unique=24
   Pir: 110 probes, avgSNP=4.3, BLAST_unique=110
   Pou5f1: 14 probes, avgSNP=3.9, BLAST_unique=14
   Rbmx: 5 probes, avgSNP=4.6, BLAST_unique=5
   Rlim: 1 probes, avgSNP=3.0, BLAST_unique=1
   Rps6ka3: 79 probes, avgSNP=4.1, BLAST_unique=79
   Rps6ka6: 37 probes, avgSNP=3.4, BLAST_unique=37
   Smc1a: 47 probes, avgSNP=3.9, BLAST_unique=47
   Spen: 132 probes, avgSNP=4.2, BLAST_unique=132
   Tsix: 51 probes, avgSNP=3.6, BLAST_uniqu

---

## Step 3: Design Forward Primers

Design forward primers 200-250bp upstream of the RT probe using Primer3.
This ensures the PCR amplicon fully covers the RT coverage region containing SNPs.

In [None]:
# Step 3: Design Forward Primers
from pathlib import Path

# Input is the final selection from Step 1
INPUT_FILE = Path(OUTPUT_DIR) / "FISH_RT_probes_FINAL_SELECTION.csv"
OUTPUT_FILE = Path(OUTPUT_DIR) / "FISH_RT_probes_WITH_PRIMERS.csv"
GENOME_FASTA = "/Volumes/guttman/genomes/mm10/fasta/mm10.fa"  # Path from config

if INPUT_FILE.exists():
    print(f"🚀 Designing forward primers for {INPUT_FILE.name}...")
    !python3 design_forward_primers.py "{INPUT_FILE}" "{OUTPUT_FILE}" --genome "{GENOME_FASTA}"
else:
    print(f"⚠️ Input file not found: {INPUT_FILE.name}. Run Step 1 first.")

---

## Step 4: Add RTBC Barcode (Optional)

Add your custom RTBC barcode to the probe sequences for synthesis.
This is a **separate post-processing step** so you can easily try different barcodes.

In [4]:
from pathlib import Path

# Choose which file to add RTBC to
# We prefer the version with primers if it exists
input_primers = Path(OUTPUT_DIR) / "FISH_RT_probes_WITH_PRIMERS.csv"
input_final = Path(OUTPUT_DIR) / "FISH_RT_probes_FINAL_SELECTION.csv"

if input_primers.exists():
    INPUT_FILE = input_primers
elif input_final.exists():
    INPUT_FILE = input_final
else:
    INPUT_FILE = None

OUTPUT_FILE = Path(OUTPUT_DIR) / "FISH_RT_probes_SYNTHESIS_READY.csv"

if INPUT_FILE:
    print(f"🧬 Adding RTBC barcode to {INPUT_FILE.name}...")
    !python3 add_rtbc_barcode.py "{INPUT_FILE}" "{OUTPUT_FILE}" --rtbc "{RTBC_SEQUENCE}"
    print(f"\n✅ Synthesis-ready probes saved to: {OUTPUT_FILE}")
else:
    print("⚠️ No input files found to add barcode. Run Step 1 or 3 first.")

📄 Loaded 1034 probes from /Users/gmgao/Dropbox/Caltech_PostDoc_GuttmanLab/constructs_and_smiFISH/smFISH_like_focusedRT-XCI/FISH_RT_probes_FINAL_SELECTION.csv
🧬 Adding RTBC barcode: /5Phos/TGACTTGAGGAT
✅ Saved 1034 probes to /Users/gmgao/Dropbox/Caltech_PostDoc_GuttmanLab/constructs_and_smiFISH/smFISH_like_focusedRT-XCI/FISH_RT_probes_SYNTHESIS_READY.csv

📊 Summary:
   Original probe length: 29.5 nt
   RTBC length: 12 nt
   Full oligo length: 48.5 nt

✅ Synthesis-ready probes saved to: /Users/gmgao/Dropbox/Caltech_PostDoc_GuttmanLab/constructs_and_smiFISH/smFISH_like_focusedRT-XCI/FISH_RT_probes_SYNTHESIS_READY.csv


---

## Output Files

| File | Description |
|------|-------------|
| `FISH_RT_probes_PRE_BLAST_CANDIDATES.csv` | All high-quality candidate probes before BLAST specificity filtering |
| `FISH_RT_probes_FINAL_SELECTION.csv` | **Final Selection** (N probes per gene, BLAST-unique) |
| `FISH_RT_probes_WITH_PRIMERS.csv` | Probes with designed forward primers for validation |
| `FISH_RT_probes_SYNTHESIS_READY.csv` | Final probes with RTBC barcode added |
| `*.fasta` | FASTA format for BLAST or other analysis |