# FISH-RT Probe Design Pipeline
This notebook provides a streamlined 4-phase workflow for designing FISH-RT probes:

1. **Phase 1: Candidate Design** - Automated probe design with SNP-first selection
2. **Phase 2: Specificity Validation** - Local BLAST specificity check and hit reporting
3. **Phase 3: Forward Primer Design** - Automated primer design with specificity validation
4. **Phase 4: Synthesis Prep** - Post-process with custom RTBC barcodes

## Pipeline Features
- SNP-first probe selection (maximizes allelic discrimination potential)
- Data-validated BLAST specificity selection (word_size=11, evalue=0.1)
- 200-500 nt RT coverage guaranteed through dynamic primer search space
- Fully modular scripts for each phase


---
## Step 0: Configuration
Edit the parameters below to customize your analysis.



In [1]:
# ========================================
# CONFIGURATION - EDIT THESE PARAMETERS
# ========================================
# Gene list to process (Example: ["Nanog", "Mecp2", "Xist"])
# GENE_LIST = [
#     "Nanog",
#     "Mecp2",
#     "Xist",
# ]
# Example: X-chromosome test set (21 genes)
GENE_LIST = [
    # "Atrx",
    # "Diaph2",
    # "Gpc4",
    # "Hdac3",
    # "Hnrnpu",
    # "Kdm5c",
    # "Kdm6a",
    # "Mecp2",
    # "Mid1",
    # "Nanog",
    # "Pir",
    # "Pou5f1",
    # "Rbmx",
    # "Rlim",
    # "Rps6ka3",
    # "Rps6ka6",
    # "Smc1a",
    # "Spen",
    # "Tsix",
    "Xist",
    # "Zfp42",
]
# Output directory
OUTPUT_DIR = "/Users/gmgao/Dropbox/Caltech_PostDoc_GuttmanLab/constructs_and_smiFISH/smFISH_like_focusedRT-XCI"
# RTBC barcode for synthesis (added as final step)
RTBC_SEQUENCE = "/5Phos/TGACTTGAGGAT"
# ========================================
# PROBE SELECTION PARAMETERS
# ========================================
# Selection logic: min(top N, probes with >= M SNPs)
MAX_PROBES_PER_GENE = 500  # Maximum probes per gene
MIN_SNPS_FOR_SELECTION = 3  # Minimum SNPs for probe inclusion
# ========================================
# RT COVERAGE PARAMETERS (nt)
# ========================================
RT_COVERAGE_DOWNSTREAM = 500  # Region to cover with RT product
# ========================================


## Step 1: Candidate Design (Oligostan with filters)



In [2]:
import subprocess
import os
# Build command for Phase 1: Candidate Generation
cmd = ["python3", "design_candidate_probes.py", "--output", OUTPUT_DIR]
cmd.extend(["--max-probes", str(MAX_PROBES_PER_GENE)])
cmd.extend(["--min-snps", str(MIN_SNPS_FOR_SELECTION)])
cmd.extend(["--rt-coverage", str(RT_COVERAGE_DOWNSTREAM)])
# Always use --genes
cmd.extend(["--genes"] + GENE_LIST)
# Run Phase 1
print(f"üöÄ Running Phase 1 (Candidate Design): {' '.join(cmd)}")
print("=" * 60)
result = subprocess.run(
    cmd,
    capture_output=False,
    cwd=os.path.dirname(os.path.abspath("__notebook_file__")) or "."
)
if result.returncode == 0:
    print("\n‚úÖ Phase 1 completed successfully! Candidates generated.")
else:
    print(f"\n‚ùå Phase 1 failed with exit code {result.returncode}")


üöÄ Running Phase 1 (Candidate Design): python3 design_candidate_probes.py --output /Users/gmgao/Dropbox/Caltech_PostDoc_GuttmanLab/constructs_and_smiFISH/smFISH_like_focusedRT-XCI --max-probes 500 --min-snps 3 --rt-coverage 500 --genes Xist
[1;34müß¨ smfish-like-rt-probe-designer[0m
[36mMode: Local GTF + FASTA files only[0m
[33mUsing custom output directory: [0m
[33m/Users/gmgao/Dropbox/Caltech_PostDoc_GuttmanLab/constructs_and_smiFISH/[0m[33msmFISH_li[0m
[33mke_focusedRT-XCI[0m
[33mUsing custom RT coverage: [0m[1;33m500[0m[33m nt[0m
[32mProcessing [0m[1;32m1[0m[32m specified genes[0m
[32mOutput directory: [0m
[32m/Users/gmgao/Dropbox/Caltech_PostDoc_GuttmanLab/constructs_and_smiFISH/[0m[32msmFISH_li[0m
[32mke_focusedRT-XCI[0m

[1mConfiguration:[0m
 PNAS rules: [1m[[0m[1;36m1[0m, [1;36m2[0m, [1;36m4[0m[1m][0m
 Dustmasker: ‚úÖ Enabled
 RT coverage: [1;36m500[0m nt downstream

[1mInitializing pipeline components[0m[1;33m...[0m
[32m‚úÖ

---
## Step 2: Specificity Validation (BLAST)



In [3]:
import subprocess
from pathlib import Path
# Configuration for Phase 2: Specificity Validation
CANDIDATES_CSV = Path(OUTPUT_DIR) / "FISH_RT_probes_CANDIDATES.csv"
if not CANDIDATES_CSV.exists():
    print(f"‚ö†Ô∏è Candidates CSV not found at {CANDIDATES_CSV}. Run Step 1 first.")
else:
    # Run the new descriptive validation script
    print(f"üéØ Validating probe specificity...")
    validate_cmd = [
        "python3", "validate_probe_specificity.py",
        "--candidates", str(CANDIDATES_CSV),
        "--output-dir", OUTPUT_DIR
    ]
    print(f"üöÄ Command: {' '.join(validate_cmd)}")
    subprocess.run(validate_cmd, check=True)
    print("\n‚úÖ Phase 2 completed! Final selection generated.")


üéØ Validating probe specificity...
üöÄ Command: python3 validate_probe_specificity.py --candidates /Users/gmgao/Dropbox/Caltech_PostDoc_GuttmanLab/constructs_and_smiFISH/smFISH_like_focusedRT-XCI/FISH_RT_probes_CANDIDATES.csv --output-dir /Users/gmgao/Dropbox/Caltech_PostDoc_GuttmanLab/constructs_and_smiFISH/smFISH_like_focusedRT-XCI
[1;34müß¨ Probe Specificity Validation [0m[1;34m([0m[1;34mPhase [0m[1;34m2[0m[1;34m)[0m
[36mRunning BLASTn against [0m
[36m/Volumes/guttman/genomes/mm10/fasta/blastdb/[0m[36mmm10_blastdb...[0m
[2mCommand: blastn -query [0m
[2;35m/Users/gmgao/Dropbox/Caltech_PostDoc_GuttmanLab/constructs_and_smiFISH/smFISH_li[0m
[2;35mke_focusedRT-XCI/[0m[2;95mFISH_RT_probes_CANDIDATES.fasta[0m[2m -db [0m
[2;35m/Volumes/guttman/genomes/mm10/fasta/blastdb/[0m[2;95mmm10_blastdb[0m[2m -out [0m
[2;35m/Users/gmgao/Dropbox/Caltech_PostDoc_GuttmanLab/constructs_and_smiFISH/smFISH_li[0m
[2;35mke_focusedRT-XCI/[0m[2;95mblast_results.txt[0m[

In [4]:
import pandas as pd
from pathlib import Path
results_file = Path(OUTPUT_DIR) / "FISH_RT_probes_CANDIDATES_BLASTresults.csv"
if results_file.exists():
    df_res = pd.read_csv(results_file)
    # Get parameters from first row for display
    ws = df_res['BLAST_WordSize'].iloc[0] if 'BLAST_WordSize' in df_res.columns else 'N/A'
    ev = df_res['BLAST_EValue'].iloc[0] if 'BLAST_EValue' in df_res.columns else 'N/A'
    min_aln = df_res['BLAST_MinAlignment'].iloc[0] if 'BLAST_MinAlignment' in df_res.columns else 'N/A'

    print(f"\nüìä BLAST specificity summary for {len(df_res)} candidates:")
    print(f"   Parameters: word_size={ws}, evalue={ev}, min_alignment={min_aln}bp")
    print(f"   Specifically unique: {df_res['BLAST_Unique'].sum()}")
    print("\nüìã Top BLAST results (first 10, showing key specificity columns):")
    # Use existing columns, fall back to what it has if names changed
    # We know merged_df has ProbeID
    cols_to_show = ['ProbeID', 'BLAST_Hits', 'Primary_Identity', 'Secondary_Identity', 'BLAST_Unique']
    existing_cols = [c for c in cols_to_show if c in df_res.columns]
    display(df_res[existing_cols].head(10))
else:
    print("‚ö†Ô∏è BLAST results CSV not found.")


üìä BLAST specificity summary for 140 candidates:
   Parameters: word_size=11, evalue=0.1, min_alignment=15bp
   Specifically unique: 87

üìã Top BLAST results (first 10, showing key specificity columns):


Unnamed: 0,ProbeID,BLAST_Hits,Primary_Identity,Secondary_Identity,BLAST_Unique
0,Xist_probe_0,1,100,,True
1,Xist_probe_1,2,100,100.0,False
2,Xist_probe_2,2,100,100.0,False
3,Xist_probe_3,2,100,100.0,False
4,Xist_probe_4,1,100,,True
5,Xist_probe_5,2,100,100.0,False
6,Xist_probe_6,2,100,100.0,False
7,Xist_probe_7,8,100,97.0,False
8,Xist_probe_8,14,100,100.0,False
9,Xist_probe_9,27,100,100.0,False


---
## Step 3: Design Forward Primers
Design forward primers upstream of the RT product region (RT region + 50bp buffer).
This ensures the PCR amplicon fully covers the RT coverage region containing SNPs.



In [5]:
# Step 3: Design Forward Primers
from pathlib import Path
# Input is the final selection from Step 1
INPUT_FILE = Path(OUTPUT_DIR) / "FISH_RT_probes_FINAL_SELECTION.csv"
OUTPUT_FILE = Path(OUTPUT_DIR) / "FISH_RT_probes_WITH_PRIMERS.csv"
GENOME_FASTA = "/Volumes/guttman/genomes/mm10/fasta/mm10.fa"  # Path from config
if INPUT_FILE.exists():
    print(f"üöÄ Designing forward primers for {INPUT_FILE.name}...")
    !python3 design_forward_primers.py "{INPUT_FILE}" "{OUTPUT_FILE}" --genome "{GENOME_FASTA}" --rt-coverage {RT_COVERAGE_DOWNSTREAM}
else:
    print(f"‚ö†Ô∏è Input file not found: {INPUT_FILE.name}. Run Step 1 first.")


üöÄ Designing forward primers for FISH_RT_probes_FINAL_SELECTION.csv...
üìÑ [36mLoaded [0m[1;36m87[0m[36m probes from [0m
[36m/Users/gmgao/Dropbox/Caltech_PostDoc_GuttmanLab/constructs_and_smiFISH/smFISH_li[0m
[36mke_focusedRT-XCI/[0m[36mFISH_RT_probes_FINAL_SELECTION.csv[0m
[2KDesigning forward primers... [91m‚îÅ‚îÅ‚îÅ[0m[90m‚ï∫[0m[90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [35m  8%[0m [36m-:--:--[0m
[?25hTraceback (most recent call last):
  File "/Users/gmgao/GGscripts/smfish-like-rt-probe-designer/design_forward_primers.py", line 479, in <module>
    main()
  File "/Users/gmgao/GGscripts/smfish-like-rt-probe-designer/design_forward_primers.py", line 467, in main
    design_forward_primers(
  File "/Users/gmgao/GGscripts/smfish-like-rt-probe-designer/design_forward_primers.py", line 363, in design_forward_primers
    p_pos = selected_primer['position'] # Index in temp template
            ~~~~

---
## Step 4: Add RTBC Barcode (Optional)
Add your custom RTBC barcode to the probe sequences for synthesis.
This is a **separate post-processing step** so you can easily try different barcodes.



In [6]:
from pathlib import Path
# Choose which file to add RTBC to
# We prefer the version with primers if it exists
input_primers = Path(OUTPUT_DIR) / "FISH_RT_probes_WITH_PRIMERS.csv"
input_final = Path(OUTPUT_DIR) / "FISH_RT_probes_FINAL_SELECTION.csv"
if input_primers.exists():
    INPUT_FILE = input_primers
elif input_final.exists():
    INPUT_FILE = input_final
else:
    INPUT_FILE = None
OUTPUT_FILE = Path(OUTPUT_DIR) / "FISH_RT_probes_SYNTHESIS_READY.csv"
if INPUT_FILE:
    print(f"üß¨ Adding RTBC barcode to {INPUT_FILE.name}...")
    !python3 add_rtbc_barcode.py "{INPUT_FILE}" "{OUTPUT_FILE}" --rtbc "{RTBC_SEQUENCE}"
    print(f"\n‚úÖ Synthesis-ready probes saved to: {OUTPUT_FILE}")
else:
    print("‚ö†Ô∏è No input files found to add barcode. Run Step 1 or 3 first.")


üß¨ Adding RTBC barcode to FISH_RT_probes_WITH_PRIMERS.csv...
üìÑ Loaded 17 probes from /Users/gmgao/Dropbox/Caltech_PostDoc_GuttmanLab/constructs_and_smiFISH/smFISH_like_focusedRT-XCI/FISH_RT_probes_WITH_PRIMERS.csv
üß¨ Adding RTBC barcode: /5Phos/TGACTTGAGGAT
‚úÖ Saved 17 probes to /Users/gmgao/Dropbox/Caltech_PostDoc_GuttmanLab/constructs_and_smiFISH/smFISH_like_focusedRT-XCI/FISH_RT_probes_SYNTHESIS_READY.csv

üìä Summary:
   Original probe length: 29.4 nt
   RTBC length: 12 nt
   Full oligo length: 48.4 nt

‚úÖ Synthesis-ready probes saved to: /Users/gmgao/Dropbox/Caltech_PostDoc_GuttmanLab/constructs_and_smiFISH/smFISH_like_focusedRT-XCI/FISH_RT_probes_SYNTHESIS_READY.csv


---
## Output Files
| File | Description |
|------|-------------|
| `FISH_RT_probes_CANDIDATES.csv` | All high-quality candidate probes before BLAST specificity filtering |
| `FISH_RT_probes_FINAL_SELECTION.csv` | **Final Selection** (N probes per gene, BLAST-unique) |
| `FISH_RT_probes_WITH_PRIMERS.csv` | Probes with designed forward primers for validation |
| `FISH_RT_probes_SYNTHESIS_READY.csv` | Final probes with RTBC barcode added |
| `*.fasta` | FASTA format for BLAST or other analysis |

