# FISH-RT Probe Design Pipeline
This notebook provides a streamlined 4-phase workflow for designing FISH-RT probes:

1. **Phase 1: Candidate Design** - Automated probe design with SNP-first selection
2. **Phase 2: Specificity Validation** - Local BLAST specificity check and hit reporting
3. **Phase 3: Forward Primer Design** - Automated primer design with specificity validation
4. **Phase 4: Synthesis Prep** - Post-process with custom RTBC barcodes

## Pipeline Features
- SNP-first probe selection (maximizes allelic discrimination potential)
- Data-validated BLAST specificity selection (word_size=11, evalue=0.1)
- 200 nt RT coverage guaranteed through expanded primer search space
- Fully modular scripts for each phase


---
## Step 0: Configuration
Edit the parameters below to customize your analysis.



In [None]:
# ========================================
# CONFIGURATION - EDIT THESE PARAMETERS
# ========================================
# Gene list to process (Example: ["Nanog", "Mecp2", "Xist"])
# GENE_LIST = [
#     "Nanog",
#     "Mecp2",
#     "Xist",
# ]
# Example: X-chromosome test set (21 genes)
GENE_LIST = [
    # "Atrx",
    # "Diaph2",
    # "Gpc4",
    # "Hdac3",
    # "Hnrnpu",
    # "Kdm5c",
    # "Kdm6a",
    # "Mecp2",
    # "Mid1",
    # "Nanog",
    # "Pir",
    # "Pou5f1",
    # "Rbmx",
    # "Rlim",
    # "Rps6ka3",
    # "Rps6ka6",
    # "Smc1a",
    # "Spen",
    # "Tsix",
    "Xist",
    # "Zfp42",
]
# Output directory
OUTPUT_DIR = "/Users/gmgao/Dropbox/Caltech_PostDoc_GuttmanLab/constructs_and_smiFISH/smFISH_like_focusedRT-XCI"
# RTBC barcode for synthesis (added as final step)
RTBC_SEQUENCE = "/5Phos/TGACTTGAGGAT"
# ========================================
# PROBE SELECTION PARAMETERS
# ========================================
# Selection logic: min(top N, probes with >= M SNPs)
MAX_PROBES_PER_GENE = 200  # Maximum probes per gene
MIN_SNPS_FOR_SELECTION = 3  # Minimum SNPs for probe inclusion
# ========================================


## Step 1: Candidate Design (Oligostan with filters)



In [None]:
import subprocess
import os
# Build command for Phase 1: Candidate Generation
cmd = ["python3", "design_candidate_probes.py", "--output", OUTPUT_DIR]
cmd.extend(["--max-probes", str(MAX_PROBES_PER_GENE)])
cmd.extend(["--min-snps", str(MIN_SNPS_FOR_SELECTION)])
# Always use --genes
cmd.extend(["--genes"] + GENE_LIST)
# Run Phase 1
print(f"üöÄ Running Phase 1 (Candidate Design): {' '.join(cmd)}")
print("=" * 60)
result = subprocess.run(
    cmd,
    capture_output=False,
    cwd=os.path.dirname(os.path.abspath("__notebook_file__")) or "."
)
if result.returncode == 0:
    print("\n‚úÖ Phase 1 completed successfully! Candidates generated.")
else:
    print(f"\n‚ùå Phase 1 failed with exit code {result.returncode}")


---
## Step 2: Specificity Validation (BLAST)



In [None]:
import subprocess
from pathlib import Path
# Configuration for Phase 2: Specificity Validation
CANDIDATES_CSV = Path(OUTPUT_DIR) / "FISH_RT_probes_CANDIDATES.csv"
if not CANDIDATES_CSV.exists():
    print(f"‚ö†Ô∏è Candidates CSV not found at {CANDIDATES_CSV}. Run Step 1 first.")
else:
    # Run the new descriptive validation script
    print(f"üéØ Validating probe specificity...")
    validate_cmd = [
        "python3", "validate_probe_specificity.py",
        "--candidates", str(CANDIDATES_CSV),
        "--output-dir", OUTPUT_DIR
    ]
    print(f"üöÄ Command: {' '.join(validate_cmd)}")
    subprocess.run(validate_cmd, check=True)
    print("\n‚úÖ Phase 2 completed! Final selection generated.")


In [None]:
import pandas as pd
from pathlib import Path
results_file = Path(OUTPUT_DIR) / "FISH_RT_probes_CANDIDATES_BLASTresults.csv"
if results_file.exists():
    df_res = pd.read_csv(results_file)
    # Get parameters from first row for display
    ws = df_res['BLAST_WordSize'].iloc[0] if 'BLAST_WordSize' in df_res.columns else 'N/A'
    ev = df_res['BLAST_EValue'].iloc[0] if 'BLAST_EValue' in df_res.columns else 'N/A'
    min_aln = df_res['BLAST_MinAlignment'].iloc[0] if 'BLAST_MinAlignment' in df_res.columns else 'N/A'

    print(f"\nüìä BLAST specificity summary for {len(df_res)} candidates:")
    print(f"   Parameters: word_size={ws}, evalue={ev}, min_alignment={min_aln}bp")
    print(f"   Specifically unique: {df_res['BLAST_Unique'].sum()}")
    print("\nüìã Top BLAST results (first 10, showing key specificity columns):")
    # Use existing columns, fall back to what it has if names changed
    # We know merged_df has ProbeID
    cols_to_show = ['ProbeID', 'BLAST_Hits', 'Primary_Identity', 'Secondary_Identity', 'BLAST_Unique']
    existing_cols = [c for c in cols_to_show if c in df_res.columns]
    display(df_res[existing_cols].head(10))
else:
    print("‚ö†Ô∏è BLAST results CSV not found.")

---
## Step 3: Design Forward Primers
Design forward primers 200-250bp upstream of the RT probe using Primer3.
This ensures the PCR amplicon fully covers the RT coverage region containing SNPs.



In [None]:
# Step 3: Design Forward Primers
from pathlib import Path
# Input is the final selection from Step 1
INPUT_FILE = Path(OUTPUT_DIR) / "FISH_RT_probes_FINAL_SELECTION.csv"
OUTPUT_FILE = Path(OUTPUT_DIR) / "FISH_RT_probes_WITH_PRIMERS.csv"
GENOME_FASTA = "/Volumes/guttman/genomes/mm10/fasta/mm10.fa"  # Path from config
if INPUT_FILE.exists():
    print(f"üöÄ Designing forward primers for {INPUT_FILE.name}...")
    !python3 design_forward_primers.py "{INPUT_FILE}" "{OUTPUT_FILE}" --genome "{GENOME_FASTA}"
else:
    print(f"‚ö†Ô∏è Input file not found: {INPUT_FILE.name}. Run Step 1 first.")


---
## Step 4: Add RTBC Barcode (Optional)
Add your custom RTBC barcode to the probe sequences for synthesis.
This is a **separate post-processing step** so you can easily try different barcodes.



In [None]:
from pathlib import Path
# Choose which file to add RTBC to
# We prefer the version with primers if it exists
input_primers = Path(OUTPUT_DIR) / "FISH_RT_probes_WITH_PRIMERS.csv"
input_final = Path(OUTPUT_DIR) / "FISH_RT_probes_FINAL_SELECTION.csv"
if input_primers.exists():
    INPUT_FILE = input_primers
elif input_final.exists():
    INPUT_FILE = input_final
else:
    INPUT_FILE = None
OUTPUT_FILE = Path(OUTPUT_DIR) / "FISH_RT_probes_SYNTHESIS_READY.csv"
if INPUT_FILE:
    print(f"üß¨ Adding RTBC barcode to {INPUT_FILE.name}...")
    !python3 add_rtbc_barcode.py "{INPUT_FILE}" "{OUTPUT_FILE}" --rtbc "{RTBC_SEQUENCE}"
    print(f"\n‚úÖ Synthesis-ready probes saved to: {OUTPUT_FILE}")
else:
    print("‚ö†Ô∏è No input files found to add barcode. Run Step 1 or 3 first.")


---
## Output Files
| File | Description |
|------|-------------|
| `FISH_RT_probes_CANDIDATES.csv` | All high-quality candidate probes before BLAST specificity filtering |
| `FISH_RT_probes_FINAL_SELECTION.csv` | **Final Selection** (N probes per gene, BLAST-unique) |
| `FISH_RT_probes_WITH_PRIMERS.csv` | Probes with designed forward primers for validation |
| `FISH_RT_probes_SYNTHESIS_READY.csv` | Final probes with RTBC barcode added |
| `*.fasta` | FASTA format for BLAST or other analysis |

