# HCR-FISH Probe Design Tutorial

This notebook demonstrates the complete workflow for designing HCR-FISH probes using the `hcrfish.hcr.utils` module. The functions covered here build upon the data preparation demonstrated in `prep_tutorial.ipynb` to design specific, high-quality probes for target genes.

## Overview

The HCR-FISH probe design workflow consists of several key steps:

1. **Check probe availability** - Determine how many probes can be designed for a gene
2. **BLAST analysis** - Identify off-target binding sites to ensure specificity
3. **Probe design** - Generate HCR v3.0 compatible probe sequences
4. **Export for synthesis** - Create IDT-compatible files for probe ordering
5. **Visualization** - Plot probe binding sites on genomic features

## Requirements

Before running this tutorial, ensure you have:
- Completed the data preparation from `prep_tutorial.ipynb`
- BLAST+ tools installed and available in your PATH
- Transcriptome objects created for your species of interest
- BLAST databases for both mature mRNA (no introns) and pre-mRNA (with introns)

In [1]:
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import hcrfish

## Setup and Configuration

First, let's verify that BLAST tools are available and set up our working directories.

In [2]:
hcrfish.check_blast_tools()

‚úì makeblastdb: makeblastdb: 2.15.0+
‚úì blastn: blastn: 2.15.0+
‚úì blastn: blastn: 2.15.0+


{'makeblastdb': {'available': True, 'version': 'makeblastdb: 2.15.0+'},
 'blastn': {'available': True, 'version': 'blastn: 2.15.0+'}}

In [9]:
# Configuration parameters
species_identifier = "obir"  # Change this to your species
main_directory = "../output"
transcriptome_object_name = "obir_transcriptome"  # From prep_tutorial.ipynb

# Target genes for demonstration
target_gene = ["Or5-9E306"]  

# HCR amplifier to use (B1, B2, B3, B4, or B5)
amplifier = "B1"

# Create main output directory
os.makedirs(main_directory, exist_ok=True)
print(f"‚úì Working directory: {os.path.abspath(main_directory)}")
print(f"‚úì Species: {species_identifier}")
print(f"‚úì Amplifier: {amplifier}")

‚úì Working directory: /Users/giacomo.glotzer/Desktop/Rockefeller/Kronauer/hcr_probe_design_general.nosync/output
‚úì Species: obir
‚úì Amplifier: B1


## Load Transcriptome Data

Load the transcriptome object that was created in the preparation tutorial.

In [11]:
# Load transcriptome object
try:
    transcriptome_path = f"../input/{species_identifier}/{transcriptome_object_name}.pkl"
    transcriptome = hcrfish.load_transcriptome_object(transcriptome_path)
    print(f"‚úì Loaded transcriptome: {transcriptome_path}")
    print(f"  Total genes: {len(transcriptome.genes):,}")
    
    # Show some example gene names
    gene_names = list(transcriptome.genes.keys())[:10]
    print(f"  Example genes: {', '.join(gene_names)}")
    
except FileNotFoundError:
    print(f"‚úó Transcriptome object '{transcriptome_object_name}' not found.")
    print("  Please run prep_tutorial.ipynb first to create the transcriptome object.")
    raise
except Exception as e:
    print(f"‚úó Error loading transcriptome: {e}")
    raise

File ../input/obir/obir_transcriptome.pkl not found in any of the expected locations:
  - input/dmel/../input/obir/obir_transcriptome.pkl
  - input/dyak/../input/obir/obir_transcriptome.pkl
  - ../input/obir/obir_transcriptome.pkl
  - docs/../input/obir/obir_transcriptome.pkl
Please run update_transcriptome_object(genome_path, transcriptome_path, output_filename, species) to generate the transcriptome object.
‚úì Loaded transcriptome: ../input/obir/obir_transcriptome.pkl
‚úó Error loading transcriptome: 'NoneType' object has no attribute 'genes'


AttributeError: 'NoneType' object has no attribute 'genes'

## Understanding HCR Amplifiers

HCR v3.0 uses five different amplifier systems (B1-B5) that allow for multiplexed detection. Each amplifier has specific initiator sequences and spacers.

In [None]:
# Demonstrate amplifier sequences
print("HCR v3.0 Amplifier Sequences:")
print("=" * 50)

amplifier_data = []
for amp in ["B1", "B2", "B3", "B4", "B5"]:
    upstream_spacer, downstream_spacer, upstream_init, downstream_init = hcrfish.get_amplifier(amp)
    
    amplifier_data.append({
        'Amplifier': amp,
        'Upstream Spacer': upstream_spacer,
        'Downstream Spacer': downstream_spacer,
        'Upstream Initiator': upstream_init,
        'Downstream Initiator': downstream_init
    })
    
    print(f"\n{amp}:")
    print(f"  Upstream:   {upstream_init} + {upstream_spacer} + [25bp target]")
    print(f"  Downstream: [25bp target] + {downstream_spacer} + {downstream_init}")

# Create a summary table
amp_df = pd.DataFrame(amplifier_data)
print("\nAmplifier Summary Table:")
print(amp_df.to_string(index=False))

HCR v3.0 Amplifier Sequences:

B1:
  Upstream:   GAGGAGGGCAGCAAACGG + aa + [25bp target]
  Downstream: [25bp target] + ta + GAAGAGTCTTCCTTTACG

B2:
  Upstream:   CCTCGTAAATCCTCATCA + aa + [25bp target]
  Downstream: [25bp target] + aa + ATCATCCAGTAAACCGCC

B3:
  Upstream:   GTCCCTGCCTCTATATCT + tt + [25bp target]
  Downstream: [25bp target] + tt + CCACTCAACTTTAACCCG

B4:
  Upstream:   CCTCAACCTACCTCCAAC + aa + [25bp target]
  Downstream: [25bp target] + at + TCTCACCATATTCGCTTC

B5:
  Upstream:   CTCACTCCCAATCTCTAT + aa + [25bp target]
  Downstream: [25bp target] + aa + CTACCCTACAAATCCAAT

Amplifier Summary Table:
Amplifier Upstream Spacer Downstream Spacer Upstream Initiator Downstream Initiator
       B1              aa                ta GAGGAGGGCAGCAAACGG   GAAGAGTCTTCCTTTACG
       B2              aa                aa CCTCGTAAATCCTCATCA   ATCATCCAGTAAACCGCC
       B3              tt                tt GTCCCTGCCTCTATATCT   CCACTCAACTTTAACCCG
       B4              aa                at

## Utility Functions: Sequence Manipulation

Before diving into probe design, let's explore some basic sequence manipulation utilities.

In [None]:
# Demonstrate reverse complement function
example_sequences = [
    "ATCGATCG",
    "GGGGCCCCAAAATTTT",
    "ATCG-N-CGAT",  # With gaps and ambiguous bases
    "atcgATCG"  # Mixed case
]

print("Reverse Complement Examples:")
print("=" * 40)
for seq in example_sequences:
    rc = hcrfish.reverse_complement(seq)
    print(f"Original:  {seq}")
    print(f"Rev Comp:  {rc}")
    print(f"Length:    {len(seq)} -> {len(rc)}")
    print()

## Step 1: Check Probe Availability

Before designing probes for a gene, it's useful to check how many high-quality probes can be generated. This function performs the complete analysis pipeline and returns the number of available probes.

In [None]:
# Check probe availability for multiple genes
print("Checking Probe Availability")
print("=" * 30)

probe_counts = {}

for gene_name in target_genes:
    try:
        print(f"\nAnalyzing {gene_name}...")
        print("-" * 20)
        
        # Check if gene exists in transcriptome
        gene = transcriptome.get_gene(gene_name)
        if gene is None:
            print(f"‚úó Gene '{gene_name}' not found in transcriptome")
            continue
            
        # Get longest transcript for analysis
        transcript = gene.get_transcript_longest_bounds()
        print(f"‚úì Gene found with {len(gene.transcripts)} transcript(s)")
        print(f"‚úì Using transcript: {transcript.name}")
        print(f"‚úì Sequence length: {len(transcript.mrna_sequence):,} bp")
        
        # Check probe availability (this performs BLAST analysis)
        n_probes = hcrfish.check_probe_availability(
            gene_name=gene_name,
            transcriptome=transcriptome,
            input_dir=main_directory,
            species_identifier=species_identifier
        )
        
        probe_counts[gene_name] = n_probes
        print(f"‚úì Available probes: {n_probes}")
        
    except Exception as e:
        print(f"‚úó Error analyzing {gene_name}: {e}")
        probe_counts[gene_name] = 0

# Summary table
print("\n" + "=" * 40)
print("PROBE AVAILABILITY SUMMARY")
print("=" * 40)
summary_data = [
    {'Gene': gene, 'Available Probes': count, 'Status': '‚úì Good' if count >= 20 else '‚ö†Ô∏è Limited' if count >= 10 else '‚úó Poor'}
    for gene, count in probe_counts.items()
]
summary_df = pd.DataFrame(summary_data)
print(summary_df.to_string(index=False))

## Step 2: Detailed Probe Design Workflow

Now let's walk through the detailed probe design process for a specific gene, demonstrating each step.

In [None]:
# Select a gene with good probe availability for detailed analysis
selected_gene = max(probe_counts.items(), key=lambda x: x[1])[0]
print(f"Selected gene for detailed analysis: {selected_gene}")
print(f"Available probes: {probe_counts[selected_gene]}")

# Get gene object
gene = transcriptome.get_gene(selected_gene)
transcript = gene.get_transcript_longest_bounds()

print(f"\nGene Information:")
print(f"  Chromosome: {gene.chromosome}")
print(f"  Strand: {transcript.strand}")
print(f"  Coordinates: {transcript.get_bounds()}")
print(f"  Number of exons: {len(transcript.exons)}")
print(f"  mRNA length: {len(transcript.mrna_sequence):,} bp")

# Store target sequence
gene.target_sequence = transcript.mrna_sequence

### Step 2a: BLAST Analysis for Specificity

The BLAST analysis identifies regions of the target sequence that have significant similarity to other genes, which could lead to off-target probe binding.

In [None]:
# Perform BLAST analysis to identify off-target regions
print(f"Performing BLAST analysis for {selected_gene}...")
print("This may take a few minutes depending on sequence length and database size.")

try:
    # Run BLAST analysis (this function also processes results and masks off-targets)
    hcrfish.blast_gene(
        gene_name=selected_gene,
        transcriptome=transcriptome,
        main_directory=main_directory,
        species_identifier=species_identifier,
        permitted_off_targets=[],  # No permitted off-targets for this example
        length_thresh=50  # Minimum alignment length to consider
    )
    
    print(f"‚úì BLAST analysis completed for {selected_gene}")
    
    # Analyze BLAST results
    if hasattr(gene, 'blast_results'):
        blast_df = gene.blast_results
        
        print(f"\nBLAST Results Summary:")
        print(f"  Total hits: {len(blast_df):,}")
        print(f"  Self-hits: {len(blast_df[blast_df['same_gene']]):,}")
        print(f"  Off-target hits: {len(blast_df[~blast_df['same_gene']]):,}")
        
        # Show top off-targets
        off_targets = blast_df[~blast_df['same_gene']].head(10)
        if len(off_targets) > 0:
            print(f"\nTop Off-Target Hits:")
            display_cols = ['subject_gene_id', 'percent_identity', 'length', 'mismatches', 'source']
            print(off_targets[display_cols].to_string(index=False))
        else:
            print("\n‚úì No significant off-target hits found!")
    
    # Show sequence masking results
    if hasattr(gene, 'unique_sequence'):
        original_length = len(gene.target_sequence)
        masked_chars = gene.unique_sequence.count('-')
        unique_length = original_length - masked_chars
        
        print(f"\nSequence Masking Results:")
        print(f"  Original length: {original_length:,} bp")
        print(f"  Masked bases: {masked_chars:,} bp ({masked_chars/original_length*100:.1f}%)")
        print(f"  Unique length: {unique_length:,} bp ({unique_length/original_length*100:.1f}%)")
        
except Exception as e:
    print(f"‚úó BLAST analysis failed: {e}")
    # Set a simple unique sequence for demonstration
    gene.unique_sequence = gene.target_sequence
    print("Using unmasked sequence for demonstration purposes.")

### Step 2b: Probe Design on Unique Regions

Now we design HCR-FISH probes on the regions that passed the specificity filter.

In [None]:
# Design probes on unique sequence regions
print(f"Designing HCR-FISH probes for {selected_gene} using amplifier {amplifier}...")

try:
    # Design probes
    probe_pairs, probe_regions, probe_positions = hcrfish.design_hcr_probes(
        sequence=gene.unique_sequence,
        amplifier=amplifier,
        gc_min=0.25,  # Minimum GC content
        gc_max=0.75,  # Maximum GC content
        max_homopolymer=4  # Maximum homopolymer length
    )
    
    print(f"‚úì Designed {len(probe_pairs)} probe pairs")
    
    # Analyze probe characteristics
    if len(probe_pairs) > 0:
        print(f"\nProbe Design Results:")
        print(f"  Number of probe pairs: {len(probe_pairs)}")
        print(f"  Total individual probes: {len(probe_pairs) * 2}")
        print(f"  Probe spacing: ~54 bp between centers")
        print(f"  Target region length: 52 bp per pair")
        
        # Show first few probe pairs
        print(f"\nFirst 3 Probe Pairs:")
        for i, (up_probe, down_probe) in enumerate(probe_pairs[:3]):
            print(f"\nPair {i+1}:")
            print(f"  Upstream:   {up_probe}")
            print(f"  Downstream: {down_probe}")
            print(f"  Target:     {probe_regions[i]}")
            print(f"  Position:   {probe_positions[i][0]}-{probe_positions[i][1]}")
        
        # Analyze GC content distribution
        all_regions = probe_regions
        gc_contents = [(region.count('G') + region.count('C')) / len(region) 
                      for region in all_regions]
        
        print(f"\nGC Content Distribution:")
        print(f"  Mean: {np.mean(gc_contents):.2f}")
        print(f"  Range: {np.min(gc_contents):.2f} - {np.max(gc_contents):.2f}")
        print(f"  Std Dev: {np.std(gc_contents):.2f}")
        
    else:
        print("‚ö†Ô∏è No suitable probe pairs could be designed")
        print("This may be due to:")
        print("  - Extensive off-target masking")
        print("  - Poor sequence quality")
        print("  - Short unique regions")
        
except Exception as e:
    print(f"‚úó Probe design failed: {e}")
    probe_pairs, probe_regions, probe_positions = [], [], []

## Step 3: Export Probes for Synthesis

Generate IDT-compatible files for probe ordering and reference files for the lab.

In [None]:
# Export probes in IDT format
if len(probe_pairs) > 0:
    print(f"Exporting probes for {selected_gene}...")
    
    try:
        # Set number of probes to export (maximum 30 for this example)
        n_probes_to_export = min(30, len(probe_pairs))
        
        hcrfish.get_probes_IDT(
            gene_name=selected_gene,
            transcriptome=transcriptome,
            main_directory=main_directory,
            species_identifier=species_identifier,
            amplifier=amplifier,
            n_probes=n_probes_to_export
        )
        
        print(f"‚úì Exported {n_probes_to_export} probe pairs")
        
        # Show file locations
        idt_dir = os.path.join(main_directory, species_identifier, 'IDT_sheets')
        regions_dir = os.path.join(main_directory, species_identifier, 'probe_binding_regions_sheets')
        
        print(f"\nOutput Files:")
        print(f"  IDT order sheet: {idt_dir}/")
        print(f"  Binding regions: {regions_dir}/")
        
        # List actual files created
        if os.path.exists(idt_dir):
            idt_files = [f for f in os.listdir(idt_dir) if selected_gene in f]
            for file in idt_files:
                print(f"    - {file}")
                
        if os.path.exists(regions_dir):
            region_files = [f for f in os.listdir(regions_dir) if selected_gene in f]
            for file in region_files:
                print(f"    - {file}")
        
        # Show probe cost estimation
        total_probes = n_probes_to_export * 2  # Each pair has 2 probes
        estimated_cost = total_probes * 15  # Rough estimate: $15 per probe
        print(f"\nCost Estimation:")
        print(f"  Total probes: {total_probes}")
        print(f"  Estimated cost: ${estimated_cost:,} (@ $15/probe)")
        
    except Exception as e:
        print(f"‚úó Export failed: {e}")
else:
    print("‚ö†Ô∏è No probes available for export")

## Step 4: Visualize Probe Binding Sites

Create a genomic visualization showing where the probes will bind relative to the gene structure.

In [None]:
# Generate genomic visualization
if len(probe_pairs) > 0 and hasattr(gene, 'regions'):
    print(f"Creating genomic visualization for {selected_gene}...")
    
    try:
        fig = hcrfish.get_probe_binding_regions_plot(
            gene_name=selected_gene,
            transcriptome=transcriptome,
            main_directory=main_directory,
            species_identifier=species_identifier,
            save=True
        )
        
        print(f"‚úì Genomic visualization created")
        
        # Display the figure
        plt.show()
        
        # Show output location
        plot_dir = os.path.join(main_directory, species_identifier, 'probe_regions_plot')
        print(f"  Output directory: {plot_dir}/")
        
        if os.path.exists(plot_dir):
            plot_files = [f for f in os.listdir(plot_dir) if selected_gene in f]
            for file in plot_files:
                print(f"    - {file}")
        
    except Exception as e:
        print(f"‚úó Visualization failed: {e}")
        print("This may be due to missing genome files or visualization dependencies")
else:
    print("‚ö†Ô∏è No probe regions available for visualization")

## File Organization Summary

Let's review the files that have been created during this probe design session.

In [None]:
# File organization summary
print("File Organization Summary")
print("=" * 26)

def list_directory_contents(directory, max_files=5):
    """List contents of a directory with file count limits."""
    if not os.path.exists(directory):
        print(f"  üìÅ {os.path.basename(directory)}/  (not created)")
        return
    
    files = os.listdir(directory)
    if not files:
        print(f"  üìÅ {os.path.basename(directory)}/  (empty)")
        return
    
    print(f"  üìÅ {os.path.basename(directory)}/  ({len(files)} files)")
    for i, file in enumerate(sorted(files)):
        if i < max_files:
            print(f"    üìÑ {file}")
        elif i == max_files:
            print(f"    ... and {len(files) - max_files} more files")
            break

# Main output directory structure
base_output = os.path.join(main_directory, species_identifier)
directories = [
    ('gene_seq_blast_input', 'Gene sequences for BLAST input'),
    ('gene_seq_blast_output', 'BLAST results (CSV format)'),
    ('gene_seq_unique_regions', 'Masked sequences with off-targets removed'),
    ('IDT_sheets', 'Probe sequences ready for IDT ordering'),
    ('probe_binding_regions_sheets', 'Probe binding region references'),
    ('probe_regions_plot', 'Genomic visualization plots'),
    ('probe_region_blast_input', 'Individual probe region sequences'),
    ('probe_region_blast_output', 'BLAST results for probe regions')
]

print(f"Output directory: {base_output}/")
print()

for dir_name, description in directories:
    full_path = os.path.join(base_output, dir_name)
    print(f"{description}:")
    list_directory_contents(full_path)
    print()

# Calculate total disk usage
def get_directory_size(directory):
    """Calculate total size of directory in MB."""
    total_size = 0
    try:
        for dirpath, dirnames, filenames in os.walk(directory):
            for filename in filenames:
                filepath = os.path.join(dirpath, filename)
                if os.path.exists(filepath):
                    total_size += os.path.getsize(filepath)
    except:
        pass
    return total_size / (1024 * 1024)  # Convert to MB

if os.path.exists(base_output):
    total_size = get_directory_size(base_output)
    print(f"Total disk usage: {total_size:.1f} MB")
else:
    print("No output files created yet.")

## Best Practices and Troubleshooting

Here are some key recommendations for successful HCR-FISH probe design:

### Best Practices

1. **Target Selection**:
   - Choose genes with moderate to high expression levels
   - Avoid genes with extensive splice variants unless specifically needed
   - Consider gene family membership to avoid cross-reactivity

2. **Probe Design**:
   - Aim for 20-30 probe pairs per target for robust signal
   - Use different amplifiers (B1-B5) for multiplexed experiments
   - Check probe availability before committing to experimental design

3. **Quality Control**:
   - Always run BLAST analysis to check specificity
   - Review off-target hits manually for closely related genes
   - Consider experimental validation with positive/negative controls

4. **File Management**:
   - Keep organized directory structure as shown above
   - Save probe sequences and metadata for future reference
   - Document amplifier assignments for multiplexed experiments

### Troubleshooting Common Issues

**Low probe availability (<10 probes)**:
- Check for repetitive sequences or gene family members
- Consider targeting specific splice variants
- Relax quality thresholds if necessary (with caution)

**BLAST analysis failures**:
- Verify BLAST+ installation and PATH configuration
- Check database file integrity and paths
- Ensure sufficient disk space for temporary files

**Visualization errors**:
- Confirm genome FASTA file exists and is indexed
- Check chromosome naming consistency between GTF and FASTA
- Verify matplotlib and pygenomeviz dependencies

**Poor probe performance in experiments**:
- Re-examine BLAST results for missed off-targets
- Consider probe concentration optimization
- Check amplifier sequences and experimental protocols

### Next Steps

After completing probe design:
1. Order probes from IDT using the generated Excel files
2. Plan experimental validation with appropriate controls
3. Optimize hybridization and amplification conditions
4. Document successful probe sets for future use

For questions or issues, refer to the HCR-FISH protocol documentation or contact the development team.