# HCR-FISH Probe Design: Data Preparation Tutorial

This notebook demonstrates the usage of the new preparation functions in `hcrfish.hcr.prep` for setting up data for HCR-FISH probe design. These functions streamline the process of downloading genomic data, exporting transcriptome sequences, and creating BLAST databases.

## Overview

The preparation workflow consists of three main steps:
1. **Download genomic data** using `download_with_rsync()`
2. **Export mRNA sequences** to FASTA format using `export_mrna_to_fasta()`
3. **Create BLAST databases** using `create_blast_databases()`

These functions are designed to work with the species identifier system and organize files in the `input/{species_identifier}/` directory structure.

In [4]:
import os
import sys
import pandas as pd

# Add the src directory to Python path for development mode
import pathlib
repo_root = pathlib.Path.cwd().parent if pathlib.Path.cwd().name == 'docs' else pathlib.Path.cwd()
src_path = repo_root / 'src'
if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

# Import the preparation functions
from hcrfish.hcr.prep import download_with_rsync, export_mrna_to_fasta, create_blast_databases

# Import transcriptomics functions for building transcriptome objects
from hcrfish.transcriptomics import update_transcriptome_object, load_transcriptome_object, check_exons_contain_all_features

## Example 1: Drosophila melanogaster

Let's start with setting up data for Drosophila melanogaster (species identifier: 'dmel').

### Step 1: Download Genome Data

The `download_with_rsync()` function handles the complete download process including decompression and file organization.

In [5]:
# Download D. melanogaster genome
species_id = "dmel"

# Download genome sequence
genome_rsync_path = "rsync://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/dm6.fa.gz"
genome_path = download_with_rsync(
    rsync_path=genome_rsync_path,
    species_identifier=species_id,
    file_type="genome"
)

print(f"Genome downloaded to: {genome_path}")

Downloading rsync://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/dm6.fa.gz
Destination: input/dmel/genome/dm6.fa.gz
Download completed successfully
Decompressing input/dmel/genome/dm6.fa.gz
File ready at: input/dmel/genome/dm6.fa
Genome downloaded to: input/dmel/genome/dm6.fa


### Step 2: Download Transcriptome Annotation

In [6]:
# Download transcriptome annotation (GTF file)
transcriptome_rsync_path = "rsync://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/genes/dm6.ncbiRefSeq.gtf.gz"
transcriptome_path = download_with_rsync(
    rsync_path=transcriptome_rsync_path,
    species_identifier=species_id,
    file_type="transcriptome"
)

print(f"Transcriptome annotation downloaded to: {transcriptome_path}")

Downloading rsync://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/genes/dm6.ncbiRefSeq.gtf.gz
Destination: input/dmel/transcriptome/dm6.ncbiRefSeq.gtf.gz
Download completed successfully
Decompressing input/dmel/transcriptome/dm6.ncbiRefSeq.gtf.gz
File ready at: input/dmel/transcriptome/dm6.ncbiRefSeq.gtf
Transcriptome annotation downloaded to: input/dmel/transcriptome/dm6.ncbiRefSeq.gtf


### Step 3: Build Transcriptome Object

Before we can export mRNA sequences, we need to build the transcriptome object that contains all gene and transcript information.

In [7]:
# Build transcriptome object
transcriptome_object_name = "dmel_transcriptome_demo"
update_transcriptome_object(
    genome_path=genome_path,
    transcriptome_path=transcriptome_path,
    output_filename=transcriptome_object_name,
    species=species_id
)

Found 17868 unique genes.


100%|██████████| 17868/17868 [00:23<00:00, 746.98it/s] 


Transcriptome(genes=17868)
Transcriptome object has been updated and saved to dmel_transcriptome_demo.pkl


In [8]:
# Load the transcriptome object
transcriptome = load_transcriptome_object(transcriptome_object_name)
print(f"Loaded transcriptome with {len(transcriptome.genes)} genes")

Loaded transcriptome with 17868 genes


In [9]:
# Verify the transcriptome structure
check_exons_contain_all_features(transcriptome)

### Step 4: Export mRNA Sequences to FASTA

The `export_mrna_to_fasta()` function creates two FASTA files: one with mature mRNA sequences (no introns) and one with pre-mRNA sequences (with introns).

In [10]:
# Export mRNA sequences to FASTA files
no_introns_fasta, yes_introns_fasta = export_mrna_to_fasta(
    transcriptome=transcriptome,
    species_identifier=species_id
)

print(f"Mature mRNA sequences exported to: {no_introns_fasta}")
print(f"Pre-mRNA sequences exported to: {yes_introns_fasta}")

Exporting mRNA sequences for 17868 genes...
Exported 35121 transcripts to input/dmel/transcriptome/mRNA_no_introns/mRNA_no_introns.fasta
Exported 35121 transcripts to input/dmel/transcriptome/mRNA_yes_introns/mRNA_yes_introns.fasta
Mature mRNA sequences exported to: input/dmel/transcriptome/mRNA_no_introns/mRNA_no_introns.fasta
Pre-mRNA sequences exported to: input/dmel/transcriptome/mRNA_yes_introns/mRNA_yes_introns.fasta


### Step 5: Create BLAST Databases

The final step is to create BLAST databases from the FASTA files. These databases are used during probe design to identify potential off-target binding sites.

In [11]:
# Create BLAST databases
fasta_paths = (no_introns_fasta, yes_introns_fasta)
no_introns_db, yes_introns_db = create_blast_databases(
    fasta_paths=fasta_paths,
    species_identifier=species_id
)

print(f"Mature mRNA BLAST database: {no_introns_db}")
print(f"Pre-mRNA BLAST database: {yes_introns_db}")

Using makeblastdb version: makeblastdb: 2.15.0+
 Package: blast 2.15.0, build Oct 19 2023 15:16:13
Creating BLAST databases...
Creating database: input/dmel/transcriptome/mRNA_no_introns/mRNA_no_introns
Mature mRNA database created successfully
Creating database: input/dmel/transcriptome/mRNA_yes_introns/mRNA_yes_introns
Pre-mRNA database created successfully
BLAST database creation completed
Mature mRNA BLAST database: input/dmel/transcriptome/mRNA_no_introns/mRNA_no_introns
Pre-mRNA BLAST database: input/dmel/transcriptome/mRNA_yes_introns/mRNA_yes_introns


### Verify Directory Structure

Let's examine the directory structure that was created:

In [12]:
# Display the directory structure
import os
from pathlib import Path

base_dir = Path("input") / species_id

print(f"Directory structure for {species_id}:")
print("="*50)

if base_dir.exists():
    for root, dirs, files in os.walk(base_dir):
        level = root.replace(str(base_dir), '').count(os.sep)
        indent = ' ' * 2 * level
        print(f"{indent}{os.path.basename(root)}/")
        subindent = ' ' * 2 * (level + 1)
        for file in files[:5]:  # Show first 5 files only
            print(f"{subindent}{file}")
        if len(files) > 5:
            print(f"{subindent}... and {len(files) - 5} more files")
else:
    print(f"Directory {base_dir} does not exist")

Directory structure for dmel:
dmel/
  transcriptome/
    dm6.ncbiRefSeq.gtf
    mRNA_yes_introns/
      mRNA_yes_introns.nto
      mRNA_yes_introns.ntf
      mRNA_yes_introns.not
      mRNA_yes_introns.nos
      mRNA_yes_introns.nog
      ... and 6 more files
    mRNA_no_introns/
      mRNA_no_introns.njs
      mRNA_no_introns.nin
      mRNA_no_introns.fasta
      mRNA_no_introns.nsq
      mRNA_no_introns.nos
      ... and 6 more files
  genome/
    dm6.fa


## Example 2: Drosophila yakuba

Now let's demonstrate the same workflow with a different species - Drosophila yakuba.

### Download D. yakuba Data

In [13]:
# Set species identifier
dyak_species_id = "dyak"

# Download genome
dyak_genome_rsync = "rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/746/365/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.fna.gz"
dyak_genome_path = download_with_rsync(
    rsync_path=dyak_genome_rsync,
    species_identifier=dyak_species_id,
    file_type="genome"
)

print(f"D. yakuba genome downloaded to: {dyak_genome_path}")

Downloading rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/746/365/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.fna.gz
Destination: input/dyak/genome/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.fna.gz
Download completed successfully
Decompressing input/dyak/genome/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.fna.gz
Renamed to input/dyak/genome/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.fa
File ready at: input/dyak/genome/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.fa
D. yakuba genome downloaded to: input/dyak/genome/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.fa


In [14]:
# Download transcriptome annotation
dyak_transcriptome_rsync = "rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/746/365/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.gtf.gz"
dyak_transcriptome_path = download_with_rsync(
    rsync_path=dyak_transcriptome_rsync,
    species_identifier=dyak_species_id,
    file_type="transcriptome"
)

print(f"D. yakuba transcriptome downloaded to: {dyak_transcriptome_path}")

Downloading rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/746/365/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.gtf.gz
Destination: input/dyak/transcriptome/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.gtf.gz
Download completed successfully
Decompressing input/dyak/transcriptome/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.gtf.gz
File ready at: input/dyak/transcriptome/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.gtf
D. yakuba transcriptome downloaded to: input/dyak/transcriptome/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.gtf


### Build D. yakuba Transcriptome and Export Data

In [15]:
# Build transcriptome object
dyak_transcriptome_object_name = "dyak_transcriptome_demo"
update_transcriptome_object(
    genome_path=dyak_genome_path,
    transcriptome_path=dyak_transcriptome_path,
    output_filename=dyak_transcriptome_object_name,
    species=dyak_species_id
)

# Load transcriptome
dyak_transcriptome = load_transcriptome_object(dyak_transcriptome_object_name)
print(f"D. yakuba transcriptome loaded with {len(dyak_transcriptome.genes)} genes")

Found 16150 unique genes.


100%|██████████| 16150/16150 [00:21<00:00, 755.72it/s] 


Transcriptome(genes=16150)
Transcriptome object has been updated and saved to dyak_transcriptome_demo.pkl
D. yakuba transcriptome loaded with 16150 genes


In [16]:
# Export mRNA sequences and create BLAST databases
dyak_fasta_paths = export_mrna_to_fasta(
    transcriptome=dyak_transcriptome,
    species_identifier=dyak_species_id
)

dyak_db_paths = create_blast_databases(
    fasta_paths=dyak_fasta_paths,
    species_identifier=dyak_species_id
)

print(f"D. yakuba BLAST databases created:")
print(f"  - Mature mRNA: {dyak_db_paths[0]}")
print(f"  - Pre-mRNA: {dyak_db_paths[1]}")

Exporting mRNA sequences for 16150 genes...
Exported 28247 transcripts to input/dyak/transcriptome/mRNA_no_introns/mRNA_no_introns.fasta
Exported 28247 transcripts to input/dyak/transcriptome/mRNA_yes_introns/mRNA_yes_introns.fasta
Using makeblastdb version: makeblastdb: 2.15.0+
 Package: blast 2.15.0, build Oct 19 2023 15:16:13
Creating BLAST databases...
Creating database: input/dyak/transcriptome/mRNA_no_introns/mRNA_no_introns
Mature mRNA database created successfully
Creating database: input/dyak/transcriptome/mRNA_yes_introns/mRNA_yes_introns
Pre-mRNA database created successfully
BLAST database creation completed
D. yakuba BLAST databases created:
  - Mature mRNA: input/dyak/transcriptome/mRNA_no_introns/mRNA_no_introns
  - Pre-mRNA: input/dyak/transcriptome/mRNA_yes_introns/mRNA_yes_introns


## Summary and Analysis

Let's compare the two species and analyze the data we've prepared.

In [17]:
# Compare transcriptome sizes
comparison_data = {
    'Species': ['D. melanogaster', 'D. yakuba'],
    'Species ID': [species_id, dyak_species_id],
    'Number of Genes': [len(transcriptome.genes), len(dyak_transcriptome.genes)]
}

comparison_df = pd.DataFrame(comparison_data)
print("Transcriptome Comparison:")
print(comparison_df.to_string(index=False))

Transcriptome Comparison:
        Species Species ID  Number of Genes
D. melanogaster       dmel            17868
      D. yakuba       dyak            16150


In [18]:
# Analyze chromosome distribution for D. melanogaster
print("\nD. melanogaster chromosome distribution:")
chromosomes = {}
for gene_name, gene in transcriptome.genes.items():
    chrom = gene.chromosome
    chromosomes[chrom] = chromosomes.get(chrom, 0) + 1

chrom_df = pd.DataFrame(list(chromosomes.items()), columns=['Chromosome', 'Gene Count'])
chrom_df = chrom_df.sort_values('Gene Count', ascending=False).head(10)
print(chrom_df.to_string(index=False))


D. melanogaster chromosome distribution:
      Chromosome  Gene Count
           chr3R        4223
           chr2R        3652
           chr2L        3515
           chr3L        3486
            chrX        2689
            chr4         114
            chrY         113
            chrM          37
chrUn_CP007120v1          21
chrUn_CP007081v1           2


## File Size Analysis

Let's check the sizes of the files we've created to understand the data volume.

In [19]:
def get_file_size_mb(filepath):
    """Get file size in MB."""
    try:
        size_bytes = os.path.getsize(filepath)
        return round(size_bytes / (1024 * 1024), 2)
    except:
        return "N/A"

# Analyze file sizes for D. melanogaster
print("D. melanogaster file sizes:")
print(f"Genome FASTA: {get_file_size_mb(genome_path)} MB")
print(f"Transcriptome GTF: {get_file_size_mb(transcriptome_path)} MB")
print(f"Mature mRNA FASTA: {get_file_size_mb(no_introns_fasta)} MB")
print(f"Pre-mRNA FASTA: {get_file_size_mb(yes_introns_fasta)} MB")

# Check BLAST database files
blast_files_no_introns = [f"{no_introns_db}.{ext}" for ext in ['ndb', 'nhr', 'nin', 'not', 'nsq', 'ntf', 'nto']]
total_blast_size = sum(get_file_size_mb(f) for f in blast_files_no_introns if get_file_size_mb(f) != "N/A")
print(f"BLAST database (mature mRNA): {total_blast_size} MB")

D. melanogaster file sizes:
Genome FASTA: 139.85 MB
Transcriptome GTF: 83.23 MB
Mature mRNA FASTA: 90.4 MB
Pre-mRNA FASTA: 329.32 MB
BLAST database (mature mRNA): 28.58 MB


## Next Steps

Now that you have prepared the genomic data, you can proceed with HCR-FISH probe design using the functions in `hcrfish.hcr.utils`:

1. **Check probe availability**: Use `check_probe_availability()` to see how many probes can be designed for specific genes
2. **BLAST analysis**: Use `blast_gene()` to identify off-target binding sites
3. **Probe design**: Use `get_probes_IDT()` to design and export probe sequences
4. **Visualization**: Use `export_probe_binding_regions_plot()` to visualize probe locations

## Key Benefits of the Preparation Functions

The functions in `hcrfish.hcr.prep` provide several advantages:

- **Automated workflow**: No need to manually handle downloads, decompression, and file organization
- **Error handling**: Robust error checking and informative error messages
- **Species organization**: Automatic organization by species identifier
- **Consistency**: Standardized file formats and naming conventions
- **Progress reporting**: Clear feedback on download and processing progress
- **Cross-platform compatibility**: Works on macOS, Linux, and Windows (with appropriate tools installed)

## Requirements

To use these functions, ensure you have:
- `rsync` for downloading files
- `gunzip` for decompressing .gz files
- BLAST+ tools (`makeblastdb`) for creating databases
- Sufficient disk space for genomic data files