# ProbePy: Data Preparation Tutorial

This notebook demonstrates how to prepare input data for ProbePy probe design. These functions streamline the process of downloading genomic data, exporting transcriptome sequences, and creating BLAST databases.

## Overview

The preparation workflow consists of three main steps:
1. **Download genomic data** using `download_with_rsync()`
2. **Parse Transcriptome** using `update_transcriptome_object()`
3. **Export mRNA sequences** to FASTA format using `export_mrna_to_fasta()`
4. **Create BLAST databases** using `create_blast_databases()`

These functions are designed to work with the species identifier system and organize files using the `input/{species_identifier}/` directory structure.

In [2]:
import os
import sys
from pathlib import Path
import pandas as pd
import probepy

## Setup and Configuration

First, let's verify that BLAST tools are available and set up our working directories.

In [3]:
probepy.check_blast_tools()

[OK] makeblastdb: makeblastdb: 2.15.0+
[OK] blastn: blastn: 2.15.0+


{'makeblastdb': {'available': True, 'version': 'makeblastdb: 2.15.0+'},
 'blastn': {'available': True, 'version': 'blastn: 2.15.0+'}}

If the blast tools are unavailable, run: 

In [4]:
probepy.install_blast_tools()

[OK] BLAST+ tools are already installed:
makeblastdb: 2.15.0+
 Package: blast 2.15.0, build Oct 19 2023 15:16:13
BLAST+ tools are already installed.


## Example 1: Drosophila melanogaster

Let's start with setting up data for Drosophila melanogaster (species identifier: 'dmel').

### Step 1: Download Genome Data

The `download_with_rsync()` function handles the complete download process including decompression and file organization.

In [13]:
# Download D. melanogaster genome
species_id = "dmel"
genome_rsync_path = "rsync://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/dm6.fa.gz"
probepy.download_with_rsync(
    rsync_path=genome_rsync_path,
    species_identifier=species_id,
    file_type="genome",
    base_dir="../",
    overwrite=True
)

Downloading rsync://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/dm6.fa.gz
Destination: ../input/dmel/genome/dm6.fa.gz
Final processed file will be: ../input/dmel/genome/dm6.fa
Download completed successfully
Decompressing ../input/dmel/genome/dm6.fa.gz
Removing existing decompressed file: ../input/dmel/genome/dm6.fa
Successfully decompressed to: ../input/dmel/genome/dm6.fa
File ready at: ../input/dmel/genome/dm6.fa


### Step 2: Download Transcriptome Annotation

In [14]:
# Download transcriptome annotation (GTF file)
transcriptome_rsync_path = "rsync://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/genes/dm6.ncbiRefSeq.gtf.gz"
probepy.download_with_rsync(
    rsync_path=transcriptome_rsync_path,
    species_identifier=species_id,
    file_type="transcriptome",
    base_dir="../",
    overwrite=True
)

Downloading rsync://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/genes/dm6.ncbiRefSeq.gtf.gz
Destination: ../input/dmel/transcriptome/dm6.ncbiRefSeq.gtf.gz
Final processed file will be: ../input/dmel/transcriptome/dm6.ncbiRefSeq.gtf
Download completed successfully
Decompressing ../input/dmel/transcriptome/dm6.ncbiRefSeq.gtf.gz
Removing existing decompressed file: ../input/dmel/transcriptome/dm6.ncbiRefSeq.gtf
Successfully decompressed to: ../input/dmel/transcriptome/dm6.ncbiRefSeq.gtf
File ready at: ../input/dmel/transcriptome/dm6.ncbiRefSeq.gtf


### Step 3: Build Transcriptome Object

Before we can export mRNA sequences, we need to build the transcriptome object that contains all gene and transcript information.

In [15]:
genome_path = "../input/dmel/genome/dm6.fa"
transcriptome_path = "../input/dmel/transcriptome/dm6.ncbiRefSeq.gtf"
species_id = "dmel"

In [16]:
# Build transcriptome object
probepy.update_transcriptome_object(
    genome_path=genome_path,
    transcriptome_path=transcriptome_path,
    species_identifier=species_id,
    base_dir="../",
    overwrite=True
)

Found 17868 unique genes.


100%|██████████| 17868/17868 [00:28<00:00, 623.56it/s] 


Transcriptome(genes=17868)
Transcriptome object has been updated and saved to ../input/dmel/dmel_transcriptome.pkl


In [17]:
# Load the transcriptome object
transcriptome_dmel = probepy.load_transcriptome_object("dmel", base_dir="../")
print(f"Loaded transcriptome with {len(transcriptome_dmel.genes)} genes")

Loaded transcriptome object from ../input/dmel/dmel_transcriptome.pkl
Loaded transcriptome with 17868 genes


In [18]:
# Verify the transcriptome structure
probepy.check_exons_contain_all_features(transcriptome_dmel)

### Step 4: Export mRNA Sequences to FASTA

The `export_mrna_to_fasta()` function creates two FASTA files: one with mature mRNA sequences (no introns) and one with pre-mRNA sequences (with introns).

In [19]:
# Export mRNA sequences to FASTA files
probepy.export_mrna_to_fasta(
    transcriptome=transcriptome_dmel,
    species_identifier=species_id,
    base_dir="../",
    overwrite=True
)

Exporting mRNA sequences for 17868 genes...
Exported 35121 transcripts to ../input/dmel/transcriptome/mRNA_no_introns/mRNA_no_introns.fasta
Exported 35121 transcripts to ../input/dmel/transcriptome/mRNA_yes_introns/mRNA_yes_introns.fasta


### Step 5: Create BLAST Databases

The final step is to create BLAST databases from the FASTA files. These databases are used during probe design to identify potential off-target binding sites.

In [20]:
# Create BLAST databases
probepy.create_blast_databases(
    species_identifier=species_id, 
    base_dir="../"
)

Using makeblastdb version: makeblastdb: 2.15.0+
 Package: blast 2.15.0, build Oct 19 2023 15:16:13
Creating BLAST databases...
Creating database: ../input/dmel/transcriptome/mRNA_no_introns/mRNA_no_introns
Mature mRNA database created successfully
Creating database: ../input/dmel/transcriptome/mRNA_yes_introns/mRNA_yes_introns
Pre-mRNA database created successfully
BLAST database creation completed


### Verify Directory Structure

Let's examine the directory structure that was created:

In [None]:
# Directory Structure
! ls -R ../input/dmel/

dmel_transcriptome.pkl [34mgenome[m[m                 [34mtranscriptome[m[m

../input/dmel/genome:
dm6.fa

../input/dmel/transcriptome:
GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.gtf.gz
dm6.ncbiRefSeq.gtf
[34mmRNA_no_introns[m[m
[34mmRNA_yes_introns[m[m

../input/dmel/transcriptome/mRNA_no_introns:
mRNA_no_introns.fasta mRNA_no_introns.njs   mRNA_no_introns.nsq
mRNA_no_introns.ndb   mRNA_no_introns.nog   mRNA_no_introns.ntf
mRNA_no_introns.nhr   mRNA_no_introns.nos   mRNA_no_introns.nto
mRNA_no_introns.nin   mRNA_no_introns.not

../input/dmel/transcriptome/mRNA_yes_introns:
mRNA_yes_introns.fasta mRNA_yes_introns.njs   mRNA_yes_introns.nsq
mRNA_yes_introns.ndb   mRNA_yes_introns.nog   mRNA_yes_introns.ntf
mRNA_yes_introns.nhr   mRNA_yes_introns.nos   mRNA_yes_introns.nto
mRNA_yes_introns.nin   mRNA_yes_introns.not


## Example 2: Drosophila yakuba

Now let's demonstrate the same workflow with a different species - Drosophila yakuba.

### Download D. yakuba Data

In [22]:
# Set species identifier
species_id = "dyak"

# Download genome
genome_rsync_path = "rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/746/365/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.fna.gz"
probepy.download_with_rsync(
    rsync_path=genome_rsync_path,
    species_identifier=species_id,
    file_type="genome",
    base_dir="../",
    overwrite=True
)

Downloading rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/746/365/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.fna.gz
Destination: ../input/dyak/genome/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.fna.gz
Final processed file will be: ../input/dyak/genome/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.fa
Download completed successfully
Decompressing ../input/dyak/genome/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.fna.gz
Successfully decompressed to: ../input/dyak/genome/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.fna
Renamed to ../input/dyak/genome/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.fa
File ready at: ../input/dyak/genome/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.fa


In [23]:
# Download transcriptome annotation
dyak_transcriptome_rsync = "rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/746/365/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.gtf.gz"
probepy.download_with_rsync(
    rsync_path=dyak_transcriptome_rsync,
    species_identifier=species_id,
    file_type="transcriptome",
    base_dir="../",
    overwrite=True
)

Downloading rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/746/365/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.gtf.gz
Destination: ../input/dyak/transcriptome/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.gtf.gz
Final processed file will be: ../input/dyak/transcriptome/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.gtf
Download completed successfully
Decompressing ../input/dyak/transcriptome/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.gtf.gz
Removing existing decompressed file: ../input/dyak/transcriptome/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.gtf
Successfully decompressed to: ../input/dyak/transcriptome/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.gtf
File ready at: ../input/dyak/transcriptome/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.gtf


### Build D. yakuba Transcriptome and Export Data

In [24]:
dyak_genome_path = "../input/dyak/genome/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.fa"
dyak_transcriptome_path = "../input/dyak/transcriptome/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.gtf"
species_id = "dyak"

In [25]:
# Build transcriptome object
probepy.update_transcriptome_object(
    genome_path=dyak_genome_path,
    transcriptome_path=dyak_transcriptome_path,
    species_identifier=species_id,
    base_dir="../",
    overwrite=False
)

File ../input/dyak/dyak_transcriptome.pkl already exists. Set overwrite=True to overwrite it.


In [26]:
# Load the transcriptome object
transcriptome_dyak = probepy.load_transcriptome_object("dyak", base_dir="../")
print(f"Loaded transcriptome with {len(transcriptome_dyak.genes)} genes")

Loaded transcriptome object from ../input/dyak/dyak_transcriptome.pkl
Loaded transcriptome with 16150 genes


In [27]:
# Verify the transcriptome structure
probepy.check_exons_contain_all_features(transcriptome_dyak)

In [28]:
# Export mRNA sequences to FASTA files
probepy.export_mrna_to_fasta(
    transcriptome=transcriptome_dyak, 
    species_identifier=species_id, 
    base_dir="../",
    overwrite=True
)

Exporting mRNA sequences for 16150 genes...
Exported 28247 transcripts to ../input/dyak/transcriptome/mRNA_no_introns/mRNA_no_introns.fasta
Exported 28247 transcripts to ../input/dyak/transcriptome/mRNA_yes_introns/mRNA_yes_introns.fasta


In [29]:
# Create BLAST databases
probepy.create_blast_databases(
    species_identifier=species_id, 
    base_dir="../"
)

Using makeblastdb version: makeblastdb: 2.15.0+
 Package: blast 2.15.0, build Oct 19 2023 15:16:13
Creating BLAST databases...
Creating database: ../input/dyak/transcriptome/mRNA_no_introns/mRNA_no_introns
Mature mRNA database created successfully
Creating database: ../input/dyak/transcriptome/mRNA_yes_introns/mRNA_yes_introns
Pre-mRNA database created successfully
BLAST database creation completed


In [4]:
# Directory Structure
! ls -R ../input/dyak/

dyak_transcriptome.pkl [34mgenome[m[m                 [34mtranscriptome[m[m

../input/dyak/genome:
GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.fa

../input/dyak/transcriptome:
GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.gtf
[34mmRNA_no_introns[m[m
[34mmRNA_yes_introns[m[m

../input/dyak/transcriptome/mRNA_no_introns:
mRNA_no_introns.fasta mRNA_no_introns.njs   mRNA_no_introns.nsq
mRNA_no_introns.ndb   mRNA_no_introns.nog   mRNA_no_introns.ntf
mRNA_no_introns.nhr   mRNA_no_introns.nos   mRNA_no_introns.nto
mRNA_no_introns.nin   mRNA_no_introns.not

../input/dyak/transcriptome/mRNA_yes_introns:
mRNA_yes_introns.fasta mRNA_yes_introns.njs   mRNA_yes_introns.nsq
mRNA_yes_introns.ndb   mRNA_yes_introns.nog   mRNA_yes_introns.ntf
mRNA_yes_introns.nhr   mRNA_yes_introns.nos   mRNA_yes_introns.nto
mRNA_yes_introns.nin   mRNA_yes_introns.not


## Summary and Analysis

Let's compare the two species and analyze the data we've prepared.

In [30]:
# Compare transcriptome sizes
comparison_data = {
    'Species': ['D. melanogaster', 'D. yakuba'],
    'Species ID': [species_id, species_id],
    'Number of Genes': [len(transcriptome_dmel.genes), len(transcriptome_dyak.genes)]
}

comparison_df = pd.DataFrame(comparison_data)
print("Transcriptome Comparison:")
print(comparison_df.to_string(index=False))

Transcriptome Comparison:
        Species Species ID  Number of Genes
D. melanogaster       dyak            17868
      D. yakuba       dyak            16150


In [31]:
# Analyze chromosome distribution for D. melanogaster
print("\nD. melanogaster chromosome distribution:")
chromosomes = {}
for gene_name, gene in transcriptome_dmel.genes.items():
    chrom = gene.chromosome
    chromosomes[chrom] = chromosomes.get(chrom, 0) + 1

chrom_df = pd.DataFrame(list(chromosomes.items()), columns=['Chromosome', 'Gene Count'])
chrom_df = chrom_df.sort_values('Gene Count', ascending=False).head(10)
print(chrom_df.to_string(index=False))


D. melanogaster chromosome distribution:
      Chromosome  Gene Count
           chr3R        4223
           chr2R        3652
           chr2L        3515
           chr3L        3486
            chrX        2689
            chr4         114
            chrY         113
            chrM          37
chrUn_CP007120v1          21
chrUn_CP007081v1           2


## Next Steps

Now that you have prepared the genomic data, you can proceed with HCR-FISH probe design using the functions in `probepy.hcr.utils`:

1. **Check probe availability**: Use `check_probe_availability()` to see how many probes can be designed for specific genes
2. **BLAST analysis**: Use `blast_gene()` to identify off-target binding sites
3. **Probe design**: Use `get_probes_IDT()` to design and export probe sequences
4. **Visualization**: Use `export_probe_binding_regions_plot()` to visualize probe locations

## Key Benefits of the Preparation Functions

The functions in `probepy.hcr.prep` provide several advantages:

- **Automated workflow**: No need to manually handle downloads, decompression, and file organization
- **Error handling**: Robust error checking and informative error messages
- **Species organization**: Automatic organization by species identifier
- **Consistency**: Standardized file formats and naming conventions
- **Progress reporting**: Clear feedback on download and processing progress
- **Cross-platform compatibility**: Works on macOS, Linux, and Windows (with appropriate tools installed)

## Requirements

To use these functions, ensure you have:
- `rsync` for downloading files
- `gunzip` for decompressing .gz files
- BLAST+ tools (`makeblastdb`) for creating databases
- Sufficient disk space for genomic data files