In [1]:
import os 
import sys
import subprocess
import pandas as pd
import numpy as np
import pybedtools
from gene2probe import *

This tutorial explains how to create your own blast database.
We already provide one containing RefSeq exons, but you might want to make your own (e.g., because you work on a different species or want to consider different genes/features).
Simply adjust the fasta/gtf input files and the mode (gtf feature type to filter for).

In [2]:
fasta = '../hg38_resources/hg38.fa' ## Genome in fasta file
gtf = '../hg38_resources/hg38.ncbiRefSeq.gtf' ## Gene annotation in gtf file

mode = 'exon' ## Which feature to filter the annotation for (exon, CDS, gene)

out_dir = '../sample_run/001_blastdb/' ## Specify output directory
os.makedirs(out_dir , exist_ok=True) ## And make it if it doesn't exist
out_name = 'hg38_ncbiRefSeq_exons' ## Name for the blast database

In [3]:
## Path to blast binaries.
## Replace with your conda environment
## This can also be omitted if you started the jupyter session from within the gene2probe conda environment
blast_exec_path = '/nfs/team205/is10/miniconda/envs/gene2probe_env/bin/'

The first step is to filter our gtf file, filter for our feature of interest, convert to bed format and initialise a pybedtools object.

In [4]:
## Read gtf
gene_anno = read_gtf(gtf)

## Filter for exon/transcript respectively
gene_anno = gene_anno[gene_anno['feature'] == mode].reset_index(drop=True)

## Extract gene names
gene_ids = gene_anno['attribute'].apply(extract_feature_from_gtf, feature='gene_name')

## Convert to bed
bed_df = gtf_2_bed(gene_anno, name_pref = (''))
bed_df['name'] = gene_ids

## Initialise bedtools
bed = pybedtools.BedTool.from_dataframe(bed_df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_bed['start'] = df_bed['start'].astype(int) - 1


Now we can use pybedtools to get the sequences for these regions.
We make sure to extract the sequence in the right strand and to keep the name of the gene.

In [5]:
## Get fasta
seq = bed.sequence(
    fi=fasta,
    s=True, ##Get sequence in the correct strand per transcript (i.e., in 5'->3')
    name=True, ##Keeping name of the gene, which will help us distinguish between off-targets and our gene of interest.
    fullHeader=True
) 

##Export fasta
seq.save_seqs((out_dir + out_name +  '.fa'))

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



<BedTool(/tmp/pybedtools.86erwuew.tmp)>

Now we can use blast (which we should have installed in our conda environment) to create a blast database.

In [6]:
## Make blast db
command = [
    (blast_exec_path + 'makeblastdb'), ## Can omit absolute path if jupyter session started from t
    '-in', (out_dir + out_name +  '.fa'),  # Input FASTA file
    '-dbtype', 'nucl',                     # Database type, 'nucl' for nucleotide
    '-out', (out_dir + out_name +  '_db')  # Output database name
]

# Run the command
result = subprocess.run(command, capture_output=True, text=True)

# Check if the command was successful
if result.returncode == 0:
    print("Database created successfully!")
    print(result.stdout)  # Print standard output
else:
    print("Error in database creation:")
    print(result.stderr)  # Print any error messages

Database created successfully!


Building a new DB, current time: 05/03/2024 14:27:56
New DB name:   /nfs/team205/is10/projects/gene2probe/sample_run/001_blastdb/hg38_ncbiRefSeq_exons_db
New DB title:  ../sample_run/001_blastdb/hg38_ncbiRefSeq_exons.fa
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 2178727 sequences in 81.0432 seconds.



We are done! Now we are ready to use this database to test our probes!