multiBLAST automates the process of running multiple BLAST (Basic Local Alignment Search Tool) queries against multiple databases and organizing the results. It is designed to handle various types of BLAST searches (nucelotide, translated nucleotide) and formats the output for easy analysis. Additionally, aligned sequences may be extracted from genome assemblies. To facilitate metagenomic analyses, contigs harboring genes of interest identified via multiBLAST may additionally be extracted for further analysis. multiBLAST is scalable and has been used to run dozens of gene queries against hundreds of genomes.
Author:
Elijah R. Bring Horvath, PhD (https://github.com/ERBringHorvath)
License:
This script is shared under MIT License, which allows for modification and redistribution with attribution.
multiBLAST uses NCBI BLAST+
Camancho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL, 2009.
BLAST+: architecture and applications. BMC Bioinformatics, 10, 421. doi:10.1186/1471-2105-10-421
Install BLAST+
Download latest version of BLAST+
Or using Conda:
-
Install Conda miniforge if not already installed
-
Create Conda environment
conda create -n blast
- Activate Conda envrionment
source activate blast
- Install BLAST+
conda -y install bioconda::blast
Verify BLAST Installation
makeblastdb -h
blastn -h
If these commands run without error, BLAST is correctly installed. If an error occurs, refer to the BLAST+ documentation
If not already installed, install Git
- Linux/Unix systems should have this installed by default
- To test installation, open the terminal and type
git --version
- For macOS users, you should see something like
git version 2.37.1 (Apple Git-137.1)
We suggest installing multiBLAST within your Home folder, such as /Users/user/
Change directory to desired installation path
cd /Users/user
Clone multiBLAST from the repository
git clone https://github.com/ERBringHorvath/multiBLAST
Add multiBLAST to your PATH
- Open your profile in a text editor. This might be
~/.bash_profile
or~/.zshrc
- Add the following line to the end of the file:
export PATH=$PATH:/Users/user/multiblast/bin
Replace /Users/user/multiblast/bin
with the actual path to the directory containing the executable.
Whatever the initial directory, this path should end with /multiblast/bin
Save the file and restart your terminal or run source ~/.bash_profile
(Linux/Unix) or source ~/.zshrc
(macOS)
Install Dependencies
pip install -r requirements.txt
or pip3 install -r requirements.txt
Verify multiBLAST Installation
multiblast --help
multiblast --version
NOTE: Permissions should automatically be applied during installation. If you get a permission denied
message when running multiblast
permissions may need to be changed manually. To do this, you can use the following command:
chmod +x /path/to/multiblast/bin/multiblast
Building a BLAST+ Database Library
multiblast makedb:
-f
, --file_directory
: path to the directory containing input files in FASTA format
-d
, --dbtype
: specify what sort of database you want to create (nucl
, nucleotide, prot
, protein)
-o
, --out
: path to directory where you want to store your databases
Example:
multiblast makedb -f /path/to/FASTA/files -d nucl -o /path/to/results/folder
Querying a database library
multiblast query:
-d
, --database
: path to directory containing BLAST+ databases
-q
, --query_files
: path to directory containing query files in amino acid FASTA format
-e
, --evalue
: maximum e-value cutoff, default 0.00001
-o
, --output
: path to directory to store results
-T
, --threads
: number of cores to dedicate for multi-threading
--report-strongest-match
: report only the single strongest match for each query
--perc
: define minimum percent identity threshold. Default = 90
--cov
: define minimum query coverage threshold. Default = 75
--nucleotide-query
: use blastn for queries in nucleotide FASTA format
--min-seq-len
: define minimum sequence length for short amino acid/nucleotide sequence queries (use with caution)
Example:
multiblast query -T 8 -d /path/to/blast/database/files -q /path/to/query/files/
-o /path/to/results/folder
All multiBLAST results are concatenated to all_results.csv
and either all_filtered_results.csv
or
filtered_results.csv
within the output folder designated by -o, --output
multiblast extract:
-d
, --results_directory
: path to directory containing multiBLAST results files
-f
, --fasta_directory
: path to reference FASTA assemblies
These should be the FASTA files the BLAST databases were created from and should have the same basename as the query results files
-o
, --output_fasta
: output file to contain sequences, defaults to current working directory
-T
, --threads
: number of cores to dedicate, default is 1
--min-evalue
: maximum e-value threshold, default = 0.00001
--min-perc
: minimum percent identity threshold. Default = 90
--min-cov
: minimum query coverage threshold. Default = 75
--translate
: translates extracted nucleotide sequence(s)
NOTE: Translation of sequences is optional, however care should be used when translating extracted nucleotide sequences, as BLAST results may not always contain a full CDS. To allow for this, when the --translate
argument is called, extracted sequences will be trimmed to only include complete codons, which may affect interpretation of results.
NOTE: Results files and FASTA reference assemblies must share the same basename:
Example basename: 'FILE'
Example FASTA: FILE.fasta
Example results file: FILE_results.txt
If multiBLAST is used for database creation and queries, matching basenames should be generated automatically
Example usage:
multiblast extract -d /path/to/results/file -f /path/to/reference/FASTA/files -T 8 -o sequences.fa
NOTE:
Results file should be all_results.csv
, all_filtered_results.csv
, or filtered_results.csv
, which are automatically generated using multiblast query
If percent identity and query coverage were set manually during multiblast query
, these values will need to be reflected when using mutliblast extract
using --min-perc
and/or --min-cov
For instance, if multiblast query
was called using --perc 75
, but the multiblast extract
minimum percent identity is left at its default value (90), the appropriate sequences may not be extracted, as they may fall beneath the internally curated --min-perc
theshold.
multiblast extract
will generate a multi-FASTA file of all sequences identified by multiblast queryP
/query
based on the default or user-defined e-value cutoff.
multiblast extract-contig
:
-d
, --results_directory
: path to directory containing multiBLAST results files
-f
, --fasta_directory
: path to reference FASTA assemblies
These should be the FASTA files the BLAST databases were created from and should have the same basename as the query results files
-o
, --output_fasta
: output file to contain sequences, defaults to current working directory
-T
, --threads
: number of cores to dedicate, default is 1
--min-evalue
: maximum e-value threshold, default = 0.00001
--min-perc
: minimum percent identity threshold. Default = 90
--min-cov
: minimum query coverage threshold. Default = 75
NOTE: Results files and FASTA reference assemblies must share the same basename for both multiblast extract
and multiblast extract-contig
:
Example basename: 'FILE'
Example FASTA: FILE.fasta
Example results file: FILE_results.txt
If multiBLAST is used for database creation and queries, matching basenames are handled automatically
Example usage:
multiblast extract-contig -d /path/to/results/files -f /path/to/reference/FASTA/files -T 8 -o contigs.fa
multiblast extract-contig
will generate a multi-FASTA file of all contigs harboring a matching
sequence identified by multiblast query
based on the default or user-defined thresholds.
This program was designed for use with metagenome mining, as metagenomic assemblies are often too large to explore using a genome browser. If short-read assembly methods are used, contigs harboring genes of interest may be extracted; contigs will likely be more tractible to parsing using a genome browser if manual annotation is needed.
Often times when downloading large genomic datasets, individual FASTA files will be concatenated into one large multi-FASTA file. This script is designed to split multi-FASTA files into individual FASTA files for use with the multiBLAST or other bioinformatic platforms.
Example usage:
multiblast split_fasta -i /path/to/multiFASTA/file -o /path/to/results/folder
Cite multiBLAST:
multiBLAST (https://github.com/ERBringHorvath/multiBLAST)
Cite NCBI BLAST+:
Camancho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL, 2009.
BLAST+: architecture and applications. BMC Bioinformatics, 10, 421. doi:10.1186/1471-2105-10-421