A BLASTn database provides the essential reference framework for comparing query sequences, forming the backbone of any sequence-based analysis. Accurate results---whether in diagnostics, biosecurity surveillance, microbial studies, evolutionary research, environmental surveys, or functional genomics---depend on a high-quality, well-curated database; without it, even the most sophisticated tools can yield ambiguous outcomes.
Public databases are comprehensive but rapidly expanding, often containing redundant or low-quality and irrelevant entries. This leads to slower searches and reduced search resolution.
In contrast, a custom database is like a well-organised library where every book is precisely indexed--- smaller in volume, faster to search, and more focused in results.
However, manually constructing a custom database from numerous genomes is tedious, error-prone, and frequently interrupted by the "Duplicate ID Found" error--- with little guidance available on how to resolve it.
To bridge this gap, I developed the blastdbbuilder package --- an automated solution for genome download, curation, and database construction. It eliminates common errors, ensures reproducibility, and delivers an optimized, high-quality BLASTn database tailored for diagnostics, biosecurity surveillance, microbial research, and any study that relies on robust sequence comparison.
blastdbbuilder is a lightweight toolkit that automates the complete
BLASTn database preparation workflow. It streamlines every step ---
from downloading user-specified genomes and organizing datasets to
building optimized, up-to-date BLASTn databases.
Designed for researchers and clinicians, it provides a reproducible, portable, and regularly updated solution for constructing BLASTn databases without manual setup.
The toolkit leverages:
- Singularity / Apptainer containers
- Modular scripts
- Automated genome retrieval from NCBI
This enables:
- Easy deployment across diverse computational environments
- Minimal dependency installation
- Reproducible database generation
- Automatic cleanup of intermediate files, retaining only the final BLASTn database to significantly reduce disk space requirements
All genomes are retrieved directly from NCBI RefSeq repositories, ensuring that the database reflects the latest available sequences at the time of download.
-
Automated download of all genomes for virus and the reference genomes for Archaea, Bacteria, Fungi, and Plants
-
Resume-able BLASTn database creation --- continue from interrupted runs
-
Modular scripts for each workflow step
-
Container-based execution for portability and reproducibility
-
Lightweight installation
-
Reduced disk space usage through automatic cleanup of intermediate files
blastdbbuilder supports multiple user workflows depending on computational environment and user preference. Separate user manuals are provided for each supported usage environment.
System requirements
Before installing blastdbbuilder, make sure the following are
available on your system:
Python ≥ 3.9
Check your Python version:
python3 --version
If Python is older than 3.9, install a newer Python using your system package manager.
Example (Ubuntu):
sudo apt install python3
unzip
Check if installed:
unzip -v
If missing:
sudo apt install unzip
Container engine
One of the following container engines must be installed:
- Apptainer
- SingularityCE ≥ 3.x
Example installation:
sudo apt install singularity-container
On most HPC systems (for example ARDC Nectar), Singularity or Apptainer is typically already installed.
The CLI provides the full automated workflow for downloading genomes, concatenating FASTA files, and building BLAST databases.
The GUI provides a guided desktop interface for building customised BLASTn databases without requiring command-line experience.
It wraps the same reproducible backend as the CLI while offering an interactive environment suitable for diagnostics laboratories, teaching environments, and routine analyses.
blastdbbuilder can be run on Windows using Windows Subsystem for Linux (WSL), enabling Windows users to build BLAST databases locally while using a Linux backend.
blastdbbuilder is designed to scale efficiently on HPC systems.
It supports:
- SLURM-based execution
- containerised workflows
- large-scale genome downloads
For reproducible execution across different computing environments, blastdbbuilder is distributed with an official container runtime.
The container bundles all required software dependencies, including:
- blastdbbuilder
- NCBI datasets CLI
- BLAST+
- seqkit
- dataformat
- unzip
This enables blastdbbuilder to run without installing any dependencies on the host system.
Cite this repository
If you use this software in your work, please cite it as follows:
Prodhan, M. A. (2025). blastdbbuilder: Building a Customised BLASTn Database. https://doi.org/10.5281/zenodo.18973405
For issues, bug reports, or feature requests, please contact: Asad Prodhan. E-mail: asad.prodhan@dpird.wa.gov.au, prodhan82@gmail.com
