Skip to content

asadprodhan/blastdbbuilder

Repository files navigation

blastdbbuilder: Building a Customised BLASTn Database

M. Asaduzzaman Prodhan*

DPIRD Diagnostics and Laboratory Services
Department of Primary Industries and Regional Development
3 Baron-Hay Court, South Perth, WA 6151, Australia
*Correspondence: asad.prodhan@dpird.wa.gov.au; prodhan82@gmail.com

License GPL 3.0 ORCID DOI: 10.5281/zenodo.18973405

Content

Introduction

A BLASTn database provides the essential reference framework for comparing query sequences, forming the backbone of any sequence-based analysis. Accurate results---whether in diagnostics, biosecurity surveillance, microbial studies, evolutionary research, environmental surveys, or functional genomics---depend on a high-quality, well-curated database; without it, even the most sophisticated tools can yield ambiguous outcomes.

Public databases are comprehensive but rapidly expanding, often containing redundant or low-quality and irrelevant entries. This leads to slower searches and reduced search resolution.

In contrast, a custom database is like a well-organised library where every book is precisely indexed--- smaller in volume, faster to search, and more focused in results.

However, manually constructing a custom database from numerous genomes is tedious, error-prone, and frequently interrupted by the "Duplicate ID Found" error--- with little guidance available on how to resolve it.

To bridge this gap, I developed the blastdbbuilder package --- an automated solution for genome download, curation, and database construction. It eliminates common errors, ensures reproducibility, and delivers an optimized, high-quality BLASTn database tailored for diagnostics, biosecurity surveillance, microbial research, and any study that relies on robust sequence comparison.


blastdbbuilder

blastdbbuilder is a lightweight toolkit that automates the complete BLASTn database preparation workflow. It streamlines every step --- from downloading user-specified genomes and organizing datasets to building optimized, up-to-date BLASTn databases.

Designed for researchers and clinicians, it provides a reproducible, portable, and regularly updated solution for constructing BLASTn databases without manual setup.

The toolkit leverages:

  • Singularity / Apptainer containers
  • Modular scripts
  • Automated genome retrieval from NCBI

This enables:

  • Easy deployment across diverse computational environments
  • Minimal dependency installation
  • Reproducible database generation
  • Automatic cleanup of intermediate files, retaining only the final BLASTn database to significantly reduce disk space requirements

All genomes are retrieved directly from NCBI RefSeq repositories, ensuring that the database reflects the latest available sequences at the time of download.


Features

  • Automated download of all genomes for virus and the reference genomes for Archaea, Bacteria, Fungi, and Plants

  • Resume-able BLASTn database creation --- continue from interrupted runs

  • Modular scripts for each workflow step

  • Container-based execution for portability and reproducibility

  • Lightweight installation

  • Reduced disk space usage through automatic cleanup of intermediate files


User Manuals

blastdbbuilder supports multiple user workflows depending on computational environment and user preference. Separate user manuals are provided for each supported usage environment.

Pre-requisite

System requirements

Before installing blastdbbuilder, make sure the following are available on your system:

Python ≥ 3.9

Check your Python version:

python3 --version

If Python is older than 3.9, install a newer Python using your system package manager.

Example (Ubuntu):

sudo apt install python3

unzip

Check if installed:

unzip -v

If missing:

sudo apt install unzip

Container engine

One of the following container engines must be installed:

  • Apptainer
  • SingularityCE ≥ 3.x

Example installation:

sudo apt install singularity-container

On most HPC systems (for example ARDC Nectar), Singularity or Apptainer is typically already installed.


Command Line Interface (CLI)

The CLI provides the full automated workflow for downloading genomes, concatenating FASTA files, and building BLAST databases.

👉 CLI User Guide


Graphical User Interface (GUI)

The GUI provides a guided desktop interface for building customised BLASTn databases without requiring command-line experience.

It wraps the same reproducible backend as the CLI while offering an interactive environment suitable for diagnostics laboratories, teaching environments, and routine analyses.

👉 GUI User Guide


Windows (WSL)

blastdbbuilder can be run on Windows using Windows Subsystem for Linux (WSL), enabling Windows users to build BLAST databases locally while using a Linux backend.

👉 WSL User Guide


High Performance Computing (HPC)

blastdbbuilder is designed to scale efficiently on HPC systems.

It supports:

  • SLURM-based execution
  • containerised workflows
  • large-scale genome downloads

👉 HPC User Guide


Singularity/Apptainer Container

For reproducible execution across different computing environments, blastdbbuilder is distributed with an official container runtime.

The container bundles all required software dependencies, including:

  • blastdbbuilder
  • NCBI datasets CLI
  • BLAST+
  • seqkit
  • dataformat
  • unzip

This enables blastdbbuilder to run without installing any dependencies on the host system.

👉 blastdbbuilder Container


Citation

Cite this repository

If you use this software in your work, please cite it as follows:

Prodhan, M. A. (2025). blastdbbuilder: Building a Customised BLASTn Database. https://doi.org/10.5281/zenodo.18973405


Support

For issues, bug reports, or feature requests, please contact: Asad Prodhan. E-mail: asad.prodhan@dpird.wa.gov.au, prodhan82@gmail.com

About

Building Customised Blastn Database

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors