RepDBmaker Pipeline

A Snakemake workflow for building taxonomically annotated protein sequence databases from public resources and custom genome collections.

Overview

RepDBmaker assembles protein sequence databases from multiple sources, annotates them with taxonomy, and builds searchable indices for:

Diamond
MMseqs2
BLAST

It supports:

prokaryotes from GTDB
eukaryotes from EukProt, P10K, and UniProt
viruses from NCBI Virus

Optional features include taxonomic clustering and contamination filtering.

Insallation

Requirements

Snakemake
Conda or Miniconda
Internet access for external downloads
Sufficient disk space for genome and database files

If using --sdm conda, Snakemake will automatically create the following environments from workflow/envs/:

workflow/envs/python.yaml — Python, pandas, matplotlib, polars
workflow/envs/homology.yaml — diamond, mmseqs2, blast
workflow/envs/utils.yaml — taxonkit, csvtk, ncbi-datasets-cli, jq, seqkit
workflow/envs/R.yaml — R and visualization/taxonomy packages

Create all environments without executing the pipeline:

snakemake --conda-create-envs-only

Docker

A Docker image is available at Docker Hub.

If Docker is installed, run the pipeline from the repository root with:

docker run --rm -v $(pwd):/app/data gmuttiirb/repdbmaker:v1.0 snakemake --cores 2 --directory /app/data -n

Quick start

To inspect the available proteomes before running the full workflow:

snakemake -j 1 --until available_proteomes

This generates results/meta/available_proteomes.tsv, which can be used to select a proteome subset.

To run the full workflow:

snakemake -j 14

Troubleshooting

Sometimes things can go wrong while downloading a proteome. That is why there is a step (rule db_stats) that will fail if any gzipped fasta is malformed and will block the creation of the database fasta.

To check for broken files:

cut -f1 results/dbs/<db>/genome_table.tsv | xargs -I {} sh -c 'gzip -t "{}" || echo "Failed: {}"'

Then delete the problematic ones and re-run the pipeline. If the problem persists, there may be other sort of problems (the files may be broken or the current downloading script fails), I reccomend to exclude them and find the most suitable alternative.

Configuration

Configure the workflow in config/repdb.yaml or pass a custom config file with:

snakemake --configfile path/to/custom.yaml

Example configuration:

# Database configuration
dbs:
  gtdb_version: "latest"  # use release220, release226, etc.
  type: ["diamond", "mmseqs"]  # Available types: diamond, mmseqs, blastp
  build:
    repdb:
      decontamination:
        # optimized settings for ContScout benchmarking
        identity: 0.9
        coverage: 0.5
        cov_mode: 3
        prop_euka: 0.5
    custom:
      clusteredrepdb:
        ids: resources/repdb.ids
        cluster:
          level: class
          identity: 0.9
          coverage: 0.9

files:
  clades_to_keep: "resources/clades_tokeep.txt"
  genomes_to_exclude: "resources/exclude.txt"
  new_genomes: "resources/custom_genomes_repdb.csv"

Using this config file will allow the creation of RepDB and its clustered version.

Custom databases

To add any custom database:

Define an entry under dbs.build.custom
Provide an ids file with genome identifiers or metadata
Optionally configure cluster and decontaminate

Example:

dbs:
  type: ["diamond", "mmseqs", "blastp"]
  build:
    smalleuks:
      ids: resources/eukas_50.ids
      cluster:
        level: order
        identity: 0.8
        coverage: 0.8
      decontaminate:
        identity: 0.9
        coverage: 0.5
        cov_mode: 3
        prop_euka: 0.5

The workflow will create results/dbs/<custom_db>/ and its associated outputs.

Outputs

Key output locations:

results/dbs/<db>/
- <db>.fa.gz — compressed protein FASTA
- <db>_accession_map.txt — accession-to-taxid mapping
- <db>_map, <db>_nohead.map — BLAST/MMseqs maps
- <db>_diamond — Diamond database index
- <db>_mmseqs — MMseqs2 database index
- <db>_blastp — BLASTP database files
results/dbs/<db>/cluster/cluster_params.yaml — clustering settings
results/dbs/<db>/cluster/<db>_clustered.fa.gz — clustered FASTA output
results/dbs/<db>/decontaminate/decontaminate_params.yaml — decontamination settings
results/dbs/<db>/decontaminate/contaminants.txt — decontamination candidates
results/dbs/<db>/decontaminate/pair_counts.tsv — cluster pair counts
results/stats/ — database and clustering statistics
results/meta/check_resources.txt — validation of downloaded assets
results/taxonomies/ — taxonomy annotations and selection outputs
results/taxdump/repdb_taxdump/ — taxonkit taxdump for combined genomes

Utilities

The repository includes helper scripts for working with generated databases.

workflow/scripts/get_fasta.py: extract a subset of FASTA sequences from an MMseqs2 database

Example usage:

python workflow/scripts/get_fasta.py <id_file> <output_fasta> <db_mmseqs>

Benchmarking

A benchmark workflow is available in workflow/benchmark.smk and uses config/benchmark.yaml to compare RepDB against reference databases such as NR and clustered NR.

A results notebook is available at workflow/notebooks/comparison.Rmd.

Citation

Please add the preferred citation here.

License

See the LICENSE file for license details.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
config		config
images		images
resources		resources
workflow		workflow
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RepDBmaker Pipeline

Overview

Insallation

Requirements

Docker

Quick start

Troubleshooting

Configuration

Custom databases

Outputs

Utilities

Benchmarking

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RepDBmaker Pipeline

Overview

Insallation

Requirements

Docker

Quick start

Troubleshooting

Configuration

Custom databases

Outputs

Utilities

Benchmarking

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages