MetaTracer is organized into two main workflows:
- Reference MG-index build
- Read assignment + annotation
The reference workflow prepares NCBI Datasets assemblies (genome FASTA + GFF3 + proteins), reformats/chunks them for indexing, and builds MG-indices. The assignment workflow bins reads against the indices, merges hits from multiple indices and/or samples (i.e. for paired reads), filters hits based on low frequency taxa and edit distances, then annotates hits with taxonomic and CDS/protein data.
Preprint: MetaTracer bioRxiv manuscript
MetaTracer assumes references are downloaded using the NCBI Datasets CLI.
conda install -c conda-forge ncbi-datasets-cli
Downloaded assemblies should be annotated and include:
*genomic.fna(reference genome sequences)*genomic.gff/*genomic.gff.gz(GFF3 annotations)*protein.faa(protein sequences)
Important: you must also include a Datasets report that maps assembly accession → taxid (the “genome report”). This report is used later to build mapping tables and for downstream annotation.
You can request the report either:
- as JSON lines (recommended for reproducibility / structured parsing), or
- as TSV (easier to inspect manually)
datasets download genome taxon 2 \
--assembly-level complete \
--annotated \
--exclude-atypical \
--assembly-version latest \
--assembly-source RefSeq \
--mag exclude \
--include genome,gff3,protein \
--report genome \
--as-json-lines \
--filename bacteria.datasets.zipUnpack:
unzip -q bacteria.datasets.zip -d bacteria.datasets/After unpacking, you should have a directory containing assembly subdirectories such as:
bacteria.datasets/.../GCF_XXXXXXX.Y/
*_genomic.fna
*_genomic.gff[.gz]
*_protein.faa
You should also have the genome report (TSV or JSONL), which provides:
assembly_accession(e.g.,GCF_000006625.1)tax_id(NCBI taxid)
The reference build pulls all references provided from the genome report, reformats the headers, and concatenates them into chunks ready for index building. The size of the chunks will play a role in how much memory is required to load each index so consider resources when setting the chunk size. The final index will be ~3.5x the size of the chunked fasta file and will require proportional memory to load during assignment.
metatracer reference-build takes:
-
the base directory containing all GCF subdirectories
-
the genome report (assembly ↔ taxid mapping)
-
and produces:
- chunked FASTA files for indexing (headers rewritten to
>{accession_key}-{taxid}) - a mapping TSV required for downstream annotation
- a summary report (assemblies processed, taxa counts, taxonomic level per rolled taxid)
- (optional) bgzip + tabix indexing for GFF files (indexed GFF files are required at annotation step)
- chunked FASTA files for indexing (headers rewritten to
During reference-build, taxids from the genome report are rolled up to the species rank when available, or to the nearest higher rank (genus, family, order, class, phylum, superkingdom) when species is not available. This rollup uses ete3.
Example:
metatracer reference-build \
--data-dir bacteria.datasets/ncbi_dataset/data \
--report bacteria.datasets/assembly_data_report.jsonl \
--out-dir metatracer_ref/ \
--summary-out metatracer_ref/metatracer_reference.summary.txt \
--map_out metatracer_ref/metatracer_reference.map.tsv \
--max-size-mb 10000 \
--index-gffOutputs:
metatracer_ref/metatracer_reference.chunk.0.fasta,.1.fasta, ...metatracer_ref/metatracer_reference.map.tsvmetatracer_ref/metatracer_reference.summary.txt
Summary taxid section format:
taxidrank(the rolled level used for that taxid)assemblies(count of assemblies mapped to that rolled taxid)
The mapping TSV includes:
seqid(unique integer used in MG-index)assembly(assembly accession, e.g.GCF_...)taxidheader(contig accession from the original FASTA header, e.g.NC_...)description(original FASTA header)gff(bgzipped + tabix-indexed path if--index-gffis used)protein_fasta(path to protein FASTA)
metatracer index-build consumes the each FASTA file and builds MG-index.
Example:
for i in {0..10}; do
metatracer index-build \
--fasta metatracer_ref/metatracer_reference.chunk.${i}.fasta \
--index metatracer_index/metatracer.chunk.${i}.index
doneOutputs:
- MG-index files in
metatracer_index/suitable for read binning. Note: New indices can be build at any time without needing to rebuild existing indices. Multiple mapping files need to be maintained for each build and passed during annotation.
Once the build is complete, the original *genomic.fna files can be removed as well as the chunked fasta files. The gff, and protein files should be maintained for annotation.
MetaTracer assumes reads have been QC’ed prior to assignment.
Example fastp command (paired-end):
fastp \
-i sample_R1.fastq.gz -I sample_R2.fastq.gz \
-o sample_R1.qc.fastq.gz -O sample_R2.qc.fastq.gz \
--cut_front --cut_tail \
--cut_window_size 4 --cut_mean_quality 20 \
--length_required 50 \
--trim_poly_g \
--poly_x_min_len 10 \
--n_base_limit 5 \
--html sample.fastp.html --json sample.fastp.json \Run metatracer assign for each index. For paired-end reads, run it separately on R1 and R2 or on merged reads.
Example:
for i in {0..10}; do
metatracer assign \
--index metatracer_index/metatracer.chunk.${i}.index \
--fastq sample_R1.qc.fastq.gz \
--results assignments/sample.R1.chunk.${i}.bn \
--threads 16 \
doneIf the result file is not empty, metatracer assign will resume from the last assigned read and append to the file unless --force-overwrite is passed.
We benchmarked metatracer assign on a reference collection split into 10 MG-indices built with:
chunk_size: 10sample_interval: 64sa_sample: 32
Each index was approximately 35 GB on disk.
Benchmarking was performed on simulated oral metatranscriptomic read sets from figshare: 10.6084/m9.figshare.31245190. Each read set contained approximately 10 million 150bp reads. Assignment was run separately against each index with 8 threads, using 3 replicates per sample/index combination (90 total runs), with the following command:
metatracer assign \
--fastq {reads.fastq} \
--index {index} \
--results {output.bn} \
--threads 8 \
--force-overwrite \
--seed-interval 8 \
--tune-max-hits 1000 \
--edit-rate 0.13 \
--seed-size 18 \
--min-seed 0.015 \
--max-candidates 1000 \
--max-assignments 50Observed assignment performance:
- Wall time, minutes: mean
20.29, median17.77, min8.80, max74.43, SD9.81 - Max RSS, GB: mean
34.845, median34.847, min34.619, max34.881, SD0.030
metatracer merge combines assignment output files into a single per-read assignments file for a sample.
For paired-end data, pass all .bn files for the sample (R1 + R2 and any chunks).
Example:
metatracer merge \
--output merged/sample.assignments.clp \
--report metatracer_assignment_report.tsv \
--threads 16
assignments/sample.R1.chunk.0.bn \
assignments/sample.R2.chunk.0.bn \
...
assignments/sample.R1.chunk.10.bn \
assignments/sample.R2.chunk.10.bn \This produces one line per read:
-
Read ID prefix (everything before the last
:) -
A comma-separated list of hits like:
{taxid}-{accession_key}-{pos}={edit}
Merge also provides a per-taxa summary of assignment counts. This summary is useful for identifying unlikely taxa for the filtering step.
To reduce runtime, it’s recommended to filter out unlikely taxa before annotating. This includes keeping only hits with the lowest edit distances, and removing taxa that have low overall abundance and support (see merge report).
metatracer filter supports:
- include/exclude taxa lists
- edit distance cutoffs
- writing a reduced assignments file for annotation
Example:
metatracer filter \
--input merged/sample.assignments.clp \
--out filter/sample.filtered.assignments.clp \
--exclude-taxa taxa_to_drop.txt \
--include-taxa taxa_to_keep.txt \
--edit-delta 1 # Keep hits with edit <= min_edit + edit_delta (default: 0) Notes:
--include-taxaand--exclude-taxacan be used together; include is applied first, then exclude.- Both are applied prior to edit distance filtering
metatracer annotate expands each read hit to a tabular format and (unless --taxa-only is used) maps hit positions to CDS/protein annotations using the reference mapping table and indexed GFFs. --taxa-only just translates taxids into organism names.
Example:
metatracer annotate \
--map-table metatracer_ref/metatracer_reference.map.tsv \ # Generated during metatracer reference-build
--out annotations/sample.annotated.tsv \
--proteins-out annotations/sample.proteins.faa \
--threads 8 \
filter/sample.filtered.assignments.clp
Recommended:
- Run annotation on filtered assignments to reduce runtime and output size.
- Keep
metatracer_reference.map.tsvproduced duringreference-build; it contains the paths needed to resolve assemblies → GFF/protein resources.
Example sample.annotated.tsv output:
read_id taxid taxon_name accession_key contig_pos edit assembly contig gene_id locus_tag product protein_id strand cds_start cds_end
M01234:56:000000000-ABCD1:1:1101:10234:1056 562 Escherichia coli 18422 NC_000913.3:345671 2 GCF_000005845.2 NC_000913.3 b0002 thrA aspartokinase/homoserine dehydrogenase NP_414543.1 + 337 2799
M01234:56:000000000-ABCD1:1:1101:10234:1056 562 Escherichia coli 18422 NC_000913.3:345872 2 GCF_000005845.2 NC_000913.3 b0002 thrA aspartokinase/homoserine dehydrogenase NP_414543.1 + 337 2799
M01234:56:000000000-ABCD1:1:1101:9988:1120 1280 Staphylococcus aureus 23107 NZ_CP000253.1:112934 1 GCF_000013425.1 NZ_CP000253.1 SAUSA300_0102 SAUSA300_0102 putative membrane protein WP_000123456.1 - 112801 113220Example sample.proteins.faa output (unique protein sequences referenced by annotated hits):
>NP_414543.1 aspartokinase/homoserine dehydrogenase [Escherichia coli]
MNNR...KQ
>WP_000123456.1 putative membrane protein [Staphylococcus aureus]
MFKK...LL