Skip to content

AnantharamanLab/CheckAMG

Repository files navigation

CheckAMG

PyPI

Automated identification and curation of Auxiliary Metabolic Genes (AMGs), Auxiliary Regulatory Genes (AReGs), and Auxiliary Physiology Genes (APGs) in viral genomes and metagenomes

⚠️ This tool is in active development and has not yet been peer-reviewed.

Overview

CheckAMG is a pipeline for high-confidence identification and curation of auxiliary genes (AMGs, AReGs, APGs) in viral genomes. It leverages functional annotations, genomic context, and manually curated lists of AVG annotations. Its prediction approach reflects years of community-defined standards for identifying and auxiliary genes, validating that they are virus-encoded, and the filtering of common misannotations.

CheckAMG supports:

  • Nucleotide or protein input
  • Single-contig viral genomes or vMAGs (multi-contig)
  • Running on viral genomes or viromes/metagenomes directly

Dependencies

See pyproject.toml for all dependencies. Major packages:

Installation

Step 1: Create a conda environment and install CheckAMG using pip

conda create -n CheckAMG python=3.11 pip
conda activate CheckAMG
pip install checkamg

Step 2: Download the databases required by CheckAMG

About 40 GB of free disk space will be required to download the databases. This can be reduced to about 21 GB after downloading finishes if the human-readable HMM files are removed by providing the --rm_hmm argument.

checkamg download -d /path/to/db/destination --rm_hmm

Quick start

Example data to test your installation of CheckAMG are provided in the examples/example_data folder of this repository.

checkamg download -d /path/to/db/destination

checkamg annotate \
  -d /path/to/db/destination \
  -g examples/example_data/single_contig_viruses.fasta \
  -vg examples/example_data/multi_contig_vMAGs \
  -o CheckAMG_example_out

Usage

CheckAMG has multiple modules. The main modules that will be used for AVG predictions are annotate, de-novo, and end-to-end. Currently, only the annotate module has been implemented, and its associated download module to download its required databases.

Run checkamg -h for full options and module descriptions:

usage: checkamg [-h] [-v] {download,annotate,de-novo,aggregate,end-to-end} ...

CheckAMG: automated identification and curation of Auxiliary Metabolic Genes (AMGs),
Auxiliary Regulatory Genes (AReGs), and Auxiliary Physiology Genes (APGs) in viral
genomes.

positional arguments:
  {download,annotate,de-novo,aggregate,end-to-end}
                        CheckAMG modules
    download            Download the databases required by CheckAMG.
    annotate            Predict and curate auxiliary genes in viral genomes based on
                        functional annotations and genomic context.
    de-novo             (Not yet implemented) Predict auxiliary genes with an annotation-
                        independent method using Protein Set Transformer.
    aggregate           (Not yet implemented) Aggregate the results of the CheckAMG
                        annotate and de-novo modules to produce a final report of
                        auxiliary gene predictions.
    end-to-end          (Not yet implemented) Executes CheckAMG annotate, de-novo, and
                        aggregate in tandem.

options:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit

CheckAMG annotate

The annotate module is for the automated prediction and curation of auxiliary genes in viral genomes based on functional annotations and genomic context.

Basic usage:

checkamg annotate -g <genomes.fna> -d <db_dir> -o <output_dir>

Basic arguments:

  • -g, --genomes: Path to single-contig nucleotide genomes in a single FASTA file
  • -vg, --vmags: Path to a folder containing multi-contig nucleotide genomes
  • -p, --proteins: Path to single-contig amino acid input in a single FASTA file
  • -vp, --vmag_proteins: Path to a folder containing multi-contig amino acid input
  • -d, --db_dir: Path to CheckAMG databases download with checkamg download
  • -o, --output: Path to the output directory where CheckAMG intermediates and results will be written

Notes:

  • At least one of --genomes or --vmags, or one of --proteins or --vmag_proteins, must be provided
  • Both nucleotide and protein input types cannot be mixed
  • Providing single-contig genomes or vMAGs only affects the labelling and organization of results, and does not affect AVG predictions
  • Protein headers must be in prodigal format (e.g. >Contig1_1 # 144 # 635 # 1 or >Contig1_2 # 1535 # 635 # -1)

Full usage:

usage: checkamg annotate [-h] -d DB_DIR -o OUTPUT [-g GENOMES] [-vg VMAGS]
                         [-p PROTEINS] [-vp VMAG_PROTEINS] [--input_type INPUT_TYPE]
                         [-l MIN_LEN] [-f MIN_ORF] [-n MIN_ANNOT] [-c COV_FRACTION]
                         [-e EVALUE] [-b BIT_SCORE]
                         [-bh BITSCORE_FRACTION_HEURISTIC] [-k] [-w WINDOW_SIZE]
                         [-V MIN_FLANK_VSCORE]
                         [-H | --use_hallmark | --no-use_hallmark]
                         [--filter_presets FILTER_PRESETS] [-t THREADS] [-m MEM]
                         [--debug | --no-debug]

Predict and curate auxiliary genes in viral genomes based on functional annotations
and genomic context.

options:
  -h, --help            show this help message and exit
  --input_type INPUT_TYPE
                        Specifies whether the input files are nucleotide genomes
                        (nucl) or translated amino-acid genomes (prot). Providing
                        proteins as input will skip the pyrodigal-gv step, but it
                        will be unable to tell whether viral genomes are circular,
                        potentially losing additional evidence for verifying the
                        viral origin of putative auxiliary genes. (default: nucl).
  -l MIN_LEN, --min_len MIN_LEN
                        Minimum length in base pairs for input sequences (default:
                        5000).
  -f MIN_ORF, --min_orf MIN_ORF
                        Minimum number of open reading frames (proteins) inferred by
                        pyrodigal-gv for input sequences (default: 4).
  -n MIN_ANNOT, --min_annot MIN_ANNOT
                        Minimum percentage (0.0-1.0) of genes in a genome/contig
                        required to have been assigned a functional annotation using
                        the CheckAMG database to be considered for contextual
                        analysis. (default: 0.2).
  -c COV_FRACTION, --cov_fraction COV_FRACTION
                        Minimum covered fraction (of HMM profiles) for reporting HMM
                        searches (default: 0.3).
  -e EVALUE, --evalue EVALUE
                        Maximum fallback E-value for HMM searches when database-
                        provided cutoffs are not available (default: 1e-05).
  -b BIT_SCORE, --bit_score BIT_SCORE
                        Minimum fallback bit score for HMM searches when database-
                        provided cutoffs are not available (default: 30).
  -bh BITSCORE_FRACTION_HEURISTIC, --bitscore_fraction_heuristic BITSCORE_FRACTION_HEURISTIC
                        Retain HMM hits scoring at least this fraction of the
                        database-provided threshold under heuristic filtering
                        (default: 0.5).
  -k, --keep_full_hmm_results
                        Keep a file with full HMM search results. By default, only
                        the single best HMM hits per protein, per database are
                        written to save space. Does not affect final annotations.
                        Not recommended for large inputs (>2 GB fasta file or
                        >10,000 sequences) (default: False).
  -w WINDOW_SIZE, --window_size WINDOW_SIZE
                        Size in base pairs of the window used to calculate the
                        average VL-score of genes in a local region on a contig
                        (default: 5000).
  -V MIN_FLANK_VSCORE, --min_flank_Vscore MIN_FLANK_VSCORE
                        Minimum V-score of genes in flanking regions required to
                        verify a potential auxiliary gene as viral and not host
                        sequence contamination (0.0-10.0) (default: 10.0).
  -H, --use_hallmark, --no-use_hallmark
                        Use viral hallmark gene annotations instead of V-scores when
                        checking flanking regions of potential auxiliary genes for
                        viral verification (default: False).
  --filter_presets FILTER_PRESETS
                        Preset(s) for filtering auxiliary gene annotations based on
                        keywords (see documentation for details). Valid choices:
                        'default' (recommended), 'allow_glycosyl' (keep
                        glycosyltransferase, glycoside-hydrolase, and related
                        annotations), 'allow_nucleotide' (keep nucleotide metabolism
                        annotations), 'allow_methyl' (keep methyltransferase and
                        related annotations), 'allow_lipid' (keep lipopolysaccharide
                        and phospholipid-related annotations), 'no_filter' (disable
                        all annotation filtering, not recommended). Multiple presets
                        can be provided, separated by commas (e.g.,
                        allow_glycosyl,allow_nucleotide). (default: default)
  -t THREADS, --threads THREADS
                        Number of threads to use for pyrodigal-gv and pyhmmer
                        (default: 10).
  -m MEM, --mem MEM     Maximum amount of memory allowed to be allocated in GB
                        (default: 80% of available). (default: 1395)
  --debug, --no-debug   Log CheckAMG with debug-level detail (default: False).

required arguments:
  -d DB_DIR, --db_dir DB_DIR
                        Path to CheckAMG database files (Required). (default: None)
  -o OUTPUT, --output OUTPUT
                        Output directory for all generated files and folders
                        (Required). (default: None)
  -g GENOMES, --genomes GENOMES
                        Input viral genome(s) in nucleotide fasta format (.fna or
                        .fasta). Expectation is that individual virus genomes are
                        single contigs. (default: None)
  -vg VMAGS, --vmags VMAGS
                        Path to folder containing vMAGs (multiple contigs) rather
                        than single-contig viral genomes. Expectation is that the
                        folder contains one .fna or .fasta file per virus genome and
                        that each genome contains multiple contigs. (default: None)
  -p PROTEINS, --proteins PROTEINS
                        Input viral genome(s) in amino-acid fasta format (.faa or
                        .fasta). Required if --input_type is prot. Expectations are
                        that the amino-acid sequence headers are in Prodigal format
                        (>[CONTIG NAME]_[CDS NUMBER] # START # END # FRAME # ...)
                        and that each contig encoding proteins represents a single
                        virus genome. (default: None)
  -vp VMAG_PROTEINS, --vmag_proteins VMAG_PROTEINS
                        Path to folder containing vMAGs (multiple contigs) in amino-
                        acid fasta format (.faa or .fasta) rather than single-contig
                        viral genomes. Expectation is that the folder contains one
                        .faa or .fasta file per virus genome and that each genome
                        file contains amino-acid sequences encoded on multiple
                        contigs. Required if --input_type is 'prot'. (default: None)

Outputs:

The CheckAMG annotate output folder will have the following structure:

CheckAMG_annotate_output
├── CheckAMG_annotate.log
├── config_annotate.yaml
├── results/
│   ├── faa_metabolic/
│   │   ├── AMGs_all.faa
│   │   ├── AMGs_high_confidence.faa
│   │   ├── AMGs_low_confidence.faa
│   │   └── AMGs_medium_confidence.faa
│   ├── faa_physiology/
│   │   ├── APGs_all.faa
│   │   ├── APGs_high_confidence.faa
│   │   ├── APGs_low_confidence.faa
│   │   └── APGs_medium_confidence.faa
│   ├── faa_regulatory/
│   │   ├── AReGs_all.faa
│   │   ├── AReGs_high_confidence.faa
│   │   ├── AReGs_low_confidence.faa
│   │   └── AReGs_medium_confidence.faa
│   ├── final_results.tsv
│   ├── gene_annotations.tsv
│   ├── genes_genomic_context.tsv
│   ├── metabolic_genes_curated.tsv
│   ├── physiology_genes_curated.tsv
│   └── regulation_genes_curated.tsv
└── wdir/
  • CheckAMG_annotate.log: Log file for the CheckAMG annotate run
  • config_annotate.yaml: Snakemake pipeline configuration
  • results/: Main results directory
    • faa_metabolic/, faa_physiology/, faa_regulatory/: Predicted AVGs by type and confidence
    • final_results.tsv: Summary table of AVG predictions
      • Note that this table contains information on all genes that made it past the length/CDS filtering steps, including metabolic, physiological, regulatory, and unclassified (not AVG) genes. The "Protein Classification" column can be used to filter by classification.
    • gene_annotations.tsv: All gene annotations
    • genes_genomic_context.tsv: Gene-level genomic context for confidence assignment
    • *_genes_curated.tsv: Curated lists of metabolic, physiological, and regulatory genes after filtering false positives
  • wdir/: Intermediate files

Examples of these output files are provided in the examples/example_outputs folder of this repository.

CheckAMG de-novo

Coming soon.

CheckAMG end-to-end

Coming soon.

Important Notes / FAQs

1. What is an AVG?

An AVG is an Auxiliary Viral Gene, a virus-encoded gene that is non-essential for viral replication but augments host metabolism (AMGs), physiology (APGs), or regulation (AReGs). In the past, all AVGs have been referred to as AMGs, but recently the term AVG has been adopted to include broader host-modulating functions, not just metabolism (see Martin et al. (2025) Nat Microbiol).

Examples:

  • A virus-encoded psbA or soxY would be an AMG because they encode proteins with functions in host photosynthesis and sulfide oxidation
  • A virus-encoded VasG type VI secretion system protein or HicA toxin would be an APG because they are involved in host physiology
  • A LuxR transcriptional regulator or an AsiA anti-sigma factor protein would be an AReG because they are likely involved in the regulation of host gene expression

Despite the name "CheckAMG", this tool also predicts APGs and AReGs using the same pipeline, differing only by functional annotation criteria.

2. How does CheckAMG classify and curate its predictions?

CheckAMG applies a two-stage filtering process:

  1. Use a list of curated profile HMMs that represent metabolic, physiological, and regulatory genes to come up with initial AVG candidates
  2. Use a second list of curated keywords/substrings that will be used to filter unlikely AMGs, APGs, and AReGs

Unclassified genes are those with annotations that don't meet thresholds for confident AVG classification, not necessarily unannotated.

Curated Keyword Presets

Users can control how CheckAMG applies keyword-based filters using the --filter_preset argument. The currently available options are:

  • default: Standard annotation filtering behavior (recommended)
  • allow_glycosyl: Disables filtering for glycosyltransferase, glycoside-hydrolase, and related annotations
  • allow_nucleotide: Disables filtering for nucleotide metabolism annotations
  • allow_methyl: Disables filtering for methyltransferase and related annotations
  • allow_lipid: Disables filtering for lipopolysaccharide and phospholipid-related annotations
  • no_filter: Disables all keyword-based filtering (not recommended)

We generally do not recommend changing --filter_preset from default for most use cases. However, there are scenarios where it may be appropriate to add exceptions to CheckAMG's filtering logic. For example:

  • If virus-encoded glycosyltransferases/glycoside-hydrolases, methyltransferases, nucleotide metabolism genes, or lipopolysaccharide/phospholipid metabolism genes are specifically of interest, consider applying the relevant filter presets to include those exceptions
  • If you have environment-specific knowledge that makes certain gene functions highly relevant to your study system, you can use the appropriate --filter_preset to retain those annotations if they were originally included among the CheckAMG filters
    • For example, setting --filter_preset allow_glycosyl may include additional potential AMGs involved in carbohydrate degradation when these functions are likely to be enriched in the environmental context of your viral genomes
  • If you have other evidence to suggest that annotations flagged by certain keywords are more likely involved in auxiliary metabolic, physiological, or regulatory pathways in the host, rather than essential/core viral functions like genome replication, capsid assembly, cell entry, or lysis

Note: If any non-default values for --filter_preset are used, additional manual curation of functional annotations is still necessary to avoid misclassification of a gene as an AMG, APG, or AReG.

3. What do the viral origin confidence assignments to predicted AVGs mean?

TL;DR It reflects the likelihood that a gene is virus-encoded (vs host/MGE)

AVGs often resemble host genes and can result from contamination. CheckAMG uses local genome context to assign high, medium, or low viral origin confidence based on:

  1. Proximity to virus-like or viral hallmark genes
  2. Proximity to transposases or other non-viral mobilization genes
  3. Local viral gene content, determined using V- and VL-scores (Zhou et al., 2025)

A LightGBM model, trained on real and simulated viral/non-viral data, makes these assignments. Confidence levels refer to the viral origin, not the functional annotation.

4. Which confidence levels should I use?

TL;DR When in doubt, use high, but medium should be included if your input is virus enriched.

The precision and recall of each confidence level for predicting true viral proteins depends on the input dataset. Whether you should use high, medium, and/or low-confidence AVGs will depend on your knowledge of your input data.

  • High-confidence
    • CheckAMG assigns confidence levels such that high-confidence predictions can be almost always trusted (false-discovery rate < 0.1 in most cases)
    • To maintain the integrity of high-confidence predictions even in cases where viral proteins are relatively rare in the input, high-confidence predictions are conservative
    • We recommend using just high-confidence AVGs when viral proteins are relatively rare in the input data (such as mixed-community metagenomes) or when the composition of the input data is unknown
  • Medium-confidence
    • Using medium-confidence predictions can significantly increase the recovery of truly viral proteins, but they may not always be best to use
    • Medium-confidence predictions maintain false-discovery rates < 0.1 in datasets with at least 50% viral proteins, but as input sequences become increasingly non-viral in their protein composition, FDRs begin to surpass 0.1 (see the figure and table, below)
    • We recommend using both high- and medium-confidence AVGs when you know that at least half of your input sequences are viral, such as outputs from most virus prediction tools or viromes
  • Low-confidence
    • Low-confidence predictions are not filtered at all, so we only recommend using them when you are certain that all of your input sequences are free of non-viral sequence contamination, or for testing

Below are preliminary results for benchmarking our viral origin confidence predictions against test datasets with varying sequence composition (% of proteins, see the table below for composition):

Precision-Recall Plot

Dataset % Viral Proteins % MGE Proteins % Host Proteins
Near all virus 90 5 5
Virus enriched 75 12.5 12.5
Equal viral/nonviral 50 25 25
Training dist. 46.6 12.3 41.1
Half viral/host 50 0 50
Equal viral/MGE/host 33.3 33.3 33.3
MGE enriched 12.5 75 12.5
Host enriched 12.5 12.5 75
Near all MGE 5 90 5
Near all host 5 5 90

5. How does CheckAMG assign functions to proteins?

If you're curious about the internal mechanics of how CheckAMG annotates proteins for their function, this section explains the behavior. These settings are designed to balance sensitivity (not missing true hits) and specificity (excluding weak/ambiguous matches), with additional database-specific optimizations for functional reliability.

  1. Homology Searching Method

    • CheckAMG uses pyhmmer for fast and reproducible HMM searches of user proteins against profile HMMs
  2. Profile HMM Databases

  3. E-value Threshold

    • An initial, permissive E-value cutoff of 0.01 is applied during hmmsearch to minimize missed hits due to chunking or memory differences when parallelizing, which can affect search reproducibility
  4. Coverage Filter

    • After hits are collected, CheckAMG enforces a minimum HMM alignment coverage filter (default 0.30, configurable via --cov_fraction)
    • This is applied during downstream hit filtering so that functional inferences are not drawn from tiny partial alignments
  5. Database-Specific Thresholds

    • CheckAMG applies specialized rules depending on the HMM source:

      • Pfam: Applies sequence-level gathering threshold (GA); hits below GA are excluded
      • FOAM, KEGG, & CAMPER: Use database-defined bit score thresholds, but apply a relaxed fallback heuristic (see below)
      • METABOLIC: Uses GA cutoffs derived from its underlying Pfam/TIGRFAM sources, where available
  6. Fallback Heuristic (FOAM, KEGG, & CAMPER)

    • KEGG (and consequently, FOAM and CAMPER, since these databases were largely derived from KEGG KOfams) thresholds can sometimes be overly strict, especially for environmental viruses, filtering out hits that are biologically valid

    • To recover these valid hits, CheckAMG applies a relaxed fallback heuristic inspired by the Anvi'o anvi-run-kegg-kofams strategy:

      • If a hit falls below the database-provided trusted threshold (e.g., KEGG TC), it is still retained if all three conditions below are met:

        1. The bit score is at least 50% of the threshold value
        2. The E-value is below 1e-5
        3. The coverage of the HMM profile aligned to the sequence hit is at least 0.30
      • These values are configurable by the user using the --bitscore_fraction_heuristic, --evalue, and --cov_fraction arguments if desired, but we do not recommend changing them

    • A similar heuristic improves annotation recovery without compromising too much on precision (Kananen et al., 2025)

  7. Fallback Filtering for Other Databases

    • If the HMM source doesn't have defined cutoffs, such as dbCAN, PHROGs, and some profiles in the METABOLIC database, CheckAMG enforces:

      • A minimum coverage of the HMM profile 0.30 to the aligned sequence (via --cov_fraction)
      • A minimum bit score of 30 (via --bit_score)
      • Both are configurable by the user if desired
  8. Result Consolidation and Best-Hit Reporting

    • Each input protein is searched against each HMM source database (KEGG, FOAM, Pfam, PHROG, dbCAN, METABOLIC, and CAMPER)

    • All hits are filtered using the criteria above (including the minimum coverage filter)

    • Then, CheckAMG reports (1) per-database best hits and (2) a single cross-database top-hit summary:

      • Per database: only the single best hit per protein is retained and reported for that database

        • Preference is given to the hit with the lowest E-value
        • If E-values are equal, the hit with the higher bit score is selected
      • Across databases: CheckAMG also reports a single best-supported annotation per protein (top_hit_hmm_id, top_hit_description, top_hit_db) by selecting the database whose retained per-database best hit has the largest bit score among databases with a non-null hit

These defaults provide a balance between accuracy and recall, and are based on benchmarking and community best practices. Users may modify thresholds using the --bit_score, --bitscore_fraction_heuristic, --evalue, and --cov_fraction arguments.

6. Snakemake

CheckAMG modules are executed as Snakemake pipelines. If a run is interrupted, it can resume from the last complete step as long as intermediate files exist.

Reproducibility and reference database construction

CheckAMG is packaged with several curated reference tables under CheckAMG/files/ that define (i) the functional label mappings used for reporting, (ii) the curated AMG/APG/AReG HMMs, and (iii) the AVG filtering tables (including exception categories). These tables are what the pipeline reads and parses when curating annotations.

To make these resources transparent and reproducible, this repository includes a notebook (see make_checkamg_required_tables.ipynb) that was used to build the required tables from upstream sources and writes the standardized outputs used by the pipeline, including:

  • hmm_id_to_name.tsv (cross-database HMM id to name/description mapping)
  • FOAM.tsv and vscores.tsv
  • AMGs.tsv, APGs.tsv, AReGs.tsv
  • AMG_filters.tsv, APG_filters.tsv, AReG_filters.tsv
  • viral_hallmark_genes.tsv and mobile_genes.tsv

CheckAMG’s required profile HMM database (downloaded via checkamg download) is formatted and packaged to ensure consistency and standardization across versions. The notebook used to download, format, and build this database, including documentation of the associated source versions, is available at build_checkamg_db.ipynb.

These notebooks are not required to run CheckAMG, but they are provided so others can inspect, regenerate, and update the curated assets when upstream databases change.

Error reporting

To report bugs or request features, please use the GitHub Issues page.

Citation

Coming soon.

Authors:

  • James C. Kosmopoulos (kosmopoulos [at] wisc [dot] edu)
  • Cody Martin
  • Karthik Anantharaman (karthik [at] bact [dot] wisc [dot] edu)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •