Skip to content
Combine a set of metagenomic assemblies into a common set of references
Python Shell Dockerfile
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Integrate Metagenomic Assemblies

Combine a set of metagenomic assemblies into a common set of references


To use this software, generate a set of assemblies with predicted protein-coding genes in a single FASTA file (e.g. *.fastp.gz) and the annotations in GFF format.

All of those files should be found in a single folder, with the name of the file matching the name of the sample that the assembly originated from.


After running this code, two files should be generated:

  1. A FASTA file with the deduplicated protein-coding gene sequences
  2. A JSON file describing the identity and physical relationship between those protein-coding sequences

The format of the JSON will be as follows:

        "protein_id": "<name of deduplicated reference>",
        "members": [
            "<ids of assembled proteins within the group>"
        "annotation": "<annotation of protein>",
        "neighbors": {
            "<upstream or downstream>": {
                "<protein id of upstream neighbor>": "<number of assemblies with connection>"

Invocation \
    --gff-folder "<folder containing GFF files>" \
    --prot-folder "<folder containing protein FASTA files>" \
    --output-name "<base name for output files>" \
    --output-folder "<folder for output files>" -h for more options.


To use this code, we recommend using the Docker image, because it contains all the needed dependencies and is validated with a set of tests. If that is not a satisfying option, see the Dockerfile to see how to set up the appropriate environment.

Docker Repository on Quay

You can’t perform that action at this time.