Skip to content

METABOLIC Usage

Zhichao Zhou edited this page Dec 8, 2023 · 46 revisions

METABOLIC Usage:


All Required and Optional Flags:

To view the options that METABOLIC-C.pl and METABOLIC-G.pl have, please type:

perl METABOLIC-G.pl -help
perl METABOLIC-C.pl -help
  • -in-gn [required if you are starting from nucleotide fasta files] Defines the location of the FOLDER containing the genome nucleotide fasta files ending with ".fasta" to be run by this program
  • -in [required if you are starting from faa files] Defines the location of the FOLDER containing the genome amino acid files ending with ".faa" to be run by this program
  • -r [required] Defines the path to a text file containing the location of paried reads
  • -rt [optional] Defines the option to use "metaG" or "metaT" to indicate whether you use the metagenomic reads or metatranscriptomic reads (default: 'metaG'). Only required when using METABOLIC-C
  • -st [optional] To use "illumina" (for Illumina short reads), or "pacbio" (for PacBio CLR reads), or "pacbio_hifi" (for PacBio HiFi/CCS genomic reads (v2.19 or later)), or "pacbio_asm20" (for PacBio HiFi/CCS genomic reads (v2.18 or earlier)), or "nanopore" (for Oxford Nanopore reads) to indicate the sequencing type of metagenomes or metatranscriptomes (default: 'illumina'; Note that all "illumina", "pacbio", "pacbio_hifi", "pacbio_asm20", and "nanopore" should be provided as lowercase letters and the underscore "_" should not be typed as "-" or any other marks)
  • -t [optional] Defines the number of threads to run the program with (Default: 20)
  • -m-cutoff [optional] Defines the fraction of KEGG module steps present to designate a KEGG module as present (Default: 0.75)
  • -kofam-db [optional] Defines the use of the full ("full") or reduced ("small") KOfam database by the program (Default: 'full'). "small" KOfam database only contains KOs present in KEGG module, using this setting will significantly reduce hmmsearch running time.
  • -tax [optional] To calculate MW-score contribution of microbial groups at the resolution of which taxonomical level (default: "phylum"; other options: "class", "order", "family", "genus", "species", and "bin" (MAG itself)). Only required when using METABOLIC-C
  • -p [optional] Defines the prodigal method used to annotate ORFs ("meta" or "single")(Default: "meta")
  • -o [optional] Defines the output directory to be created by the program (Default: current directory)
  1. The directory specified by the "-in-gn" flag should contain nucleotide sequences for your genomes with the file extension ".fasta". If you are supplying amino acid sequences for each genome, these should be contained within a directory and have the file extension ".faa", and you will be using the "-in" option instead. Ensure that the fasta headers of each ".fasta" or ".faa" file is unique (all fasta or faa files will be concatenated together to make a "total.fasta" or "total.faa" file; be sure that all sequence headers are unique), and that your file names do not contain spaces (suggest to only use alphanumeric characters and underscores in the file names); be sure that in the genomes folder, only the genomes are placed but not other files, for example, non-genome metagenomic assemblies, since METABOLIC will take in all the files within the folder as genomes. If you want to use METABOLIC-C, only "fasta" files and the "-in-gn" flag are allowed to perform the analysis correctly.
  2. The "-r" flag allows input of a text file defining the path of metagenomic reads (if running METABOLIC-C). The metagenomic reads refer to the metagenomic read datasets that you used to generate the MAGs. Try to confirm that you are using unzipped fastq files instead of zipped files before you run METABOLIC-C. Sets of paired reads are entered in one line, separated by a ",". Note that you should give the absolute path to the read files. A sample for this text file is as follows:
#Read pairs: 
/path/to/your/reads/file/SRR3577362_sub_1.fastq,/path/to/your/reads/file/SRR3577362_sub_2.fastq
/path/to/your/reads/file/SRR3577362_sub2_1.fastq,/path/to/your/reads/file/SRR3577362_sub2_2.fastq

Note that the two different sets of paired reads are separated by a line return (new line), and two reads in each line are separated by a "," but not " ," or " , " (no spaces before or after comma). Blank lines are not allowed

  1. If you use long reads generated by PacBio or Nanopore, you will need to use "-st" (or "-sequencing-type") option to indicate the type of sequencing method (please refer to flag parameter description above). Note that short reads (Illumina) and long reads (PacBio or Nanopore) should not be used together as the input. Like the requirements for short reads, try to confirm that you are using unzipped fastq files instead of zipped files before you run METABOLIC-C. Since long reads are provide as single-end reads, note that you should give the absolute path to the read files. Different read files (if you have) should be provided in each line as follows:
#Read pairs: 
/path/to/your/reads/file/Nanopore_1st_run.fastq
/path/to/your/reads/file/Nanopore_2nd_run.fastq
  1. Note that hmmsearch and hmmscan (in the dbCAN2 processing step) will normally take a very small memory load, while when running parallelly with a high CPU thread number, the aggregated memory demand can be very high and potentially cause problems to the server. It was suggested that using 40 cores will eat up 1TB of RAM. One can take this as a standard to customize your settings of thread number.

Running Test Data:

The main METABOLIC directory also contains a set of 5 genomes and one set of paired metagenomic reads, which can be used to test whether METABOLIC-G and METABOLIC-C were installed correctly. These genomes and reads can be found within the directory METABOLIC_test_files/, which is contained within the METABOLIC program directory.

METABOLIC-C.pl and METABOLIC-G.pl can be run with the test data by using the -test true function of METABOLIC:

perl METABOLIC-G.pl -test true

perl METABOLIC-C.pl -test true

How To Run METABOLIC:

The main scripts that should be used to run the program are METABOLIC-G.pl or METABOLIC-C.pl.

In order to run METABOLIC-G starting from nucleotide sequences, AT LEAST the following flags should be used for METABOLIC-G:

perl METABOLIC-G.pl -in-gn [path_to_folder_with_genome_files] -o [output_directory_to_be_created]

  Note that you will use "-in-gn" to input your genome files (nucleotide sequences) containing folder

In order to run METABOLIC-G starting from amino acid sequences, AT LEAST the following flags should be used for METABOLIC-G:

perl METABOLIC-G.pl -in [path_to_folder_with_genome_files] -o [output_directory_to_be_created]

  Note that you will use "-in" to input your genome files (amino acid sequences) containing folder

In order to run METABOLIC-C, AT LEAST the following flags should be used for METABOLIC-C:

perl METABOLIC-C.pl -in-gn [path_to_folder_with_genome_files] -r [path_to_list_of_paired_reads] -o [output_directory_to_be_created]

(METABOLIC-C will only use fasta files, so -in option is not applicable here.)


A 2nd METABOLIC-C run:

We provided an additional METABOLIC-C script ("METABOLIC-C.2nd_run.pl") for users who want to run multiple times METABOLIC-C for the same set of genomes.

An option "-2nd-run" was added to this script to use the previous genome annotation intermediate folders and/or metagenomic/metatranscriptomic mapping result, thus it can save lots of time to multiple runs of METABOLIC-C. Users will use "true" or "false" to run the 2nd-run option (default: 'false').

[Note] If this option was set to be "true", option "-o" should be set to the previous output folder of a successful run, METABOLIC-C will use the intermediate files within. You will also need to set the option "-2nd-run-suffix", the suffix will be appended to the new folders and files that are generated by the 2nd run (including "METABOLIC_Figures", "METABOLIC_Figures_Input", "MW-score_result", "METABOLIC_run.log", and "METABOLIC_log.log"). The Prodigal Method (-p or -prodigal-method), KOfam DB (-kofam-db), Module Cutoff Value (-m-cutoff or -module-cutoff), and Input Genome directory (nucleotides) (-in-gn) should be the same as the previous run.

[Note] One can also use option "-depth-file" to use the depth file provided by a previous run (You only need to provide the depth file name to this option without adding the path in the front, e.g., "All_gene_collections_mapped.depth.txt"). This option should be used only when you use the option "-2nd-run". This is useful if you want to tune the option "-taxonomy or -tax" to calculate MW-score contribution of microbial groups at the resolution of multiple taxonomical levels.

Example: Use genome annotation documents in the previous METABOLIC output directory to conduct the 2nd run with the suffix "2nd_run_test" appended to corresponding folders and files.

perl METABOLIC-C.2nd_run.pl -in-gn [path_to_folder_with_genome_files_of_a_previous_run] -r [path_to_list_of_paired_reads] -o [output_directory_of_a_previous_run] -2nd-run true -2nd-run-suffix 2nd_run_test

The 2nd run with the suffix "MW_score_tax_genus" appended, and use the previous depth file "All_gene_collections_mapped.depth.txt" (don't need to provide the full path but only the file name) with the setting of "-tax genus". This is useful if you want to try different settings of "-tax" for the same sets of inputs.

perl METABOLIC-C.2nd_run.pl -in-gn [path_to_folder_with_genome_files_of_a_previous_run] -r [path_to_list_of_paired_reads] -o [output_directory_of_a_previous_run] -2nd-run true -2nd-run-suffix MW_score_tax_genus -depth-file All_gene_collections_mapped.depth.txt -tax genus

Go Back to the homepage

METABOLIC output files:

Output Files Overview:

Output File File Description Generated by METABOLIC-C Generated by METABOLIC-G
All_gene_collections_mapped.depth.txt The gene depth of all input genes X
Each_HMM_Amino_Acid_Sequence/ The faa collection for each hmm file X X
intermediate_files/ The hmmsearch, peptides (MEROPS), CAZymes (dbCAN2), and GTDB-Tk (only for METABOLIC-C) running intermediate files X X
KEGG_identifier_result/ The hit and result of each genome by Kofam database X X
METABOLIC_Figures/ All figures output from the running of METABOLIC X X
METABOLIC_Figures_Input/ All input files for R-generated diagrams X X
METABOLIC_result_each_spreadsheet/ TSV files representing each sheet of the created METABOLIC_result.xlsx file X X
MW-score_result/ The resulted table for MW-score X
METABOLIC_result.xlsx The resulting excel file of METABOLIC X X

Output Files Detailed:

• METABOLIC result table (METABOLIC_result.xlsx)

This spreadsheet has 6 sheets:

  1. "HMMHitNum" = Presence or absence of custom HMM profiles within each genome, the number of times the HMM profile was identified within a genome, and the ORF(s) that represent the identified protein.
  2. "FunctionHit" = Presence or absence of sets of proteins which were identified and displayed as separate proteins in the sheet titled "HMMHitNum". For each genome, the functions are identified as "Present" or "Absence".
  3. "KEGGModuleHit" = Annotation of each genome with modules from the KEGG database organized by metabolic category. For each genome, the modules are identified as "Present" or "Absence".
  4. "KEGGModuleStepHit" = Presence or absence of modules from the KEGG database within each genome separated into the steps that make up the module. For each genome, the module steps are identified as "Present" or "Absence".
  5. "dbCAN2Hit" = The dbCAN2 annotation results against all genomes (CAZy numbers and hits). For each genome, there are two distinct columns, which show the number of times a CAZy was identified and what ORF(s) represent the protein.
  6. "MEROPSHit" = The MEROPS peptidase searching result (MEROPS peptidase numbers and hits). For each genome, there are two distinct columns, which show the number of times a peptidase was identified and what ORF(s) represent the protein.

• Each HMM Profile Hit Amino Acid Sequence Collection (Each_HMM_Amino_Acid_Sequence/)

A collection of all amino acid sequences extracted from the input genome ".faa" files that were identified as matches to the HMM profiles provided by METABOLIC.

• KEGG identifier results (KEGG_identifier_result/)

The KEGG identifier searching result - KEGG identifier numbers and hits of each genome that could be used to visualize the pathways in KEGG Mapper

• All METABOLIC figures generated by METABOLIC-G.pl and METABOLIC-C.pl (METABOLIC_Figures/)

Nutrient cycling diagrams

Both METABOLIC-G.pl and METABOLIC-C.pl will generate a folder titled Nutrient_Cycling_Diagrams/ within the METABOLIC_Figures/ directory, which will contain figures that represent nutrient cycling pathways for Sulfur, Nitrogen, Carbon, and other select pathways found within each genome. METABOLIC-C.pl also has the ability to generate overall community nutrient cycling pathways.

Although the Nutrient_Cycling_Diagrams/ directory is generated by both METABOLIC-G.pl and METABOLIC-C.pl, the files contained within the directory will be dependent on which script is used.

For both programs, METABOLIC-G.pl and METABOLIC-C.pl, the Nutrient_Cycling_Diagrams/ directory will contain the following files:

  [GenomeName].draw_sulfur_cycle_single.PDF
  [GenomeName].draw_nitrogen_cycle_single.PDF
  [GenomeName].draw_other_cycle_single.PDF
  [GenomeName].draw_carbon_cycle_single.PDF

A red arrow designates presence of a pathway step and a black arrow means absence. Note the the width of the arrows does not have any significance.

If you run METABOLIC-C.pl, the software will also calculate relative gene abundances, which will allow for generation of summary diagrams for pathways at a community scale:

  draw_sulfur_cycle_total.PDF
  draw_other_cycle_total.PDF
  draw_nitrogen_cycle_total.PDF
  draw_carbon_cycle_total.PDF

Note the the width of the arrows does not have any significance.

Sequential transformation diagram

  > Generated only by METABOLIC-C.pl are a set of figures representing metabolic handoffs within the community:

For Sequential transformation diagram, we have summarized and visualized the genome number and genome coverage (relative abundance of microorganism) of the microorganisms that were putatively involved in the sequential transformation of both important inorganic elements and organic compounds.

The resulting files are Sequential_transformation_01.pdf and Sequential_transformation_02.pdf.

Metabolic Sankey diagram

  > Generated only by METABOLIC-C.pl is a figure representing function contribution by the community:

For metabolic Sankey diagram, a Sankey diagram is generated, representing the function fractions that are contributed by various microbial groups in a given community.

The resulting file is Metabolic_Sankey_diagram.pdf.

Functional network diagrams

  > Generated only by METABOLIC-C.pl are figures representing metabolic connections between different reactions that are found within the community:

For Functional network diagrams, diagrams representing metabolic connections of biogeochemical cycling steps at both phylum level and the whole community level will be generated.

The resulted files are placed in the directory Functional_network/.

• MW-score result (MW-score_result/)

For MW-score result, the table showing the MW-score (Metabolic Weight score) will be generated ("MW-score_result.txt"). An example was given:

This MW-score figure was based on a metagenomic dataset of microbial community inhabiting deep-sea hydrothermal vent environment of Guaymas Basin in the Pacific Ocean. It contains 98 MAGs and 1 set of metagenomic reads. After we figured out the functional capacities of the whole community and gene coverage for each function, by using similar methods for studying metabolic interactions, we selected functions that are shared among genomes and summarized their weights within the whole community by adding up their abundances.

In the example figure, the column "MW-score for each function" indicates the functional weights within the whole community. More frequently shared functions and their higher abundances lead to higher MW-scores, which quantitively reflects the function weights in functional networks.

The rest columns indicate the contribution of each phylum to the MW-score. This helps to reflect each phylum's contribution to the function within the whole community. Overall, MW-scores provide a quantitive measure on comparing function weights and microbial group contributions within functional networks.

Notice: If you use metatranscriptomic reads instead of metagenomic reads in METABOLIC-C, gene coverage result will be replaced by transcript coverage [normalized into Reads Per Kilobase of transcript, per Million mapped reads (RPKM)] and all the community analyses were performed based on the transcript coverage instead. A result file of "All_gene_collections_transcript_coverage.txt" will be generated in the output directory in lieu of "All_gene_collections_gene_coverage.txt".

Go Back to the homepage

METABOLIC version updates:

v4.0 -- Jun 22, 2020 --

  • METABOLIC now uses an R script to generate METABOLIC_result.xlsx, which fixes issues with the generation of a corrupt METABOLIC_result.xlsx file
  • Test input data now includes both five nucleotide fasta files and one set of paired sequencing reads, allowing all capabilities of both METABOLIC-G.pl and METABOLIC-C.pl to be tested
  • The MW-score table has been provided as one of the results by METABOLIC-C
  • Updated the motif checking step for pmo/amo, dsrE/tusD, dsrH/tusB, and dsrF/tusC
  • Updated the "reads-type" option allowing the use of metatranscriptomic reads to conduct community analysis
  • Updated the script error of assigning module step presence to be the same with module step presence
  • Updated the carbon fixation pathway gene markers
  • Added an option "-tax" to let users get the microbial group contribution to MW-score at different taxonomic levels
  • Corrected the calculation of metatranscriptome mapping depth file (read numbers of a pair of read files are now calculated as: each fastq file lines / 4 * 2)
  • Added "METABOLIC-C.2nd_run.pl" for a 2nd METABOLIC-C run to run multiple times METABOLIC-C for the same set of genomes
  • Added "-st" ("-sequencing-type") option for alternative long reads mapping in METABOLIC-C.
  • Added total execution time for both METABOLIC-G and METABOLIC-C
  • Added "METABOLIC_log.log" in the result folder to record stdout and stderr
  • Change the shebang line to "#!/usr/bin/env perl" to use conda env perl if installed by conda installaton method
  • Update the As metabolizing gene templates (and MW-score and pathway txt files) and HMMs in "METABOLIC_template_and_database.tgz" and "METABOLIC_hmm_db.tgz" (The As metabolizing gene HMM files were copied from GitHub deposits: https://github.com/ShadeLab/PAPER_Dunivin_meta_arsenic/tree/master/gene_targeted_assembly/gene_resource and Publication: BMC Biol . 2019 May 30;17(1):45. doi: 10.1186/s12915-019-0661-5 from Ashley Shade's group, MSU)
  • Update the Fe/Mn oxidizing and reducing gene templates (and MW-score and pathway txt files) and HMMs in "METABOLIC_template_and_database.tgz" and "METABOLIC_hmm_db.tgz" (The Iron oxidizing and reducing gene HMM files were copied from GitHub deposits: https://github.com/Arkadiy-Garber/FeGenie and Publication: Front Microbiol . 2020 Jan 31;11:37. doi: 10.3389/fmicb.2020.00037. eCollection 2020. by Arkadiy I Garber and Nancy Merino)
  • Add "iron oxidation" step to both the biogeochemical cycling diagram and MN-score; Use "ndh2" and "DFE_450 and DFE_464" as the markers to represent the corresponding whole gene operons for iron reduction assignment
  • correct cellulase HMM from "K01779" to "K01179 and K20542" in "hmm_table_template.txt" within METABOLIC_template_and_database.tgz
  • Change "gtdbtk.ar122.summary.tsv" to "gtdbtk.ar53.summary.tsv" in METABOLIC-C.pl to meet the change of the new version of GTDB-Tk (v2.1.0+)
  • Add "bin" option to "-tax" to allow to get MW-score at the taxonomical level of MAG itself
  • Update the KEGG module part by using the most updated K00002.keg information (Jun 12, 2023 version); the METABOLIC-G and -C scripts are updated too. the module steps are curated by using the k-string (based on the Boolean expression)
  • Update the gtdbtk command in METABOLIC-C.pl and METABOLIC-C.2nd_run.pl by adding the "--skip_ani_screen" option for GTDB-Tk (v2.3.2). Updated on Dec 8, 2023.

v3.0 -- Feb 18, 2020 --

  • Provide an option to let the user reduce the size of Kofam Hmm profiles (only use KOs that can be found in Modules) to speed up the calculation
  • Change HMMER to v3.3 to speed up the calculation

v2.0 -- Nov 5, 2019 --

  • Add more functions on visualization, add more annotations, make the software faster

v1.3 -- Sep 5, 2019 --

  • Fix the output folder problem, the perl script could be called in another place instead of the original place

v1.2 -- Sep 5, 2019 --

  • Fix the prodigal parallel run, change "working-dir" to "METABOLIC-dir"

v1.1 -- Sep 4, 2019 --

  • Fix the parallel problem, change from hmmscan to hmmsearch, and update the "METABOLIC_template_and_database"

Go Back to the homepage