Skip to content

1 Setup process using a Conda environment

Matteopaluh edited this page Oct 25, 2023 · 5 revisions

Premises

  1. Open a shell in the KEMET directory. Scripts execution should be enabled. If the opposite is true use
    chmod +x ./*.py

Populate the working directory

  1. Only for the first time, run the set_kemet_working-directory.py script (if genome-scale models functionalities are wanted, add the -G parameter).
    This will populate KEMET folder with other different subfolders, where input and outputs are to be stored.

Set input files

  1. Set input files into proper paths (IMPORTANT):
  • Copy MAG/Genome sequences to be analysed in KEMET/genomes/ folder, which is created after the setup process.
    NOTE:
    Only ".fa",".fna" or ".fasta" sequence file extensions are supported
    No FASTA header repetitions are allowed in a single MAG/Genome.
    If necessary, rename MAGs/Genomes and FASTA headers accordingly, using:
    awk '/^>/{print ">"++i; next}{print}' < original.fasta > new.fasta

  • Copy KEGG KOs annotations (derived from different sources) in KEMET/KEGG_annotations/ folder, created in the setup process.
    The script requires an indication of the program used to generate input KEGG annotation (eggNOG, KofamKOALA -both web server and command line-, KAAS and KAAS-like format are supported up to January 2023).
    Do not change annotations format from their original output (truncated example files can be found in KEMET/toy/ folder)

  • Check KEMET/KEGG_MODULES/ folder presence as in GitHub. This is necessary for script usage as it contains KEGG Modules structure files (REF: KEGG MODULE resource); missing KO orthologs are deduced from these structures. Other "custom" Modules could be added to that folder, if formatted in the proper way (see wiki about this topic).

  • (Optional) Pre-existing genome-scale models (GSMM or GEMs) using BiGG namespace (".xml" files) can be copied in the KEMET/models/ folder, created after the setup process. These files can be used to expand existing GSMM, which is one of the two possible GSMMs options; the other viable option is de novo GSMM creation with extra protein coding genes discovered via HMMs.

    IMPORTANT NOTE

    Files extensions are not to be modified.
    The same is valid for the rest of file names, unless there is no correspondance between KEGG KOs annotations and input MAG/genome:

    e.g. bin1.fasta/.fa/.fna MAG/Genome should be paired with KEGG annotations from file bin1.emapper.annotations, and these should be used with the bin1.xml genome-scale model file.

Fill Instruction files (Only for HMM/GSMM use)

  1. (ONLY mandatory if HMM and GSMM steps are needed) Fill in the textual file called "genomes.instruction", generated after the setup process.
    Excluding the header, each line should have a tab-separated indication of:

    MAG/Genome FASTA Taxonomic indication Metabolic model universe
  • The MAG/Genome FASTA indicate the MAG/Genome of interest file name (e.g. bin1.fasta)

  • The taxonomic indication should be taken from the KEGG Brite taxonomic indication (specifically from the C-level, that most of the times coincide with NCBI phylum level taxonomy) (REF: BRITE Organism table) (e.g. Actinobacteria)

  • Metabolic model universe comprehend grampos, gramneg, archaea or other custom universe (this is an optional indication needed for GSMM de-novo reconstruction)

A handy script to do so is included (add_taxonomy_from_gtdb-tk.py). Using that, it's possible to speed up the process while converting the taxonomy normally obtained with the popular tool GTDB-tk, which assign the more complete and up-to-date Genome Taxonomy database (GTDB) taxonomy to MAGs.
The output of the GTDB-tk gtdb_to_ncbi_majority_vote.py script is needed. This way GTDB taxonomy is converted to NCBI standards, which are further converted to the requested KEGG BRITE taxonomy.

  1. (ONLY mandatory if HMM and GSMM steps are needed) Fill in other instruction text files.
    If HMM-analyses are desired, these need either the module_file.instruction or the ko_file.instruction files as follows, depending on the desired MODE OF USE (which needs to be specified with the --hmm_mode MODE parameter).

    MODE Analysis Instructions
    onebm KOs from KEGG Modules missing 1 block (No need to fill instruction files)
    modules KOs from a fixed list of KEGG Modules (One per line indication in the module_file.instruction file)
    kos KOs from a fixed list of orthologs (One per line indication in the ko_file.instruction file)

KEMET script

  1. Launch the kemet.py command line script with your arguments of choice! See the help page for details, or the initial Readme page for base usage.