SAAGA is a tool for summarising, annotating and assessing genome annotations, with a particular focus on annotation generated by GeMoMa. The core of SAAGA is reciprocal MMeqs searches of the annotation and reference proteomes. These are used to identify the best hits for protein product identification and to assess annotations based on query and hit coverage. SAAGA will also generate annotation summary statistics, and extract the longest protein from each gene for a representative non-redundant proteome (e.g. for BUSCO analysis).
Please note that SAAGA is still in development and documentation is currently a bit sparse.
The different run modes are set using a set of mode=T/F
flags (or simply adding the run mode to the command):
assess
= Assess annotation using reference annotation (e.g. a reference organism proteome)annotate
= Rename annotation using reference annotation (could be Swissprot)longest
= Extract the longest protein per genemmseq
= Run the mmseq2 steps in preparation for further analysissummarise
= Summarise annotation from a GFF file
See https://slimsuite.github.io/saaga/ for details of each mode. General SLiMSuite run documentation can be found at https://github.com/slimsuite/SLiMSuite.
SAAGA is available as part of SLiMSuite, or via a standalone GitHub repo at https://github.com/slimsuite/saaga.
SAAGA is written in Python 2.x and can be run directly from the commandline:
python $CODEPATH/diploidocus.py [OPTIONS]
If running as part of SLiMSuite, $CODEPATH
will be the SLiMSuite tools/
directory. If running from the standalone SAAGA git repo, $CODEPATH
will be the path the to code/
directory. Please see details in the SAAGA git repo
for running on example data.
For assess
, annotate
and mmseq
modes, MMseqs2 must be installed and
either added to the environment $PATH
.
A list of commandline options can be generated at run-time using the -h
or help
flags. Please see the general
SLiMSuite documentation for details of how to
use commandline options, including setting default values with INI files.
### ~ Input/Output options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
seqin=FILE : Protein annotation file to assess [annotation.faa]
gffin=FILE : Protein annotation GFF file [annotation.gff]
cdsin=FILE : Optional transcript annotation file for renaming and/or longest isoform extraction [annotation.fna]
refprot=FILE : Reference proteome for mapping data onto [refproteome.fasta]
refdb=FILE : Reference proteome MMseq2 database (over-rule mmseqdb path) []
mmseqdb=PATH : Directory in which to find/create mmseqs2 databases [./mmseqdb/]
mmsearch=PATH : Directory in which to find/create mmseqs2 databases [./mmsearch/]
basefile=X : Prefix for output files [$SEQBASE.$REFBASE]
gffgene=X : Label for GFF gene feature type ['gene']
gffcds=X : Label for GFF CDS feature type ['CDS']
gffmrna=X : Label for GFF mRNA feature type ['mRNA']
### ~ Run mode options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
annotate=T/F : Rename annotation using reference annotation (could be Swissprot) [False]
assess=T/F : Assess annotation using reference annotation [False]
longest=T/F : Extract longest protein per gene into *.longest.faa [False]
mmseqs=T/F : Run the mmseq2 steps in preparation for further analysis [True]
summarise=T/F : Summarise annotation from GFF file [True]
dochtml=T/F : Generate HTML SAAGA documentation (*.docs.html) instead of main run [False]
### ~ Search and filter options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
tophits=INT : Restrict mmseqs hits to the top X hits [250]
minglobid=PERC : Minimum global query percentage identity for a hit to be kept [40.0]
### ~ Precomputed MMSeq2 options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
mmqrymap=TSV : Tab-delimited output for query versus reference search (see docs) [$SEQBASE.$REFBASE.mmseq.tsv]
mmhitmap=TSV : Tab-delimited output for reference versus query search (see docs) [$REFBASE.$SEQBASE.mmseq.tsv]
### ~ Batch Run options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
batchseq=FILELIST : List of seqin=FILE annotation proteomes for comparison
batchref=FILELIST : List of refprot=FILE reference proteomes for comparison
### ~ System options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
forks=X : Number of parallel sequences to process at once [0]
killforks=X : Number of seconds of no activity before killing all remaining forks. [36000]
forksleep=X : Sleep time (seconds) between cycles of forking out more process [0]
tmpdir=PATH : Temporary directory path for running mmseqs2 [./tmp/]
### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
See SAAGA.md for more.