SAAGA: Summarise, Annotate & Assess Genome Annotations

SAAGA is a tool for summarising, annotating and assessing genome annotations, with a particular focus on annotation generated by GeMoMa. The core of SAAGA is reciprocal MMeqs searches of the annotation and reference proteomes. These are used to identify the best hits for protein product identification and to assess annotations based on query and hit coverage. SAAGA will also generate annotation summary statistics, and extract the longest protein from each gene for a representative non-redundant proteome (e.g. for BUSCO analysis).

Please note that SAAGA is still in development and documentation is currently a bit sparse.

The different run modes are set using a set of mode=T/F flags (or simply adding the run mode to the command):

assess = Assess annotation using reference annotation (e.g. a reference organism proteome)
annotate = Rename annotation using reference annotation (could be Swissprot)
longest = Extract the longest protein per gene
mmseq = Run the mmseq2 steps in preparation for further analysis
summarise = Summarise annotation from a GFF file

See https://slimsuite.github.io/saaga/ for details of each mode. General SLiMSuite run documentation can be found at https://github.com/slimsuite/SLiMSuite.

SAAGA is available as part of SLiMSuite, or via a standalone GitHub repo at https://github.com/slimsuite/saaga.

Running SAAGA

SAAGA is written in Python 2.x and can be run directly from the commandline:

python $CODEPATH/diploidocus.py [OPTIONS]

If running as part of SLiMSuite, $CODEPATH will be the SLiMSuite tools/ directory. If running from the standalone SAAGA git repo, $CODEPATH will be the path the to code/ directory. Please see details in the SAAGA git repo for running on example data.

For assess, annotate and mmseq modes, MMseqs2 must be installed and either added to the environment $PATH.

Commandline options

A list of commandline options can be generated at run-time using the -h or help flags. Please see the general SLiMSuite documentation for details of how to use commandline options, including setting default values with INI files.

### ~ Input/Output options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
seqin=FILE      : Protein annotation file to assess [annotation.faa]
gffin=FILE      : Protein annotation GFF file [annotation.gff]
cdsin=FILE      : Optional transcript annotation file for renaming and/or longest isoform extraction [annotation.fna]
refprot=FILE    : Reference proteome for mapping data onto [refproteome.fasta]
refdb=FILE      : Reference proteome MMseq2 database (over-rule mmseqdb path) []
mmseqdb=PATH    : Directory in which to find/create mmseqs2 databases [./mmseqdb/]
mmsearch=PATH   : Directory in which to find/create mmseqs2 databases [./mmsearch/]
basefile=X      : Prefix for output files [$SEQBASE.$REFBASE]
gffgene=X       : Label for GFF gene feature type ['gene']
gffcds=X        : Label for GFF CDS feature type ['CDS']
gffmrna=X       : Label for GFF mRNA feature type ['mRNA']
### ~ Run mode options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
annotate=T/F    : Rename annotation using reference annotation (could be Swissprot) [False]
assess=T/F      : Assess annotation using reference annotation [False]
longest=T/F     : Extract longest protein per gene into *.longest.faa [False]
mmseqs=T/F      : Run the mmseq2 steps in preparation for further analysis [True]
summarise=T/F   : Summarise annotation from GFF file [True]
dochtml=T/F     : Generate HTML SAAGA documentation (*.docs.html) instead of main run [False]
### ~ Search and filter options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
tophits=INT     : Restrict mmseqs hits to the top X hits [250]
minglobid=PERC  : Minimum global query percentage identity for a hit to be kept [40.0]
### ~ Precomputed MMSeq2 options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
mmqrymap=TSV    : Tab-delimited output for query versus reference search (see docs) [$SEQBASE.$REFBASE.mmseq.tsv]
mmhitmap=TSV    : Tab-delimited output for reference versus query search (see docs) [$REFBASE.$SEQBASE.mmseq.tsv]
### ~ Batch Run options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
batchseq=FILELIST   : List of seqin=FILE annotation proteomes for comparison
batchref=FILELIST   : List of refprot=FILE reference proteomes for comparison
### ~ System options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
forks=X         : Number of parallel sequences to process at once [0]
killforks=X     : Number of seconds of no activity before killing all remaining forks. [36000]
forksleep=X     : Sleep time (seconds) between cycles of forking out more process [0]
tmpdir=PATH     : Temporary directory path for running mmseqs2 [./tmp/]
### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

See SAAGA.md for more.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
code		code
docs		docs
LICENSE		LICENSE
README.md		README.md
SAAGA.md		SAAGA.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

docs

docs

LICENSE

LICENSE

README.md

README.md

SAAGA.md

SAAGA.md

index.html

index.html

Repository files navigation

SAAGA: Summarise, Annotate & Assess Genome Annotations

Running SAAGA

Commandline options

About

Releases

Packages

Languages

License

slimsuite/saaga

Folders and files

Latest commit

History

Repository files navigation

SAAGA: Summarise, Annotate & Assess Genome Annotations

Running SAAGA

Commandline options

About

Resources

License

Stars

Watchers

Forks

Languages