Finish Release-0.11.0

TGAC · Jan 5, 2015 · ec9e70c · ec9e70c
2 parents 2ee797d + 2a70b56
commit ec9e70c
Show file tree

Hide file tree

Showing 105 changed files with 2,478 additions and 787 deletions.
diff --git a/README.md b/README.md
@@ -21,13 +21,16 @@ http://rampart.readthedocs.org/en/latest/index.html
 Installation
 ============
 
-There are two ways to install RAMPART, the easiest and quickest way is from a distributable tarball.  The other is from
-source.
+There are three ways to install RAMPART: using homebrew, from tarball, and from source code.  All installation methods
+require you to have JRE V1.7+ installed.  For more detailed installation instructions please consult the RAMPART manual.
 
-To acquire the tarball please go to the releases section of the RAMPART github page: ``https://github.com/TGAC/RAMPART/releases``   
+To install from homebrew, please first ensure you have homebrew or linuxbrew installed and the homebrew science repo tapped.
+Then simply type: ``brew install rampart``.
+
+To install from tarball please go to the releases section of the RAMPART github page: ``https://github.com/TGAC/RAMPART/releases``   
 Then extract to a directory of your choice: ``tar -xvf <name_of_tarball>``.
 
-From source, you will first need the following dependencies installed:
+Alternatively, from source, you will first need the following dependencies installed:
 
 * GIT
 * Maven 3
@@ -47,8 +50,11 @@ specific URLs.
 Assuming there were no compilation errors. The build can be found in ./build/rampart-<version>. There should also be a
 dist sub directory which will contain a tarball suitable for installing RAMPART on other systems.
 
-Next RAMPART's dependencies must be installed. To save time finding all these tools on the internet RAMPART provides two options.  The first and recommended approach is
-to download a compressed tarball of all supported versions of the tools, which is available on the github releases page:
+Dependencies
+------------
+
+Next RAMPART's dependencies must be installed. To save time finding all these tools on the internet RAMPART provides two options.  
+The first and recommended approach is to download a compressed tarball of all supported versions of the tools, which is available on the github releases page:
 ``https://github.com/TGAC/RAMPART/releases``.  The second option is to download them all to a directory of your
 choice.  The one exception to this is SSPACE, which requires you to fill out a form prior to download.  RAMPART can help
 with this.  After the core RAMPART pipeline is compiled, type: ``rampart-download-deps <dir>``.  The tool will place all
@@ -129,14 +135,18 @@ Email: daniel.mapleson@tgac.ac.uk
 Acknowledgements
 ================
 
-* Nizar Drou (Formerly TGAC)
-* David Swarbreck (TGAC)
-* Bernardo Clavijo (TGAC)
-* Robert Davey (TGAC)
-* Sarah Bastkowski (TGAC)
-* Tony Burdett (EBI)
-* Ricardo Ramirez (TGAC)
-* Purnima Pachori (TGAC)
-* Mark McCullen (TGAC)
+* Nizar Drou
+* David Swarbreck
+* Bernardo Clavijo
+* Robert Davey
+* Sarah Bastkowski
+* Tony Burdett
+* Ricardo Ramirez
+* Purnima Pachori
+* Mark McCullen
+* Hugo Taveres
 * Ram Krishna Shrestha
+* Darren Waite
+* Tim Stitt
+* Shaun Jackman
 * And everyone who contributed to making the tools RAMPART depends on!
diff --git a/doc/source/acknowledgements.rst b/doc/source/acknowledgements.rst
@@ -3,16 +3,20 @@
 Acknowledgements
 ================
 
-* Nizar Drou (Formerly TGAC)
-* David Swarbreck (TGAC)
-* Bernardo Clavijo (TGAC)
-* Robert Davey (TGAC)
-* Sarah Bastkowski (TGAC)
-* Tony Burdett (EBI)
-* Ricardo Ramirez (TGAC)
-* Purnima Pachori (TGAC)
-* Mark McCullen (TGAC)
+* Nizar Drou
+* David Swarbreck
+* Bernardo Clavijo
+* Robert Davey
+* Sarah Bastkowski
+* Tony Burdett
+* Ricardo Ramirez
+* Purnima Pachori
+* Mark McCullen
+* Hugo Taveres
 * Ram Krishna Shrestha
+* Darren Waite
+* Tim Stitt
+* Shaun Jackman
 * And everyone who contributed to making the tools RAMPART depends on!
 
 
diff --git a/doc/source/analyse_assemblies.rst b/doc/source/analyse_assemblies.rst
@@ -21,7 +21,9 @@ KAT, performs a kmer count on the assembly using Jellyfish, and, assuming kmer c
 previously, will use the Kmer Analysis Toolkit (KAT) to create a comparison matrix comparing kmer counts in the reads to
 the assembly.  This can be visualised later using KAT to show how much of the content in the reads has been assembled
 and how repetitive the assembly is.  Repetition could be due to heterozygosity in the diploid genomes so please read the
-KAT manual and walkthrough guide to get a better understanding of how to interpret this data.
+KAT manual and walkthrough guide to get a better understanding of how to interpret this data.   Note that information
+for KAT is not automatically used for selecting the best assembly at present.  See next section for more information about
+automatic assembly selection.
 
 CEGMA aligns highly conserved eukaryotic genes to the assembly.  CEGMA produces a statistic which represents an estimate
 of gene completeness in the assembly.  i.e. if we see CEGMA maps 95% of the conserved genes to the assembly we can
@@ -53,12 +55,15 @@ Note that you can apply ``parallel`` attributes to both the ``analyse_mass`` and
 Selecting the best assembly
 ---------------------------
 
-Assuming at least one analysis option is selected, RAMPART will produce a table listing each assembly as a row, with each
-column representing an assembly metric.  The user can specify a weighting file when running RAMPART to assign the
-weights to each metric.  Each assembly is then assigned a score, based on the weighted mean of the metrics, and the
+Assuming at least one analysis option is selected, RAMPART will produce a summary file and a tab separated value file listing
+metrics for each assembly, along with scores relating to the contiguity, conservation and problem metrics, and a final overall score
+for each assembly.  Each score is given a value between 0.0 and 1.0, where higher values represent better assemblies.  The
 assembly with the highest score is then automatically selected as the **best** assembly to be used downstream.
+The group scores and the final scores are derived from underlying metrics and can be adjusted to have
+different weightings applied to them. This is done by specifying a weighting file to use in the RAMPART pipeline.
 
-To use these default weightings simply add the following element to the pipeline::
+By default RAMPART applies its own weightings, which can be found at ``<rampart_dir>/etc/weightings.tab``, so to run the
+assembly selection stage with default settings the user simply needs add the following element to the pipeline::
 
   <select_mass/>
 
@@ -68,20 +73,43 @@ weightings file the XML snippet may look like this::
 
    <select_mass weightings_file="~/.tgac/rampart/custom_weightings.tab"/>
 
-The format of the weightings file is a pipe separated table as follows::
-
-   nb_seqs|nb_seqs_gt_1k|nb_bases|nb_bases_gt_1k|max_len|n50|l50|gc%|n%|nb_genes|completeness
-   0.05|0.1|0.05|0.05|0.05|0.2|0.05|0.05|0.1|0.5|0.25
-
-All the metrics are derived from Quast results, except for the last one.
-
+The format of the weightings key value pair file separated by '=' character.  Comment lines can start using '#'.
+Most metrics are derived from Quast results, except for the core eukaryote genes detection score which is gathered from CEGMA.  Note, that some metrics
+from Quast will only be used in certain circumstances.  For example, the na50 and nb_ma_ref metrics are only used if a
+reference is supplied in the organism element of the configuration file.  Additionally, the nb_bases, nb_bases_gt_1k and
+the gc% metrics are used only if the user has supplied either a reference, or has provided estimated size and / or estimated
+gc% for the organism respectively.
 
 TODO: Currently the kmer metric, is not included.  In the future this will offer an alternate means of assessing the
 assembly completeness.
 
-The file best.fa is particularly important as this is the assembly that will be taken forward to the AMP / FINALISE
-stage.  If you are not happy with RAMPART's choice of assembly you should replace best.fa with your selection and re-run
-the rampart pipeline from the AMP stage: ``rampart -s AMP,FINALISE job.cfg``.
+The file best.fa is particularly important as this is the assembly that will be taken forward to the second half of the pipeline
+(from the AMP stage).  Although we have found that scoring system to be generally quite useful, we strongly recommend users
+to make their own assessment as to which assembly to take forward as we acknowledge that the scoring system is biased by
+outlier assemblies.  For example, consider three assemblies with an N50 of 1000, 1100 and 1200 bp, with scaled scores of
+0, 0.5 and 1. We add a third assembly, which does poorly and is disregarded by the user, with an N50 of 200 bp. Now the
+weighted N50 scores of the assemblies are 0, 0.8, 0.9 and 1. Even though the user has no intention of using that poor
+assembly, the effective weight of the N50 metric of the three good assemblies has decreased drastically by a factor of
+(1 - 0) / (1 - 0.8) = 5.  It's possible that the assembly selected as the best would change by adding an irrelevant assembly.
+For example consider two metrics, a and b, with even weights of 0.5 for three assemblies, and then again for four assemblies
+after adding a fourth irrelevant assembly, which performs worst in both metrics. By adding a fourth irrelevant assembly,
+the choice of the best assembly has changed.
+
+Three assemblies:
+a = {1000, 1100, 1200}, b = {0, 10, 8}
+sa = {0, 0.5, 1}, sb = {0, 1, 0.8}
+fa = {0, 0.75, 0.9}
+best = 0.9, the assembly with a = 1200
+
+Four assemblies:
+a = {200, 1000, 1100, 1200}, b = {0, 0, 10, 8}
+sa = {0, 0.8, 0.9, 1}, sb = {0, 0, 1, 0.8}
+fa = {0, 0.4, 0.95, 0.9}
+best = 0.95, the assembly with a = 1100
+
+To reiterate, we recommend that the user double check the results provided by RAMPART and if necessary overrule the choice
+of assembly selected for further processing.  This can be done, i.e. starting from the AMP stage with a user selected
+assembly, by using the following command: ``rampart -2 -a <path_to_assembly> <path_to_job_config>``.
 
 
 Analysing assemblies produced by AMP

diff --git a/doc/source/analyse_reads.rst b/doc/source/analyse_reads.rst
@@ -6,8 +6,11 @@ Analysing reads
 
 This stage analyses all datasets, both the RAW and those, if any, which have been produced by the MECQ stage.
 
-Currently, the only analysis option provided involves a kmer analysis, using tools called jellyfish and KAT.  The user
-has the option to control, the number of threads and amount of memory to request per process and whether or not the
+Currently, the only analysis option provided involves a kmer analysis, using tools called jellyfish and KAT.  This
+process will produce GC vs kmer frequence plots, which can highlight potential contamination and indicate whether you
+have sufficient coverage in your datasets for assembling your genome.
+
+The user has the option to control, the number of threads and amount of memory to request per process and whether or not the
 kmer counting for each dataset should take place in parallel.  An example of this is shown below::
 
    <analyse_reads kmer="true" parallel="true" threads="16" memory="4000"/>

diff --git a/doc/source/conf.py b/doc/source/conf.py
@@ -50,9 +50,9 @@
 # built documents.
 #
 # The short X.Y version.
-version = '0.8'
+version = '0.11'
 # The full version, including alpha/beta/rc tags.
-release = '0.8'
+release = '0.11.0'
 
 # The language for content autogenerated by Sphinx. Refer to documentation
 # for a list of supported languages.
@@ -263,7 +263,7 @@
 epub_title = u'RAMPART'
 epub_author = u'Daniel Mapleson, Nizar Drou, David Swarbreck'
 epub_publisher = u'Daniel Mapleson, Nizar Drou, David Swarbreck'
-epub_copyright = u'2013, The Genome Analysis Centre'
+epub_copyright = u'2015, The Genome Analysis Centre'
 
 # The basename for the epub file. It defaults to the project name.
 #epub_basename = u'RAMPART'

diff --git a/doc/source/env-config.rst b/doc/source/env-config.rst
@@ -37,11 +37,13 @@ External process configuration
 ------------------------------
 
 RAMPART can utilise a number of dependencies, each of which may require modification of environment variables in order
-for it to run successfully.  This can be problematic if multiple versions of the same piece of software need ot be
-available on the same environment.  At TGAC we execute python scripts for configure the environment for a tool, other
-sites may use a system like "modules".  Instead of configuring all the tools in one go, RAMPART can execute commands
-specific to each dependency just prior to it's execution.  Currently known process keys are as follows (note that these
-keys are hard coded, please keep the exact wording as below, even if you are using a different version of the software).
+for it to run successfully.  This can be problematic if multiple versions of the same piece of software need to be
+available on the same environment.  At TGAC we execute python scripts for configure the environment for a tool, although other
+institutes may use an alternative system like "modules".  Instead of configuring all the tools in one go, RAMPART can execute commands
+specific to each dependency just prior to it's execution.  Currently known process keys are described below.  In
+general the versions indicated have been tested and will work with RAMPART, however other versions may work if their
+command line interface has not changed significantly from the listed versions.  Note that these keys are hard coded, please keep
+the exact wording as below, even if you are using a different version of the software.
 The format for each entry is as follows: ``<key>=<command_to_load_tool>``.  Valid keys::
 
    # Assemblers
@@ -55,6 +57,7 @@ The format for each entry is as follows: ``<key>=<command_to_load_tool>``.  Vali
    # Dataset improving tools
    Sickle_V1.2
    Quake_V0.3
+   Musket_V1.0
 
    # Assembly improving tools
    Platanus_Gapclose_V1.2

diff --git a/doc/source/index.rst b/doc/source/index.rst
@@ -5,7 +5,7 @@
 
 .. image:: RAMPART-logo.png
 
-RAMPART is a de novo assembly pipeline that makes use of third party-tools and High Performance Computing resources.  It
+RAMPART is a *de novo* assembly pipeline that makes use of third party-tools and High Performance Computing resources.  It
 can be used as a single interface to several popular assemblers, and can perform automated comparison and analysis of
 any generated assemblies.
 

diff --git a/doc/source/installation.rst b/doc/source/installation.rst
@@ -9,10 +9,22 @@ dependencies are required to install and run RAMPART:
 
 * Java Runtime Environment (JRE) V1.7+
 
-RAMPART can be installed either from a distributable tarball, or from source via a ``git clone``.  These steps are
+RAMPART can be installed either from a distributable tarball, from source via a ``git clone``, or via homebrew.  These steps are
 described below.  Before that however, here are a few things to keep in mind during the installation process:
 
 
+Quick start
+-----------
+
+To get a bare bones version of RAMPART up and running quickly, we recommend installation via Homebrew.  This requires you
+to first install homebrew and tap homebrew/science.  On Mac you can access homebrew from ``http://brew.sh`` and linuxbrew
+for linux from ``https://github.com/Homebrew/linuxbrew``.  Once installed make sure to tap homebrew science with
+``brew tap homebrew/science``.  Then, as discussed above, please ensure you have JRE V1.7+ installed.  Finally, to install
+RAMPART simply type ``brew install rampart``.  This will install RAMPART into your homebrew cellar, with the bare minimum
+of dependencies: Quake, Kmergenie, ABySS, Velvet, Quast, KAT.
+
+
+
 From tarball
 ------------
 
@@ -31,6 +43,9 @@ be the following sub-directories:
 Should you want to run the tools without referring to their paths, you should ensure the 'bin' sub-directory is on your
 PATH environment variable.
 
+Also please note that this method does not install any dependencies automatically.  You must do this before trying to
+run RAMPART.
+
 
 From source
 -----------

diff --git a/doc/source/introduction.rst b/doc/source/introduction.rst
@@ -46,3 +46,32 @@ supported and RAMPART can execute jobs in parallel over many nodes if requested.
 to run all parts of the pipeline in sequence on a regular server provided enough memory is available for the job in question.
 
 This documentation is designed to help end users install, configure and run RAMPART.
+
+Comparison to other systems
+---------------------------
+
+**Roll-your-own Make files**  This method probably offers the most flexibility.  It allows you to define exactly how you
+want your tools to run in whatever order you wish.  However, you will need to define all the inputs and outputs to each tool.
+And in some cases write scripts to manage interoperability between some otherwise incompatible tools.  RAMPART takes all
+this complication away from the user as all input and output between each tool is managed automatically.  In addition,
+RAMPART offers more support for HPC environments, making it easier to parallelize steps in the pipeline.  Managing this
+manually is difficult and time consuming.
+
+**Galaxy** This is a platform for chaining together tools in such a way as to promote reproducible analyses.  It also
+has support for HPC environments.  However, it is a heavy weight solution, and is not trivial to install
+and configure locally.  RAMPART itself is lightweight in comparison, and ignoring dependencies, much easier to install.  In
+addition, galaxy is not designed with *de novo* genome assembly specifically in mind, whereas RAMPART
+is.  RAMPART places more constraints in the workflow design process as well as more checks initially before the
+workflow is started.  In addition, as mentioned above RAMPART will automatically manage interoperability between tools, which
+will likely save the user time debugging workflows and writing their own scripts to manage specific tool interaction issues.
+
+**A5-miseq** and **BugBuilder** Both are domain specific pipeline for automating assembly of microbial organisms.
+They are designed specifically with microbial genomes in mind and keep their interfaces simple and easy to use.  RAMPART,
+while more complex to use, is far more configurable as a result.  RAMPART also allows users to tackle eukaryote assembly projects.
+
+**iMetAMOS** This is a configurable pipeline for isolate genome assembly and annotation.  One distinct advantage of iMetAMOS is
+that it offers the ability to annotate your genome.  It also supports some assemblers that RAMPART currently does not.
+Both systems are highly configurable, allowing the user to create bespoke pipelines and compare and validate the results of multiple
+assemblers. However, in it's current form, iMetAMOS doesn't have as much provision for automating or managing assembly scaffolding
+or gap filling steps in the assembly workflow. In addition, we would argue that RAMPART is more configurable, easier to use
+and has more support for HPC environments.