Skip to content

Commit

Permalink
Finish Release-0.11.0
Browse files Browse the repository at this point in the history
  • Loading branch information
Daniel Mapleson committed Jan 5, 2015
2 parents 2ee797d + 2a70b56 commit ec9e70c
Show file tree
Hide file tree
Showing 105 changed files with 2,478 additions and 787 deletions.
40 changes: 25 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,13 +21,16 @@ http://rampart.readthedocs.org/en/latest/index.html
Installation
============

There are two ways to install RAMPART, the easiest and quickest way is from a distributable tarball. The other is from
source.
There are three ways to install RAMPART: using homebrew, from tarball, and from source code. All installation methods
require you to have JRE V1.7+ installed. For more detailed installation instructions please consult the RAMPART manual.

To acquire the tarball please go to the releases section of the RAMPART github page: ``https://github.com/TGAC/RAMPART/releases``
To install from homebrew, please first ensure you have homebrew or linuxbrew installed and the homebrew science repo tapped.
Then simply type: ``brew install rampart``.

To install from tarball please go to the releases section of the RAMPART github page: ``https://github.com/TGAC/RAMPART/releases``
Then extract to a directory of your choice: ``tar -xvf <name_of_tarball>``.

From source, you will first need the following dependencies installed:
Alternatively, from source, you will first need the following dependencies installed:

* GIT
* Maven 3
Expand All @@ -47,8 +50,11 @@ specific URLs.
Assuming there were no compilation errors. The build can be found in ./build/rampart-<version>. There should also be a
dist sub directory which will contain a tarball suitable for installing RAMPART on other systems.

Next RAMPART's dependencies must be installed. To save time finding all these tools on the internet RAMPART provides two options. The first and recommended approach is
to download a compressed tarball of all supported versions of the tools, which is available on the github releases page:
Dependencies
------------

Next RAMPART's dependencies must be installed. To save time finding all these tools on the internet RAMPART provides two options.
The first and recommended approach is to download a compressed tarball of all supported versions of the tools, which is available on the github releases page:
``https://github.com/TGAC/RAMPART/releases``. The second option is to download them all to a directory of your
choice. The one exception to this is SSPACE, which requires you to fill out a form prior to download. RAMPART can help
with this. After the core RAMPART pipeline is compiled, type: ``rampart-download-deps <dir>``. The tool will place all
Expand Down Expand Up @@ -129,14 +135,18 @@ Email: daniel.mapleson@tgac.ac.uk
Acknowledgements
================

* Nizar Drou (Formerly TGAC)
* David Swarbreck (TGAC)
* Bernardo Clavijo (TGAC)
* Robert Davey (TGAC)
* Sarah Bastkowski (TGAC)
* Tony Burdett (EBI)
* Ricardo Ramirez (TGAC)
* Purnima Pachori (TGAC)
* Mark McCullen (TGAC)
* Nizar Drou
* David Swarbreck
* Bernardo Clavijo
* Robert Davey
* Sarah Bastkowski
* Tony Burdett
* Ricardo Ramirez
* Purnima Pachori
* Mark McCullen
* Hugo Taveres
* Ram Krishna Shrestha
* Darren Waite
* Tim Stitt
* Shaun Jackman
* And everyone who contributed to making the tools RAMPART depends on!
22 changes: 13 additions & 9 deletions doc/source/acknowledgements.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,20 @@
Acknowledgements
================

* Nizar Drou (Formerly TGAC)
* David Swarbreck (TGAC)
* Bernardo Clavijo (TGAC)
* Robert Davey (TGAC)
* Sarah Bastkowski (TGAC)
* Tony Burdett (EBI)
* Ricardo Ramirez (TGAC)
* Purnima Pachori (TGAC)
* Mark McCullen (TGAC)
* Nizar Drou
* David Swarbreck
* Bernardo Clavijo
* Robert Davey
* Sarah Bastkowski
* Tony Burdett
* Ricardo Ramirez
* Purnima Pachori
* Mark McCullen
* Hugo Taveres
* Ram Krishna Shrestha
* Darren Waite
* Tim Stitt
* Shaun Jackman
* And everyone who contributed to making the tools RAMPART depends on!


58 changes: 43 additions & 15 deletions doc/source/analyse_assemblies.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,9 @@ KAT, performs a kmer count on the assembly using Jellyfish, and, assuming kmer c
previously, will use the Kmer Analysis Toolkit (KAT) to create a comparison matrix comparing kmer counts in the reads to
the assembly. This can be visualised later using KAT to show how much of the content in the reads has been assembled
and how repetitive the assembly is. Repetition could be due to heterozygosity in the diploid genomes so please read the
KAT manual and walkthrough guide to get a better understanding of how to interpret this data.
KAT manual and walkthrough guide to get a better understanding of how to interpret this data. Note that information
for KAT is not automatically used for selecting the best assembly at present. See next section for more information about
automatic assembly selection.

CEGMA aligns highly conserved eukaryotic genes to the assembly. CEGMA produces a statistic which represents an estimate
of gene completeness in the assembly. i.e. if we see CEGMA maps 95% of the conserved genes to the assembly we can
Expand Down Expand Up @@ -53,12 +55,15 @@ Note that you can apply ``parallel`` attributes to both the ``analyse_mass`` and
Selecting the best assembly
---------------------------

Assuming at least one analysis option is selected, RAMPART will produce a table listing each assembly as a row, with each
column representing an assembly metric. The user can specify a weighting file when running RAMPART to assign the
weights to each metric. Each assembly is then assigned a score, based on the weighted mean of the metrics, and the
Assuming at least one analysis option is selected, RAMPART will produce a summary file and a tab separated value file listing
metrics for each assembly, along with scores relating to the contiguity, conservation and problem metrics, and a final overall score
for each assembly. Each score is given a value between 0.0 and 1.0, where higher values represent better assemblies. The
assembly with the highest score is then automatically selected as the **best** assembly to be used downstream.
The group scores and the final scores are derived from underlying metrics and can be adjusted to have
different weightings applied to them. This is done by specifying a weighting file to use in the RAMPART pipeline.

To use these default weightings simply add the following element to the pipeline::
By default RAMPART applies its own weightings, which can be found at ``<rampart_dir>/etc/weightings.tab``, so to run the
assembly selection stage with default settings the user simply needs add the following element to the pipeline::

<select_mass/>

Expand All @@ -68,20 +73,43 @@ weightings file the XML snippet may look like this::

<select_mass weightings_file="~/.tgac/rampart/custom_weightings.tab"/>

The format of the weightings file is a pipe separated table as follows::

nb_seqs|nb_seqs_gt_1k|nb_bases|nb_bases_gt_1k|max_len|n50|l50|gc%|n%|nb_genes|completeness
0.05|0.1|0.05|0.05|0.05|0.2|0.05|0.05|0.1|0.5|0.25

All the metrics are derived from Quast results, except for the last one.

The format of the weightings key value pair file separated by '=' character. Comment lines can start using '#'.
Most metrics are derived from Quast results, except for the core eukaryote genes detection score which is gathered from CEGMA. Note, that some metrics
from Quast will only be used in certain circumstances. For example, the na50 and nb_ma_ref metrics are only used if a
reference is supplied in the organism element of the configuration file. Additionally, the nb_bases, nb_bases_gt_1k and
the gc% metrics are used only if the user has supplied either a reference, or has provided estimated size and / or estimated
gc% for the organism respectively.

TODO: Currently the kmer metric, is not included. In the future this will offer an alternate means of assessing the
assembly completeness.

The file best.fa is particularly important as this is the assembly that will be taken forward to the AMP / FINALISE
stage. If you are not happy with RAMPART's choice of assembly you should replace best.fa with your selection and re-run
the rampart pipeline from the AMP stage: ``rampart -s AMP,FINALISE job.cfg``.
The file best.fa is particularly important as this is the assembly that will be taken forward to the second half of the pipeline
(from the AMP stage). Although we have found that scoring system to be generally quite useful, we strongly recommend users
to make their own assessment as to which assembly to take forward as we acknowledge that the scoring system is biased by
outlier assemblies. For example, consider three assemblies with an N50 of 1000, 1100 and 1200 bp, with scaled scores of
0, 0.5 and 1. We add a third assembly, which does poorly and is disregarded by the user, with an N50 of 200 bp. Now the
weighted N50 scores of the assemblies are 0, 0.8, 0.9 and 1. Even though the user has no intention of using that poor
assembly, the effective weight of the N50 metric of the three good assemblies has decreased drastically by a factor of
(1 - 0) / (1 - 0.8) = 5. It's possible that the assembly selected as the best would change by adding an irrelevant assembly.
For example consider two metrics, a and b, with even weights of 0.5 for three assemblies, and then again for four assemblies
after adding a fourth irrelevant assembly, which performs worst in both metrics. By adding a fourth irrelevant assembly,
the choice of the best assembly has changed.

Three assemblies:
a = {1000, 1100, 1200}, b = {0, 10, 8}
sa = {0, 0.5, 1}, sb = {0, 1, 0.8}
fa = {0, 0.75, 0.9}
best = 0.9, the assembly with a = 1200

Four assemblies:
a = {200, 1000, 1100, 1200}, b = {0, 0, 10, 8}
sa = {0, 0.8, 0.9, 1}, sb = {0, 0, 1, 0.8}
fa = {0, 0.4, 0.95, 0.9}
best = 0.95, the assembly with a = 1100

To reiterate, we recommend that the user double check the results provided by RAMPART and if necessary overrule the choice
of assembly selected for further processing. This can be done, i.e. starting from the AMP stage with a user selected
assembly, by using the following command: ``rampart -2 -a <path_to_assembly> <path_to_job_config>``.


Analysing assemblies produced by AMP
Expand Down
7 changes: 5 additions & 2 deletions doc/source/analyse_reads.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,11 @@ Analysing reads

This stage analyses all datasets, both the RAW and those, if any, which have been produced by the MECQ stage.

Currently, the only analysis option provided involves a kmer analysis, using tools called jellyfish and KAT. The user
has the option to control, the number of threads and amount of memory to request per process and whether or not the
Currently, the only analysis option provided involves a kmer analysis, using tools called jellyfish and KAT. This
process will produce GC vs kmer frequence plots, which can highlight potential contamination and indicate whether you
have sufficient coverage in your datasets for assembling your genome.

The user has the option to control, the number of threads and amount of memory to request per process and whether or not the
kmer counting for each dataset should take place in parallel. An example of this is shown below::

<analyse_reads kmer="true" parallel="true" threads="16" memory="4000"/>
Expand Down
6 changes: 3 additions & 3 deletions doc/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,9 +50,9 @@
# built documents.
#
# The short X.Y version.
version = '0.8'
version = '0.11'
# The full version, including alpha/beta/rc tags.
release = '0.8'
release = '0.11.0'

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
Expand Down Expand Up @@ -263,7 +263,7 @@
epub_title = u'RAMPART'
epub_author = u'Daniel Mapleson, Nizar Drou, David Swarbreck'
epub_publisher = u'Daniel Mapleson, Nizar Drou, David Swarbreck'
epub_copyright = u'2013, The Genome Analysis Centre'
epub_copyright = u'2015, The Genome Analysis Centre'

# The basename for the epub file. It defaults to the project name.
#epub_basename = u'RAMPART'
Expand Down
13 changes: 8 additions & 5 deletions doc/source/env-config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,11 +37,13 @@ External process configuration
------------------------------

RAMPART can utilise a number of dependencies, each of which may require modification of environment variables in order
for it to run successfully. This can be problematic if multiple versions of the same piece of software need ot be
available on the same environment. At TGAC we execute python scripts for configure the environment for a tool, other
sites may use a system like "modules". Instead of configuring all the tools in one go, RAMPART can execute commands
specific to each dependency just prior to it's execution. Currently known process keys are as follows (note that these
keys are hard coded, please keep the exact wording as below, even if you are using a different version of the software).
for it to run successfully. This can be problematic if multiple versions of the same piece of software need to be
available on the same environment. At TGAC we execute python scripts for configure the environment for a tool, although other
institutes may use an alternative system like "modules". Instead of configuring all the tools in one go, RAMPART can execute commands
specific to each dependency just prior to it's execution. Currently known process keys are described below. In
general the versions indicated have been tested and will work with RAMPART, however other versions may work if their
command line interface has not changed significantly from the listed versions. Note that these keys are hard coded, please keep
the exact wording as below, even if you are using a different version of the software.
The format for each entry is as follows: ``<key>=<command_to_load_tool>``. Valid keys::

# Assemblers
Expand All @@ -55,6 +57,7 @@ The format for each entry is as follows: ``<key>=<command_to_load_tool>``. Vali
# Dataset improving tools
Sickle_V1.2
Quake_V0.3
Musket_V1.0

# Assembly improving tools
Platanus_Gapclose_V1.2
Expand Down
2 changes: 1 addition & 1 deletion doc/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
.. image:: RAMPART-logo.png

RAMPART is a de novo assembly pipeline that makes use of third party-tools and High Performance Computing resources. It
RAMPART is a *de novo* assembly pipeline that makes use of third party-tools and High Performance Computing resources. It
can be used as a single interface to several popular assemblers, and can perform automated comparison and analysis of
any generated assemblies.

Expand Down
17 changes: 16 additions & 1 deletion doc/source/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,22 @@ dependencies are required to install and run RAMPART:

* Java Runtime Environment (JRE) V1.7+

RAMPART can be installed either from a distributable tarball, or from source via a ``git clone``. These steps are
RAMPART can be installed either from a distributable tarball, from source via a ``git clone``, or via homebrew. These steps are
described below. Before that however, here are a few things to keep in mind during the installation process:


Quick start
-----------

To get a bare bones version of RAMPART up and running quickly, we recommend installation via Homebrew. This requires you
to first install homebrew and tap homebrew/science. On Mac you can access homebrew from ``http://brew.sh`` and linuxbrew
for linux from ``https://github.com/Homebrew/linuxbrew``. Once installed make sure to tap homebrew science with
``brew tap homebrew/science``. Then, as discussed above, please ensure you have JRE V1.7+ installed. Finally, to install
RAMPART simply type ``brew install rampart``. This will install RAMPART into your homebrew cellar, with the bare minimum
of dependencies: Quake, Kmergenie, ABySS, Velvet, Quast, KAT.



From tarball
------------

Expand All @@ -31,6 +43,9 @@ be the following sub-directories:
Should you want to run the tools without referring to their paths, you should ensure the 'bin' sub-directory is on your
PATH environment variable.

Also please note that this method does not install any dependencies automatically. You must do this before trying to
run RAMPART.


From source
-----------
Expand Down
29 changes: 29 additions & 0 deletions doc/source/introduction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,3 +46,32 @@ supported and RAMPART can execute jobs in parallel over many nodes if requested.
to run all parts of the pipeline in sequence on a regular server provided enough memory is available for the job in question.

This documentation is designed to help end users install, configure and run RAMPART.

Comparison to other systems
---------------------------

**Roll-your-own Make files** This method probably offers the most flexibility. It allows you to define exactly how you
want your tools to run in whatever order you wish. However, you will need to define all the inputs and outputs to each tool.
And in some cases write scripts to manage interoperability between some otherwise incompatible tools. RAMPART takes all
this complication away from the user as all input and output between each tool is managed automatically. In addition,
RAMPART offers more support for HPC environments, making it easier to parallelize steps in the pipeline. Managing this
manually is difficult and time consuming.

**Galaxy** This is a platform for chaining together tools in such a way as to promote reproducible analyses. It also
has support for HPC environments. However, it is a heavy weight solution, and is not trivial to install
and configure locally. RAMPART itself is lightweight in comparison, and ignoring dependencies, much easier to install. In
addition, galaxy is not designed with *de novo* genome assembly specifically in mind, whereas RAMPART
is. RAMPART places more constraints in the workflow design process as well as more checks initially before the
workflow is started. In addition, as mentioned above RAMPART will automatically manage interoperability between tools, which
will likely save the user time debugging workflows and writing their own scripts to manage specific tool interaction issues.

**A5-miseq** and **BugBuilder** Both are domain specific pipeline for automating assembly of microbial organisms.
They are designed specifically with microbial genomes in mind and keep their interfaces simple and easy to use. RAMPART,
while more complex to use, is far more configurable as a result. RAMPART also allows users to tackle eukaryote assembly projects.

**iMetAMOS** This is a configurable pipeline for isolate genome assembly and annotation. One distinct advantage of iMetAMOS is
that it offers the ability to annotate your genome. It also supports some assemblers that RAMPART currently does not.
Both systems are highly configurable, allowing the user to create bespoke pipelines and compare and validate the results of multiple
assemblers. However, in it's current form, iMetAMOS doesn't have as much provision for automating or managing assembly scaffolding
or gap filling steps in the assembly workflow. In addition, we would argue that RAMPART is more configurable, easier to use
and has more support for HPC environments.

0 comments on commit ec9e70c

Please sign in to comment.