Merge pull request #28 from Rfam/rfam-cloud

Add Rfam cloud documentation
Rfam · Oct 24, 2019 · e4b9711 · e4b9711
2 parents cc72477 + fcfe254
commit e4b9711
Show file tree

Hide file tree

Showing 21 changed files with 511 additions and 115 deletions.
diff --git a/docs/source/about-rfam.rst b/docs/source/about-rfam.rst
@@ -1,23 +1,18 @@
 About Rfam
 ==========
 
-The Rfam database is a collection of RNA sequence families of
+The `Rfam <http://rfam.org>`_ database is a collection of RNA sequence families of
 structural RNAs including non-coding RNA genes as well as
-cis-regulatory elements.
+cis-regulatory elements. Each family is represented by a multiple
+sequence alignment and a covariance model (CM).
 
-Each family is represented by multiple
-sequence alignments and covariance models (CMs).
 You can use the `Rfam website <http://rfam.org>`_
 to obtain information about an individual family, or browse
-our families and genome annotations. Alternatively you can download
-all of the Rfam data from our `FTP site <ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT>`_.
+the families and genome annotations. Alternatively you can download
+all of the Rfam data from the `FTP site <ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT>`_.
+Find out more about the project by exploring the latest :ref:`citing-rfam:Rfam references`.
 
-.. HINT::
-
-  Take a `quick tour <https://www.ebi.ac.uk/training/online/course/rfam-quick-tour>`_
-  of Rfam to find out more about the project.
-
-For each Rfam family we provide:
+For each family Rfam provides:
 
 **Summary page**
   Textual background information on the RNA family, which we obtain from

diff --git a/docs/source/api.rst b/docs/source/api.rst
@@ -8,7 +8,7 @@ Most data in Rfam can be accessed programmatically using a RESTful API
 allowing for integration with other resources.
 
 .. HINT::
-  You can also access the data using a :ref:`Public MySQL Database`
+  You can also access the data using a :ref:`database:Public MySQL Database`
   that contains the latest Rfam release.
 
 Data access

diff --git a/docs/source/building-families.rst b/docs/source/building-families.rst
@@ -1,24 +1,15 @@
 How Rfam families are built
 ===========================
 
-*rfamseq* database
-------------------
-
-Starting with Rfam 13.0, the underlying nucleotide sequence database from which
-the RNA families are built (known as *rfamseq*) is based on a reprsentative collection
-of complete genomes maintained by `UniProt <http://www.uniprot.org/proteomes>`_.
-
-*rfamseq* is usually updated with each major Rfam release, e.g., 12.0 or 13.0.
-You can find out the information about *rfamseq* currently in use in the
-`README file <ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/README>`_ in the Rfam FTP archive.
+.. Note:: 🆕 If you are interested in building new Rfam families or updating the existing ones, take a look at the :ref:`rfam-cloud:Rfam cloud pipeline`.
 
 Seed alignments and secondary structure annotation
 --------------------------------------------------
 
-Rfam **seed alignments** are small, curated sets of representative sequences
+Rfam :ref:`glossary:Seed alignment` is a small, curated set of representative sequences
 for each family, as opposed to an alignment of all known members. The
 seed alignment also has as a **secondary structure** annotation, which
-represents the conserved secondary structure for these sequences.
+represents the **conserved** secondary structure for these sequences.
 
 The ideal basis for a new family is an RNA element that:
 
@@ -31,21 +22,21 @@ must first obtain at least one **experimentally validated example** from
 the published literature. If any other homologues are identified in the
 literature, we will add these to the seed. Alternatively, if these are
 not available, we will try to identify other members either by
-similarity searching (using Infernal) or manual curation.
+similarity searching (using :ref:`glossary:Infernal`) or manual curation.
 
 Where possible we will use a multiple sequence alignment and
-secondary structure annotation provided in the literature. If this is
+secondary structure annotation provided in the **literature**. If this is
 the case, we will cite the source of both the alignment and the
 secondary structure. You should note that the structure annotations
 obtained from the literature may be experimentally validated or they
-may be RNA folding predictions (commonly `Mfold <http://unafold.rna.albany.edu/?q=mfold>`_).
+may be RNA folding predictions (commonly :ref:`glossary:Mfold`).
 Unfortunately, we do not discriminate between these two cases when we
 cite the PubMed Identifier (PMID) and you will need to refer to the
 original publications to clarify.
 
 Alternatively, where this information is not available from the
 literature, we will generate an alignment and secondary structure
-prediction using various software, such as `WAR <http://genome.ku.dk/resources/war>`_. This
+prediction using various software, such as :ref:`glossary:WAR` or :ref:`glossary:RNAalifold`. This
 software allows us to cherry pick the best alignment and secondary
 structure prediction. Historically, the methods used to
 make these alignments and folding predictions have varied.
@@ -55,18 +46,11 @@ author on the list will be the most recent editor of the secondary
 structure. You can
 find the method we have used for the seed alignment or the secondary
 structure annotation in the **SE** and **SS**
-lines of the `Stockholm format <https://en.wikipedia.org/wiki/Stockholm_format>`_
-or in the curation information pages.
-
-Covariance Models
------------------
+lines of the :ref:`glossary:Stockholm format` or in the curation information pages.
 
 From the seed alignment, we use the `Infernal software <http://eddylab.org/infernal/>`_ to build a
-probabilistic model (covariance model or CM) for this family. Useful
-references on stochastic free grammars and covariance models can be
-found in the :ref:`Citing Rfam`
-section. This model is then used to search the *rfamseq*
-database for other possible homologs.
+:ref:`glossary:Covariance model (CM)` for this family.
+This model is then used to search the :ref:`glossary:rfamseq` database for other possible homologs.
 
 Expanding the seed (iteration)
 ------------------------------
@@ -80,6 +64,13 @@ continue to iterate the seed until we have good resolution
 between real and false hits and cannot improve the seed membership
 further.
 
+.. figure:: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6754622/bin/nihms-1047076-f0022.jpg
+    :alt: Building an RNA family using Infernal
+    :width: 600
+    :align: center
+
+    Building an RNA family using Infernal
+
 Important points to remember about seed alignments
 ------------------------------------------------------
 
@@ -98,24 +89,20 @@ Important points to remember about seed alignments
   generate predicted structures
 * Each seed sequence will be a significant match to the corresponding
   covariance model. A significant score is generally greater than 20
-  bits
+  bits.
 
 Rfam full alignments
 --------------------
 
-The Rfam full alignments contain all of the sequences in *rfamseq* that
+The Rfam :ref:`glossary:full alignment` contains all of the sequences in *rfamseq* that
 we can identify as members of the family. The alignment is generated by
 searching the covariance model for the family against the *rfamseq*
-database. Matches that score above a :ref:`gathering cutoff` are aligned to
+database. Matches that score above a :ref:`glossary:Gathering cutoff` are aligned to
 the CM to produce the full alignment. All sequences in the seed will
 also be present in the full  alignment.
 
-As of Rfam 12.0, we no longer automatically generate full alignments for
-each Rfam family. You may download the Rfam CM and generate your own alignments
-using Infernal.
-
-Family annotation
------------------
+Wikipedia annotations
+---------------------
 
 In order to provide some background and functional information about
 a family, we link to a `Wikipedia <http://www.wikipedia.org/>`_

diff --git a/docs/source/choosing-gathering-threshold.rst b/docs/source/choosing-gathering-threshold.rst
@@ -0,0 +1,138 @@
+Choosing gathering threshold
+============================
+
+The :ref:`glossary:Gathering cutoff` (also known as gathering threshold) corresponds to the bit score of the lowest-scoring sequence that is considered part of the family. When creating families using the :ref:`rfam-cloud:Rfam cloud pipeline`, it is important to choose an accurate cutoff value by **reviewing the following 3 files**: ``species``, ``taxinfo``, and ``align``, as well as the :ref:`glossary:R-scape` secondary structure diagrams.
+
+.. contents:: Table of contents
+  :local:
+
+Species file
+------------
+
+The rfsearch program produces a file called ``species`` that contains a list of all sequences identified by an Infernal search in the :ref:`glossary:Rfamseq` database using a covariance model based on the :ref:`glossary:Seed alignment`.
+
+1. View the ``species`` file using ``less``::
+
+    less -S species
+
+  The ``-S`` option displays long lines without wrapping them.
+
+  .. hint::
+    For help with using ``less``, check out `10 Tips for Effective Navigation <https://www.thegeekstuff.com/2010/02/unix-less-command-10-tips-for-effective-navigation>`_ or type ``man less``.
+
+  Each line corresponds to a sequence matching the covariance model and includes the bit score, sequence label, species, taxonomic lineage, and other information.
+
+.. figure:: images/species-file-example.png
+      :alt: Example species file
+      :width: 600
+      :align: center
+
+      Example species file
+
+2. Find the current gathering cutoff by searching for the word ``THRESHOLD`` using ``less``::
+
+    / THRESHOLD
+
+.. figure:: images/species-file-threshold-example.png
+      :alt: Example species file
+      :width: 600
+      :align: center
+
+      Example species file showing current gathering threshold and the best reversed hit
+
+Consider the following questions:
+
+- **How many sequences are above the gathering threshold?** If there are no or very few sequences, then the threshold may need to be lowered.
+
+- **Do you notice any jumps in gathering threshold values?** For example, consider the following list of bit scores:
+
+  - 80.1
+  - 79.4
+  - 75.4
+  - 70.1
+  - 69.4
+  - 41.1
+  - 39.3
+  - #### CURRENT THRESHOLD ####
+  - 39.2
+  - 39.1
+
+  Notice that there is a sudden drop in bit scores between 69.4 and 41.1 bits. You should carefully examine the sequences immediately before and after the drop and decide whether they belong in the same family. A bit score jump could be an indication of where to put the bit score cutoff (in this example, it could be set to 69.0 as it is customary to round the scores to the nearest bit score). Please note that a bit score jump alone does not provide enough evidence for setting gathering cutoff for a family and should be used in combination with other information as explained below.
+
+- **Does the taxonomic distribution of the hits match the expectation?** For example, if you are building an RNA family that is described in the literature as bacterial, it is desirable to set the threshold in a way that excludes non-bacterial hits. Each case should be reviewed individually, as it is possible that the unexpected hits could represent contamination, horizontal gene transfer, or a biologically interesting case.
+
+- **Are any SEED sequences below the gathering threshold?**
+
+  The gathering threshold should include all sequences in the SEED file. It is expected that the covariance model will identify all sequences from the seed alignment that the covariance model is based on.
+
+  The **sequence label** (third column) contains 3 possible values:
+
+  .. list-table::
+
+      * - ``SEED``
+        - Sequence from the Seed alignment.
+      * - ``FULL``
+        - Sequence from Rfamseq that was identified using the covariance model.
+      * - ``NOT``
+        - Any sequence scoring below the gathering threshold.
+
+  If a seed sequence has a very low bit score (for example, lower than the REVERSED score), consider removing it from the seed alignment.
+
+3. Find the top scoring random hit by searching for the word ``REVERSED`` using ``less``::
+
+    / REVERSED
+
+In order to exclude false positives, the rfsearch command scans a large collection of sequences called the **Reversed database**. It consists of 10% of the Rfamseq sequences that have been reversed to preserve the sequence composition but decrease sequence similarity to real sequences (except for rare cases of `palindromes <https://en.wikipedia.org/wiki/Palindrome>`_).
+
+⚠️ The reversed hits are **random sequences** and **should not be included in the family**.
+
+For example, if the current threshold is 40 bits but the top scoring reversed hit is at 45 bits, it means that the gathering threshold needs to be raised to at least 45 bits.
+
+.. hint::
+    Consider also reviewing the ``outlist`` file which is similar to ``species`` but contains slightly different information, such as sequence descriptions as well as the details about whether the hits were truncated or appear on the reverse strand.
+
+Taxinfo file
+-------------
+
+The ``taxinfo`` file is created by the rfmake program and includes the taxonomic distribution of the hits listed in the ``species`` file. It can be viewed using less::
+
+    less -S taxinfo
+
+.. figure:: images/taxinfo-example.png
+      :alt: Example taxinfo file
+      :width: 600
+      :align: center
+
+      Example taxinfo file
+
+Reviewing the file allows one to better understand the taxonomic distribution of the family.
+
+Align file
+----------
+
+The ``align`` file is created by the rfmake program when executed with the ``-a`` option. The file includes an alignment of all the hits listed in the ``species`` file to the covariance model. It can be viewed using ``less``::
+
+    less -S align
+
+It is useful to review the bottom of the alignment as it contains the lowest scoring hits. Ask yourself if the alignment has too many gaps or very large insertions. Are there any sequences that could be excluded by raising the gathering cutoff that would decrease the number of gaps?
+
+⚠️ Do not edit the ``align`` file because it is overwritten every time ``rfsearch.pl -a`` runs - you should edit the SEED alignment instead.
+
+R-scape secondary structures
+----------------------------
+
+:ref:`glossary:R-scape` analyses RNA multiple sequence alignment to check if the consensus secondary structure is supported by the covariation observed in the alignment. To run R-scape, enter the following commands::
+
+    mkdir rscape-seed
+    R-scape --cyk --outdir rscape-seed SEED
+    mkdir rscape-align
+    R-scape --cyk --outdir rscape-align align
+
+The results will appear in the ``rscape-seed`` folder that can be copied to your computer for inspection. A good family will have multiple basepairs highlighted in green, which indicates covariation support. The ``--cyk`` option checks if there is an alternative secondary structure compatible with the alignment. Comparing the regular and the ``--cyk`` secondary structure diagrams may suggest a better structure than the current secondary structure consensus found in the seed alignment.
+
+.. figure:: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5753348/bin/gkx1038fig5.jpg
+    :alt: R-scape visualisation of SAM riboswitch
+    :width: 600
+    :align: center
+
+    R-scape visualisation of SAM riboswitch
diff --git a/docs/source/citing-rfam.rst b/docs/source/citing-rfam.rst
@@ -12,7 +12,7 @@ Rfam references
 `Non‐coding RNA analysis using the Rfam database <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6754622/>`_
 	| I. Kalvari, E.P. Nawrocki, J. Argasinska, N. Quinones‐Olvera, R.D. Finn, A. Bateman and A.I. Petrov
 	| **Current Protocols in Bioinformatics** (2018) e51. doi: 10.1002/cpbi.51
-	
+
 `Rfam 12.0: updates to the RNA families database <http://nar.oxfordjournals.org/content/43/D1/D130>`_
 	| E.P. Nawrocki, S.W. Burge, A. Bateman, J. Daub, R.Y. Eberhardt, S.R. Eddy, E.W. Floden, P.P. Gardner, T.A. Jones, J.T. and R.D. Finn
 	| **Nucleic Acids Research** (2014) doi: 10.1093/nar/gku1063

diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -34,6 +34,8 @@
     'sphinx.ext.autosectionlabel',
 ]
 
+autosectionlabel_prefix_document = True
+
 # Add any paths that contain templates here, relative to this directory.
 templates_path = ['ytemplates']
 

diff --git a/docs/source/database.rst b/docs/source/database.rst
@@ -1,4 +1,3 @@
-.. _ftp_help_link:
 Public MySQL Database
 ======================