Skip to content

Commit

Permalink
Merge pull request #28 from Rfam/rfam-cloud
Browse files Browse the repository at this point in the history
Add Rfam cloud documentation
  • Loading branch information
AntonPetrov committed Oct 24, 2019
2 parents cc72477 + fcfe254 commit e4b9711
Show file tree
Hide file tree
Showing 21 changed files with 511 additions and 115 deletions.
19 changes: 7 additions & 12 deletions docs/source/about-rfam.rst
Original file line number Diff line number Diff line change
@@ -1,23 +1,18 @@
About Rfam
==========

The Rfam database is a collection of RNA sequence families of
The `Rfam <http://rfam.org>`_ database is a collection of RNA sequence families of
structural RNAs including non-coding RNA genes as well as
cis-regulatory elements.
cis-regulatory elements. Each family is represented by a multiple
sequence alignment and a covariance model (CM).

Each family is represented by multiple
sequence alignments and covariance models (CMs).
You can use the `Rfam website <http://rfam.org>`_
to obtain information about an individual family, or browse
our families and genome annotations. Alternatively you can download
all of the Rfam data from our `FTP site <ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT>`_.
the families and genome annotations. Alternatively you can download
all of the Rfam data from the `FTP site <ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT>`_.
Find out more about the project by exploring the latest :ref:`citing-rfam:Rfam references`.

.. HINT::

Take a `quick tour <https://www.ebi.ac.uk/training/online/course/rfam-quick-tour>`_
of Rfam to find out more about the project.

For each Rfam family we provide:
For each family Rfam provides:

**Summary page**
Textual background information on the RNA family, which we obtain from
Expand Down
2 changes: 1 addition & 1 deletion docs/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Most data in Rfam can be accessed programmatically using a RESTful API
allowing for integration with other resources.

.. HINT::
You can also access the data using a :ref:`Public MySQL Database`
You can also access the data using a :ref:`database:Public MySQL Database`
that contains the latest Rfam release.

Data access
Expand Down
57 changes: 22 additions & 35 deletions docs/source/building-families.rst
Original file line number Diff line number Diff line change
@@ -1,24 +1,15 @@
How Rfam families are built
===========================

*rfamseq* database
------------------

Starting with Rfam 13.0, the underlying nucleotide sequence database from which
the RNA families are built (known as *rfamseq*) is based on a reprsentative collection
of complete genomes maintained by `UniProt <http://www.uniprot.org/proteomes>`_.

*rfamseq* is usually updated with each major Rfam release, e.g., 12.0 or 13.0.
You can find out the information about *rfamseq* currently in use in the
`README file <ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/README>`_ in the Rfam FTP archive.
.. Note:: 🆕 If you are interested in building new Rfam families or updating the existing ones, take a look at the :ref:`rfam-cloud:Rfam cloud pipeline`.

Seed alignments and secondary structure annotation
--------------------------------------------------

Rfam **seed alignments** are small, curated sets of representative sequences
Rfam :ref:`glossary:Seed alignment` is a small, curated set of representative sequences
for each family, as opposed to an alignment of all known members. The
seed alignment also has as a **secondary structure** annotation, which
represents the conserved secondary structure for these sequences.
represents the **conserved** secondary structure for these sequences.

The ideal basis for a new family is an RNA element that:

Expand All @@ -31,21 +22,21 @@ must first obtain at least one **experimentally validated example** from
the published literature. If any other homologues are identified in the
literature, we will add these to the seed. Alternatively, if these are
not available, we will try to identify other members either by
similarity searching (using Infernal) or manual curation.
similarity searching (using :ref:`glossary:Infernal`) or manual curation.

Where possible we will use a multiple sequence alignment and
secondary structure annotation provided in the literature. If this is
secondary structure annotation provided in the **literature**. If this is
the case, we will cite the source of both the alignment and the
secondary structure. You should note that the structure annotations
obtained from the literature may be experimentally validated or they
may be RNA folding predictions (commonly `Mfold <http://unafold.rna.albany.edu/?q=mfold>`_).
may be RNA folding predictions (commonly :ref:`glossary:Mfold`).
Unfortunately, we do not discriminate between these two cases when we
cite the PubMed Identifier (PMID) and you will need to refer to the
original publications to clarify.

Alternatively, where this information is not available from the
literature, we will generate an alignment and secondary structure
prediction using various software, such as `WAR <http://genome.ku.dk/resources/war>`_. This
prediction using various software, such as :ref:`glossary:WAR` or :ref:`glossary:RNAalifold`. This
software allows us to cherry pick the best alignment and secondary
structure prediction. Historically, the methods used to
make these alignments and folding predictions have varied.
Expand All @@ -55,18 +46,11 @@ author on the list will be the most recent editor of the secondary
structure. You can
find the method we have used for the seed alignment or the secondary
structure annotation in the **SE** and **SS**
lines of the `Stockholm format <https://en.wikipedia.org/wiki/Stockholm_format>`_
or in the curation information pages.

Covariance Models
-----------------
lines of the :ref:`glossary:Stockholm format` or in the curation information pages.

From the seed alignment, we use the `Infernal software <http://eddylab.org/infernal/>`_ to build a
probabilistic model (covariance model or CM) for this family. Useful
references on stochastic free grammars and covariance models can be
found in the :ref:`Citing Rfam`
section. This model is then used to search the *rfamseq*
database for other possible homologs.
:ref:`glossary:Covariance model (CM)` for this family.
This model is then used to search the :ref:`glossary:rfamseq` database for other possible homologs.

Expanding the seed (iteration)
------------------------------
Expand All @@ -80,6 +64,13 @@ continue to iterate the seed until we have good resolution
between real and false hits and cannot improve the seed membership
further.

.. figure:: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6754622/bin/nihms-1047076-f0022.jpg
:alt: Building an RNA family using Infernal
:width: 600
:align: center

Building an RNA family using Infernal

Important points to remember about seed alignments
------------------------------------------------------

Expand All @@ -98,24 +89,20 @@ Important points to remember about seed alignments
generate predicted structures
* Each seed sequence will be a significant match to the corresponding
covariance model. A significant score is generally greater than 20
bits
bits.

Rfam full alignments
--------------------

The Rfam full alignments contain all of the sequences in *rfamseq* that
The Rfam :ref:`glossary:full alignment` contains all of the sequences in *rfamseq* that
we can identify as members of the family. The alignment is generated by
searching the covariance model for the family against the *rfamseq*
database. Matches that score above a :ref:`gathering cutoff` are aligned to
database. Matches that score above a :ref:`glossary:Gathering cutoff` are aligned to
the CM to produce the full alignment. All sequences in the seed will
also be present in the full alignment.

As of Rfam 12.0, we no longer automatically generate full alignments for
each Rfam family. You may download the Rfam CM and generate your own alignments
using Infernal.

Family annotation
-----------------
Wikipedia annotations
---------------------

In order to provide some background and functional information about
a family, we link to a `Wikipedia <http://www.wikipedia.org/>`_
Expand Down
138 changes: 138 additions & 0 deletions docs/source/choosing-gathering-threshold.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
Choosing gathering threshold
============================

The :ref:`glossary:Gathering cutoff` (also known as gathering threshold) corresponds to the bit score of the lowest-scoring sequence that is considered part of the family. When creating families using the :ref:`rfam-cloud:Rfam cloud pipeline`, it is important to choose an accurate cutoff value by **reviewing the following 3 files**: ``species``, ``taxinfo``, and ``align``, as well as the :ref:`glossary:R-scape` secondary structure diagrams.

.. contents:: Table of contents
:local:

Species file
------------

The rfsearch program produces a file called ``species`` that contains a list of all sequences identified by an Infernal search in the :ref:`glossary:Rfamseq` database using a covariance model based on the :ref:`glossary:Seed alignment`.

1. View the ``species`` file using ``less``::

less -S species

The ``-S`` option displays long lines without wrapping them.

.. hint::
For help with using ``less``, check out `10 Tips for Effective Navigation <https://www.thegeekstuff.com/2010/02/unix-less-command-10-tips-for-effective-navigation>`_ or type ``man less``.

Each line corresponds to a sequence matching the covariance model and includes the bit score, sequence label, species, taxonomic lineage, and other information.

.. figure:: images/species-file-example.png
:alt: Example species file
:width: 600
:align: center

Example species file

2. Find the current gathering cutoff by searching for the word ``THRESHOLD`` using ``less``::

/ THRESHOLD

.. figure:: images/species-file-threshold-example.png
:alt: Example species file
:width: 600
:align: center

Example species file showing current gathering threshold and the best reversed hit

Consider the following questions:

- **How many sequences are above the gathering threshold?** If there are no or very few sequences, then the threshold may need to be lowered.

- **Do you notice any jumps in gathering threshold values?** For example, consider the following list of bit scores:

- 80.1
- 79.4
- 75.4
- 70.1
- 69.4
- 41.1
- 39.3
- #### CURRENT THRESHOLD ####
- 39.2
- 39.1

Notice that there is a sudden drop in bit scores between 69.4 and 41.1 bits. You should carefully examine the sequences immediately before and after the drop and decide whether they belong in the same family. A bit score jump could be an indication of where to put the bit score cutoff (in this example, it could be set to 69.0 as it is customary to round the scores to the nearest bit score). Please note that a bit score jump alone does not provide enough evidence for setting gathering cutoff for a family and should be used in combination with other information as explained below.

- **Does the taxonomic distribution of the hits match the expectation?** For example, if you are building an RNA family that is described in the literature as bacterial, it is desirable to set the threshold in a way that excludes non-bacterial hits. Each case should be reviewed individually, as it is possible that the unexpected hits could represent contamination, horizontal gene transfer, or a biologically interesting case.

- **Are any SEED sequences below the gathering threshold?**

The gathering threshold should include all sequences in the SEED file. It is expected that the covariance model will identify all sequences from the seed alignment that the covariance model is based on.

The **sequence label** (third column) contains 3 possible values:

.. list-table::

* - ``SEED``
- Sequence from the Seed alignment.
* - ``FULL``
- Sequence from Rfamseq that was identified using the covariance model.
* - ``NOT``
- Any sequence scoring below the gathering threshold.

If a seed sequence has a very low bit score (for example, lower than the REVERSED score), consider removing it from the seed alignment.

3. Find the top scoring random hit by searching for the word ``REVERSED`` using ``less``::

/ REVERSED

In order to exclude false positives, the rfsearch command scans a large collection of sequences called the **Reversed database**. It consists of 10% of the Rfamseq sequences that have been reversed to preserve the sequence composition but decrease sequence similarity to real sequences (except for rare cases of `palindromes <https://en.wikipedia.org/wiki/Palindrome>`_).

⚠️ The reversed hits are **random sequences** and **should not be included in the family**.

For example, if the current threshold is 40 bits but the top scoring reversed hit is at 45 bits, it means that the gathering threshold needs to be raised to at least 45 bits.

.. hint::
Consider also reviewing the ``outlist`` file which is similar to ``species`` but contains slightly different information, such as sequence descriptions as well as the details about whether the hits were truncated or appear on the reverse strand.

Taxinfo file
-------------

The ``taxinfo`` file is created by the rfmake program and includes the taxonomic distribution of the hits listed in the ``species`` file. It can be viewed using less::

less -S taxinfo

.. figure:: images/taxinfo-example.png
:alt: Example taxinfo file
:width: 600
:align: center

Example taxinfo file

Reviewing the file allows one to better understand the taxonomic distribution of the family.

Align file
----------

The ``align`` file is created by the rfmake program when executed with the ``-a`` option. The file includes an alignment of all the hits listed in the ``species`` file to the covariance model. It can be viewed using ``less``::

less -S align

It is useful to review the bottom of the alignment as it contains the lowest scoring hits. Ask yourself if the alignment has too many gaps or very large insertions. Are there any sequences that could be excluded by raising the gathering cutoff that would decrease the number of gaps?

⚠️ Do not edit the ``align`` file because it is overwritten every time ``rfsearch.pl -a`` runs - you should edit the SEED alignment instead.

R-scape secondary structures
----------------------------

:ref:`glossary:R-scape` analyses RNA multiple sequence alignment to check if the consensus secondary structure is supported by the covariation observed in the alignment. To run R-scape, enter the following commands::

mkdir rscape-seed
R-scape --cyk --outdir rscape-seed SEED
mkdir rscape-align
R-scape --cyk --outdir rscape-align align

The results will appear in the ``rscape-seed`` folder that can be copied to your computer for inspection. A good family will have multiple basepairs highlighted in green, which indicates covariation support. The ``--cyk`` option checks if there is an alternative secondary structure compatible with the alignment. Comparing the regular and the ``--cyk`` secondary structure diagrams may suggest a better structure than the current secondary structure consensus found in the seed alignment.

.. figure:: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5753348/bin/gkx1038fig5.jpg
:alt: R-scape visualisation of SAM riboswitch
:width: 600
:align: center

R-scape visualisation of SAM riboswitch
2 changes: 1 addition & 1 deletion docs/source/citing-rfam.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Rfam references
`Non‐coding RNA analysis using the Rfam database <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6754622/>`_
| I. Kalvari, E.P. Nawrocki, J. Argasinska, N. Quinones‐Olvera, R.D. Finn, A. Bateman and A.I. Petrov
| **Current Protocols in Bioinformatics** (2018) e51. doi: 10.1002/cpbi.51
`Rfam 12.0: updates to the RNA families database <http://nar.oxfordjournals.org/content/43/D1/D130>`_
| E.P. Nawrocki, S.W. Burge, A. Bateman, J. Daub, R.Y. Eberhardt, S.R. Eddy, E.W. Floden, P.P. Gardner, T.A. Jones, J.T. and R.D. Finn
| **Nucleic Acids Research** (2014) doi: 10.1093/nar/gku1063
Expand Down
2 changes: 2 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,8 @@
'sphinx.ext.autosectionlabel',
]

autosectionlabel_prefix_document = True

# Add any paths that contain templates here, relative to this directory.
templates_path = ['ytemplates']

Expand Down
1 change: 0 additions & 1 deletion docs/source/database.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
.. _ftp_help_link:
Public MySQL Database
======================

Expand Down

0 comments on commit e4b9711

Please sign in to comment.