Skip to content

Commit

Permalink
Update family building text
Browse files Browse the repository at this point in the history
  • Loading branch information
AntonPetrov committed Oct 28, 2016
1 parent 765520b commit 32ad599
Showing 1 changed file with 30 additions and 27 deletions.
57 changes: 30 additions & 27 deletions docs/source/building-families.rst
Original file line number Diff line number Diff line change
@@ -1,33 +1,38 @@
Building Rfam families
======================

Rfamseq database
*rfamseq* database
----------------

The underlying nucleotide sequence database from which we build our
families (known as rfamseq) is derived from the `EMBL nucleotide database <http://www.ebi.ac.uk/embl/>`_.
We include EMBL Standard (STD) and Whole Genome Shotgun (WGS) data
families (known as *rfamseq*) is derived from the `European Nucleotide Archive <http://www.ebi.ac.uk/ena/>`_.

We include Standard (STD) and Whole Genome Shotgun (WGS) data
classes. This includes all the environmental sample sequences (ENV)
but we currently exclude the patented (PAT) and synthetic (SYN)
divisions. You should note that rfamseq does NOT include Expressed
divisions. You should note that *rfamseq* does NOT include Expressed
Sequence Tag (EST) or Genome Survey Sequence (GSS) data classes.

Rfamseq is usually updated with each major Rfam release, e.g., 8.0, 9.0.
You can find out the the EMBL version currently in use in the
release README file on our `FTP site <ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT">`_.
*rfamseq* is usually updated with each major Rfam release, e.g., 8.0, 9.0.
You can find out the ENA release currently in use in the
`README file <ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/README>`_ on our FTP site.

Seed alignments and secondary structure annotation
--------------------------------------------------

Our seed alignments are small, curated sets of representative sequences
Our **seed alignments** are small, curated sets of representative sequences
for each family, as opposed to an alignment of all known members. The
seed alignment also has as a secondary structure annotation, which
seed alignment also has as a **secondary structure** annotation, which
represents the conserved secondary structure for these sequences.

The ideal basis for a new family is an RNA element that has some
known functional classification, is evolutionarily conserved, and has
evidence for a secondary structure. In order to build a new family, we
must first obtain at least one experimentally validated example from
The ideal basis for a new family is an RNA element that:

* has some known functional classification
* is evolutionarily conserved
* has evidence for a secondary structure

In order to build a new family, we
must first obtain at least one **experimentally validated example** from
the published literature. If any other homologues are identified in the
literature, we will add these to the seed. Alternatively, if these are
not available, we will try to identify others members either by
Expand All @@ -38,7 +43,7 @@ secondary structure annotation provided in the literature. If this is
the case, we will cite the source of both the alignment and the
secondary structure. You should note that the structure annotations
obtained from the literature may be experimentally validated or they
may be RNA folding predictions (commonly `MFOLD <http://mfold.bioinfo.rpi.edu/cgi-bin/rna-form1-2.3.cgi>`_).
may be RNA folding predictions (commonly `Mfold <http://unafold.rna.albany.edu/?q=mfold>`_).
Unfortunately, we do not discriminate between these two cases when we
site the PubMed Identifier (PMID) and you will need to refer to the
original publications to clarify.
Expand All @@ -55,7 +60,8 @@ author on the list will be the most recent editor of the secondary
structure. You can
find the method we have used for the seed alignment or the secondary
structure annotation in the **SE** and **SS**
lines of the Stockholm format or in the curation information pages.
lines of the `Stockholm format <https://en.wikipedia.org/wiki/Stockholm_format>`_
or in the curation information pages.

Covariance Models
-----------------
Expand All @@ -64,10 +70,10 @@ From the seed alignment, we use the `Infernal software <http://eddylab.org/infer
probabilistic model (covariance model or CM) for this family. Useful
references on stochastic free grammars and covariance models can be
found in the `citing Rfam <TODO>`_
section. This model is then used to search the rfamseq
section. This model is then used to search the *rfamseq*
database for other possible homologs.

Searching a nucleotide database as larger as rfamseq with a covariance
Searching a nucleotide database as larger as *rfamseq* with a covariance
model is hugely computationally expensive. In order to do this in
reasonable time, we use sequence based filters to prune the search
space prior to applying the CMs. Please refer to the recent Rfam
Expand All @@ -76,10 +82,10 @@ publication for more details on how we implement this.
Expanding the seed (iteration)
------------------------------

If the CM search of rfamseq identifies any homologs that we believe
If the CM search of *rfamseq* identifies any homologs that we believe
would improve the seed, we use the Infernal software (cmalign) to
add these sequences to the seed alignment. From the new seed, the CM
is re-built and re-searched against rfamseq. We refer to this process
is re-built and re-searched against *rfamseq*. We refer to this process
of expanding the seed using Infernal searching as "iteration". We
continue to iterate the seed until we have good resolution
between real and false hits and cannot improve the seed membership
Expand All @@ -88,9 +94,9 @@ further.
Important points to remember about our seed alignments
------------------------------------------------------

* We can only build families using the sequences in rfamseq
* We can only build families using the sequences in *rfamseq*
* We can only build a family where we can identify more than one
sequence in rfamseq
sequence in *rfamseq*
* Sequences in the seed cannot be manually altered in any way,
e.g. no manual excision of introns, no editing of sequencing errors,
no marking up modified nucleotides etc
Expand All @@ -108,20 +114,17 @@ Important points to remember about our seed alignments
Rfam full alignments
--------------------

The Rfam full alignments contain all of the sequences in rfamseq that
The Rfam full alignments contain all of the sequences in *rfamseq* that
we can identify as members of the family. The alignment is generated by
searching the covariance model for the family against the rfamseq
searching the covariance model for the family against the *rfamseq*
database. Matches that score above a curated threshold are aligned to
the CM to produce the full alignment. All sequences in the seed will
also be present in the full alignment. You should read the
`curation information <TODO>`_ pages for details of bit scores and gathering
thresholds.

As of Rfam 12.0, we no longer automatically generate full alignments for
each Rfam family. You can use our Sunbursts (under the Species tab) to
generate alignments for the sequences of your choice for families with full
alignments of less than 1000 sequences, or you may download the Rfam CM and
generate your own alignments.
each Rfam family. You may download the Rfam CM and generate your own alignments.

Family annotation
-----------------
Expand Down

0 comments on commit 32ad599

Please sign in to comment.