Skip to content

Latest commit

 

History

History
310 lines (192 loc) · 14.2 KB

rfam-cloud.rst

File metadata and controls

310 lines (192 loc) · 14.2 KB

Rfam Cloud pipeline

The Rfam Cloud environment provides access to the command-line interface for curating Rfam families. It uses the same software and the sequence database as the ones used by the Rfam team. The pipeline allows one to create a new RNA family or update an existing Rfam entry.

The main Rfam family building pipeline is located on the EMBL-EBI computational cluster, so that only the EBI account holders can access it. In order to enable more users to contribute to Rfam, a new version of the pipeline was developed using a cloud infrastructure so that the EBI accounts are not needed. All families built using the cloud pipeline are reviewed by the :ref:`rfam-team:Rfam team` before the families are added to Rfam.

  1. A computer with internet access (Mac, Linux, or PC)
  2. A command line environment supporting ssh (for example, bash)
  3. An Rfam Cloud account

Please :ref:`contact-us:Contact us` to request access to the Rfam family building pipeline. If you intend to use the pipeline for teaching purposes, please let us know in advance to ensure that the pipeline can support the workload.

Use the username and password provided by the Rfam team to ssh to Rfam Cloud:

ssh <username>@cloud.rfam.org

To get access to an interactive session and start using the pipeline run the following command:

rfcloud --start

You should see a command line prompt:

Rfam Cloud command line prompt

Rfam Cloud command line prompt

To verify that the system works, try calling the rfsearch and rfmake scripts (you should see help messages explaining how to use the scripts):

rfsearch.pl -h
rfmake.pl -h

Tip

https://cdn.brandfolder.io/5H442O3W/as/pl546j-7le8zk-5guop3/Slack_RGB.png

Join Rfam Cloud on Slack to get help with the pipeline from the Rfam team

✏️ Interested in editing an Rfam family? Skip to :ref:`rfam-cloud:Updating an existing Rfam family`.

📂 Create a new folder, for example rfam_test and navigate to it:

mkdir rfam_test
cd rfam_test

Each family has a :ref:`glossary:seed alignment` file called SEED that contains a multiple sequence alignment of the confirmed instances of a family. To get started, you will need a :ref:`glossary:Stockholm format` file with at least 1 RNA sequence and a consensus secondary structure, for example see the tRNA seed alignment.

If you have a FASTA file called file.fasta with a single RNA sequence, convert it to Stockholm format and predict a consensus secondary structure using RNAfold (the -r option):

predict_ss.pl -infile <file.fasta> -outfile SEED -r

Alternatively, create a SEED file using the vi or nano text editors and paste the file contents from your local computer. See :ref:`rfam-cloud:Copying files to and from Rfam Cloud` for instructions about moving files to and from Rfam Cloud.

It is recommended that the sequences are named in the accession:start-end format where accession is an ENA, GenBank, or RNAcentral identifier, and start-end are the coordinates of the RNA in the accession (for example, AB003409.1/96-167). See the tRNA seed alignment for more examples. ⚠️ The sequence name cannot contain the parenthesis characters (( and )).

Once you have a Stockholm file called SEED in your working directory, proceed to the next step.

Build and calibrate a :ref:`glossary:Covariance model (CM)` based on your seed alignment and search for similar sequences in the :ref:`glossary:rfamseq` database:

rfsearch.pl -nodesc -relax -t 25
Option Meaning
-nodesc creates a required file called DESC that contains the description of the family. You only need to use the -nodesc flag the first time you run rfsearch, after that you will get an error if you use -nodesc because a DESC file already exists.
-relax allow sequences not found in the :ref:`glossary:rfamseq` database to be included in the seed alignment (recommended)
-t 30 :ref:`glossary:Gathering cutoff` in bits. Usually 30 bits is a good starting point as most families are expected to have a threshold higher than 30.

⚠️ This step can take a long time (up to 10 minutes or longer) depending on the size of the alignment and the availability of computational resources.

The output files (:ref:`choosing-gathering-threshold:Species file` and :ref:`choosing-gathering-threshold:Outlist file`) can be used to determine the gathering threshold for this family (the bit score of the last true positive hit).

Note

For detailed instructions on how to select the threshold, see :ref:`choosing-gathering-threshold:Choosing gathering threshold`.

The :ref:`glossary:seed alignment` needs to represent the taxonomic diversity and the structural features observed in different instances of the family. A seed alignment needs to have at least 2 sequences but a larger seed alignment is preferred.

Find an accession in the outlist file that you would like to add to the SEED (for example, AB480043.1):

grep AB480043.1 outlist >> addme
rfseed.pl addme

To remove sequences from SEED (if added in error, for example), create a file with a list of accessions you want to remove using grep as described above and call it removeme. Make sure the accession is exactly the same as in the SEED file, for example NW_002196667.1/1438869-1438941. Then run the following command:

rfseed.pl -d -n <removeme>

Consider manually editing the alignment on your local computer using RALEE or belvu and re-uploading it as explained in Step 1.

🔄 Steps 3 to 6 should be repeated until the seed alignment can no longer be improved:

rfsearch.pl -t <new_cutoff> -relax -ignoresm

The -ignoresm option overrides the threshold set at the previous iteration and saved in the DESC file.

This process is known as iteration (see :ref:`building-families:Expanding the seed (iteration)` for more information).

Once the cutoff has been chosen, all the required family files can be generated like this:

rfmake.pl -t <gathering_cutoff> -a

The -a option creates an align file with an alignment of all the sequences above the gathering threshold. For more information about setting the -t parameter, see :ref:`choosing-gathering-threshold:Choosing gathering threshold`.

After running rfmake you should:

Any unwanted sequences can be excluded by rerunning rfmake with a higher threshold -t.

Each family is described using in a DESC file (see the tRNA DESC file as an example). The following fields are required:

ID:

a unique ID, such as tRNA or skipping-rope. No spaces are allowed.

DE:
a short description of the family.
Example: DE GlmZ RNA activator of glmS mRNA
⚠️ Maximum 75 characters.
AU:

Author name with an ORCID id. Multiple AU lines can be used. Example: AU Eddy SR; 0000-0001-6676-4706

SE:

Seed alignment source. Example: SE Published; PMID:21994249;

SS:

Secondary structure source. Examples:

  • SS Published; PMID:28977401;
  • SS Predicted; mfold;
TP:

One of Rfam RNA types. Example: TP Gene; sRNA;

DR:

A reference to a Gene Ontology or Sequence Ontology term. Multiple DR lines can be used. Example:

  • DR SO; 0000253; tRNA;
  • DR GO; 0030533; triplet codon-amino acid adaptor activity;

You may find the QuickGO website useful for finding GO terms. A link to a website can also be included, for example: DR URL; http://telomerase.asu.edu/;

CC:

A free text comment describing what is known about the RNA (function, taxonomic distribution, experimental validation etc). ⚠️ Maximum 80 characters per line, but multiple CC lines can be used.

WK:

A Wikipedia link (you should create a new Wikipedia article or link to an existing one). Example: WK Transfer_RNA

📚 To add literature references, use the following command that automatically imports information from PubMed:

add_ref.pl <pubmed_id>

⚠️ The GA, TC, NC, BM, CV, SM lines are added automatically, please do not change them manually. The RN, RM, RT, RA, and RL lines are added by the add_ref.pl script. The AC field is assigned once the family is stored in the official Rfam database.

The rqc-all script performs multiple quality controls on the family. It checks the file formats, the accessions, and the DESC file:

cd .. && rqc-all.pl rfam_test

Download your SEED and DESC files to your local machine and send the files to the Rfam team for review by email or Slack. 🎉🎉🎉

See :ref:`rfam-cloud:Copying files to and from Rfam Cloud` for instructions about moving files to and from Rfam Cloud.

!DANGER!

We encourage you to always keep a local copy of the important data!

The only difference between creating a new family and updating an existing one is that the SEED and DESC files are retrieved from Rfam:

rfco.pl <RF0XXXX>
cd <RF0XXXX>

After that, follow the family building instructions: :ref:`rfam-cloud:3. Find similar sequences using rfsearch`.

The Rfam Cloud consists of a login node that handles the account login and worker pods which control the Rfam family building pipeline. When you run ssh <username>@cloud.rfam.org you are connected directly to your worker pod.

Rfam Cloud infrastructure

The login node and the worker pods currently have different filesystems which means that if you are on the worker pod you cannot see the files on the login node and vice versa. You can move files to and from login node using scp or sftp but then you need to use kubectl cp to make the files available on the worker pods.

🛠️ Work on unifying the two filesystems is underway which should make moving files to and from Rfam more user-friendly.

🖥️ On your local machine:

scp SEED <username>@cloud.rfam.org:/home/<username>

This copies a file SEED to your login node. You can also use an SFTP client for this task (for example, CyberDuck on Mac and Windows).

⚙️ On worker pod:

ssh <username>@cloud.rfam.org
kubectl get pod --selector=user=<username>,tier=frontend

Record the pod_id that looks like rfam-login-pod-<username>-6b9f46fc76-67fhn, then exit to the login node:

exit

🗝️ On login node:

kubectl cp SEED <pod_id>:/workdir

Then get back to the worker pod:

kubectl exec -it <pod_id> bash

The file should appear in your workdir folder. You can specify other paths in the kubectl cp command to move the files to any subfolder.

⚙️ On worker pod:

ssh <username>@cloud.rfam.org
kubectl get pod --selector=user=<username>,tier=frontend

Record the pod_id that looks like rfam-login-pod-<username>-6b9f46fc76-67fhn, then exit to the login node:

exit

🗝️ On login node:

kubectl cp <pod_id>:/workdir/SEED .

🖥️ On your local machine:

scp <username>@cloud.rfam.org:/home/<username>/SEED .
  • Filter out redundant sequences. For example, to remove redundancy from a file called align using 95% identify as a cutoff run:

    esl-weight -f --idf 0.95 align
    
  • Iteratively re-align seed sequences to the CM:

    cmbuild --refine SEED.new CM.new SEED
    

:ref:`contact-us:Contact us` by email, raise an issue on GitHub, or get in touch on Slack.