Based on Rfam 14.8 (May 2022). See releases for previous versions.
This repository contains the code and data for analysing the taxonomic distribution of the Rfam families. The goal is to identify domain-specific subsets of Rfam covariance models for annotating bacterial, eukaryotic, and other genomes with the Infernal software.
📂 The results are organised in several files in the domains folder. Each file contains seven columns:
Family= Rfam accession (e.g. RF00001)
Domain= Taxonomic domain where the family is found (:grey_exclamation: this is the most important column)
Seed domains= All taxonomic domains from the seed alignment
Full region domains= All taxonomic domains from full region hits
Rfam ID= Rfam identifier (e.g. 5S_rRNA)
Description= Family description
RNA type= One of Rfam RNA types.
Domain can be:
- a single domain (for example, Bacteria or Eukaryota) if the majority of hits (>=90%) are from the same domain both in seed and full region hits;
<seed domain>/<full region domain>- if seed and full region domains are not the same, then both are listed. For example, Viruses/Eukaryota means that the seed alignment contains mostly Viruses and the full region hits contain mostly Eukaryotes);
Mixed- if there is no single domain where the family occurs. For example, 5S rRNA RF00001 is expected to be found in Bacteria, Archaea, and Eukaryota.
<seed region domain>/Mixedor
Mixed/<full region domain>- For example, Bacterial SSU RF00177 has only Bacteria in the seed alignment but the full region hits also contain Eukaryota because the mitochondrial and plastid SSU is similar to the bacterial SSU and is expected to match the bacterial model.
✅ View summary with the number of families observed in each domain.
Retrieving the data
The latest version of the files can be retrieved directly from GitHub using the following URL format:
It is also possible to download the data and use it locally or regenerate the files (see the Installation section below).
Example use cases
If you are interested in a subset of Rfam families that match Bacteria, you can use the bacteria.csv file. For example, the following command generates a
bacteria.cmfile with a subset of Rfam covariance models that can be used with the Infernal cmscan program:
curl https://raw.githubusercontent.com/Rfam/rfam-taxonomy/master/domains/bacteria.csv | \ cut -f 1,1 -d ',' | \ tail -n +2 | \ cmfetch -o bacteria.cm -f Rfam.cm.gz -
where cmfetch is part of the Infernal suite and
Rfam.cm.gzcan be downloaded from
You can also further process the all-domains.csv file. For example, to eliminate any families that find hits outside Bacteria, you can focus on rows where the second column is
Bacteriaand the third and the fourth columns contain
Bacteria (100.0%). Note that such a subset would ignore many important RNA families that detect some contamination in eukaryotic sequences.
Clone or download this repository and run the following commands:
virtualenv ENV source ENV/bin/activate pip install -r requirements.txt
Updating the data
After each Rfam release, the data in this repo need to be updated locally and pushed to GitHub.
Generate new data
# when running for the first time (needs to run in this order): python rfam-taxonomy.py --precompute-full python rfam-taxonomy.py --precompute-seed # after precompute is done, run: python rfam-taxonomy.py # to see additional options: python rfam-taxonomy.py --help
Review the changes
The results must be manually reviewed before committing the new files by checking the difference between the old and the new versions using git.
It is normal for the values in the 3rd and 4th columns to change but
Domain, the 2nd column, should stay stable unless the affected family has been significantly updated.
Update release info in Readme
Create new GitHub release
Feel free to create GitHub issues to ask questions or provide feedback. Pull requests are also welcome.