`pyrodigal-rv` models

This repository contains the notebooks and scripts to reproduce the RNA virus model generation with pyrodigal.

It also contains a notebook to benchmark the performance of pyrodigal, pyrodigal-gv and pyrodigal-rv.

📊 Benchmarking

Accuracy

To evaluate pyrodigal-rv ORF prediction in RNA viruses, all Riboviria sequences in RefSeq indicated as "complete" by the sequence submission authors and without N's in the sequence were used as a benchmark (n=9,001).

All tools were run in closed mode (-c) and pyrodigal was forced to use genetic code 1 (-g 1) for the benchmarking as this is the most used genetic code by RNA viruses. After comparison with the CDS annotations from RefSeq pyrodigal and pyrodigal-rv give 58.9% and 49.4% exact matches respectively, while both of them also predicted ~25% CDSs with different start and/or stop sites compared to RefSeq. For pyrodigal-rv another 12.4% was predicted to only have a different translation table.

As expected pyrodigal-gv had almost no exact matches because it contains no metagenomic models with genetic code 1, and it also predicts 28.8% CDSs with different start/stop sites (4.6% higher than pyrodigal-rv).

pyrodigal-rv also performed best in context of extra and missing CDS predictions (considerably lower amount extra predictions and only 0.4% more missing predictions compared to pyrodigal).

pyrodigal-rv adds the ability to predict the right genetic code for your RNA virus sequence, when comparing to RefSeq, 11.7% of the sequences had a mismatch in genetic code. However, when examining more closely the majority of these sequences belong to the Atkinsviridae, Blumeviridae, Fiersviridae, Solspiviridae and Steitzviridae, which are RNA phages and should use the bacterial genetic code 11 (as predicted by pyrodigal-rv). This shows that not all sequences in RefSeq are annotated with the correct translation table and this benchmark underestimated pyrodigal-rv's accuracy in number of exact matches.

Disclaimer: The training models for pyrodigal-rv contain some RefSeq sequences.

Speed

CLI speed was benchmarked with hyperfine over 10 runs of the same command on 9,000 sequences for each CLI (pyrodigal, pyrodigal-gv and pyrodigal-rv) using 10 processes (-j 10 --pool process).

Command	Mean [s]	Min [s]	Max [s]	Relative
`pyrodigal`	63.883 ± 0.597	63.402	65.288	2.19 ± 0.03
`pyrodigal-gv`	29.568 ± 0.563	28.250	30.286	1.01 ± 0.02
`pyrodigal-rv`	29.150 ± 0.199	28.860	29.540	1.00

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
images		images
scripts		scripts
taxid_files		taxid_files
.gitignore		.gitignore
README.md		README.md
benchmark.ipynb		benchmark.ipynb
make_models.ipynb		make_models.ipynb
pyrodigal-rv.slurm		pyrodigal-rv.slurm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

`pyrodigal-rv` models

📊 Benchmarking

Accuracy

Speed

About

Uh oh!

Releases

Packages

Languages

LanderDC/add_pyrodigal_rv_models

Folders and files

Latest commit

History

Repository files navigation

pyrodigal-rv models

📊 Benchmarking

Accuracy

Speed

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`pyrodigal-rv` models

Packages