Skip to content


Subversion checkout URL

You can clone with
Download ZIP
Tree: 599f3a94a1
Fetching contributors…

Cannot retrieve contributors at this time

56 lines (49 sloc) 3.48 kB
%!TEX root = main_RECOMB.tex
Ribonucleic acids (RNAs) are now an ubiquitous class of molecules, being
found in every living organisms and having a broad range of functions, from catalyzing
chemical reactions as the RNase P or the group II introns,
hybridizing messenger RNA to regulate gene expression,
to ribosomal RNA (rRNA) synthesizing proteins.
Those functions require specific structures,
encoded in their nucleotide sequence. Although the functions
need to be preserved through various organisms, and therefore
their structure must be similar, the sequences
can greatly differ from one organism to another.
For half a century, biological molecules have been studied as a proxy to understand
evolution~\cite{Zuckerkandl1965}, and with all their characteristics, rRNAs have
always been a prime candidate for phylogenetic studies~\cite{Olsen1986, Olsen1993}.
In recent years, studies as the \emph{Human Microbiome Project}~\cite{Turnbaugh2007},
leveraging NGS techniques to sequence as many new organisms
as possible, are producing a wealth of new information. Although
those techniques have a huge throughput, they yield a sequencing error rate of around
$4\%$~\cite{Huse2007}. This error can be highly reduced when highly
redundant multiple sequence alignments
are available, but in studies of new or not well known organisms, there is not
enough similarity to differentiate between the sequencing errors and the natural
polymorphisms that we want to observe, often inflating the diversity estimates~\cite{Kunin2010}.
In this paper, we hypothesize that the family and consensus secondary structure hold
information allowing to identify the positions most probable to be sequencing errors.
Leveraging the techniques in \texttt{RNAmutants}~\cite{Waldispuhl2008}, and building on top
of the \emph{Inside-Outside algorithm}, we define here a new method called \texttt{RNApyro}
efficiently computing for large RNAs those probabilities under a new
pseudo-energetic model.
Classical techniques define a probabilistic model using a Boltzmann distribution
whose weights are based on the free energy of the structure, using as energy parameter
the values of Turner found in the NNDB~\cite{Turner2010} for stacked,
canonical and wobble, base pairs. As shown by Leontis and Westhof~\cite{Leontis2001},
this does not encapsulated the large diversity of base pairs that any nucleotide
can form with any other, although with an energy too small to be yet determined
by experimental techniques. To quantify geometrical differences, they
define an isostericity distance, increasing as two base pairs differ
more from one another in space. We incorporate this second measure in the Boltzmann weights.
The 5S ribosomal RNA family, $119$ nucleotides long, was used to benchmark our method.
It is a prime example since it has been extensively used for phylogenetic
reconstructions~\cite{Hori1987} and its sequence has been recovered for over 8000 species
(RFAM Id: \texttt{RF00001}).
Using a leave one out strategy, we perform random distributed mutations on a sequence. We show that
\texttt{RNApyro} reconstructs the original sequence with an excellent accuracy.
The pseudo-energetic model and the algorithm is presented in Sec.~\ref{sec:methods}.
Details of the implementation and benchmarks are in Sec.~\ref{sec:results}.
Future applications and a discussion are developed in Sec.~\ref{sec:conclusion}.
Jump to Line
Something went wrong with that request. Please try again.