Recomb/introduction_RECOMB.tex

%!TEX root = main_RECOMB.tex
\section{Introduction}
\label{sec:introduction}

Ribonucleic acids (RNAs) are  now  an ubiquitous class of molecules, being
found in every living organisms and having a broad range of functions, from catalyzing
chemical reactions as the RNase P or the group II introns,
hybridizing  messenger RNA to regulate gene expression,
to ribosomal RNA (rRNA) synthesizing proteins.
Those functions  require specific structures, 
encoded in their nucleotide sequence. Although the functions
need to be preserved through various organisms, and therefore
their structure must be similar,  the sequences
can greatly differ from one organism to another.
For half a century, biological molecules have been studied as a proxy to understand
evolution~\cite{Zuckerkandl1965}, and with all their characteristics, rRNAs have
always been a prime candidate for phylogenetic studies~\cite{Olsen1986, Olsen1993}.

In recent years, studies as the \emph{Human Microbiome Project}~\cite{Turnbaugh2007}, 
leveraging NGS techniques to sequence as many new organisms 
as possible, are producing a wealth of new information. Although
those techniques have a huge throughput, they yield a sequencing error rate of around
$4\%$~\cite{Huse2007}. This error can be highly reduced  when highly 
redundant multiple sequence alignments 
 are available, but in studies of new or not well known organisms, there is not
 enough  similarity to differentiate between the sequencing errors and the natural 
 polymorphisms that we want to observe, often inflating the diversity estimates~\cite{Kunin2010}.
 
 
In this paper, we hypothesize that the family and consensus secondary structure hold 
information allowing to identify the positions most probable to be sequencing errors.

Leveraging the techniques  in \texttt{RNAmutants}~\cite{Waldispuhl2008}, and building on top 
of the \emph{Inside-Outside algorithm}, we define here a new method called \texttt{RNApyro}
efficiently computing for large RNAs those probabilities under a new 
pseudo-energetic model.
 Classical techniques define a probabilistic model using a Boltzmann distribution 
whose weights are based on the free energy of the structure, using as energy parameter
the values of Turner found in the NNDB~\cite{Turner2010}  for stacked, 
canonical and wobble, base pairs. As shown by Leontis and Westhof~\cite{Leontis2001},
this  does not encapsulated the large diversity of base pairs that any nucleotide
can form with any other, although with an energy too small to be yet determined
by experimental techniques. To quantify geometrical differences, they
 define an isostericity distance, increasing as two base pairs differ 
 more from one another in space. We incorporate this second measure in the Boltzmann weights.
 
The 5S ribosomal RNA family, $119$ nucleotides long,  was used to benchmark our method.
It is a prime example since it has been extensively used for phylogenetic
reconstructions~\cite{Hori1987} and its sequence has been recovered for over 8000 species
 (RFAM Id: \texttt{RF00001}).
 Using a leave one out strategy, we perform random distributed mutations on a sequence. We show that
\texttt{RNApyro} reconstructs the original sequence with an excellent accuracy.

The pseudo-energetic model and the algorithm is presented in Sec.~\ref{sec:methods}.
Details of the implementation and benchmarks are in Sec.~\ref{sec:results}. 
Future applications and a discussion are developed in Sec.~\ref{sec:conclusion}.