# McGill-CSB/RNApyro

### Subversion checkout URL

You can clone with
or
.
Fetching contributors…

Cannot retrieve contributors at this time

56 lines (49 sloc) 3.48 kB
 %!TEX root = main_RECOMB.tex \section{Introduction} \label{sec:introduction} Ribonucleic acids (RNAs) are now an ubiquitous class of molecules, being found in every living organisms and having a broad range of functions, from catalyzing chemical reactions as the RNase P or the group II introns, hybridizing messenger RNA to regulate gene expression, to ribosomal RNA (rRNA) synthesizing proteins. Those functions require specific structures, encoded in their nucleotide sequence. Although the functions need to be preserved through various organisms, and therefore their structure must be similar, the sequences can greatly differ from one organism to another. For half a century, biological molecules have been studied as a proxy to understand evolution~\cite{Zuckerkandl1965}, and with all their characteristics, rRNAs have always been a prime candidate for phylogenetic studies~\cite{Olsen1986, Olsen1993}. In recent years, studies as the \emph{Human Microbiome Project}~\cite{Turnbaugh2007}, leveraging NGS techniques to sequence as many new organisms as possible, are producing a wealth of new information. Although those techniques have a huge throughput, they yield a sequencing error rate of around $4\%$~\cite{Huse2007}. This error can be highly reduced when highly redundant multiple sequence alignments are available, but in studies of new or not well known organisms, there is not enough similarity to differentiate between the sequencing errors and the natural polymorphisms that we want to observe, often inflating the diversity estimates~\cite{Kunin2010}. In this paper, we hypothesize that the family and consensus secondary structure hold information allowing to identify the positions most probable to be sequencing errors. Leveraging the techniques in \texttt{RNAmutants}~\cite{Waldispuhl2008}, and building on top of the \emph{Inside-Outside algorithm}, we define here a new method called \texttt{RNApyro} efficiently computing for large RNAs those probabilities under a new pseudo-energetic model. Classical techniques define a probabilistic model using a Boltzmann distribution whose weights are based on the free energy of the structure, using as energy parameter the values of Turner found in the NNDB~\cite{Turner2010} for stacked, canonical and wobble, base pairs. As shown by Leontis and Westhof~\cite{Leontis2001}, this does not encapsulated the large diversity of base pairs that any nucleotide can form with any other, although with an energy too small to be yet determined by experimental techniques. To quantify geometrical differences, they define an isostericity distance, increasing as two base pairs differ more from one another in space. We incorporate this second measure in the Boltzmann weights. The 5S ribosomal RNA family, $119$ nucleotides long, was used to benchmark our method. It is a prime example since it has been extensively used for phylogenetic reconstructions~\cite{Hori1987} and its sequence has been recovered for over 8000 species (RFAM Id: \texttt{RF00001}). Using a leave one out strategy, we perform random distributed mutations on a sequence. We show that \texttt{RNApyro} reconstructs the original sequence with an excellent accuracy. The pseudo-energetic model and the algorithm is presented in Sec.~\ref{sec:methods}. Details of the implementation and benchmarks are in Sec.~\ref{sec:results}. Future applications and a discussion are developed in Sec.~\ref{sec:conclusion}.
Something went wrong with that request. Please try again.