Permalink
Browse files

Merge branch 'master' of https://github.com/McGill-CSB/RNApyro

  • Loading branch information...
2 parents 2dae2b8 + 6ee747a commit 09b6c642d15d90cadb444af447c603c787e7ba8b @yannponty yannponty committed Oct 6, 2012
Showing with 99 additions and 4,372 deletions.
  1. +21 −18 Recomb/introduction_RECOMB.tex
  2. +2 −0 Recomb/main_RECOMB.tex
  3. +43 −2 Recomb/results_RECOMB.tex
  4. +0 −4,334 benchmark/RF00001_seed.stockholm.txt
  5. +6 −7 benchmark/runtest.py
  6. +27 −11 scripts/benchmark.py
@@ -2,18 +2,7 @@
\section{Introduction}
\label{sec:introduction}
-
-In recent years, studies as the \emph{Human Microbiome Project}~\cite{Turnbaugh2007},
-leveraging the NGS techniques to sequence as many new organisms
-as possible, are producing a wealth of new information. Although
-those techniques have a huge throughput, they yield a sequencing error rate of around
-$4\%$~\cite{Huse2007}. This error can be highly reduced when highly
-redundant multiple sequence alignments
- are available, but in studies of new or not well known organisms, there is not
- enough similarity to differentiate between the sequencing errors and the natural
- polymorphisms that we want to observe, often inflating the diversity estimates~\cite{Kunin2010}.
-
- Ribonucleic acids (RNAs) are now an ubiquitous class of molecules, being
+Ribonucleic acids (RNAs) are now an ubiquitous class of molecules, being
found in every living organisms and having a broad range of functions, from catalyzing
chemical reactions as the RNase P or the group II introns,
hybridizing messenger RNA to regulate gene expression,
@@ -25,15 +14,26 @@ \section{Introduction}
can greatly differ from one organism to another.
For half a century, biological molecules have been studied as a proxy to understand
evolution~\cite{Zuckerkandl1965}, and with all their characteristics, rRNAs have
-become a prime candidate for phylogenetic studies~\cite{Olsen1986, Olsen1993}.
+always been a prime candidate for phylogenetic studies~\cite{Olsen1986, Olsen1993}.
-In this paper, we tackle the problem of exploring an rRNA sequence mutant space, to
-identify positions which are probably sequencing error, given the rRNA family sequences
-and its consensus structure.
+In recent years, studies as the \emph{Human Microbiome Project}~\cite{Turnbaugh2007},
+leveraging NGS techniques to sequence as many new organisms
+as possible, are producing a wealth of new information. Although
+those techniques have a huge throughput, they yield a sequencing error rate of around
+$4\%$~\cite{Huse2007}. This error can be highly reduced when highly
+redundant multiple sequence alignments
+ are available, but in studies of new or not well known organisms, there is not
+ enough similarity to differentiate between the sequencing errors and the natural
+ polymorphisms that we want to observe, often inflating the diversity estimates~\cite{Kunin2010}.
+
+
+In this paper, we hypothesize that the family and consensus secondary structure hold
+information allowing to identify the positions most probable to be sequencing errors.
Leveraging the techniques in \texttt{RNAmutants}~\cite{Waldispuhl2008}, and building on top
of the \emph{Inside-Outside algorithm}, we define here a new method called \texttt{RNApyro}
-efficiently computing for large RNAs those probabilities.
+efficiently computing for large RNAs those probabilities under a new
+pseudo-energetic model.
Classical techniques define a probabilistic model using a Boltzmann distribution
whose weights are based on the free energy of the structure, using as energy parameter
the values of Turner found in the NNDB~\cite{Turner2010} for stacked,
@@ -44,4 +44,7 @@ \section{Introduction}
define an isostericity distance, increasing as two base pairs differ
more from one another in space. We incorporate this second measure in the Boltzmann weights.
-To benchmark
+
+
+To benchmark our technique we use the 5S ribosomal RNA family,
+ which is around $120$ nucleotides longs. We show
View
@@ -21,6 +21,8 @@
\usepackage[applemac]{inputenc} %for the encoding
+\newcommand{\RNApyro}{\texttt{RNApyro}\xspace}
+
\newcommand{\red}[1]{{\color{red}#1}}
\newcommand{\farna}{\texttt{FARNA}\xspace}
\newcommand{\mcfoldmcsym}{\texttt{MC-Pipeline}\xspace}
View
@@ -4,5 +4,46 @@ \section{Results}
\subsection{Implementation}
The software was implemented in Python2.7 using the \textit{mpmath}~\cite{mpmath} library
-for arbitrary floating point precision. The code at \verb+https://github.com/McGill-CSB/RNApyro+
-is freely available.
+for arbitrary floating point precision. The source code is freely available at \verb+https://github.com/McGill-CSB/RNApyro+.
+
+\subsection{Error correction in 5s rRNA}
+
+To illustrate the potential of our algorithm, we applied our techniques to identify and correct point-wise errors in RNA sequences
+with conserved secondary structures. More precisely, we used \RNApyro to reconstruct 5s rRNA sequences with randomly distributed
+mutations. This experiment has been designed to suggest further applications to error-corrections in pyrosequencing data.
+
+We build our data set from the 5S rRNA multiple sequence alignment (MSA) available in the Rfam Database 11.0 (Rfam id: \texttt{RF00001}).
+Since our software does not currently implement gaps (mainly because scoring indels is a challenging issue that cannot be fully addressed
+in this work), we clustered together the sequences with identical gap locations. From the $54$ MSAs without gap produced, we selected the
+biggest MSA which contains $130$ sequences (out of $712$ in the original Rfam MSA). Then, in order to avoid any bias, we used \texttt{cd-hit}
+\cite{CDHIT} to remove sequences with more than 80\% of sequence similarity. This operation resulted in a data set of $45$ sequences.
+
+We design our benchmark using a leave-one-out strategy. We randomly picked one sequence from our data set and performed $12$ random
+mutations. Our sequences have $119$ nucleotides, thus the number of mutations corresponds to an error-rate of 10\%. We repeated this operation
+$10$ times.
+
+To evaluate our method, we computed a ROC curve representing the performance of a classifier based on the mutational probabilities computed by
+\RNApyro. We reported in Table \ref{tab:benchmark} the area under the curve (AUC). More specifically, we fix a threshold $\lambda \in [0,1]$ and we
+predict an error at position $i$ in sequence $\omega$ if and only if the probability $P(i,n)$ of a nucleotide $n \in \{ A,C,G,U \}$ exceed this threshold.
+The set of corrections is thus $\{ n \; | \; n \in \{ A,C,G,U \} \mbox{ and } P(i,n) > \lambda \mbox{ and } n \neq \omega[i] \}$, where $\omega[i]$ is the
+nucleotide at position $i$ in the input sequence. Then, we progressively vary $\lambda$ between $0$ and $1$ to calculate the ROC curve and the AUC.
+
+
+
+\begin{table}
+\begin{center}
+\begin{tabular}{|c|c|c|c|}
+\hline
+& 1.0 & 0.5 & 0. \\
+\hline
+6 & 0.69 & 0.72 & 0.74 \\
+12 & 0.80 & 0.84 & 0.84 \\
+24 & 0.76 & & \\
+\hline
+\end{tabular}
+\end{center}
+\caption{Performance of error-correction. The row index indicates the number of mutations performed with \RNApyro, and the column index indicates
+the value of the parameter $\alpha$ distributing the weights of stacking pair energies vs isostericity scores. In each cell, we report the average AUC
+values over the 10 experiments.}
+\label{tab:benchmark}
+\end{table}
Oops, something went wrong.

0 comments on commit 09b6c64

Please sign in to comment.