Browse files

update script & add results section (partial)

  • Loading branch information...
waldispuhl committed Oct 6, 2012
1 parent ae27986 commit 6ee747a1a49b4a5077737befa6108cfaf3046c2f
Showing with 78 additions and 4,354 deletions.
  1. +2 −0 Recomb/main_RECOMB.tex
  2. +43 −2 Recomb/results_RECOMB.tex
  3. +0 −4,334 benchmark/
  4. +6 −7 benchmark/
  5. +27 −11 scripts/
@@ -21,6 +21,8 @@
\usepackage[applemac]{inputenc} %for the encoding
@@ -4,5 +4,46 @@ \section{Results}
The software was implemented in Python2.7 using the \textit{mpmath}~\cite{mpmath} library
-for arbitrary floating point precision. The code at \verb+
-is freely available.
+for arbitrary floating point precision. The source code is freely available at \verb+
+\subsection{Error correction in 5s rRNA}
+To illustrate the potential of our algorithm, we applied our techniques to identify and correct point-wise errors in RNA sequences
+with conserved secondary structures. More precisely, we used \RNApyro to reconstruct 5s rRNA sequences with randomly distributed
+mutations. This experiment has been designed to suggest further applications to error-corrections in pyrosequencing data.
+We build our data set from the 5S rRNA multiple sequence alignment (MSA) available in the Rfam Database 11.0 (Rfam id: \texttt{RF00001}).
+Since our software does not currently implement gaps (mainly because scoring indels is a challenging issue that cannot be fully addressed
+in this work), we clustered together the sequences with identical gap locations. From the $54$ MSAs without gap produced, we selected the
+biggest MSA which contains $130$ sequences (out of $712$ in the original Rfam MSA). Then, in order to avoid any bias, we used \texttt{cd-hit}
+\cite{CDHIT} to remove sequences with more than 80\% of sequence similarity. This operation resulted in a data set of $45$ sequences.
+We design our benchmark using a leave-one-out strategy. We randomly picked one sequence from our data set and performed $12$ random
+mutations. Our sequences have $119$ nucleotides, thus the number of mutations corresponds to an error-rate of 10\%. We repeated this operation
+$10$ times.
+To evaluate our method, we computed a ROC curve representing the performance of a classifier based on the mutational probabilities computed by
+\RNApyro. We reported in Table \ref{tab:benchmark} the area under the curve (AUC). More specifically, we fix a threshold $\lambda \in [0,1]$ and we
+predict an error at position $i$ in sequence $\omega$ if and only if the probability $P(i,n)$ of a nucleotide $n \in \{ A,C,G,U \}$ exceed this threshold.
+The set of corrections is thus $\{ n \; | \; n \in \{ A,C,G,U \} \mbox{ and } P(i,n) > \lambda \mbox{ and } n \neq \omega[i] \}$, where $\omega[i]$ is the
+nucleotide at position $i$ in the input sequence. Then, we progressively vary $\lambda$ between $0$ and $1$ to calculate the ROC curve and the AUC.
+& 1.0 & 0.5 & 0. \\
+6 & 0.69 & 0.72 & 0.74 \\
+12 & 0.80 & 0.84 & 0.84 \\
+24 & 0.76 & & \\
+\caption{Performance of error-correction. The row index indicates the number of mutations performed with \RNApyro, and the column index indicates
+the value of the parameter $\alpha$ distributing the weights of stacking pair energies vs isostericity scores. In each cell, we report the average AUC
+values over the 10 experiments.}
Oops, something went wrong.

0 comments on commit 6ee747a

Please sign in to comment.