Skip to content


Subversion checkout URL

You can clone with
Download ZIP
Tree: ac85f552d4
Fetching contributors…

Cannot retrieve contributors at this time

96 lines (80 sloc) 6.42 KB
%!TEX root = main_RECOMB.tex
The software was implemented in Python2.7 using the \textit{mpmath}~\cite{mpmath} library
for arbitrary floating point precision. The source code is freely available at \verb+
The time benchmarks were done on a MacMini 2010, 2.3GHz dual-core Intel Core i5, 8GB of RAM.
Since applications of \RNApyro implies a need for
efficiency and scalability, we present in Table~\ref{tab:time} typical runtimes required to compute the probabilities for every nucleotide at every positions for a vast set of parameters. For those tests,
both the sequences and the target secondary structure were generated at random.
$|s|$&\multicolumn{3}{c}{Number of mutations}\\\cline{2-4}
& 6 & 12 & 24\\\cline{2-4}
100 & 35s & 238s & 1023s\\
300 & 135s & 594s &2460s\\\cline{2-4}
& 25 & & 50 \\\cline{2-4}
500 & 5400s& & 21003s \\
\caption{Time to compute all probabilities. The first column indicates the length and the column indexes indicate the number
of mutations. $\alpha$ is
set at $0.5$, $\beta$ to $1.5$ and $|\Omega|=44$.}
\subsection{Error correction in 5s rRNA}
To illustrate the potential of our algorithm, we applied our techniques to identify and correct point-wise errors in RNA sequences
with conserved secondary structures. More precisely, we used \RNApyro to reconstruct 5s rRNA sequences with randomly distributed
mutations. This experiment has been designed to suggest further applications to error-corrections in pyrosequencing data.
We built our data set from the 5S rRNA multiple sequence alignment (MSA) available in the Rfam Database 11.0 (Rfam id: \texttt{RF00001}).
Since our software does not currently implement gaps (mainly because scoring indels is a challenging issue that cannot be fully addressed
in this work), we clustered together the sequences with identical gap locations. From the $54$ MSAs without gap produced, we selected the
biggest MSA which contains $130$ sequences (out of $712$ in the original Rfam MSA). Then, in order to avoid overfitting, we used \texttt{cd-hit}
\cite{Li:2006fk} to remove sequences with more than 80\% of sequence similarity. This operation resulted in a data set of $45$ sequences.
We designed our benchmark using a leave-one-out strategy. We randomly picked one sequence from our data set and performed $12$ random
mutations. Our sequences have $119$ nucleotides, thus the number of mutations corresponds to an error-rate of 10\%. We repeated this operation
$10$ times. The value of $\beta$ has been set to $1.5$ (larger values gave similar results). To estimate the impact of the distribution of the weights
between the energy term and isostericity score, we used 4 different values of $\alpha = {0, 0.5, 0.8, 1.0}$. Similarly, we also investigated the impact of
an under- and over- estimate of the number of errors, and use values equal to 50\% (6 mutations) and 200\% (24 mutations) of the exact number of
errors (i.e. 12).
To evaluate our method, we computed a ROC curve representing the performance of a classifier based on the mutational probabilities computed
by \RNApyro. More specifically, we fixed a threshold $\lambda \in [0,1]$, and predicted an error at position $i$ in sequence $\omega$ if and only if the
probability $P(i,n)$ of a nucleotide $n \in \{ A,C,G,U \}$ exceeds this threshold. To correct the errors we used the set of nucleotides with a probability
higher than this threshold is $C(i) = \{ n \; | \; n \in \{ A,C,G,U \} \mbox{ and } P(i,n) > \lambda \mbox{ and } n \neq \omega[i] \}$, where $\omega[i]$ is
the nucleotide at position $i$ in the input sequence. We note that for the lowest thresholds multiple nucleotides can be available in $C(i)$ to correct
the sequence. Here, we remind that our aim is to estimate the potential of error-correction of \RNApyro and not to develop an error-correction pipe-line, which
will be the subject of further studies. Finally, we progressively varied $\lambda$ between $0$ and $1$ to calculate the ROC curve and the area
under the curve (AUC). We report our results in Figure~\ref{fig:ROCall}.
\caption{6 mutations}
\caption{12 mutations}
\caption{24 mutations}
\caption{Performance of error-correction. Subfigures accuracy with under-estimated error rate (6 mutations), exact estimates (12 mutations) and over estimates
(24 mutations). We also analyze the impact of the parameter $\alpha$ distributing the weights of stacking pair energies vs isostericity scores and use values
ranging of $\alpha=\{0,0.5,0.8,1.0\}$. The AUC is indicated in the legend of the figures. Each individual ROC curve represent the average performance over the 10 experiments.}
Our data demonstrate that our algorithm exhibits interesting performance for error-correction applications. First, the AUC values (up to $0.86$) indicate that a
signal has been successfully extracted. This result has been achieved with errors in loop regions (i.e. without base pairing information) and thus suggests
that correction rates in structured regions (i.e. base paired regions) could be even higher. Next, the optimal values of $\alpha$ tend to be close to $0.0$. This
finding suggests that at this point the information issued from the stacking energy is currently modest. However, specific examples showed improved performance
using this energy term. Further studies must be conducted to understand how to make the best use it. Finally, our algorithm seems robust to the number of
mutations considered. Indeed, good AUC values are achieved with low estimates of the number of errors in the sequence (c.f. 50\% of the errors in
Fig.~\ref{fig:ROC6mut}) and with large values (c.f.~200\% of the errors in Fig.~\ref{fig:ROC24mut}). It is worth noting that scoring scheme with larger weight on
the isostericity scores (low $\alpha$ values) seem more robust to under- and over-estimate of the number of errors.
Jump to Line
Something went wrong with that request. Please try again.