# McGill-CSB/RNApyro

### Subversion checkout URL

You can clone with
or
.
Fetching contributors…

Cannot retrieve contributors at this time

96 lines (80 sloc) 6.42 KB
 %!TEX root = main_RECOMB.tex \section{Results} \label{sec:results} \subsection{Implementation} The software was implemented in Python2.7 using the \textit{mpmath}~\cite{mpmath} library for arbitrary floating point precision. The source code is freely available at \verb+https://github.com/McGill-CSB/RNApyro+. The time benchmarks were done on a MacMini 2010, 2.3GHz dual-core Intel Core i5, 8GB of RAM. Since applications of \RNApyro implies a need for efficiency and scalability, we present in Table~\ref{tab:time} typical runtimes required to compute the probabilities for every nucleotide at every positions for a vast set of parameters. For those tests, both the sequences and the target secondary structure were generated at random. \begin{table}[t] \begin{center} \begin{tabular}{lccc} $|s|$&\multicolumn{3}{c}{Number of mutations}\\\cline{2-4} & 6 & 12 & 24\\\cline{2-4} 100 & 35s & 238s & 1023s\\ 300 & 135s & 594s &2460s\\\cline{2-4} & 25 & & 50 \\\cline{2-4} 500 & 5400s& & 21003s \\ \end{tabular} \end{center} \caption{Time to compute all probabilities. The first column indicates the length and the column indexes indicate the number of mutations. $\alpha$ is set at $0.5$, $\beta$ to $1.5$ and $|\Omega|=44$.} \label{tab:time} \end{table} \subsection{Error correction in 5s rRNA} To illustrate the potential of our algorithm, we applied our techniques to identify and correct point-wise errors in RNA sequences with conserved secondary structures. More precisely, we used \RNApyro to reconstruct 5s rRNA sequences with randomly distributed mutations. This experiment has been designed to suggest further applications to error-corrections in pyrosequencing data. We built our data set from the 5S rRNA multiple sequence alignment (MSA) available in the Rfam Database 11.0 (Rfam id: \texttt{RF00001}). Since our software does not currently implement gaps (mainly because scoring indels is a challenging issue that cannot be fully addressed in this work), we clustered together the sequences with identical gap locations. From the $54$ MSAs without gap produced, we selected the biggest MSA which contains $130$ sequences (out of $712$ in the original Rfam MSA). Then, in order to avoid overfitting, we used \texttt{cd-hit} \cite{Li:2006fk} to remove sequences with more than 80\% of sequence similarity. This operation resulted in a data set of $45$ sequences. We designed our benchmark using a leave-one-out strategy. We randomly picked one sequence from our data set and performed $12$ random mutations. Our sequences have $119$ nucleotides, thus the number of mutations corresponds to an error-rate of 10\%. We repeated this operation $10$ times. The value of $\beta$ has been set to $1.5$ (larger values gave similar results). To estimate the impact of the distribution of the weights between the energy term and isostericity score, we used 4 different values of $\alpha = {0, 0.5, 0.8, 1.0}$. Similarly, we also investigated the impact of an under- and over- estimate of the number of errors, and use values equal to 50\% (6 mutations) and 200\% (24 mutations) of the exact number of errors (i.e. 12). To evaluate our method, we computed a ROC curve representing the performance of a classifier based on the mutational probabilities computed by \RNApyro. More specifically, we fixed a threshold $\lambda \in [0,1]$, and predicted an error at position $i$ in sequence $\omega$ if and only if the probability $P(i,n)$ of a nucleotide $n \in \{ A,C,G,U \}$ exceeds this threshold. To correct the errors we used the set of nucleotides with a probability higher than this threshold is $C(i) = \{ n \; | \; n \in \{ A,C,G,U \} \mbox{ and } P(i,n) > \lambda \mbox{ and } n \neq \omega[i] \}$, where $\omega[i]$ is the nucleotide at position $i$ in the input sequence. We note that for the lowest thresholds multiple nucleotides can be available in $C(i)$ to correct the sequence. Here, we remind that our aim is to estimate the potential of error-correction of \RNApyro and not to develop an error-correction pipe-line, which will be the subject of further studies. Finally, we progressively varied $\lambda$ between $0$ and $1$ to calculate the ROC curve and the area under the curve (AUC). We report our results in Figure~\ref{fig:ROCall}. \begin{figure} \begin{subfigure}[b]{0.3\textwidth} \includegraphics[width=1.2\textwidth]{figures/ROC_6.pdf} \caption{6 mutations} \label{fig:ROC6mut} \end{subfigure} \hfill \begin{subfigure}[b]{0.3\textwidth} \includegraphics[width=1.2\textwidth]{figures/ROC_12.pdf} \caption{12 mutations} \label{fig:ROC12mut} \end{subfigure} \hfill \begin{subfigure}[b]{0.3\textwidth} \includegraphics[width=1.2\textwidth]{figures/ROC_24.pdf} \caption{24 mutations} \label{fig:ROC24mut} \end{subfigure} \caption{Performance of error-correction. Subfigures accuracy with under-estimated error rate (6 mutations), exact estimates (12 mutations) and over estimates (24 mutations). We also analyze the impact of the parameter $\alpha$ distributing the weights of stacking pair energies vs isostericity scores and use values ranging of $\alpha=\{0,0.5,0.8,1.0\}$. The AUC is indicated in the legend of the figures. Each individual ROC curve represent the average performance over the 10 experiments.} \label{fig:ROCall} \end{figure} Our data demonstrate that our algorithm exhibits interesting performance for error-correction applications. First, the AUC values (up to $0.86$) indicate that a signal has been successfully extracted. This result has been achieved with errors in loop regions (i.e. without base pairing information) and thus suggests that correction rates in structured regions (i.e. base paired regions) could be even higher. Next, the optimal values of $\alpha$ tend to be close to $0.0$. This finding suggests that at this point the information issued from the stacking energy is currently modest. However, specific examples showed improved performance using this energy term. Further studies must be conducted to understand how to make the best use it. Finally, our algorithm seems robust to the number of mutations considered. Indeed, good AUC values are achieved with low estimates of the number of errors in the sequence (c.f. 50\% of the errors in Fig.~\ref{fig:ROC6mut}) and with large values (c.f.~200\% of the errors in Fig.~\ref{fig:ROC24mut}). It is worth noting that scoring scheme with larger weight on the isostericity scores (low $\alpha$ values) seem more robust to under- and over-estimate of the number of errors.