Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

Merge branch 'master' of https://github.com/McGill-CSB/RNApyro

  • Loading branch information...
commit 09b6c642d15d90cadb444af447c603c787e7ba8b 2 parents 2dae2b8 + 6ee747a
@yannponty yannponty authored
View
39 Recomb/introduction_RECOMB.tex
@@ -2,18 +2,7 @@
\section{Introduction}
\label{sec:introduction}
-
-In recent years, studies as the \emph{Human Microbiome Project}~\cite{Turnbaugh2007},
-leveraging the NGS techniques to sequence as many new organisms
-as possible, are producing a wealth of new information. Although
-those techniques have a huge throughput, they yield a sequencing error rate of around
-$4\%$~\cite{Huse2007}. This error can be highly reduced when highly
-redundant multiple sequence alignments
- are available, but in studies of new or not well known organisms, there is not
- enough similarity to differentiate between the sequencing errors and the natural
- polymorphisms that we want to observe, often inflating the diversity estimates~\cite{Kunin2010}.
-
- Ribonucleic acids (RNAs) are now an ubiquitous class of molecules, being
+Ribonucleic acids (RNAs) are now an ubiquitous class of molecules, being
found in every living organisms and having a broad range of functions, from catalyzing
chemical reactions as the RNase P or the group II introns,
hybridizing messenger RNA to regulate gene expression,
@@ -25,15 +14,26 @@ \section{Introduction}
can greatly differ from one organism to another.
For half a century, biological molecules have been studied as a proxy to understand
evolution~\cite{Zuckerkandl1965}, and with all their characteristics, rRNAs have
-become a prime candidate for phylogenetic studies~\cite{Olsen1986, Olsen1993}.
+always been a prime candidate for phylogenetic studies~\cite{Olsen1986, Olsen1993}.
-In this paper, we tackle the problem of exploring an rRNA sequence mutant space, to
-identify positions which are probably sequencing error, given the rRNA family sequences
-and its consensus structure.
+In recent years, studies as the \emph{Human Microbiome Project}~\cite{Turnbaugh2007},
+leveraging NGS techniques to sequence as many new organisms
+as possible, are producing a wealth of new information. Although
+those techniques have a huge throughput, they yield a sequencing error rate of around
+$4\%$~\cite{Huse2007}. This error can be highly reduced when highly
+redundant multiple sequence alignments
+ are available, but in studies of new or not well known organisms, there is not
+ enough similarity to differentiate between the sequencing errors and the natural
+ polymorphisms that we want to observe, often inflating the diversity estimates~\cite{Kunin2010}.
+
+
+In this paper, we hypothesize that the family and consensus secondary structure hold
+information allowing to identify the positions most probable to be sequencing errors.
Leveraging the techniques in \texttt{RNAmutants}~\cite{Waldispuhl2008}, and building on top
of the \emph{Inside-Outside algorithm}, we define here a new method called \texttt{RNApyro}
-efficiently computing for large RNAs those probabilities.
+efficiently computing for large RNAs those probabilities under a new
+pseudo-energetic model.
Classical techniques define a probabilistic model using a Boltzmann distribution
whose weights are based on the free energy of the structure, using as energy parameter
the values of Turner found in the NNDB~\cite{Turner2010} for stacked,
@@ -44,4 +44,7 @@ \section{Introduction}
define an isostericity distance, increasing as two base pairs differ
more from one another in space. We incorporate this second measure in the Boltzmann weights.
-To benchmark
+
+
+To benchmark our technique we use the 5S ribosomal RNA family,
+ which is around $120$ nucleotides longs. We show
View
2  Recomb/main_RECOMB.tex
@@ -21,6 +21,8 @@
\usepackage[applemac]{inputenc} %for the encoding
+\newcommand{\RNApyro}{\texttt{RNApyro}\xspace}
+
\newcommand{\red}[1]{{\color{red}#1}}
\newcommand{\farna}{\texttt{FARNA}\xspace}
\newcommand{\mcfoldmcsym}{\texttt{MC-Pipeline}\xspace}
View
45 Recomb/results_RECOMB.tex
@@ -4,5 +4,46 @@ \section{Results}
\subsection{Implementation}
The software was implemented in Python2.7 using the \textit{mpmath}~\cite{mpmath} library
-for arbitrary floating point precision. The code at \verb+https://github.com/McGill-CSB/RNApyro+
-is freely available.
+for arbitrary floating point precision. The source code is freely available at \verb+https://github.com/McGill-CSB/RNApyro+.
+
+\subsection{Error correction in 5s rRNA}
+
+To illustrate the potential of our algorithm, we applied our techniques to identify and correct point-wise errors in RNA sequences
+with conserved secondary structures. More precisely, we used \RNApyro to reconstruct 5s rRNA sequences with randomly distributed
+mutations. This experiment has been designed to suggest further applications to error-corrections in pyrosequencing data.
+
+We build our data set from the 5S rRNA multiple sequence alignment (MSA) available in the Rfam Database 11.0 (Rfam id: \texttt{RF00001}).
+Since our software does not currently implement gaps (mainly because scoring indels is a challenging issue that cannot be fully addressed
+in this work), we clustered together the sequences with identical gap locations. From the $54$ MSAs without gap produced, we selected the
+biggest MSA which contains $130$ sequences (out of $712$ in the original Rfam MSA). Then, in order to avoid any bias, we used \texttt{cd-hit}
+\cite{CDHIT} to remove sequences with more than 80\% of sequence similarity. This operation resulted in a data set of $45$ sequences.
+
+We design our benchmark using a leave-one-out strategy. We randomly picked one sequence from our data set and performed $12$ random
+mutations. Our sequences have $119$ nucleotides, thus the number of mutations corresponds to an error-rate of 10\%. We repeated this operation
+$10$ times.
+
+To evaluate our method, we computed a ROC curve representing the performance of a classifier based on the mutational probabilities computed by
+\RNApyro. We reported in Table \ref{tab:benchmark} the area under the curve (AUC). More specifically, we fix a threshold $\lambda \in [0,1]$ and we
+predict an error at position $i$ in sequence $\omega$ if and only if the probability $P(i,n)$ of a nucleotide $n \in \{ A,C,G,U \}$ exceed this threshold.
+The set of corrections is thus $\{ n \; | \; n \in \{ A,C,G,U \} \mbox{ and } P(i,n) > \lambda \mbox{ and } n \neq \omega[i] \}$, where $\omega[i]$ is the
+nucleotide at position $i$ in the input sequence. Then, we progressively vary $\lambda$ between $0$ and $1$ to calculate the ROC curve and the AUC.
+
+
+
+\begin{table}
+\begin{center}
+\begin{tabular}{|c|c|c|c|}
+\hline
+& 1.0 & 0.5 & 0. \\
+\hline
+6 & 0.69 & 0.72 & 0.74 \\
+12 & 0.80 & 0.84 & 0.84 \\
+24 & 0.76 & & \\
+\hline
+\end{tabular}
+\end{center}
+\caption{Performance of error-correction. The row index indicates the number of mutations performed with \RNApyro, and the column index indicates
+the value of the parameter $\alpha$ distributing the weights of stacking pair energies vs isostericity scores. In each cell, we report the average AUC
+values over the 10 experiments.}
+\label{tab:benchmark}
+\end{table}
View
4,334 benchmark/RF00001_seed.stockholm.txt
0 additions, 4,334 deletions not shown
View
13 benchmark/runtest.py
@@ -2,13 +2,15 @@
def main(nmut,alpha,prefix):
- for numex in range(10):
+ for numex in range(100):
infile = "RF00001_%d.in" % numex
if os.path.exists(infile):
alphanorm = int (alpha * 100)
- commandline1="python ../src/RNAPyroEx.py RF00001_%d.in %d %.1f > RF00001_%d_nmut_%d_alpha_%d.out" % (numex,nmut,alpha,numex,nmut,alphanorm)
+ outfile = "RF00001_%d_nmut_%d_alpha_%d.out" % (numex,nmut,alphanorm)
+ commandline1="python ../src/RNAPyroEx.py %s %d %.1f 1.5 > %s" % (infile,nmut,alpha,outfile)
print commandline1
- os.system(commandline1)
+ if not os.path.exists(outfile):
+ os.system(commandline1)
commandline2="python ../scripts/benchmark.py -i RF00001_%d.ref -o RF00001_%d_nmut_%d_alpha_%d.out" % (numex,numex,nmut,alphanorm)
print commandline2
os.system(commandline2)
@@ -22,14 +24,11 @@ def main(nmut,alpha,prefix):
alpha=0.5
prefix="RF00001"
- if len(sys.argv)==1:
- usage(sys.argv[0])
-
for i in range(len(sys.argv)):
myarg=sys.argv[i]
if myarg=='-h' or myarg=='--help':
usage(sys.argv[0])
- if myarg=='-n':
+ if myarg=='-m':
nmut=int(sys.argv[i+1])
if myarg=='-a':
alpha=float(sys.argv[i+1])
View
38 scripts/benchmark.py
@@ -9,7 +9,7 @@
## usage
def usage(softname):
- print "%s [-cvsw] -i <RNApyro infile> -o <RNApyro outfile>" % (softname)
+ print "%s [-cvswx] -i <RNApyro infile> -o <RNApyro outfile>" % (softname)
sys.exit(1)
## compute bp set
@@ -322,7 +322,7 @@ def checkconsistency(mutlist,indata,outdata):
## read infile
-def fullprediction(mutlist,indata,outdata,onlybpregion,verbose,customsteps,withread):
+def fullprediction(mutlist,indata,outdata,onlybpregion,verbose,customsteps,withread,withfigs):
read=indata['read'].replace('.','').replace('-','').upper()
original=indata['target'].replace('.','').replace('-','').upper()
@@ -417,12 +417,13 @@ def fullprediction(mutlist,indata,outdata,onlybpregion,verbose,customsteps,withr
myroc+=(1.0-prevspec)*(1.0+prevsens)/2.0
#print xcoord,ycoord
# make figure
- fig = plt.figure();
- ax = fig.add_subplot(111)
- ax.plot(xcoord,ycoord,'ro-')
- ax.plot([0.0,1.0],[0.0,1.0],'b--')
- fig.savefig('roccurve.pdf',transparent=True)
- print 'ROC curve saved in file roccurve.pdf'
+ if withfigs:
+ fig = plt.figure();
+ ax = fig.add_subplot(111)
+ ax.plot(xcoord,ycoord,'ro-')
+ ax.plot([0.0,1.0],[0.0,1.0],'b--')
+ fig.savefig('roccurve.pdf',transparent=True)
+ print 'ROC curve saved in file roccurve.pdf'
return bestFM,myroc
@@ -430,11 +431,14 @@ def fullprediction(mutlist,indata,outdata,onlybpregion,verbose,customsteps,withr
## main
-def main(infilename,outfilename,onlybpregion,verbose,mysteps,withread):
+def main(infilename,outfilename,onlybpregion,verbose,mysteps,withread,withfigs,printfile):
indata=readinput(infilename)
outdata=readoutput(outfilename)
+ if printfile!='':
+ f = open(printfile,'a')
+
mutationlist = hammingdistance(indata)
print 'number of mutations: ', len(mutationlist)
@@ -450,8 +454,14 @@ def main(infilename,outfilename,onlybpregion,verbose,mysteps,withread):
print 'Correlation: TP=%.2f; TN=%.2f; FP=%.2f; FN=%.2f' % (val['TP'],val['TN'],val['FP'],val['FN'])
## Accuracy of prediction using threshold
- myFM,myroc = fullprediction(mutationlist,indata,outdata,onlybpregion,verbose,mysteps,withread)
+ myFM,myroc = fullprediction(mutationlist,indata,outdata,onlybpregion,verbose,mysteps,withread,withfigs)
print 'Best F-measure: %.2f; ROC: %.2f' %(myFM,myroc)
+
+ if printfile!='':
+ print >>f,val['TP'],val['TN'],val['FP'],val['FN'],myFM,myroc
+
+ if printfile!='':
+ f.close()
###############################################################################
@@ -462,6 +472,8 @@ def main(infilename,outfilename,onlybpregion,verbose,mysteps,withread):
verbose=False
mysteps=100
withread=True
+ withfigs=False
+ printfile=''
if len(sys.argv)==1:
usage(sys.argv[0])
@@ -482,8 +494,12 @@ def main(infilename,outfilename,onlybpregion,verbose,mysteps,withread):
mysteps=int(sys.argv[i+1])
if myarg=='-w':
withread=False
+ if myarg=='-x':
+ withfigs=True
+ if myarg=='-p':
+ printfile=sys.argv[i+1]
- main(inputfile,outputfile,onlybpregion,verbose,mysteps,withread)
+ main(inputfile,outputfile,onlybpregion,verbose,mysteps,withread,withfigs,printfile)
Please sign in to comment.
Something went wrong with that request. Please try again.