Skip to content
Fetching contributors…
Cannot retrieve contributors at this time
616 lines (505 sloc) 23.6 KB
 Gumbel statistics are often used to estimate the statistical significance of local alignment scores. The Gumbel distribution is the so-called Type I extreme value distribution (EVD). It occurs so frequently in sequence analysis applications, compared to the type II (Fr\'{e}chet) and type III (Weibull) extreme value distributions, that Gumbel'' and EVD'' are often used interchangeably in bioinformatics. Easel has a separate module, the \eslmod{gev} module, that implements the generalized extreme value distribution. Karlin/Altschul statistics are a special case of the Gumbel distribution that apply to the scores of ungapped local alignments between infinitely long random sequences. Empirically, Karlin/Altschul statistics also apply reasonably well to the more useful case of gapped alignment of finite-length sequences. Karlin/Altschul statistics predict how the Gumbel's two parameters depend on the length of the query and target sequences. In the case of ungapped alignments, Karlin/Altschul statistics allow the Gumbel parameters to be estimated directly, without the need for a compute-intensive simulation. \subsection{The gumbel API} The \eslmod{gumbel} API consists of the following functions: \vspace{0.5em} \begin{center} \begin{tabular}{ll}\hline \multicolumn{2}{c}{\textbf{evaluating densities and distributions:}}\\ \ccode{esl\_gumbel\_pdf()} & Returns the probability density, $P(S=x)$.\\ \ccode{esl\_gumbel\_logpdf()} & Returns the log of the pdf, $\log P(S=x)$.\\ \ccode{esl\_gumbel\_cdf()} & Returns the cumulative probability distribution, $P(S \leq x)$.\\ \ccode{esl\_gumbel\_logcdf()} & Returns the log of the cdf, $\log P(S \leq x)$.\\ \ccode{esl\_gumbel\_surv()} & Returns right tail mass, 1-cdf, $P(S > x)$\\ \ccode{esl\_gumbel\_logsurv()} & Returns log of 1-cdf, $\log P(S > x)$.\\ \multicolumn{2}{c}{\textbf{sampling:}}\\ \ccode{esl\_gumbel\_Sample()} & Returns a Gumbel-distributed random sample.\\ \multicolumn{2}{c}{\textbf{maximum a posteriori parameter fitting:}}\\ \ccode{esl\_gumbel\_FitComplete()} & Estimates $\mu,\lambda$ from complete data.\\ \ccode{esl\_gumbel\_FitCompleteLoc()} & Estimates $\mu$ when $\lambda$ is known.\\ \ccode{esl\_gumbel\_FitCensored()} & Estimates $\mu,\lambda$ from censored data.\\ \ccode{esl\_gumbel\_FitCensoredLoc()} & Estimates $\mu$ when $\lambda$ is known.\\ \ccode{esl\_gumbel\_FitTruncated()}& Estimates $\mu,\lambda$ from truncated data.\\\hline \end{tabular} \end{center} \vspace{0.5em} The Gumbel distribution depends on two parameters, $\mu$ and $\lambda$. When $\mu$ and $\lambda$ are known, the statistical significance (P-value) of a single score $x$ is $P(S>x)$, obtained by a call to \ccode{esl\_gumbel\_surv()}. The E-value for obtaining that score or better in searching a database of $N$ sequences is just $NP(S>x)$. When $\mu$ and $\lambda$ are unknown, they are estimated from scores obtained from comparisons of simulated random data. (Analytical solutions for $\mu$ and $\lambda$ are only available in the case of ungapped sequence alignments.) The \ccode{esl\_evd\_Fit*()} functions provide maximum likelihood parameter fitting routines for different types of data. \subsection{Example of using the gumbel API} An example that samples 10,000 data points from a Gumbel distribution with $\mu=-20$, $\lambda=0.4$; reports the min and max samples, and the probability mass to the left of the min and to the right of the max (both of which should be about $\frac{1}{10000}$, since we took 10,000 samples); and then fits those simulated data to a Gumbel and reports the fitted $\mu$ and $\lambda$: \input{cexcerpts/gumbel_example} \subsection{Gumbel densities} The probability density function (pdf) and the cumulative distribution function (cdf) of the extreme value distribution are: \begin{equation} P(x) = \lambda \exp \left[ -\lambda (x - \mu) - e^{- \lambda (x - \mu)} \right] \label{eqn:gumbel_density} \end{equation} \begin{equation} P(S < x) = \exp \left[ -e^{-\lambda(x - \mu)} \right] \label{eqn:gumbel_distribution} \end{equation} The extreme value density and distribution functions for $\mu = 0$ and $\lambda = 1.0$ are shown below. \begin{center} \includegraphics[width=3in]{figures/evd_basic} \end{center} The $\mu$ and $\lambda$ parameters are {\em location} and {\em scale} parameters, respectively: \centerline{ \begin{minipage}{3in} \includegraphics[width=2.8in]{figures/evd_location} \end{minipage} \begin{minipage}{3in} \includegraphics[width=2.8in]{figures/evd_scale} \end{minipage} } For more details, a classic reference is \citep{Lawless82}. Gumbel distributions can have their long tail to the right or to the left. The form given here is for the long tail to the right. This is the form that arises when the extreme value is a maximum, such as when our score is the maximum over the individual scores of all possible alignments. The equations in \citep{Lawless82} are for extremal minima; use $(x - u) = -(x - \mu)$ and $b = 1 / \lambda$ to convert Lawless' notation to the notation used here. \subsection{Fitting Gumbel distributions to observed data} Given a set of $n$ observed samples $\mathbf{x}$, we may want to estimate the $\mu$ and $\lambda$ parameters. One might try to use linear regression to fit to a $\log \log$ transformation of the $P(S < x)$ histogram, which gives a straight line with slope $-\lambda$ and $x$ intercept $\mu$: \begin{equation} \log \left[ -\log P(S0$. We do a change of variables, and use the transformation$\lambda = e^w$so we can optimize the unconstrained parameter$w = \log \lambda$instead of optimizing$\lambda$directly. The necessary partial derivatives are then: \begin{eqnarray} \frac{\partial \log L}{\partial \mu} & = & n \lambda - \lambda \sum_{i=1}^{n} e^{-\lambda (x_i - \mu)} - \frac{n \lambda \exp \left[ -\lambda (\phi - \mu) - e^{- \lambda (\phi - \mu)} \right]} {1 - \exp(-e^{-\lambda(\phi - \mu)})} \label{eqn:truncated_dmu} \\% \frac{\partial \log L}{\partial w} & = & n - \sum_{i=1}^{n} \lambda(x_i - \mu) + \sum_{i=1}^{n} \lambda(x_i - \mu) e^{-\lambda (x_i - \mu)} + \frac{n\lambda (\phi-\mu) \exp \left[ -\lambda (\phi - \mu) - e^{- \lambda (\phi - \mu)} \right]} {1 - \exp(-e^{-\lambda(\phi - \mu)})} \label{eqn:truncated_dw} \end{eqnarray} This optimization is carried out by \ccode{esl\_evd\_FitTruncated()}. The likelihood (\ref{eqn:truncated_logL}) is implemented in \ccode{tevd\_func()}, and the derivatives (\ref{eqn:truncated_dmu}) and (\ref{eqn:truncated_dw}) are implemented in \ccode{tevd\_dfunc()}. \ccode{esl\_evd\_FitTruncated()} simply sets up the problem and passes it all off to a conjugate gradient descent optimizer. Results on 500 simulated datasets with$\mu = -20, \lambda = 0.4$, truncated at$\phi = -20$(leaving the right tail, containing about 63\% of the samples): \begin{center} \begin{tabular}{lrrrr} \hline & \multicolumn{4}{c}{\# samples}\\ & 100 & 1000 & 10,000 & 100,000 \\ \% error in$\hat{\mu}$& 13\%& 2\% & 0.8\% & 0.3\% \\ max error in$\hat{\mu}$&260\%& 42\% & 3\% & 1\% \\ \% error in$\hat{\lambda}$& 15\%& 5\% & 2\% & 0.6\% \\ max error in$\hat{\lambda}$& 68\%& 18\% & 6\% & 2\% \\ \hline \end{tabular} \end{center} Fitting truncated Gumbel distributions is difficult, requiring much more data than fitting complete or censored data. The problem is that the right tail becomes a scale-free exponential when$\phi >> \mu$, and$\mu$becomes undetermined. Fits become very inaccurate as$\phi$gets larger than$\mu$, and for sufficiently large$\phi\$, the numerical optimizer will completely fail.
You can’t perform that action at this time.