Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
198 lines (160 sloc) 7.84 KB
The generalized extreme value distribution (GEV) includes all three
types of extreme value distributions: Type I (Gumbel), type II
(Fr\'{e}chet), and type III (Weibull). Empirically, the scores of some
sequence alignment algorithms appear to follow GEV distributions. The
\eslmod{gev} module is used in estimating the statistical significance
of such scores.
Most local sequence alignment scores follow the Gumbel distribution.
Easel's \eslmod{gumbel} module applies specifically to the Gumbel. The
\eslmod{gev} module is used for Type II or III extreme value
distributions, or for determining which of the three types of
distribution that a dataset best fits.
\subsection{The gev API}
The \eslmod{gev} API consists of the following functions:
\vspace{0.5em}
\begin{center}
\begin{tabular}{ll}\hline
\multicolumn{2}{c}{\textbf{evaluating densities and distributions:}}\\
\ccode{esl\_gev\_pdf()} & Returns the probability density, $P(S=x)$.\\
\ccode{esl\_gev\_logpdf()} & Returns the log of the pdf, $\log P(S=x)$.\\
\ccode{esl\_gev\_cdf()} & Returns the cumulative probability distribution, $P(S \leq x)$.\\
\ccode{esl\_gev\_logcdf()} & Returns the log of the cdf, $\log P(S \leq x)$.\\
\ccode{esl\_gev\_surv()} & Returns right tail mass, 1-cdf, $P(S > x)$\\
\ccode{esl\_gev\_logsurv()} & Returns log of 1-cdf, $\log P(S > x)$.\\
\multicolumn{2}{c}{\textbf{sampling:}}\\
\ccode{esl\_gev\_Sample()} & Returns a GEV-distributed random sample.\\
\multicolumn{2}{c}{\textbf{maximum likelihood parameter fitting:}}\\
\ccode{esl\_gev\_FitComplete()} & Estimates GEV parameters from complete data.\\
\end{tabular}
\end{center}
\vspace{0.5em}
The Gumbel distribution depends on three parameters, $\mu$, $\lambda$,
and $\alpha$. When these parameters are known, the statistical
significance (P-value) of a single score $x$ is $P(S>x)$, obtained by
a call to \ccode{esl\_gev\_surv()}. The E-value for obtaining that
score or better in searching a database of $N$ sequences is just
$NP(S>x)$.
When the parameters are unknown, they can be estimated from scores
obtained from comparisons of simulated random data. The
\ccode{esl\_gev\_FitComplete()} function performs maximum likelihood
parameter fitting \citep{Coles01}.
\subsection{Example of using the gev API}
Below is a code example that samples 10,000 data points from a
Fr\'{e}chet distribution with $\mu=-20$, $\lambda=0.4$, $\alpha=0.1$;
reports the min and max samples, and the probability mass to the left
of the min and to the right of the max (both of which should be about
$\frac{1}{10000}$, since we took 10,000 samples); and then fits those
simulated data to a Gumbel and reports the fitted $\mu$ and $\lambda$:
\input{cexcerpts/gev_example}
\subsection{GEV densities}
The probability density function (pdf) and the cumulative distribution
function (cdf) of the generalized extreme value distribution are
\citep{Coles01}:
\begin{eqnarray}
P(X=x) & = & \lambda \left[ 1 + \alpha \lambda (x - \mu) \right]^{-\frac{\alpha+1}{\alpha}}
\exp \left\{ - \left[ 1 + \alpha \lambda (x - \mu)
\right]^{-\frac{1}{\alpha}} \right\}
\\%
\label{eqn:gev_density}
P(X \geq x) & = & \exp \left\{ - \left[ 1 +
\alpha\lambda(x-\mu) \right]^{-\frac{1}{\alpha}} \right\}
\\%
\label{eqn:gev_distribution}
\end{eqnarray}
The parameters $\mu$, $\lambda$, and $\alpha$ are location, scale, and
shape parameters, respectively, with $-\infty < \mu < \infty$, $0 <
\lambda < \infty$, and $-\infty < \alpha < \infty$.
The Type II (Fr\'{e}chet) distribution corresponds to $\alpha > 0$,
and the Type III (Weibull) distribution corresponds to $\alpha < 0$.
The Type I (Gumbel) distribution arises in the limit $\alpha
\rightarrow 0$. At values $\alpha \simeq 0$, Easel's GEV functions
revert to the Gumbel limit case, as opposed to dividing by zero and
failing.
Technically the GEV is only defined for values of $x$ such that $1 +
\alpha \lambda (x - \mu) > 0$. However, Easel's functions return
sensible values outside this domain, such as 0 for nonexistent
densities.
Generalized extreme value densities for $\mu = 0$ and $\lambda = 1$
are shown below (left) for three settings of $\alpha$; $\alpha = 0$
(Gumbel), $\alpha = 0.1$ (Fr\'{e}chet), and $\alpha = -0.1$
(Weibull). The figure on the right shows the log densities, which more
clearly show how, relative to the exponential right tail of the
Gumbel, the Fr\'{e}chet's tail is longer, and the Weibull's tail is
shorter.
\centerline{
\begin{minipage}{3in}
\includegraphics[width=2.8in]{figures/gev_density}
\end{minipage}
\begin{minipage}{3in}
\includegraphics[width=2.8in]{figures/gev_logdensity}
\end{minipage}
}
For more details, see the excellent description in \citep{Coles01}.
Easel's $\{ \mu, \lambda, \alpha \}$ notation differs from the $\{
\mu, \sigma, \xi \}$ parameterization used by Coles. Use $\lambda =
\frac{1}{\sigma}$ and $\alpha = \xi$ to translate.
\subsection{Fitting GEV distributions to observed data}
Easel fits GEVs by maximum likelihood estimation by numerically
optimizing the log likelihood function, using first derivative
information and conjugate gradient descent. See the \eslmod{gumbel}
chapter for a more general introduction to maximum likelihood fitting.
\subsubsection{Maximum likelihood estimation, complete data}
The function \ccode{esl\_gev\_FitComplete()} uses gradient information
to find parameters that optimize the likelihood function, using the
conjugate gradient descent code in the \eslmod{minimizer} module.
Given $n$ samples $x_1..x_n$, we want to estimate maximum likelihood
parameter estimates $\{ \hat{\mu}, \hat{\lambda}, \hat{\alpha} \}$
that maximize the log likelihood:
\begin{equation}
\log L(\lambda, \mu, \alpha) = n \log \lambda
- \frac{\alpha+1}{\alpha}
\sum_{i=1}^{n} \log\left[1+ \alpha\lambda(x_i - \mu) \right]
- \sum_{i=1}^{n} \left[ 1 + \alpha\lambda (x_i - \mu) \right]^{\frac{1}{\alpha}}
\label{eqn:gev_logL}
\end{equation}
The $\left[ 1 + \alpha\lambda (x_i - \mu) \right]^{\frac{1}{\alpha}}$
term can be rewritten in a more conveniently differentiable form as
$\exp \left\{ \frac{1}{\alpha} \log \left[ 1 + \alpha\lambda (x_i - \mu)
\right] \right\}$.
Since the $\lambda$ parameter is constrained to $\lambda > 0$ but the
numerical optimizer expects unconstrained parameters, we use a change
of variables $\lambda = e^w$ and optimize an unconstrained value $w$.
The gradient of the log likelihood with respect to $\mu$, $w$, and
$\alpha$ is:
%% xref: STL9/118-120
\begin{eqnarray}
\frac{\partial \log L}{\partial \mu} & = &
\sum_{i=1}^n \frac{\lambda (\alpha+1)}{1+\alpha\lambda(x_i-\mu)}
-\sum_{i=1}^n \lambda \exp
\left\{ -\frac{\alpha+1}{\alpha} \log
\left[1+\alpha\lambda(x_i-\mu)\right] \right\}
\\%
\label{eqn:gev_mupartial}
\frac{\partial \log L}{\partial w} & = &
n - \sum_{i=1}^{n} \frac{\lambda (\alpha+1) (x_i - \mu)}
{1 + \alpha \lambda (x_i - \mu)}
+ \sum_{i=1}^n \lambda (x_i - \mu)
\exp \left\{ -\frac{\alpha+1}{\alpha} \log
\left[1+\alpha\lambda(x_i-\mu)\right] \right\}
\\%
\label{eqn:gev_wpartial}
\frac{\partial \log L}{\partial \alpha} & = &
\sum_{i=1}^n \left\{
\begin{array}{l}
- \frac{\alpha+1}{\alpha} \frac{\lambda(x_i-\mu)}
{1 +\alpha\lambda(x_i-\mu)}\\
+ \frac{1}{\alpha^2} \log \left[ 1 + \alpha\lambda(x_i - \mu) \right]\\
+ \frac{1}{\alpha} \frac{\lambda(x_i-\mu)}
{1 +\alpha\lambda(x_i-\mu)}
e^{-\frac{1}{\alpha} \log\left[ 1 + \alpha\lambda(x_i - \mu) \right]}\\
- \frac{1}{\alpha^2} \log \left[ 1 + \alpha\lambda(x_i - \mu) \right]
e^{-\frac{1}{\alpha} \log\left[ 1 + \alpha\lambda(x_i - \mu)
\right]}
\end{array}
\right.
\\%
\label{eqn:gev_alphapartial}
\end{eqnarray}
When $|\alpha\lambda(x_i - \mu)|$ approaches $0$, the GEV approximates
a Gumbel distribution and these equations can be simplified using the
approximation $\log(1+a) \simeq a$.
You can’t perform that action at this time.