# Details of Parameter Estimation

This note is a more detailed version of our [paper](https://doi.org/10.1093/gigascience/giaa044). For a review on generalized linear models, I recommend chapter 15.3 of [Applied regression analysis and generalized linear models](https://www.amazon.com/Applied-Regression-Analysis-Generalized-Linear/dp/1452205663/ref=sr_1_2?dchild=1&keywords=Applied+Regression+Analysis+and+Generalized+Linear+Models&qid=1609298891&s=books&sr=1-2) by John Fox, or chapter 3-5 of [An introduction to generalized linear models](https://www.amazon.com/Introduction-Generalized-Chapman-Statistical-Science/dp/1138741515/ref=sr_1_2?crid=18BN4MONNYYJH&dchild=1&keywords=an+introduction+to+generalized+linear+models&qid=1609298924&s=books&sprefix=an+introduction+to+ge%2Cstripbooks%2C222&sr=1-2) by Dobson and Barnett. 

## Generalized linear models

In `MendelIHT.jl`, phenotypes $(\bf y)$ are modeled as a [generalized linear model](https://en.wikipedia.org/wiki/Generalized_linear_model):
\begin{aligned}
    \mu_i = E(y_i) = g({\bf x}_i^t {\bf \beta})
\end{aligned}
where $\bf x$ is sample $i$'s $p$-dimensional vector of *covariates* (genotypes + other fixed effects), $\bf \beta$ is a $p$-dimensional regression coefficients, $g$ is a non-linear *inverse-link* function, $y_i$ is sample $i$'s phenotype value, and $\mu_i$ is the *average predicted value* of $y_i$ given $\bf x$. 

The regression coefficients $\bf \beta$ is not observed and is estimated via **maximum likelihood**. The full design matrix $\bf X$ (obtained by stacking each ${\bf x}_i^t$ row-by-row) and phenotypes $\bf y$ are observed. 

GLMs offer a natural way to model common non-continuous phenotypes. For instance, logistic regression for binary phenotypes and Poisson regression for integer valued phenotypes are special cases. Of course, when $g(\alpha) = \alpha,$ we get the standard linear model used for Gaussian phenotypes. 

## Calculating loglikelihood, gradient and expected information

In GLM, the distribution of $\bf y$ is from the exponential family of distributions of form

$$f(y \mid \theta, \phi) = \exp \left[ \frac{y \theta - b(\theta)}{a(\phi)} + c(y, \phi) \right].$$

$\theta$ is called the **canonical (location) parameter** and under the canonical link, $\theta = g(\bf x^t \bf \beta)$. $\phi$ is the **dispersion (scale) parameter**. The functions $a, b, c$ are known functions that vary depending on the distribution of $y$. 

Given $n$ independent observations, the loglikelihood is:

\begin{aligned}
    L({\bf \theta}, \phi; {\bf y}) &= \sum_{i=1}^n \frac{y_i\theta_i - b(\theta_i)}{a_i(\phi)} + c(y_i, \phi).
\end{aligned}

To evaluate the loglikelihood, instead of manually defining $a, b, c$, we use the [logpdf](https://juliastats.org/Distributions.jl/latest/univariate/#Distributions.logpdf-Tuple{Distribution{Univariate,S}%20where%20S%3C:ValueSupport,Real}) function in Distributions.jl to sum over each sample. 

The perform maximum likelihood estimation, we compute partial derivatives for $\beta$s. The $j$th score component is (eq 4.18 in Dobson):

\begin{aligned}
    \frac{\partial L}{\partial \beta_j} = \sum_{i=1}^n \left[\frac{y_i - \mu_i}{var(y_i)}x_{ij}\left(\frac{\partial \mu_i}{\partial \eta_i}\right)\right]
\end{aligned}

Thus the full gradient is

\begin{aligned}
    \nabla L&= {\bf X}^t{\bf W}({\bf y} - \mathcal{\mu}), \quad {\bf W}_{ii} = \frac{1}{var(y_i)}\left(\frac{\partial \mu_i}{\partial \eta_i}\right)
\end{aligned}

Similarly, the expected information is (eq 4.23 in Dobson):

\begin{aligned}
    J = {\bf X^t\tilde{W}X}, \quad {\bf \tilde{W}}_{ii} = \frac{1}{var(y_i)}\left(\frac{\partial \mu_i}{\partial \eta_i}\right)^2
\end{aligned}

To evaluate the score and expected information, we need to calculate $var(y)$ and $\frac{d\mu}{d\eta}$. Fortunately, the exponential family distributions have mean and variance

\begin{aligned}
    E(y) &= \mu = b'(\theta) = \frac{db(\theta)}{d\theta}\\
    var(y) &= a(\phi)b''(\theta) = a(\phi)\frac{d^2b(\theta)}{d\theta} = a(\phi) var(\mu).
\end{aligned}

Thus $\frac{d\mu_i}{d\eta_i} = \frac{dg({\bf x}_i^t {\bf \beta})}{d{\bf x}_i^t {\bf \beta}}$ is just the derivative of the link function at the linear predictor $\eta_i = {\bf x}_i^t {\bf \beta}$, which is already implemented for various link functions as [mueta](https://github.com/JuliaStats/GLM.jl/blob/master/src/glmtools.jl#L149) in [GLM.jl](https://github.com/JuliaStats/GLM.jl). Similarly, calculating $var(y) = a(\phi) var(\mu)$ requires calculating $a(\phi)$ (easy since $a$ known) and $var(\mu)$.



Since we want to estimate $\bf \beta$, it is more convenient to express $L({\bf \theta}, \phi; {\bf y})$ as a function of $\bf \beta$.  

## Iterative hard thresholding

In `MendelIHT.jl`, the loglikelihood $L(\bf \beta)$ is maximized using iterative hard thresholding. 