# 5 Generalized Linear Model 

## Least Squares

Assume $y\sim (X\beta, \sigma^2 I)$ with $\beta\in\mathbb R^{k+1},\ X\in\mathbb R^{n\times (k+1)},\ y\in\mathbb R^n$ is a linear model with $n$ data. Assume $X$ is of full rank and known (we call $X$ the design matrix). Now $y$ is observed and we want to estimate $\beta$, it is then well-known that the MSE estimator gives $\hat\beta = (X^TX)^{-1}X^Ty$.

#### Orthogonality

$y - X\hat\beta$ and $X\hat\beta$ are orthogonal.

**Proof** Use $X^TX\hat\beta = X^Ty$ (which also holds even when $X$ is not of full rank):

$$(y - X\hat\beta)^T X\hat\beta = y^TX\hat\beta - \hat\beta^T(X^TX\hat\beta)= y^TX\hat\beta -\hat\beta^TX^Ty = 0.$$


#### Gauss Markov Theorem

**Theorem** Assume $y= X\beta+\epsilon$ where noise $\epsilon$ is mean-zero, uncorrelated and homoscedastic with variance $\sigma^2$. Then $\hat\beta = (X^TX)^{-1}X^Ty$ is the BLUE (best linear unbiased estimator).

**Proof** Assume the BLUE is $\hat\beta^* = Ay+b$. Since it should be unbiased, we require $\beta = \mathbb E\hat\beta^* = \mathbb E(Ay+b) = AX\beta +b$ for all $\beta$. Scaling $\beta$ shows that $b\equiv 0$ and then $AX = I$. If we write $A = (X^TX)^{-1}X^T +D $, then $DX =0$. Now we check the variance matrix of $\hat\beta^*$, 

$${\rm Var}\hat\beta^* = {\rm Var}Ay = \sigma^2 AA^T = \sigma^2\left[ (X^TX)^{-1}+DD^T\right]\succeq \sigma^2(X^TX)^{-1}.$$

The minimum is reached if and only if $D = 0$ and $A = (X^TX)^{-1}X^Ty$, which is exactly $\hat\beta = (X^TX)^{-1}X^Ty$.

## Generalized Linear Model 

Recall we can treat the least squares model as an optimization problem in the **parametric family**:

$$y_i \sim N(x_i^T\beta, \sigma^2)\quad\quad\quad x_i^T\in\mathbb R^{k+1},$$

or, in the form of p.d.f.,

$$f_{\beta,\sigma^2}(y_i) = \frac{1}{\sqrt{2\pi\sigma^2 }}\exp\left\{-\frac{1}{2\sigma^2}(y_i - x_i^T\beta)^2\right\}$$

with $\beta$ and $\sigma^2$ are unknown parameters. It is also an exponential family. The least squares problem is finding the parameters $(\beta,\sigma^2)$ that maximize the joint likelihood of the observations.

<br>

In general, generalized linear model (GLM) takes the following form: modelling the observation by a parametric family $(\phi>0)$ with p.d.f. given below

$$f_{\beta,\phi}(y_i)= h(y_i,\phi) \exp\left\{\frac{y_i\eta_i - \zeta(\eta_i) }{\phi} \right\},\quad\quad \eta_i=\eta(\beta^Tx_i)$$

Here $\beta,\phi$ are unknown parameters for the family while $x_i$ are known parameters (coeffcients). $h,\zeta,\eta$ are functions. The family is not an exponential family. However, <font color=red>it is an exponential family when treating</font> $\phi$ <font color=red>as a constant.</font>


### Linear and Logistic Examples

Linear model is a special case of the generalized linear model, as long as we take $\phi =\sigma^2$ and $h(y_i,\phi)=\frac{1}{\sqrt{2\pi\sigma^2 }}\exp \left\{-\frac{y_i^2}{2\sigma^2}\right\}$ and $\zeta(\eta) = \frac12\eta^2$ and $\eta(z)=z$. In this case, $\eta_i = \beta^Tx_i$ and

$$f_{\beta,\sigma^2}(y_i) = \frac{1}{\sqrt{2\pi\sigma^2 }}\exp \left\{-\frac{y_i^2}{2\sigma^2}\right\}\exp\left\{\frac{ y_i (\beta^Tx_i) -\frac{1}{2 }(\beta^Tx_i)^2}{\sigma^2}\right\}.$$

Logistic regression is also a special case. Let $p_i = \frac{1}{1+\exp\{-\beta^Tx_i\}}$ and $\log \frac{p_i}{1-p_i} = \beta^Tx_i$, then we can take $\phi\equiv 1$, $h(y_i,\phi) =  \mathbb I_{y_i\in\{0,1\}}$, $\zeta(\eta) =- \log (1 - \frac{1}{1+e^{-\eta}})=-\log (1 - p_i)$ and $\eta(z) = z$, such that $\eta_i = \beta^Tx_i$ and

$$f_{\beta}(y_i) = \mathbb I_{y_i\in\{0,1\}} p_i^{y_i}(1-p_i)^{1-y_i}
= \mathbb I_{y_i\in\{0,1\}} \exp\left\{y_i\log \frac{p_i}{1-p_i}+\log (1 - p_i)\right\}
= \mathbb I_{y_i\in\{0,1\}} \exp\left\{y_i(x_i^T\beta)+\log (1 - p_i)\right\}.$$

### Statistical Property

We have $\mathbb E(y_i) = \zeta'(\eta_i)$ and ${\rm Var}(y_i) = \phi \zeta''(\eta_i)$. Here $\eta_i = \beta^Tx_i$. Note that $\phi$ controls the variance rather expectance, and $\phi$ is called the dispersion parameter.

**Proof** Since $\int h(y,\phi) \exp\left\{\frac{y\eta_i - \zeta(\eta_i) }{\phi} \right\}dy = 1$, we take derivative with respect to $\eta_i$ on both sides to yield

$$0 = \frac{\partial }{\partial \eta_i}\int h(y,\phi) \exp\left\{\frac{y\eta_i - \zeta(\eta_i) }{\phi} \right\}dy 
=\int h(y,\phi) \exp\left\{\frac{y\eta_i - \zeta(\eta_i) }{\phi} \right\}\left[ \frac{y - \zeta'(\eta_i)}{\phi}\right]dy. $$

Therefore $\mathbb E(y_i ) =\displaystyle \int h(y,\phi) \exp\left\{\frac{y\eta_i - \zeta(\eta_i) }{\phi} \right\}y_idy = \zeta'(\eta_i)\displaystyle \int h(y,\phi) \exp\left\{\frac{y\eta_i - \zeta(\eta_i) }{\phi} \right\}dy =  \zeta'(\eta_i) $.


Similarly the second derivative gives

$$0 =\phi \frac{\partial^2 }{\partial \eta_i^2}\int h(y,\phi) \exp\left\{\frac{y\eta_i - \zeta(\eta_i) }{\phi} \right\}dy 
=\int h(y,\phi) \exp\left\{\frac{y\eta_i - \zeta(\eta_i) }{\phi} \right\}\left[ -\zeta''(\eta_i)+\frac{(y - \zeta'(\eta_i) )^2}{\phi}\right]dy. $$

Sort the equality yields ${\rm Var}(y_i )=\mathbb E(y_i^2) - \left[\mathbb E(y_i)\right]^2= \phi \zeta''(\eta_i)$.



### Link Function

Define $g$ by the **inverse function** of $\zeta'(\eta(\cdot))$. Call $g$ the link function. Moreover, when $\eta (z)\equiv z$ is the identity, we call $g$ the natural link / canonical link.

For example, in linear regression $\zeta(\eta(z)) = \frac12z^2$ and $\zeta'(\eta(z)) =z$, since $\left[\zeta'(\eta(z))\right]^{-1} = z$, so the natural link of linear regression is identity. Whilst in logistic regression, $\zeta(\eta) = -\log\left[ 1 - \frac{1}{1+e^{-\eta}}\right]$ and $\zeta'(\eta(z)) = \frac{1}{1+e^{-z}}$, and $g(z) = \left[\zeta'(\eta(z))\right]^{-1}= \ln\frac{z}{1-z}$, which is known as the logit link.

It is clear that $g(\mathbb E(y_i)) = \beta^Tx_i$.


### Maximum Likelihood Estimator

For $\phi$ being any fixed value, the MLE for parameter $\eta$ does not depend on $\phi$. Thus we can treat $\phi$ as a constant and the parametric family exponential when computing the MLE for $\beta$.

**Proof** The joint log-likelihood is given by

$$\log\ell = \sum_{i=1}^n \log f(y_i) = \sum_{i=1}^n\left[\log h(y_i,\phi)
+\frac{y_i\eta - \zeta(\eta) }{\phi}\right]  
=\sum_{i=1}^n \log h(y_i,\phi) + \frac{1}{\phi} \sum_{i=1}^n [{y_i \eta_i - \zeta(\eta_i) }].$$

Since $\eta_i = \eta(\beta^Tx_i)$, therefore, for whatever $\phi$, the MLE for parameter $\beta$ is the maximizer of $\sum_{i=1}^n [{y_i\eta(\beta^Tx_i) - \zeta(\eta(\beta^Tx_i)) }]$.

<br>

Also, let

$$\begin{aligned}
M_n(\beta) &= \sum_{i=1}^n  [\eta'(\beta^Tx_i)] ^2 \zeta'' (\eta_i)x_ix_i^T
\quad{\rm and}\quad R_n(\beta)  = \sum_{i=1}^n [y_i - \zeta'(\eta_i)]\eta''(\beta^Tx_i)x_ix_i^T.
\end{aligned}$$

Then $\frac{\partial^2}{\partial \beta\partial \beta^T}\log \ell = [R_n(\beta) - M_n(\beta)] / \phi$.

**Proof**

$$\begin{aligned} \frac{\partial\log\ell}{\partial \beta^T} &= \frac{1}{\phi}\frac{\partial }{\partial \beta^T} \sum_{i=1}^n [{y_i\eta(\beta^Tx_i) - \zeta(\eta(\beta^Tx_i)) }]   =\frac{1}{\phi} \sum_{i=1}^n\left[  y_i\eta'(\beta^Tx_i)x_i^T - \zeta' (\eta(\beta^Tx_i))\eta'(\beta^Tx_i)x_i^T
\right] \\ &  =\frac{1}{\phi} \sum_{i=1}^n  [ y_i  - \zeta' (\eta(\beta^Tx_i))]\eta'(\beta^Tx_i)x_i^T
\\ 
\frac{\partial^2\log\ell}{\partial \beta\partial \beta^T} &= \frac{1}{\phi} \frac{\partial }{\partial \beta} \sum_{i=1}^n  [ y_i  - \zeta' (\eta(\beta^Tx_i))]\eta'(\beta^Tx_i)x_i^T\\ &=  \frac{1}{\phi} \sum_{i=1}^n \left\{-\zeta''(\eta(\beta^T(x_i)))[\eta'(\beta^Tx_i)]^2x_ix_i^T +  [ y_i  - \zeta' (\eta(\beta^Tx_i))]\eta''(\beta^Tx_i)x_ix_i^T\right\}
 \end{aligned}
$$

Remark: note that $\eta'' = 0$ when $g$ is canonical, i.e. $\eta(z) = z$, so in this case $\frac{\partial^2 \log \ell}{\partial \beta\partial \beta^T} =M_n(\beta)/\phi$. Further, when $\eta(z) = z$ and $\zeta$ is convex, we can see that $-\log\ell$ is also convex since $M_n(\beta)$ is positive semidefinite.

## Cramér Rao Lower Bound

### Fisher Information Matrix

Suppose all distributions of a parametric family $\mathcal P$ has p.d.f $f_\theta\ (\theta\in\Theta)$ (on measure $\nu$). Then we define the Fisher information matrix to be 

$$I(\theta) =\mathbb E\left\{ \left[\frac{\partial}{\partial \theta}\log f_\theta(X)\right] \left[\frac{\partial}{\partial \theta}\log f_\theta(X)\right] ^T\right\}.$$

Here $\frac{\partial}{\partial \theta}\log f_\theta(X)$ is a column vector.

When $\frac{\partial ^2}{\partial \theta_i\theta_j }\int f_\theta (x)dx  = \int \frac{\partial ^2}{\partial \theta_i\partial \theta_j} f_\theta (x)dx$ for arbitrary indices $i,j$ (permission of interchanging differentiation and integral), we have identity

$$I(\theta) =-\mathbb E\left\{ \frac{\partial^2}{\partial \theta\partial \theta^T}\log f_\theta(X)\right\}.$$

**Proof** We focus on the $(i,j)$ entry of $I_(\theta)$:

$$\begin{aligned}I_{i,j}(\theta)  &= \int  \frac{\partial}{\partial \theta_i}\log f_\theta(x)\cdot \frac{\partial}{\partial \theta_j}\log f_\theta(x) \cdot f_\theta (x)dx= \int \frac{ \frac{\partial}{\partial \theta_i} f_\theta(x)\cdot \frac{\partial}{\partial \theta_j} f_\theta(x)}{f^2_\theta(x)}f_\theta(x) dx
\\ &=  \int - \frac{f_\theta(x) \frac{\partial ^2}{\partial \theta_i\partial \theta_j}f_\theta(x) - \frac{\partial}{\partial \theta_i} f_\theta(x)\cdot \frac{\partial}{\partial \theta_j} f_\theta(x)}{f^2_\theta(x)}f_\theta(x) dx+\int  \frac{ f_\theta(x)\frac{\partial ^2}{\partial \theta_i\partial \theta_j}f_\theta(x) }{f^2_\theta(x)}f_\theta(x)dx.\end{aligned}$$

Observe $\frac{\partial^2\log f_\theta(x)}{\partial \theta_i\partial \theta_j} = \frac{f_\theta(x) \frac{\partial ^2}{\partial \theta_i\partial \theta_j}f_\theta(x)- \frac{\partial}{\partial \theta_i} f_\theta(x)\cdot \frac{\partial}{\partial \theta_j} f_\theta(x)}{f_\theta(x)}$ in the first term. And also $\int  \frac{ f_\theta(x)\frac{\partial ^2}{\partial \theta_i\partial \theta_j}f_\theta(x) }{f^2_\theta(x)}f_\theta(x)dx
= \frac{\partial ^2}{\partial \theta_i\theta_j }\int f_\theta (x)dx = \frac{\partial ^2}{\partial \theta_i\theta_j }1 = 0$ for the second term. We conclude the result $I(\theta) =-\mathbb E\left\{ \frac{\partial^2}{\partial \theta\partial \theta^T}\log f_\theta(X)\right\}$.

#### Change of Variable

If $\theta = \psi(\eta)$ is a function of $\eta$ and $\psi$ is differentiable, then the Fisher information of $\eta$ is given by 

$$I_\eta =  \left[\frac{\partial \psi}{\partial \eta}\right] I_\theta \left[\frac{\partial \psi}{\partial \eta}\right]^T.$$

<br>

### Cramér-Rao Lower Bound

**Theorem** Let $X = (X_1,\dotsc,X_n)$ be a sample from $\mathbb P_\theta$ where $\mathbb P_\theta$ is a member of the parametric family $\mathcal P$. Suppose that $T(X)$ is an estimator with expectance $\mathbb E(T(X)) =\mathbb E(T_{\mathbb P_\theta}(X))= g(\theta)$. When $g$ is differentiable and each $\mathbb P_\theta$ has p.d.f. $f_\theta$ that allows interchanging differentiation and integral:
$$\frac{\partial}{\partial \theta}\int h(x)f_\theta (x)dx =\int h(x) \frac{\partial}{\partial \theta}f_\theta (x)dx\quad\quad \theta\in \Theta$$
for both $h(x) \equiv 1$ and $h(x) = T(x)$. Also, assume the Fisher information matrix $I(\theta)$ is strictly positive definite for all $\theta$. Then, we have the following bound
$${\rm Var}(T(X)) \geqslant \left[\frac{\partial g}{\partial \theta}\right]^T[I(\theta)]^{-1}\left[\frac{\partial g}{\partial \theta}\right]\quad\quad\theta\in\Theta.$$

**Proof of univariate case** We will prove the case where $\theta\in\mathbb R$ is univariate and the inequality becomes ${\rm Var}(T(X))I(\theta)\geqslant [g'(\theta)]^2$. Actually we will simply show that $I(\theta) = {\rm Var}(\frac{\partial }{\partial \theta }\log f_\theta)$ and $g'(\theta) = {\rm Cov}(T(X), \frac{\partial}{\partial \theta }\log f_\theta)$.

Taking $h(x) \equiv 1$ in the condition, we have 

$$\mathbb E\left\{\frac{\partial}{\partial \theta }\log f_\theta\right\}
= \int \frac{\partial}{\partial \theta }\log f_\theta (x)\cdot f_\theta(x)dx
= \int \frac{\frac{\partial f_\theta (x)}{\partial \theta } }{f_\theta(x)} \cdot f_\theta(x)dx
= \int \frac{\partial}{\partial \theta }f_\theta (x) dx
= \frac{\partial}{\partial \theta } \int 1dx = 0.$$

Thus, $I(\theta) = E\left\{(\frac{\partial}{\partial \theta }\log f_\theta)^2\right\} = {\rm Var}(\frac{\partial}{\partial \theta }\log f_\theta)$.

Similarly, taking $h(x)\equiv T(x)$ in the condition,

$${\rm Cov}\left\{T(X), \frac{\partial}{\partial \theta }\log f_\theta\right\} 
=\int  (T(x) - g(\theta))  \cdot \frac{\partial}{\partial \theta }\log f_\theta(x)\cdot f_\theta(x)dx=\int  (T(x) - g(\theta))\frac{\partial f_\theta(x)}{\partial \theta } dx =\frac{\partial}{\partial \theta } \int   T(x)f_\theta(x)dx = g'(\theta).$$

Hence the theorem holds.

<br>

### Exponential Family

Consider the exponential family $\mathcal P$ with p.d.f. $f_\theta(x) = h(x)\exp\{\eta^T(\theta) T(x) - \xi(\theta)\}$. Assume $\eta$ is a bijection and all the regularity conditions required are satisfied. Then the Fisher information for $\eta(\theta)$ is the variance-covariance matrix ${\rm Var}(T(X))$. And the Fisher information for $\mathbb E(T(X))$ is ${\rm Var}(T(X))^{-1}$.

**Proof** For $\eta(\theta)$, first we recall that we have proved $\mathbb E(T(X)) = \frac{\partial \xi(\theta)}{\partial \eta}$ in the previous section (statistical property of GLM), so

$$\frac{\partial \log f}{\partial \eta} = T(x)  -\frac{\partial \xi(\theta)}{\partial \eta}
=T(x) -\mathbb E(T(X))\quad\Rightarrow\quad I_\eta = \mathbb E\left\{[T(X) -\mathbb E(T(X))][T(X) -\mathbb E(T(X))]^T\right\} = {\rm Var}(T(X)).$$

For $\mathbb E(T(X))$, denote $\vartheta = \mathbb E(T(X))=\left. \frac{\partial \xi(\theta)}{\partial \eta}\right|_{\eta =(\cdot)}$ to be a differentiable function of $\eta$, and we have the identity

$$I_\eta  = \left[\frac{\partial \vartheta}{\partial \eta}\right]I_{\vartheta} \left[\frac{\partial \vartheta}{\partial \eta}\right]^T\quad\Rightarrow\quad I_{\vartheta}   = \left[\frac{\partial \vartheta}{\partial \eta}\right]^{-1} I_\eta\left[\frac{\partial \vartheta}{\partial \eta}\right]^{-T}.
$$

Note that $\frac{\partial \vartheta}{\partial \eta}=\frac{\partial \xi(\theta)}{\partial \eta\partial \eta^T}$ and $I_\eta = {\rm Var}(T(X))$, it remains to show that $\frac{\partial \xi(\theta)}{\partial \eta\partial \eta^T} = {\rm Var}(T(X))$. Actually this is also proved in the previous section of the property of GLM or in the chapter of exponential family.

### Asymptotic Optimality