## Generalized Linear Models

Both linear regression - 
$$
y|x;\theta \sim N(\mu, \sigma^2)
$$
and logistic regression (binary classification)  - 
$$
y|x;\theta \sim Bernoulli(\phi)
$$
are special cases of broader family of models, called Generalized Linear Models (GLMs). 


### Exponential Family of distribution

Class of distributions belongs in the exponential family of distribution if - 
$$
p(y;\eta) = b(y)exp(\eta^TT(y) - a(\eta))
$$

- $\eta$ - natural/canonical parameter
- $T(y)$ - sufficient statistic of y ($T(y) = y$ most of the times)
- $a(\eta)$ - log partition function (normalization constant)

>**Sufficient Statistic** - Let $X$ be a random variable following a probability distribution $f(y;\theta)$. A statistic $T(y)$ >is called the sufficient statistic for $\theta$ iff the density $f(y;\theta)$ can be factored as follows - 
>$$
>f(y;\theta) = u(y)v(T(y), \theta)
>$$

For a fixed choice of $T$, $a$ and $b$ we get a specific family of exponential distribution, we get different distributions belonging to this family by varying $\eta$.

Bernoulli distribution belongs to the exponential family of distribution - 
$$\begin{align*}
p(y;\phi) &= \phi^y(1-\phi)^{(1-y)} \\
&= exp(ln(\phi^y(1-\phi)^{1-y})) \\
&= exp(ln(\frac{\phi}{1-\phi})y + ln(1 - \phi))
\end{align*}
$$

- $T(y) = y$
- $\eta = ln(\frac{\phi}{1-\phi})$
- $\phi = \frac{1}{1 + e^{-\eta}}$
- $b(y) = 1$
- $a(\eta) = -ln(1 - \phi) = ln(1 + e^\eta)$

Gaussian Distribution belongs to the exponential family of distribution as well. As the optimal value of $\theta$ or $h(\theta)$ doesn't depend on the value of $\sigma^2$ we can choose it be of a constant value. Let $\sigma^2 = 1$ 
$$\begin{align*}
p(y;\mu) &= \frac{1}{\sqrt{2\pi}}exp(-\frac{1}{2}(y - \mu)^2)\\
&= \frac{1}{\sqrt{2\pi}}exp(-\frac{1}{2}y^2)exp(y\mu - \frac{1}{2}\mu^2)
\end{align*}
$$ 

- $T(y) = y$
- $\eta = \mu$
- $b(y) = \frac{1}{\sqrt{2\pi}}exp(-\frac{1}{2}y^2)$
- $a(\eta) = \frac{1}{2}{\eta^2}$

Other examples of distributions belonging to the exponential family include - 
- Multinomial Distribution 
- Poisson Distribution 
- Gamma Distribution 
- Exponential Distribution 
- Beta Distribution, e.t.c.


### Constructing a Generalized Linear Model
Let us assume that we are modelling the footfall on our website on a given day. We can use the poisson distribution to model this problem. As poisson is belongs to the GLM, we can fit a GLM to model our problem. We make the following assumptions - 
- $y|x;\theta \sim ExponentialFamily(\eta)$ - Given $x$ and $\theta$ i.e. given $x$ and $\theta$ the distribution of $y$ follows some exponential family with parameter $\eta$.
- Given $x$ our goal is to predict the expected value of $T(y)$ given $x$. In most of our examples, we will have $T(y) = y$. $\therefore$ $$h(x) = E[y|x]$$ 
- The natural/canonical parameter $\eta$ is defined as - 
$$
\eta = \theta^Tx
$$

Using these assumption we can derive the $h(\theta)$ of linear and logistic regression.

#### 1. OLS - Linear Regression - 

Assumptions 
1. $$y|x \sim N(\mu, \sigma^2)$$
2. $$\eta = \theta^Tx$$
3. $$T(y) = y $$
$\therefore$
$$\begin{align*}
h(y) &= E[T(y)|x] \\
&= E[y|x] \\
&= \mu \\
&= \eta \\
&= \theta^Tx \\
\end{align*}
$$
Which the formulation of the hypothesis for the Linear Regression we came with in chapter 1.

#### Logistic Regression (Binary Classification)
1. $$y|x \sim Bernouli(\phi)$$
2. $$\eta = \theta^Tx$$
3. $$T(y) = y$$
$\therefore$
$$\begin{align*}
h(x) &= E[T(y);\eta] \\
&= E[y|x;\theta] \\
&= \phi^y(1 - \phi)^{1 - y} \\
&= (\frac{1}{1 + e^{-\eta}})^y(\frac{e^{-\eta}}{1 + e^{-\eta}})^{1-y} \\
&= (\frac{e^{-\eta}}{1 + e^{-\eta}})e^{\eta} \\
&= (\frac{1}{1 + e^{-\eta}}) \\ 
&= \frac{1}{1 + e^{-\theta^Tx}}
\end{align*}
$$

Which is the formulation of the hypothesis for the Logistic Regression for binary classification.

#### Exponential Distribution 
Probability Distribution Function - 
$$
p(y;\lambda) = \begin{cases}
\lambda e^{\lambda x} &x \geq 0 \\
0  &x < 0
\end{cases}
$$
$\therefore$
$$
p(y;\lambda) = 1*e^{-\lambda y + ln(\lambda)}
$$
$\therefore$
- $T(y) = y$
- $\eta = -\lambda$
- $b(y) = 1$
- $a(\eta) = ln(\lambda) = ln(-\eta)$ 

Now we construct a GLM for predicting $h(T(y)) = E(T(y);\eta)$ as a function of $\eta = \theta^x$ for a given data point $x$ and a unknown parameter $\theta$
1. $$y|x \sim Exp(\lambda)$$
2. $$\eta = \theta^Tx$$
3. 
$$\begin{align*}
h(x) &= E(y|x;\theta) \\
&= 1/\lambda \\
&= -\eta^{-1} \\
&= -\frac{1}{\theta^Tx}
\end{align*}
$$
We can use this relationship to learn the unknown parameter $\theta$ for a given dataset $(x_i, y_i)\  \forall \ i \in [1, n]$



### Canonical Link Function & Canonical Response Function 
- **Canonical Response function** $g$ provides relationship between the expected value of the sufficient statistic and the linear predictors (which is $\eta = \theta^Tx$) in the model $g(\eta) = E[T(y);\eta]$
e.g. for the case of bernoulli distribution $g(\eta) = (\frac{1}{1 + e^{-\eta}})$
- **Canonical Link function** is the inverse of the corresponding canonical response function i.e. $g^{-1}$ i.e. if $\mu$ is the mean of the distribution of $y$, then the link function for bernoulli distribution is given by- 
$$
\theta^Tx = ln(\frac{y}{1-y})
$$