# Logistic Regression

## Summary

A logistic regression model tries to model $P(Y|X)$. We use a logit transformation to make sure the $P(Y|X)$ is between 0 and 1. The linear function (used inside logit) is trained via maximum likelihood estimation.


## Detailed summary

$$\log(\frac{p(X)}{1 - p(X)}) = \beta_0 + \beta_1 X$$

As $p(X)$ increases, the $\frac{p(X)}{1 - p(X)}$ will increase monotonically. 
As $p(x)$ increases, the $\beta_0 + \beta_1 x$ increases monotonically.
$p(x)$ is always between 0 and 1.

We train this via MLE: describe the distribution of observing $Y | X$ where $$\begin{cases} 0 & \hbox{with probability } 1 - p(x)\\ 1 & \hbox{with probability } p(x) \\\end{cases}$$

where $P(Y=1) = \frac{e^{\beta_0 + \beta_1 x}}{1 + e^{\beta_0 + \beta_1 x}} = p(x)$.

If $Y_i \sim Bernuoilli(p(X_i))$, then $p(Y=k) = p^k(x) (1 - p(x))^{1 - k}$

Given each sample is independent, then the likelihood is defined as the product (for all samples) of $p(Y=k)$

$$L(\hat{\beta}) = \Pi_{i=1}^n p(y_i = k) = \Pi_{i=1}^n \{ p(x_i)^y_i [1 - p(x_i)]^{1 - y_i} \} = \Pi_{i: y_i = 1} p(x_i) \Pi_{i: y_i = 0} [1 - p(x_i)]$$

From here, we can solve normally by taking log-likelihood and derivative equal to zero.

The maximum likelihood is an optimization which allows us to solve $\vec{\beta}$.

The coefficient describes the effect of parameter $X$ on the log-odds of the class.



## Multiple classes



$p_k(X) = Pr(Y=k|X) = \frac{e^{\beta_k^T X}}{\sum_{j=1}^K e^{\beta_k^T X}}$

Let $p_k(x) = e^{\beta_k^T x}$

The log odds of class 2 and 3 can be found by $\beta_k^\star = \beta_k - \beta_K$. 




In [None]:
import numpy as np

def softmax(X):
    return np.exp(X) / np.sum(np.exp(X), axis=0)

class LogisticRegression:
    def __init__(self):
        pass

    def fit(self, X, y, learning_rate=0.01, num_iters=100):
        pass

    def forward(self, X):
        pass


def logistic_regression():
    pass

## Iterative Reweighted Least Squares (IRLS)

### Background Ideas

In a generalized linear model (glm), the idea is to relate a general response variable $y_i$ with a set of covariates, in order to get a predictive model similar to the one provided by simple regression.

Assume $n$ observations of a response variable $y_1, y_2, ..., y_n$ and $k$ explanatory variables $x_1, x_2, ..., x_k$. with unknown parameters $\beta_0, \beta_1, ..., \beta_k$. 

3 parts for the GLM:

1. Random component ($y_i$) -> distribution of $y_i$

2. Systematic component - explanatory variable form a linear predictor $\beta_0 + \beta_1 x_1 + ... + \beta_k x_k$

3. Link between random component and systematic components.

**Example 1:** In linear regression, $y_i \sim N(\mu, \sigma^2)$, where $\mu = E[Y_i]$ (which is the random component), and we have systematic component which is $\beta_0 + \beta_1 x_1 + ... + \beta_k x_k$. In linear regression, we set the random component and systematic component equal. Therefore, the link function is the identity function.

**Example 2:** Let's say $y_i \sim Ber(\pi_i)$, so $E[y_i] = \pi_i$ and we know $\pi_i \in [0, 1]$. This is the random component. The systematic component again is $\beta_0 + \beta_1 x_1 + ... + \beta_k x_k$. However, we cannot set these equal (aka, use identity as link function) because linear regression does not constrict the domain to $[0, 1]$. Instead, we use the logit link function where

$$logit(\pi) = \log(\frac{\pi}{1 - \pi})$$

This is the log odds of success. Now, assume that $y_i | z_i \sim_{iid} Ber(\pi_i), i=1, ..., n$ where $\pi_i$ is related to the set of covariates $z_i$ by the logit link function, i.e.

$$logit (\pi_i) = \log(\frac{\pi_i}{1 - \pi_i}) = \vec{Z}_i^\prime \vec{\beta}$$

For simplicity, we may assume that  $\vec{z}_i = (1, z_i)^\prime$ and $\vec{\beta} = (\beta_0, \beta_1)^\prime$. So, the link function relates the natural parameter of the Bernuolli as a member of the exponential family with the set of covariates.

The log likelihood is $$l(\vec{\beta}) = \vec{y}^\prime \vec{z} \vec{\beta} - \vec{b}^\prime \vec{1}$$ where $\vec{1}$ is a vector of ones, $\vec{y} = (y_1, y_2, ..., y_n)^\prime$, $\vec{z}$ is the $n \times 2$ matrix whose $i$th row is $z_i^\prime$ and $b = \{ -\log(1 - \pi_i) \}^n_{i=1}$.


PROVE THIS

We want to use the Newton's method to find the MLE $\hat{\beta}$ which maximizes the likelihood.

To do this, we need to find the first and second derivatives.

The score function is $$l^\prime(\beta) = z^\prime(y - \vec{\pi})$$ where $\vec{\pi}$ is the column vector of the Bernuolli probability $\pi_i$.

The Hessian matrix is given by $$l^{\prime \prime}(\beta) = \frac{d}{d \beta}( z^\prime (y - \pi)) = - \vec{z}^\prime \vec{w} \vec{z}$$ where $\vec{w}$ is the diagonal matrix with $i$th diagonal entry $\pi_i (1 - \pi_i)$. 

With these, we apply Newton-Raphson method (see root-finding page).

The Newton's update is given by:

$$\begin{align*}
\beta^{(t+1)} &= \beta^{(t)} - [l^{\prime \prime} (\beta^{(t)}) ]^{-1} l^\prime(\beta^{(t)})\\
&= \beta^{(t)} + (Z^\prime w^{(t)} Z)^{-1} (Z^\prime (y - \pi^{(t)}))\\
\end{align*}$$

where $\pi^{(t)}$ is the value of $\pi$ corresponding to $\beta^{(t)}$ and $w^{(t)}$ is evaluated at $\pi^{(t)}$.

As a reminder, OLS would find that $\hat{\beta} = (Z^\prime Z))^{-1} Z^\prime y$. This shares the same structure as our Newton raphson update.

**Remarks**

1. From the update formula, it follows that the problem of finding MLE in a GLM framework reduces to a repeated weighted least squares applications in which the inverse of the diagonal values of $W$ are the appropriate weights.

2. Since the Hessian does not depend on the data, the expected Fisher information matrix is equal to the observed Fisher information matrix. (TODO: verify whether this is true)

3. IRLS can be slow and unreliable unless the model fits the data well.

**Example:** Example 2.5 on page 37 (G + H)

irls.R


In [None]:
# Do exercise.