# Logistic Regression 

## Preliminary

### Statistical Learning

- [Bayesian Decision Theory (BDT)](bayesian-decision-theory)

- [Maximum Likelihood Estimation (MLE)](maximum-likelihood-estimation)

### Supervised Learning

- [Linear Discriminant](linear-discriminant)

## Logistic regression as a Gaussian classifier

Logistic regression is a classification model that models the posterior probability of the positive class and assigns labels based on the MAP rule

$$
y = \begin{cases}
1 & \sigma (f (\mathbf{x})) \geq 0.5 \\
0 & \sigma (f (\mathbf{x}))
< 0.5 \\
\end{cases},
$$

where $\sigma$ is the sigmoid function and $f (\mathbf{x})$ is a linear function on the instance $\mathbf{x}$. 

### MAP rule and posterior probability

Recall that the BDR with 0-1 loss is the MAP rule

$$
f (\mathbf{x}) = \arg\max_{y} \mathbb{P}_{Y \mid \mathbf{X}} (y \mid \mathbf{x})
$$

where $\mathbb{P}_{Y \mid \mathbf{X}} (y \mid \mathbf{x})$ is the posterior probability that the true class for instance $\mathbf{x}$ is $y$. 

For a binary classification problem, the **MAP rule** can be simplified to select the class $1$ for $\mathbf{x}$ if

$$
\begin{aligned}
\mathbb{P}_{Y \mid \mathbf{X}} (1 \mid \mathbf{x}) 
& \geq \mathbb{P}_{Y \mid \mathbf{X}} (0 \mid \mathbf{x}) 
\\
& \geq 1 - \mathbb{P}_{Y \mid \mathbf{X}} (1 \mid \mathbf{x})
\\
& \geq 0.5.
\end{aligned}
$$

Using the Bayes theorem, the **posterior probability** of the positive class can be represented using the class conditional probabilities and class probabilities

$$
\begin{aligned}
\mathbb{P}_{Y \mid \mathbf{X}} (1 \mid \mathbf{x}) 
& = \frac{
    \mathbb{P}_{\mathbf{X} \mid Y} (\mathbf{x} \mid 1) \mathbb{P}_{Y} (1)
}{
    \mathbb{P}_{\mathbf{X}} (\mathbf{x})
} 
& [\text{Bayes' theroem}]
\\
& = \frac{
    \mathbb{P}_{\mathbf{X} \mid Y} (\mathbf{x} \mid 1) \mathbb{P}_{Y} (1)
}{
    \mathbb{P}_{\mathbf{X}, Y} (\mathbf{x}, 0) + \mathbb{P}_{\mathbf{X}, Y} (\mathbf{x}, 1)
} 
& [\text{Law of total probability}]
\\
& = \frac{
    \mathbb{P}_{\mathbf{X} \mid Y} (\mathbf{x} \mid 1) \mathbb{P}_{Y} (1)
}{
    \mathbb{P}_{\mathbf{X} \mid Y} (\mathbf{x} \mid 0) \mathbb{P}_{Y} (0) + \mathbb{P}_{\mathbf{X} \mid Y} (\mathbf{x} \mid 1) \mathbb{P}_{Y} (1)
} 
& [\text{Chain rule}]
\\
& = \left(1 + \frac{
        \mathbb{P}_{\mathbf{X} \mid Y} (\mathbf{x} \mid 0) \mathbb{P}_{Y} (0) 
    }{
        \mathbb{P}_{\mathbf{X} \mid Y} (\mathbf{x} \mid 1) \mathbb{P}_{Y} (1)
    } 
\right)^{-1}
\\
\end{aligned}
$$

### Sigmoid function

The **sigmoid function** is a saturating function that maps the real number $x$ into a number that ranges from $0$ to $1$

$$
\sigma(x) = \frac{
    1
}{
    1 + e^{- x}
}.
$$

The posterior probability is the result of the sigmoid function if we assume the class conditional probabilities are Gaussian distributions. 

Recall that the multivariate Gaussian with the mean $\boldsymbol{\mu}$ and covariance matrix $\boldsymbol{\Sigma}$ is

$$
\mathcal{G} (\mathbf{x}; \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{
    1
}{
    \sqrt{(2 \pi)^{2} \lvert \boldsymbol{\Sigma}_{1} \rvert}
} \exp \left(
    -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu}_{1})^T \boldsymbol{\Sigma}_{1}^{-1} (\mathbf{x} - \boldsymbol{\mu_{1}})
\right),
$$

which can be compactly written as follows

$$
\begin{aligned}
\mathcal{G} (\mathbf{x}; \boldsymbol{\mu}, \boldsymbol{\Sigma}) 
& = \frac{
    1
}{
    \sqrt{(2 \pi)^{2} \lvert \boldsymbol{\Sigma}_{1} \rvert}
} \exp \left(
    -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu}_{1})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu_{1}})
\right)
\\
& = \exp \left(
    \log \left(
        (2 \pi)^{d} \lvert \boldsymbol{\Sigma} \rvert
    \right)^{-\frac{1}{2}} - \frac{1}{2} \left(
        \mathbf{x} - \boldsymbol{\mu}
    \right)^{T} \boldsymbol{\Sigma}^{-1} \left(
        \mathbf{x} - \boldsymbol{\mu}
    \right)
\right)
\\
& = \exp \left(
    -\frac{1}{2} \log \left(
        (2 \pi)^{d} \lvert \boldsymbol{\Sigma} \rvert
    \right) - \frac{1}{2} \left(
        \mathbf{x} - \boldsymbol{\mu}
    \right)^{T} \boldsymbol{\Sigma}^{-1} \left(
        \mathbf{x} - \boldsymbol{\mu}
    \right)
\right)
\\
& = \exp \left(
    -\frac{1}{2} \left(
        \log \left(
            (2 \pi)^{d} \lvert \boldsymbol{\Sigma} \rvert
        \right) + d_{\boldsymbol{\Sigma}} (\mathbf{x}, \boldsymbol{\mu}) 
    \right)
\right),
\end{aligned}
$$

where $d_{\boldsymbol{\Sigma}} (\mathbf{x}, \mathbf{y}) = \frac{1}{2} \left(
    \mathbf{x} - \mathbf{y} 
\right)^{T} \boldsymbol{\Sigma}^{-1} \left( 
    \mathbf{x} - \mathbf{y} 
\right)$ is the Mahalanobis distance between $\mathbf{x}$ and $\mathbf{y}$ with covariance matrix $\boldsymbol{\Sigma}$.

If we assume the class conditional probabilities for both classes are Gaussian distributions:

- $\mathbb{P}_{\mathbf{X} \mid Y} (\mathbf{x} \mid 0) = \mathcal{G} (\mathbf{x}; \boldsymbol{\mu}_{0}, \boldsymbol{\Sigma}_{0})$

- $\mathbb{P}_{\mathbf{X} \mid Y} (\mathbf{x} \mid 1) = \mathcal{G} (\mathbf{x}; \boldsymbol{\mu}_{1}, \boldsymbol{\Sigma}_{1})$,

then the posterior possibility is 

$$
\begin{aligned}
\mathbb{P}_{Y \mid \mathbf{X}} (1 \mid \mathbf{x})
& = \left(
    1 + \frac{
        \exp \left(
            -\frac{1}{2} \left(
                \log \left(
                        (2 \pi)^{d} \lvert \boldsymbol{\Sigma}_{0} \rvert
                \right) + d_{\boldsymbol{\Sigma_{0}}} (\mathbf{x}, \boldsymbol{\mu_{0}})
            \right) \mathbb{P}_{Y} (0)
        \right) 
    }{
        \exp \left(
            -\frac{1}{2} \left(
                \log \left(
                        (2 \pi)^{d} \lvert \boldsymbol{\Sigma}_{1} \rvert
                \right) + d_{\boldsymbol{\Sigma_{1}}} (\mathbf{x}, \boldsymbol{\mu_{1}})
            \right) \mathbb{P}_{Y} (1)
        \right) 
    }
\right)^{-1}
\\
& = \left(
    1 + \frac{
        \exp \left(
            -\frac{1}{2} \left(
                \log \left(
                        (2 \pi)^{d} \lvert \boldsymbol{\Sigma}_{0} \rvert
                \right) + d_{\boldsymbol{\Sigma_{0}}} (\mathbf{x}, \boldsymbol{\mu_{0}})
            \right) + \log \mathbb{P}_{Y} (0)
        \right)
    }{
        \exp \left(
            -\frac{1}{2} \left(
                \log \left(
                        (2 \pi)^{d} \lvert \boldsymbol{\Sigma}_{1} \rvert
                \right) + d_{\boldsymbol{\Sigma_{1}}} (\mathbf{x}, \boldsymbol{\mu_{1}})
            \right) + \log \mathbb{P}_{Y} (1)
        \right) 
    }
\right)^{-1}
\\
& = \left(
    1 + \exp \left(
        - f (\mathbf{x})
    \right)
\right)^{-1}
\end{aligned}
$$

where $f (\mathbf{x}) = \frac{1}{2} \left(
    \alpha_{0} - \alpha_{1} 
    + d_{\boldsymbol{\Sigma_{0}}} (\mathbf{x}, \boldsymbol{\mu_{0}}) 
    - d_{\boldsymbol{\Sigma_{1}}} (\mathbf{x}, \boldsymbol{\mu_{1}})
    + 2 \log \frac{\mathbb{P}_{Y} (1)}{\mathbb{P}_{Y} (0)}
\right)
$ and $\alpha_{i} = \log \left(
    (2 \pi)^{d} \lvert \boldsymbol{\Sigma}_{i} \rvert
\right)$.

### Linear function

If we further assume that the Gaussian distributions for both classes have the same covariance matrix $\boldsymbol{\Sigma}_{0} = \boldsymbol{\Sigma}_{1} = \boldsymbol{\Sigma}$,
then $f (\mathbf{x})$ is a linear function

$$
\begin{aligned}
f (\mathbf{x}) 
& = \frac{1}{2} \left(
    \alpha - \alpha 
    + d_{\boldsymbol{\Sigma}} (\mathbf{x}, \boldsymbol{\mu_{0}}) 
    - d_{\boldsymbol{\Sigma}} (\mathbf{x}, \boldsymbol{\mu_{1}})
    + 2 \log \frac{\mathbb{P}_{Y} (1)}{\mathbb{P}_{Y} (0)}
\right)
\\
& = \frac{1}{2} \left(
    \mathbf{x}^{T} \boldsymbol{\Sigma}^{-1} \mathbf{x} +
    2\mathbf{x}^{T} \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_{0} + 
    \boldsymbol{\mu}_{0}^{T} \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_{0} -
    \mathbf{x}^{T} \boldsymbol{\Sigma}^{-1} \mathbf{x} -
    2\mathbf{x}^{T} \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_{1} -
    \boldsymbol{\mu}_{1}^{T} \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_{1}
\right) + \log \frac{\mathbb{P}_{Y} (1)}{\mathbb{P}_{Y} (0)}
\\
& = \left( 
    \boldsymbol{\mu}_{0} - \boldsymbol{\mu}_{1}
\right)^{T} \boldsymbol{\Sigma}^{-1} \mathbf{x} + \frac{1}{2} \left(
    \boldsymbol{\mu}_{0}^{T} \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_{0} -
    \boldsymbol{\mu}_{1}^{T} \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_{1}
\right) + \log \frac{\mathbb{P}_{Y} (1)}{\mathbb{P}_{Y} (0)}
\\
& = \mathbf{w}^{T} \mathbf{x} + b
\end{aligned}
$$

where 

- $\mathbf{w}^{T} = \left( 
    \boldsymbol{\mu}_{0} - \boldsymbol{\mu}_{1}
\right)^{T} \boldsymbol{\Sigma}^{-1} \mathbf{x}$,

- $b = \frac{1}{2} \left(
    \boldsymbol{\mu}_{0}^{T} \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_{0} -
    \boldsymbol{\mu}_{1}^{T} \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_{1}
\right) + \log \frac{\mathbb{P}_{Y} (1)}{\mathbb{P}_{Y} (0)}$

## Learning of logistic regression

With the generative approach, parameters $\boldsymbol{\mu}_{0}$, $\boldsymbol{\mu}_{1}$, $\boldsymbol{\Sigma}_{0}$, and $\boldsymbol{\Sigma}_{1}$ are learned from the training set using MLE. 
In particular, the parameters for the conditional probability of class $j$ are learned by solving the following optimization problem

$$
\arg\max_{\boldsymbol{\mu}_{i}, \boldsymbol{\Sigma}_{i}} \prod_{y_{j} = j} \mathbb{P}_{\mathbf{X} \mid Y} \left(
    \mathbf{x}_{i} \mid j
\right) = \arg\max_{\boldsymbol{\mu}_{j}, \boldsymbol{\Sigma}_{j}} \prod_{y_{j} = j} \mathcal{G} \left( 
    \mathbf{x}_{i}; \boldsymbol{\mu}_{j}, \boldsymbol{\Sigma}_{j}
\right).
$$

However, logistic regression is usually learned using a discriminative approach, where the parameters $\mathbf{w}, b$ are directly learned from the data by minimizing binary cross-entropy loss. 

### Learning as a MLE problem

Recall that the learning of the linear regression can be formulated as an MLE problem

$$
\arg\max_{\mathbf{w}, b} \prod_{i} \mathbb{P}_{Y \mid \mathbf{X}} \left(
    y_{i} \mid \mathbf{x}_{i}
\right) = \arg\max_{\mathbf{w}, b} \prod_{i} \mathcal{G} \left( 
    y_{i}; \mathbf{w}^{T} \mathbf{x}_{i} + b, \sigma^{2} 
\right),
$$

where the posterior probability of the label $\mathbb{P}_{Y \mid \mathbf{X}} \left(
    y_{i} \mid \mathbf{x}_{i}
\right)$ follows a univariate Gaussian distribution with the mean $\mathbf{w}^{T} \mathbf{x} + b$ and a known variance $\sigma^{2}$.

For logistic regression, the posterior probability of the label should be a Bernoulli distribution 

$$
\begin{aligned}
\mathbb{P}_{Y \mid \mathbf{X}} \left(
    y \mid \mathbf{x}
\right) 
& = \mathcal{B} \left( 
    y; \mathbb{P}_{Y \mid \mathbf{X}} (1 \mid \mathbf{x})
\right)
\\
& = \mathbb{P}_{Y \mid \mathbf{X}} (1 \mid \mathbf{x})^{y} \left(
    1 - \mathbb{P}_{Y \mid \mathbf{X}} (1 \mid \mathbf{x})
\right)^{(1 - y)}
\end{aligned}
$$

and therefore the MLE problem is defined as 

$$
\begin{aligned}
\arg\max_{\mathbf{w}, b} \prod_{i} \mathcal{B} \left( 
    y; \mathbb{P}_{Y \mid \mathbf{X}} (1 \mid \mathbf{x}_{i})
\right)
& = \arg\max_{\mathbf{w}, b} \sum_{i} \log \mathcal{B} \left( 
    y; \mathbb{P}_{Y \mid \mathbf{X}} (1 \mid \mathbf{x}_{i})
\right)
\\
& = \arg\max_{\mathbf{w}, b} \sum_{i} \log \mathbb{P}_{Y \mid \mathbf{X}} (1 \mid \mathbf{x}_{i})^{y_{i}} \left(
    1 - \mathbb{P}_{Y \mid \mathbf{X}} (1 \mid \mathbf{x}_{i})
\right)^{(1 - y_{i})}.
\end{aligned}
$$

### Binary cross-entropy (BCE) loss

The binary cross-entropy loss is defined as 

$$
\text{BCE} (y, \hat{y}) =  - y \log \hat{y} - (1 - y) \log (1 - \hat{y})
$$

where $y \in \{0, 1\}$ is the binary label and $\hat{y} \in [0, 1]$ is the probability of the positive class.

Solving the MLE of parameters of the logistic regression problem is the same as minimizing the BCE loss

$$
\begin{aligned}
& \arg\max_{\mathbf{w}, b} \sum_{i} \log \mathbb{P}_{Y \mid \mathbf{X}} (1 \mid \mathbf{x}_{i})^{y_{i}} \left(
    1 - \mathbb{P}_{Y \mid \mathbf{X}} (1 \mid \mathbf{x}_{i})
\right)^{(1 - y_{i})}
\\
= & \arg\max_{\mathbf{w}, b} \sum_{i} y_{i} \log \sigma (f (\mathbf{x}_{i})) + (1 - y_{i}) \log \left(
    1 - \sigma (f (\mathbf{x}_{i}))
\right)
\\
= & \arg\min_{\mathbf{w}, b} \sum_{i} - y_{i} \log \sigma (f (\mathbf{x}_{i})) - (1 - y_{i}) \log \left(
    1 - \sigma (f (\mathbf{x}_{i}))
\right)
\\
= & \arg\min_{\mathbf{w}, b} \sum_{i} \text{BCE} (y_{i}, \sigma (f (\mathbf{x}_{i})).
\end{aligned}
$$

Therefore, logistic regression can be learned by minimizing the BCE loss between the predicted labels and training labels. 

### Minimizing loss with gradient descent 

Unlike linear regression, the optimization problem of logistic regression 

$$
\arg\min_{\mathbf{w}, b} \sum_{i} - y_{i} \log \sigma (f (\mathbf{x}_{i})) - (1 - y_{i}) \log \left(
    1 - \sigma (f (\mathbf{x}_{i}))
\right)
$$

can not be analytically solved to obtain a closed-form solution because of the non-linear sigmoid function. 

Instead, gradient descent is used to solve the optimization problem numerically. 

TODO: waiting for convex optimization 