# Gaussian Discriminant Analysis

Gaussian discriminant analysis (GDA) is a generative model used typically for solving classification problems. In this model, the feature vector $\boldsymbol{x} \in \mathbb{R}^{d \times 1}$ is assumed to be distributed according to a class-conditional multivariate Gaussian distribution:

\begin{align}
p(\boldsymbol{x} \ | \ y=c) &= \frac{1}{(2 \pi)^{d/2} |\boldsymbol{\Sigma_c}|^{1/2}} \exp \Big( - \frac{1}{2} (\boldsymbol{x} - \boldsymbol{\mu_c})^T \boldsymbol{\Sigma_c}^{-1} (\boldsymbol{x} - \boldsymbol{\mu_c}) \Big)\\
\end{align}

where $\boldsymbol{\mu_c} \in \mathbb{R}^{d \times 1}$ and $\boldsymbol{\Sigma_c} \in \mathbb{R}^{d \times d}$ are the mean vector and covariance matrix for class $c$.

The joint distribution for GDA:

\begin{align}
p(\boldsymbol{x} ,  \ y) &= p(\boldsymbol{x} \ | \ y) \  p(y)\\
\end{align}

where $p(y)$ is a $k$-dimensional categorical distribution to model the class priors.

## Estimating the GDA Parameters

The joint distribution for dataset of $N$ i.i.d. samples:

\begin{align}
p(\boldsymbol{x_1}, \ y_1, \ ..., \ \boldsymbol{x_N}, \ y_N) &= \prod_i^N p(\boldsymbol{x_i}, \ y_i)\\
&= \prod_i^N p(\boldsymbol{x_i} \ | \ y_i) \ p(y_i)
\end{align}

Taking the log of the joint gives:

\begin{align}
L &= \sum_i^N \ln p(\boldsymbol{x_i} \ | \ y_i) + \sum_i^N \ln p(y_i)\\
&= \sum_i^N \sum_c^k \delta_{y_i, c} \ln p(\boldsymbol{x_i} \ | \ y_i=c) +  \sum_i^N \sum_c^k \delta_{y_i, c} \ln p(y_i=c)
\end{align}

### Estimating the prior

Only the second term of the log joint is needed for estimating the prior:

\begin{align}
L_{prior} &= \sum_i^N \sum_c^k \delta_{y_i, c} \ln p(y_i=c)
\end{align}


Let $p(y_i=c) = \pi_c$. Since the $\pi_c$ follow a $k$-dimensional categorical distribution, these can be written as:

\begin{align}
\sum_c^k \pi_c =\sum_c^{k-1} \pi_c + \pi_k= 1\\
\end{align}

Rearranging gives:

$$\pi_k= 1 - \sum_c^{k-1} \pi_c$$

Substituting this constraint into the log prior gives:

\begin{align}
L_{prior} &= \sum_i^N \Big( \sum_c^{k-1} \delta_{y_i, c} \ln \pi_c +  \delta_{y_i, k}  \ln\big(1 - \sum_c^{k-1} \pi_c \big)\Big)
\end{align}


Maximizing the log prior by taking derivative:

\begin{align}
\frac{\partial L_{prior}}{\partial \pi_j} &= \sum_i^N \Big( \sum_c^{k-1} \delta_{y_i, c} \delta_{j, c} \frac{1}{\pi_c} -  \delta_{y_i, k}  \frac{ \sum_c^{k-1} \delta_{c, j}}{\big(1 - \sum_c^{k-1} \pi_c \big)}\Big)\\
&= \sum_i^N \Big( \delta_{y_i, j} \frac{1}{\pi_j} -  \delta_{y_i, k}  \frac{1}{\big(1 - \sum_c^{k-1} \pi_c \big)}\Big)\\
&= \frac{1}{\pi_j} \sum_i^N \delta_{y_i, j}  - \frac{1}{\big(1 - \sum_c^{k-1} \pi_c \big)} \sum_i^N  \delta_{y_i, k}  
\end{align}


Setting the derivative to $0$ and rearranging gives:


\begin{align}
\pi_j  &= \big(1 - \sum_c^{k-1} \pi_c \big) \frac{\sum_i^N \delta_{y_i, j}}{\sum_i^N  \delta_{y_i, k}  }\\
&= \pi_k \frac{\sum_i^N \delta_{y_i, j}}{\sum_i^N  \delta_{y_i, k}  }
\end{align}

Summing both sides for all $j \in \{1, ... k\}$:

\begin{align}
\sum_j^{k}\pi_j  &= \pi_k \frac{\sum_i^N \sum_j^{k} \delta_{y_i, j}}{\sum_i^N  \delta_{y_i, k}  } \\
&=  \pi_k \frac{N}{\sum_i^N  \delta_{y_i, k}  }
\end{align}

Using $\sum_j^{k}\pi_j = 1$ and rearranging:

\begin{align}
\pi_k = \frac{\sum_i^N  \delta_{y_i, k}}{N}
\end{align}

This result can be used to determine $\pi_j$:

\begin{align}
\pi_j = \frac{\sum_i^N  \delta_{y_i, j}}{N}
\end{align}

Thus the prior probability $p(y_i = c)$ is simply the number of training samples in class $c$ divided by total number of training samples.

### Estimating the Mean

Only the first term of the log joint is needed for estimating the mean $\boldsymbol{\mu_c}$:

\begin{align}
L_{cond} &= \sum_i^N \sum_c^k \delta_{y_i, c} \ln p(\boldsymbol{x_i} \ | \ y_i=c)\\
&= - \frac{1}{2} \sum_i^N \sum_c^k \delta_{y_i, c} \Big( \ln{|\boldsymbol{\Sigma_c}|} + (\boldsymbol{x_i} - \boldsymbol{\mu_c})^T \boldsymbol{\Sigma_c}^{-1} (\boldsymbol{x_i} - \boldsymbol{\mu_c}) \Big) + const.\\
&= - \frac{1}{2} \sum_i^N \sum_c^k \delta_{y_i, c} \Big( \ln{|\boldsymbol{\Sigma_c}|} + \boldsymbol{x_i}^T\boldsymbol{\Sigma_c}^{-1} \boldsymbol{x_i} - 2 \boldsymbol{\mu_c}^T \boldsymbol{\Sigma_c}^{-1} \boldsymbol{x_i} +  \boldsymbol{\mu_c}^T \boldsymbol{\Sigma_c}^{-1} \boldsymbol{\mu_c} \Big) + const.
\end{align}

Taking the derivative:

\begin{align}
\frac{\partial L_{cond}}{\partial \boldsymbol{\mu_c}} &= \sum_i^N  \delta_{y_i, c} \big( \boldsymbol{\Sigma_c}^{-1} \boldsymbol{x_i} - \boldsymbol{\Sigma_c}^{-1} \boldsymbol{\mu_c} \big)\\
&= \boldsymbol{\Sigma_c}^{-1} \sum_i^N  \delta_{y_i, c} \big(  \boldsymbol{x_i} - \boldsymbol{\mu_c} \big)\\
\end{align}

Setting the derivative to $0$ and rearranging gives:
    
\begin{align}
\boldsymbol{\mu_c}  &=  \frac{\sum_i^N  \delta_{y_i, c}   \boldsymbol{x_i}}{\sum_i^N  \delta_{y_i, c}}\\
\end{align}

Thus $\boldsymbol{\mu_c}$ is simply the mean of the feature vector for all samples from class $c$.

### Estimating the Covariance 

Once again, only the first term of the log joint is needed for estimating the covariance $\boldsymbol{\Sigma_c}$:

\begin{align}
L_{cond} &= - \frac{1}{2} \sum_i^N \sum_c^k \delta_{y_i, c} \Big( \ln{|\boldsymbol{\Sigma_c}|} + (\boldsymbol{x_i} - \boldsymbol{\mu_c})^T \boldsymbol{\Sigma_c}^{-1} (\boldsymbol{x_i} - \boldsymbol{\mu_c}) \Big) + const.\\
\end{align}

Taking the derivative w.r.t. $\boldsymbol{\Sigma_c}^{-1}$ using $\frac{\partial \boldsymbol{v^T \Lambda v}}{\partial \boldsymbol{\Lambda}} = \boldsymbol{vv^T}$ and $\frac{\partial \ln |\boldsymbol{\Lambda}^{-1}|}{\partial \boldsymbol{\Lambda}} = - (\boldsymbol{\Lambda}^{-1})^T$:

\begin{align}
\frac {\partial L_{cond}}{\partial \boldsymbol{\Sigma_c}^{-1}} &=  \frac{1}{2} \sum_i^N \delta_{y_i, c} \Big( \boldsymbol{\Sigma_c} - (\boldsymbol{x_i} - \boldsymbol{\mu_c})(\boldsymbol{x_i} - \boldsymbol{\mu_c})^T \Big)\\
\end{align}

Setting the derivative to $0$ and rearranging gives:

\begin{align}
\boldsymbol{\Sigma_c} &=  \frac{\sum_i^N \delta_{y_i, c} (\boldsymbol{x_i} - \boldsymbol{\mu_c})(\boldsymbol{x_i} - \boldsymbol{\mu_c})^T}{\sum_i^N \delta_{y_i, c}}\\
\end{align}

Thus $\boldsymbol{\Sigma_c}$ is the covariance matrix computed on all samples from class $c$.

## Linear Discriminant Analysis

Linear discriminant analysis (LDA) is a special case where the covariance matrix is the same across each class:

\begin{align}
p(\boldsymbol{x} \ | \ y=c) &= \frac{1}{(2 \pi)^{d/2} |\boldsymbol{\Sigma}|^{1/2}} \exp \Big( - \frac{1}{2} (\boldsymbol{x} - \boldsymbol{\mu_c})^T \boldsymbol{\Sigma}^{-1} (\boldsymbol{x} - \boldsymbol{\mu_c}) \Big)\\
\end{align}


The covariance matrix is estimated in a slightly different way:

\begin{align}
L_{cond} &= - \frac{N}{2} \ln{|\boldsymbol{\Sigma}|} - \frac{1}{2} \sum_i^N \sum_c^k \delta_{y_i, c}  (\boldsymbol{x_i} - \boldsymbol{\mu_c})^T \boldsymbol{\Sigma}^{-1} (\boldsymbol{x_i} - \boldsymbol{\mu_c})  + const.\\
\end{align}

Taking the derivative w.r.t. $\boldsymbol{\Sigma}^{-1}$:

\begin{align}
\frac {\partial L_{cond}}{\partial \boldsymbol{\Sigma}^{-1}} &= \frac{N}{2} \boldsymbol{\Sigma} - \frac{1}{2} \sum_i^N \sum_c^k  \delta_{y_i, c} (\boldsymbol{x_i} - \boldsymbol{\mu_c})(\boldsymbol{x_i} - \boldsymbol{\mu_c})^T \\
\end{align}

Setting the derivative to $0$ and rearranging gives:

\begin{align}
\boldsymbol{\Sigma} &= \frac{1}{N} \sum_i^N \sum_c^k  \delta_{y_i, c} (\boldsymbol{x_i} - \boldsymbol{\mu_c})(\boldsymbol{x_i} - \boldsymbol{\mu_c})^T \\
&= \frac{1}{N}  \sum_c^k \sum_i^N  \delta_{y_i, c} (\boldsymbol{x_i} - \boldsymbol{\mu_c})(\boldsymbol{x_i} - \boldsymbol{\mu_c})^T \\
&= \frac{1}{N}  \sum_c^k \sum_i^N  \delta_{y_i, c} \boldsymbol{\Sigma}_c\\
&= \frac{1}{N}  \sum_c^k \boldsymbol{\Sigma}_c \sum_i^N  \delta_{y_i, c} 
\end{align}

This can be interpreted as weighted average of the covariance matrices from the general case.

## References 

- https://online.stat.psu.edu/stat508/lesson/9/9.2
- https://statisticaloddsandends.wordpress.com/2018/05/24/derivative-of-log-det-x/
- https://stats.stackexchange.com/questions/129062/what-is-the-correct-formula-for-covariance-matrix-in-quadratic-discriminant-anal