# Linear Discriminant Analysis

LDA is a generalization of Fisher's linear discriminant used for dimensionality reduction before classification. Compared to PCA, LDA is a supervised algorithm.

The basic idea of LDA is to project data points onto a line to maximize the between-class scatter and minimize the within-class scatter.

## LDA for Binary Classification

Given a dataset,

\begin{equation}
D = \{(x_1, y_1), (x_2, y_2), \dots, (x_m, y_m)\}, \ x_i = (x_{i1}, x_{i2}, ..., x_{in}),\ y_i\in\{0,1\}
\end{equation}

The mean of the category $j$,

\begin{equation}
\boldsymbol\mu_{j} = \frac{1}{N_{j}} \sum_{\boldsymbol{x} \in \mathcal{X}_{j}}\boldsymbol{x}
\end{equation}

The covariance matrix of the category $j$,
\begin{equation}
\boldsymbol{\Sigma}_{j} = \sum_{\boldsymbol{x} \in \mathcal{X}_{j}}(\boldsymbol{x}-\boldsymbol{\mu}_{j})(\boldsymbol{x}-\boldsymbol{\mu}_{j})^\top
\end{equation}

$N_{j}$ is the number of examples in category $j$, $\mathcal{X}_{j}$ is the set of examples of category $j$.

Projecting all the examples onto a line $\boldsymbol{w}$, for each category, the mean is $\boldsymbol{w}^\top \boldsymbol{\mu}_j$, the covariance matrix is $\boldsymbol{w}^\top \boldsymbol{\Sigma}_{j} \boldsymbol{w}$.

Define the between-class scatter matrix,

\begin{equation}
\boldsymbol{S}_{b} = (\boldsymbol{\mu}_0 - \boldsymbol{\mu}_1)(\boldsymbol{\mu}_0 - \boldsymbol{\mu}_1)^\top
\end{equation}

The within-class scatter,

\begin{equation}
\boldsymbol{S}_{w} = \boldsymbol{\Sigma}_{0} + \boldsymbol{\Sigma}_{1} = \sum_{\boldsymbol{x} \in \boldsymbol{X}_{0}}(\boldsymbol{x}-\boldsymbol{\mu}_0)(\boldsymbol{x}-\boldsymbol{\mu}_0)^\top + \sum_{\boldsymbol{x} \in \boldsymbol{X}_{0}}(\boldsymbol{x}-\boldsymbol{\mu}_1)(\boldsymbol{x}-\boldsymbol{\mu}_1)^\top
\end{equation}

On the line $\boldsymbol{w}$, to keep two categories as far as possible, which is equivalent to maximize the distance between the centres of two categories i.e. $\max \lVert\boldsymbol{w}^\top \boldsymbol{\mu}_0 - \boldsymbol{w}^\top \boldsymbol{\mu}_1\rVert_{2}^{2}$. To keep the points of the same category keep as close as possible, which is equivalent to minimize the covariance of the points in the same category i.e. $\min (\boldsymbol{w}^\top \boldsymbol{\Sigma}_{0} \boldsymbol{w} + \boldsymbol{w}^\top \boldsymbol{\Sigma}_{1} \boldsymbol{w})$ In summary, the goal is to maximize,

\begin{equation}
\mathcal{L}(\boldsymbol{w}) &= \frac{\lVert\boldsymbol{w}^\top \boldsymbol{\mu}_0 - \boldsymbol{w}^\top \boldsymbol{\mu}_1\rVert_{2}^{2}} {\boldsymbol{w}^\top \boldsymbol{\Sigma}_{0} \boldsymbol{w} + \boldsymbol{w}^\top \boldsymbol{\Sigma}_{1} \boldsymbol{w}}\\
& = \frac{\boldsymbol{w}^\top (\boldsymbol{\mu}_0 - \boldsymbol{\mu}_1)(\boldsymbol{\mu}_0 - \boldsymbol{\mu}_1)^\top \boldsymbol{w}}{\boldsymbol{w}^\top(\boldsymbol{\Sigma}_{0} + \boldsymbol{\Sigma}_{1})\boldsymbol{w}}\\
& = \frac{\boldsymbol{w}^\top\boldsymbol{S}_{b}\boldsymbol{w}} {\boldsymbol{w}^\top\boldsymbol{S}_{w}\boldsymbol{w}}
\end{equation}

In other words, maximize the general Rayleigh quotient of $\boldsymbol{S}_{b}$ and $\boldsymbol{S}_{w}$.