## Chapter 3 - Linear Models

_**Author:** Zitong Su_

*This note includes some formula derivations (and/or extended materials) additional to the __Machine Learning ("Watermelon Book")__ or the __Pumpkin Book__.*

### 3.4 Linear Discriminant Analysis
#### 3.4.1 Derivation of the components in formula (3.32)
**Problem:** As described in the book, Linear Discriminant Analysis (LDA) maps a set of points from a high dimentional space $\mathbb{R^n}$ to a lower dimensional space $\mathbb{R}$ by linear transformation $\mathbf{w^T}$. It is important to understand how the mean and variance change after the linear transformation.

<br>**Formula:**

***From ML Perspective***:
<br>Consider a set of samples with $n$ features $\mathbf{S}=\{\mathbf{x}_i | \mathbf{x}_i \in \mathbb{R^n}\}_{i=1}^{k}$, where $\mathbf{x}_i = [x_{1i}, x_{2i}, ..., x_{ni}]^{T}$, each entry in the column vector $\mathbf{x}_i$ is a realization (observation) of the corresponding feature.

The mean vector of the set is $\boldsymbol{\mu_X}=\frac{1}{k}\sum_{i=1}^{k}\mathbf{x}_i=[\mu_{1}, \mu_{2}, ..., \mu_{n}]^{T}$, $\boldsymbol{\mu_X} \in \mathbb{R^n}$.

The covariance matrix is $\mathbf{\Sigma_{X}}=(a_{pq})_{n \times n}
=\begin{bmatrix}
a_{11} & a_{12} & \dots & a_{1n} \\
a_{21} & a_{22} & \dots & a_{2n} \\
\vdots & \vdots & \ddots & \vdots \\
a_{n1} & a_{n2} & \dots & a_{nn}
\end{bmatrix}$, where $a_{pq}=\frac{1}{k-1}\sum_{i=1}^{k} [(x_{pi}-\mu_{p})(x_{qi}-\mu_{q})]$.

<br>When applying the linear transformation $Y=\mathbf{w^{T} X}$ on set $\mathbf{S}$, we obtain a scalar random variable $Y$. The mean vector becomes $\mu_Y=\mathbf{w^T}\boldsymbol{\mu_X}$, and the covariance matrix becomes $\Sigma_{Y}=\mathbf{w^{T}}\mathbf{\Sigma_X}\mathbf{w}$.

<br>***From Statistics Perspective***:
<br>Consider each sample from set $\mathbf{S}$ as a realization (observation) of the random vector (or (n)-dimensional random variable) $\mathbf{X}=[X_1, X_2, ..., X_n]^{T}$. The mean vector and covariance matrix are $\boldsymbol{\mu_X}$ and $\mathbf{\Sigma_X}$ respectively.

<br>After the linear transformation $Y=\mathbf{w^{T} X}$, the mean and covariance change to $\mu_Y=\mathbf{w^T}\boldsymbol{\mu_X}$ and $\Sigma_{Y}=\mathbf{w^{T}}\mathbf{\Sigma_X}\mathbf{w}$.

<br>**Derivation (Statistics Perspective):** Consider a random vector $\mathbf{X}=[X_1, X_2, ..., X_n]^{T}$. Each component $X_i$ is a random variable (or a feature variable in ML), and together they form the joint distribution over $\mathbb{R}^n$.

Linear transformation $Y=\mathbf{w^{T} X}$ maps the random vector $\mathbf{X}$ to a scalar random variable $Y$ with a constant linear operator $\mathbf{w^{T}}$.

For mean, we have
$$\mathbb{E}[\mathbf{X}]=\boldsymbol{\mu_X},\quad \boldsymbol{\mu_X} \in \mathbb{R^n}$$

$$\mathbb{E}[Y]
=\mathbb{E}[\mathbf{w^{T}X}]
=\mathbf{w^{T}}\mathbb{E}[\mathbf{X}]
=\mathbf{w^{T}}\boldsymbol{\mu_X}$$

For variance, the covariance matrix of random vector $\mathbf{X}$ is defined as:
$$\operatorname{Var}(\mathbf{X})
=\operatorname{Cov}(\mathbf{X}, \mathbf{X})
=\mathbf{\Sigma_X}
=\mathbb{E}[(\mathbf{X}-\mathbb{E}[\mathbf{X}])(\mathbf{X}-\mathbb{E}[\mathbf{X}])^{T}]$$

We can derive the covariance matrix of random variable $Y$ as:

$\operatorname{Cov}(Y, Y)
=\operatorname{Cov}(\mathbf{w^{T} X}, \mathbf{w^{T} X})
\\=\mathbb{E}[(\mathbf{w^{T} X}-\mathbb{E}[\mathbf{w^{T}  X}])(\mathbf{w^{T} X}-\mathbb{E}[\mathbf{w^{T}  X}])^{T}]
\\=\mathbb{E}[(\mathbf{w^{T} X}-\mathbf{w^{T}}\mathbb{E}[\mathbf{X}])(\mathbf{w^{T} X}-\mathbf{w^{T}}\mathbb{E}[\mathbf{X}])^{T}]
\\=\mathbb{E}[\mathbf{w^{T}}(\mathbf{X}-\mathbb{E}[\mathbf{X}])(\mathbf{w^{T}}(\mathbf{X}-\mathbb{E}[\mathbf{X}]))^{T}]
\\=\mathbb{E}[\mathbf{w^{T}}(\mathbf{X}-\mathbb{E}[\mathbf{X}])(\mathbf{X}-\mathbb{E}[\mathbf{X}])^{T}\mathbf{w}]
\\=\mathbf{w^{T}}\mathbb{E}[(\mathbf{X}-\mathbb{E}[\mathbf{X}])(\mathbf{X}-\mathbb{E}[\mathbf{X}])^{T}]\mathbf{w}
\\=\mathbf{w^{T}}\operatorname{Cov}(\mathbf{X}, \mathbf{X})\mathbf{w}
\\=\mathbf{w^{T}}\mathbf{\Sigma_X}\mathbf{w}
$

<br>From the above derivation we get:
$$\operatorname{Var}(Y)
=\operatorname{Cov}(Y, Y)
=\Sigma_Y=\mathbf{w^{T}}\operatorname{Cov}(\mathbf{X}, \mathbf{X})\mathbf{w}=\mathbf{w^{T}}\mathbf{\Sigma_X}\mathbf{w}$$

