# Intuition behind PCA

Principal component analysis (PCA) is a dimensionality reduction technique that aims to project data into a lower dimensional space while preserving important properties of the data, such as relative distances between data points. This is similar to other dimensionality reduction methods like t-SNE and UMAP. The goal of PCA is to retain as much information about the original high-dimensional data as possible, which is achieved by choosing an orthogonal basis (the principal components) that maximises the variance in the data. This is equivalent to minimising the residuals in the projected subspace. 

The objective of (PCA) is to project a dataset $X \in \mathbb{R}^{n \times d}$ into a lower dimensional vector $v \in \mathbb{R}^{d}$ s.t $\langle\,v,v\rangle=\|v\| = 1$. The i-th projection is $y_i=\langle\, x_i,v\rangle v$

## Maximising Variance


\begin{equation}
\begin{split}
&v^{*}=\operatorname*{argmax}_v \frac{1}{N}\sum_{i=1}^{N}\langle\, x_i,v\rangle^{2}\\
 &= \operatorname*{argmax}_v \frac{1}{N}\sum_{i=1}^{N}(v^{T}x_i-v^{T}\bar x_i)^{2}\\
&=
v^{T}(\frac{1}{N}\sum_{i=1}^{N}(x_i-\bar x_i))v \\
&= v^{T} \Sigma v 
\end{split}
\end{equation}


Where $\Sigma$ is the covariance matrix. We define the Lagrangian for this problem as follows:

\begin{equation}
\mathcal{L}(v,\lambda) = v^{T} \Sigma v - \lambda(v^Tv-1)
\end{equation}



\begin{align*}
 \frac{\partial \mathcal{L}(v,\lambda)}{\partial v} = 2\Sigma v - 2\lambda v = 0 \\
\implies \Sigma v = \lambda v
\end{align*}


The magnitude of the eigenvalues is proportional to the variance in the direction of that vector, i.e $\lambda_{1} \geq \lambda_{2},...,\lambda_{d}$. 

## Minimising projected squared residuals


\begin{equation}
\begin{split}
&\operatorname*{argmin}_v \frac{1}{N}\sum_{i=1}^{N}\| x_i - \langle\,x_i,v\rangle v\|^2\\
&=\frac{1}{N}\sum_{i=1}^{N}\| x_i\|^2 - 2\langle\,x_i,v\rangle^2 + \langle\,x_i,v\rangle^2  \|v\|^2 \\
&\mathop{\mathbb{E}(\langle\,x_i,v\rangle ^2)}= \frac{-1}{N}\sum_{i=1}^{N}  \langle\,x_i,v\rangle^2 \\
\end{split}
\end{equation}


Recognising that variance can the written as follows:

\begin{equation}
\mathop{\mathbb{V}(X)} = \mathop{\mathbb{E}(X^2)}-\mathop{\mathbb{E}(X)^2}
\end{equation}

We can do the following substitution

\begin{equation}
\begin{split}
- \mathop{\mathbb{E}(\langle\,x,v\rangle ^2)} 
&=\mathop{\mathbb{V}(\langle\,x,v\rangle)} + \mathop{\mathbb{E}(\langle\,x,v\rangle )^2}
\end{split}
\end{equation}

Based on the assumption that the data follow a Gaussian distribution with a mean of zero, we can obtain $\mathop{\mathbb{E}(\langle\,x,v\rangle )^2}=\langle\,0,v\rangle = 0$


\begin{align}
- \mathop{\mathbb{E}(\langle\,x,v\rangle ^2)} 
=\mathop{\mathbb{V}(\langle\,x,v\rangle)}\\
v^{*} =\operatorname*{argmax}_v \mathop{\mathbb{V}(\langle\,x,v\rangle)}
\end{align}


This gives us an expression that maximizes the variance in the projected data.