# Principal Components Analysis (PCA)

Given a collection of $n$ vectors $x_1, \ldots, x_n \in \mathbb{R}^m$, we are looking for an orthonormal set of $k$ ($k << n$) vectors $u_1, \ldots, u_k \in \mathbb{R}^m$ which can effectively capture the most "most of the variation" in the data. This is achieved by approximating each data point $x_i$ with a linear combination: $z_{i1}u_1 + \ldots + z_{ik}u_k$, where $z_{i1}, \ldots, z_{ik} \in \mathbb{R}$ are the coefficients. 

The primary challenge of PCA is to select the $u_i$s that optimize the quality of this approximation across all data points. The resulting $u_1, \ldots, u_k$ are known as the k principal components. They are ordered in a way that $u_1$ represents the direction of the greatest variance in the data, while $u_2$ is the direction of greatest variance orthogonal to $u_1$, and so on.

In conclusion, PCA allows us to represent each data point $x_i \in \mathbb{R}^m$ using its coordinates $z_i = (z_{i1}, \ldots, z_{ik})$ with respect to the $k$ principal components. This yields a lower-dimensional and hopefully more informative representation, making PCA a powerful tool for data analysis and dimensionality reduction.


## PCA 1 (from SVD)

To find the principal components, we apply the singular value decomposition to the $m \times n$ matrix whose columns are the mean-centered data points $x_i$.


Lets see how SVD is used to find PCA:


Consider an $n\times m$ data matrix $\mathbf{X} = [\mathbf{x}_1, \ldots, \mathbf{x}_n]$ where each row represents a different data point (e.g., $n$ people) and each column represents one of $m$ different features (e.g., height, weight, age etc.) Let $\mu := \frac{1}{n}\sum_{i=1}^{n} \mathbf{x}_i$ denote the mean columns. By replacing $\mathbf{x}_i$ by $\mathbf{x}_i - \mu$, we can assume that the input data is mean-centered. Given a target dimension $k \leq m$, our goal is to find points $\mathbf{\tilde{x}}_1, \ldots, \mathbf{\tilde{x}}_n$ in $\mathbb{R}^m$ such that the reconstruction error $\sum_{i=1}^{n} \|\mathbf{x}_i - \mathbf{\tilde{x}}_i\|^2$ is minimized, subject to the constraint that $\mathbf{\tilde{x}}_1, \ldots, \mathbf{\tilde{x}}_n$ lie in a subspace of $\mathbb{R}^m$ of dimension at most $k$. 


Let $\mathbf{\tilde{X}} :=
\begin{bmatrix}
\mathbf{\tilde{x}}_1 & \ldots & \mathbf{\tilde{x}}_n
\end{bmatrix}$. Then the reconstruction error is nothing but $\|\mathbf{X} - \mathbf{\tilde{X}}\|_F^2$. Thus, by the Eckhart-Young Theorem, an optimal choice of $\mathbf{\tilde{X}}$ is the best rank $k$ approximation of $\mathbf{X}$ (the $k$th-truncated SVD). Now recall that if $\mathbf{U}_k$ is the $n \times k$ matrix whose columns are the top $k$ left singular vectors of $\mathbf{X}$, then, writing $\mathbf{Z} := \mathbf{U}_k^T \mathbf{X}$, we have $\mathbf{X}_k = \mathbf{U}_k\mathbf{U}_k^T \mathbf{X} = \mathbf{U}_k\mathbf{Z}$.

The output of PCA is the pair of matrices $\mathbf{U}_k$ and $\mathbf{Z}$. The columns of $\mathbf{U}_k$ are the top $k$ left singular vectors, while the columns of $\mathbf{Z}$ give the coefficients that respectively approximate each mean-centered data point $\mathbf{x}_i$ as a linear combination of the top $k$ left singular vectors.

Note that PCA finds a k-dimensional hyperplane that must pass through the mean of the data, whereas SVD finds the k-dimensional hyper- plane passing through the origin. The former provides better reconstruction. However, as the next exercise shows, the difference is usually not too large.

##  PCA 2 (Statistical interpretation)

__1 Variance and Covariance for Mean-centered Data__

Consider real data values $\mathbf{x}$ with mean $\bar{x}$ and $\mathbf{y}$ with mean $\bar{y}$. The variance measures how far the data are spread away from the mean, and the covariance  measures the correspondence between $\mathbf{x}$ and $\mathbf{y}$ values.


i) The variance $Var(\mathbf{x})$ of $\mathbf{x}$ and variance $Var(\mathbf{y})$ of $\mathbf{y}$ are 

$$
Var(\mathbf{x})=\frac{\sum_{i=1}^n (x_i-\bar{x})^2}{n},\hspace{.25in} Var(\mathbf{y})=\frac{\sum_{i=1}^n (y_i-\bar{y})^2}{n}.
$$

ii) The covariance $Cov(\mathbf{x},\mathbf{y})$ of $\mathbf{x}$ and $\mathbf{y}$ is 

$$
Cov(\mathbf{x},\mathbf{y})=\frac{\sum_{i=1}^n\, (x_i-\bar{x})(y_i-\bar{y}){}}{n}.
$$

Note that $Cov(\mathbf{x},\mathbf{y}) = Cov(\mathbf{y},\mathbf{x}).$ Furthermore, for mean-centered data ($\bar{x}=0$ and $\bar{y}=0$), the formulas  simplify to 

$$
Var(\mathbf{x})=\dfrac{\sum_{i=1}^n x_i^2}{n}=\frac{1}{n}\,\mathbf{x}\cdot\mathbf{x},\qquad Cov(\mathbf{x},\mathbf{y})=\frac{\sum_{i=1}^n x_iy_i}{n}=\frac{1}{n}\,\mathbf{x}\cdot\mathbf{y}.
$$


 __2 Covariance Matrix__

Consider an $n\times m$ data matrix $\mathbf{X}$ above and assume that the columns have all been mean-centered.  Define the real <i>covariance matrix</i> $\mathbf{A}$ as

$$
\mathbf{A}=\tfrac{1}{n}\,\mathbf{X}^T\mathbf{X}.
$$

$\mathbf{A}$ must be symmetric since $\mathbf{A}^T=\frac{1}{n}\mathbf{X}^T(\mathbf{X}^T)^T=\frac{1}{n}\mathbf{X}^T\mathbf{X}=\mathbf{A}.$ (Recall that $(\mathbf{M}\mathbf{N})^T=\mathbf{N}^T\mathbf{M}^T.$)


As a consequence of the definition of $\mathbf{A}$ and as illustrated in the following example, the $(i,i)$ entry of the covariance matrix $\mathbf{A}$ is the variance of the $i^{th}$ column of the data matrix $\mathbf{X}$, and the $(i,j)$ entry ($i\neq j$) of the covariance matrix $\mathbf{A}$ is the covariance of the $i^{th}$ and $j^{th}$ columns of the data matrix $\mathbf{X}$.

### Example 1:
Consider the data points $(-5,0)$, $(0,2)$, $(5,-2)$. The corresponding data matrix is 

\begin{align*}
\mathbf{X}=
\begin{pmatrix}
   -5 & 0 \\
0& 2 \\
5&-2
\end{pmatrix},
\end{align*}.

- Are data mean-centered?

- Compute $Var(\mathbf{x})$ and $Var(\mathbf{y})$. 

- Compute $Cov(\mathbf{x},\mathbf{y})$

- Compute the covariance matrix $A= [a_{ij}]$. 

- Verify that $a_{11} = Var(\mathbf{x})$, the $ a_{22} = Var(\mathbf{y})$, and the $ a_{21} = a_{12} = Cov(\mathbf{x},\mathbf{y})= Cov(\mathbf{y},\mathbf{x})$.


In [11]:
# you code

__3 PRINCIPAL COMPONENT ANALYSIS As An OPTIMIZATION PROBLEM__ 


So far we saw that given an $n\times m$ data matrix $X$ whose columns all have mean  zero, the $m\times m$ covariance matrix $\mathbf{A}$ is the symmetric matrix 

\begin{equation}
    \mathbf{A}=\tfrac{1}{n}\mathbf{X}^T\mathbf{X}.
\end{equation}
Since the $m\times m$ covariance matrix $\mathbf{A}$ is a real symmetric matrix, it has $m$ linearly independent eigenvectors $\mathbf{v}_1,\mathbf{v}_2,\dots,\mathbf{v}_m \in \mathbf{R}^m$ with $m$ real eigenvalues $\lambda_1\ge\lambda_2\ge\dots\ge\lambda_m.$  The first two principal components solve the following optimization problems:




<b>  (First principal component)</b> Find vector $\mathbf{v}_1$ (written as an $m\times 1$ column vector) that will

$$
      maximize \, J(\mathbf{v}_1)=\mathbf{v}_1^T \mathbf{A}\mathbf{v}_1
$$

 such that $\|\mathbf{v}_1\|^2=1.$ When the data is projected onto the line $\mathbf{L_{v_1}}$ through the origin  determined by the unit vector $\mathbf{v}_1$,  $J(\mathbf{v}_1)$ is the variance of the data along that line.
 
  The choice of $\mathbf{v}_1$  that maximizes the variance is a unit eigenvector of $\mathbf{A}=\frac{1}{n}\mathbf{X}^T\mathbf{X}$ that has the largest eigenvalue [Aggarwal 2020].
  
  
<b>Second principal component</b>  Find vector $\mathbf{v}_2$ (written as an  $m\times 1$ column vector) that will

$$
      maximize \, J(\mathbf{v}_2)=\mathbf{v}_2^T\mathbf{A}\mathbf{v}_2
$$

such that $\|\mathbf{v}_2\|^2=1$ and $\mathbf{v_1}\cdot\mathbf{v}_2=0.$ The vector  $\mathbf{v}_1$ is the one obtained in the first step as the first principal component;  $\mathbf{v}_2$ must be orthogonal to $\mathbf{v}_1$ and captures as much of the remaining variance in the data as possible that was not captured by $\mathbf{v}_1$.

More generally, to achieve our goal of dimensionality reduction we consider the following:


The first $k$ principal component vectors $\mathbf{v}_1,\dots\mathbf{v}_k$ will

$$
      maximize \, J(\mathbf{v}_1,\dots,\mathbf{v}_k)=\sum_{i=1}^k \mathbf{v}_i^T \mathbf{A} \mathbf{v}_i
$$
	
subject to $\|\mathbf{v}_i\|^2=1$ for all $i=1,\dots,k$ and $\mathbf{v}_i\cdot\mathbf{v}_j=0$ for all $i\neq j$.
 The vectors $\mathbf{v}_1,\dots\mathbf{v}_k$ are orthonormal eigenvectors of $\mathbf{A}$ with the largest eigenvalues.

### Example 2

Use a principal component to determine on what line should we project the points $(-1,3),(0,0),(1,-3)$  to maximize the variance on that line.

### Example 3: 

$\,$ Consider the $n\times m = 4\times 3$ data matrix $\mathbf{X}$ defined as 

$$
\mathbf{X}=
\begin{pmatrix}
 2 & 2 & 0 \\
 0 & 0 &  1 \\
 0 & 0 &  -1\\
-2 & -2 & 0
\end{pmatrix}.
$$



Each row represents a person, so there are data for $n=4$ people.  Each column represents a feature, so there are $m=3$ features (e.g., height, weight, and age.)
We assume that the columns are mean-centered and in standard units.
 Person 1 is  tall and heavy, with average age. Person 2 has average height and weight and is older. Person 3 is average height and weight and is younger. Person 4 is relatively short and light, with average age.
 
a) $\,$ Find the symmetric $m\times m$ covariance matrix $\mathbf{A}=\frac{1}{n}\mathbf{X}^T\mathbf{X}$
 
b) $\,$ Find the eigenvalues and eigenvectors of $\mathbf{A}$ What are the first, second, and third principal components?


c)  Explain how the variance is maximized by the principal components.

### Example 4: 

$\,$ Consider the $n\times m = 4\times 3$ data matrix $\mathbf{X}$ defined as 

$$
\scriptsize
\mathbf{X}=
\begin{pmatrix}
 0 & 4 & 0 \\
 2 & 0 &  1 \\
 -2 & 0 &  -1\\
0 & -4 & 0
\end{pmatrix}.
$$

a) $\,$ Without calculating eigenvectors, conjecture the first two principal components.

b) $\,$ 
 Check your answer to a) by computing the eigenvectors of the covariance
matrix $\mathbf{A}=\frac{1}{n}\mathbf{X}^T\mathbf{X}.$


## Application (Potential Final Topic)

1. **Facial Recognition Using PCA**

Imagine we have a database of n faces. Each face is represented as a vector $x \in \mathbb{R}^m$, where $x_i$ signifies the intensity of the i-th pixel. The dimension $m$ can be extensive, but after applying PCA, we can accurately represent these faces using only a few hundred principal components.

Now, let's define some key matrices:
- $X$ is an $m \times n$ matrix with columns representing the mean-centered faces in the database.
- $U_k$ is an $n \times k$ matrix with columns as the top k left singular vectors of $X$, known as "eigenfaces."
- $Z$ is an $n \times k$ matrix, where each column $z_i$ represents the coefficients for expressing a face as a linear combination of eigenfaces.

To identify the closest match for a new face $x$ in the database, we first calculate the coordinates of $x$ projected onto the subspace spanned by the eigenfaces. Then, we locate the nearest neighbor to this projection among the vectors $z_i$ for $i = 1, \ldots, n$.


2. **Latent Semantic Analysis**

We want to group a collection of n documents based on their topics. Each document is represented as a vector in $\mathbb{R}^m$, indicating the frequency of each keyword (the number of occurrences of the keyword divided by the total number of occurrences of all keywords).

Let $x_1, \ldots, x_n \in \mathbb{R}^m$ represent the mean-centered documents. One approach is to measure the similarity between two documents $x_i$ and $x_j$ using their inner product $x_i^\top x_j$. However, this approach has limitations. It doesn't consider correlations among keywords; for instance, two different keywords like "football" and "Premier League" may be closely related but are treated as orthogonal.

To address this, we apply PCA. We represent the i-th mean-centered document as a linear combination of the top k principal components, denoted by vector $z_i$. It is expected that the inner product $z_i^\top z_j$ will provide a better measure of similarity between the i-th and j-th documents compared to $x_i^\top x_j$. This considers keyword correlations and enhances the grouping of similar documents.


3. **Kernel PCA** 
Kernel Principal Component Analysis (Kernel PCA) is an extension of Principal Component Analysis (PCA) that allows for nonlinear dimensionality reduction. PCA is a linear technique that works well for datasets with linear relationships between variables. However, it may not be effective when the relationships between variables are nonlinear. Kernel PCA addresses this limitation by mapping data into a higher-dimensional space where PCA can be applied effectively.

Refrences:
    
    1- https://timothyprojectgig.github.io/JB_Math_Textbook/Advanced/LinearAlgebra/PCA/jnb3.html
    
    2- https://www.cs.ox.ac.uk/people/james.worrell/SVD-thin.pdf