In [1]:
import numpy as np
import pandas as pd


### Principle Component Analysis (PCA) is a method of dimensionality reduction to minimize redundancy of an overdetermined system. 

### Suppose you have m number of data points x and y in a column vector:

$X = \begin{bmatrix}
           x_{1} \\
           y_{1} \\
           x_{2} \\
           y_{2} \\
           x_{3} \\
           y_{3}
\end{bmatrix} $

### We know that variance is a single-value measure of the spread of data. Covariance allows us to see how the directions of two vectors differ from each other. $C_{X}=\dfrac{1}{n-1}XX^{T}$. Orthogonal vectors are indicated by a 0 covariance, which means that they are statistically independent. 

### Note the following: 
- The values along the main diagonal $C_{i,j}$ s.t. $i=j$ measure variance 
- Non-diagonals are symmetric $C_{i,j} = C_{j,i}$ and measure covariance
- Smaller values indicate that vectors are more statistically independent
- Higher values indicate that vectors are more statistically dependent, and thus redundant

$$Cov(X) = 
\begin{bmatrix}
    \sigma^{2}_{x_{1},x_{1}} & \sigma^{2}_{x_{1},y_{1}} & \sigma^{2}_{x_{1},x_{2}} & ... & \sigma^{2}_{x_{1},y_{3}} \\
    \sigma^{2}_{y_{1},x_{1}} & \sigma^{2}_{y_{1},y_{1}} & \sigma^{2}_{y_{1},x_{2}} & ... & \sigma^{2}_{y_{1},y_{3}} \\
    \sigma^{2}_{x_{2},x_{1}} & \sigma^{2}_{x_{2},y_{1}} & \sigma^{2}_{x_{2},x_{2}} & ... & \sigma^{2}_{x_{2},y_{3}} \\
    \vdots                   & \vdots                   & \vdots                   & \ddots & \vdots \\
    \sigma^{2}_{y_{3},x_{1}} & \sigma^{2}_{y_{3},y_{1}} & \sigma^{2}_{y_{3},x_{2}} & ... & \sigma^{2}_{y_{3},y_{3}} \\
\end{bmatrix}
$$

### Our goal is to diagonalize the covariance matrix to remove redundancies. We can do this with SVD, or Eigendecomposition.

In [9]:
X = np.array([[1],[2],[0],[-1],[1],[3]])

cov_x = X@X.T

cov_x

array([[ 1,  2,  0, -1,  1,  3],
       [ 2,  4,  0, -2,  2,  6],
       [ 0,  0,  0,  0,  0,  0],
       [-1, -2,  0,  1, -1, -3],
       [ 1,  2,  0, -1,  1,  3],
       [ 3,  6,  0, -3,  3,  9]])

In [10]:
U,D,V = np.linalg.svd(cov_x) 

In [13]:
np.diag(D)

array([[1.60000000e+01, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 1.72499419e-15, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 8.57460394e-17, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 8.62293005e-33,
        0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        2.64099661e-47, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 6.02065218e-51]])