## PCA (Principal component analysis)

Use: Dimensionality reduction

why dimensionality reduction ?

**EDA:**
- visualization
-     redundant features

The goal of PCA is to find a lower dimensional representation of your features without losing any relevant information.

Steps for applying PCA:
- Obtain the covariance matrix of the feature matrix
- perform eigen decomposition of covariance matrix
- use the eigen vectors to project the feature matrix into a new space

## --- The Maths ---
let X be our feature matrix. 
$X \ \epsilon \ \mathbb{R}^{m \times n}$ 

for each feature j;<br/>
mean,<br/> $\mu_j = \frac{1}{m}\sum_{i=1}^{m} x_j^{(i)}$ 

variance,<br/> 
$var_j = \frac{1}{m}\sum_{i=1}^{m} (x_j^{(i)} - \mu_j)^2$

for any two features j and k, the covariance is<br/>
$covar_{j,k} = \frac{1}{m}\sum_{i=1}^{m} (x_j^{(i)} - \mu_j)(x_k^{(i)} - \mu_k)$

Let's assume a feature matrix of the form ,  $ X =
\begin{bmatrix}x_{1}^{(1)} & x_{2}^{(1)} & x_{3}^{(1)}\\x_{1}^{(2)} & x_{2}^{(2)} & x_{3}^{(2)}\\x_{1}^{(3)} & x_{2}^{(3)} & x_{3}^{(3)}\\x_{1}^{(4)} & x_{2}^{(4)} & x_{3}^{(4)}\end{bmatrix} $, and $X \ \epsilon \ \mathbb{R}^{4 \times 3}$ 

It's good to perform feature scaling and normalization before applying PCA. This makes each feature column to have a mean, $\mu$ of 0 and variance $\sigma^2$ of 1.

feature scaling and normalization =>  $X = \frac{X - \mu}{\sigma}$

The covariance matrix is one of the form
$ C_X =
\begin{bmatrix}var_1 & covar_{1,2} & covar_{1,3}\\covar_{2,1} & var_2 & covar_{2,3}\\covar_{3,1} & covar_{3,2} & var_3\end{bmatrix} $, and $C_X \ \epsilon \ \mathbb{R}^{n \times n}$ 

The idea here, is that a feature column should be highly correlated with it self and less correlated with other feature columns. correlation values ranges from -1 to 1. a correlation of 1 means highly positive correlated and -1 means highly negetively correlated while 0 means no correlation at all. Hence a feature column should have a covariance of 1 with it self and 0 with other feature columns.

To obtain a representation of the feature matrix, $X'$ with diagonal covariance matrix, $C_X'$, we apply the transformation $X'^T = PX^T$, $X' = XP^T$

by the definition of covariance
$$C_X = \frac{1}{m}X^TX$$
$$C_X' = \frac{1}{m}X'^TX'$$
$$C_X' = \frac{1}{m}(XP^T)^T(XP^T)$$
$$C_X' = \frac{1}{m}PX^TXP^T$$
$$C_X' = P\frac{1}{m}X^TXP^T$$
$$C_X' = PC_XP^T$$

The covariance matrix $C_X$ is symmetric, and $C_X'$ needs to be diagonal. Hence P has to be a matrix of the eigen vectors of C_X at each row.

using the property that for a symetric matrix, A, and its orthogonal/orthonomal matrix of eigen vectors $Q$,
$Q^TAQ = D$ where $D$ is a diagonal matrix

Hence, (if $P^T$ is orthonogonal), $$C_X' = PC_XP^T = D$$
for P^T is orthonormal, then, $$C_X' = PC_XP^T = I$$, where $I$ is the unit matrix

## Working on a toy example

In [1]:
import numpy as np
X = np.array([[1,2,4],[1,3,6],[2,6,12],[3,4,8]])
m, n = X.shape

In [2]:
X

array([[ 1,  2,  4],
       [ 1,  3,  6],
       [ 2,  6, 12],
       [ 3,  4,  8]])

The last feature column in X is redundant

In [3]:
# apply feature scaling and normalization
X_norm = (X-X.mean(axis=0))/X.std(axis=0)

In [4]:
X_norm

array([[-0.90453403, -1.18321596, -1.18321596],
       [-0.90453403, -0.50709255, -0.50709255],
       [ 0.30151134,  1.52127766,  1.52127766],
       [ 1.50755672,  0.16903085,  0.16903085]])

In [5]:
X_norm.std(axis=0)

array([1., 1., 1.])

Obtain covariance matrix of X_norm

In [6]:
C_X = (1/(m))*np.dot(X_norm.T, X_norm)

In [7]:
C_X

array([[1.        , 0.56061191, 0.56061191],
       [0.56061191, 1.        , 1.        ],
       [0.56061191, 1.        , 1.        ]])

Obtain the eigen vectors (Principal components) and values of the covariance matrix

In [8]:
v, P = np.linalg.eig(C_X)

In [9]:
v

array([2.43732141e+00, 5.62678588e-01, 1.27131532e-32])

In [10]:
P

array([[ 4.82993297e-01,  8.75624049e-01,  1.96270488e-17],
       [ 6.19159703e-01, -3.41527836e-01, -7.07106781e-01],
       [ 6.19159703e-01, -3.41527836e-01,  7.07106781e-01]])

In [11]:
C_X

array([[1.        , 0.56061191, 0.56061191],
       [0.56061191, 1.        , 1.        ],
       [0.56061191, 1.        , 1.        ]])

Obtain the projections

In [12]:
P.T.dot(X_norm.T).T

array([[-1.90208316e+00,  1.61706169e-02,  1.11022302e-16],
       [-1.06482642e+00, -4.45659309e-01,  5.55111512e-17],
       [ 2.02945560e+00, -7.75106748e-01, -2.22044605e-16],
       [ 9.37453975e-01,  1.20459544e+00,  1.38777878e-17]])

Using sci-kit learn

In [13]:
from sklearn import decomposition

In [14]:
pca = decomposition.PCA(n_components=3)

In [15]:
pca.fit_transform(X_norm)

array([[-1.90208316e+00,  1.61706169e-02,  7.50909007e-18],
       [-1.06482642e+00, -4.45659309e-01,  4.16986813e-17],
       [ 2.02945560e+00, -7.75106748e-01,  1.68335240e-17],
       [ 9.37453975e-01,  1.20459544e+00,  2.61579580e-17]])

In [16]:
pca.get_covariance()

array([[1.33333333, 0.74748255, 0.74748255],
       [0.74748255, 1.33333333, 1.33333333],
       [0.74748255, 1.33333333, 1.33333333]])

In [17]:
X_norm

array([[-0.90453403, -1.18321596, -1.18321596],
       [-0.90453403, -0.50709255, -0.50709255],
       [ 0.30151134,  1.52127766,  1.52127766],
       [ 1.50755672,  0.16903085,  0.16903085]])

The toy example above shows how to apply PCA on the feature matrix of size (4 X 3) and can be extended to all feature matrices of any size.

## Reference materials:

Sebastian rascha's post on  <a href='https://sebastianraschka.com/Articles/2014_pca_step_by_step.html'>Implementing a Principal Component Analysis (PCA)</a><br/>
Rishav Kumar's post on <a href='https://medium.com/@aptrishu/understanding-principle-component-analysis-e32be0253ef0'>Understanding Principal Component Analysis</a>