# Dimensionality Reduction
* PCA: Unsupervised. C=𝑃Λ𝑃−1 for covariance matrix C. This is the eigen decomposition of the covariance matrix. PCA separates the data by variance. PCA gives orthonormal basis ranked by variance explained.
* ICA: is a higher-order PCA, but the basis is not orthonormal. 
* SVD: Unsupervised. 𝐷=𝑄Σ𝑃−1 for data D. This is the linear algebra decomposition of the data matrix, which always exists. SVD is not restricted to special cases, as PCA is. SVD discovers the minimum required latent factors (rank of D). It discovers the singular values that rank the factors.

See also: Eigen Decomposition
* 𝐴=𝑄Λ𝑄inv: 
* Unsupervised. Data matrix A must be square, invertible. Q is a basis matrix. Λ is a diagonal matrix or vector of eigenvalues in order of their utility for data reconstruction. Large eigenvalues capture trends, small eigenvalues fit the noise.


## PCA = Principal Component Analysis
PCA uses the eigen decomposition of the covariance matrix:  
cov = $P \Lambda P^{-1}$   

* PCA requires mean-centered data.
* PCA finds an orthonormal basis, an axis rotation, a linear transformation. In effect, PCA recursively performs axis rotation to where remaining variance is maximized along the next axis.
* PCA can be used for dimensionality reduction.  
* PCA uses the eigen decomposition of the covariance matrix. The decomposition is possible because the covariance matrix is square, symmetric, positive-semidefinite, and mean centered. 
* After transformation, there is no covariance. The new covariance matrix is $\Lambda$ which is a diagonal matrix so every off-diagonal entry is zero.
* For dimensionality reduction (with lossy reconstruction), discard the axiis with the smallest eigenvalues.
* Each PCA axis is a linear combination of the original axiis. 
* PCA only discovers linear transformations. But the kernel trick can uncover transformations that are linear in the kernel space but non-linear in the original space. Kernel PCA computes the kernel of pairwise distance in the original space as a proxy for computing the pairwise distances in the kernel space. One kind of kernel is Gaussian.

Data matrix D:   
* Scatter matrix = $D D^T$  
* Covariance matrix = $\frac{D D^T}{n} - \mu^T \mu$ (scatter normalized and mean centered)    
* Transformed data $D' = DP$   
* PCA removes correlations. Features of the transformed data have zero covariance. 

## SVD
SVD uses the eigenvector decomposition of the scatter matrix.   

SVD factors data matrix D matrix into 3 matrices:   
$D = Q \Sigma P^{-1}$  
It is a fundamental fact of linear algebra that this is always possible.  

The $\Sigma$ matrix of factors is ranked by importance.   
You get a lossy reconstruction if you reconstruct the data with a subset of factors. 

* SVD does not require mean centering.
* SVD is ideal for sparse non-negative matrices e.g. word vectors of documents.
* SVD can be transformed in spectral decomposition.
* SVD discovers the latent factors and ranks them.
* SVD can provide a lossy reconstruction of the data.

### SVD explained with energy
SVD maximizes the energy on the reduced dimensions.
Energy of original = sum of squared distances to origin.
Energy of reconstructed is unchanged by axis rotation, 
and slightly reduced by dimension reduction.
The SVD minimizes the reconstruction error i.e.
sum of squared distances between pairs of (original,reconstructed) points.  

### SVD explained with latent factors  
Say D = (n x d) = movie-patrons x movie-titles.
SVD discovers latent concepts that explain both.
Example: SVD might discover the latent factor "sci-fi" 
where some movies have "sci-fi" and some patrons like "sci-fi".

### Using SVD for noise reduction
Using SVD for lossy reconstruction, the lost part tends to be noise and outliers.
So a lossy reconstruction can provide better training data.

### The math behind svd
$D = Q \Sigma P^{-1}$  
SVD uses this factorization of the data matrix,
which is unique and always exists.  

Intuitively, SVD breaks D into a rotation, a scaling, and an inverse rotation.  

Matrix $\Sigma$ is rectangular but diagonal.   
Its entries are non-negative.  
Its number of non-zero entries equals Rank(D) and is <= min(n,d).

D = data matrix in its original axis.   
Q = left singular matrix of orthonormal column vectors of $D^T$.  
$\Sigma$ = non-negative singular values along the diagonal (in decreasing order)  
P = right singular matrix of orthonormal column vectors of $D$.  

D = (n x d) = n instances x d dimensions.  
Q = (n x n)  
$\Sigma$ = (n x d)   
P = (d x d)  

$P^T P = I$ because the vectors are orthonormal.   
$Q^T Q = I$ because the vectors are orthonormal.   

Q = eigenvectors of $D D^T$  
P = eigenvectors of $D^T D$  

### SVD example
The data is D(n viewers x d movies).   
The n viewers are each represented by like/dislike vector of d movies.  
Seek to explain the viewers using just k movie concepts.  
Reduce d dimensions to k concepts:   
Choose k < min(n instances, d dimensions).  
By SVD, the full dataset is:    
$D = Q \Sigma P^{-1}$  
Now get a rank k approxization of the data:       
R(k)= Q(n viewers x k concepts) * (k x k, diagonal, ordered) * Pt(k concepts x d movies)   

## PCA vs SVD = Singular Value Decomposition
* PCA is a special case of SVD. 
* SVD==PCA when all feature means are 0 i.e. when data is mean centered. 
* PCA generates one basis for the matrix rows. SVD generates one basis for the rows and one for the columns. 
* PCA is restricted to diagonalizable (square) matrices. SVD is not.
* PCA is applied to the covariance matrix. SVD is applied to the data matrix.

## PCA vs ICA = Independent Component Analysis
* PCA components are orthogonal, but ICA components are not.
* PCA removes correlation to focus on uncorrelated components. 
* ICA finds higher-order dependencies. Usually, PCA is applied first for "whitening" to remove correlations. 
* ICA aims to reconstruct the data from linear combination of independent signals.
* ICA features are not ranked because they are all equally important.
* PCA uses second-order stats (variance), ICA uses higher-order.
* PCA assumes underlying Gaussians (via variance).

## PCA vs MDS
MDS = multidimensional scaling.

MDS is a generalization and superset of PCA. 
MDS is usually used for visualization e.g. to show separability of two classes.   

MDS will reduce the data to lower dimensions while minimizing distortion.
MDS will maximize the agreement of pairwise distances before & after the transformation.
MDS uses a stress() formula that penalizes distortion induced by the transformation.