# Dimensionality Reduction
* PCA: Unsupervised. C=𝑃Λ𝑃−1 for covariance matrix C. This is the eigen decomposition of the covariance matrix. PCA separates the data by variance. PCA gives orthonormal basis ranked by variance explained.
* ICA: is a higher-order PCA, but the basis is not orthonormal. 
* SVD: Unsupervised. 𝐷=𝑄Σ𝑃−1 for data D. This is the linear algebra decomposition of the data matrix, which always exists. SVD is not restricted to special cases, as PCA is. SVD discovers the minimum required latent factors (rank of D). It discovers the singular values that rank the factors.

See also: Eigen Decomposition
* 𝐴=𝑄Λ𝑄inv: Unsupervised. Data matrix A must be square, invertible. Q is a basis matrix. Λ is a diagonal matrix or vector of eigenvalues in order of their utility for data reconstruction. Large eigenvalues capture trends, small eigenvalues fit the noise.


## PCA = Principal Component Analysis
PCA uses the eigen decomposition of the covariance matrix.

* PCA Discovers an axis rotation i.e. linear transformation.
* Uses the eigen decomposition of the covariance matrix.
* The covariance matrix must be square, symmetric, positive-semidefinite, and mean centered. 
* Prerequisite is mean centering.
* Finds an orthonormal basis. 
* In effect, recursively performs axis rotation to where remaining variance is maximized along next axis.
* Discovers linear transformations only. (But kernel trick can uncover non-linear ones.)
* Under the new coordinates, there is no covariance, and the covariance matrix == the diagonalized eigenvalue matrix.
* For dimensionality reduction (with lossy reconstruction), discard the axiis with the smallest eigenvalues.

The eigen decomposition of data matrix D:   
Scatter matrix = $D D^T$  
Covariance matrix = $\frac{D D^T}{n} - \mu^T \mu$ (mean centered)    
Covariance = $P \Lambda P^{-1}$   
This factorization is possible because cov is symmetric and positive semi-definite.

## SVD
SVD uses the factorization of one matrix into three.

The factors are ranked by importance.   
You get a lossy reconstruction if you reconstruct the data with a subset of factors. 

* Does not require mean centering.
* Ideal for sparse non-negative matrices e.g. word vectors of documents.
* SVD can be transformed in spectral decomposition.
* SVD discovers the latent factors and ranks them.
* SVD can provide a lossy reconstruction of the data.
* SVD maximizes the energy on the reduced dimensions.

### Energy
Energy of original = sum of squared distances to origin.  
Energy of reconstructed is unchanged by axis rotation, 
and slightly reduced by dimension reduction.  
The SVD minimizes the reconstruction error i.e.
sum of squared distances between pairs of (original,reconstructed) points.  

### Latent factors  
Say D = (n x d) = movie-patrons x movie-titles.   
SVD discovers a latent concepts that explain both   
e.g. sci-fi might explain why some of the movie patrons like some of the movies.

### Noise reduction
The lossy part of the lossy reconstruction 
is focused on noise and outliers.
So the lossy reconstruction can provide better training data.

### The math behind svd
$D = Q \Sigma P^{-1}$  
SVD uses this factorization of the data matrix,
which is unique and always exists.  

Intuitively, SVD breaks D into rotation, scaling, inverse rotation.  

Matrix $\Sigma$ is rectangular but diagonal.   
Its entries are non-negative.  
Its number of non-zero entries is the rank of D.  

D = data matrix on original axiis.   
Q = left singular matrix of orthonormal column vectors of $D^T$.  
$\Sigma$ = non-negative singular values along the diagonal (in decreasing order)  
P = right singular matrix of orthonormal column vectors of $D$.  

D = (n x d) = n instances x d dimensions.  
Q = (n x n)  
$\Sigma$ = (n x d)   
P = (d x d)  

$P^T P = I$ because the vectors are orthonormal.   
$Q^T Q = I$ because the vectors are orthonormal.   

Q = eigenvectors of $D D^T$  
P = eigenvectors of $D^T D$  

Number of non-zero entries in $\Sigma$
equals Rank(D) and is <= min(n,d).

### SVD example
The data is D(n viewers x d movies).   
The n viewers are each represented by like/dislike vector of d movies.  
Seek to explain the viewers using just k movie concepts.  
Reduce d dimensions to k concepts:   
Choose k < min(n instances, d dimensions).  
The full dataset is:    
$D = Q \Sigma P^{-1}$  
Now get rhe rank k approxization of the data:       
R(k)= Q(n viewers x k concepts) * (k x k, diagonal, ordered) * Pt(k concepts x d movies)  
Rely on SVD to provide the k latent concepts.  

## PCA vs SVD = Singular Value Decomposition
* PCA is a special case of SVD. The give the same basis in mean-centered data.
* Whenever all feature means are 0, SVD==PCA.
* PCA generates one basis for the matrix rows. SVD generates one for the rows and one for the columns. 
* PCA is restricted to diagonalizable (square) matrices. SVD is not.
* PCA is applied to the covariance matrix. SVD is applied to the data matrix.

## PCA vs ICA = Independent Component Analysis
* PCA components are orthogonal, but ICA components are not.
* PCA removes correlation to focus on uncorrelated components. 
* ICA finds higher-order dependencies. Usually, PCA is applied first for "whitening" to remove correlations. 
* ICA aims to reconstruct the data from linear combination of independent signals.
* ICA features are not ranked because they are all equally important.
* PCA uses second-order stats (variance), ICA uses higher-order.
* PCA assumes underlying Gaussians (via variance).