# Dimensionality Reduction

## PCA
* PCA Discovers an axis rotation i.e. linear transformation.
* Uses the eigen decomposition of the (symmetric positive-semidefinite) covariance matrix.
* Prerequisite is mean centering.
* Finds an orthonormal basis. 
* In effect, recursively performs axis rotation to where remaining variance is maximized along next axis.
* Discovers linear transformations only. (But kernel trick can uncover non-linear ones.)
* Under the new coordinates, there is no covariance, and the covariance matrix == the diagonalized eigenvalue matrix.
* For dimensionality reduction (with lossy reconstruction), discard the axiis with the smallest eigenvalues.

The eigen decomposition of data matrix D:   
Scatter matrix = $D D^T$  
Covariance matrix = $\frac{D D^T}{n} - \mu^T \mu$ (mean centered)    
Covariance = $P \Lambda P^{-1}$   
This factorization is possible because cov is symmetric and positive semi-definite.

PCA vs SVD
* PCA is a special case of SVD. 
* Whenever all feature means are 0, SVD==PCA.
* PCA generates one basis for the matrix rows. SVD generates one for the rows and one for the columns. 
* PCA is restricted to diagonalizable (square) matrices. SVD is not.
* PCA is applied to the covariance matrix. SVD is applied to the data matrix.

PCA vs ICA
* PCA components are orthogonal, ICA are not.
* PCA is focused on components that are uncorrelated, but not ICA.
* PCA uses second-order stats (variance), ICA uses higher-order.
* PCA assumes underlying Gaussians (via variance).

## SVD
* Does not require mean centering.
* Ideal for sparse non-negative matrices e.g. word vectors of documents.
* SVD can be transformed in spectral decomposition.
* SVD discovers the latent factors and ranks them.
* SVD can provide a lossy reconstruction of the data.
* SVD maximizes the energy on the reduced dimensions.

Energy:
Energy of original = sum of squared distances to origin.  
Energy of reconstructed is unchanged by axis rotation, 
and slightly reduced by dimension reduction.
The SVD minimizes the reconstruction error i.e.
sum of squared distances between pairs of (original,reconstructed) points.  

Latent factors:  
Say D = (n x d) = movie-patrons x movie-titles.
SVD discovers a latent concepts that explain both
e.g. sci-fi might explain why some of the movie patrons like some of the movies.

Noise reduction:
The lossy part of the lossy reconstruction 
is focused on noise and outliers.
So the lossy reconstruction can provide better training data.

### math
$D = Q \Sigma P^{-1}$  
SVD uses this factorization of the data matrix.   
This facorization is unique and always exists. 

D = data matrix on original axiis.   
Q = left singular matrix of orthonormal column vectors of $D^T$.  
$\Sigma$ = non-negative singular values along the diagonal (in decreasing order)  
P = right singular matrix of orthonormal column vectors of $D$.  

D = (n x d) = n instances x d dimensions.  
Q = (n x n)  
$\Sigma$ = (n x d) 
P = (d x d)  

$P^T P = I$  
$Q^T Q = I$  

Q = eigenvectors of $D D^T$  
P = eigenvectors of $D^T D$  

Number of non-zero entries in $\Sigma$
equals Rank(D) and is <= min(n,d).

Example of reducing dimensions to k:   
Choose k < min(n instances, d dimensions).  
The full dataset is $D = Q \Sigma P^{-1}$  
The rank=k approxization of data D(n viewers x d movies) =   
Q(n viewers x k concepts) * (k x k, diagonal, ordered) * Pt(k concepts x d movies)  
Note SVD provided k latent concepts.  