# Matrix Factorization

# LU
Any square matrix can be factored into an upper-diagonal and lower-diagonal matrix.  
This corresponds to Gaussian elimination.

## Eigen
Some matrices can be factored into values and eigenvectors.   
The matrix must be square and diagonalizable.   
Some factorizations yield matrices of complex numbers.   
The eigenvalues can be ordered by importance for dimensionality reduction.   
The eigenvectors can be scaled or normalized to unit length.   
The eigenvalues can also be scaled, so they are non-unique.   
Use lambda $\lambda$ for one eigenvalue, Lambda $\Lambda$ for diagonal matrix of all.   
Use A for a matrix e.g. rows of data, columns of features.    

An eigenvector of A stretches or shrinks but does not rotate when multiplied by A.    
$A v = \lambda v$   
Matrix A = (column) eigenvectors times Lambda times (row) eigenvectors.   
$A = Q \Lambda Q^{-1}$   
The eigenvalues are found by solving the system of linear equations    
$det(A - \lambda I) = 0$   
The eigenvectors are found by solving another system of linear equations    
$(A - \lambda_i I)v_i = 0$   

As a special case, when A is real and symmetric, 
there exist orthonormal eigenvectors    
$A = Q \Lambda Q^{T}$   
This is the case when A is a covariance matrix, as with PCA.   
(When a matrix $M$ is orthogonal, $M^{-1} = M^T$.)   

## PCA
Eigen decomposition of the covariance matrix.   
The matrix satisfies the special case of being square, real, symmetric, positive semi-definite.   
PCA chooses an orthonormal basis of eigenvectors,
with eigenvalues ordered by the portion of variance explained.   
Covariance = $P \Lambda P^{-1}$   
This can be used for dimensionality reduction.  
No use of labels, so this is unsupervised.  

## ICA
ICA vectors are not orthonormal and are not ranked.   
ICA can discover higher-order interactions.   
(PCA only discovers linear combinations but its vecors are ranked.)
No use of labels, so this is unsupervised.

## SVD
Data D = { colums of features, rows of instances }.   
SVD finds    
$D = U \Sigma V^{-1}$ where   
U = matrix of instances (rows) transformed to latent features   
$\Sigma$ = ranking of latent features   
V = matrix of vectors (rows) that compose latent features by combinations of given features.   
Objective function: use of top k latent features minimizes reconstruction loss, for any k.

The SVD factorization is unique and always exists.   
SVD does not use labels, so it is unsupervised.
$\Sigma$ = is diagonal but not square (last rows could be zero)      

### SVD vs PCA
SVD devises a different coordinate system for the data (like PCA).    
SVD lossless, but subset of latent features gives dimensionality reduction (like PCA).   
SVD does not use the eigen decomposition (unlike PCA).   
SVD can operate on rectangular matrices (unlike eigen decomposition).        
SVD decomposes the data matrix (not the covariance, as in PCA).   
SVD generates three matrices (unlike PCA which generates two).       
The middle matrix is diagonal (in SVD and PCA).

## LDA
See [Wikipedia](https://en.wikipedia.org/wiki/Linear_discriminant_analysis).
See also our Linear Discriminant Analysis notebook.

LDA finds the optimal linear boundary between the labeled instances.     
LDA casts the data into a lower dimension.    
LDA can be used for classification or dimensionality reduction.   
LDA assumes Gaussian normal distributions.   
LDA assumes homoscedacity: equal variance and covariance.  
(When these conditions aren't met, use QDA instead.)   

LDA only explores linear combinations of features.   
LDA only finds linear decision boundaries (unlike QDA).    
The data determine the orientation of the decision boundary hyperplane,
but a given parameter (confidence to predict class 1) determines its position.    
LDA is less effective if features are correlated.  

Objective functions: 
* maximize SS_between/SS_within ratio, and
* maximize classification accuracy.   

LDA uses the eigen decomposition (like PCA),
but LDA applies it to label scatter (not just feature covariance).      
LDA is specific to continuous numerical features.   
LDA uses labels, so LDA is supervised classification.   

Otsu's method is related to LDA.   