# Dimensionality Reduction
* eigen decomposition: data 𝐴=𝑄Λ𝑄−1 where A is square, invertible. Basis ranked by utility for data reconstruction (i.e. smallest eigenvalues address noise).
* PCA: covariance=𝑃Λ𝑃−1 where cov is square, symmetric, positive-semidefinite, mean centered. Gives orthonormal basis ranked by variance explained.
* ICA: basis is not orthonormal. 
* SVD: data 𝐷=𝑄Σ𝑃−1 always exists. Discovers & ranks. latent factors.
* LDA: Supervised, separates the labels by linear combination of continuous features, assumed independent. Lower-dimensional space maximizes between-vs-within class scatter. May not exist. Helps to run PCA first. 

## PCA
* PCA Discovers an axis rotation i.e. linear transformation.
* Uses the eigen decomposition of the (symmetric positive-semidefinite) covariance matrix.
* Prerequisite is mean centering.
* Finds an orthonormal basis. 
* In effect, recursively performs axis rotation to where remaining variance is maximized along next axis.
* Discovers linear transformations only. (But kernel trick can uncover non-linear ones.)
* Under the new coordinates, there is no covariance, and the covariance matrix == the diagonalized eigenvalue matrix.
* For dimensionality reduction (with lossy reconstruction), discard the axiis with the smallest eigenvalues.

The eigen decomposition of data matrix D:   
Scatter matrix = $D D^T$  
Covariance matrix = $\frac{D D^T}{n} - \mu^T \mu$ (mean centered)    
Covariance = $P \Lambda P^{-1}$   
This factorization is possible because cov is symmetric and positive semi-definite.

PCA vs SVD
* PCA is a special case of SVD. 
* Whenever all feature means are 0, SVD==PCA.
* PCA generates one basis for the matrix rows. SVD generates one for the rows and one for the columns. 
* PCA is restricted to diagonalizable (square) matrices. SVD is not.
* PCA is applied to the covariance matrix. SVD is applied to the data matrix.

PCA vs ICA
* PCA components are orthogonal, ICA are not.
* PCA is focused on components that are uncorrelated, but not ICA.
* PCA uses second-order stats (variance), ICA uses higher-order.
* PCA assumes underlying Gaussians (via variance).

## SVD
* Does not require mean centering.
* Ideal for sparse non-negative matrices e.g. word vectors of documents.
* SVD can be transformed in spectral decomposition.
* SVD discovers the latent factors and ranks them.
* SVD can provide a lossy reconstruction of the data.
* SVD maximizes the energy on the reduced dimensions.

Energy:
Energy of original = sum of squared distances to origin.  
Energy of reconstructed is unchanged by axis rotation, 
and slightly reduced by dimension reduction.
The SVD minimizes the reconstruction error i.e.
sum of squared distances between pairs of (original,reconstructed) points.  

Latent factors:  
Say D = (n x d) = movie-patrons x movie-titles.
SVD discovers a latent concepts that explain both
e.g. sci-fi might explain why some of the movie patrons like some of the movies.

Noise reduction:
The lossy part of the lossy reconstruction 
is focused on noise and outliers.
So the lossy reconstruction can provide better training data.

### math
$D = Q \Sigma P^{-1}$  
SVD uses this factorization of the data matrix.   
This facorization is unique and always exists. 

D = data matrix on original axiis.   
Q = left singular matrix of orthonormal column vectors of $D^T$.  
$\Sigma$ = non-negative singular values along the diagonal (in decreasing order)  
P = right singular matrix of orthonormal column vectors of $D$.  

D = (n x d) = n instances x d dimensions.  
Q = (n x n)  
$\Sigma$ = (n x d) 
P = (d x d)  

$P^T P = I$  
$Q^T Q = I$  

Q = eigenvectors of $D D^T$  
P = eigenvectors of $D^T D$  

Number of non-zero entries in $\Sigma$
equals Rank(D) and is <= min(n,d).

Example of reducing dimensions to k:   
Choose k < min(n instances, d dimensions).  
The full dataset is $D = Q \Sigma P^{-1}$  
The rank=k approxization of data D(n viewers x d movies) =   
Q(n viewers x k concepts) * (k x k, diagonal, ordered) * Pt(k concepts x d movies)  
Note SVD provided k latent concepts.  

## LDA
Original form invented by Fisher. Since extended to LDA and dimensionality reduction. Good tutorial [here](https://www.knowledgehut.com/blog/data-science/linear-discriminant-analysis-for-machine-learning). Good video from UVa [here](https://youtu.be/IMfLXEOksGc). Code sample [blog](https://www.python-engineer.com/courses/mlfromscratch/14-lda/).

Supervised learning, maximizes between/within scatter.

This is rather simplistic. 
It assumes each class is generated by a Guassian with the same variance.
That is, each class forms a circular sombrero of same size. 
If these assumptions are violated, poor discriminant functions result.
The LDA discriminators between classes are lines or hyperplanes.

One step up is QDA, quadratic discriminant analysis, 
which allows different variances, and yields parabolic discriminants.

Assume independent features (for learning a linear combination).
Assume independent data instances (for learning the placement or intercept of each discriminant.
Assume the data are generated by one Gaussian distribution per class. 
Assume homoscedacity i.e. same variance everywhere. 
Look for a linear combination of features that explains the class labels.

Uses likelihood.
Consider each class using its assumed mean & stdev.
Each point has a probability of coming from this mean, & stdev (use the Gaussian PDF).
The class has a likelihood based on these data.

Uses maximum likelihood. 
Each point is assigned to the class model that explains it best.

Invokes the Bayes decision rule.
Draw a line between Gaussians and assign points to classes
based on whether they are left or right of the line.
The line placement can incorporate priors.

Each discriminant function creates latent features and gets a discriminant score, much like how each eigenvector has an eigenvalue. The top 3 linear discriminants would draw 3 lines or hyperplanes between populations.

Otsu's method is related. In greyscale image analysis, it chooses the black/which threshold that maximizes pixel-to-class assignment.

LDA is a classifier but it is typically used for dimensionality reduction prior to classification. LDA is related to eigenvalues of the covariance matrix and thus to PCA.

Here is my take. The labeled data points are in a d-dimensional space for d features. Fisher's Score uses between-scatter in numerator and within-scatter in denominator. There is a d-dimensional vector w that maximizes Fisher's Score. Of all hyperplanes perpendicular to w, one maximally separates the classes. Use least squares regression to find its intercept: wx+b=0. Now w is linear combination of features but also a latent feature that maximally separates classes. We can recursively add more latent features = dimensions.

LDA is associated with eigenvalues.
Part of computing the LDA involves computing
the eigen decomposition of the Scatter_Between/Scatter_Within matrix.



The scikit-learn [LDA](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html) 
has these features.

Inputs:
Solver algorithm: SVD avoids the covariance matrix; Eigen and Least Squares must compute covariance and can use shrinkage. 
Shrinkage algorithm: for large covariance matrices, it is better to apply a shrinkage algorithm that preserves variance but removes outliers.
Priors: if not given, inferred from frequencies in the data.
Target dimensions: if not given, uses min(#classes-1,#features).

Outputs:
The vectors w and the intercepts b.
Means (centroids) per class.
Explained variance ratio. 