# LDA = Linear Discriminant Analysis

## LDA
Original form invented by Fisher. Since extended to LDA and dimensionality reduction. Good tutorial [here](https://www.knowledgehut.com/blog/data-science/linear-discriminant-analysis-for-machine-learning). Good video from UVa [here](https://youtu.be/IMfLXEOksGc). Code sample [blog](https://www.python-engineer.com/courses/mlfromscratch/14-lda/).

Supervised learning, maximizes between/within scatter.

This is rather simplistic. 
It assumes each class is generated by a Guassian with the same variance.
That is, each class forms a circular sombrero of same size. 
If these assumptions are violated, poor discriminant functions result.
The LDA discriminators between classes are lines or hyperplanes.

One step up is QDA, quadratic discriminant analysis, 
which allows different variances, and yields parabolic discriminants.

Assume independent features (for learning a linear combination).
Assume independent data instances (for learning the placement or intercept of each discriminant.
Assume the data are generated by one Gaussian distribution per class. 
Assume homoscedacity i.e. same variance everywhere. 
Look for a linear combination of features that explains the class labels.

Uses likelihood.
Consider each class using its assumed mean & stdev.
Each point has a probability of coming from this mean, & stdev (use the Gaussian PDF).
The class has a likelihood based on these data.

Uses maximum likelihood. 
Each point is assigned to the class model that explains it best.

Invokes the Bayes decision rule.
Draw a line between Gaussians and assign points to classes
based on whether they are left or right of the line.
The line placement can incorporate priors.

Each discriminant function creates latent features and gets a discriminant score, much like how each eigenvector has an eigenvalue. The top 3 linear discriminants would draw 3 lines or hyperplanes between populations.

Otsu's method is related. In greyscale image analysis, it chooses the black/which threshold that maximizes pixel-to-class assignment.

LDA is a classifier but it is typically used for dimensionality reduction prior to classification. LDA is related to eigenvalues of the covariance matrix and thus to PCA.

Here is my take. The labeled data points are in a d-dimensional space for d features. Fisher's Score uses between-scatter in numerator and within-scatter in denominator. There is a d-dimensional vector w that maximizes Fisher's Score. Of all hyperplanes perpendicular to w, one maximally separates the classes. Use least squares regression to find its intercept: wx+b=0. Now w is linear combination of features but also a latent feature that maximally separates classes. We can recursively add more latent features = dimensions.

LDA is associated with eigenvalues. Part of computing the LDA involves computing the eigen decomposition of the Scatter_Between/Scatter_Within matrix.

The scikit-learn [LDA](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html) 
has these features.

Inputs:
Solver algorithm: SVD avoids the covariance matrix; Eigen and Least Squares must compute covariance and can use shrinkage. 
Shrinkage algorithm: for large covariance matrices, it is better to apply a shrinkage algorithm that preserves variance but removes outliers.
Priors: if not given, inferred from frequencies in the data.
Target dimensions: if not given, uses min(#classes-1,#features).

Outputs:
The vectors w and the intercepts b.
Means (centroids) per class.
Explained variance ratio. 