# LDA = Linear Discriminant Analysis
Using labeled data, maximize the (between-group)/(within-group) scatter.

Supervised.   
Separates the labels by linear combination of continuous features, assumed independent.   
Lower-dimensional space maximizes between-vs-within class scatter.    
The discriminant may not exist.    

LDA was derived from Fisher's Discriminant Analysis. 
The statistics was extended for dimensionality reduction. 

Good tutorial [here](https://www.knowledgehut.com/blog/data-science/linear-discriminant-analysis-for-machine-learning). Good video from UVa [here](https://youtu.be/IMfLXEOksGc). Code sample [blog](https://www.python-engineer.com/courses/mlfromscratch/14-lda/).

### Summary steps
The labeled data points are in a d-dimensional space for d features.   
Fisher's Score uses between-scatter in numerator and within-scatter in denominator.    
There is a d-dimensional vector w that maximizes Fisher's Score.   
Of all hyperplanes perpendicular to w, one maximally separates the classes.   
Use least squares regression to find its intercept: wx+b=0.    
Now w is linear combination of features.    
But w is also a latent feature that maximally separates classes.    
We can recursively add more latent features = dimensions.

## Assumptions
LDA is sensitive to its assumpitons.
If these assumptions are violated, poor discriminant functions result.

Assume independent features (for learning a linear combination).
Assume independent data instances (for learning the placement or intercept of each discriminant.
Assume the data are generated by one Gaussian distribution per class. 
Assume homoscedacity i.e. same variance everywhere in in each class.
Thus, points from any one class form a circular sombrero,
and all the sombreros are the same size. 
Look for a linear combination of features that explains the class labels.
Finally, place lines or hyperplanes between sombreros.
(Yes, LDA is rather simplistic. See QDA.)

Each discriminant function creates latent features (like an eigenvector).
Each discriminant gets a discriminant score (like an eigenvalue).
much like how each eigenvector has an eigenvalue. 

Say you use the top 3 linear discriminants.
These would draw 3 hyperplanes,
assigning points and populations to classes. 


## Dependencies
### Likelihood
Consider each class using its assumed mean & stdev.
In LDA, 
each point has a probability of coming from this Gaussian (computed with the PDF).
The class as a whole has a likelihood based on these data.

### Maximum likelihood
In LDA, each point is assigned to the class model that explains it best.
(Like K-means Clustering?)

### Bayes decision rule.
In LDA, draw a hyperplane between the Gaussians.
Assign data points to classes based on which side of the line.
The line placement can incorporate priors.

## LDA as classifier
LDA is a classifier.
But it is typically used for dimensionality reduction prior to classification. 

## Related methods
### LDA and PCA
It helps to run PCA before LDA. Why?
### LDA vs QDA
QDA = quadratic discriminant analysis.   
QDA allows different variances and yields parabolic discriminants.  
LDA assumes same variances and yields hyperplanes as discriminants.  
### LDA and ANOVA
ANOVA requires categorical independent variables, continuous dependent variables.   
LDA requires continuous independent variables, categorical dependent variables.   
### LDA and Logistic Regression
LDA assumes normally distributed data.  
Otherwise, use logistic regression.  
### LDA and PCA
LDA is related to eigenvalues of the covariance matrix and thus to PCA.  
Part of computing the LDA involves computing the eigen decomposition of the (Scatter_Between)/(Scatter_Within) matrix.
### Otsu's Method
Otsu's method is related to LDA.   
Otsu is for greyscale image analysis.   
Otsu chooses the black/which threshold that maximizes pixel-to-class assignment.

## Scikit-Learn
The scikit-learn [LDA](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html) 
has these features.

Inputs:
* Solver algorithm. Choices: SVD avoids the covariance matrix; Eigen and Least Squares must compute covariance and can use shrinkage. 
* Shrinkage algorithm: for large covariance matrices, it is better to apply a shrinkage algorithm that preserves variance but removes outliers.
* Priors: if not given, inferred from frequencies in the data.
* Target dimensions: if not given, uses min(#classes-1,#features).

Outputs:
* The vectors w and the intercepts b.
* Means (centroids) per class.
* Explained variance ratio. 