# LDA = Linear Discriminant Analysis
Supervised learning.   
Using labeled data, 
LDA projects to a lower-dimensional space 
to maximize the (between-group)/(within-group) scatter.    
LDA separates the labels by a linear combination of features.

For example, project 2D points onto a 1D line,
but chose the line that maximizes between/within scatter.
Or, project 3D points onto a 2D plane,
but chose the plane that maximizes between/within scatter.

For the two-class problem, the solution is computed directly using covariances.    
For the multi-class problem, the solution requires an optimization step.   
The compute relies on eigen decomposition of covB/covW = covW$^{-1}$covB.

Assume the features are continuous and independent.   
Assume there is one covariance matrix i.e. covaraince does not vary by class.    
Geometrically, every class has different mean but the same ellipsoid distrution.  
All the decision boundaries are linear (line, plane, hyperplane).   

For classes with different ellipsoid distributions, 
step up to QDA = quadratic discriminant analysis, 
and estimate a covariance matrix for each class.
QDA gives parabolic decision boundaries.

LDA is infeasible at very high dimesions.   
The discriminant may not exist.    

LDA was derived from Fisher's Discriminant Analysis,
extending the statistics to also do dimensionality reduction.    
LDA was conceived for two-class problems,
but was extended for multi-class problems.

Good tutorial at [knowledgehut](https://www.knowledgehut.com/blog/data-science/linear-discriminant-analysis-for-machine-learning). 
Good video from [UVa](https://youtu.be/IMfLXEOksGc). 
Code sample at [python-engineer](https://www.python-engineer.com/courses/mlfromscratch/14-lda/).

Formulas given on MultivariateStats
for [two-class](https://multivariatestatsjl.readthedocs.io/en/latest/lda.html)
and [multi-class](https://multivariatestatsjl.readthedocs.io/en/latest/mclda.html).

## Define scatter
Sum of the pair-wise distances.  
Within-group scatter for every instance i in class c:    
$\Sigma_i(\bar{x}_i-\bar{\mu_c})(\bar{x}_i-\bar{\mu_c})^T$   
Between-group scatter for all classes c vs overall mean $\bar{\mu}$:   
$\Sigma_c(\bar{\mu}_c-\bar{\mu})(\bar{\mu}_c-\bar{\mu})^T$   

## LDA steps
The labeled data points are in a d-dimensional space for d features.   
Measure scatter within the same label, and between groups of labels.  
Fisher's Score uses between-scatter in numerator and within-scatter in denominator.    
There is a d-dimensional vector w that maximizes Fisher's Score.   
Of all hyperplanes perpendicular to w, one maximally separates the classes.   
Use least squares regression to find its intercept: wx+b=0.    
Now w is linear combination of features.    
But w is also a latent feature that maximally separates classes.    
We can recursively add more latent features = dimensions.

## Assumptions
LDA is sensitive to its assumpitons.
If these assumptions are violated, poor discriminant functions result.

Assume independent features (for learning a linear combination).
Assume independent data instances (for learning the placement or intercept of each discriminant.
Assume the data are generated by one Gaussian distribution per class. 
Assume homoscedacity i.e. same variance everywhere in in each class.
Thus, points from any one class form a circular sombrero,
and all the sombreros are the same size. 
Look for a linear combination of features that explains the class labels.
Finally, place lines or hyperplanes between sombreros.
(Yes, LDA is rather simplistic. See QDA.)

Each discriminant function creates latent features (like an eigenvector).
Each discriminant gets a discriminant score (like an eigenvalue).

Say you use the top 3 linear discriminants.
These would draw 3 hyperplanes,
assigning points and populations to classes. 


## Dependencies
### Likelihood
Consider each class using its assumed mean & stdev.
In LDA, 
each point has a probability of coming from this Gaussian (computed with the PDF).
The class as a whole has a likelihood based on these data.

### Maximum likelihood
In LDA, each point is assigned to the class model that explains it best.
(Like K-means Clustering?)

### Bayes decision rule
In LDA, draw a hyperplane between the Gaussians.
Assign data points to classes based on which side of the line.
The line placement can incorporate priors.

### Distance
Use Mahalanobis distance which incorporates the covariance.

## LDA as classifier
LDA is a classifier.
But it is typically used for dimensionality reduction prior to classification. 

## Related methods
### LDA and PCA
LDA and PCA use eigenvalues of the covariance matrix.  
LDA is supervised but PCA is unsupervised.

At high dimensions, LDA overfits the scatter of the data seen.  
This is because scatter just happens, even in random data, 
when the number of dimensions approaches the number of data points.
A shrinkage method, such as PCA, can help.
Doing PCA before LDA will reduce the dimensions for LDA.
### LDA vs QDA
LDA assumes same variances and covariances.   
LDA yields hyperplanes as discriminants.   
QDA = quadratic discriminant analysis.    
QDA allows different variances and covariances.   
QDA yields parabolic discriminants.  
### LDA and ANOVA are opposites
ANOVA requires categorical independent variables, continuous dependent variables.   
LDA requires continuous independent variables, categorical dependent variables.   
### LDA and Logistic Regression
LDA assumes normally distributed data.  
Otherwise, use logistic regression.  
### Otsu's Method
Otsu's method is related to LDA.   
Otsu is for greyscale image analysis.   
Otsu chooses the black/which threshold that maximizes pixel-to-class assignment.

## Scikit-Learn
The scikit-learn [LDA](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html) 
has these features.

Inputs:
* Solver algorithm. Choices: Eigen and Least_Squares both compute covariance, and both can use shrinkage methods; SVD avoids the covariance matrix. 
* Shrinkage algorithm: for large covariance matrices i.e. high dimensions, it is recommended  to apply a shrinkage algorithm that preserves variance but removes outliers.
* Priors: if not given, inferred from frequencies in the data.
* Target dimensions: if not given, uses min(#classes-1,#features).

Outputs:
* The vectors w and the intercepts b.
* Means (centroids) per class.
* Explained variance ratio. 