## Linear Discriminant Analysis (LDA)

### Overview

Linear Discriminant Analysis (LDA) is a supervised learning algorithm used for classification and dimensionality reduction. It finds the linear combination of features that best separates two or more classes of data.

### Mathematical Foundations

#### 1. **Data Representation**
Given a dataset $X$ with $n$ samples and $p$ features, and a target vector $y$ with $k$ distinct classes.

#### 2. **Class-wise Mean Vectors**
Compute the mean vector for each class $ \mu_i $ where $ i $ represents the class index.

$$ \mu_i = \frac{1}{n_i} \sum_{x \in X_i} x $$
where $n_i$ is the number of samples in class $i$ and $X_i$ is the set of samples in class $i$.

#### 3. **Overall Mean Vector**
Compute the overall mean vector $\mu$ for the entire dataset.

$$ \mu = \frac{1}{n} \sum_{i=1}^k \sum_{x \in X_i} x $$

#### 4. **Scatter Matrices**
- **Within-class scatter matrix $S_W$**: Measures the scatter (variance) within each class.

$$ S_W = \sum_{i=1}^k \sum_{x \in X_i} (x - \mu_i)(x - \mu_i)^T $$

- **Between-class scatter matrix $S_B$**: Measures the scatter between the class means.

$$ S_B = \sum_{i=1}^k n_i (\mu_i - \mu)(\mu_i - \mu)^T $$

#### 5. **Eigenvalue Problem**
Solve the generalized eigenvalue problem to find the linear discriminants:

$$ S_W^{-1} S_B v = \lambda v $$

Here, $ \lambda $ represents the eigenvalues and $ v $ represents the eigenvectors (linear discriminants).

#### 6. **Selecting Linear Discriminants**
Sort the eigenvalues in descending order and select the top $ k-1 $ eigenvectors to form the transformation matrix $ W $.

#### 7. **Projecting Data**
Project the original data onto the new subspace:

$$ X_{LDA} = X W $$

### Example

Suppose we have a dataset with two features: length and width, and two classes: A and B.

1. **Class-wise Mean Vectors**

    - Class A: $ \mu_A = [1, 2]^T $
    - Class B: $ \mu_B = [3, 4]^T $

2. **Overall Mean Vector**

    $$ \mu = [2, 3]^T $$

3. **Scatter Matrices**

    - Within-class scatter matrix $ S_W $:

    $$
    S_W = \begin{bmatrix}
    0.5 & 0 \\
    0 & 0.5
    \end{bmatrix}
    $$

    - Between-class scatter matrix $ S_B $:

    $$
    S_B = \begin{bmatrix}
    2 & 2 \\
    2 & 2
    \end{bmatrix}
    $$

4. **Eigenvalue Problem**

    Solve $ S_W^{-1} S_B v = \lambda v $ to get eigenvalues and eigenvectors.

5. **Selecting Linear Discriminants**

    Select the eigenvector corresponding to the largest eigenvalue.

6. **Projecting Data**

    Project data onto the selected linear discriminant.

### When to Use LDA

- **Classification tasks**: LDA is primarily used for classification when class labels are available.
- **Dimensionality reduction**: LDA reduces dimensions while preserving class separability, making it useful for visualization and preprocessing.

### How to Use LDA

1. **Compute class-wise and overall mean vectors**.
2. **Calculate within-class and between-class scatter matrices**.
3. **Solve the generalized eigenvalue problem**.
4. **Select the top $ k-1 $ eigenvectors**.
5. **Project the original data onto the new subspace**.

### Advantages

- **Improves class separability**: Maximizes the distance between class means while minimizing the spread within each class.
- **Effective for linear boundaries**: Works well when classes are linearly separable.
- **Reduces dimensionality**: Reduces feature space to $ k-1 $ dimensions for $ k $ classes, useful for visualization.

### Disadvantages

- **Assumes normality**: Assumes that features are normally distributed within each class.
- **Sensitive to outliers**: Outliers can affect the mean and covariance estimates, impacting performance.
- **Requires linear separability**: May not perform well if classes are not linearly separable.

### Assumptions

1. **Normal distribution**: Assumes that features are normally distributed within each class.
2. **Equal covariance matrices**: Assumes that all classes share the same covariance matrix.
3. **Linearity**: Assumes linear boundaries between classes.

### Conclusion

LDA is a powerful technique for both classification and dimensionality reduction in supervised learning. By maximizing the ratio of between-class variance to within-class variance, LDA ensures maximum class separability in the projected subspace. While it has certain assumptions and limitations, it remains a widely used and effective method in various applications.