## t-Distributed Stochastic Neighbor Embedding (t-SNE)

### Overview

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimensionality reduction technique primarily used for visualizing high-dimensional data in 2 or 3 dimensions. It is particularly effective at preserving local structure and revealing clusters in data.

### Mathematical Foundations

#### 1. **Pairwise Similarities in High Dimension**

Compute pairwise similarities between data points in the high-dimensional space using a Gaussian distribution.

Given a dataset $ X $ with $ n $ samples and $ p $ features, the similarity between two points $ x_i $ and $ x_j $ is defined as:

$$ p_{j|i} = \frac{\exp(-\| x_i - x_j \|^2 / 2 \sigma_i^2)}{\sum_{k \neq i} \exp(-\| x_i - x_k \|^2 / 2 \sigma_i^2)} $$

The joint probabilities are then symmetrized:

$$ p_{ij} = \frac{p_{j|i} + p_{i|j}}{2n} $$

where $ \sigma_i $ is the variance of the Gaussian centered at $ x_i $.

#### 2. **Pairwise Similarities in Low Dimension**

Compute pairwise similarities between data points in the low-dimensional space using a Student's t-distribution with one degree of freedom (which is equivalent to a Cauchy distribution).

For low-dimensional points $ y_i $ and $ y_j $:

$$ q_{ij} = \frac{(1 + \| y_i - y_j \|^2)^{-1}}{\sum_{k \neq l} (1 + \| y_k - y_l \|^2)^{-1}} $$

#### 3. **Minimizing the Kullback-Leibler Divergence**

Minimize the Kullback-Leibler (KL) divergence between the joint probabilities $ P $ and $ Q $ to ensure that the low-dimensional representation $ Y $ preserves the high-dimensional similarities.

$$ KL(P || Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}} $$

This is achieved through gradient descent optimization:

$$ \frac{\partial C}{\partial y_i} = 4 \sum_{j \neq i} \left( p_{ij} - q_{ij} \right) \left( y_i - y_j \right) \left( 1 + \| y_i - y_j \|^2 \right)^{-1} $$

### Example

Consider a dataset with three clusters in a high-dimensional space. t-SNE can reveal these clusters in 2D.

1. **Compute Pairwise Similarities in High Dimension**

    Calculate $ p_{ij} $ using the Gaussian distribution for all pairs of data points.

2. **Initialize Low-dimensional Representation**

    Initialize $ y_i $ randomly or using PCA.

3. **Compute Pairwise Similarities in Low Dimension**

    Calculate $ q_{ij} $ using the Student's t-distribution for all pairs of low-dimensional points.

4. **Minimize KL Divergence**

    Use gradient descent to iteratively update $ y_i $ and minimize the KL divergence.

5. **Result**

    Visualize the 2D representation, revealing the clusters.

### When to Use t-SNE

- **Data visualization**: t-SNE is particularly useful for visualizing high-dimensional data in 2D or 3D.
- **Exploratory data analysis**: Helps in identifying patterns, clusters, and outliers in the data.
- **Preprocessing**: Can be used as a preprocessing step to reduce dimensions before applying other algorithms.

### How to Use t-SNE

1. **Standardize the data** if necessary.
2. **Set hyperparameters** such as perplexity (typically between 5 and 50), learning rate, and number of iterations.
3. **Compute pairwise similarities** in high-dimensional space.
4. **Initialize low-dimensional points**.
5. **Run gradient descent** to minimize KL divergence.
6. **Visualize the low-dimensional data**.

### Advantages

- **Effective visualization**: Reveals complex patterns and clusters that may not be visible in high-dimensional space.
- **Nonlinear relationships**: Captures nonlinear relationships between data points.
- **Local structure preservation**: Preserves the local structure of the data well.

### Disadvantages

- **Computationally intensive**: Slow for very large datasets due to pairwise similarity computations.
- **Parameter sensitivity**: Results can be sensitive to hyperparameters like perplexity and learning rate.
- **Non-deterministic**: Different runs can yield different results due to random initialization.

### Assumptions

- **Locality preservation**: Assumes that similar points in high-dimensional space remain similar in low-dimensional space.
- **Non-linear transformation**: Suitable for datasets where linear methods like PCA fail to capture complex relationships.

### Conclusion

t-SNE is a powerful tool for dimensionality reduction, especially for visualizing high-dimensional data. By focusing on preserving local structures and minimizing the KL divergence between high and low-dimensional spaces, t-SNE effectively reveals clusters and patterns that may be hidden in the original high-dimensional space. Despite its computational demands and sensitivity to parameters, t-SNE remains a popular choice for exploratory data analysis and visualization.