---
date: 2024-09-15 10:47:37
created: 2024-09-15 09:48:07
categories:
- School Stuff / Unsupervised Learning
---

## #T\_SNE

  

# t-Distributed Stochastic Neighbor Embedding (t-SNE)\\

  

t-sne is a ⁠Dimensionality Reduction ⁠technique that is particularly good at visualizing/turning high dimensional data ⁠into low dimension data ⁠(particularly 2D and 3D)

It is better than ⁠PCA ⁠at keeping local relationships between data points, and ⁠t-sne⁠ captures non-linear patterns and ⁠clusters a whole lot more efficiently than ⁠PCA⁠.

  

### Key Concepts:

  

- ⁠Local Relations: ⁠t-sne ⁠tries to keep neighboring data in high dimensions close when formatting to a low dimension

  

⁠

- ⁠⁠Non-Linearity: ⁠It captures more complex relationships than ⁠PCA ⁠by focusing on **preserving local clusters and neighborhoods**, which is great for non-linear data (Images or Word Embeddings)

  

- ⁠Visualization: ⁠⁠t-sne⁠ is really useful when visualizing large/complex datasets with many dimensions, since it is efficient at turning those datasets into 2D or 3D for plotting purposes.

  

  

### Steps in Application:

  

1.  **Pairwise Similarities:** It calculates the similarity between points in the original high dimensional space.

⁠  

2. **Probability Distribution: t-Sne** converts the similarities paired above into a ⁠Probability Distribution⁠.

  

3. **Minimizing Divergence:** It then basically **_Dr. Stranges_** it’s way into finding a low dimensional embedding ⁠where the **Pairwise Similarities** are as close as possible to those in the beeg space.

  

### T-SNE in Python

We will take the mnist dataset with 784 dimensions and visualize it into a 2D plot using T-SNE

  

  

`⁠pip install scikit-learn matplotlib`⁠

  

```python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler

# Step 1: Load the digits dataset (handwritten digits)
digits = load_digits()
X = digits.data  # The 64-dimensional data points
y = digits.target  # The labels (0-9)

# Step 2: Standardize the data (important for t-SNE)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 3: Apply t-SNE to reduce the data to 2 dimensions
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)

# Step 4: Plot the data in 2D using matplotlib
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='tab10')
plt.colorbar(scatter, ticks=range(10))  # Add colorbar to show digit labels
plt.title("t-SNE visualization of MNIST Digits")
plt.show()
```

  

- The following plot will show a 2D representation of the relationships where similar digits form clusters together.
- This is a visual representation of the relationships between the digits in the og high dimensional space in 2D for easy interpretation

# Task 1 of T-SNE. The init.

In [1]:
#!/usr/bin/env python3
"""
Initializes all variables required for to calculate the P affinities in T-sne
"""


import numpy as np


def P_init(X, perplexity):
    """
    Initializes all variables required for to calculate the P affinities in T-sne
    
    Parameters:
        X is a numpy.ndarray of shape (n, d) containing the dataset
        to be transformed by t-SNE
            -> n is the number of data points
            -> d is the number of dimensions in each point
        
        perplexity is the perplexity that all Gaussian distributions have
    
    Returns:
        (D, P, betas, H)
            -> D: a numpy.ndarray of shape (n, n) that calculates
            the squared pairwise distance between two data points
                * The diagonal of D should be 0s
            
            -> P: a numpy.ndarray of shape (n, n) initialized to all 0‘s that
            will contain the P affinities 
            
            -> betas: a numpy.ndarray of shape (n, 1) initialized to all 1’s
            that will contain all of the beta values
            
            
            -> H is the Shannon entropy for perplexity perplexity with a base of 2
    """
    n, d = X.shape
    X_sum = np.sum(np.square(X), axis=1)
    D = np.add(np.add(-2 * np.dot(X, X.T), X_sum).T, X_sum)
    np.fill_diagonal(D, 0)
    
    P = np.zeros((n, n))
    
    betas = np.ones((n, 1))
    
    H = np.log2(perplexity)
    
    return D, P, betas, H
    

In [4]:
# main file


pca = __import__('1-pca').pca

X = np.loadtxt("data/mnist2500_X.txt")
X = pca(X, 50)
D, P, betas, H = P_init(X, 30.0)
print('X:', X.shape)
print(X)
print('D:', D.shape)
print(D.round(2))
print('P:', P.shape)
print(P)
print('betas:', betas.shape)
print(betas)
print('H:', H)

ValueError: the number of columns changed from 784 to 535 at row 669; use `usecols` to select a subset and avoid this error