# Week 4, Day 4: t-SNE and UMAP

## Learning Objectives
- Understand t-SNE and UMAP algorithms
- Learn non-linear dimensionality reduction
- Master visualization techniques
- Compare different dimensionality reduction methods

## Topics Covered
1. t-SNE Algorithm
2. UMAP Algorithm
3. Parameter Tuning
4. Visualization Techniques

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.manifold import TSNE
from umap import UMAP
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits, load_breast_cancer

## 1. t-SNE Example

In [None]:
def tsne_example():
    # Load digits dataset
    digits = load_digits()
    X = digits.data
    y = digits.target
    
    # Scale features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Apply t-SNE
    tsne = TSNE(n_components=2, random_state=42)
    X_tsne = tsne.fit_transform(X_scaled)
    
    # Visualize results
    plt.figure(figsize=(10, 6))
    scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis')
    plt.colorbar(scatter)
    plt.title('t-SNE Visualization of Digits Dataset')
    plt.xlabel('t-SNE 1')
    plt.ylabel('t-SNE 2')
    plt.show()
    
    # Print some statistics
    print("Original data shape:", X.shape)
    print("t-SNE embedding shape:", X_tsne.shape)

tsne_example()

## 2. UMAP Example

In [None]:
def umap_example():
    # Load breast cancer dataset
    data = load_breast_cancer()
    X = data.data
    y = data.target
    
    # Scale features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Apply UMAP
    umap = UMAP(n_components=2, random_state=42)
    X_umap = umap.fit_transform(X_scaled)
    
    # Visualize results
    plt.figure(figsize=(10, 6))
    scatter = plt.scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='viridis')
    plt.colorbar(scatter)
    plt.title('UMAP Visualization of Breast Cancer Dataset')
    plt.xlabel('UMAP 1')
    plt.ylabel('UMAP 2')
    plt.show()
    
    # Print some statistics
    print("Original data shape:", X.shape)
    print("UMAP embedding shape:", X_umap.shape)

umap_example()

## 3. Comparing Methods

In [None]:
def compare_methods():
    # Generate synthetic data
    np.random.seed(42)
    n_samples = 1000
    
    # Create Swiss roll dataset
    t = 1.5 * np.pi * (1 + 2 * np.random.rand(n_samples))
    x = t * np.cos(t)
    y = 21 * np.random.rand(n_samples)
    z = t * np.sin(t)
    
    X = np.column_stack((x, y, z))
    color = t
    
    # Scale features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Apply different methods
    tsne = TSNE(n_components=2, random_state=42)
    umap = UMAP(n_components=2, random_state=42)
    
    X_tsne = tsne.fit_transform(X_scaled)
    X_umap = umap.fit_transform(X_scaled)
    
    # Visualize results
    plt.figure(figsize=(15, 5))
    
    # Original 3D data
    ax = plt.subplot(131, projection='3d')
    scatter = ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=color, cmap='viridis')
    plt.colorbar(scatter)
    plt.title('Original Data')
    
    # t-SNE
    plt.subplot(132)
    scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=color, cmap='viridis')
    plt.colorbar(scatter)
    plt.title('t-SNE')
    
    # UMAP
    plt.subplot(133)
    scatter = plt.scatter(X_umap[:, 0], X_umap[:, 1], c=color, cmap='viridis')
    plt.colorbar(scatter)
    plt.title('UMAP')
    
    plt.tight_layout()
    plt.show()

compare_methods()

## Practical Exercises

In [None]:
# Exercise 1: Parameter Tuning

def parameter_tuning_exercise():
    # Load digits dataset
    digits = load_digits()
    X = digits.data
    y = digits.target
    
    # Scale features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    print("Dataset shape:", X.shape)
    
    # Task: Compare different parameter settings
    # 1. Try different perplexity values for t-SNE
    # 2. Try different n_neighbors for UMAP
    # 3. Visualize and compare results
    # 4. Analyze the impact of parameters
    
    # Your code here

parameter_tuning_exercise()

In [None]:
# Exercise 2: Text Data Visualization

def text_visualization_exercise():
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    # Sample text data
    texts = [
        "Machine learning is fascinating",
        "Deep learning revolutionizes AI",
        "Python programming is fun",
        "Data science uses machine learning",
        "Programming in Python is easy",
        "AI transforms technology",
        "Learning Python programming",
        "Data analysis with machine learning",
        "Neural networks are powerful",
        "Statistical analysis in Python"
    ]
    
    # Convert text to TF-IDF features
    vectorizer = TfidfVectorizer(stop_words='english')
    X = vectorizer.fit_transform(texts)
    
    print("Feature matrix shape:", X.shape)
    
    # Task: Visualize text relationships
    # 1. Apply t-SNE and UMAP
    # 2. Create meaningful visualizations
    # 3. Compare the methods
    # 4. Analyze text clusters
    
    # Your code here

text_visualization_exercise()

## MCQ Quiz

1. What is the main advantage of t-SNE over PCA?
   - a) Faster computation
   - b) Linear dimensionality reduction
   - c) Better preservation of local structure
   - d) Simpler implementation

2. What does perplexity control in t-SNE?
   - a) Learning rate
   - b) Number of neighbors
   - c) Number of components
   - d) Iteration count

3. Which method is generally faster?
   - a) t-SNE
   - b) UMAP
   - c) PCA
   - d) They are all similar

4. What is preserved in UMAP?
   - a) Only local structure
   - b) Only global structure
   - c) Both local and global structure
   - d) Neither

5. Which parameter is most important in UMAP?
   - a) Learning rate
   - b) n_neighbors
   - c) Random state
   - d) Metric

6. What is the time complexity of t-SNE?
   - a) O(n)
   - b) O(n log n)
   - c) O(n²)
   - d) O(n³)

7. When should you use t-SNE?
   - a) Large datasets
   - b) Visualization
   - c) Feature extraction
   - d) Linear reduction

8. What is NOT a limitation of t-SNE?
   - a) Slow computation
   - b) Non-deterministic
   - c) Preserves distances
   - d) Requires perplexity tuning

9. Which method preserves global structure better?
   - a) t-SNE
   - b) UMAP
   - c) Both equally
   - d) Neither

10. What is the recommended perplexity range for t-SNE?
    - a) 5-10
    - b) 5-50
    - c) 50-100
    - d) 100-500

Answers: 1-c, 2-b, 3-b, 4-c, 5-b, 6-c, 7-b, 8-c, 9-b, 10-b