# XAI CODE DEMO

## Explainable AI Specialization on Coursera

# Visualizing Embedding (Latent) Space 🔎

If you experience high latency while running this notebook, you can open it in Google Colab:

[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/explainable-machine-learning/explainable-ml/blob/main/embedding_visualization.ipynb)

* How do we go about visualizing the latent space?
* With so many dimensions, can we make any meaningful interpretations of the latent space?

#### Dimensionality Reduction & Visualization:
* PCA
* t-SNE
* UMAP


In [None]:
!pip install gensim==4.3.2 matplotlib==3.7.1 scikit-learn==1.2.2 umap-learn==0.5.6 plotly==5.15.0

In [None]:
# Basic
import os
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px

# Dimensionality Reduction
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap

For this demo, we will be using **GloVe embeddings**, which is a traditional NLP embedding model. This particular version has a vector length of only 50 (reminder that Open AI’s ada_002 embedding model has a vector length of 1536!)

**GloVe** is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

______________
The code block below can be run outside of the Coursera environment to get the pretrained GloVe model and embed words. We have done this for you, so instead you will load the words and embeddings into this notebook.

In [None]:
def load_words_and_embeddings(words_file='words.npy', embeddings_file='embeddings.npy'):
    # Load words
    loaded_words = np.load(words_file, allow_pickle=True)
    
    # Load embeddings
    loaded_embeddings = np.load(embeddings_file)
    
    print(f"Loaded words: {len(loaded_words)}")
    print(f"Loaded embeddings shape: {loaded_embeddings.shape}")
    
    return loaded_words, loaded_embeddings

words, embeddings = load_words_and_embeddings()

## Principal Components Analysis (PCA)

* Focus is on capturing global linear relationships in the data
* Use to: simplify and find global linear relationships and patterns in the data

#### How does PCA work?

1. Standardize the Data: Scale the data so each feature has a mean of 0 and standard deviation of 1
2. Compute the Covariance Matrix: Calculate the covariance matrix to understand how features vary together
3. Compute Eigenvalues and Eigenvectors: Derive the eigenvalues and eigenvectors from the covariance matrix. Eigenvectors represent principal components, and eigenvalues indicate their significance
4. Sort Eigenvalues and Eigenvectors: Order them by descending eigenvalues to prioritize the most significant components
5. Select Principal Components: Choose the top 𝑘 eigenvectors corresponding to the largest eigenvalues
6. Transform the Data: Project the original data onto the selected principal components to reduce dimensions

#### Implementation in Python
Need to set:
* `n_components` - The number of dimensions in the embedded space

In [None]:
# Apply PCA
pca = PCA(n_components=2)
embeddings_pca = pca.fit_transform(embeddings)

# Plot PCA results using Plotly for interactivity
fig_pca = px.scatter(
    embeddings_pca, x=0, y=1,
    text=words,
    title="PCA of GloVe Embeddings",
    labels={'0': 'Principal Component 1', '1': 'Principal Component 2'}
)
fig_pca.update_traces(marker=dict(size=8))
fig_pca.show()


## t-distributed Stochastic Neighbor Embedding (t-SNE)

* Constructs a lower-dimensional representation where similar data points are placed closer together
* Use to: Emphasize visualization, reveal local patterns and clusters


#### How does t-SNE work?

1. Compute Pairwise Similarities: Measure how similar each pair of data points is in the high-dimensional space using a Gaussian kernel
2. Initialize Embeddings: Start with random low-dimensional embeddings for each data point
3. Compute Similarities in Low-Dimensional Space: Measure similarities between low-dimensional embeddings using a Student's t-distribution
4. Optimize Embeddings: Adjust the embeddings to minimize the difference between the distributions of similarities in high-dimensional and low-dimensional spaces
5. Reduce Dimensionality: Obtain a reduced-dimensional representation of the data, preserving local relationships between data points

#### Implementation in Python
Need to set:
* `n_components` - The number of dimensions in the embedded space
* `perplexity` - a hyperparameter that balances the attention given to local versus global aspects of the data. It affects the quality of the resulting embeddings. Higher perplexity values consider more points as neighbors of each other, potentially resulting in more global views of the data.
* `n_iter` - the number of iterations the algorithm will run for. More iterations can lead to better convergence and potentially better embeddings, but it also increases computation time

In [None]:
# Apply t-SNE
tsne = TSNE(n_components=2, perplexity=30, n_iter=300, random_state=42)
embeddings_tsne = tsne.fit_transform(embeddings)

# Plot t-SNE results using Plotly
fig_tsne = px.scatter(
    embeddings_tsne, x=0, y=1,
    text=words,
    title="t-SNE of GloVe Embeddings",
    labels={'0': 'Component 1', '1': 'Component 2'}
)
fig_tsne.update_traces(marker=dict(size=8))
fig_tsne.show()


## Uniform Manifold Approximation and Projection (UMAP)

* Uses manifold learning (nonlinear dimensionality reduction) to understand the underlying structure or shape of the data
* Focus on capturing complex, non-linear relationships in the data
* Use to: preserve local structure and handle complex, nonlinear relationships



#### How does UMAP work?
1. Construct Local Neighborhoods: Define local neighborhoods for each data point in the high-dimensional space based on proximity
2. Optimize Low-Dimensional Embeddings: Minimize the discrepancy between local neighborhoods in the high-dimensional and low-dimensional spaces using stochastic gradient descent
3. Preserve Global Structure: Balance the preservation of local and global structures using a fuzzy simplicial set representation
4. Reduce Dimensionality: Obtain a lower-dimensional representation of the data while preserving both local and global relationships
5. Effective Visualization: UMAP provides an effective tool for visualizing high-dimensional data in a reduced-dimensional space, capturing complex relationships and structures


#### Implementation in Python
Need to set:
* `n_components` - The number of dimensions in the embedded space
* `n_neighbors` - determines the number of neighboring points used in the construction of the high-dimensional fuzzy topological representation of the data. It controls the local connectivity structure in the high-dimensional space. Higher values result in a more global view of the data, while lower values emphasize local structure
* `min_dist` - controls the minimum distance between embedded points in the low-dimensional representation. It acts as a regularization parameter preventing points from being too close to each other in the embedding space. Larger values of min_dist result in a more spread-out embedding, while smaller values allow points to be closer together

In [None]:
# This line hides a low-level depracation warning. This warning is within the UMAP library itself.
os.environ['KMP_WARNINGS'] = '0'

# Apply UMAP
umap_model = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42)
embeddings_umap = umap_model.fit_transform(embeddings)

# Plot UMAP results using Plotly
fig_umap = px.scatter(
    embeddings_umap, x=0, y=1,
    text=words,
    title="UMAP of GloVe Embeddings",
    labels={'0': 'Component 1', '1': 'Component 2'}
)
fig_umap.update_traces(marker=dict(size=8))
fig_umap.show()
