<a href="https://colab.research.google.com/github/Henkin2th/AIPI-590.05-Assignement/blob/main/AIPI590_XAI_LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring the Embedding Space of GIST Embedding v0 Using Dimensionality Reduction

### Overview
In this notebook, we explore the embedding space of the GIST Embedding v0 model, a fine-tuned text embedding model optimized for retrieval and classification tasks. To visualize the high-dimensional embeddings, we apply three dimensionality reduction techniques: **Principal Component Analysis (PCA)**, **t-distributed Stochastic Neighbor Embedding (t-SNE)**, and **Uniform Manifold Approximation and Projection (UMAP)**. Each technique provides a unique way to observe structure and relationships within the embeddings, allowing us to analyze how different words relate within the GIST model’s embedding space.


In [None]:
!pip install gensim==4.3.2 matplotlib==3.7.1 scikit-learn==1.2.2 umap-learn==0.5.6 plotly==5.15.0

Collecting gensim==4.3.2
  Downloading gensim-4.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.4 kB)
Collecting scikit-learn==1.2.2
  Downloading scikit_learn-1.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting umap-learn==0.5.6
  Downloading umap_learn-0.5.6-py3-none-any.whl.metadata (21 kB)
Collecting plotly==5.15.0
  Downloading plotly-5.15.0-py2.py3-none-any.whl.metadata (7.0 kB)
Collecting pynndescent>=0.5 (from umap-learn==0.5.6)
  Downloading pynndescent-0.5.13-py3-none-any.whl.metadata (6.8 kB)
Downloading gensim-4.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.5/26.5 MB[0m [31m40.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading scikit_learn-1.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.6/9.6 MB[0m [31m58.6 MB/s[0m eta [36m

In [None]:
!pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-3.2.1-py3-none-any.whl.metadata (10 kB)
Downloading sentence_transformers-3.2.1-py3-none-any.whl (255 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m255.8/255.8 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence-transformers
Successfully installed sentence-transformers-3.2.1


# Load the Model and Prepare the Dataset

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("avsolatorio/GIST-Embedding-v0")

In [None]:
# ai-generated 100 words
texts = [
    "sunset", "mountain", "river", "city", "forest", "ocean", "sky", "garden", "flower", "island",
    "book", "novel", "poem", "story", "author", "library", "chapter", "page", "poetry", "literature",
    "painting", "sculpture", "museum", "canvas", "brush", "gallery", "artist", "portrait", "exhibit", "frame",
    "guitar", "piano", "violin", "concert", "symphony", "melody", "rhythm", "lyric", "harmony", "band",
    "friend", "family", "sibling", "parent", "child", "cousin", "uncle", "aunt", "grandparent", "neighbor",
    "hiking", "running", "swimming", "cycling", "tennis", "basketball", "soccer", "volleyball", "yoga", "dance",
    "computer", "phone", "tablet", "camera", "keyboard", "mouse", "screen", "monitor", "printer", "speaker",
    "bread", "cheese", "apple", "banana", "orange", "strawberry", "grape", "carrot", "potato", "lettuce",
    "school", "university", "teacher", "student", "classroom", "homework", "exam", "textbook", "lecture", "library",
    "car", "bus", "train", "bicycle", "airplane", "boat", "subway", "taxi", "scooter", "motorcycle",
    "joy", "sadness", "anger", "love", "fear", "hope", "relief", "surprise", "excitement", "calm"
]
embeddings = model.encode(texts, convert_to_tensor=True).cpu().numpy()

### Dataset Preparation
We use a list of 100 single words covering various topics, including nature, emotions, objects, and family, to ensure diverse embedding points. Using these words, we generate embeddings with the GIST model, which allows us to explore the relationships between these concepts in a reduced dimensional space.


In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
embeddings_pca = pca.fit_transform(embeddings)

In [None]:
import plotly.express as px
fig_pca = px.scatter(embeddings_pca, x=0, y=1, text=texts,
                     title="PCA of GIST Embeddings",
                     labels={'0': 'Principal Component 1', '1': 'Principal Component 2'})
fig_pca.show()

### PCA Visualization Insights

The PCA visualization of the GIST embeddings reveals broad thematic groupings:

1. **Clusters of Similar Concepts**:
   - **Food-related words** such as "banana," "carrot," "potato," "grape," and "strawberry" form a distinct cluster, showing that PCA captures some semantic relationships.
   - **Family-related terms** including "parent," "sibling," "cousin," "grandparent," and "child" are also grouped closely.

2. **Thematic Areas**:
   - **Nature and Activities**: Words like "mountain," "river," "ocean," and "forest" appear close to recreational terms like "hiking" and "swimming."
   - **Arts and Literature**: Words such as "novel," "poetry," "museum," "sculpture," and "artist" form a grouping related to creative and cultural themes.

3. **Global Patterns and Observations**:
   - This PCA plot captures broad patterns but lacks detailed local relationships that methods like t-SNE and UMAP can reveal. However, it provides a useful overview of how words are generally organized within the embedding space.

Overall, PCA highlights global themes within the GIST embedding space, making it suitable for observing large-scale patterns but less effective for detailed clusters or highly localized relationships.


In [None]:
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, perplexity=10, n_iter=300, random_state=42)
embeddings_tsne = tsne.fit_transform(embeddings)

In [None]:

fig_tsne = px.scatter(embeddings_tsne, x=0, y=1, text=texts,
                      title="t-SNE of GIST Embeddings",
                      labels={'0': 'Component 1', '1': 'Component 2'})
fig_tsne.show()

### t-SNE Visualization Insights

The t-SNE plot reveals detailed local clusters within the GIST embeddings, highlighting semantically similar words grouped by theme.

1. **Clusters of Related Concepts**:
   - **Education terms** like "school," "student," and "homework" form a distinct cluster.
   - **Family terms** such as "parent," "child," and "sibling" are grouped closely, reflecting a cohesive family theme.

2. **Thematic Areas**:
   - **Emotions**: Words like "fear," "joy," and "relief" are clustered, showing a related group.
   - **Art and Culture**: Terms like "museum," "artist," and "gallery" appear together, highlighting cultural concepts.
   - **Nature and Activities**: Nature-related words like "ocean" and "river" are near recreational terms like "hiking" and "swimming."

3. **Local Structure**:
   - t-SNE’s focus on local structure reveals detailed clusters, making it ideal for observing specific relationships like family, education, and emotions.

Overall, t-SNE effectively showcases detailed semantic clusters, capturing nuanced relationships within the GIST embedding space.


In [None]:
import umap
umap_model = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42)
embeddings_umap = umap_model.fit_transform(embeddings)


n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.



In [None]:
fig_umap = px.scatter(embeddings_umap, x=0, y=1, text=texts,
                      title="UMAP of GIST Embeddings",
                      labels={'0': 'Component 1', '1': 'Component 2'})
fig_umap.show()

### UMAP Visualization Insights

The UMAP visualization of GIST embeddings balances local and global structures, capturing broad themes and detailed clusters.

1. **Clusters of Related Concepts**:
   - **Technology terms** like "computer," "keyboard," and "tablet" form a clear cluster, showing semantic relationships within tech.
   - **Family terms** like "parent," "child," and "sibling" are grouped, reflecting the model’s organization of family-related concepts.

2. **Thematic Areas**:
   - **Arts and Culture**: Terms like "museum," "artist," and "sculpture" cluster together, showing a cultural theme.
   - **Emotions**: Words like "joy," "anger," and "relief" are grouped, representing emotional vocabulary.
   - **Nature and Activities**: Nature words like "mountain" and "river" appear near activities like "hiking" and "swimming," linking outdoor environments and recreation.

3. **Local and Global Structure**:
   - UMAP captures both detailed clusters and broader themes, providing a balanced view of semantic relationships compared to PCA and t-SNE.

Overall, UMAP effectively displays detailed clusters and broad themes, highlighting both specific relationships and general structure within the embedding space.

### Overall Comparison

- **PCA** is best suited for observing large-scale patterns but lacks detailed clustering.
- **t-SNE** provides in-depth insight into local structures, highlighting specific clusters but sacrificing some global context.
- **UMAP** strikes a balance, preserving both detailed clusters and overall thematic groupings, making it effective for understanding both specific relationships and general structure.

In summary, each method offers unique perspectives on the embedding space, with PCA showing broad themes, t-SNE revealing fine-grained clusters, and UMAP providing a balanced view that incorporates both local and global structures. These complementary insights collectively enhance our understanding of how the GIST model organizes words in its embedding space.