# 04. EDA & Visualization

This notebook is for exploring and visualizing the embeddings we have generated. We will:
- Load the final dataset containing both image and text embeddings.
- Use t-SNE to visualize the high-dimensional image embeddings in 2D.
- Color the t-SNE plot by genre to see if the embeddings capture meaningful semantic relationships.
- Use the elbow method and silhouette scores to find an optimal number of clusters for K-means.
- Perform K-means clustering and visualize the results on the t-SNE plot.

## 1. Setup and Data Loading

In [1]:
import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Add project root to path
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath('__file__'))))

from src.visualization_utils import (
    compute_tsne,
    plot_tsne,
    elbow_method,
    plot_elbow_method,
    perform_clustering,
    plot_clusters
)

# --- Configuration ---
CATEGORY = "CDs_and_Vinyl"
IMAGE_EMB_FILE = "../clip_img_emb_parent.parquet"
FULL_DATA_FILE = f"../reviews_with_img_text_emb_{CATEGORY}.parquet"

# Load data
print("Loading embedding and full data files...")
emb_df = pd.read_parquet(IMAGE_EMB_FILE)
df = pd.read_parquet(FULL_DATA_FILE)

print(f"Loaded {len(emb_df)} image embeddings.")
print(f"Loaded {len(df)} rows from the full dataset.")

Loading embedding and full data files...
Loaded 4673 image embeddings.
Loaded 200000 rows from the full dataset.
Loaded 4673 image embeddings.
Loaded 200000 rows from the full dataset.


## 2. Prepare Data for Visualization

We'll merge the image embeddings with the main DataFrame to get the genre labels for each item. Then, we'll extract the embeddings and labels for t-SNE.

In [2]:
# Merge embeddings with categories
emb_with_cat = emb_df.merge(
    df[["parent_asin", "categories"]], 
    on="parent_asin", 
    how="inner"
)

# Extract embeddings and labels
X = np.stack(emb_with_cat["clip_img_emb"].values)
emb_with_cat["genre"] = emb_with_cat["categories"].apply(
    lambda x: x[-1] if (x and len(x) > 0) else "Unknown"
)
labels = emb_with_cat["genre"]

print("Genre counts:", labels.value_counts().to_dict())
print("Embedding matrix shape:", X.shape)

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

## 3. t-SNE Visualization

Let's compute the 2D t-SNE projection of the image embeddings and plot the result, colored by genre. This will help us understand if the CLIP embeddings have learned a meaningful representation of the album cover images.

In [None]:
# Compute t-SNE
print("Computing t-SNE... (this may take a while)")
X_2d = compute_tsne(X, perplexity=30, metric="cosine")

# Plot t-SNE
plot_tsne(X_2d, labels, title="t-SNE of CLIP Image Embeddings (colored by genre)")
plt.show()

## 4. Clustering Analysis

Now, let's perform K-means clustering to identify inherent groupings in the data. We'll first use the elbow method and silhouette scores to determine a good value for K (the number of clusters).

In [None]:
# Find optimal K
print("Performing elbow method to find optimal K...")
elbow_results = elbow_method(X, k_range=range(2, 11))
plot_elbow_method(elbow_results)
plt.show()

Based on the elbow plot and silhouette scores, we can choose an optimal K. Let's pick K=6 and visualize the resulting clusters.

In [None]:
# Perform clustering with optimal K
OPTIMAL_K = 6
clusters, kmeans_model = perform_clustering(X, n_clusters=OPTIMAL_K)

# Plot clusters on t-SNE
plot_clusters(X_2d, clusters, title=f"t-SNE of CLIP Embeddings (k={OPTIMAL_K} clusters)")
plt.show()