# BERTopic Model Exploration

This notebook allows you to explore the trained BERTopic model, visualize clusters, and inspect specific reviews.

In [1]:
import sys
import os
from pathlib import Path

# Add src to path
sys.path.append(os.path.abspath('../src'))

import numpy as np
import polars as pl
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from umap import UMAP

from voc.storage.factory import get_storage
from voc.storage.types import StorageType

In [2]:
# Configuration
APP_ID = "2393760"

## 1. Load Data

We will load the pre-computed embeddings and the original reviews.

In [3]:
def load_data(app_id):
    # Load Embeddings
    print(f"Loading embeddings for {app_id}...")
    emb_storage = get_storage(StorageType.PARQUET, app_id, base_path="../data/embeddings")
    emb_data = emb_storage.load()
    
    docs = [item["sentence"] for item in emb_data]
    embeddings = np.array([item["embedding"] for item in emb_data])
    
    print(f"Loaded {len(docs)} sentences and embeddings.")
    
    # Load Original Reviews
    print(f"Loading original reviews for {app_id}...")
    review_storage = get_storage(StorageType.PARQUET, app_id, base_path="../data/reviews")
    reviews_data = review_storage.load()
    reviews_df = pl.DataFrame(reviews_data)
    
    print(f"Loaded {len(reviews_df)} reviews.")
    
    return docs, embeddings, reviews_df

docs, embeddings, reviews_df = load_data(APP_ID)

Loading embeddings for 2393760...
2026-01-31 21:08:32,849 - voc.storage.parquet_store - INFO - Loading 1 parquet files from ../data/embeddings/2393760
Loaded 1375 sentences and embeddings.
Loading original reviews for 2393760...
2026-01-31 21:08:33,182 - voc.storage.parquet_store - INFO - Loading 1 parquet files from ../data/reviews/2393760
Loaded 497 reviews.


## 2. Load Model

We will try to load the latest trained model from `data/models`.

In [4]:
def load_latest_model(app_id):
    models_dir = Path("../data/models")
    # Find subdirectories starts with app_id
    candidates = sorted([d for d in models_dir.iterdir() if d.is_dir() and d.name.startswith(app_id)])
    
    if not candidates:
        raise FileNotFoundError(f"No models found for app {app_id}")
    
    latest_model_path = candidates[-1]
    print(f"Loading model from {latest_model_path}...")
    
    # Load the model
    # We explicitly pass the embedding model to ensure compatibility if needed, 
    # though safetensors load usually handles it well.
    topic_model = BERTopic.load(str(latest_model_path), embedding_model=SentenceTransformer("all-MiniLM-L6-v2"))
    return topic_model

try:
    topic_model = load_latest_model(APP_ID)
    print("Model loaded successfully.")
except Exception as e:
    print(f"Failed to load model: {e}")
    print("Training a new model temporarily for exploration...")
    topic_model = BERTopic(embedding_model=SentenceTransformer("all-MiniLM-L6-v2")).fit(docs, embeddings)


Loading model from ../data/models/2393760_20260131_204156...
2026-01-31 21:08:37,028 - sentence_transformers.SentenceTransformer - INFO - Use pytorch device_name: cuda:0
2026-01-31 21:08:37,029 - sentence_transformers.SentenceTransformer - INFO - Load pretrained SentenceTransformer: all-MiniLM-L6-v2


Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Model loaded successfully.


## 3. Visualization

Visualize the documents and clusters.

In [6]:
# Reduce dimensionality for faster/cleaner visualization
print("Reducing dimensionality with UMAP...")
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)

fig_red = topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)
fig_red.show()

Reducing dimensionality with UMAP...


## 4. Explore Topics

List the top topics found.

In [7]:
topic_info = topic_model.get_topic_info()
print(topic_info.head(10))

   Topic  Count                        Name  \
0      0   1234      0_game_fun_little_just   
1      1     49    1_tall_trails_game_short   
2      2     38        2_10_game_11_1000000   
3      3     29                   3_ps_oh__   
4      4     25  4_good_yes_amazing_alright   

                                      Representation  Representative_Docs  
0  [game, fun, little, just, really, love, great,...                  NaN  
1  [tall, trails, game, short, world, games, love...                  NaN  
2  [10, game, 11, 1000000, dragons, recommend, su...                  NaN  
3                           [ps, oh, , , , , , , , ]                  NaN  
4  [good, yes, amazing, alright, awesomesauce, ex...                  NaN  
