# Evaluating Model Performance 

In [None]:
# Imports 

from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
import torch
import umap
import hdbscan
import pandas as pd
import numpy as np
import json


  from .autonotebook import tqdm as notebook_tqdm


## Topic Coherence  

### BERT

In [None]:
# Topic Coherence 

#Topic Coherence will be evaluating:
#How meaningful the top words in each topic are
#If the words tend to co-occur in the same context
#And +  interpretable the topic is to humans.... We want higher = better (.35 - .55) 

#Initial assumption: I believe that Bertopic will outperform LDA/LSA here

# Loading our previous results 

# Truncated 5k review subset. 
df_5k = pd.read_csv("../data/bert_results/bertopic_5k_reviews_with_topics.csv")  # or your saved CSV
documents = df_5k["text"].astype(str).tolist()

#Bert model 
topic_model = BERTopic.load("../data/bert_results/bertopic_model")
print("Loaded BERTopic model.")

#Topic assignments 
topics = df_5k["topic"].tolist()
print("Loaded topics:", len(topics))


Loaded BERTopic model.
Loaded topics: 5000


In [6]:
from gensim.models import CoherenceModel
from gensim.corpora import Dictionary


#Top words for ->  each topic
topics_words = [
    [word for word, _ in topic_model.get_topic(t)]
    for t in set(topics)
]

#Tokenization of the docs. 

tokenized_docs = [doc.split() for doc in documents]
dictionary = Dictionary(tokenized_docs)


#Getting the c_v coherence -> and will later put this into a table to compare. 
cm = CoherenceModel(
    topics=topics_words,
    texts=tokenized_docs,
    dictionary=dictionary,  
    coherence='c_v'
)


coherence_score = cm.get_coherence()
print("BERTopic Coherence for Reviews (c_v):", coherence_score)


BERTopic Coherence for Reviews (c_v): 0.4204005495131143


The BERTopic model achieved a c_v coherence score of **0.4204**, this indicates an above avagera level of semantic consistency with topics that it extracted. In the context of consumer product reviews, where language can highly subjective, and contain a lot of noise in the data, the modeal was able to capture the topics well and with even with text tha can often be informal. Since coherence scores that range between the 0.35–0.55 range are considered good, this result suggests that BERTopic successfully identified interpretable and meaningful themes across our Amazon Beauty reviews. 

This scores and the visuals obtaine earlier show that  the top words in each topic shows strong co-occurrence patterns and shared contextual meaning. Overall, the coherence performance confirms that a transformer-based, embedding-driven approach produces well-structured topics that align closely with human interpretations of review content.    ---> Pair these results with the overview

## Topic Diversity 

### BERT

In [7]:
# Topic Diversity 

#High topic diversity = the model is capturing many different themes rather than repeating the same idea in multiple topics.

#Typical ranges: 0.50 & below poor and topic usually repeats Above .70 is good the middle is moderate

#.70–0.90 → High diversity (excellent)
# Initial assumption: I believe that BERTopic will outperform LDA/LSA here as well.

# Using the top 10 words per  -> (This is uniform) 
u_words = set()
t_words = 0

for topic_id in set(topics):
    top_words = [word for word, _ in topic_model.get_topic(topic_id)[:10]]
    t_words += len(top_words)
    u_words.update(top_words)

topic_dei = len(u_words) / t_words
print("BERTopic Topic Diversity:", topic_dei)


BERTopic Topic Diversity: 0.6383561643835617


The model achieved a topic diversity score of **0.638**, which suggest that the topics are fairly distinct from one another and not overly repetitive. Eventhough it’s not extremely high, it still showcases the model's ability to capture a wide range of themes from the reviews. This level of diversity means users can get more varried insights across topics, even if a few of them slightly overlap in meaning.  ---> Pair these results with the tree

## Silhouette Score (Clustering Performance)

### BERT

In [8]:
# Import the og embeddings 
#Og Embeddings  Deleted them will add them if needed. 
embeddings = np.load("../cleaned_data/embeddings/embeddings.npy")
#print("Loaded embeddings:", embeddings.shape) 

In [9]:
from sklearn.metrics import silhouette_score

#extracting the topic labels
labels = np.array(topics)

# Removing any outlier  outliers (-1) -> none from the visuals. but just incase. 
mask = labels != -1
filtered_embeddings = embeddings[mask]
filtered_labels = labels[mask]

print("Original:", len(labels), "Filtered:", len(filtered_labels))

score = silhouette_score(filtered_embeddings, filtered_labels)
print("BERTopic Silhouette Score:", score)

Original: 5000 Filtered: 3692
BERTopic Silhouette Score: 0.05022962763905525


For clustering performance, the BERTopic model gave a silhouette score of 0.05 after removing outliers. This isn’t surprising for beauty review data since people describe products in all kinds of ways, and the themes naturally overlap. So even though the clusters aren’t super tight or clearly separated in the embedding space, the model still manages to pull out meaningful patterns. When you look at this together with the coherence and diversity results, it shows that BERTopic can still capture useful, real-world topics even if the underlying clusters are a bit loose, which honestly reflects how messy human-written reviews usually are.