Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

results of transform is differnet from merged topic model get_topic_info() output #2019

Open
1jamesthompson1 opened this issue May 29, 2024 · 1 comment

Comments

@1jamesthompson1
Copy link

1jamesthompson1 commented May 29, 2024

Edit: I have surprisingly missed both the topics_ atrribute and the get_document_info() method. My question is changed a little bit and I am now wondering why the transform is different to the original assignment on training?

I have just noticed a problem I am having where the outputs from the transform don't match the the counts from the get_topic_info() method.

That is that the counts of how many documents in a topic are not consitant.

Here is a Minimum Reproducible example:

from umap import UMAP
from bertopic import BERTopic
from datasets import load_dataset

dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]

# Extract abstracts to train on and corresponding titles
abstracts_1 = dataset["abstract"][:500]
abstracts_2 = dataset["abstract"][500:1000]
abstracts_3 = dataset["abstract"][1000:1500]

# Create topic models
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
topic_model_1 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_1)
topic_model_2 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_2)
topic_model_3 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_3)

# Combine all models into one
merged_model = BERTopic.merge_models([topic_model_1, topic_model_2, topic_model_3])

display(merged_model.get_topic_info())

all_abstracts = pd.DataFrame({'documents': abstracts_1 + abstracts_2 + abstracts_3})
all_abstracts['topic'] = merged_model.transform(all_abstracts['documents'])[0]

display(all_abstracts['topic'].value_counts())

image

Here is my workling example
embeddings = all_embeddings['voyageai'].copy()

display(embeddings)

mode_groups = embeddings.groupby('mode')
mode_dfs = [mode_groups.get_group(i).reset_index(drop=True) for i in range(3)]

mode_models = [BERTopic() for _ in mode_dfs]

for model, df in zip(mode_models, mode_dfs):
    model.fit_transform(
        df['si'],
        np.array([np.array(x) for x in df['si_embedding'].to_numpy()])
)
    display(model.get_topic_info())

merged_model = BERTopic.merge_models(mode_models, min_similarity=0.9)

display(merged_model.get_topic_info())

embeddings['topic'] = merged_model.transform(embeddings['si'], np.array([np.array(x) for x in embeddings['si_embedding'].to_numpy()]))[0]

embeddings['topic'].value_counts()

Output:
image

What am I missing and why can the topic assignment be so different from the merged model and the transformed values. Furthermore am I missing how I should be getting the topics for the original documents?

@MaartenGr
Copy link
Owner

My question is changed a little bit and I am now wondering why the transform is different to the original assignment on training?

There are a couple of issues both open and closed that discuss this but the most recent one that I could find is this: #2017 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants