You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Edit: I have surprisingly missed both the topics_ atrribute and the get_document_info() method. My question is changed a little bit and I am now wondering why the transform is different to the original assignment on training?
I have just noticed a problem I am having where the outputs from the transform don't match the the counts from the get_topic_info() method.
That is that the counts of how many documents in a topic are not consitant.
Here is a Minimum Reproducible example:
fromumapimportUMAPfrombertopicimportBERTopicfromdatasetsimportload_datasetdataset=load_dataset("CShorten/ML-ArXiv-Papers")["train"]
# Extract abstracts to train on and corresponding titlesabstracts_1=dataset["abstract"][:500]
abstracts_2=dataset["abstract"][500:1000]
abstracts_3=dataset["abstract"][1000:1500]
# Create topic modelsumap_model=UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
topic_model_1=BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_1)
topic_model_2=BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_2)
topic_model_3=BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_3)
# Combine all models into onemerged_model=BERTopic.merge_models([topic_model_1, topic_model_2, topic_model_3])
display(merged_model.get_topic_info())
all_abstracts=pd.DataFrame({'documents': abstracts_1+abstracts_2+abstracts_3})
all_abstracts['topic'] =merged_model.transform(all_abstracts['documents'])[0]
display(all_abstracts['topic'].value_counts())
What am I missing and why can the topic assignment be so different from the merged model and the transformed values. Furthermore am I missing how I should be getting the topics for the original documents?
The text was updated successfully, but these errors were encountered:
Edit: I have surprisingly missed both the
topics_
atrribute and theget_document_info()
method. My question is changed a little bit and I am now wondering why the transform is different to the original assignment on training?I have just noticed a problem I am having where the outputs from the transform don't match the the counts from the
get_topic_info()
method.That is that the counts of how many documents in a topic are not consitant.
Here is a Minimum Reproducible example:
Here is my workling example
Output:
![image](https://private-user-images.githubusercontent.com/103026808/334714800-e2f8cc89-7a12-4cda-8584-364f8f812050.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjAzMjM3ODYsIm5iZiI6MTcyMDMyMzQ4NiwicGF0aCI6Ii8xMDMwMjY4MDgvMzM0NzE0ODAwLWUyZjhjYzg5LTdhMTItNGNkYS04NTg0LTM2NGY4ZjgxMjA1MC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzA3JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcwN1QwMzM4MDZaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1hNGY3M2NmMmU2ZTZjNzNhODNiMjdkYmY4Y2ZmYTY1YmJiNjA1Y2E0NDdmZDhmMTM4MDgzODQ3NmNmYzU5YWY0JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.U0n5z4j8Y_xH86l5Rmh5ZkcPAcWHF1qdF1XIoHl4gNc)
What am I missing and why can the topic assignment be so different from the merged model and the transformed values. Furthermore am I missing how I should be getting the topics for the original documents?
The text was updated successfully, but these errors were encountered: