Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

partial_fit with hierarchical_topics #837

Closed
ajdaling opened this issue Nov 12, 2022 · 3 comments
Closed

partial_fit with hierarchical_topics #837

ajdaling opened this issue Nov 12, 2022 · 3 comments

Comments

@ajdaling
Copy link

ajdaling commented Nov 12, 2022

First off, let me say that is module is amazing and the work you are doing is awesome.

Short version, I used the suggested setup for partial_fit on a reasonably small dataset and it works perfectly. The issue is that a model trained using the partial_fit method does not seem to work when I call hierarchical_topics(). Is partial_fit not compatible with hierarchical_topics? I'm not sure if this is a bug, user error, or if I am simply not using it the way it was intended, but I am out of ideas so any help is appreciated.

Long version:

My setup (basically copied from the readme):

  • IncrementalPCA
  • MiniBatchKMeans
  • OnlineCountVectorizer
  • 30,000 document dataset with allenai/specter embeddings
  • ample memory and GPUs

When I call hierarchical_topics on the trained topic_mode, it throws the following error:

File "/users/PYS1027/ajdaling/work/munch/compare_models/generate_bertopic_model.py", line 266, in generate_model h = model.hierarchical_topics(list(doc_df.text)) File "/users/PYS1027/ajdaling/.conda/envs/munch/lib/python3.9/site-packages/bertopic/_bertopic.py", line 860, in hierarchical_topics documents = pd.DataFrame({"Document": docs, File "/users/PYS1027/ajdaling/.conda/envs/munch/lib/python3.9/site-packages/pandas/core/frame.py", line 662, in __init__ mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager) File "/users/PYS1027/ajdaling/.conda/envs/munch/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 493, in dict_to_mgr return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy) File "/users/PYS1027/ajdaling/.conda/envs/munch/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 118, in arrays_to_mgr index = _extract_index(arrays) File "/users/PYS1027/ajdaling/.conda/envs/munch/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 666, in _extract_index raise ValueError("All arrays must be of the same length") ValueError: All arrays must be of the same length

Upon inspection, it seems that self.topics_ is the length of the last batch of documents passed into partial_fit, so when I input my full list of documents, the documents list is longer than the topics list in the dataframe constuctor.

As a shot in the dark, I tried modifying the source code for the hierarchical_topics function to take the full list of topics (obtained calling .transform() on the full list of documents) as an input parameter but that led to other errors.

More generally, I am using partial_fit because I have some datasets that are simply too large for the standard umap/hdbscan setup (30M+ documents). I am open to any suggestions/configuration, not necessarily partial_fit if it is not a viable option, that would get me to a hierarchical set of clusters.

Thank you in advance.

@MaartenGr
Copy link
Owner

First off, let me say that is module is amazing and the work you are doing is awesome.

Thank you for your kind words!

I am quite sure that your problem should be resolved with the following:

# Incrementally fit the topic model by training on 1000 documents at a time and track the topics in each iteration
topics = []
for docs in doc_chunks:
    topic_model.partial_fit(docs)
    topics.extend(topic_model.topics_)

topic_model.topics_ = topics

As you mentioned, when you run .partial_fit it will only keep track of the topics created at that specific step. If you go to the documentation here, then you can see that in the last part of the example it is mentioned that you will need to continuously update the internal topic_model.topics_ in order to use it for hierarchical topic modeling. You can use it without updating the internal topic_model.topics_ but then you can use hierarchical topic modeling only for the most recent set of documents on which was trained.

@ajdaling
Copy link
Author

Oh, wow. I swear that paragraph explicitly documenting and answering my exact question was not there before. You just added that to make me look bad...

Thank you for responding so quickly. While I am embarrassed that the answer to my problem was just "read the extremely well-written and easy-to-follow documentation" and I am sincerely sorry for wasting your time on something you so clearly answered, I take solace only in the hope that someday, someone as lazy and oblivious as I will make the same mistake and share in my embarrassment.

Thanks again for the help.

@MaartenGr
Copy link
Owner

No problem! Any and all questions are welcome. Reading through the documentation, it does seem like it's rather hidden away. I'll make sure it gets a bit clearer in the next release 😄

MaartenGr added a commit that referenced this issue Nov 29, 2022
@MaartenGr MaartenGr mentioned this issue Nov 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants