Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hierachical topics calculated with topic embeddings #1894

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

azikoss
Copy link

@azikoss azikoss commented Mar 28, 2024

Added an option to use topic embeddings to calculate distances between clusters for hierarchical topics. The topic embeddings should provide a better representation of the term frequencies calculated based on c-TF-IDF.

…en clusters for hierrachical topics. The topic embeddings should provide better representation that the term frequencies calculated based on c-TF-IDF.
@MaartenGr
Copy link
Owner

Thank you for this PR! I might be mistaken but I think you might have missed the CONTRIBUTING file since I'm not seeing an attached issue.

Having said that, we can start this feature request here for now.

I definitely agree that having the option to switch between c-TF-IDF and topic embeddings would be great to have, not only in this specific example but throughout BERTopic. Keeping that in mind, I actually think it would be best that whenever such a parameter (use_ctfidf) is introduced, it is also implemented wherever relevant. For instance, having this feature in .topics_over_time, .topics_per_class, .visualize_hierarchy, and even ._reduce_to_n_topics/._auto_reduce_topics.

So I think this PR might be a bit bigger than the single function. I believe that if this parameter is in this single function, users will expect it to be used in others as well.

@azikoss
Copy link
Author

azikoss commented Mar 28, 2024

Hi @MaartenGr, my apologies for not reading the contributing file.

I am happy to make changes in the other methods as well. The list of the methods above is complete or they are only examples?

@MaartenGr
Copy link
Owner

Thanks, that would be great! A few things to keep in mind:

  • The ._reduce_to_n_topics/._auto_reduce_topics might be a bit tricky as it is a private function, so the use_c_tf_idf param does not work there and it is not a parameter that fits with __init__
  • Although use_c_tf_idf is the most logical, I think it's a nicer experience to type use_ctfidf instead
  • I believe, but you will have to check, that the following also can choose between them:
    • bertopic.plotting.visualize_heatmap
    • bertopic.plotting.visualize_hierarchy
    • bertopic.plotting.visualize_topics
    • bertopic.BERTopic.topics_over_time
    • bertopic.BERTopic.topics_per_class
  • Note that even if you choose ctfidf, it might not always be available, and the same goes for the topic embeddings. So it should revert to whatever is available
  • The topics_over_time and topics_per_class are quite tricky, so I think it's okay to skip these for now
  • Some testing might be required to check if everything works as intended since there might be some unexpected issues here and there (for instance, I've had issues in the past with the distance function in hierarchical_topics using topic embeddings

Hopefully, it is not too much. But if it is, please let me know! I'll see if I can help out.

@azikoss
Copy link
Author

azikoss commented Apr 2, 2024

Thanks for the guidelines! I will look into it within the next two weeks.

@azikoss azikoss force-pushed the hierachy-topics-with-embeddings branch from 6ba19c7 to 51e0b3f Compare April 18, 2024 12:08
Included a unit test for selecting the representation.
@azikoss azikoss force-pushed the hierachy-topics-with-embeddings branch from 51e0b3f to dc8d33b Compare April 18, 2024 13:18
@azikoss
Copy link
Author

azikoss commented Apr 18, 2024

@MaartenGr, please have a look.

I included the embedding selection to all mentioned methods apart from topics_over_time(..) and topics_per_class(..).

I left the defaults for attribute use_ctfidf according to the current implementations. For instance visualize_heatmap(..) has use_ctfidf = False while visualize_hierarchy(..) use_ctfidf = True. This is the current behavior, but it would be nice to unify it - i.e. use semantic topic embeddings wherever possible. What do you think?

While the function select_topic_represenation(..) is unit tested, I did not add any end-to-end tests. Let me know what you think about it.

select_topic_represenation(..) always returns an np.ndarray - it seems to be the general pattern throughout the code with some exceptions that I would like to point out. E.g., plotting.visualize_hierarchy(..) only converts topic_embeddings_ into ndarray and not c_tfidf representation. In plotting.visualize_topics(..), only c_tfidf are converted into ndarray`.

@MaartenGr
Copy link
Owner

@azikoss Thanks for the work on this! Hopefully, I will have some time this weekend or beginning of next week to look at this a bit more in-depth.

I left the defaults for attribute use_ctfidf according to the current implementations. For instance visualize_heatmap(..) has use_ctfidf = False while visualize_hierarchy(..) use_ctfidf = True. This is the current behavior, but it would be nice to unify it - i.e. use semantic topic embeddings wherever possible. What do you think?

I agree. Unification would definitely be best here. It will likely change the output so it might be nice to additionally test this. However, since there is the option to easily switch between type of topic representation embedding I'm not too worried about this.

While the function select_topic_represenation(..) is unit tested, I did not add any end-to-end tests. Let me know what you think about it.

I just started the tests, so let's see what happens with end-to-end tests using the defaults. In an ideal world, I would like to see more tests that also cover this end-to-end, but they are already quite large and slow.

select_topic_represenation(..) always returns an np.ndarray - it seems to be the general pattern throughout the code with some exceptions that I would like to point out. E.g., plotting.visualize_hierarchy(..) only converts topic_embeddings_ into ndarray and not c_tfidf representation. In plotting.visualize_topics(..), only c_tfidf are converted into ndarray`.

I would have to check this but I generally want to prevent casting sparse matrices into numpy arrays as that can increase the memory needed to hold that matrix significantly.

@azikoss
Copy link
Author

azikoss commented Apr 19, 2024

@MaartenGr thanks for your comments.

I just ran the test on Python 3.9. They all run successfully. If I understand correctly, the support is also for 3.8? If so, I will change the types.

Have a look at the conversion to np.array pls and let me know. It is about 50/50 where the c_tfidf matrix is converted to np.array and where it is not.

Copy link
Owner

@MaartenGr MaartenGr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work on this! I left some comments that we can discuss as they relate to, for instance casting to numpy. Also, the tests fail on python 3.8 most likely as a result of these kinds of type hints not yet being supported.

bertopic/_utils.py Outdated Show resolved Hide resolved
bertopic/_utils.py Show resolved Hide resolved
bertopic/_bertopic.py Outdated Show resolved Hide resolved
bertopic/_utils.py Outdated Show resolved Hide resolved
bertopic/_utils.py Outdated Show resolved Hide resolved
bertopic/_utils.py Outdated Show resolved Hide resolved
@azikoss
Copy link
Author

azikoss commented May 6, 2024

@MaartenGr pls let me know if there is anything else to address.

@MaartenGr
Copy link
Owner

@azikoss Thanks for all the work so far! I was on holiday for the last two weeks and just getting back to a bunch of issues and PRs. Yours is high priority and I hope to get to it this weekend. Thanks for being patient with me.

@MaartenGr
Copy link
Owner

I just did a first pass through the code and wanted to see whether all checks pass correctly but there seems to be a problem with the pipeline. To get this pipeline working for you, I would advise implementing the same fix as #2008. That way, we can run the pipeline and see if everything works as intended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants