add support for cuml hdbscan membership_vector #1324

stevetracvc · 2023-06-08T17:16:23Z

cuml version 23.04 now has membership_vector, so this allows topic_model.transform() to calculate the probability matrix if using a cuml-based hdbscan model

As I was preparing to submit this PR, I saw #1317 and that some of you have already figured this out 😄

I had included batch_size as a new parameter to hdbscan_delegator just in case someone needed to pass in a batch_size somehow

cuml version 23.04 now has membership_vector, so this allows topic_model.transform() to calculate the probability matrix if using a cuml-based hdbscan model

MaartenGr · 2023-06-09T04:52:00Z

Thanks for the PR! Did you check whether the try/except clause works? According to the issue that you refer approximate_predict returns a tuple, so I think it will break if you are using cuML < 23.04. Also, I am wondering whether the batch_size parameter should be added since it will be fixed in cuML 23.08.

stevetracvc · 2023-06-09T21:48:25Z

Good catch, that should be an AttributeError instead

I'm fine removing the batch_size param, but I added it so you can use a custom batch size if necessary. I think it's important to leave the check in there, because telling people this function won't work for cuml 23.04 or .06 isn't great.

I just left it as it was, and yes, approximate_predict returns a tuple. What should be the expected return if someone tries to get membership_vector when using an incompatible cuml?

MaartenGr · 2023-06-19T09:14:40Z

Apologies for the late reply. I am wondering whether it would not be better to skip over handling the batch_size parameter since that was fixed in the most recent release. Although I understand that previous versions are still used, I believe most users would use the newest stable release automatically when installing cuML. I'm afraid it will give problems later on when more details about batch_size parameter might change or be removed.

HeadCase · 2023-08-01T11:14:28Z

I am currently experiencing Cuda out_of_memory (OOM) errors on my (admittedly large) dataset when hdbscan_delegator makes the call to all_points_membership_vectors. I know that I can continue to tweak min_cluster_size/min_samples to reduce memory constraints (or switch off calculate_probabilities), but I would prefer those to be last resorts. My understanding from reading the RAPIDS docs is that manually reducing the batch_size in this call has a chance of fixing my OOM issues. Do we see a possibility of making this batch size user-configurable?

stevetracvc · 2023-08-03T17:15:19Z

@HeadCase what version of cuML are you using? I think version 23.08 uses a batch size of 4096 which likely will fix your problem. You could also take a look at my first commit, which does include a batch_size parameter, if 4096 is too big for you

stevetracvc · 2023-08-03T17:17:25Z

@MaartenGr sorry this dropped off my radar. I get it causing future problems, but rather than removing it I can add in version tests. I think it's 23.04 and 23.06 that need the batch_size patch

eg
if cuml.__version__.startswith("23.04") or cuml.__version__.startswith("23.06"):

MaartenGr · 2023-08-14T09:18:53Z

Seeing as the newest version of cuML resolves the issue. Would it not be straightforward to just mention that users will need to use the latest version? Especially since it was fixed in their latest release.

stevetracvc · 2023-08-14T20:02:53Z

For me, it's a lot easier to update the BERTopic version in my environment than it is to update cuML. I've had major dependency conflicts when trying to install cuML with some of my other packages.

My point being, other people might not be able to update cuML in their environments just to get a feature that technically does work. But I get that you don't want the code cluttered.

Your call. We could just leave the commit as is, and not merge it, and people can patch their copy of BERTopic if needed...

MaartenGr · 2023-08-28T08:06:11Z

Hmmm, in that case, it might indeed be better to check for those specific versions and only implement the fix for these versions. I think that would be more stable and indeed still allows users to keep using the current version. As long as it does not affect newer versions, that should suffice.

add support for cuml hdbscan membership_vector

2fce29b

cuml version 23.04 now has membership_vector, so this allows topic_model.transform() to calculate the probability matrix if using a cuml-based hdbscan model

stevetracvc force-pushed the cuml-hdbscan-membership_vector branch from 6f2f48e to 2fce29b Compare June 9, 2023 19:04

stevetracvc added 5 commits June 9, 2023 15:50

removed batch_size param

0c7058c

fixed exception handler

4e5f028

fixed return value for earlier versions of cuml

b971bf6

new tests for cuml models

9ee5bc5

fix tests if cuml not installed

74335fa

MaartenGr mentioned this pull request Sep 24, 2023

bug fix for transform when using cuml.hdbscan &calculate_probabilities=True #1543

Closed

MaartenGr mentioned this pull request Jan 31, 2024

BERTopic Loading Issue #1764

Closed

MaartenGr mentioned this pull request Mar 21, 2024

Fix CUML HDBSCAN predictions by using correct method. #1874

Closed

beckernick mentioned this pull request Apr 15, 2024

model.transform() throwing error when using cuml for HDBSCAN with calculate_probabilities=True #1317

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add support for cuml hdbscan membership_vector #1324

add support for cuml hdbscan membership_vector #1324

stevetracvc commented Jun 8, 2023

MaartenGr commented Jun 9, 2023

stevetracvc commented Jun 9, 2023

MaartenGr commented Jun 19, 2023

HeadCase commented Aug 1, 2023

stevetracvc commented Aug 3, 2023

stevetracvc commented Aug 3, 2023 •

edited

MaartenGr commented Aug 14, 2023

stevetracvc commented Aug 14, 2023

MaartenGr commented Aug 28, 2023

add support for cuml hdbscan membership_vector #1324

Are you sure you want to change the base?

add support for cuml hdbscan membership_vector #1324

Conversation

stevetracvc commented Jun 8, 2023

MaartenGr commented Jun 9, 2023

stevetracvc commented Jun 9, 2023

MaartenGr commented Jun 19, 2023

HeadCase commented Aug 1, 2023

stevetracvc commented Aug 3, 2023

stevetracvc commented Aug 3, 2023 • edited

MaartenGr commented Aug 14, 2023

stevetracvc commented Aug 14, 2023

MaartenGr commented Aug 28, 2023

stevetracvc commented Aug 3, 2023 •

edited