GPU accelerated UMAP and HDBSCAN issues: memory and predict #644

ghost · 2022-07-27T14:40:46Z

Hello everyone,

First issue: memory

cuml.manifold.Umap is crashing with the following error every time whenever I fit_transform on more than 1500000 documents.

2022-07-27 15:40:09,788 - BERTopic - Transformed documents to Embeddings
Traceback (most recent call last):
File "/home/natethegreat/bertopic/bertopic_model_cuml.py", line 17, in
topic_model.fit(docs)
File "/home/natethegreat/miniconda3/envs/torchrapids/lib/python3.9/site-packages/bertopic/_bertopic.py", line 237, in fit
self.fit_transform(documents, embeddings, y)
File "/home/natethegreat/miniconda3/envs/torchrapids/lib/python3.9/site-packages/bertopic/_bertopic.py", line 313, in fit_transform
umap_embeddings = self._reduce_dimensionality(embeddings, y)
File "/home/natethegreat/miniconda3/envs/torchrapids/lib/python3.9/site-packages/bertopic/_bertopic.py", line 2070, in _reduce_dimensionality
umap_embeddings = self.umap_model.transform(embeddings)
File "/home/natethegreat/miniconda3/envs/torchrapids/lib/python3.9/site-packages/cuml/internals/api_decorators.py", line 586, in inner_get
ret_val = func(*args, **kwargs)
File "cuml/manifold/umap.pyx", line 730, in cuml.manifold.umap.UMAP.transform
MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /home/natethegreat/miniconda3/envs/torchrapids/include/rmm/mr/device/cuda_memory_resource.hpp
Segmentation fault

The same error does not occur if I use the cpu versions of umap and hdbscan.
My understanding is that it happens because (obviously) the amount of dedicated gpu memory is much smaller than regular ram (8gb vs 128gb in my case).

Are there any options to circumvent this issue? Like maybe we could split the process into smaller batches or use shared gpu memory?

Also, a model trained with less than 1500000 documents has two weird things to it:

The resulting topics often have duplicated words inside e.g. "14_gold_gold money_silver_money gold"
Regardless of whether or not training is successful the message "Segmentation fault" is printed at the end

Second issue: model.predict
With a model trained with GPU accelerated versions of umap and hdbscan running model.transform([sentence]) causes the following error:

Traceback (most recent call last):
File "", line 1, in
File "/home/natethegreat/miniconda3/envs/torchrapids/lib/python3.9/site-packages/bertopic/_bertopic.py", line 404, in transform
predictions = self.hdbscan_model.predict(umap_embeddings)
File "cuml/common/base.pyx", line 269, in cuml.common.base.Base.getattr
AttributeError: predict

Thank you regardless of whether something comes out of this.

code:

from bertopic import BERTopic
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP
from sklearn.feature_extraction.text import CountVectorizer
import pickle

umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True)

vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english", min_df=100)

topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, vectorizer_model=vectorizer_model, verbose=True, calculate_probabilities=False, low_memory=True)

docs = pickle.load(open("docs_0.35.pkl", "rb")) # <--- 220 mb. approx. 1.6m samples

topic_model.fit(docs)

pickle.dump(topic_model, open("bert_model_2.0.pkl", "wb"))

nvidia-smi:

pepi99 · 2022-07-29T08:07:49Z

Hello, I get same problem as Second issue: model.predict

Did you manage to solve it?

ghost · 2022-07-29T14:47:09Z

Sadly not. Still waiting for someone to bail us out

beckernick · 2022-07-29T20:30:26Z

The first issue is likely an out-of-memory. Though UMAP should not spike memory too significantly, it's possible that the requirements plus other data on your GPU may be causing an OOM. UMAP creates a sparse KNN graph on the order of n_samples * n_neighbors, so reducing the number of neighbors will reduce memory needs for that data structure.

The predict error is due to cuML's HDBSCAN not yet supporting approximate_predict. Please see #647 (comment) for more context.

pepi99 · 2022-07-31T21:40:28Z

Yes, as beckernick said, just remove the hdbscan=... in your bertopic model when you initialized, no problems after that! :)

ghost · 2022-08-01T13:53:57Z

Thank you guys!

sebastien-mcrae · 2022-09-26T02:27:44Z

approximate_predict function for HDBSCAN: rapidsai/cuml#4872
Nightly Release: https://github.com/rapidsai/cuml/releases/tag/v22.10.00a

Installed the nightly release but I'm still getting the error mentioned above... Any ideas?

MaartenGr · 2022-09-26T08:22:43Z

@sebastien-mcrae The support for for cuML's HDBSCAN as a 1-on-1 replacement for the CPU HDBSCAN is not yet implemented in BERTopic and will take some time before it is fully implemented.

In your case, when you use cuML's HDBSCAN it recognizes it as a cluster model, not necessarily an HDBSCAN-like model. As such, it defaults back to what is expected from cluster models in BERTopic, namely that it needs .fit and .predict functions for it to work.

MaartenGr · 2023-01-09T12:07:36Z

Due to inactivity and with the v0.13 release of BERTopic supporting more native functions of cuML, I'll be closing this for now. If you have any questions or want to continue the discussion, I'll make sure to re-open the issue!

MaartenGr closed this as completed Jan 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU accelerated UMAP and HDBSCAN issues: memory and predict #644

GPU accelerated UMAP and HDBSCAN issues: memory and predict #644

ghost commented Jul 27, 2022 •

edited by ghost

pepi99 commented Jul 29, 2022

ghost commented Jul 29, 2022

beckernick commented Jul 29, 2022 •

edited

pepi99 commented Jul 31, 2022

ghost commented Aug 1, 2022

sebastien-mcrae commented Sep 26, 2022 •

edited

MaartenGr commented Sep 26, 2022

MaartenGr commented Jan 9, 2023

GPU accelerated UMAP and HDBSCAN issues: memory and predict #644

GPU accelerated UMAP and HDBSCAN issues: memory and predict #644

Comments

ghost commented Jul 27, 2022 • edited by ghost

pepi99 commented Jul 29, 2022

ghost commented Jul 29, 2022

beckernick commented Jul 29, 2022 • edited

pepi99 commented Jul 31, 2022

ghost commented Aug 1, 2022

sebastien-mcrae commented Sep 26, 2022 • edited

MaartenGr commented Sep 26, 2022

MaartenGr commented Jan 9, 2023

ghost commented Jul 27, 2022 •

edited by ghost

beckernick commented Jul 29, 2022 •

edited

sebastien-mcrae commented Sep 26, 2022 •

edited