Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU accelerated UMAP and HDBSCAN issues: memory and predict #644

Closed
ghost opened this issue Jul 27, 2022 · 8 comments
Closed

GPU accelerated UMAP and HDBSCAN issues: memory and predict #644

ghost opened this issue Jul 27, 2022 · 8 comments

Comments

@ghost
Copy link

ghost commented Jul 27, 2022

Hello everyone,

First issue: memory

cuml.manifold.Umap is crashing with the following error every time whenever I fit_transform on more than 1500000 documents.

2022-07-27 15:40:09,788 - BERTopic - Transformed documents to Embeddings
Traceback (most recent call last):
File "/home/natethegreat/bertopic/bertopic_model_cuml.py", line 17, in
topic_model.fit(docs)
File "/home/natethegreat/miniconda3/envs/torchrapids/lib/python3.9/site-packages/bertopic/_bertopic.py", line 237, in fit
self.fit_transform(documents, embeddings, y)
File "/home/natethegreat/miniconda3/envs/torchrapids/lib/python3.9/site-packages/bertopic/_bertopic.py", line 313, in fit_transform
umap_embeddings = self._reduce_dimensionality(embeddings, y)
File "/home/natethegreat/miniconda3/envs/torchrapids/lib/python3.9/site-packages/bertopic/_bertopic.py", line 2070, in _reduce_dimensionality
umap_embeddings = self.umap_model.transform(embeddings)
File "/home/natethegreat/miniconda3/envs/torchrapids/lib/python3.9/site-packages/cuml/internals/api_decorators.py", line 586, in inner_get
ret_val = func(*args, **kwargs)
File "cuml/manifold/umap.pyx", line 730, in cuml.manifold.umap.UMAP.transform
MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /home/natethegreat/miniconda3/envs/torchrapids/include/rmm/mr/device/cuda_memory_resource.hpp
Segmentation fault

The same error does not occur if I use the cpu versions of umap and hdbscan.
My understanding is that it happens because (obviously) the amount of dedicated gpu memory is much smaller than regular ram (8gb vs 128gb in my case).

Are there any options to circumvent this issue? Like maybe we could split the process into smaller batches or use shared gpu memory?

Also, a model trained with less than 1500000 documents has two weird things to it:

  • The resulting topics often have duplicated words inside e.g. "14_gold_gold money_silver_money gold"
  • Regardless of whether or not training is successful the message "Segmentation fault" is printed at the end

Second issue: model.predict
With a model trained with GPU accelerated versions of umap and hdbscan running model.transform([sentence]) causes the following error:

Traceback (most recent call last):
File "", line 1, in
File "/home/natethegreat/miniconda3/envs/torchrapids/lib/python3.9/site-packages/bertopic/_bertopic.py", line 404, in transform
predictions = self.hdbscan_model.predict(umap_embeddings)
File "cuml/common/base.pyx", line 269, in cuml.common.base.Base.getattr
AttributeError: predict

Thank you regardless of whether something comes out of this.

code:

from bertopic import BERTopic
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP
from sklearn.feature_extraction.text import CountVectorizer
import pickle

umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True)

vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english", min_df=100)

topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, vectorizer_model=vectorizer_model, verbose=True, calculate_probabilities=False, low_memory=True)

docs = pickle.load(open("docs_0.35.pkl", "rb")) # <--- 220 mb. approx. 1.6m samples

topic_model.fit(docs)

pickle.dump(topic_model, open("bert_model_2.0.pkl", "wb"))

nvidia-smi:

image

@pepi99
Copy link

pepi99 commented Jul 29, 2022

Hello, I get same problem as Second issue: model.predict

Did you manage to solve it?

@ghost
Copy link
Author

ghost commented Jul 29, 2022

Sadly not. Still waiting for someone to bail us out

@beckernick
Copy link
Contributor

beckernick commented Jul 29, 2022

The first issue is likely an out-of-memory. Though UMAP should not spike memory too significantly, it's possible that the requirements plus other data on your GPU may be causing an OOM. UMAP creates a sparse KNN graph on the order of n_samples * n_neighbors, so reducing the number of neighbors will reduce memory needs for that data structure.

The predict error is due to cuML's HDBSCAN not yet supporting approximate_predict. Please see #647 (comment) for more context.

@pepi99
Copy link

pepi99 commented Jul 31, 2022

Yes, as beckernick said, just remove the hdbscan=... in your bertopic model when you initialized, no problems after that! :)

@ghost
Copy link
Author

ghost commented Aug 1, 2022

Thank you guys!

@sebastien-mcrae
Copy link

sebastien-mcrae commented Sep 26, 2022

approximate_predict function for HDBSCAN: rapidsai/cuml#4872
Nightly Release: https://github.com/rapidsai/cuml/releases/tag/v22.10.00a

Installed the nightly release but I'm still getting the error mentioned above... Any ideas?

@MaartenGr
Copy link
Owner

@sebastien-mcrae The support for for cuML's HDBSCAN as a 1-on-1 replacement for the CPU HDBSCAN is not yet implemented in BERTopic and will take some time before it is fully implemented.

In your case, when you use cuML's HDBSCAN it recognizes it as a cluster model, not necessarily an HDBSCAN-like model. As such, it defaults back to what is expected from cluster models in BERTopic, namely that it needs .fit and .predict functions for it to work.

@MaartenGr
Copy link
Owner

Due to inactivity and with the v0.13 release of BERTopic supporting more native functions of cuML, I'll be closing this for now. If you have any questions or want to continue the discussion, I'll make sure to re-open the issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants