-
Notifications
You must be signed in to change notification settings - Fork 711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU accelerated UMAP and HDBSCAN issues: memory and predict #644
Comments
Hello, I get same problem as Second issue: model.predict Did you manage to solve it? |
Sadly not. Still waiting for someone to bail us out |
The first issue is likely an out-of-memory. Though UMAP should not spike memory too significantly, it's possible that the requirements plus other data on your GPU may be causing an OOM. UMAP creates a sparse KNN graph on the order of The predict error is due to cuML's HDBSCAN not yet supporting approximate_predict. Please see #647 (comment) for more context. |
Yes, as beckernick said, just remove the hdbscan=... in your bertopic model when you initialized, no problems after that! :) |
Thank you guys! |
approximate_predict function for HDBSCAN: rapidsai/cuml#4872 Installed the nightly release but I'm still getting the error mentioned above... Any ideas? |
@sebastien-mcrae The support for for cuML's HDBSCAN as a 1-on-1 replacement for the CPU HDBSCAN is not yet implemented in BERTopic and will take some time before it is fully implemented. In your case, when you use cuML's HDBSCAN it recognizes it as a cluster model, not necessarily an HDBSCAN-like model. As such, it defaults back to what is expected from cluster models in BERTopic, namely that it needs |
Due to inactivity and with the v0.13 release of BERTopic supporting more native functions of cuML, I'll be closing this for now. If you have any questions or want to continue the discussion, I'll make sure to re-open the issue! |
Hello everyone,
First issue: memory
cuml.manifold.Umap is crashing with the following error every time whenever I fit_transform on more than 1500000 documents.
2022-07-27 15:40:09,788 - BERTopic - Transformed documents to Embeddings
Traceback (most recent call last):
File "/home/natethegreat/bertopic/bertopic_model_cuml.py", line 17, in
topic_model.fit(docs)
File "/home/natethegreat/miniconda3/envs/torchrapids/lib/python3.9/site-packages/bertopic/_bertopic.py", line 237, in fit
self.fit_transform(documents, embeddings, y)
File "/home/natethegreat/miniconda3/envs/torchrapids/lib/python3.9/site-packages/bertopic/_bertopic.py", line 313, in fit_transform
umap_embeddings = self._reduce_dimensionality(embeddings, y)
File "/home/natethegreat/miniconda3/envs/torchrapids/lib/python3.9/site-packages/bertopic/_bertopic.py", line 2070, in _reduce_dimensionality
umap_embeddings = self.umap_model.transform(embeddings)
File "/home/natethegreat/miniconda3/envs/torchrapids/lib/python3.9/site-packages/cuml/internals/api_decorators.py", line 586, in inner_get
ret_val = func(*args, **kwargs)
File "cuml/manifold/umap.pyx", line 730, in cuml.manifold.umap.UMAP.transform
MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /home/natethegreat/miniconda3/envs/torchrapids/include/rmm/mr/device/cuda_memory_resource.hpp
Segmentation fault
The same error does not occur if I use the cpu versions of umap and hdbscan.
My understanding is that it happens because (obviously) the amount of dedicated gpu memory is much smaller than regular ram (8gb vs 128gb in my case).
Are there any options to circumvent this issue? Like maybe we could split the process into smaller batches or use shared gpu memory?
Also, a model trained with less than 1500000 documents has two weird things to it:
Second issue: model.predict
With a model trained with GPU accelerated versions of umap and hdbscan running model.transform([sentence]) causes the following error:
Traceback (most recent call last):
File "", line 1, in
File "/home/natethegreat/miniconda3/envs/torchrapids/lib/python3.9/site-packages/bertopic/_bertopic.py", line 404, in transform
predictions = self.hdbscan_model.predict(umap_embeddings)
File "cuml/common/base.pyx", line 269, in cuml.common.base.Base.getattr
AttributeError: predict
Thank you regardless of whether something comes out of this.
code:
from bertopic import BERTopic
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP
from sklearn.feature_extraction.text import CountVectorizer
import pickle
umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True)
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english", min_df=100)
topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, vectorizer_model=vectorizer_model, verbose=True, calculate_probabilities=False, low_memory=True)
docs = pickle.load(open("docs_0.35.pkl", "rb")) # <--- 220 mb. approx. 1.6m samples
topic_model.fit(docs)
pickle.dump(topic_model, open("bert_model_2.0.pkl", "wb"))
nvidia-smi:
The text was updated successfully, but these errors were encountered: