New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory inefficient algorithm and getting error while saving the model #173
Comments
Hmmm, I am not familiar with this error unfortunately. Could you share your code so that I can see what is happening? Also, which version of BERTopic are you using and how did you install it? With respect to the memory issues, have you tried this or have you seen this thread? They both contain some suggestions on how to reduce memory issues. The c-TF-IDF matrix can also quickly grow in size if you use a large amount of data, increasing the minimum frequency of words (by setting from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(min_df=10)
topic_model = BERTopic(vectorizer_model=vectorizer_model) |
But, I am getting
This means that or problem states that, we try to save contains weakref python objects. Such objects are not supported by python’s pickle module. |
And one more thing how many RAM is required to 2 Million data points, Tell me least value ? |
I am using 0.8.1 version of BerTopic |
Unfortunately, that is not how it works. If you have 2 million documents then the size of each document matters greatly. Not only that, but changing any of the parameters also has impact on the required RAM. With most algorithms, it is not possible to give an estimate how much RAM is needed for a specific dataset. Having said that, make sure to follow the tips in previous posts in this thread to reduce the necessary RAM. |
It would also help if you could share your code. Perhaps specific settings or parameters might have caused this issue. By understanding your workflow I might be able to pinpoint your issue and help you resolve it. |
My total data size for 2 million documents is only 535 MB. |
I am sharing my code snippet with you.
In the last line, I am getting the error. |
As I mentioned before, I cannot give you a value of RAM that is minimally necessary. It also depends on the parameters, average and max number of words in a document, the vocabulary size, etc. I would advise you to follow the tips above to reduce the necessary RAM.
After some tests in a Google Colab session, I can replicate this issue by setting |
Can you advise, how I can choose those parameters, which take less RAM and provides meaningful results? Can you send some resources or links. I also want to choose the Best parameters from UMAP and HDBSCAN ? |
And one more thing I understand it depends on parameters, But no topic modeling algorithm takes 256 GB RAM to train. And after training on that, didn't succeed, My advice is to make the algorithm more optimized. Your algorithm is performing well as compare to others, but it lacks memory optimization. After completion of 1st step, Transformation of documents to embeddings, It takes whole RAM and session crashes. |
The FAQ in the documentation gives some pointers on how to reduce the necessary memory if you use a large amount of data. You can find that link here. Also, you can find some help already in this thread. There are several others ways to perform computation with large datasets. First, you can set Second, setting Third, you can set the minimum frequency of words in the CountVectorizer class to reduce the size of the resulting sparse c-TF-IDF matrix. You can do this as follows: from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(min_df=10)
topic_model = BERTopic(vectorizer_model=vectorizer_model) The min_df parameter is used to indicate the minimum frequency of words. Setting this value larger than 1 can significantly reduce memory. |
Setting each and every parameter manually, so that it consumes less RAM and then I did but no improvement in RAM management I can see. And I selected all parameters according to dataset size i.e 2.5 lakhs.
|
Problem is with your UMAP it's taking all RAM after doing Embeddings Transformation. It's taking all RAM.
|
This happens because you are using a custom UMAP model which overrides the umap_model = UMAP(n_neighbors=300, n_components=2, min_dist=0.0, metric='cosine', low_memory=True) |
But Also when we are not using custom UMAP and giving this parameter in BERTopic Class it consumes all RAM. |
As per your suggestion, I tried setting
|
I just tried the following out in a Google Colab session that has 13GB RAM available and is using a Tesla T4. The used data was retrieved here. BERTopic was trained on a sample of 300,000 documents. Here, I made sure to use import pandas as pd
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(min_df=10)
df = pd.read_csv("abcnews-date-text.csv")
df_small = df.sample(300_000)
docs = df_small.headline_text.tolist()
topic_model = BERTopic(verbose=True, vectorizer_model=vectorizer_model)
topics, _ = topic_model.fit_transform(docs) Having said that, if I would go with more documents, then I find myself having memory issues. I would suggest using the above settings and simply using a machine with more RAM. If that does not work, then I suggest taking a smaller sample of the data that you have and simply predicting the topic of data that was left out. In practice, BERTopic does not need millions of data points to create a good model. Simply using a sample of a few hundred thousand documents should do the trick. |
Hi @MaartenGr Thanks for the package! I am also trying to run BERT Topic on a dataset with roughly 1M sentences. because I was having memory issues, I used your sample strategy above to run it but the model seems to run forever. Do you have an idea of how long it should take on Colab Pro Notebook with GPU enabled? Also, I am using pre-trained sentence embeddings trained with 'paraphrase-MiniLM-L6-v2' SentenceTransformer model and passing the full embeddings array to |
@Attol8 Hmmm, it should not take that long. Could you share your entire code? Also, did you try it with |
Yeah, sure. However, I cannot share the data as it is proprietary. from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
#We then load the model with SentenceTransformers
model_name = 'paraphrase-MiniLM-L6-v2'
model = SentenceTransformer(model_name)
#sample sentences from main df
df_small = aa_df.sample(300000)
tasks_texts = list(df_small.ad_text.values)
tasks_texts = [sentence[0:128*10] for sentence in tasks_texts]
#Compute embeddings for all sentences
corpus_embeddings = model.encode(tasks_texts, convert_to_tensor=True, show_progress_bar=True)
#BERTTopic
vectorizer_model = CountVectorizer(min_df=10)
topic_model = BERTopic(low_memory = True, verbose = True, calculate_probabilities = False, vectorizer_model=vectorizer_model)
topics, _ = topic_model.fit_transform(tasks_texts, np.array(corpus_embeddings.cpu())) I did set Thanks for the fast response tho! Love this package and all your Data Science work! |
Perhaps it is the way you are creating the embeddings. There might be an issue with the convertion to tensors. Perhaps a more minimal example would improve your situation. Could you try it out like this: from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
# sample sentences from main df
df_small = aa_df.sample(300000)
tasks_texts = list(df_small.ad_text.values)
tasks_texts = [sentence[0:128*10] for sentence in tasks_texts]
# BERTopic
vectorizer_model = CountVectorizer(min_df=10)
topic_model = BERTopic(embedding_model="paraphrase-MiniLM-L6-v2",
low_memory = True,
verbose = True,
calculate_probabilities = False,
vectorizer_model=vectorizer_model)
topics, _ = topic_model.fit_transform(tasks_texts) I should note that setting |
While we were talking, the algorithm finished running. Took a bit more than 2 hours. Maybe that's the expected time given my dataset. I have tried your code with Still stuck in the phase where it is trying to reduce the dimensionality of the embeddings. The slowing part really seems to be the UMAP part. Not familiar with UMAP but other solutions to reduce dimensionality may be faster. Anyway, I was able to create the topics, and 2 hours is not even that bad! |
Glad to hear that it finished running. Perhaps I wasn't clear but if you have enough RAM available, setting One other thing might be starting from a fresh environment and making sure you have the newest versions of BERTopic and UMAP installed. |
I was trying to train 20 Lakh data points and I have tried lots of GPU instances in AWS, I have tried GPU instances with 16GB RAM, 32GB RAM, 64 GB RAM, and 256 GB RAM on AWS. All of them failed and not able to train. And on 256 GB RAM, it was trained successfully but I was unable to save the model
Below is the error I was getting while saving the model.
topic_model.save("topic_model_all_20L.pt",save_embedding_model=False)
The text was updated successfully, but these errors were encountered: