New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train and Predict BERTopic #278
Comments
You are correct. You can use In practice, I would try to combine as many sources as possible before training the model. If you have various datasets, then you can simply combine them and train over all of the models. However, if there is a specific reason for training it on only a single dataset and predict for all others, then that is also possible. I can imagine it could be computationally expensive to train on all datasets or that you only want the topics from a single source represented. In those cases, it should be fine to train it on a single dataset although training on all of them is preferred. This depends on the content of the large paragraphs. If you feel like or assume, that those paragraphs may contain multiple topics then I would advise splitting them up into sentences. You can use Spacy to split them up into sentences. However, if you think that there is only a single topic in the large paragraph then there is no need to split them up into sentences. If possible, I would try training it on the data without splitting it up into sentences and see if they make sense. If not, then a sentence splitter would be your next step. |
@MaartenGr Thanks for your reply |
Yes! You can train the model and save it for other datasets just like other ML models. Do note that it is important that the versions of packages stay the same when switching between environments. Most issues related to model loading can be solved by looking at the environment. This is actually quite a complex subject. Although there are methods that you can employ, such as So while I am definitely not against evaluation metrics. I do think it is important to realize that they by no means represent a ground truth and can be misleading in some cases. You can look towards Gensim or Octis for evaluation metrics/functions/libraries. |
Hi @MaartenGr, Thanks for your help. one more request Thanks again |
There are several reasons for using a fixed value. First, the value needs to be equal or lower than |
Hi @MaartenGr , ERROR: TypeError Traceback (most recent call last) 13 frames TypeError: load() missing 1 required positional argument: 'Loader' |
This is an issue that quite randomly popped up. Fortunately, some fixes can be found here. Most likely, just running either |
@mjavedgohar A new version of BERTopic (v0.9.3) was released that should fix this issue and some others that should be helpful. You can install that version through |
@MaartenGr Thanks for your help Training:
topic_model = BERTopic(low_memory=True,
Prediction:
|
Yes, it should be okay to train on your training docs and to predict them on only new docs. |
@MaartenGr I am getting the same topics in prediction as in training using the above parameters. can you please help me to resolve this? In prediction I want to display the topics from new docs only. Or I have fit_transform() for every dataset ?? |
If you are getting the same topics then you are most likely predicting the same documents like the ones you trained on. Typically, the workflow is something like this: from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
# We create a split between the documents that we train on and those that we predict
train_docs = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))['data']
test_docs = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))['data']
# Train the model only the train_docs
topic_model = BERTopic(embedding_model="paraphrase-MiniLM-L3-v2", verbose=True)
topics, probs = topic_model.fit_transform(train_docs)
# Predict topics for test_docs
predicted_topics, predicted_probs = topic_model.transform(test_docs) |
@MaartenGr Thaks for your help, from bertopic import BERTopic We create a split between the documents that we train on and those that we predicttrain_docs = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))['data'] Train the model only the train_docstopic_model = BERTopic(embedding_model="paraphrase-MiniLM-L3-v2", verbose=True) Predict topics for test_docspredicted_topics, predicted_probs = topic_model.transform(test_docs) |
Ah, when you use If you want to have new topics, then you need to re-train the model with all documents. |
Hi @MaartenGr , I trained BERTopic model on HPC (server) and saved it. Now I am trying to load it in google colab notebook for visualization but I am getting the following error on topic_model.laod("model name") ValueError: EOF: reading array data, expected 262144 bytes got 815 can you please help resolve this issue? Thanks |
The most important thing when loading in a model is making sure that the environment is the same. So, make sure that the packages and versions used in the saving environment are the same as the loading environment. For example, if you are using sentence-transformers v0.4.1 when saving the model it is highly advised to use the same version when loading the environment. |
Hi @MaartenGr, Thanks for your help. I am getting following error when tring to visualize the topics over time. can you please help me for this Code: Error |
I cannot be sure without having your entire code but it seems that |
hi @MaartenGr , Thanks for you help. I using comments extracted from the reddit. Following is the code to generated topics. Code: print("Embedding models") #from flair.embeddings import TransformerDocumentEmbeddings from sentence_transformers import SentenceTransformer import umap umap_model = umap.UMAP(n_neighbors=100, # size of neighbour import hdbscan hdbscan_model = hdbscan.HDBSCAN(min_cluster_size=50, topic_model = BERTopic(top_n_words=10, topics, probabilities = topic_model.fit_transform(docs) timestamps = review_data.timestamp.to_list() Error: |
Yes, as I mentioned before it seems that You are taking the set here: docs = review_data.body.to_list()
docs = list(set(docs)) Which most likely reduces the number of
Which is larger than your |
Thanks @MaartenGr It worked fig = topic_model.visualize_topics() |
That is more an issue with the number of topics than necessarily the method. Typically, BERTopic would result in tens or hundreds of topics. Any less and you likely have too little data to work with, or you have set the |
Thanks @MaartenGr, If I run the following code on my PC its works fine but on HPC (server) I am getting Error with same data. can you please help me for this. timestamps = review_data.timestamp.to_list() Error topics_over_time = topic_model.topics_over_time(docs, topics, timestamps, nr_bins=10) File "/home/muhammad.javed/.local/lib/python3.7/site-packages/bertopic/_bertopic.py", line 457, in topics_over_time |
If you run into issues when switching environments it is most likely a version control issue. Did you make sure to use the same versions of packages between environments? Also, is the code exactly the same between environments? |
Hi @MaartenGr , I am tring to use the Guided Topic Modeling using the following code. Its working fine in Colab notebooks but getting error on my local machine. I am using BERTopic 0.12.0. Can you please help me for this??? Thanks Code: topic_model = BERTopic(language="english", verbose=True, seed_topic_list=seed_topic_list) Error: |
When you are working across different environments, then there might be an issue with the packages that you have installed. I would advise starting from a completely fresh environment and re-installing everything there. From your code, it seems that Numpy might be the culprit here, so I would think that a fresh environment might solve the issue. |
Hi @MaartenGr, I have two datasets (train and test), and I would like to predict the topics for both, while fitting only the first one. This is my code:
The problem is that, while doing the last step, I encountered the following error: I guess the problem is related to the fact that the function approximate_predict of hdbscan is unaware of the parameter cluster_selection_epsilon. Thanks in advance! |
@elenacandellone Hmmm, it indeed seems to be related to HDBSCAN. If it is a bug with HDBSCAN that cannot be solved within that package, you can instead save the topic model as |
Hi @MaartenGr ,
As I understand about BERTopic; fit_transform() is to train model while transform() is for prediction. Am I right??
what is the best method to train the model for data from different sources e.g. twitter, reddit, facebook comments etc. I want to train the model once and use it for various datasets?
should I have to divide data in sentences because some sources has very large comments (paragraphs) e.g. reddit or news articles?
Thanks
The text was updated successfully, but these errors were encountered: