Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train and Predict BERTopic #278

Closed
mjavedgohar opened this issue Oct 11, 2021 · 28 comments
Closed

Train and Predict BERTopic #278

mjavedgohar opened this issue Oct 11, 2021 · 28 comments

Comments

@mjavedgohar
Copy link

Hi @MaartenGr ,

As I understand about BERTopic; fit_transform() is to train model while transform() is for prediction. Am I right??
what is the best method to train the model for data from different sources e.g. twitter, reddit, facebook comments etc. I want to train the model once and use it for various datasets?
should I have to divide data in sentences because some sources has very large comments (paragraphs) e.g. reddit or news articles?

Thanks

@MaartenGr
Copy link
Owner

You are correct. You can use fit_transform() to train the model. The transform() function is indeed used for prediction. Do note that fit_transform() not only trains the model but also predicts the data on which it was trained.

In practice, I would try to combine as many sources as possible before training the model. If you have various datasets, then you can simply combine them and train over all of the models. However, if there is a specific reason for training it on only a single dataset and predict for all others, then that is also possible. I can imagine it could be computationally expensive to train on all datasets or that you only want the topics from a single source represented. In those cases, it should be fine to train it on a single dataset although training on all of them is preferred.

This depends on the content of the large paragraphs. If you feel like or assume, that those paragraphs may contain multiple topics then I would advise splitting them up into sentences. You can use Spacy to split them up into sentences. However, if you think that there is only a single topic in the large paragraph then there is no need to split them up into sentences.

If possible, I would try training it on the data without splitting it up into sentences and see if they make sense. If not, then a sentence splitter would be your next step.

@mjavedgohar
Copy link
Author

@MaartenGr Thanks for your reply
It means once I trained the model I can save it for other dataset sets using transform() just like other ML models?
Is there any method to evaluate the trained BERTopic?

@MaartenGr
Copy link
Owner

Yes! You can train the model and save it for other datasets just like other ML models. Do note that it is important that the versions of packages stay the same when switching between environments. Most issues related to model loading can be solved by looking at the environment.

This is actually quite a complex subject. Although there are methods that you can employ, such as c_v for evaluation they suffer from a number of issues. Topic modeling creates a highly subjective output in a way and evaluation that output is quite difficult. Do you focus on the topic coherence, its clustering capabilities, predictive power, or anything else? Those questions, in part, are what makes it difficult.

So while I am definitely not against evaluation metrics. I do think it is important to realize that they by no means represent a ground truth and can be misleading in some cases.

You can look towards Gensim or Octis for evaluation metrics/functions/libraries.

@mjavedgohar
Copy link
Author

Hi @MaartenGr,

Thanks for your help. one more request
I am using get_representative_docs() to get the representative docs but it returns only three. Is there any why to get required n number of docs for a specific topic??

Thanks again

@MaartenGr
Copy link
Owner

There are several reasons for using a fixed value. First, the value needs to be equal or lower than min_topic_size which may result in issues if that were not the case. Second, allowing for the top n can quickly lead to simply saving all documents in the model which makes the model explode in size. Third, three documents should give you enough of an idea to understand what the topic is about. Any more than that is typically redundant. Fourth, whenever topics are reduced, the representative documents are simply put together. In other words, if you merge 4 topics, then the new topic will contain 3 times 4 = 12 representative documents. Increasing n will again lead to too many representative documents for a single topic.

@mjavedgohar
Copy link
Author

mjavedgohar commented Oct 14, 2021

Hi @MaartenGr ,
It was working working fine but since this morning I am getting the following error when I tried to load BERTopic model in google Colab notebook. error at line "from bertopic import BERTopic"
can you please help me for the error

ERROR:


TypeError Traceback (most recent call last)
in ()
19 import contractions
20
---> 21 from bertopic import BERTopic
22 from sklearn.feature_extraction.text import CountVectorizer

13 frames
/usr/local/lib/python3.7/dist-packages/distributed/config.py in ()
18
19 with open(fn) as f:
---> 20 defaults = yaml.load(f)
21
22 dask.config.update_defaults(defaults)

TypeError: load() missing 1 required positional argument: 'Loader'

@MaartenGr
Copy link
Owner

This is an issue that quite randomly popped up. Fortunately, some fixes can be found here. Most likely, just running either pip uninstall distributed or pip install distributed==2021.9.0 will fix your issue. Hopefully, I can get to the bottom of this and fix it in the next release.

@MaartenGr
Copy link
Owner

@mjavedgohar A new version of BERTopic (v0.9.3) was released that should fix this issue and some others that should be helpful. You can install that version through pip install --upgrade bertopic. If you have any questions regarding this issue, release, or some other issue, please let me know!

@mjavedgohar
Copy link
Author

mjavedgohar commented Oct 18, 2021

@MaartenGr Thanks for your help
I am following the following steps for training and predicting. is It ok for topic modelling using BERTopic?
but in prediction it also including the training docs. I want to predict on only new docs.

Training:

  1. load docs/sentences, 2. Instantiate the BERTopic model by defining parameters 3. fit_transform() for training listed below 4. save model

topic_model = BERTopic(low_memory=True,
calculate_probabilities=False,
nr_topics="auto",
verbose=False,
embedding_model=model, # using a pre-trained BERT model
n_gram_range=(1, 3),
vectorizer_model=CountVectorizer(ngram_range=(1, 3),
stop_words=final_stop_words,
min_df=0.05,
max_df=0.90,

                    ))

Prediction:

  1. load new docs/sentences 2. load saved model 3. Transform() for prediction

@MaartenGr
Copy link
Owner

Yes, it should be okay to train on your training docs and to predict them on only new docs.

@mjavedgohar
Copy link
Author

mjavedgohar commented Oct 20, 2021

@MaartenGr I am getting the same topics in prediction as in training using the above parameters. can you please help me to resolve this? In prediction I want to display the topics from new docs only. Or I have fit_transform() for every dataset ??

@MaartenGr
Copy link
Owner

If you are getting the same topics then you are most likely predicting the same documents like the ones you trained on. Typically, the workflow is something like this:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

# We create a split between the documents that we train on and those that we predict
train_docs = fetch_20newsgroups(subset='train',  remove=('headers', 'footers', 'quotes'))['data']
test_docs = fetch_20newsgroups(subset='test',  remove=('headers', 'footers', 'quotes'))['data']

# Train the model only the train_docs
topic_model = BERTopic(embedding_model="paraphrase-MiniLM-L3-v2", verbose=True)
topics, probs = topic_model.fit_transform(train_docs)

# Predict topics for test_docs
predicted_topics, predicted_probs = topic_model.transform(test_docs)

@mjavedgohar
Copy link
Author

mjavedgohar commented Oct 20, 2021

@MaartenGr Thaks for your help,
I used the same code you shared but still I am geeting the same topics when using 'topic_model.get_topic_info()'

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

We create a split between the documents that we train on and those that we predict

train_docs = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))['data']
test_docs = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))['data']

Train the model only the train_docs

topic_model = BERTopic(embedding_model="paraphrase-MiniLM-L3-v2", verbose=True)
topics, probs = topic_model.fit_transform(train_docs)
topic_model.get_topic_info()

Predict topics for test_docs

predicted_topics, predicted_probs = topic_model.transform(test_docs)
topic_model.get_topic_info()

@MaartenGr
Copy link
Owner

Ah, when you use transform the model will not be trained. It will only predict which topics can be found in test_docs based on the topics trained on train_docs. This is the same with most models in general that have a transform or predict function. No changes will be made to the original model.

If you want to have new topics, then you need to re-train the model with all documents.

@mjavedgohar
Copy link
Author

Hi @MaartenGr ,

I trained BERTopic model on HPC (server) and saved it. Now I am trying to load it in google colab notebook for visualization but I am getting the following error on topic_model.laod("model name")

ValueError: EOF: reading array data, expected 262144 bytes got 815

can you please help resolve this issue?
What is is the best way to train model on hpc server and visualize it ??

Thanks

@MaartenGr
Copy link
Owner

The most important thing when loading in a model is making sure that the environment is the same. So, make sure that the packages and versions used in the saving environment are the same as the loading environment. For example, if you are using sentence-transformers v0.4.1 when saving the model it is highly advised to use the same version when loading the environment.

@mjavedgohar
Copy link
Author

Hi @MaartenGr,

Thanks for your help. I am getting following error when tring to visualize the topics over time. can you please help me for this

Code:
timestamps = review_data.timestamp.to_list()
topics_over_time = topic_model.topics_over_time(docs, topics, timestamps, nr_bins=10)
topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=10)

Error
ValueError: arrays must all be same length

@MaartenGr
Copy link
Owner

I cannot be sure without having your entire code but it seems that topics, docs, and timestamps are not the same size.

@mjavedgohar
Copy link
Author

hi @MaartenGr ,

Thanks for you help. I using comments extracted from the reddit. Following is the code to generated topics.

Code:
docs = review_data.body.to_list()
docs=list(set(docs))

print("Embedding models")

#from flair.embeddings import TransformerDocumentEmbeddings
#Cbert_model = TransformerDocumentEmbeddings('digitalepidemiologylab/covid-twitter-bert-v2-mnli')#'digitalepidemiologylab/covid-twitter-bert-v2')
#embeddings = Cbert_model.embed(docs)

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-mpnet-base-v2') #'digitalepidemiologylab/covid-twitter-bert-v2-mnli') #'all-mpnet-base-v2'

import umap

umap_model = umap.UMAP(n_neighbors=100, # size of neighbour
n_components=10, # dimentionality
min_dist=0.1, #The default value for min_dist (as used above) is 0.1. We will look at a range of values from 0.0 through to 0.99.
metric='cosine',
low_memory=False)

import hdbscan

hdbscan_model = hdbscan.HDBSCAN(min_cluster_size=50,
min_samples=1,
metric='euclidean',
cluster_selection_method='eom',
prediction_data=True)

topic_model = BERTopic(top_n_words=10,
n_gram_range=(1,3),
calculate_probabilities=True,
umap_model= umap_model,
hdbscan_model=hdbscan_model,
nr_topics="auto",
verbose=True,
embedding_model=model,
vectorizer_model=CountVectorizer(ngram_range=(1, 3),
stop_words=final_stop_words
))

topics, probabilities = topic_model.fit_transform(docs)

timestamps = review_data.timestamp.to_list()
topics_over_time = topic_model.topics_over_time(docs, topics, timestamps, nr_bins=10)
topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=10)

Error:
File "Bert_topic_customized2.py", line 305, in
topics_over_time = topic_model.topics_over_time(docs, topics, timestamps, nr_bins=10)
File "/home/muhammad.javed/.local/lib/python3.7/site-packages/bertopic/_bertopic.py", line 447, in topics_over_time
documents = pd.DataFrame({"Document": docs, "Topic": topics, "Timestamps": timestamps})
File "/home/muhammad.javed/.local/lib/python3.7/site-packages/pandas/core/frame.py", line 614, in init
mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
File "/home/muhammad.javed/.local/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 465, in dict_to_mgr
arrays, data_names, index, columns, dtype=dtype, typ=typ, consolidate=copy
File "/home/muhammad.javed/.local/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 119, in arrays_to_mgr
index = _extract_index(arrays)
File "/home/muhammad.javed/.local/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 635, in _extract_index
raise ValueError("All arrays must be of the same length")
ValueError: All arrays must be of the same length

@MaartenGr
Copy link
Owner

Yes, as I mentioned before it seems that timestamps are a different size from docs and topics.

You are taking the set here:

docs = review_data.body.to_list()
docs = list(set(docs))

Which most likely reduces the number of docs and may shuffle the documents. Then, you take

timestamps = review_data.timestamp.to_list()

Which is larger than your docs. Thus, make sure that your docs and timestamps have the same size and that each index corresponds to one another. Thus, if there are 10_000 documents in docs there should be 10_000 documents in timestamps. Moreover, index 0 of docs should correspond to index 0 of timestamps.

@mjavedgohar
Copy link
Author

Thanks @MaartenGr It worked
Just another thing to discuss. I am getting the following error when number of topics are very low. Can visualize the topics >=2

fig = topic_model.visualize_topics()
File "/home/muhammad.javed/.local/lib/python3.7/site-packages/bertopic/bertopic.py", line 909, in visualize_topics
height=height)
File "/home/muhammad.javed/.local/lib/python3.7/site-packages/bertopic/plotting/topics.py", line 63, in visualize_topics
embeddings = UMAP(n_neighbors=2, n_components=2, metric='hellinger').fit_transform(embeddings)
File "/home/muhammad.javed/.local/lib/python3.7/site-packages/umap/umap
.py", line 2634, in fit_transform
self.fit(X, y)
File "/home/muhammad.javed/.local/lib/python3.7/site-packages/umap/umap
.py", line 2554, in fit
self.raw_data[index], n_epochs, init, random_state, # JH why raw data?
File "/home/muhammad.javed/.local/lib/python3.7/site-packages/umap/umap
.py", line 2601, in fit_embed_data
self.verbose,
File "/home/muhammad.javed/.local/lib/python3.7/site-packages/umap/umap
.py", line 1060, in simplicial_set_embedding
metric_kwds=metric_kwds,
File "/home/muhammad.javed/.local/lib/python3.7/site-packages/umap/spectral.py", line 334, in spectral_layout
maxiter=graph.shape[0] * 5,
File "/home/muhammad.javed/.local/lib/python3.7/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 1598, in eigsh
raise TypeError("Cannot use scipy.linalg.eigh for sparse A with "
TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.

@MaartenGr
Copy link
Owner

That is more an issue with the number of topics than necessarily the method. Typically, BERTopic would result in tens or hundreds of topics. Any less and you likely have too little data to work with, or you have set the min_topic_size to high. I would advise trying to increase the number of topics as that would most likely be the best representation of the data.

@mjavedgohar
Copy link
Author

Thanks @MaartenGr,

If I run the following code on my PC its works fine but on HPC (server) I am getting Error with same data. can you please help me for this.

timestamps = review_data.timestamp.to_list()
topics_over_time = topic_model.topics_over_time(docs, topics, timestamps, nr_bins=10)
topic_over_time=topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=100)
topic_over_time.write_html(filename.split('.')[0]+"_customize3_Model_topicovertime.html")

Error

topics_over_time = topic_model.topics_over_time(docs, topics, timestamps, nr_bins=10)

File "/home/muhammad.javed/.local/lib/python3.7/site-packages/bertopic/_bertopic.py", line 457, in topics_over_time
format=datetime_format)
File "/home/muhammad.javed/.local/lib/python3.7/site-packages/pandas/core/tools/datetimes.py", line 887, in to_datetime
values = convert_listlike(arg._values, format)
File "/home/muhammad.javed/.local/lib/python3.7/site-packages/pandas/core/tools/datetimes.py", line 408, in _convert_listlike_datetimes
allow_object=True,
File "/home/muhammad.javed/.local/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 2193, in objects_to_datetime64ns
raise err
File "/home/muhammad.javed/.local/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 2182, in objects_to_datetime64ns
allow_mixed=allow_mixed,
File "pandas/_libs/tslib.pyx", line 379, in pandas._libs.tslib.array_to_datetime
File "pandas/_libs/tslib.pyx", line 611, in pandas._libs.tslib.array_to_datetime
File "pandas/_libs/tslib.pyx", line 749, in pandas._libs.tslib._array_to_datetime_object
File "pandas/_libs/tslib.pyx", line 740, in pandas._libs.tslib._array_to_datetime_object
File "pandas/_libs/tslibs/parsing.pyx", line 257, in pandas._libs.tslibs.parsing.parse_datetime_string
File "/home/muhammad.javed/.local/lib/python3.7/site-packages/dateutil/parser/_parser.py", line 1368, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "/home/muhammad.javed/.local/lib/python3.7/site-packages/dateutil/parser/_parser.py", line 643, in parse
raise ParserError("Unknown string format: %s", timestr)
dateutil.parser._parser.ParserError: Unknown string format: timestamp

@MaartenGr
Copy link
Owner

If you run into issues when switching environments it is most likely a version control issue. Did you make sure to use the same versions of packages between environments? Also, is the code exactly the same between environments?

@mjavedgohar
Copy link
Author

Hi @MaartenGr ,

I am tring to use the Guided Topic Modeling using the following code. Its working fine in Colab notebooks but getting error on my local machine. I am using BERTopic 0.12.0. Can you please help me for this??? Thanks

Code:

topic_model = BERTopic(language="english", verbose=True, seed_topic_list=seed_topic_list)
topics, probs = topic_model.fit_transform(docs)

Error:
topics, probs = topic_model.fit_transform(docs)
File "...\Local\Programs\Python\Python38\lib\site-packages\bertopic_bertopic.py", line 344, in fit_transform
y, embeddings = self._guided_topic_modeling(embeddings)
File "...\Local\Programs\Python\Python38\lib\site-packages\bertopic_bertopic.py", line 2376, in _guided_topic_modeling
embeddings[indices] = np.average([embeddings[indices], seed_topic_embeddings[seed_topic]], weights=[3, 1])
File "<array_function internals>", line 5, in average
File "..\Local\Programs\Python\Python38\lib\site-packages\numpy\lib\function_base.py", line 407, in average
scl = wgt.sum(axis=axis, dtype=result_dtype)
File "..\Local\Programs\Python\Python38\lib\site-packages\numpy\core_methods.py", line 47, in _sum
return umr_sum(a, axis, dtype, out, keepdims, initial, where)
TypeError: No loop matching the specified signature and casting was found for ufunc add

@MaartenGr
Copy link
Owner

MaartenGr commented Dec 6, 2022

When you are working across different environments, then there might be an issue with the packages that you have installed. I would advise starting from a completely fresh environment and re-installing everything there. From your code, it seems that Numpy might be the culprit here, so I would think that a fresh environment might solve the issue.

@elenacandellone
Copy link

Hi @MaartenGr,

I have two datasets (train and test), and I would like to predict the topics for both, while fitting only the first one. This is my code:

eps = 1e-6
min_sample = 1

embedding_model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-mpnet-base-v2')


umap_model = UMAP(n_components=150, n_neighbors=50, random_state=42, metric="cosine")


hdbscan_model_arccos = HDBSCAN(
                            min_samples = min_sample,
                            min_cluster_size = 50, 
                            cluster_selection_epsilon = eps,
                            metric='cosine', algorithm = 'generic', cluster_selection_method = 'eom', 
                            prediction_data = True, core_dist_n_jobs=1)


vectorizer_model = CountVectorizer(vocabulary=vocab, 
                                max_features=10000,
                                stop_words = stopwords)


ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)


representation_model = MaximalMarginalRelevance(diversity=0.2)

topic_model= BERTopic(
        low_memory =True,
        language = 'spanish',
        embedding_model=embedding_model,
        umap_model=umap_model,
        hdbscan_model=hdbscan_model_arccos,
        vectorizer_model=vectorizer_model,
        ctfidf_model=ctfidf_model, 
        representation_model=representation_model,
        top_n_words = 20,
        n_gram_range = (1,3)
)

topics_train, probabilities_train = topic_model.fit_transform(docs_train, embeddings_train)

predicted_topics, predicted_probs = topic_model.transform(docs_test)

The problem is that, while doing the last step, I encountered the following error:
attribute error: no prediction data was generated

I guess the problem is related to the fact that the function approximate_predict of hdbscan is unaware of the parameter cluster_selection_epsilon.
Would you happen to have any idea on how to solve this issue?

Thanks in advance!

@MaartenGr
Copy link
Owner

@elenacandellone Hmmm, it indeed seems to be related to HDBSCAN. If it is a bug with HDBSCAN that cannot be solved within that package, you can instead save the topic model as safetensors and then load it in. Saving with safetensors removes the underlying HDBSCAN and UMAP and does inference through the embeddings only. This should prevent the issue you are having.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants