Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory inefficient algorithm and getting error while saving the model #173

Closed
makkarss929 opened this issue Jul 6, 2021 · 24 comments
Closed

Comments

@makkarss929
Copy link

makkarss929 commented Jul 6, 2021

I was trying to train 20 Lakh data points and I have tried lots of GPU instances in AWS, I have tried GPU instances with 16GB RAM, 32GB RAM, 64 GB RAM, and 256 GB RAM on AWS. All of them failed and not able to train. And on 256 GB RAM, it was trained successfully but I was unable to save the model

Below is the error I was getting while saving the model.

topic_model.save("topic_model_all_20L.pt",save_embedding_model=False)

KeyError                                  Traceback (most recent call last)
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/numba/core/caching.py in save(self, key, data)
    482             # If key already exists, we will overwrite the file
--> 483             data_name = overloads[key]
    484         except KeyError:
KeyError: ((array(int32, 1d, C), array(int32, 1d, C), array(float32, 1d, C), array(float32, 2d, C), type(CPUDispatcher(<function alternative_cosine at 0x7f3c3ca174d0>)), array(int64, 1d, C), float64), ('x86_64-unknown-linux-gnu', 'cascadelake', '+64bit,+adx,+aes,+avx,+avx2,-avx512bf16,-avx512bitalg,+avx512bw,+avx512cd,+avx512dq,-avx512er,+avx512f,-avx512ifma,-avx512pf,-avx512vbmi,-avx512vbmi2,+avx512vl,+avx512vnni,-avx512vpopcntdq,+bmi,+bmi2,-cldemote,+clflushopt,+clwb,-clzero,+cmov,+cx16,+cx8,-enqcmd,+f16c,+fma,-fma4,+fsgsbase,+fxsr,-gfni,+invpcid,-lwp,+lzcnt,+mmx,+movbe,-movdir64b,-movdiri,-mwaitx,+pclmul,-pconfig,+pku,+popcnt,-prefetchwt1,+prfchw,-ptwrite,-rdpid,+rdrnd,+rdseed,-rtm,+sahf,-sgx,-sha,-shstk,+sse,+sse2,+sse3,+sse4.1,+sse4.2,-sse4a,+ssse3,-tbm,-vaes,-vpclmulqdq,-waitpkg,-wbnoinvd,-xop,+xsave,+xsavec,+xsaveopt,+xsaves'), ('308c49885ad3c35a475c360e21af1359caa88c78eb495fa0f5e8c6676ae5019e', 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855'))
During handling of the above exception, another exception occurred:
TypeError                                 Traceback (most recent call last)
<ipython-input-25-32c887ac8b59> in <module>
      1 # Saving model
----> 2 topic_model.save("topic_model_all_20L.pt",save_embedding_model=False)
      3 print("model saved")
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/bertopic/_bertopic.py in save(self, path, save_embedding_model)
   1201                 embedding_model = self.embedding_model
   1202                 self.embedding_model = None
-> 1203                 joblib.dump(self, file)
   1204                 self.embedding_model = embedding_model
   1205             else:
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/joblib/numpy_pickle.py in dump(value, filename, compress, protocol, cache_size)
    480             NumpyPickler(f, protocol=protocol).dump(value)
    481     else:
--> 482         NumpyPickler(filename, protocol=protocol).dump(value)
    483 
    484     # If the target container is a file object, nothing is returned.
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in dump(self, obj)
    435         if self.proto >= 4:
    436             self.framer.start_framing()
--> 437         self.save(obj)
    438         self.write(STOP)
    439         self.framer.end_framing()
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/joblib/numpy_pickle.py in save(self, obj)
    280             return
    281 
--> 282         return Pickler.save(self, obj)
    283 
    284 
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
    547 
    548         # Save the reduce() output and finally memoize the object
--> 549         self.save_reduce(obj=obj, *rv)
    550 
    551     def persistent_id(self, obj):
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in save_reduce(self, func, args, state, listitems, dictitems, obj)
    660 
    661         if state is not None:
--> 662             save(state)
    663             write(BUILD)
    664 
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/joblib/numpy_pickle.py in save(self, obj)
    280             return
    281 
--> 282         return Pickler.save(self, obj)
    283 
    284 
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
    502         f = self.dispatch.get(t)
    503         if f is not None:
--> 504             f(self, obj) # Call unbound method with explicit self
    505             return
    506 
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in save_dict(self, obj)
    857 
    858         self.memoize(obj)
--> 859         self._batch_setitems(obj.items())
    860 
    861     dispatch[dict] = save_dict
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in _batch_setitems(self, items)
    883                 for k, v in tmp:
    884                     save(k)
--> 885                     save(v)
    886                 write(SETITEMS)
    887             elif n:
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/joblib/numpy_pickle.py in save(self, obj)
    280             return
    281 
--> 282         return Pickler.save(self, obj)
    283 
    284 
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
    547 
    548         # Save the reduce() output and finally memoize the object
--> 549         self.save_reduce(obj=obj, *rv)
    550 
    551     def persistent_id(self, obj):
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in save_reduce(self, func, args, state, listitems, dictitems, obj)
    660 
    661         if state is not None:
--> 662             save(state)
    663             write(BUILD)
    664 
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/joblib/numpy_pickle.py in save(self, obj)
    280             return
    281 
--> 282         return Pickler.save(self, obj)
    283 
    284 
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
    502         f = self.dispatch.get(t)
    503         if f is not None:
--> 504             f(self, obj) # Call unbound method with explicit self
    505             return
    506 
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in save_dict(self, obj)
    857 
    858         self.memoize(obj)
--> 859         self._batch_setitems(obj.items())
    860 
    861     dispatch[dict] = save_dict
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in _batch_setitems(self, items)
    883                 for k, v in tmp:
    884                     save(k)
--> 885                     save(v)
    886                 write(SETITEMS)
    887             elif n:
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/joblib/numpy_pickle.py in save(self, obj)
    280             return
    281 
--> 282         return Pickler.save(self, obj)
    283 
    284 
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
    522             reduce = getattr(obj, "__reduce_ex__", None)
    523             if reduce is not None:
--> 524                 rv = reduce(self.proto)
    525             else:
    526                 reduce = getattr(obj, "__reduce__", None)
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/pynndescent/pynndescent_.py in __getstate__(self)
    900     def __getstate__(self):
    901         if not hasattr(self, "_search_graph"):
--> 902             self._init_search_graph()
    903         if not hasattr(self, "_search_function"):
    904             if self._is_sparse:
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/pynndescent/pynndescent_.py in _init_search_graph(self)
   1061                 self._distance_func,
   1062                 self.rng_state,
-> 1063                 self.diversify_prob,
   1064             )
   1065         reverse_graph.eliminate_zeros()
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/numba/core/dispatcher.py in _compile_for_args(self, *args, **kws)
    431                     e.patch_message('\n'.join((str(e).rstrip(), help_msg)))
    432             # ignore the FULL_TRACEBACKS config, this needs reporting!
--> 433             raise e
    434 
    435     def inspect_llvm(self, signature=None):
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/numba/core/dispatcher.py in _compile_for_args(self, *args, **kws)
    364                 argtypes.append(self.typeof_pyval(a))
    365         try:
--> 366             return self.compile(tuple(argtypes))
    367         except errors.ForceLiteralArg as e:
    368             # Received request for compiler re-entry with the list of arguments
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/numba/core/compiler_lock.py in _acquire_compile_lock(*args, **kwargs)
     30         def _acquire_compile_lock(*args, **kwargs):
     31             with self:
---> 32                 return func(*args, **kwargs)
     33         return _acquire_compile_lock
     34 
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/numba/core/dispatcher.py in compile(self, sig)
    861                 raise e.bind_fold_arguments(folded)
    862             self.add_overload(cres)
--> 863             self._cache.save_overload(sig, cres)
    864             return cres.entry_point
    865 
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/numba/core/caching.py in save_overload(self, sig, data)
    665         """
    666         with self._guard_against_spurious_io_errors():
--> 667             self._save_overload(sig, data)
    668 
    669     def _save_overload(self, sig, data):
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/numba/core/caching.py in _save_overload(self, sig, data)
    675         key = self._index_key(sig, _get_codegen(data))
    676         data = self._impl.reduce(data)
--> 677         self._cache_file.save(key, data)
    678 
    679     @contextlib.contextmanager
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/numba/core/caching.py in save(self, key, data)
    490                     break
    491             overloads[key] = data_name
--> 492             self._save_index(overloads)
    493         self._save_data(data_name, data)
    494 
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/numba/core/caching.py in _save_index(self, overloads)
    536     def _save_index(self, overloads):
    537         data = self._source_stamp, overloads
--> 538         data = self._dump(data)
    539         with self._open_for_write(self._index_path) as f:
    540             pickle.dump(self._version, f, protocol=-1)
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/numba/core/caching.py in _dump(self, obj)
    564 
    565     def _dump(self, obj):
--> 566         return pickle.dumps(obj, protocol=-1)
    567 
    568     @contextlib.contextmanager
TypeError: can't pickle weakref objects

@makkarss929 makkarss929 changed the title Memory inefficient algorithm Memory inefficient algorithm and getting error while saving the model Jul 6, 2021
@MaartenGr
Copy link
Owner

Hmmm, I am not familiar with this error unfortunately. Could you share your code so that I can see what is happening? Also, which version of BERTopic are you using and how did you install it?

With respect to the memory issues, have you tried this or have you seen this thread? They both contain some suggestions on how to reduce memory issues.

The c-TF-IDF matrix can also quickly grow in size if you use a large amount of data, increasing the minimum frequency of words (by setting min_df) might already reduce the RAM necessary:

from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model = CountVectorizer(min_df=10)
topic_model = BERTopic(vectorizer_model=vectorizer_model)

@makkarss929
Copy link
Author

But, I am getting

TypeError: can't pickle weakref objects

This means that or problem states that, we try to save contains weakref python objects. Such objects are not supported by python’s pickle module.

@makkarss929
Copy link
Author

makkarss929 commented Jul 6, 2021

And one more thing how many RAM is required to 2 Million data points, Tell me least value ?

@makkarss929
Copy link
Author

Hmmm, I am not familiar with this error unfortunately. Could you share your code so that I can see what is happening? Also, which version of BERTopic are you using and how did you install it?

With respect to the memory issues, have you tried this or have you seen this thread? They both contain some suggestions on how to reduce memory issues.

The c-TF-IDF matrix can also quickly grow in size if you use a large amount of data, increasing the minimum frequency of words (by setting min_df) might already reduce the RAM necessary:

from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model = CountVectorizer(min_df=10)
topic_model = BERTopic(vectorizer_model=vectorizer_model)

I am using 0.8.1 version of BerTopic

@MaartenGr
Copy link
Owner

And one more thing how many RAM is required to 2 Million data points, Tell me least value ?

Unfortunately, that is not how it works. If you have 2 million documents then the size of each document matters greatly. Not only that, but changing any of the parameters also has impact on the required RAM. With most algorithms, it is not possible to give an estimate how much RAM is needed for a specific dataset. Having said that, make sure to follow the tips in previous posts in this thread to reduce the necessary RAM.

@MaartenGr
Copy link
Owner

This means that or problem states that, we try to save contains weakref python objects. Such objects are not supported by python’s pickle module.

It would also help if you could share your code. Perhaps specific settings or parameters might have caused this issue. By understanding your workflow I might be able to pinpoint your issue and help you resolve it.

@makkarss929
Copy link
Author

And one more thing how many RAM is required to 2 Million data points, Tell me least value ?

Unfortunately, that is not how it works. If you have 2 million documents then the size of each document matters greatly. Not only that, but changing any of the parameters also has impact on the required RAM. With most algorithms, it is not possible to give an estimate how much RAM is needed for a specific dataset. Having said that, make sure to follow the tips in previous posts in this thread to reduce the necessary RAM.

My total data size for 2 million documents is only 535 MB.

@makkarss929
Copy link
Author

makkarss929 commented Jul 7, 2021

I am sharing my code snippet with you.

topic_model = BERTopic(low_memory=True,verbose=True,calculate_probabilities=False,embedding_model="paraphrase-MiniLM-L12-v2", min_topic_size=50)
# fitting model
topic_model.fit(documents = train_docs_batch, y=train_targets_batch)

topic_model.save("topic_model.pt",save_embedding_model=False)

In the last line, I am getting the error.

@MaartenGr
Copy link
Owner

My total data size for 2 million documents is only 535 MB.

As I mentioned before, I cannot give you a value of RAM that is minimally necessary. It also depends on the parameters, average and max number of words in a document, the vocabulary size, etc. I would advise you to follow the tips above to reduce the necessary RAM.

In the last line, I am getting the error.

After some tests in a Google Colab session, I can replicate this issue by setting y=train_targets_batch. It seems that Umap (pynndescent in particular) has some issue pickling the object. Unfortunately, there does not seem to be a fix as of this moment. I would advise you not setting the y parameter if you want to save the model.

@makkarss929
Copy link
Author

Can you advise, how I can choose those parameters, which take less RAM and provides meaningful results? Can you send some resources or links. I also want to choose the Best parameters from UMAP and HDBSCAN ?

@makkarss929
Copy link
Author

makkarss929 commented Jul 7, 2021

And one more thing I understand it depends on parameters, But no topic modeling algorithm takes 256 GB RAM to train. And after training on that, didn't succeed, My advice is to make the algorithm more optimized.

Your algorithm is performing well as compare to others, but it lacks memory optimization.

After completion of 1st step, Transformation of documents to embeddings, It takes whole RAM and session crashes.

@MaartenGr
Copy link
Owner

Can you advise, how I can choose those parameters, which take less RAM and provides meaningful results? Can you send some resources or links. I also want to choose the Best parameters from UMAP and HDBSCAN ?

The FAQ in the documentation gives some pointers on how to reduce the necessary memory if you use a large amount of data. You can find that link here. Also, you can find some help already in this thread.

There are several others ways to perform computation with large datasets. First, you can set low_memory to True when instantiating BERTopic. This may prevent blowing up the memory in UMAP.

Second, setting calculate_probabilities to False when instantiating BERTopic prevents a huge document-topic probability matrix from being created. Moreover, HDBSCAN is quite slow when it tries to calculate probabilities on large datasets.

Third, you can set the minimum frequency of words in the CountVectorizer class to reduce the size of the resulting sparse c-TF-IDF matrix. You can do this as follows:

from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model = CountVectorizer(min_df=10)
topic_model = BERTopic(vectorizer_model=vectorizer_model)

The min_df parameter is used to indicate the minimum frequency of words. Setting this value larger than 1 can significantly reduce memory.

@makkarss929
Copy link
Author

makkarss929 commented Jul 7, 2021

Setting each and every parameter manually, so that it consumes less RAM and then I did but no improvement in RAM management I can see. And I selected all parameters according to dataset size i.e 2.5 lakhs.

# Setting UMAP model 
    umap_model = UMAP(n_neighbors=300, n_components=2, min_dist=0.0, metric='cosine')

    # Setting HDBSCAN model
    hdbscan_model = HDBSCAN(min_cluster_size=300, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

    # Setting count vectorize model with 
    vectorizer_model = CountVectorizer(min_df=300)


    topic_model = BERTopic(low_memory = True,verbose=True,calculate_probabilities=False,
                           embedding_model="paraphrase-MiniLM-L12-v2", umap_model = umap_model,
                           hdbscan_model = hdbscan_model,
                           vectorizer_model=vectorizer_model, min_topic_size=300)#, min_topic_size=50
    # fitting model
    topic_model.fit(documents = train_docs_batch,y=train_targets_batch)#,embeddings=train_embeddings_10 

@makkarss929
Copy link
Author

makkarss929 commented Jul 7, 2021

Setting each and every parameter manually, so that it consumes less RAM and then I did but no improvement in RAM management I can see. And I selected all parameters according to dataset size i.e 2.5 lakhs.

Problem is with your UMAP it's taking all RAM after doing Embeddings Transformation. It's taking all RAM.

# Setting UMAP model 
    umap_model = UMAP(n_neighbors=300, n_components=2, min_dist=0.0, metric='cosine')

    # Setting HDBSCAN model
    hdbscan_model = HDBSCAN(min_cluster_size=300, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

    # Setting count vectorize model with 
    vectorizer_model = CountVectorizer(min_df=300)


    topic_model = BERTopic(low_memory = True,verbose=True,calculate_probabilities=False,
                           embedding_model="paraphrase-MiniLM-L12-v2", umap_model = umap_model,
                           hdbscan_model = hdbscan_model,
                           vectorizer_model=vectorizer_model, min_topic_size=300)#, min_topic_size=50
    # fitting model
    topic_model.fit(documents = train_docs_batch,y=train_targets_batch)#,embeddings=train_embeddings_10 

@MaartenGr
Copy link
Owner

Problem is with your UMAP it's taking all RAM after doing Embeddings Transformation. It's taking all RAM.

This happens because you are using a custom UMAP model which overrides the low_memory=True parameter. Make sure that when you use a custom UMAP model you also manually set the low_memory parameter:

umap_model = UMAP(n_neighbors=300, n_components=2, min_dist=0.0, metric='cosine', low_memory=True)

@makkarss929
Copy link
Author

Problem is with your UMAP it's taking all RAM after doing Embeddings Transformation. It's taking all RAM.

This happens because you are using a custom UMAP model which overrides the low_memory=True parameter. Make sure that when you use a custom UMAP model you also manually set the low_memory parameter:

umap_model = UMAP(n_neighbors=300, n_components=2, min_dist=0.0, metric='cosine', low_memory=True)

But Also when we are not using custom UMAP and giving this parameter in BERTopic Class it consumes all RAM.

@makkarss929
Copy link
Author

makkarss929 commented Jul 8, 2021

As per your suggestion, I tried setting low_memory = True parameter in the Custom UMAP model. But it crashes my RAM.
And can't fit only 2.5 Lakh data points with 13 GB RAM

    # Setting UMAP model 
    umap_model = UMAP(n_neighbors=300, n_components=2, min_dist=0.0, metric='cosine', low_memory = True)

    # Setting HDBSCAN model
    hdbscan_model = HDBSCAN(min_cluster_size=300, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

    # Setting count vectorize model with 
    vectorizer_model = CountVectorizer(min_df=300)


    topic_model = BERTopic(low_memory = True, verbose=True,calculate_probabilities=False,
                           embedding_model="paraphrase-MiniLM-L12-v2", umap_model = umap_model,
                           hdbscan_model = hdbscan_model,
                           vectorizer_model=vectorizer_model, min_topic_size=300)#, min_topic_size=50, low_memory = True,
    # fitting model
    topic_model.fit(documents = train_docs_batch,y=train_targets_batch)#,embeddings=train_embeddings_10 

@MaartenGr
Copy link
Owner

I just tried the following out in a Google Colab session that has 13GB RAM available and is using a Tesla T4. The used data was retrieved here.

BERTopic was trained on a sample of 300,000 documents. Here, I made sure to use low_memory=True. Then, I used the CountVectorizer to decrease the memory usage when creating the c-TF-IDF matrix. There were no issues with respect to memory usage:

import pandas as pd
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(min_df=10)

df = pd.read_csv("abcnews-date-text.csv")
df_small = df.sample(300_000)
docs = df_small.headline_text.tolist()

topic_model = BERTopic(verbose=True, vectorizer_model=vectorizer_model)
topics, _ = topic_model.fit_transform(docs)

Having said that, if I would go with more documents, then I find myself having memory issues.
It turns out that UMAP is known for being rather memory intensive which likely explains the issue you are having.
You can find more about that here.

I would suggest using the above settings and simply using a machine with more RAM. If that does not work, then I suggest taking a smaller sample of the data that you have and simply predicting the topic of data that was left out. In practice, BERTopic does not need millions of data points to create a good model. Simply using a sample of a few hundred thousand documents should do the trick.

@Attol8
Copy link

Attol8 commented Jul 27, 2021

I just tried the following out in a Google Colab session that has 13GB RAM available and is using a Tesla T4. The used data was retrieved here.

BERTopic was trained on a sample of 300,000 documents. Here, I made sure to use low_memory=True. Then, I used the CountVectorizer to decrease the memory usage when creating the c-TF-IDF matrix. There were no issues with respect to memory usage:

import pandas as pd
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(min_df=10)

df = pd.read_csv("abcnews-date-text.csv")
df_small = df.sample(300_000)
docs = df_small.headline_text.tolist()

topic_model = BERTopic(verbose=True, vectorizer_model=vectorizer_model)
topics, _ = topic_model.fit_transform(docs)

Having said that, if I would go with more documents, then I find myself having memory issues.
It turns out that UMAP is known for being rather memory intensive which likely explains the issue you are having.
You can find more about that here.

I would suggest using the above settings and simply using a machine with more RAM. If that does not work, then I suggest taking a smaller sample of the data that you have and simply predicting the topic of data that was left out. In practice, BERTopic does not need millions of data points to create a good model. Simply using a sample of a few hundred thousand documents should do the trick.

Hi @MaartenGr Thanks for the package! I am also trying to run BERT Topic on a dataset with roughly 1M sentences. because I was having memory issues, I used your sample strategy above to run it but the model seems to run forever. Do you have an idea of how long it should take on Colab Pro Notebook with GPU enabled? Also, I am using pre-trained sentence embeddings trained with 'paraphrase-MiniLM-L6-v2' SentenceTransformer model and passing the full embeddings array to .fit_transform(). Could that be the issue?

@MaartenGr
Copy link
Owner

@Attol8 Hmmm, it should not take that long. Could you share your entire code? Also, did you try it with verbose=True? If so, where did it seems to be stuck?

@Attol8
Copy link

Attol8 commented Jul 27, 2021

@Attol8 Hmmm, it should not take that long. Could you share your entire code? Also, did you try it with verbose=True? If so, where did it seems to be stuck?

Yeah, sure. However, I cannot share the data as it is proprietary.

from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

#We then load the model with SentenceTransformers
model_name = 'paraphrase-MiniLM-L6-v2'
model = SentenceTransformer(model_name)

#sample sentences from main df
df_small = aa_df.sample(300000)
tasks_texts = list(df_small.ad_text.values)
tasks_texts = [sentence[0:128*10] for sentence in tasks_texts]

#Compute embeddings for all sentences
corpus_embeddings = model.encode(tasks_texts, convert_to_tensor=True, show_progress_bar=True)

#BERTTopic
vectorizer_model = CountVectorizer(min_df=10)
topic_model = BERTopic(low_memory = True, verbose = True, calculate_probabilities = False, vectorizer_model=vectorizer_model)
topics, _ = topic_model.fit_transform(tasks_texts, np.array(corpus_embeddings.cpu()))

I did set verbose=True but it does not seem to output anything at all. Looking at Colab execution bar, it seems to be stuck on the _reduce_dimensionality() part. Running for more than 2 hours now.

Thanks for the fast response tho! Love this package and all your Data Science work!

@MaartenGr
Copy link
Owner

Perhaps it is the way you are creating the embeddings. There might be an issue with the convertion to tensors. Perhaps a more minimal example would improve your situation. Could you try it out like this:

from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer

# sample sentences from main df
df_small = aa_df.sample(300000)
tasks_texts = list(df_small.ad_text.values)
tasks_texts = [sentence[0:128*10] for sentence in tasks_texts]

# BERTopic
vectorizer_model = CountVectorizer(min_df=10)
topic_model = BERTopic(embedding_model="paraphrase-MiniLM-L6-v2", 
                       low_memory = True, 
                       verbose = True, 
                       calculate_probabilities = False, 
                       vectorizer_model=vectorizer_model)
topics, _ = topic_model.fit_transform(tasks_texts)

I should note that setting low_memory=True can slow down UMAP computations. If you have enough RAM available, I would advise setting that to False.

@Attol8
Copy link

Attol8 commented Jul 27, 2021

While we were talking, the algorithm finished running. Took a bit more than 2 hours. Maybe that's the expected time given my dataset. I have tried your code with low_memory=True, but it does not seem to improve the running time, been running for an hour now.

Still stuck in the phase where it is trying to reduce the dimensionality of the embeddings. The slowing part really seems to be the UMAP part. Not familiar with UMAP but other solutions to reduce dimensionality may be faster. Anyway, I was able to create the topics, and 2 hours is not even that bad!

@MaartenGr
Copy link
Owner

MaartenGr commented Jul 28, 2021

Glad to hear that it finished running. Perhaps I wasn't clear but if you have enough RAM available, setting low_memory=False might actually speed up computation.

One other thing might be starting from a fresh environment and making sure you have the newest versions of BERTopic and UMAP installed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants