-
Notifications
You must be signed in to change notification settings - Fork 764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
custom dataset instructions #2
Comments
Thank you! Sure, really the only thing that you have to do is change the The example code can be found below: from bertopic import BERTopic
docs = YOUR_OWN_DATA_SET_LIST_OF_STRINGS
model = BERTopic("distilbert-base-nli-mean-tokens", verbose=True)
topics = model.fit_transform(docs) Does this answer your question? |
Thanks for your reply :-) What do you mean by YOUR_OWN_DATA_SET_LIST_OF_STRINGS, is it something like that: Would be awesome to have a dummy example ? |
Exactly! Each document is represented by a long string which the Although I agree that a dummy example would be preferred, the model does not work if we only a couple of sentences as it is unlikely different topics will be created. For that reason, I decided to use the newsgroups dataset which contains a significant number of documents. If you run into problems with using your own dataset I would recommend looking at the structure of the newsgroups dataset as shown in the example. |
Hi, Thanks agin for your reply ! I wanted to test the paperswithcode dataset with all their abstracts. Refs: Do you think it gonna work on that amount of data (100k abstracts) ? |
Yes, it should work with at least a couple of hundred documents that differ in content. So 100k abstracts should be no problem. Make sure to use a GPU though as extracting word embeddings can be quite time consuming with a CPU. |
So @MaartenGr are you suggesting BERT Topics is not a good idea to be used for custom topics with few data points like less than 1K? |
@kurianbenoy-aot Few data points, as in less than 100 data points, is definitely an issue for creating a topic model as the HDBSCAN is unlikely to find multiple dense clusters. However, a couple of hundred data points should be sufficient if there are clearly separated topics/dense clusters to be found. NOTE: You should specify the |
Hi,
Hope you are all well !
I wanted to apply BERTopic to a custom dataset, but can you provide more details about the input format for training a custom model ?
Thanks for any insights or inputs on that question.
Cheers,
X
The text was updated successfully, but these errors were encountered: