Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

custom dataset instructions #2

Closed
ghost opened this issue Oct 8, 2020 · 7 comments
Closed

custom dataset instructions #2

ghost opened this issue Oct 8, 2020 · 7 comments

Comments

@ghost
Copy link

ghost commented Oct 8, 2020

Hi,

Hope you are all well !

I wanted to apply BERTopic to a custom dataset, but can you provide more details about the input format for training a custom model ?

Thanks for any insights or inputs on that question.

Cheers,
X

@MaartenGr
Copy link
Owner

Thank you! Sure, really the only thing that you have to do is change the docs variable in the readme to the dataset you are currently using. Make sure that docs is a list of strings as I will be converting it at a later stage to a pandas dataframe.

The example code can be found below:

from bertopic import BERTopic
 
docs = YOUR_OWN_DATA_SET_LIST_OF_STRINGS

model = BERTopic("distilbert-base-nli-mean-tokens", verbose=True)
topics = model.fit_transform(docs)

Does this answer your question?

@ghost
Copy link
Author

ghost commented Oct 10, 2020

Thanks for your reply :-)

What do you mean by YOUR_OWN_DATA_SET_LIST_OF_STRINGS, is it something like that:
docs = ['text1','text2'] ?

Would be awesome to have a dummy example ?

@MaartenGr
Copy link
Owner

Exactly! Each document is represented by a long string which the sentence-transformers automatically converts into tokens before creating the embeddings.

Although I agree that a dummy example would be preferred, the model does not work if we only a couple of sentences as it is unlikely different topics will be created. For that reason, I decided to use the newsgroups dataset which contains a significant number of documents.

If you run into problems with using your own dataset I would recommend looking at the structure of the newsgroups dataset as shown in the example.

@ghost
Copy link
Author

ghost commented Oct 10, 2020

Hi,

Thanks agin for your reply !

I wanted to test the paperswithcode dataset with all their abstracts.

Refs:

Do you think it gonna work on that amount of data (100k abstracts) ?

@MaartenGr
Copy link
Owner

Yes, it should work with at least a couple of hundred documents that differ in content. So 100k abstracts should be no problem. Make sure to use a GPU though as extracting word embeddings can be quite time consuming with a CPU.

@kurianbenoy-aot
Copy link

So @MaartenGr are you suggesting BERT Topics is not a good idea to be used for custom topics with few data points like less than 1K?

@MaartenGr
Copy link
Owner

@kurianbenoy-aot Few data points, as in less than 100 data points, is definitely an issue for creating a topic model as the HDBSCAN is unlikely to find multiple dense clusters. However, a couple of hundred data points should be sufficient if there are clearly separated topics/dense clusters to be found.

NOTE: You should specify the min_topic_size and n_neighbors as they indirectly impact the number of topics/clusters that were created. Thus, if you have 100 data points, then the minimum topic size cannot be 90 as that would result in only 1 cluster to be made. Likewise, the n_neighbors references the number of neighboring points to be used when reducing dimensionality and should therefore not exceed half of the data points.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants