custom dataset instructions #2

ghost · 2020-10-08T07:10:47Z

Hi,

Hope you are all well !

I wanted to apply BERTopic to a custom dataset, but can you provide more details about the input format for training a custom model ?

Thanks for any insights or inputs on that question.

Cheers,
X

MaartenGr · 2020-10-10T05:56:54Z

Thank you! Sure, really the only thing that you have to do is change the docs variable in the readme to the dataset you are currently using. Make sure that docs is a list of strings as I will be converting it at a later stage to a pandas dataframe.

The example code can be found below:

from bertopic import BERTopic
 
docs = YOUR_OWN_DATA_SET_LIST_OF_STRINGS

model = BERTopic("distilbert-base-nli-mean-tokens", verbose=True)
topics = model.fit_transform(docs)

Does this answer your question?

ghost · 2020-10-10T05:59:51Z

Thanks for your reply :-)

What do you mean by YOUR_OWN_DATA_SET_LIST_OF_STRINGS, is it something like that:
docs = ['text1','text2'] ?

Would be awesome to have a dummy example ?

MaartenGr · 2020-10-10T09:05:56Z

Exactly! Each document is represented by a long string which the sentence-transformers automatically converts into tokens before creating the embeddings.

Although I agree that a dummy example would be preferred, the model does not work if we only a couple of sentences as it is unlikely different topics will be created. For that reason, I decided to use the newsgroups dataset which contains a significant number of documents.

If you run into problems with using your own dataset I would recommend looking at the structure of the newsgroups dataset as shown in the example.

ghost · 2020-10-10T09:22:31Z

Hi,

Thanks agin for your reply !

I wanted to test the paperswithcode dataset with all their abstracts.

Refs:

Do you think it gonna work on that amount of data (100k abstracts) ?

MaartenGr · 2020-10-10T11:48:52Z

Yes, it should work with at least a couple of hundred documents that differ in content. So 100k abstracts should be no problem. Make sure to use a GPU though as extracting word embeddings can be quite time consuming with a CPU.

kurianbenoy-aot · 2020-10-12T07:59:44Z

So @MaartenGr are you suggesting BERT Topics is not a good idea to be used for custom topics with few data points like less than 1K?

MaartenGr · 2020-10-12T09:04:56Z

@kurianbenoy-aot Few data points, as in less than 100 data points, is definitely an issue for creating a topic model as the HDBSCAN is unlikely to find multiple dense clusters. However, a couple of hundred data points should be sufficient if there are clearly separated topics/dense clusters to be found.

NOTE: You should specify the min_topic_size and n_neighbors as they indirectly impact the number of topics/clusters that were created. Thus, if you have 100 data points, then the minimum topic size cannot be 90 as that would result in only 1 cluster to be made. Likewise, the n_neighbors references the number of neighboring points to be used when reducing dimensionality and should therefore not exceed half of the data points.

MaartenGr closed this as completed Oct 24, 2020

nadiafelix mentioned this issue Apr 12, 2021

About Coherence of topic models #90

Open

vsean103 mentioned this issue Apr 2, 2022

Question regarding assign topics to new dataset using pretrained model #467

Closed

mepearson mentioned this issue Feb 15, 2024

Include guidance for saving Representative Documents #1811

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

custom dataset instructions #2

custom dataset instructions #2

ghost commented Oct 8, 2020

MaartenGr commented Oct 10, 2020

ghost commented Oct 10, 2020 •

edited by ghost

Loading

MaartenGr commented Oct 10, 2020

ghost commented Oct 10, 2020

MaartenGr commented Oct 10, 2020

kurianbenoy-aot commented Oct 12, 2020

MaartenGr commented Oct 12, 2020

custom dataset instructions #2

custom dataset instructions #2

Comments

ghost commented Oct 8, 2020

MaartenGr commented Oct 10, 2020

ghost commented Oct 10, 2020 • edited by ghost Loading

MaartenGr commented Oct 10, 2020

ghost commented Oct 10, 2020

MaartenGr commented Oct 10, 2020

kurianbenoy-aot commented Oct 12, 2020

MaartenGr commented Oct 12, 2020

ghost commented Oct 10, 2020 •

edited by ghost

Loading