-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About Coherence of topic models #90
Comments
I believe you should be using the CountVectorizer for creating the corresponding corpus and dictionary when creating the CoherenceModel. |
@MaartenGr thanks a lot for you attention. I am trying this. But I found a sentence in topics set that doesn't exist in dictionary. Is it ok? Do all the topics exist in ngrams? The used code is this: from gensim import corpora from sklearn.feature_extraction.text import CountVectorizer cv = CountVectorizer(ngram_range=(2, 20)) #2,20 is the same range of topics texts = [] for i in range(len(comentariosList)): topics = topics_df['Keywords'].values.tolist() cm = CoherenceModel(topics=topics, corpus=corpus, dictionary=dictionary, coherence='u_mass') Thanks for your help. |
You should focus on what you put into the |
Do you have any recommendations for working with this n_gram_range parameter? topic_model = BERTopic (verbose = True, embedding_model = embedder, n_gram_range = (1,3), calculate_probabilities = True) |
I believe it is best to make sure that the Countvectorizer in Bertopic should be the same as you used to create the dictionary, corpus and tokens. You could also try accessing the Countvectorizer directly in Bertopic by using If this still does not work let me know! |
I would suggest that instead of creating n_grams of the corpus, you can simply split the n_grams of the topics and flatten them to have a list of single words (unigram) so that you can perform gensim CoherenceNPM scores without having to create the n_grams of text. |
First of all, Thank you for your attention. 1 corpus = ['This is the first document.','This document is the second document.','And this is the third one.','Is this the first document?',] TypeError: 'CountVectorizer' object is not callable |
Hi Amine-OMI, thank you for your tips. Do you have some example of gensim CoherenceNPM? Thanks a lot for your attention. |
You should access the vectorizer model like this: |
Hey! Use it as such:
|
Hey, sorry for the late reply, here's the process if you're still working on it: Once you have extracted the topics from the corpus, you may have bigrams in the list of top words of each topic, so you need to split them and flatten the list to get a list of unigrams at the end. After that you can use Gensime Topic coherence as described in this link And you can use one of the following coherence measures: {'u_mass', 'c_v', 'c_uci', 'c_npmi'}.
I hope this helps you |
The following steps should be the correct ones in calculating the coherence scores. Some additional preprocessing is necessary since there is a very small part of that in BERTopic. Also, make sure to build the tokens with the exact same tokenizer as used in BERTopic. I do want to stress that metrics such as import gensim.corpora as corpora
from gensim.models.coherencemodel import CoherenceModel
# Preprocess documents
cleaned_docs = topic_model._preprocess_text(docs)
# Extract vectorizer and tokenizer from BERTopic
vectorizer = topic_model.vectorizer_model
tokenizer = vectorizer.build_tokenizer()
# Extract features for Topic Coherence evaluation
words = vectorizer.get_feature_names()
tokens = [tokenizer(doc) for doc in cleaned_docs]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(token) for token in tokens]
topic_words = [[words for words, _ in topic_model.get_topic(topic)]
for topic in range(len(set(topics))-1)]
# Evaluate
coherence_model = CoherenceModel(topics=topic_words,
texts=tokens,
corpus=corpus,
dictionary=dictionary,
coherence='c_v')
coherence = coherence_model.get_coherence() |
I t
Hello MaartenGr, I tried to execute this, but the problem is the tokenizer. My Bertopic model got topics with ngrams from 1 to 10 and the tokenizer here got tokens with only one term (1-gram). When I considere n_gram_range=(1,1) like this |
Good catch, I did not test for higher n-grams in the example. I made two changes:
Tested it with several ranges of n-grams and it seems to work now. from bertopic import BERTopic
import gensim.corpora as corpora
from gensim.models.coherencemodel import CoherenceModel
topic_model = BERTopic(verbose=True, n_gram_range=(1, 3))
topics, _ = topic_model.fit_transform(docs)
# Preprocess Documents
documents = pd.DataFrame({"Document": docs,
"ID": range(len(docs)),
"Topic": topics})
documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
cleaned_docs = topic_model._preprocess_text(documents_per_topic.Document.values)
# Extract vectorizer and analyzer from BERTopic
vectorizer = topic_model.vectorizer_model
analyzer = vectorizer.build_analyzer()
# Extract features for Topic Coherence evaluation
words = vectorizer.get_feature_names()
tokens = [analyzer(doc) for doc in cleaned_docs]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(token) for token in tokens]
topic_words = [[words for words, _ in topic_model.get_topic(topic)]
for topic in range(len(set(topics))-1)]
# Evaluate
coherence_model = CoherenceModel(topics=topic_words,
texts=tokens,
corpus=corpus,
dictionary=dictionary,
coherence='c_v')
coherence = coherence_model.get_coherence() |
Great! Thanks a lot! |
Hi Maarten, thanks for the code of calculating coherence score. I am wondering which parameter I can tune using coherence score. I tried Is coherence score always decreasing as reducing |
@YuanyuanLi96 In general, I would not advise you to use this coherence score to fine-tune BERTopic. These metrics are merely procies for a topic model's performance. They are by no means a ground truth and can have significant issues (e.g., sensitive to the number of words in a topic). So whether you find a low or high score, I would advise you to look at the topics yourself and see if they make sense to you. Having said that, by reducing When it comes to tuning a small dataset, I would focus on keeping a logical |
@MaartenGr Thanks for your explanation and suggestion! I tried to let |
Hi @MaartenGr , regarding the conversation here and your reply to YuanyuanLi96, currently the only available measurements i found to evaluate a Topic Model is by Coherence(Umass,NPMI etc..) and Perplexity scores which both have their downsides, beside human judgement which like you said "I would advise you to look at the topics yourself and see if they make sense to you" is there any other measurement you suggest? in short...if i have a LDA model and a ERTopic model trained on the same data and apply the same number of topics on both,how would i know which is more accurate? |
@TomNachman There are a few things that are important here. What is the definition of "accurate". Is that topic coherence? Quality (density or separation) of clusters? Predictive power? Distribution of topics? Etc. Defining accuracy or quality first is important in knowing if one topic model is better than another. What the best metric to use is highly depends on your use case but it seems that in literature Moreover, I am often very hesitant when it comes to recommending a coherence metric to use. You can quickly overfit on such a metric when tuning the parameters of BERTopic (or any other topic modeling technique) which in practice might result in poor performance. In other words, I want to prevent users from solely focusing on grid-searching parameters and motivate users to look at the results. Having said that, that does not mean that these metrics cannot be used! They are extremely useful in the right circumstances. So when you want to compare topic models, definitely use these kinds of metrics (e.g., I want to end with a great package for evaluating your topic model, namely OCTIS. It has many evaluation measures implemented aside from the standard coherence metrics, such as |
Hello Maarten, I was tuning the hyperparameters top_n_words and min_topic_size. I basically use the above code as a function to evaluate my topic model quality. It seems that the code does not work for a certain set of values of the two parameters(in my case, it's top_n_words = 5 and min_topic_size =28), while it managed to provide the coherence score for the rest of the pairs. It's even more peculiar because I'd executed the same thing the other day and there was no issue. The only difference here is I used to a different set of data, although they were preprocessed similarly and had identical structure. |
It might be worthwhile to check the differences in output between the output variables for your two sets of data (e.g., |
Good afternoon Maarten, Thank you very much for pulling this together, I recognise that coherence score isn't necessarily the best option to determine accuracy, but it's a useful proxy to consider. Having taken a brief look at the code I've notice that:
Isn't referred to elsewhere in the code, can this line be omitted or does it serve a further purpose? Thanks in advance, H |
@hwrightson You are completely right! It is definitely a useful proxy to consider when validating your model. NPMI, for example, has shown promise in emulating human performance (1). A topic coherence score in conjunction with visual checks definitely prevents issues later on.
Good catch, I might have used it for something else whilst testing out calculating coherence scores. So yes, you can omit that line! |
@MaartenGr I've been delving into model evaluation and, at your suggestion, am using OCTIS. In my first set of experiments I compared the OCTIS metrics for P.S. Of course right after writing this I remembered that I hadn't gone back to the paper the OCTIS people wrote OCTIS: Comparing and Optimizing Topic models is Simple!!. So anything you suggest that is not referenced there would be super. |
@drob-xx Great to hear that you have been working with OCTIS! You might have already seen it, but aside from in the paper itself, some of the references to the evaluation metrics can be found here. The field of evaluation metrics is a tricky one, there are many different use cases for topic modeling techniques, and topic modeling, by nature, is a subjective method that is often reflected in the evaluation metrics. Over the last years, there have been several papers describing the pros and cons of these metrics:
That has happened to me more times than I would like to admit! The metrics that you find in the paper and in OCTIS are, at least in my experience, the most common metrics that you see in academia. Especially One thing that might be interesting to look at is clustering metrics. Essentially, BERTopic is a clustering algorithm with a topic representation on top. The assumption here is that good clusters lead to good topic representations. Thus, in order to have a good model, you will need good clusters. You can find some of these metrics here but be aware that some of these might need labels to judge the quality of the generated clusters. |
Hello Maarten, I would also like to include Octis in my evaluation of BERTopic's findings. If I understand you correctly in Issues #144 and #331, the following lines should give me the topic-word-matrix I need for Octis:
Is that correct? When I initialise BERTopic with Many thanks in advance for the help |
@MaartenGr , Thank you so much for testing and pointing out the mistake in time. - I'm still learning, so I really appreciate your help! |
Thanks so much! @MaartenGr This piece of code indeed solves the empty topics issues that is torturing me for quite a while. |
Hello @MaartenGr, |
@RamziRahli That repo was merely for the evaluation of experiments in the paper and was not meant to be generally used. Instead, I would advise performing the evaluations yourself using the guidelines in OCTIS or using Gensim with the provided example here. |
@MaartenGr |
@RamziRahli That is difficult to say without seeing the actual code (and feel free to create an issue for this) but it would not be unsurprising depending on your setup. Calculating coherence measures is notoriously slow. |
Hello Everyone!
How to solve this error? Explain to me how to solve it. Thank you. |
You should not run # Print Data Evaluation
topic_eval = coherence.get_coherence() |
I got same error like before : 'numpy.float64' object has no attribute 'to_json', this is the code :
|
The type of |
@MaartenGr Is it generally a good or bad idea to use a representation model while evaluating the coherence score of a model? I noticed that using KeyBERTInspired while evaluating the coherence score yields different results than using none. Although I have to say that the different scores are still very similar. |
@mike-asw It depends. If the representation model that you use is important for your use case, then you should definitely include it in the evaluation. The multiple scores also give you an idea of the effect of representation models on the resulting coherence evaluation metric. I do think that when you include representation models and you run evaluation metrics, you should definitely include these representation models in the evaluation procedure. It always surprised me that when evaluating BERTopic, many users/researchers tend to focus on only the base representation when there are so many more to choose from. |
Hi Maarten, I was looking at the discussion above and figured at some point you switch from the tokenizer to the analyzer in order to be able to perform n_gram tokenization. In my code both implementations seem to work, however they give very different coherence values. I do specify a n_gram range in my CountVectorizer. Which of the two (tokenizer or analyzer) will give the ‘correct’ coherence value in my case, if such a notion even exists? Or what should be considered in picking one of the two? Thanks in advance! |
@ninavdPipple As you mentioned, there is no "correct" coherence value. It all depends on the reasons why you would choose the tokenizer over the analyzer or vice versa. Having said that, since you are using |
Hi, I'm currently using this code to calculate coherence measures for topic models based on arxiv preprints and the line |
@benearnthof Calculating coherence scores takes a lot of memory and I am not familiar with any more efficient techniques. Making sure you have enough RAM is definitely important here. Also, make sure that your vocab is unnecessarily large when you are using n-grams. The |
@MaartenGr I have experimented with mmcorpus but will give min_df a shot, thanks for the swift reply! |
I tried plotting coherence score (c_v) against number of topics where I am changing hyperparameters "n_neighbors", "n_components" for UMAP function passed and "cluster_selection_epsilon", "min_cluster_size" for HDBSCAN function passed to BERTopic. When I see the nature of graph it shows that it has a monotonically decreasing nature of graph. Shouldn't we expect it either otherwise or have maximum somewhere where before it was increasing upto that point and then it decreases after that point. It is weird that the coherence score always seemed to be decreasing with increase in number of topics. I could use some feedback ASAP. @MaartenGr |
@abinashsinha330 It is difficult to say without knowing every specific about your data, use case, type of coherence (e.g., c_v vs. npmi), etc. For example, it could simply be that you have little data available for each topic that you add and therefore, the topic representations are not as good as the first few. Of course, this could also depend on the representation model that you choose. However, after a quick Google search, you can find several papers that not only have this phenomenon but also observe that the coherence score might increase again after a certain point. You can do some research on your chosen coherence score and get an intuition about how it works. Then, you can experiment and research why your specific graphs appear. Do note that this issue thread is mostly focused on evaluation in general and, as you might have read here, I am generally against such a large focus on only coherence. So my main advice would be not to focus that much on coherence scores only and create a broad evaluation of your topic model. The thought that a topic model should only be evaluated by a coherence score (whatever that exactly means with different metrics) can get you into trouble when using the model in practice. |
Hi @MaartenGr , I noticed that in your provided example for calculating coherence scores, the entire corpus is used for both fitting and evaluation. I'm interested in your perspective on incorporating a train-test split for model assessment. Would this improve the evaluation's robustness by measuring generalizability to unseen data, or might it lead to non-representative coherence scores? Thanks in advance! |
@nickgiann Hmmm, I seldom see train/test splits for that since you would still need to have the same vocabulary used across splits which in turn require the entire corpus to be passed. The thing is that unseen data does not influence the training of BERTopic and whenever you run |
Dear @MaartenGr, thank you so much for all your useful advices above. Having had compatibilities issues with OCTIS, I am trying to find an alternative way to do hyperparameter tuning (wrt coherence measure). I tried creating a BERTopic Grid Search wrapper, in which i define manually the consistency function :
I keep getting "nan" as my coherence scores ( I have been trying to find the source of this issue for a while, and among the debugging attempts, I found that when I use the wrapper alone :
I obtain a coherence score. Do you have any idea of what is going on here, and what I might have wrong ? All the best, |
@romybeaute Unfortunately, I'm not that familiar with how a customized GridSearchWrapper should be implemented within scikit-learn. You potentially could do it manually since there is no cross-validation involved in your example. It would be looping over parameters and nothing more if I'm not mistaken. |
Dear @MaartenGr , thanks a lot for your previous answer to my question. I have been applying your advices, and now have a csv file containing the different combinations that have been tested, and their respective coherence score and number of topics (grid_search_results_HandWritten_seed22.csv). But the best coherence results lead to very few topics created. So I am in the situation where I need to find a balance between coherence score and a number of extracted topics that is reasonable for my research. But this choice seems quite subjective... Is it something acceptable to do ? Would you recommend any other - more objective - method to select the number of extracted of topics (and therefore the hyperparameters combinations that lead to this number of extracted topics)? |
That's indeed the problem I have with using topic coherence and grid search together, you are not likely to end up with the quality of topics that you are looking for. As such, and as you can see throughout this issue, I would definitely not recommend grid-searching topic coherence only. It is important to first take note of what "performance" or "quality" means in your specific use case and derive metrics based on that. Topic coherence by itself tells you so little about a topic model, especially when you take into account the other perspectives of what a topic model can be good at, such as assignment of topics, diversity of topics, accuracy of topics rather coherence, etc.
It is indeed subjective but that is not necessarily a bad thing because your use case is subjective. You have certain requirements for your specific use case and one of which is the number of extracted topics. It would be more than reasonable to say that having 2 topics in your 1 million documents makes no sense and based on your familiarity with the data, there are at least n topics. If you want a purely objective measure for something that is inherently subjective, that will prove to be quite difficult. Instead, I generally advise a mix. You can use proxy measures such as topic coherence and diversity as the "objective" measures (note they are not ground-truth metrics) and "subjective" information such as limiting the number of topics to a certain minimum. All in all, I would advise starting from the metric itself. Why is optimizing for only topic coherence so important for your use case?
What would be the splits and evaluation here? Normally, you would train on 80% of the data here and perform inference on 20%. In the context of topic coherence, there is no inference involved, only training. |
Currently, I am calculating the Coherence of a bertopic model using the gensim. For this I need the n_grams from each text of the corpus. Is it possible? The function used by gensim waits for the corpus and topics, and the topics are tokens that must exist in corpus.
cm = CoherenceModel(topics, corpus, dictionary, coherence='u_mass')
Thanks in advance.
The text was updated successfully, but these errors were encountered: