Error calculating coherence score when using ngram range in vectorizor #441

tawfiqam · 2022-02-14T01:09:41Z

Error Message:

ValueError Traceback (most recent call last)
Input In [17], in
----> 1 coherence_values, model_topic_list, model_probls_list = compute_coherence_values(start=10,step=5,limit=100)

Input In [16], in compute_coherence_values(start, step, limit, coherence_df)
51 topic_words = [[words for words, _ in topic_model.get_topic(topic)]
52 for topic in range(len(set(topics))-1)]
54 # Evaluate
---> 55 coherence_model = CoherenceModel(topics=topic_words,
56 texts=tokens,
57 corpus=corpus,
58 dictionary=dictionary,
59 coherence='c_v')
61 coherence = coherence_model.get_coherence()
64 #coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')

File ~/.local/lib/python3.8/site-packages/gensim/models/coherencemodel.py:215, in CoherenceModel.init(self, model, topics, texts, corpus, dictionary, window_size, keyed_vectors, coherence, topn, processes)
213 self._accumulator = None
214 self._topics = None
--> 215 self.topics = topics
217 self.processes = processes if processes >= 1 else max(1, mp.cpu_count() - 1)

File ~/.local/lib/python3.8/site-packages/gensim/models/coherencemodel.py:430, in CoherenceModel.topics(self, topics)
428 new_topics = []
429 for topic in topics:
--> 430 topic_token_ids = self._ensure_elements_are_ids(topic)
431 new_topics.append(topic_token_ids)
433 if self.model is not None:

File ~/.local/lib/python3.8/site-packages/gensim/models/coherencemodel.py:454, in CoherenceModel._ensure_elements_are_ids(self, topic)
452 return np.array(ids_from_ids)
453 else:
--> 454 raise ValueError('unable to interpret topic as either a list of tokens or a list of ids')

ValueError: unable to interpret topic as either a list of tokens or a list of ids

***********Code

import csv
from hdbscan import HDBSCAN
from bertopic import BERTopic
import gensim.corpora as corpora
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models.coherencemodel import CoherenceModel

vectorizer_model = CountVectorizer(ngram_range=(1, 3), stop_words="english")

coherence_df = pd.DataFrame(columns=['min_cluster_size','coherence_score','num_of_topics'])

def compute_coherence_values(start=15,step=10,limit=205,coherence_df = coherence_df):

coherence_values = []
model_topic_list = []
model_probls_list = []
num_of_topics_list = []

dir = 'PATH'

keys = ['min_cluster','coherence','num_of_topics']

#setup the LDA model 
#it is multicore so we can run multiple models together
for num_clusters in range(start, limit, step):
  hdbscan_model = HDBSCAN(min_cluster_size=num_clusters, metric='euclidean',cluster_selection_method='eom', prediction_data=True)
  topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True, hdbscan_model=hdbscan_model, vectorizer_model=vectorizer_model)
  topics, probs = topic_model.fit_transform(docs)
  print('saving the bertopic model...')
  
  Filename = dir+"bertopic_"+str(num_clusters)+".model"
  topic_model.save(Filename)
  model_topic_list.append(topics)
  model_probls_list.append(pd.DataFrame(probs))
  scores = pd.DataFrame(probs)
  print('done with model ',str(num_clusters))
  
  # Preprocess documents
  cleaned_docs = topic_model._preprocess_text(docs)

  # Extract vectorizer and tokenizer from BERTopic
  vectorizer = topic_model.vectorizer_model
  tokenizer = vectorizer.build_tokenizer()

  # Extract features for Topic Coherence evaluation
  words = vectorizer.get_feature_names()
  tokens = [tokenizer(doc) for doc in cleaned_docs]
  dictionary = corpora.Dictionary(tokens)
  corpus = [dictionary.doc2bow(token) for token in tokens]
  topic_words = [[words for words, _ in topic_model.get_topic(topic)] 
                for topic in range(len(set(topics))-1)]

  # Evaluate
  coherence_model = CoherenceModel(topics=topic_words, 
                                  texts=tokens, 
                                  corpus=corpus,
                                  dictionary=dictionary, 
                                  coherence='c_v')
  
  coherence = coherence_model.get_coherence()


  #coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
  print('coherence metric is ',str(coherence))
  coherence_values.append(coherence)

  #adding # of topics
  num_of_topics = len(topic_model.get_topic_info())

  #adding new values to the pandas dataframe keeping track of the results
  values = [num_clusters,coherence,num_of_topics]
  coherence_df.loc[len(coherence_df.index)] = values
  #write the next row of coherence scores/topic count/number of clusters
  coherence_df.to_csv(dir + 'coherence_updates.csv')
  #write the DTM associated with the model
  scores.to_json(dir+'bertopicmodel_score_'+str(num_clusters)+".zip",compression='zip')
  #write topics
  with open(dir+'topics_'+str(num_clusters)+'.csv', 'w') as f: 
    write = csv.writer(f) 
    write.writerow(topics) 

return  coherence_values,model_topic_list,model_probls_list

The text was updated successfully, but these errors were encountered:

MaartenGr · 2022-02-14T07:44:57Z

Your code is quite difficult to read as only parts of it are generated as code blocks in markdown. Next time, I would advise parsing your code in markdown like this to display the entire codeblock:

```
Put my code here between triple backticks
```

Having said that, it might be interesting to look at the issue here where you can find a working version of calculating the coherence score with a larger n-gram value. From what I can see, you should use the build_analyzer instead of the build_tokenizer to get the same preprocessing steps. If it does not work, let me know and I'll look at it a bit more in-depth.

tawfiqam · 2022-02-14T12:04:12Z

Apologies...posting again for clarity. I will update you after taking your suggestion for build_analyzer. Thanks!

'''
Error Message:

ValueError Traceback (most recent call last)
Input In [17], in
----> 1 coherence_values, model_topic_list, model_probls_list = compute_coherence_values(start=10,step=5,limit=100)

Input In [16], in compute_coherence_values(start, step, limit, coherence_df)
51 topic_words = [[words for words, _ in topic_model.get_topic(topic)]
52 for topic in range(len(set(topics))-1)]
54 # Evaluate
---> 55 coherence_model = CoherenceModel(topics=topic_words,
56 texts=tokens,
57 corpus=corpus,
58 dictionary=dictionary,
59 coherence='c_v')
61 coherence = coherence_model.get_coherence()
64 #coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')

File ~/.local/lib/python3.8/site-packages/gensim/models/coherencemodel.py:215, in CoherenceModel.init(self, model, topics, texts, corpus, dictionary, window_size, keyed_vectors, coherence, topn, processes)
213 self._accumulator = None
214 self._topics = None
--> 215 self.topics = topics
217 self.processes = processes if processes >= 1 else max(1, mp.cpu_count() - 1)

File ~/.local/lib/python3.8/site-packages/gensim/models/coherencemodel.py:430, in CoherenceModel.topics(self, topics)
428 new_topics = []
429 for topic in topics:
--> 430 topic_token_ids = self._ensure_elements_are_ids(topic)
431 new_topics.append(topic_token_ids)
433 if self.model is not None:

File ~/.local/lib/python3.8/site-packages/gensim/models/coherencemodel.py:454, in CoherenceModel._ensure_elements_are_ids(self, topic)
452 return np.array(ids_from_ids)
453 else:
--> 454 raise ValueError('unable to interpret topic as either a list of tokens or a list of ids')

ValueError: unable to interpret topic as either a list of tokens or a list of ids

'''

tawfiqam · 2022-02-14T20:41:36Z

That worked for me! Thanks.

tawfiqam closed this as completed Feb 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error calculating coherence score when using ngram range in vectorizor #441

Error calculating coherence score when using ngram range in vectorizor #441

tawfiqam commented Feb 14, 2022

MaartenGr commented Feb 14, 2022

tawfiqam commented Feb 14, 2022

tawfiqam commented Feb 14, 2022

Error calculating coherence score when using ngram range in vectorizor #441

Error calculating coherence score when using ngram range in vectorizor #441

Comments

tawfiqam commented Feb 14, 2022

MaartenGr commented Feb 14, 2022

tawfiqam commented Feb 14, 2022

tawfiqam commented Feb 14, 2022