Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error calculating coherence score when using ngram range in vectorizor #441

Closed
tawfiqam opened this issue Feb 14, 2022 · 3 comments
Closed

Comments

@tawfiqam
Copy link

Error Message:

ValueError Traceback (most recent call last)
Input In [17], in
----> 1 coherence_values, model_topic_list, model_probls_list = compute_coherence_values(start=10,step=5,limit=100)

Input In [16], in compute_coherence_values(start, step, limit, coherence_df)
51 topic_words = [[words for words, _ in topic_model.get_topic(topic)]
52 for topic in range(len(set(topics))-1)]
54 # Evaluate
---> 55 coherence_model = CoherenceModel(topics=topic_words,
56 texts=tokens,
57 corpus=corpus,
58 dictionary=dictionary,
59 coherence='c_v')
61 coherence = coherence_model.get_coherence()
64 #coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')

File ~/.local/lib/python3.8/site-packages/gensim/models/coherencemodel.py:215, in CoherenceModel.init(self, model, topics, texts, corpus, dictionary, window_size, keyed_vectors, coherence, topn, processes)
213 self._accumulator = None
214 self._topics = None
--> 215 self.topics = topics
217 self.processes = processes if processes >= 1 else max(1, mp.cpu_count() - 1)

File ~/.local/lib/python3.8/site-packages/gensim/models/coherencemodel.py:430, in CoherenceModel.topics(self, topics)
428 new_topics = []
429 for topic in topics:
--> 430 topic_token_ids = self._ensure_elements_are_ids(topic)
431 new_topics.append(topic_token_ids)
433 if self.model is not None:

File ~/.local/lib/python3.8/site-packages/gensim/models/coherencemodel.py:454, in CoherenceModel._ensure_elements_are_ids(self, topic)
452 return np.array(ids_from_ids)
453 else:
--> 454 raise ValueError('unable to interpret topic as either a list of tokens or a list of ids')

ValueError: unable to interpret topic as either a list of tokens or a list of ids

***********Code

import csv
from hdbscan import HDBSCAN
from bertopic import BERTopic
import gensim.corpora as corpora
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models.coherencemodel import CoherenceModel

vectorizer_model = CountVectorizer(ngram_range=(1, 3), stop_words="english")

coherence_df = pd.DataFrame(columns=['min_cluster_size','coherence_score','num_of_topics'])

def compute_coherence_values(start=15,step=10,limit=205,coherence_df = coherence_df):

coherence_values = []
model_topic_list = []
model_probls_list = []
num_of_topics_list = []

dir = 'PATH'

keys = ['min_cluster','coherence','num_of_topics']

#setup the LDA model 
#it is multicore so we can run multiple models together
for num_clusters in range(start, limit, step):
  hdbscan_model = HDBSCAN(min_cluster_size=num_clusters, metric='euclidean',cluster_selection_method='eom', prediction_data=True)
  topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True, hdbscan_model=hdbscan_model, vectorizer_model=vectorizer_model)
  topics, probs = topic_model.fit_transform(docs)
  print('saving the bertopic model...')
  
  Filename = dir+"bertopic_"+str(num_clusters)+".model"
  topic_model.save(Filename)
  model_topic_list.append(topics)
  model_probls_list.append(pd.DataFrame(probs))
  scores = pd.DataFrame(probs)
  print('done with model ',str(num_clusters))
  
  # Preprocess documents
  cleaned_docs = topic_model._preprocess_text(docs)

  # Extract vectorizer and tokenizer from BERTopic
  vectorizer = topic_model.vectorizer_model
  tokenizer = vectorizer.build_tokenizer()

  # Extract features for Topic Coherence evaluation
  words = vectorizer.get_feature_names()
  tokens = [tokenizer(doc) for doc in cleaned_docs]
  dictionary = corpora.Dictionary(tokens)
  corpus = [dictionary.doc2bow(token) for token in tokens]
  topic_words = [[words for words, _ in topic_model.get_topic(topic)] 
                for topic in range(len(set(topics))-1)]

  # Evaluate
  coherence_model = CoherenceModel(topics=topic_words, 
                                  texts=tokens, 
                                  corpus=corpus,
                                  dictionary=dictionary, 
                                  coherence='c_v')
  
  coherence = coherence_model.get_coherence()


  #coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
  print('coherence metric is ',str(coherence))
  coherence_values.append(coherence)

  #adding # of topics
  num_of_topics = len(topic_model.get_topic_info())

  #adding new values to the pandas dataframe keeping track of the results
  values = [num_clusters,coherence,num_of_topics]
  coherence_df.loc[len(coherence_df.index)] = values
  #write the next row of coherence scores/topic count/number of clusters
  coherence_df.to_csv(dir + 'coherence_updates.csv')
  #write the DTM associated with the model
  scores.to_json(dir+'bertopicmodel_score_'+str(num_clusters)+".zip",compression='zip')
  #write topics
  with open(dir+'topics_'+str(num_clusters)+'.csv', 'w') as f: 
    write = csv.writer(f) 
    write.writerow(topics) 

return  coherence_values,model_topic_list,model_probls_list
@MaartenGr
Copy link
Owner

Your code is quite difficult to read as only parts of it are generated as code blocks in markdown. Next time, I would advise parsing your code in markdown like this to display the entire codeblock:

```
Put my code here between triple backticks
```

Having said that, it might be interesting to look at the issue here where you can find a working version of calculating the coherence score with a larger n-gram value. From what I can see, you should use the build_analyzer instead of the build_tokenizer to get the same preprocessing steps. If it does not work, let me know and I'll look at it a bit more in-depth.

@tawfiqam
Copy link
Author

Apologies...posting again for clarity. I will update you after taking your suggestion for build_analyzer. Thanks!

'''
Error Message:

ValueError Traceback (most recent call last)
Input In [17], in
----> 1 coherence_values, model_topic_list, model_probls_list = compute_coherence_values(start=10,step=5,limit=100)

Input In [16], in compute_coherence_values(start, step, limit, coherence_df)
51 topic_words = [[words for words, _ in topic_model.get_topic(topic)]
52 for topic in range(len(set(topics))-1)]
54 # Evaluate
---> 55 coherence_model = CoherenceModel(topics=topic_words,
56 texts=tokens,
57 corpus=corpus,
58 dictionary=dictionary,
59 coherence='c_v')
61 coherence = coherence_model.get_coherence()
64 #coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')

File ~/.local/lib/python3.8/site-packages/gensim/models/coherencemodel.py:215, in CoherenceModel.init(self, model, topics, texts, corpus, dictionary, window_size, keyed_vectors, coherence, topn, processes)
213 self._accumulator = None
214 self._topics = None
--> 215 self.topics = topics
217 self.processes = processes if processes >= 1 else max(1, mp.cpu_count() - 1)

File ~/.local/lib/python3.8/site-packages/gensim/models/coherencemodel.py:430, in CoherenceModel.topics(self, topics)
428 new_topics = []
429 for topic in topics:
--> 430 topic_token_ids = self._ensure_elements_are_ids(topic)
431 new_topics.append(topic_token_ids)
433 if self.model is not None:

File ~/.local/lib/python3.8/site-packages/gensim/models/coherencemodel.py:454, in CoherenceModel._ensure_elements_are_ids(self, topic)
452 return np.array(ids_from_ids)
453 else:
--> 454 raise ValueError('unable to interpret topic as either a list of tokens or a list of ids')

ValueError: unable to interpret topic as either a list of tokens or a list of ids

'''

@tawfiqam
Copy link
Author

That worked for me! Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants