Allow use of truncated Dictionary for coherence measures #1349

macks22 · 2017-05-22T20:29:33Z

Resolves #1342.

…culation by avoiding lookup of tokens not in the topic token lists.

…vant words, and ensure each relevant word has a set in the `per_topic_postings` dict.

…el` module.

…s representation of corpus)

…e in strided_windows and iter_windows utiltity functions

… and add initial impl of accumulator that directly tracks term occurrence and co-occurrence counts

… package for all confirmation measures in the CoherenceModel pipeline

…ord are interpreted as the occurrence; update tests to cover this case; change the p_boolean_sliding_window to use the WordOccurrenceAccumulator; minor cleanup in test_coherencemodel

…oolean_sliding_window; add parameter for CoherenceModel to adjust number of processes used, with default equal to max(1, cpu_count - 1)

…ine_similarity function

tmylk · 2017-05-29T18:25:43Z

Thanks! It is a useful feature.

Could you please add a couple of paragraphs describing the patch - what approaches you tried and why.

May I ask what is the advantage of having the accumulator instead of passing a dictionary? Or just filtering the corpus prior to passing it to coherence using this code from the FAQ?

import copy 
from gensim.models import VocabTransform

# filter the dictionary
old_dict = corpora.Dictionary.load('old.dict')
new_dict = copy.deepcopy(old_dict)
new_dict.filter_extremes(keep_n=100000)
new_dict.save('filtered.dict')

# now transform the corpus
corpus = corpora.MmCorpus('corpus.mm')
old2new = {old_dict.token2id[token]:new_id for new_id, token in new_dict.iteritems()}
vt = VocabTransform(old2new)
corpora.MmCorpus.serialize('filtered_corpus.mm', vt[corpus], id2word=new_dict)

…f accumulator in CoherenceModel, and various test fixes.

macks22 · 2017-05-30T01:09:34Z

@tmylk thank you for your comments on the PR. I'm glad it is viewed as a useful feature.

The accumulator design is to support an abstraction on the underlying mechanism by which word occurrences and co-occurrences are retrieved. The existing inverted index based technique is not very memory friendly. I have not been able to use it on the Wikipedia corpus, which is a common reference corpus in the literature. I have pushed code to optimize this, including a new accumulator which uses a scipy.sparse.lil_matrix to accumulate word co-occurences. This method is more scalable memory-wise.

The accumulator design also leaves the door open for using lucene or some other on-disk inverted index as a backend for retrieving occurrence counts. I have opened another ticket for something like this (#1351).

I've also updated the PR to include several optimizations for the probability estimation phase of coherence calculation. This includes a new strided approach for document windowing (see util module), as well as a parallel multiprocessing implementation of word occurrence accumulation. There is still more that can be done with this (more efficient accumulation of windowed co-occurrence stats), but at least it is now able to take advantage of multiple cores. Finally, I modified the CoherenceModel to cache the accumulator from the last call to get_coherence unless the topics instance variable is set to something that contains ids not in the last set of topics (i.e. the current accumulator does not have counts for some tokens).

I also broke up the various stages of the pipeline into different methods, such that a user can get coherence measures for individual topics and view the results of topic segmentation.

I realize these changes are somewhat beyond the scope of the original issue. It was motivated by your comment in that issue thread, "c_v is the best/recommended coherence so we should make it as fast as possible." I agree with this and was finding "c_v" unusably slow. I hope these contributions can make it more useful for more people.

piskvorky · 2017-05-30T02:39:38Z

gensim/corpora/wikicorpus.py

+                        yield (tokens, (pageid, title))
+                    else:
+                        yield tokens
+        except KeyboardInterrupt:


What is this for? Masking user interrupts is an anti-pattern; deserves a detailed comment, at the very least.

I've updated the code to handle the interrupt more appropriately. As for why I chose to handle the interrupt at all (at the risk of repeating the above comment): this pool may be active during many other phases of gensim execution if the underlying corpus/texts being iterated come from the wikicorpus. It was confusing for me to see stdout activity when issuing an interrupt during execution of an entirely different code path.

Sorry, not sure I understand.

When iterating over a WikiCorpus, gensim uses a multiprocessing.Pool (forks), yes.

Now when you do CTRL+C, what happened (old behaviour)? And what happens now (after your changes here)? Why is that preferable?

Here is the code I am running:

import os import logging import gensim from gensim.models.coherencemodel import CoherenceModel logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) home = '/Users/user/workshop/nlp' id2word = gensim.corpora.Dictionary.load_from_text(os.path.join(home, 'data', 'wikipedia-may-17_wordids.txt.bz2')) texts = gensim.corpora.WikiCorpus(os.path.join(home, 'data', 'enwiki-latest-pages-articles.xml.bz2'), dictionary=id2word).get_texts() lda = gensim.models.LdaModel.load(os.path.join(home, 'models', 'lda-k100_en-wiki.model')) topics = [[t[0] for t in lda.get_topic_terms(i, 20)] for i in range(100)] topic_words = [[id2word[token_id] for token_id in t] for t in topics] cm = gensim.models.coherencemodel.CoherenceModel(topics=topic_words, texts=texts, dictionary=id2word, coherence='c_v', topn=20, window_size=110) topic_coherences = cm.get_coherence_per_topic() print(topic_coherences)

Before:

2017-05-30 07:34:29,027 : INFO : loading LdaModel object from /Users/user/workshop/nlp/models/lda-k100_en-wiki.model 2017-05-30 07:34:29,028 : INFO : loading expElogbeta from /Users/user/workshop/nlp/models/lda-k100_en-wiki.model.expElogbeta.npy with mmap=None 2017-05-30 07:34:29,080 : INFO : setting ignored attribute id2word to None 2017-05-30 07:34:29,080 : INFO : setting ignored attribute state to None 2017-05-30 07:34:29,080 : INFO : setting ignored attribute dispatcher to None 2017-05-30 07:34:29,080 : INFO : loaded /Users/user/workshop/nlp/models/lda-k100_en-wiki.model 2017-05-30 07:34:29,081 : INFO : loading LdaModel object from /Users/user/workshop/nlp/models/lda-k100_en-wiki.model.state 2017-05-30 07:34:29,249 : INFO : loaded /Users/user/workshop/nlp/models/lda-k100_en-wiki.model.state 2017-05-30 07:34:32,603 : INFO : using ParallelWordOccurrenceAccumulator(processes=7, batch_size=16) to estimate probabilities from sliding windows 2017-05-30 07:34:33,101 : INFO : submitted 0 batches to accumulate stats from 0 documents (90752 virtual) 2017-05-30 07:34:33,298 : INFO : submitted 1 batches to accumulate stats from 16 documents (154816 virtual) 2017-05-30 07:34:33,462 : INFO : submitted 2 batches to accumulate stats from 32 documents (223497 virtual) 2017-05-30 07:34:33,709 : INFO : submitted 3 batches to accumulate stats from 48 documents (285340 virtual) 2017-05-30 07:34:33,790 : INFO : submitted 4 batches to accumulate stats from 64 documents (342337 virtual) 2017-05-30 07:34:34,297 : INFO : submitted 5 batches to accumulate stats from 80 documents (415139 virtual) 2017-05-30 07:34:34,872 : INFO : submitted 6 batches to accumulate stats from 96 documents (484709 virtual) 2017-05-30 07:34:35,093 : INFO : submitted 7 batches to accumulate stats from 112 documents (542834 virtual) 2017-05-30 07:34:35,381 : INFO : submitted 8 batches to accumulate stats from 128 documents (628469 virtual) 2017-05-30 07:34:35,443 : INFO : submitted 9 batches to accumulate stats from 144 documents (691420 virtual) 2017-05-30 07:34:35,764 : INFO : submitted 10 batches to accumulate stats from 160 documents (741122 virtual) 2017-05-30 07:34:35,983 : INFO : submitted 11 batches to accumulate stats from 176 documents (774924 virtual) 2017-05-30 07:34:36,234 : INFO : submitted 12 batches to accumulate stats from 192 documents (829056 virtual) 2017-05-30 07:34:36,682 : INFO : submitted 13 batches to accumulate stats from 208 documents (887935 virtual) ^C2017-05-30 07:34:48,466 : WARNING : stats accumulation interrupted; <= 887935 documents processed Process PoolWorker-12: Process PoolWorker-10: Process PoolWorker-11: Traceback (most recent call last): Traceback (most recent call last): Traceback (most recent call last): File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap Process PoolWorker-9: Traceback (most recent call last): Process PoolWorker-8: Process PoolWorker-13: File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap Process PoolWorker-14: Traceback (most recent call last): Traceback (most recent call last): File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap Traceback (most recent call last): File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap self.run() self.run() File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 114, in run File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 114, in run self.run() File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(*self._args, **self._kwargs) File "/Users/user/anaconda2/lib/python2.7/multiprocessing/pool.py", line 102, in worker self._target(*self._args, **self._kwargs) File "/Users/user/anaconda2/lib/python2.7/multiprocessing/pool.py", line 102, in worker self.run() self.run() File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 114, in run File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(*self._args, **self._kwargs) self.run() File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(*self._args, **self._kwargs) self._target(*self._args, **self._kwargs) File "/Users/user/anaconda2/lib/python2.7/multiprocessing/pool.py", line 102, in worker File "/Users/user/anaconda2/lib/python2.7/multiprocessing/pool.py", line 102, in worker File "/Users/user/anaconda2/lib/python2.7/multiprocessing/pool.py", line 102, in worker self.run() self._target(*self._args, **self._kwargs) File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 114, in run File "/Users/user/anaconda2/lib/python2.7/multiprocessing/pool.py", line 102, in worker self._target(*self._args, **self._kwargs) File "/Users/user/anaconda2/lib/python2.7/multiprocessing/pool.py", line 102, in worker 2017-05-30 07:34:48,467 : INFO : AccumulatingWorker interrupted after processing 8598 documents 2017-05-30 07:34:48,468 : INFO : AccumulatingWorker interrupted after processing 3630 documents 2017-05-30 07:34:48,467 : INFO : AccumulatingWorker interrupted after processing 7670 documents 2017-05-30 07:34:48,468 : INFO : AccumulatingWorker interrupted after processing 5859 documents 2017-05-30 07:34:48,468 : INFO : AccumulatingWorker interrupted after processing 9993 documents 2017-05-30 07:34:48,468 : INFO : AccumulatingWorker interrupted after processing 8987 documents 2017-05-30 07:34:48,469 : INFO : serializing accumulator to return to master... 2017-05-30 07:34:48,469 : INFO : serializing accumulator to return to master... 2017-05-30 07:34:48,468 : INFO : AccumulatingWorker interrupted after processing 5062 documents 2017-05-30 07:34:48,469 : INFO : serializing accumulator to return to master... 2017-05-30 07:34:48,469 : INFO : serializing accumulator to return to master... 2017-05-30 07:34:48,469 : INFO : serializing accumulator to return to master... 2017-05-30 07:34:48,469 : INFO : serializing accumulator to return to master... task = get() task = get() File "/Users/user/anaconda2/lib/python2.7/multiprocessing/queues.py", line 376, in get task = get() task = get() File "/Users/user/anaconda2/lib/python2.7/multiprocessing/queues.py", line 376, in get task = get() task = get() File "/Users/user/anaconda2/lib/python2.7/multiprocessing/queues.py", line 376, in get File "/Users/user/anaconda2/lib/python2.7/multiprocessing/queues.py", line 376, in get File "/Users/user/anaconda2/lib/python2.7/multiprocessing/queues.py", line 378, in get File "/Users/user/anaconda2/lib/python2.7/multiprocessing/queues.py", line 376, in get task = get() File "/Users/user/anaconda2/lib/python2.7/multiprocessing/queues.py", line 376, in get racquire() racquire() racquire() racquire() racquire() racquire() return recv() KeyboardInterrupt KeyboardInterrupt KeyboardInterrupt KeyboardInterrupt KeyboardInterrupt KeyboardInterrupt KeyboardInterrupt 2017-05-30 07:34:48,469 : INFO : serializing accumulator to return to master... 2017-05-30 07:34:48,471 : INFO : accumulator serialized 2017-05-30 07:34:48,473 : INFO : accumulator serialized 2017-05-30 07:34:48,471 : INFO : accumulator serialized 2017-05-30 07:34:48,473 : INFO : accumulator serialized 2017-05-30 07:34:48,471 : INFO : accumulator serialized 2017-05-30 07:34:48,474 : INFO : accumulator serialized 2017-05-30 07:34:48,474 : INFO : accumulator serialized 2017-05-30 07:34:50,727 : INFO : 7 accumulators retrieved from output queue

After:

2017-05-30 07:29:17,246 : WARNING : Slow version of gensim.models.doc2vec is being used 2017-05-30 07:29:17,253 : INFO : 'pattern' package not found; tag filters are not available for English 2017-05-30 07:29:18,188 : INFO : loading LdaModel object from /Users/user/workshop/nlp/models/lda-k100_en-wiki.model 2017-05-30 07:29:18,198 : INFO : loading expElogbeta from /Users/user/workshop/nlp/models/lda-k100_en-wiki.model.expElogbeta.npy with mmap=None 2017-05-30 07:29:18,267 : INFO : setting ignored attribute id2word to None 2017-05-30 07:29:18,267 : INFO : setting ignored attribute state to None 2017-05-30 07:29:18,267 : INFO : setting ignored attribute dispatcher to None 2017-05-30 07:29:18,267 : INFO : loaded /Users/user/workshop/nlp/models/lda-k100_en-wiki.model 2017-05-30 07:29:18,267 : INFO : loading LdaModel object from /Users/user/workshop/nlp/models/lda-k100_en-wiki.model.state 2017-05-30 07:29:18,422 : INFO : loaded /Users/user/workshop/nlp/models/lda-k100_en-wiki.model.state 2017-05-30 07:29:20,667 : INFO : using ParallelWordOccurrenceAccumulator(processes=7, batch_size=16) to estimate probabilities from sliding windows 2017-05-30 07:29:21,117 : INFO : submitted 0 batches to accumulate stats from 0 documents (90752 virtual) 2017-05-30 07:29:21,312 : INFO : submitted 1 batches to accumulate stats from 16 documents (154816 virtual) 2017-05-30 07:29:21,441 : INFO : submitted 2 batches to accumulate stats from 32 documents (223497 virtual) 2017-05-30 07:29:21,715 : INFO : submitted 3 batches to accumulate stats from 48 documents (285340 virtual) 2017-05-30 07:29:21,783 : INFO : submitted 4 batches to accumulate stats from 64 documents (342337 virtual) 2017-05-30 07:29:22,270 : INFO : submitted 5 batches to accumulate stats from 80 documents (415139 virtual) 2017-05-30 07:29:22,634 : INFO : submitted 6 batches to accumulate stats from 96 documents (484709 virtual) 2017-05-30 07:29:22,701 : INFO : submitted 7 batches to accumulate stats from 112 documents (542834 virtual) 2017-05-30 07:29:22,995 : INFO : submitted 8 batches to accumulate stats from 128 documents (628469 virtual) 2017-05-30 07:29:23,118 : INFO : submitted 9 batches to accumulate stats from 144 documents (691420 virtual) 2017-05-30 07:29:23,238 : INFO : submitted 10 batches to accumulate stats from 160 documents (741122 virtual) 2017-05-30 07:29:23,336 : INFO : submitted 11 batches to accumulate stats from 176 documents (774924 virtual) 2017-05-30 07:29:23,471 : INFO : submitted 12 batches to accumulate stats from 192 documents (829056 virtual) 2017-05-30 07:29:23,665 : INFO : submitted 13 batches to accumulate stats from 208 documents (887935 virtual) ^C2017-05-30 07:29:58,875 : WARNING : stats accumulation interrupted; <= 887935 documents processed 2017-05-30 07:29:58,876 : INFO : AccumulatingWorker interrupted after processing 17775 documents 2017-05-30 07:29:58,876 : INFO : AccumulatingWorker interrupted after processing 27228 documents 2017-05-30 07:29:58,876 : INFO : AccumulatingWorker interrupted after processing 20910 documents 2017-05-30 07:29:58,876 : INFO : AccumulatingWorker interrupted after processing 26774 documents 2017-05-30 07:29:58,876 : INFO : AccumulatingWorker interrupted after processing 31392 documents 2017-05-30 07:29:58,876 : INFO : AccumulatingWorker interrupted after processing 32224 documents 2017-05-30 07:29:58,876 : INFO : AccumulatingWorker interrupted after processing 22497 documents 2017-05-30 07:29:58,877 : INFO : serializing accumulator to return to master... 2017-05-30 07:29:58,877 : INFO : serializing accumulator to return to master... 2017-05-30 07:29:58,877 : INFO : serializing accumulator to return to master... 2017-05-30 07:29:58,877 : INFO : serializing accumulator to return to master... 2017-05-30 07:29:58,877 : INFO : serializing accumulator to return to master... 2017-05-30 07:29:58,877 : INFO : serializing accumulator to return to master... 2017-05-30 07:29:58,877 : INFO : serializing accumulator to return to master... 2017-05-30 07:29:58,878 : INFO : accumulator serialized 2017-05-30 07:29:58,878 : INFO : accumulator serialized 2017-05-30 07:29:58,878 : INFO : accumulator serialized 2017-05-30 07:29:58,878 : INFO : accumulator serialized 2017-05-30 07:29:58,878 : INFO : accumulator serialized 2017-05-30 07:29:58,878 : INFO : accumulator serialized 2017-05-30 07:29:58,878 : INFO : accumulator serialized 2017-05-30 07:30:01,106 : INFO : 7 accumulators retrieved from output queue

So the advantage is a cleaner log?
What are the disadvantages? (I am not familiar with this type of functionality, but I assume there must be some disadvantages, otherwise it would be the default behaviour).

piskvorky · 2017-05-30T02:40:04Z

gensim/corpora/wikicorpus.py

@@ -300,22 +305,26 @@ def get_texts(self):
        articles, articles_all = 0, 0
        positions, positions_all = 0, 0
        texts = ((text, self.lemmatize, title, pageid) for title, text, pageid in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces))
-        pool = multiprocessing.Pool(self.processes)
+        pool = multiprocessing.Pool(self.processes, init_worker)


What is this change about? Needs a comment.

@piskvorky I was running get_coherence using coherence="c_v" and trying to program it to gracefully handle KeyboardInterrupt. You can see the mechanisms I put in place for this in the text_analysis module in the PR. While doing this, I faced some confusion because some background process was still raising KeyboardInterrupt. After some digging, I noticed that the wikicorpus pool workers were the culprit.

This pool may be active during many other phases of gensim execution if the underlying corpus/texts being iterated come from the wikicorpus. I think it makes things slightly cleaner in this case to handle the KeyboardInterrupt in some manner that does not propagate. Perhaps some sort of logging would improve this?

piskvorky · 2017-05-30T02:47:42Z

gensim/models/coherencemodel.py

+        return not current_set.issuperset(new_set)
+
+    def _topics_differ(self, new_topics):
+        return (new_topics is not None and


No vertical indent -- please use hanging indent.

piskvorky · 2017-05-30T02:48:30Z

gensim/models/coherencemodel.py

+        return (new_topics is not None and
+                self._topics is not None and
+                self._accumulator is not None and
+                not np.equal(new_topics, self._topics).all())


Not sure what type the arguments are, but np.allclose applicable?

list of np.ndarray of ints, so there should be no need for np.allclose.

piskvorky · 2017-05-30T02:52:07Z

Wow, impressive refactor! Thanks a lot.

…rpus.get_texts`; instead, log warning and do not set `length`.

…and non-empty blank lines in `text_analysis`.

…ctor for variable interpretability.

macks22 · 2017-05-30T10:06:07Z

@piskvorky thank you for your review and consideration of this PR. I've responded to your comments and pushed changes. Please let me know if you would like me to make any additional changes and I will gladly do so.

tmylk · 2017-05-30T12:10:50Z

cc @dsquareindia , the original contributor of coherence to the package

…ith repeated counting of tokens that occur more than once in a window.

… module; cleaned up spacing in coherencemodel.

…acking and avoid undue network traffic by moving relevancy filtering and token conversion to the master process.

…sing a `collections.Counter` instance for accumulation within a batch.

macks22 · 2017-06-01T10:35:49Z

I've pushed a few further optimizations to the probability_estimation phase; that code I posted above in response to @piskvorky can now handle the entire Wikipedia corpus in about 5.5 hours on an i7 with 8 cores (with hyperthreading). The speedup is about 16x (was ~2 docs/s and is now ~32 docs/s).

$ python coherence.py &> coherence_log2.txt
...
python coherence.py &> coherence_log2.txt  142985.89s user 770.75s system 736% cpu 5:25:21.87 total

piskvorky · 2017-06-01T14:03:10Z

gensim/topic_coherence/text_analysis.py

        except Exception as e:
-            logger.error("worker encountered unexpected exception: %s" % e)
-            logger.error(traceback.format_exc())
+            logger.error("worker encountered unexpected exception: %s\n%s",


logger.exception simpler?

piskvorky · 2017-06-01T14:03:41Z

gensim/topic_coherence/text_analysis.py

            q.put(batch, block=True)
            before = self._num_docs / self.log_every
            self._num_docs += sum(len(doc) - window_size + 1 for doc in batch)
            if before < (self._num_docs / self.log_every):
-                logger.info("submitted %d batches to accumulate stats from %d documents (%d virtual)" % (
-                    batch_num, (batch_num + 1) * self.batch_size, self._num_docs))
+                logger.info("%d batches submitted to accumulate stats from %d documents (%d "


Hanging indent please (here and elsewhere).

I'm not sure I understand the desired format here. The hanging indent makes sense when using the % operator to format a string. However, my desire here is to use the lazy formatting of the logging module instead. In this case, the string and all following format arguments are equivalent arguments of the call, so pep8 dictates alignment with the opening delimiter.

… empty line at end of `util` module.

…lity?

…he Dictionary mapping to different ids, so fixed the `probability_estimation` tests to be agnostic of this. Also fixed an issue with the interpretation of strings as iterables when getting occurrences of strings in the `text_analysis.BaseAnalyzer.__getitem__` method.

…rencemodel accumulator caching; made model a property with a setter that also sets the topics and uncaches the accumulator if the model's topics have ids not tracked by the accumulator.

macks22 · 2017-06-02T13:23:08Z

@piskvorky I have replied to your requests for changes and made the PR compatible with python3. Thanks again for the feedback! Are there any additional changes you would like?

piskvorky · 2017-06-05T01:07:53Z

gensim/test/test_text_analysis.py

+    def setUp(self):
+        self.dictionary = BaseTestCases.TextAnalyzerTestBase.dictionary
+        self.top_ids = BaseTestCases.TextAnalyzerTestBase.top_ids
+        self.corpus = [self.dictionary.doc2bow(doc)


Hanging indent please (here and elsewhere).

this

self.corpus = \ [self.dictionary.doc2bow(doc) for doc in BaseTestCases.TextAnalyzerTestBase.texts]

or this?

self.corpus = [ self.dictionary.doc2bow(doc) for doc in BaseTestCases.TextAnalyzerTestBase.texts]

piskvorky · 2017-06-05T01:08:15Z

gensim/test/test_text_analysis.py

+
+    def test_index_accumulation(self):
+        accumulator = CorpusAccumulator(self.top_ids)\
+            .accumulate(self.corpus)


Why this line break? Better keep on a single line (more readable).

It was for consistency with the other places, where the call did not fit well on a single line (too long). I've changed to a single line.

piskvorky · 2017-06-05T01:09:17Z

gensim/corpora/wikicorpus.py

-            " (total %i articles, %i positions before pruning articles shorter than %i words)",
-            articles, positions, articles_all, positions_all, ARTICLE_MIN_WORDS)
-        self.length = articles  # cache corpus length
+        texts = ((text, self.lemmatize, title, pageid)


Hanging indent.

When you ask for hanging indents for list comprehensions, can you give me an example of the style you prefer? Below are a few alternatives that come to mind. The first is the one I would assume you are asking for, but I've seen all of these before, so just want to clarify.

texts = ( (text, self.lemmatize, title, pageid) for title, text, pageid in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces))

or

texts = ( (text, self.lemmatize, title, pageid) for title, text, pageid in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces) )

or

texts = \ ((text, self.lemmatize, title, pageid) for title, text, pageid in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces))

piskvorky · 2017-06-05T01:10:01Z

gensim/corpora/wikicorpus.py

+        texts = ((text, self.lemmatize, title, pageid)
+                 for title, text, pageid
+                 in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces))
+        pool = multiprocessing.Pool(self.processes, init_to_ignore_interrupt)


Shouldn't the original interrupts be restored, once the pool is over (exception/terminated).

The interrupts are only disabled in the children, which are terminated with the Pool. I'm not sure why, but there is no place to reply to your comment about the disadvantages. The disadvantage, to my understanding, is that termination of the children must now be performed by the parent, instead of occurring exactly when the KeyboardInterrupt is sent to them. In this case, that is being done explicitly by calling pool.terminate(). However, according to the docs, even if a second KeyboardInterrupt is issued just after the first and the explicit call is not made, the GC will take care of calling it. The advantage here is a cleaner log, yes, but also not leaking the details of the wikicorpus iteration (multiprocessing or not), which can confuse developers and users when they interrupt some other part of the code.

I see, thanks.

piskvorky · 2017-06-05T01:11:35Z

gensim/corpora/wikicorpus.py

+                        yield (tokens, (pageid, title))
+                    else:
+                        yield tokens
+        except KeyboardInterrupt:


So the advantage is a cleaner log?
What are the disadvantages? (I am not familiar with this type of functionality, but I assume there must be some disadvantages, otherwise it would be the default behaviour).

piskvorky · 2017-06-05T01:12:28Z

gensim/models/coherencemodel.py

-    'c_npmi': make_pipeline(segmentation.s_one_one,
+COHERENCE_MEASURES = {
+    'u_mass': _make_pipeline(segmentation.s_one_pre,
+                             probability_estimation.p_boolean_document,


Hanging indent (here and other places below).

piskvorky · 2017-06-05T01:13:03Z

gensim/models/coherencemodel.py

+
+    def _topics_differ(self, new_topics):
+        return (new_topics is not None and
+                    self._topics is not None and


The indentation looks weird... a single line better?

This is, IMO, more confusing if limited to one line. I changed this to the standard hanging indent (8 spaces).

piskvorky · 2017-06-05T01:13:57Z

gensim/models/coherencemodel.py

+    def _topics_differ(self, new_topics):
+        return (new_topics is not None and
+                    self._topics is not None and
+                    not np.array_equal(new_topics, self._topics))


Exact equality or "numerically close" (np.allclose) desired?

exact, since these are integers

piskvorky · 2017-06-05T01:14:46Z

gensim/test/test_text_analysis.py

+        top_ids = set(token2id.values())
+
+        texts2 = [['human', 'interface', 'computer'],
+                  ['survey', 'user', 'computer', 'system', 'response', 'time'],


Hanging indent.

piskvorky · 2017-06-05T01:16:10Z

gensim/topic_coherence/probability_estimation.py

+        accumulator = WordOccurrenceAccumulator(top_ids, dictionary)
+    else:
+        accumulator = ParallelWordOccurrenceAccumulator(processes, top_ids, dictionary)
+    logger.info("using %s to estimate probabilities from sliding windows" % accumulator)


Pass argument as a fnc argument, instead of formatting the string directly with %.

…lines.

… to return individual topic coherence values, then average those. Make the `ParallelWordOccurrenceAccumulator` return a `WordOccurrenceAccumulator` after accumulation, so it can be trained further afterwards if desired.

… individual topic coherence values, then average those.

macks22 · 2017-06-08T13:16:28Z

@piskvorky I believe I've addressed all the change requests from your last review. Thank you for the review. Is there anything else you'd like changed?

piskvorky

Minor code style comments.

piskvorky · 2017-06-08T14:53:20Z

gensim/test/test_coherencemodel.py

        vw_path = os.environ.get('VOWPAL_WABBIT_PATH', None)
        if not vw_path:
-            msg = "Environment variable 'VOWPAL_WABBIT_PATH' not specified, skipping sanity checks for LDA Model"
-            logging.info(msg)
+            logging.info("Environment variable 'VOWPAL_WABBIT_PATH' not specified,"


Hanging indent please.

piskvorky · 2017-06-08T14:53:29Z

gensim/test/test_coherencemodel.py

@@ -63,162 +55,201 @@ def setUp(self):
                        ['graph', 'minors', 'trees', 'eps']]
        self.topics2 = [['user', 'graph', 'minors', 'system'],
                        ['time', 'graph', 'survey', 'minors']]
-        self.ldamodel = LdaModel(corpus=corpus, id2word=dictionary, num_topics=2, passes=0, iterations=0)
+        self.ldamodel = LdaModel(corpus=self.corpus, id2word=self.dictionary, num_topics=2,


Hanging indent please.

piskvorky · 2017-06-08T14:53:36Z

gensim/test/test_coherencemodel.py


 class TestCoherenceModel(unittest.TestCase):
+
+    # set up vars used in testing ("Deerwester" from the web tutorial)
+    texts = [['human', 'interface', 'computer'],


Hanging indent please.

piskvorky · 2017-06-08T14:54:02Z

gensim/test/test_coherencemodel.py


    def testErrors(self):
        """Test if errors are raised on bad input"""
        # not providing dictionary
-        self.assertRaises(ValueError, CoherenceModel, topics=self.topics1, corpus=corpus, coherence='u_mass')
+        self.assertRaises(ValueError, CoherenceModel, topics=self.topics1, corpus=self.corpus,


Hanging indent (here and below).

piskvorky · 2017-06-08T14:54:56Z

gensim/topic_coherence/probability_estimation.py

+    for s_i in segmented_topics:
+        for word_id in itertools.chain.from_iterable(s_i):
+            if hasattr(word_id, '__iter__'):
+                top_ids = top_ids.union(word_id)


update() simpler?

yes, good call

…for unique ids from topic segments.

macks22 · 2017-06-08T17:46:50Z

@piskvorky the most recent review has been addressed and all requested changes have been made.

piskvorky · 2017-06-09T12:02:04Z

gensim/topic_coherence/probability_estimation.py

@@ -70,7 +70,7 @@ def unique_ids_from_segments(segmented_topics):
    for s_i in segmented_topics:
        for word_id in itertools.chain.from_iterable(s_i):
            if hasattr(word_id, '__iter__'):


Are the ids guaranteed to be integers?

Because if they're strings, the __iter__ could get nasty (you can iterate over strings, yielding individual characters).

Yes, they will always be integers; I have updated the docstrings to better reflect this. The segmented topics will always come from the segmentation module, and the setter in the CoherenceModel ensures the segmentation module receives lists of integer arrays to be segmented.

… module.

macks22 · 2017-06-13T15:45:50Z

@piskvorky @tmylk Anything else to change for this? Thanks!

devashishd12 · 2017-06-13T16:52:09Z

Hi @macks22! Thanks for this amazing overhaul! I guess it would be fun to see how much of a performance gain are we getting now compared to the previous code. It would also make for an interesting blog post maybe?

piskvorky · 2017-06-14T01:23:04Z

+1 on @dsquareindia 's idea. How will the users find out about this new cool functionality?

Some sort of Notebook tutorial (motivation, quick start) with a benchmark would be awesome.

menshikh-iv · 2017-06-14T09:48:15Z

Sorry for the late response, your PR is great 👍
It would be nice if you added a new notebook with a benchmark (or modify existing notebooks with CoherenceModel)

Thank you for the wonderful work @macks22 🥇

macks22 · 2017-06-14T12:50:41Z

@dsquareindia @piskvorky @menshikh-iv @dsquareindia
Thank you all for the feedback. I'll see if I can update the benchmark notebook and also create a new notebook to showcase some of the new stuff.

menshikh-iv · 2017-06-19T20:30:35Z

Hi @macks22,
I have problems with tests on windows with your PR (on MacOS and Linux its works correctly)

=====================================================================
FAIL: test_occurrence_counting (gensim.test.test_text_analysis.TestParallelWordOccurrenceAccumulator)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python35\lib\site-packages\gensim\test\test_text_analysis.py", line 57, in test_occurrence_counting
    self.assertEqual(3, accumulator.get_occurrences("this"))
AssertionError: 3 != 0
-------------------- >> begin captured logging << --------------------
gensim.topic_coherence.text_analysis: INFO: 1 batches submitted to accumulate stats from 64 documents (3 virtual)
gensim.topic_coherence.text_analysis: INFO: 2 accumulators retrieved from output queue
gensim.topic_coherence.text_analysis: INFO: accumulated word occurrence stats for 4 virtual documents
--------------------- >> end captured logging << ---------------------
======================================================================
FAIL: test_occurrence_counting2 (gensim.test.test_text_analysis.TestParallelWordOccurrenceAccumulator)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python35\lib\site-packages\gensim\test\test_text_analysis.py", line 67, in test_occurrence_counting2
    self.assertEqual(2, accumulator.get_occurrences("human"))
AssertionError: 2 != 0
-------------------- >> begin captured logging << --------------------
gensim.topic_coherence.text_analysis: INFO: 2 accumulators retrieved from output queue
gensim.topic_coherence.text_analysis: INFO: accumulated word occurrence stats for 10 virtual documents
--------------------- >> end captured logging << ---------------------

Can you fix this?

macks22 · 2017-06-21T11:34:49Z

@menshikh-iv that's strange, I can't think of what might be causing that. I can try to replicate on a VM but likely won't be able to find time to get a windows dev environment set up until this weekend.

Sweeney, Mack added 11 commits May 22, 2017 14:43

piskvorky#1342: Allow use of truncated Dictionary for coherence cal…

9d06a1f

…culation by avoiding lookup of tokens not in the topic token lists.

piskvorky#1342: Do not produce sliding windows for texts with no rele…

f69a2ff

…vant words, and ensure each relevant word has a set in the `per_topic_postings` dict.

piskvorky#1342: Remove unused multiprocessing import in `coherencemod…

26de547

…el` module.

add utility functions for strided windowing of texts (lists of string…

dfe159b

…s representation of corpus)

handle edge cases with window_size equal to or exceeding document siz…

2e3852e

…e in strided_windows and iter_windows utiltity functions

move code for building inverted index into a new text_analysis module…

ec7af1b

… and add initial impl of accumulator that directly tracks term occurrence and co-occurrence counts

complete migration to using the accumulators in the new text_analysis…

3f8fb7f

… package for all confirmation measures in the CoherenceModel pipeline

fix bug in WordOccurrenceAccumulator so that co-occurrences of same w…

b12edef

…ord are interpreted as the occurrence; update tests to cover this case; change the p_boolean_sliding_window to use the WordOccurrenceAccumulator; minor cleanup in test_coherencemodel

make wikicorpus parsing handle KeyboardInterrupt gracefully

91b8a05

add ParallelWordOccurrenceAccumulator and make default method for p_b…

c6224b7

…oolean_sliding_window; add parameter for CoherenceModel to adjust number of processes used, with default equal to max(1, cpu_count - 1)

clean up, clarify, and optimize the indirect_confirmation_measure.cos…

f00d389

…ine_similarity function

Sweeney, Mack and others added 2 commits May 29, 2017 20:56

piskvorky#1342: Cleanup, documentation improvements, proper caching o…

327b739

…f accumulator in CoherenceModel, and various test fixes.

Merge branch 'develop' into develop

3c0752b

piskvorky requested changes May 30, 2017

View reviewed changes

Sweeney, Mack added 3 commits May 30, 2017 05:41

piskvorky#1342: Do not swallow KeyboardInterrupt naively in `WikiCo…

e06c7c3

…rpus.get_texts`; instead, log warning and do not set `length`.

piskvorky#1342: Formatting fixes (hanging indent in coherencemodel …

2ca43f7

…and non-empty blank lines in `text_analysis`.

piskvorky#1342: Improve CoherenceModel documentation and minor refa…

825b0e9

…ctor for variable interpretability.

Sweeney, Mack added 5 commits May 30, 2017 17:03

piskvorky#1342: Optimize word occurrence accumulation and fix a bug w…

314a400

…ith repeated counting of tokens that occur more than once in a window.

piskvorky#1342: Minor bug fixes and improved logging in text_analysis…

e785773

… module; cleaned up spacing in coherencemodel.

piskvorky#1342: Optimize data structures being used for window set tr…

5f78cdb

…acking and avoid undue network traffic by moving relevancy filtering and token conversion to the master process.

piskvorky#1342: Fix accidental typo.

bbd2748

piskvorky#1342: Further optimize word co-occurrence accumulation by u…

5fb0b95

…sing a `collections.Counter` instance for accumulation within a batch.

piskvorky requested changes Jun 1, 2017

View reviewed changes

piskvorky#1342: Clean up logging in text_analysis module and remove…

880b8d0

… empty line at end of `util` module.

Sweeney, Mack added 4 commits June 1, 2017 11:47

piskvorky#1342: Fixes for python3 compatibility.

8e04b41

piskvorky#1342: Hopefully six.viewitems works for python3 compatibi…

e3ce402

…lity?

piskvorky#1342: Fixed a few bugs and added test coverage for the cohe…

343da69

…rencemodel accumulator caching; made model a property with a setter that also sets the topics and uncaches the accumulator if the model's topics have ids not tracked by the accumulator.

piskvorky#1342: Further tests for persistence of accumulator.

1ce8a72

macks22 mentioned this pull request Jun 4, 2017

CoherenceModel save fails if texts or corpus is generator #1386

Open

piskvorky#1342: Add test case for CorpusAccumulator.

96fd343

piskvorky requested changes Jun 5, 2017

View reviewed changes

Sweeney, Mack added 2 commits June 5, 2017 11:08

piskvorky#1342: Formatting fixes for hanging indents and overly long …

a631ab6

…lines.

macks22 mentioned this pull request Jun 6, 2017

Add a new indirect confirmation measure based on word2vec similarity #1380

Closed

piskvorky#1342: Fix direct_confirmation_measure functions to return…

b941f3c

… individual topic coherence values, then average those.

piskvorky requested changes Jun 8, 2017

View reviewed changes

piskvorky#1342: Hanging indents and switch out union with update …

75fcac8

…for unique ids from topic segments.

piskvorky reviewed Jun 9, 2017

View reviewed changes

piskvorky#1342: Clarify documentation in the probability_estimation…

96d1349

… module.

menshikh-iv merged commit 1862953 into piskvorky:develop Jun 14, 2017

This was referenced Jun 15, 2017

Average Topic Coherence over several values of topn #1281

Open

Updated and new notebooks for coherence #1431

Merged

menshikh-iv mentioned this pull request Jun 22, 2017

Windows 2.2.0 issues #1441

Closed

Allow use of truncated Dictionary for coherence measures #1349

Allow use of truncated Dictionary for coherence measures #1349

Conversation

macks22 commented May 22, 2017

tmylk commented May 29, 2017

macks22 commented May 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky May 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky commented May 30, 2017

macks22 commented May 30, 2017

tmylk commented May 30, 2017

macks22 commented Jun 1, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

macks22 commented Jun 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

macks22 Jun 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

macks22 commented Jun 8, 2017

piskvorky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

macks22 commented Jun 8, 2017

Choose a reason for hiding this comment

macks22 Jun 9, 2017 • edited Loading

Choose a reason for hiding this comment

macks22 commented Jun 13, 2017

devashishd12 commented Jun 13, 2017

piskvorky commented Jun 14, 2017 • edited Loading

menshikh-iv commented Jun 14, 2017

macks22 commented Jun 14, 2017 • edited Loading

menshikh-iv commented Jun 19, 2017 • edited Loading

macks22 commented Jun 21, 2017

macks22 commented May 30, 2017 •

edited

Loading

piskvorky May 30, 2017 •

edited

Loading

macks22 commented Jun 1, 2017 •

edited

Loading

macks22 Jun 5, 2017 •

edited

Loading

macks22 Jun 9, 2017 •

edited

Loading

piskvorky commented Jun 14, 2017 •

edited

Loading

macks22 commented Jun 14, 2017 •

edited

Loading

menshikh-iv commented Jun 19, 2017 •

edited

Loading