Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow use of truncated Dictionary for coherence measures #1349

Merged
merged 34 commits into from
Jun 14, 2017

Conversation

macks22
Copy link
Contributor

@macks22 macks22 commented May 22, 2017

Resolves #1342.

Sweeney, Mack added 11 commits May 22, 2017 14:43
…culation by avoiding lookup of tokens not in the topic token lists.
…vant words, and ensure each relevant word has a set in the `per_topic_postings` dict.
…e in strided_windows and iter_windows utiltity functions
… and add initial impl of accumulator that directly tracks term occurrence and co-occurrence counts
… package for all confirmation measures in the CoherenceModel pipeline
…ord are interpreted as the occurrence; update tests to cover this case; change the p_boolean_sliding_window to use the WordOccurrenceAccumulator; minor cleanup in test_coherencemodel
…oolean_sliding_window; add parameter for CoherenceModel to adjust number of processes used, with default equal to max(1, cpu_count - 1)
@tmylk
Copy link
Contributor

tmylk commented May 29, 2017

Thanks! It is a useful feature.

Could you please add a couple of paragraphs describing the patch - what approaches you tried and why.

May I ask what is the advantage of having the accumulator instead of passing a dictionary? Or just filtering the corpus prior to passing it to coherence using this code from the FAQ?

import copy 
from gensim.models import VocabTransform

# filter the dictionary
old_dict = corpora.Dictionary.load('old.dict')
new_dict = copy.deepcopy(old_dict)
new_dict.filter_extremes(keep_n=100000)
new_dict.save('filtered.dict')

# now transform the corpus
corpus = corpora.MmCorpus('corpus.mm')
old2new = {old_dict.token2id[token]:new_id for new_id, token in new_dict.iteritems()}
vt = VocabTransform(old2new)
corpora.MmCorpus.serialize('filtered_corpus.mm', vt[corpus], id2word=new_dict)

Sweeney, Mack and others added 2 commits May 29, 2017 20:56
@macks22
Copy link
Contributor Author

macks22 commented May 30, 2017

@tmylk thank you for your comments on the PR. I'm glad it is viewed as a useful feature.

The accumulator design is to support an abstraction on the underlying mechanism by which word occurrences and co-occurrences are retrieved. The existing inverted index based technique is not very memory friendly. I have not been able to use it on the Wikipedia corpus, which is a common reference corpus in the literature. I have pushed code to optimize this, including a new accumulator which uses a scipy.sparse.lil_matrix to accumulate word co-occurences. This method is more scalable memory-wise.

The accumulator design also leaves the door open for using lucene or some other on-disk inverted index as a backend for retrieving occurrence counts. I have opened another ticket for something like this (#1351).

I've also updated the PR to include several optimizations for the probability estimation phase of coherence calculation. This includes a new strided approach for document windowing (see util module), as well as a parallel multiprocessing implementation of word occurrence accumulation. There is still more that can be done with this (more efficient accumulation of windowed co-occurrence stats), but at least it is now able to take advantage of multiple cores. Finally, I modified the CoherenceModel to cache the accumulator from the last call to get_coherence unless the topics instance variable is set to something that contains ids not in the last set of topics (i.e. the current accumulator does not have counts for some tokens).

I also broke up the various stages of the pipeline into different methods, such that a user can get coherence measures for individual topics and view the results of topic segmentation.

I realize these changes are somewhat beyond the scope of the original issue. It was motivated by your comment in that issue thread, "c_v is the best/recommended coherence so we should make it as fast as possible." I agree with this and was finding "c_v" unusably slow. I hope these contributions can make it more useful for more people.

yield (tokens, (pageid, title))
else:
yield tokens
except KeyboardInterrupt:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for? Masking user interrupts is an anti-pattern; deserves a detailed comment, at the very least.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the code to handle the interrupt more appropriately. As for why I chose to handle the interrupt at all (at the risk of repeating the above comment): this pool may be active during many other phases of gensim execution if the underlying corpus/texts being iterated come from the wikicorpus. It was confusing for me to see stdout activity when issuing an interrupt during execution of an entirely different code path.

Copy link
Owner

@piskvorky piskvorky May 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, not sure I understand.

When iterating over a WikiCorpus, gensim uses a multiprocessing.Pool (forks), yes.

Now when you do CTRL+C, what happened (old behaviour)? And what happens now (after your changes here)? Why is that preferable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the code I am running:

import os
import logging
import gensim
from gensim.models.coherencemodel import CoherenceModel

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

home = '/Users/user/workshop/nlp'
id2word = gensim.corpora.Dictionary.load_from_text(os.path.join(home, 'data', 'wikipedia-may-17_wordids.txt.bz2'))
texts = gensim.corpora.WikiCorpus(os.path.join(home, 'data', 'enwiki-latest-pages-articles.xml.bz2'), dictionary=id2word).get_texts()
lda = gensim.models.LdaModel.load(os.path.join(home, 'models', 'lda-k100_en-wiki.model'))
topics = [[t[0] for t in lda.get_topic_terms(i, 20)] for i in range(100)]
topic_words = [[id2word[token_id] for token_id in t] for t in topics]
cm = gensim.models.coherencemodel.CoherenceModel(topics=topic_words, texts=texts, dictionary=id2word, coherence='c_v', topn=20, window_size=110)
topic_coherences = cm.get_coherence_per_topic()
print(topic_coherences)

Before:

2017-05-30 07:34:29,027 : INFO : loading LdaModel object from /Users/user/workshop/nlp/models/lda-k100_en-wiki.model
2017-05-30 07:34:29,028 : INFO : loading expElogbeta from /Users/user/workshop/nlp/models/lda-k100_en-wiki.model.expElogbeta.npy with mmap=None
2017-05-30 07:34:29,080 : INFO : setting ignored attribute id2word to None
2017-05-30 07:34:29,080 : INFO : setting ignored attribute state to None
2017-05-30 07:34:29,080 : INFO : setting ignored attribute dispatcher to None
2017-05-30 07:34:29,080 : INFO : loaded /Users/user/workshop/nlp/models/lda-k100_en-wiki.model
2017-05-30 07:34:29,081 : INFO : loading LdaModel object from /Users/user/workshop/nlp/models/lda-k100_en-wiki.model.state
2017-05-30 07:34:29,249 : INFO : loaded /Users/user/workshop/nlp/models/lda-k100_en-wiki.model.state
2017-05-30 07:34:32,603 : INFO : using ParallelWordOccurrenceAccumulator(processes=7, batch_size=16) to estimate probabilities from sliding windows
2017-05-30 07:34:33,101 : INFO : submitted 0 batches to accumulate stats from 0 documents (90752 virtual)
2017-05-30 07:34:33,298 : INFO : submitted 1 batches to accumulate stats from 16 documents (154816 virtual)
2017-05-30 07:34:33,462 : INFO : submitted 2 batches to accumulate stats from 32 documents (223497 virtual)
2017-05-30 07:34:33,709 : INFO : submitted 3 batches to accumulate stats from 48 documents (285340 virtual)
2017-05-30 07:34:33,790 : INFO : submitted 4 batches to accumulate stats from 64 documents (342337 virtual)
2017-05-30 07:34:34,297 : INFO : submitted 5 batches to accumulate stats from 80 documents (415139 virtual)
2017-05-30 07:34:34,872 : INFO : submitted 6 batches to accumulate stats from 96 documents (484709 virtual)
2017-05-30 07:34:35,093 : INFO : submitted 7 batches to accumulate stats from 112 documents (542834 virtual)
2017-05-30 07:34:35,381 : INFO : submitted 8 batches to accumulate stats from 128 documents (628469 virtual)
2017-05-30 07:34:35,443 : INFO : submitted 9 batches to accumulate stats from 144 documents (691420 virtual)
2017-05-30 07:34:35,764 : INFO : submitted 10 batches to accumulate stats from 160 documents (741122 virtual)
2017-05-30 07:34:35,983 : INFO : submitted 11 batches to accumulate stats from 176 documents (774924 virtual)
2017-05-30 07:34:36,234 : INFO : submitted 12 batches to accumulate stats from 192 documents (829056 virtual)
2017-05-30 07:34:36,682 : INFO : submitted 13 batches to accumulate stats from 208 documents (887935 virtual)
^C2017-05-30 07:34:48,466 : WARNING : stats accumulation interrupted; <= 887935 documents processed
Process PoolWorker-12:
Process PoolWorker-10:
Process PoolWorker-11:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Process PoolWorker-9:
Traceback (most recent call last):
Process PoolWorker-8:
Process PoolWorker-13:
  File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Process PoolWorker-14:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Traceback (most recent call last):
  File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
    self.run()
  File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 114, in run
    self.run()
  File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/user/anaconda2/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    self._target(*self._args, **self._kwargs)
  File "/Users/user/anaconda2/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    self.run()
    self.run()
  File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
    self.run()
  File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
    self._target(*self._args, **self._kwargs)
  File "/Users/user/anaconda2/lib/python2.7/multiprocessing/pool.py", line 102, in worker
  File "/Users/user/anaconda2/lib/python2.7/multiprocessing/pool.py", line 102, in worker
  File "/Users/user/anaconda2/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    self.run()
    self._target(*self._args, **self._kwargs)
  File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/Users/user/anaconda2/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    self._target(*self._args, **self._kwargs)
  File "/Users/user/anaconda2/lib/python2.7/multiprocessing/pool.py", line 102, in worker
2017-05-30 07:34:48,467 : INFO : AccumulatingWorker interrupted after processing 8598 documents
2017-05-30 07:34:48,468 : INFO : AccumulatingWorker interrupted after processing 3630 documents
2017-05-30 07:34:48,467 : INFO : AccumulatingWorker interrupted after processing 7670 documents
2017-05-30 07:34:48,468 : INFO : AccumulatingWorker interrupted after processing 5859 documents
2017-05-30 07:34:48,468 : INFO : AccumulatingWorker interrupted after processing 9993 documents
2017-05-30 07:34:48,468 : INFO : AccumulatingWorker interrupted after processing 8987 documents
2017-05-30 07:34:48,469 : INFO : serializing accumulator to return to master...
2017-05-30 07:34:48,469 : INFO : serializing accumulator to return to master...
2017-05-30 07:34:48,468 : INFO : AccumulatingWorker interrupted after processing 5062 documents
2017-05-30 07:34:48,469 : INFO : serializing accumulator to return to master...
2017-05-30 07:34:48,469 : INFO : serializing accumulator to return to master...
2017-05-30 07:34:48,469 : INFO : serializing accumulator to return to master...
2017-05-30 07:34:48,469 : INFO : serializing accumulator to return to master...
    task = get()
    task = get()
  File "/Users/user/anaconda2/lib/python2.7/multiprocessing/queues.py", line 376, in get
    task = get()
    task = get()
  File "/Users/user/anaconda2/lib/python2.7/multiprocessing/queues.py", line 376, in get
    task = get()
    task = get()
  File "/Users/user/anaconda2/lib/python2.7/multiprocessing/queues.py", line 376, in get
  File "/Users/user/anaconda2/lib/python2.7/multiprocessing/queues.py", line 376, in get
  File "/Users/user/anaconda2/lib/python2.7/multiprocessing/queues.py", line 378, in get
  File "/Users/user/anaconda2/lib/python2.7/multiprocessing/queues.py", line 376, in get
    task = get()
  File "/Users/user/anaconda2/lib/python2.7/multiprocessing/queues.py", line 376, in get
    racquire()
    racquire()
    racquire()
    racquire()
    racquire()
    racquire()
    return recv()
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
2017-05-30 07:34:48,469 : INFO : serializing accumulator to return to master...
2017-05-30 07:34:48,471 : INFO : accumulator serialized
2017-05-30 07:34:48,473 : INFO : accumulator serialized
2017-05-30 07:34:48,471 : INFO : accumulator serialized
2017-05-30 07:34:48,473 : INFO : accumulator serialized
2017-05-30 07:34:48,471 : INFO : accumulator serialized
2017-05-30 07:34:48,474 : INFO : accumulator serialized
2017-05-30 07:34:48,474 : INFO : accumulator serialized
2017-05-30 07:34:50,727 : INFO : 7 accumulators retrieved from output queue

After:

2017-05-30 07:29:17,246 : WARNING : Slow version of gensim.models.doc2vec is being used
2017-05-30 07:29:17,253 : INFO : 'pattern' package not found; tag filters are not available for English
2017-05-30 07:29:18,188 : INFO : loading LdaModel object from /Users/user/workshop/nlp/models/lda-k100_en-wiki.model
2017-05-30 07:29:18,198 : INFO : loading expElogbeta from /Users/user/workshop/nlp/models/lda-k100_en-wiki.model.expElogbeta.npy with mmap=None
2017-05-30 07:29:18,267 : INFO : setting ignored attribute id2word to None
2017-05-30 07:29:18,267 : INFO : setting ignored attribute state to None
2017-05-30 07:29:18,267 : INFO : setting ignored attribute dispatcher to None
2017-05-30 07:29:18,267 : INFO : loaded /Users/user/workshop/nlp/models/lda-k100_en-wiki.model
2017-05-30 07:29:18,267 : INFO : loading LdaModel object from /Users/user/workshop/nlp/models/lda-k100_en-wiki.model.state
2017-05-30 07:29:18,422 : INFO : loaded /Users/user/workshop/nlp/models/lda-k100_en-wiki.model.state
2017-05-30 07:29:20,667 : INFO : using ParallelWordOccurrenceAccumulator(processes=7, batch_size=16) to estimate probabilities from sliding windows
2017-05-30 07:29:21,117 : INFO : submitted 0 batches to accumulate stats from 0 documents (90752 virtual)
2017-05-30 07:29:21,312 : INFO : submitted 1 batches to accumulate stats from 16 documents (154816 virtual)
2017-05-30 07:29:21,441 : INFO : submitted 2 batches to accumulate stats from 32 documents (223497 virtual)
2017-05-30 07:29:21,715 : INFO : submitted 3 batches to accumulate stats from 48 documents (285340 virtual)
2017-05-30 07:29:21,783 : INFO : submitted 4 batches to accumulate stats from 64 documents (342337 virtual)
2017-05-30 07:29:22,270 : INFO : submitted 5 batches to accumulate stats from 80 documents (415139 virtual)
2017-05-30 07:29:22,634 : INFO : submitted 6 batches to accumulate stats from 96 documents (484709 virtual)
2017-05-30 07:29:22,701 : INFO : submitted 7 batches to accumulate stats from 112 documents (542834 virtual)
2017-05-30 07:29:22,995 : INFO : submitted 8 batches to accumulate stats from 128 documents (628469 virtual)
2017-05-30 07:29:23,118 : INFO : submitted 9 batches to accumulate stats from 144 documents (691420 virtual)
2017-05-30 07:29:23,238 : INFO : submitted 10 batches to accumulate stats from 160 documents (741122 virtual)
2017-05-30 07:29:23,336 : INFO : submitted 11 batches to accumulate stats from 176 documents (774924 virtual)
2017-05-30 07:29:23,471 : INFO : submitted 12 batches to accumulate stats from 192 documents (829056 virtual)
2017-05-30 07:29:23,665 : INFO : submitted 13 batches to accumulate stats from 208 documents (887935 virtual)
^C2017-05-30 07:29:58,875 : WARNING : stats accumulation interrupted; <= 887935 documents processed
2017-05-30 07:29:58,876 : INFO : AccumulatingWorker interrupted after processing 17775 documents
2017-05-30 07:29:58,876 : INFO : AccumulatingWorker interrupted after processing 27228 documents
2017-05-30 07:29:58,876 : INFO : AccumulatingWorker interrupted after processing 20910 documents
2017-05-30 07:29:58,876 : INFO : AccumulatingWorker interrupted after processing 26774 documents
2017-05-30 07:29:58,876 : INFO : AccumulatingWorker interrupted after processing 31392 documents
2017-05-30 07:29:58,876 : INFO : AccumulatingWorker interrupted after processing 32224 documents
2017-05-30 07:29:58,876 : INFO : AccumulatingWorker interrupted after processing 22497 documents
2017-05-30 07:29:58,877 : INFO : serializing accumulator to return to master...
2017-05-30 07:29:58,877 : INFO : serializing accumulator to return to master...
2017-05-30 07:29:58,877 : INFO : serializing accumulator to return to master...
2017-05-30 07:29:58,877 : INFO : serializing accumulator to return to master...
2017-05-30 07:29:58,877 : INFO : serializing accumulator to return to master...
2017-05-30 07:29:58,877 : INFO : serializing accumulator to return to master...
2017-05-30 07:29:58,877 : INFO : serializing accumulator to return to master...
2017-05-30 07:29:58,878 : INFO : accumulator serialized
2017-05-30 07:29:58,878 : INFO : accumulator serialized
2017-05-30 07:29:58,878 : INFO : accumulator serialized
2017-05-30 07:29:58,878 : INFO : accumulator serialized
2017-05-30 07:29:58,878 : INFO : accumulator serialized
2017-05-30 07:29:58,878 : INFO : accumulator serialized
2017-05-30 07:29:58,878 : INFO : accumulator serialized
2017-05-30 07:30:01,106 : INFO : 7 accumulators retrieved from output queue

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the advantage is a cleaner log?
What are the disadvantages? (I am not familiar with this type of functionality, but I assume there must be some disadvantages, otherwise it would be the default behaviour).

@@ -300,22 +305,26 @@ def get_texts(self):
articles, articles_all = 0, 0
positions, positions_all = 0, 0
texts = ((text, self.lemmatize, title, pageid) for title, text, pageid in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces))
pool = multiprocessing.Pool(self.processes)
pool = multiprocessing.Pool(self.processes, init_worker)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this change about? Needs a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky I was running get_coherence using coherence="c_v" and trying to program it to gracefully handle KeyboardInterrupt. You can see the mechanisms I put in place for this in the text_analysis module in the PR. While doing this, I faced some confusion because some background process was still raising KeyboardInterrupt. After some digging, I noticed that the wikicorpus pool workers were the culprit.

This pool may be active during many other phases of gensim execution if the underlying corpus/texts being iterated come from the wikicorpus. I think it makes things slightly cleaner in this case to handle the KeyboardInterrupt in some manner that does not propagate. Perhaps some sort of logging would improve this?

return not current_set.issuperset(new_set)

def _topics_differ(self, new_topics):
return (new_topics is not None and
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No vertical indent -- please use hanging indent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return (new_topics is not None and
self._topics is not None and
self._accumulator is not None and
not np.equal(new_topics, self._topics).all())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what type the arguments are, but np.allclose applicable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list of np.ndarray of ints, so there should be no need for np.allclose.

@piskvorky
Copy link
Owner

Wow, impressive refactor! Thanks a lot.

Sweeney, Mack added 3 commits May 30, 2017 05:41
…rpus.get_texts`; instead, log warning and do not set `length`.
…and non-empty blank lines in `text_analysis`.
@macks22
Copy link
Contributor Author

macks22 commented May 30, 2017

@piskvorky thank you for your review and consideration of this PR. I've responded to your comments and pushed changes. Please let me know if you would like me to make any additional changes and I will gladly do so.

@tmylk
Copy link
Contributor

tmylk commented May 30, 2017

cc @dsquareindia , the original contributor of coherence to the package

Sweeney, Mack added 5 commits May 30, 2017 17:03
…ith repeated counting of tokens that occur more than once in a window.
… module; cleaned up spacing in coherencemodel.
…acking and avoid undue network traffic by moving relevancy filtering and token conversion to the master process.
…sing a `collections.Counter` instance for accumulation within a batch.
@macks22
Copy link
Contributor Author

macks22 commented Jun 1, 2017

I've pushed a few further optimizations to the probability_estimation phase; that code I posted above in response to @piskvorky can now handle the entire Wikipedia corpus in about 5.5 hours on an i7 with 8 cores (with hyperthreading). The speedup is about 16x (was ~2 docs/s and is now ~32 docs/s).

$ python coherence.py &> coherence_log2.txt
...
python coherence.py &> coherence_log2.txt  142985.89s user 770.75s system 736% cpu 5:25:21.87 total

except Exception as e:
logger.error("worker encountered unexpected exception: %s" % e)
logger.error(traceback.format_exc())
logger.error("worker encountered unexpected exception: %s\n%s",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger.exception simpler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

q.put(batch, block=True)
before = self._num_docs / self.log_every
self._num_docs += sum(len(doc) - window_size + 1 for doc in batch)
if before < (self._num_docs / self.log_every):
logger.info("submitted %d batches to accumulate stats from %d documents (%d virtual)" % (
batch_num, (batch_num + 1) * self.batch_size, self._num_docs))
logger.info("%d batches submitted to accumulate stats from %d documents (%d "
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hanging indent please (here and elsewhere).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the desired format here. The hanging indent makes sense when using the % operator to format a string. However, my desire here is to use the lazy formatting of the logging module instead. In this case, the string and all following format arguments are equivalent arguments of the call, so pep8 dictates alignment with the opening delimiter.

Sweeney, Mack added 4 commits June 1, 2017 11:47
…he Dictionary mapping to different ids, so fixed the `probability_estimation` tests to be agnostic of this. Also fixed an issue with the interpretation of strings as iterables when getting occurrences of strings in the `text_analysis.BaseAnalyzer.__getitem__` method.
…rencemodel accumulator caching; made model a property with a setter that also sets the topics and uncaches the accumulator if the model's topics have ids not tracked by the accumulator.
@macks22
Copy link
Contributor Author

macks22 commented Jun 2, 2017

@piskvorky I have replied to your requests for changes and made the PR compatible with python3. Thanks again for the feedback! Are there any additional changes you would like?

def setUp(self):
self.dictionary = BaseTestCases.TextAnalyzerTestBase.dictionary
self.top_ids = BaseTestCases.TextAnalyzerTestBase.top_ids
self.corpus = [self.dictionary.doc2bow(doc)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hanging indent please (here and elsewhere).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this

        self.corpus = \
            [self.dictionary.doc2bow(doc) for doc in BaseTestCases.TextAnalyzerTestBase.texts]

or this?

        self.corpus = [
            self.dictionary.doc2bow(doc) for doc in BaseTestCases.TextAnalyzerTestBase.texts]


def test_index_accumulation(self):
accumulator = CorpusAccumulator(self.top_ids)\
.accumulate(self.corpus)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this line break? Better keep on a single line (more readable).

Copy link
Contributor Author

@macks22 macks22 Jun 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was for consistency with the other places, where the call did not fit well on a single line (too long). I've changed to a single line.

" (total %i articles, %i positions before pruning articles shorter than %i words)",
articles, positions, articles_all, positions_all, ARTICLE_MIN_WORDS)
self.length = articles # cache corpus length
texts = ((text, self.lemmatize, title, pageid)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hanging indent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you ask for hanging indents for list comprehensions, can you give me an example of the style you prefer? Below are a few alternatives that come to mind. The first is the one I would assume you are asking for, but I've seen all of these before, so just want to clarify.

        texts = (
            (text, self.lemmatize, title, pageid)
            for title, text, pageid
            in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces))

or

        texts = (
            (text, self.lemmatize, title, pageid)
            for title, text, pageid
            in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces)
        )

or

        texts = \
            ((text, self.lemmatize, title, pageid)
             for title, text, pageid
             in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces))

texts = ((text, self.lemmatize, title, pageid)
for title, text, pageid
in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces))
pool = multiprocessing.Pool(self.processes, init_to_ignore_interrupt)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the original interrupts be restored, once the pool is over (exception/terminated).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interrupts are only disabled in the children, which are terminated with the Pool. I'm not sure why, but there is no place to reply to your comment about the disadvantages. The disadvantage, to my understanding, is that termination of the children must now be performed by the parent, instead of occurring exactly when the KeyboardInterrupt is sent to them. In this case, that is being done explicitly by calling pool.terminate(). However, according to the docs, even if a second KeyboardInterrupt is issued just after the first and the explicit call is not made, the GC will take care of calling it. The advantage here is a cleaner log, yes, but also not leaking the details of the wikicorpus iteration (multiprocessing or not), which can confuse developers and users when they interrupt some other part of the code.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks.

yield (tokens, (pageid, title))
else:
yield tokens
except KeyboardInterrupt:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the advantage is a cleaner log?
What are the disadvantages? (I am not familiar with this type of functionality, but I assume there must be some disadvantages, otherwise it would be the default behaviour).

'c_npmi': make_pipeline(segmentation.s_one_one,
COHERENCE_MEASURES = {
'u_mass': _make_pipeline(segmentation.s_one_pre,
probability_estimation.p_boolean_document,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hanging indent (here and other places below).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


def _topics_differ(self, new_topics):
return (new_topics is not None and
self._topics is not None and
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indentation looks weird... a single line better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is, IMO, more confusing if limited to one line. I changed this to the standard hanging indent (8 spaces).

def _topics_differ(self, new_topics):
return (new_topics is not None and
self._topics is not None and
not np.array_equal(new_topics, self._topics))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exact equality or "numerically close" (np.allclose) desired?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exact, since these are integers

top_ids = set(token2id.values())

texts2 = [['human', 'interface', 'computer'],
['survey', 'user', 'computer', 'system', 'response', 'time'],
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hanging indent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

accumulator = WordOccurrenceAccumulator(top_ids, dictionary)
else:
accumulator = ParallelWordOccurrenceAccumulator(processes, top_ids, dictionary)
logger.info("using %s to estimate probabilities from sliding windows" % accumulator)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass argument as a fnc argument, instead of formatting the string directly with %.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Sweeney, Mack added 2 commits June 5, 2017 11:08
… to return individual topic coherence values, then average those. Make the `ParallelWordOccurrenceAccumulator` return a `WordOccurrenceAccumulator` after accumulation, so it can be trained further afterwards if desired.
… individual topic coherence values, then average those.
@macks22
Copy link
Contributor Author

macks22 commented Jun 8, 2017

@piskvorky I believe I've addressed all the change requests from your last review. Thank you for the review. Is there anything else you'd like changed?

Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor code style comments.

vw_path = os.environ.get('VOWPAL_WABBIT_PATH', None)
if not vw_path:
msg = "Environment variable 'VOWPAL_WABBIT_PATH' not specified, skipping sanity checks for LDA Model"
logging.info(msg)
logging.info("Environment variable 'VOWPAL_WABBIT_PATH' not specified,"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hanging indent please.

@@ -63,162 +55,201 @@ def setUp(self):
['graph', 'minors', 'trees', 'eps']]
self.topics2 = [['user', 'graph', 'minors', 'system'],
['time', 'graph', 'survey', 'minors']]
self.ldamodel = LdaModel(corpus=corpus, id2word=dictionary, num_topics=2, passes=0, iterations=0)
self.ldamodel = LdaModel(corpus=self.corpus, id2word=self.dictionary, num_topics=2,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hanging indent please.


class TestCoherenceModel(unittest.TestCase):

# set up vars used in testing ("Deerwester" from the web tutorial)
texts = [['human', 'interface', 'computer'],
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hanging indent please.


def testErrors(self):
"""Test if errors are raised on bad input"""
# not providing dictionary
self.assertRaises(ValueError, CoherenceModel, topics=self.topics1, corpus=corpus, coherence='u_mass')
self.assertRaises(ValueError, CoherenceModel, topics=self.topics1, corpus=self.corpus,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hanging indent (here and below).

for s_i in segmented_topics:
for word_id in itertools.chain.from_iterable(s_i):
if hasattr(word_id, '__iter__'):
top_ids = top_ids.union(word_id)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update() simpler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, good call

…for unique ids from topic segments.
@macks22
Copy link
Contributor Author

macks22 commented Jun 8, 2017

@piskvorky the most recent review has been addressed and all requested changes have been made.

@@ -70,7 +70,7 @@ def unique_ids_from_segments(segmented_topics):
for s_i in segmented_topics:
for word_id in itertools.chain.from_iterable(s_i):
if hasattr(word_id, '__iter__'):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the ids guaranteed to be integers?

Because if they're strings, the __iter__ could get nasty (you can iterate over strings, yielding individual characters).

Copy link
Contributor Author

@macks22 macks22 Jun 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they will always be integers; I have updated the docstrings to better reflect this. The segmented topics will always come from the segmentation module, and the setter in the CoherenceModel ensures the segmentation module receives lists of integer arrays to be segmented.

@macks22
Copy link
Contributor Author

macks22 commented Jun 13, 2017

@piskvorky @tmylk Anything else to change for this? Thanks!

@devashishd12
Copy link
Contributor

Hi @macks22! Thanks for this amazing overhaul! I guess it would be fun to see how much of a performance gain are we getting now compared to the previous code. It would also make for an interesting blog post maybe?

@piskvorky
Copy link
Owner

piskvorky commented Jun 14, 2017

+1 on @dsquareindia 's idea. How will the users find out about this new cool functionality?

Some sort of Notebook tutorial (motivation, quick start) with a benchmark would be awesome.

@menshikh-iv
Copy link
Contributor

Sorry for the late response, your PR is great 👍
It would be nice if you added a new notebook with a benchmark (or modify existing notebooks with CoherenceModel)

Thank you for the wonderful work @macks22 🥇

@menshikh-iv menshikh-iv merged commit 1862953 into piskvorky:develop Jun 14, 2017
@macks22
Copy link
Contributor Author

macks22 commented Jun 14, 2017

@dsquareindia @piskvorky @menshikh-iv @dsquareindia
Thank you all for the feedback. I'll see if I can update the benchmark notebook and also create a new notebook to showcase some of the new stuff.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jun 19, 2017

Hi @macks22,
I have problems with tests on windows with your PR (on MacOS and Linux its works correctly)

=====================================================================
FAIL: test_occurrence_counting (gensim.test.test_text_analysis.TestParallelWordOccurrenceAccumulator)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python35\lib\site-packages\gensim\test\test_text_analysis.py", line 57, in test_occurrence_counting
    self.assertEqual(3, accumulator.get_occurrences("this"))
AssertionError: 3 != 0
-------------------- >> begin captured logging << --------------------
gensim.topic_coherence.text_analysis: INFO: 1 batches submitted to accumulate stats from 64 documents (3 virtual)
gensim.topic_coherence.text_analysis: INFO: 2 accumulators retrieved from output queue
gensim.topic_coherence.text_analysis: INFO: accumulated word occurrence stats for 4 virtual documents
--------------------- >> end captured logging << ---------------------
======================================================================
FAIL: test_occurrence_counting2 (gensim.test.test_text_analysis.TestParallelWordOccurrenceAccumulator)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python35\lib\site-packages\gensim\test\test_text_analysis.py", line 67, in test_occurrence_counting2
    self.assertEqual(2, accumulator.get_occurrences("human"))
AssertionError: 2 != 0
-------------------- >> begin captured logging << --------------------
gensim.topic_coherence.text_analysis: INFO: 2 accumulators retrieved from output queue
gensim.topic_coherence.text_analysis: INFO: accumulated word occurrence stats for 10 virtual documents
--------------------- >> end captured logging << ---------------------

Can you fix this?

@macks22
Copy link
Contributor Author

macks22 commented Jun 21, 2017

@menshikh-iv that's strange, I can't think of what might be causing that. I can try to replicate on a VM but likely won't be able to find time to get a windows dev environment set up until this weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants