Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSoC 2018] Multistream API for vocabulary building in *2vec #2078

Merged

Conversation

persiyanov
Copy link
Contributor

@persiyanov persiyanov commented Jun 4, 2018

This PR introduces multi-stream API for building vocabularies in *2Vec models.

Main idea

Training *2vec models involves two main steps:

  1. One pass over your training corpus to build (and filter) the vocabulary
  2. Train the model

This pull request optimizes the first step (dictionary building).

The main idea is to parallelize the scan_vocab method, by allowing users to pass several streams of data (i.e. several LineSentence iterators for different files). Under the hood, the implementation uses multiprocessing python module creating multiprocessing.Pool(processes=min(workers, len(input_streams))) to read and process all input streams in parallel.

Why use it?

This new feature substantially reduces the vocabulary building time, especially if you have complex data iterators (more CPU-bound, i.e. reading a line from a file, then doing complex preprocessing, removing stop words, lemmatizing, stemming etc).

In our benchmarks below, multi-stream mode produces up to 3x boost in the vocabulary building step. This is significant because this step can take hours for larger corpora (e.g. on https://dumps.wikimedia.org/enwiki/20180320/ :)

The more CPU-complex data iterators you use, the more speedup you will get. For example, I've experimented with building vocab for Word2Vec model in two setups, without preprocessing and with preprocessing:

  1. Without preprocessing. Iterate with gensim.models.word2vec.LineSentence, which just reads a line from disk and splits it on whitespace. I ran this benchmark on already preprocessed dataset. Using multi-stream resulted in 2.7x speedup.
# workers total time, secs
1 249.51
2 159.87
3 126.47
4 109.90
5 96.79
8 88.74
10 93.00
  1. With preprocessing. Iterate over lines and preprocess each one with gensim.parsing.preprocessing.preprocess_string. In this case, multistream mode has provided 6.9x speedup.
# workers total time, secs
1 4398.54
2 2252.73
3 1596.28
4 1157.02
5 926.66
8 704.07
10 638.45

How to use it?

It's really simple to start using the multi-stream mode. All you need is to have multiple document streams (training corpora, e.g. files on disk) you want to train on, and pass them as a list into the new input_streams constructor parameter. input_streams supersedes the old sentences parameter which is now equivalent to input_streams = [sentences]:

Old single stream approach

  1. Build vocabulary and train at the same time
sentences = LineSentence('myfile.txt')
model = Word2Vec(sentences, workers=10, ...other params...)
  1. Initialize model, build vocab and train as separate steps
sentences = LineSentence('myfile.txt')
model = Word2Vec(workers=10, ...other params...)
model.build_vocab(sentences)
model.train(sentences, ...train params...)

New multi-stream approach

  1. Build vocab and train at the same time (parallelized internally)
input_streams = [LineSentence('myfile1.txt'), LineSentence('myfile2.txt'), ...]
model = Word2Vec(input_streams=input_streams, workers=10, ...other params...)
  1. Initialize model, build vocab in parallel and train as separate steps
input_streams = [LineSentence('myfile1.txt'), LineSentence('myfile2.txt'), ...]
model = Word2Vec(workers=10, ...other params...)
model.build_vocab(input_streams=input_streams)  # Optionally, you can pass `workers` param here which differs from the one passed in constructor.
model.train(input_streams=input_streams, ...train params...)

The examples above used Word2Vec, but Doc2Vec and FastText work analogously.

@persiyanov
Copy link
Contributor Author

persiyanov commented Jun 4, 2018

10 input streams

* Model = word2vec	Workers = 1	Vocab time = 220.48 secs
* Model = word2vec	Workers = 2	Vocab time = 147.40 secs
* Model = word2vec	Workers = 3	Vocab time = 123.86 secs
* Model = word2vec	Workers = 4	Vocab time = 105.28 secs
* Model = word2vec	Workers = 5	Vocab time = 88.51 secs
* Model = word2vec	Workers = 8	Vocab time = 82.50 secs
* Model = word2vec	Workers = 10	Vocab time = 77.99 secs

* Model = doc2vec	Workers = 1	Vocab time = 223.81 secs
* Model = doc2vec	Workers = 2	Vocab time = 153.86 secs
* Model = doc2vec	Workers = 3	Vocab time = 131.49 secs
* Model = doc2vec	Workers = 4	Vocab time = 111.28 secs
* Model = doc2vec	Workers = 5	Vocab time = 98.30 secs
* Model = doc2vec	Workers = 8	Vocab time = 91.04 secs
* Model = doc2vec	Workers = 10	Vocab time = 86.03 secs

* Model = fasttext	Workers = 1	Vocab time = 309.86 secs
* Model = fasttext	Workers = 2	Vocab time = 244.43 secs
* Model = fasttext	Workers = 3	Vocab time = 217.69 secs
* Model = fasttext	Workers = 4	Vocab time = 201.10 secs
* Model = fasttext	Workers = 5	Vocab time = 189.07 secs
* Model = fasttext	Workers = 8	Vocab time = 179.37 secs
* Model = fasttext	Workers = 10	Vocab time = 175.31 secs

@persiyanov
Copy link
Contributor Author

@gojomo Hi! I'd like to ask for your advice. There is one bug in the code in this PR related to TaggedLineSentence and tags produced by it. If a user wants to use multistream + TaggedLineSentence, then there will be several documents with the same tags (any two documents on the same line in different files will have the same tags, because of TaggedLineSentence logic). Could you suggest how to fix this problem?

@gojomo
Copy link
Collaborator

gojomo commented Jun 5, 2018

Clearly incrementing-serial-numbers-per-line can't be used if you have multiple files that all start with line 0. You could require tags to be specified inside the files – as an extra field/fields each line. If the files are considered in a stable order, and the count of texts (lines) in each can be pre-determined (perhaps with another extra pass over all data), each could be given a unique starting-id. Text tags could also be a deterministic function of the hash of the contained text – for additional per-line overhead, and the extra cost of keeping those (now much longer) tags as string keys.

if isinstance(sentences, GeneratorType):
if multistream and not isinstance(sentences, (tuple, list)):
raise TypeError("If multistream=True, you must pass tuple or list as the sentences argument.")
if not multistream and isinstance(sentences, GeneratorType):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check (about generators) should be applied for each multistream element too (i.e. to each stream)

If True, use `sentences` as list of input streams and speed up vocab building by parallelization
with `min(len(sentences), self.workers)` processes. This option can lead up to 2.5x reduction
in vocabulary building time.
workers : int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe better use different naming like multistream_workers (for avoiding potential collision by parameter names) or this has no sense @persiyanov?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is no point in this. In most cases, by setting workers parameter user means the same degree of parallelization for both scan vocab (multiprocessing) and training the model (multithreading), IMO

@@ -213,6 +214,9 @@ def _train_epoch(self, data_iterable, cur_epoch=0, total_examples=None,
for _ in xrange(self.workers)
]

# Chain all input streams into one, because multistream training is not supported yet.
if multistream:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to raise warnings.warn about lack of support (I mean explicitly to the user, not only comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I don't think so:

We add new functionality, allowing users to use multistream mode. We added this functionality to the gensim core, so we are sure that it works correctly. Why should we warn the user that something is wrong? Furthermore, user can't do anything to get rid of this warning message while continuing using multistream mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About can't do anything - not true, see https://docs.python.org/2/library/warnings.html#temporarily-suppressing-warnings (that's FYI), but I agree with your point in this case.

update : bool
If true, the new words in `sentences` will be added to model's vocab.
progress_per : int
Indicates how many words to process before showing/updating the progress.

"""
if workers is None:
workers = self.workers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

workers = workers or self.workers, here and everywhere.

gensim/utils.py Outdated
"""Merge `dict1` of (word, freq1) and `dict2` of (word, freq2) into `dict1` of (word, freq1+freq2).
Parameters
----------
dict1 : dict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dict of (str, int), here and everywhere.

@@ -1203,7 +1280,7 @@ def sort_vocab(self, wv):
wv.vocab[word].index = i

def prepare_vocab(self, hs, negative, wv, update=False, keep_raw_vocab=False, trim_rule=None,
min_count=None, sample=None, dry_run=False):
min_count=None, sample=None, dry_run=False, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why **kwargs ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still a question

pool = multiprocessing.Pool(processes=min(workers, len(input_streams)))

results = [
pool.apply_async(_scan_vocab_worker,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why apply_async ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apply/map are blocking.


corpus_count = document_no + 1
results = [res.get() for res in results] # pairs (vocab, doclen2tags)
self.raw_vocab = reduce(utils.merge_dicts, [r[0] for r in results])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from functools import reduce, Guido doesn't like reduce function, for this reason reduce was hidden in python3.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@menshikh-iv
Copy link
Contributor

@persiyanov great work! Please add tests & merge fresh master to your branch (just in case).

@gojomo problem mostly defined by user-behavior, i.e. if I (as gensim-user) heard about new multistream feature and want to try it with d2v - I'll use something like
[TaggedLineDocument(..), TaggedLineDocument(..)], multistream=True (that's definitely will not work as we expected in this case).

@persiyanov
Copy link
Contributor Author

hooray! Tests are passing now. I will update the description of the PR in a proper way (what's the feature, how it helps, how to use it etc.)

@menshikh-iv menshikh-iv changed the title Multistream API for vocabulary building in word2vec, doc2vec, fastText [GSoC 2018] Multistream API for vocabulary building in *2vec Jun 11, 2018
@gojomo
Copy link
Collaborator

gojomo commented Jun 14, 2018

If it works, great, but I'm a little uncomfortable with the manner-of-activation: the multistream boolean toggle, and thus differing if-this-type-then overloading of a single sentences parameter, all as new branches inside existing code.

Ultimately, even if not right away, I'd prefer a mechanism where people opting-into an alternative-vocabulary-building-optimization use some other explicit code path, not complicating existing interfaces, then drop a completed vocabulary-object into the Word2Vec/etc model. Then the simple/legacy path stays simple, the new alternative is clearly opted-into, and the point-of-entanglement is limited to a single injection of what the main model needs to proceed – which also allows the main model to be indifferent as to the specifics of how its individual steps were completed. That could also allow further experimentation with other vocab-building approaches, without adding more switches-and-branches to a single entry point every time. (That there's only a 2.5X-3X speedup in moving from 1 to 10 threads makes me think further improvements may be possible.)

@persiyanov
Copy link
Contributor Author

persiyanov commented Jun 14, 2018

@gojomo Could you give some examples on what you mean by

opting-into an alternative-vocabulary-building-optimization use some other explicit code path...

?

Something like vocab = gensim.models.word2vec.build_vocab_multistream_mode(input_streams, <vocab params>) and then model.set_vocab(vocab)? Or something different?

Why do you think the interface I proposed is complicated? It's just toggling one flag (multistream=True) and start passing multiple streams instead of single stream. This looks very clear and simple, isn't it?

Is it okay to merge this PR in its current state (after addressing @piskvorky comments) and possibly change the interfaces and API in the future (according to new research & results in multistream training direction) if needed?

@gojomo
Copy link
Collaborator

gojomo commented Jun 15, 2018

@persiyanov Yes, your example utility function build_vocab_multistream_mode() that returns a usable vocabulary-object is more along the lines of what I'd prefer.

The switches-and-branches approach works, and if there's just one alt mode, not too complex. But, it's expanding some (already-gigantic) method-parameter lists, and some already-twisty methods - making them harder to read/maintain/expand in the future.

If just a switch, the docs for this mode will be interleaved with long lists of other options – which may tend to tempt those who don't quite need this, but shortchange those who do want full descriptions. Having the docs on a specialized function or class may be more clear – and better able to discuss subtleties like the potentially different memory requirements of this approach, or its sensitivity to the various parallel streams being of roughly-equal size.

Also, adding new switches for new approaches in the future, which might or might not be compatible with each other, will be harder than having a separate, swappable step that hides its own details and just leaves things in a proper state. For example, there's another effort in #1962 to offer the approximate-counting Bounter as a vocab-surveying option... but it also adds a toggle-on-parameter and bunch of if-thens to nearby code. As edits to a single class, these would be hard to merge; as truly standalone paths-to-a-working-vocabulary-object, they'd be easily swappable as either (1) a policy/implementation option supplied to the main class; or (2) a separate step whose output is just provided to the mail class when done. (You can see I've raised similar modularity concerns at that PR.)

The preexisting and not-so-flexible tangle of the *2Vec initialization steps long predates your work, so it may be outside your project to fully fix it, but I'm wary of making the tangle any thicker.

gensim/utils.py Outdated
@@ -1712,7 +1712,7 @@ def prune_vocab(vocab, min_reduce, trim_rule=None):

def trim_vocab_by_freq(vocab, topk, trim_rule=None):
"""Retain `topk` most frequent words in `vocab`.
If there are more words with the same frequency as `topk`-th one, they will be dropped.
If there are more words with the same frequency as `topk`-th one, they will be keeped.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keeped => kept

"""Build vocabulary from a sequence of sentences (can be a once-only generator stream).
Each sentence is a iterable of iterables (can simply be a list of unicode strings too).

Parameters
----------
sentences : iterable of iterables
sentences : {iterable of iterables, list or tuple of iterable of iterables}
Copy link
Owner

@piskvorky piskvorky Jun 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not to override the same parameter with a different type (with the type dependant on an external flag).

It's already complicated enough; iterable of iterables, or a list or tuple of iterable of iterables is no longer humanly parseable.

My original suggestion was to have the input be a sequence of iterables always. For a single stream, simply an iterable of length one. If backward compatibility is hard to achieve (is it?), keep the legacy parameter sentences (maybe deprecate it in time) and promote a new parameter multistream. If backward compatibility can be achieved transparently, then just keep using sentences.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I got you wrong, but you propose smth like this:

Word2Vec(input_streams=None, sentences=None ...) (instead of old Word2Vec(sentences=None, ...) )

with the following logic:

  1. both sentences and input_streams can't be passed
  2. if sentences is passed, then do input_streams = [sentences] and go to (3)
  3. if input_streams is passed, then perform multistream mode

Am I correct about your logic?

Copy link
Owner

@piskvorky piskvorky Jun 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I originally thought users would only pass sentences, and we decide automatically during input validation whether it's a single stream (legacy, promote to sentences = [sentences] transparently) or already multiple streams. And keep only that single parameter.

But introducing extra input_streams parameter may be a cleaner option. More explicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll implement the logic in this way, sounds good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky Currently, in a majority of methods (almost all except constructor) the sentences parameter is required, not optional. If we introduce new input_streams param, both of them need to be optional.

This requires us to do some extra checks e.g. assert sentences is not None or input_streams is not None in all of the methods. It sounds a bit dirty

What do you think about that?

Copy link
Owner

@piskvorky piskvorky Jun 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dictionary building is a part of the training process. The same N < M or N > M questions apply there.

I'm still unclear what happens when N != M, even during vocabulary building. Or do we force N==M always? If so and both N and M are specified by user (number of streams and number of workers), which one takes precedence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't force anything. I just use multiprocessing.Pool, put len(input_streams) tasks in it and that's all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small clarification: this is only for dictionary building (see PR title), training process stay as is, the goal of PR is parallelized vocab building only.

Copy link
Owner

@piskvorky piskvorky Jun 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so the distribution of N tasks among M workers is managed by multiprocessing.Pool (at least during this training stage), got it.

@persiyanov
Copy link
Contributor Author

@piskvorky pls, take a look. I've addressed your comments

Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions around handling memory.


progress_queue.put((total_words, sentence_no + 1))
progress_queue.put(None)
return vocab
Copy link
Owner

@piskvorky piskvorky Jun 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will try to pickle the entire result (huge!) and send it back to master, where it is queued internally in the results queue, right?

Can you think of a way to use shared memory instead?

That should improve both memory (in a big way) as well as maybe performance (some extra locking, but less data sending).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky

It's an interesting idea, but some tricky questions arise:

If we use shared manager.dict(), how will we perform utils.prune_vocab(vocab, max_vocab_size)? Some threads may start to shrink the vocabulary in parallel at the same time, and that's no good as I see.

One way to solve this is to perform pruning from master process, but it will add some synchronization barriers (all threads must wait until master will prune the vocab).

Another way I can propose is to disallow to use multistream mode for people with low RAM (who sets max_vocab_size parameter) -- this can save us from inventing tricky hacks in order to properly prune the vocab (it will help us in two places -- both in current multistream implementation with max_vocab_size / workers and with using shared manager.dict())

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky ping

Copy link
Owner

@piskvorky piskvorky Jun 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed this notification.

My thoughts: you mean processes, not threads, right? We're using multiprocessing here.

Shrinking could be trickier, I agree. Maybe have a world-freezing lock around that operation: all workers pause while shrinking under way? Hopefully not that frequent?

Disabling pruning with multistream is also an elegant solution :-) Although people with big corpora (~huge dicts) are exactly the people who need fast training, so that solution kinda throws the baby out with the bath water.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky

Pausing all workers while doing shrinking -- sounds very slow. Why are we sure that shrinking is performed not that frequent?

Do you want me to implement this kind of logic and benchmark it? Then decide which logic we will merge into develop branch?

Copy link
Owner

@piskvorky piskvorky Jun 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, strange. Must be due to the locking, right?

@persiyanov Any way to turn locking off (since we don't care about accurate counts, especially in the high-frequency tokens where collisions are more likely)?

-1 on experiments with disk-spilling, at least inside this project. We don't have the capacity for that here, this is just an aside in the "faster dict building" sub-project, which is already looking good enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky Yes, I bet this happens because of locks.

Any way to turn locking off?

I can only see dirty ways to do that, e.g. make nogil Cython analogue for defaultdict(int).

Copy link
Owner

@piskvorky piskvorky Jun 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Have we exhausted ideas for memory sharing? Anything else we could try easily?

If not, are we finished with parallelized dict-building?

Please add concrete final timings and insights into the PR description (instead of substantially reduce vocabulary building time).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky done, see the updated description

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some minor language edits, looks good :) Thanks!

total_words += num_words
total_sentences += num_sentences

self.raw_vocab = reduce(utils.merge_counts, [res.get() for res in results])
Copy link
Owner

@piskvorky piskvorky Jun 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does .get() create another copy of each (potentially large) dictionary, or only return a reference to an already existing object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only returns the reference, as far as I know

@@ -330,7 +330,37 @@ def _log_epoch_progress(self, progress_queue, job_queue, cur_epoch=0, total_exam

def _train_epoch(self, data_iterable=None, data_iterables=None, cur_epoch=0, total_examples=None,
total_words=None, queue_factor=2, report_delay=1.0):
"""Train one epoch."""
"""Train the model for a single epoch.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a comment about the doc-improvements here, which all look good, but a side observation about this method, which I only noticed during related reviews last week, is that its strategy of re-launching a fresh 'producer' thread and fresh 'worker' threads for each epoch, as I believe was introduced in #1777, likely drives down overall throughput and CPU utilization compared to the prior strategy. The one potential advantage I'd see for adopting such a full-sync teardown&restart between epochs would be allowing the user to specify some callback for mid-training reporting at each epoch's end – but that hasn't yet ben added.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gojomo why would that drive down throughput & CPU utilization?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When some threads have finished an epoch, but others haven't, cores will be idle not because of GIL/etc but because there's no thread even trying to move forward onto the next epoch's data. Plus any overhead of re-launching threads (magnitude unclear). Old strategy launched exactly workers + 1 threads. This one launches epochs * (workers + 1) threads.

Copy link
Owner

@piskvorky piskvorky Jun 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, you're worried that at the end of each epoch, some threads may be idle (while other threads are finishing their last job) until the next epoch starts.

Isn't that idleness infinitesimal, since any single job takes almost no time at all? I may be missing something but this type of delay shouldn't be even measurable.

Copy link
Collaborator

@gojomo gojomo Jun 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure of the magnitude, only the direction: this means more idle cores, every time an epoch rolls over. Of course only measuring could tell for sure, and the proportionate impact becomes smaller with larger corpuses.

As it's not yet clear to me the relative interplay of existing GIL/queue sync bottlenecks that have been preventing higher throughput (or else I would have tried more to fix them), adding yet more thread launches/waits/syncs-against-a-not-yet-filled-queue is something I'd have been wary of doing without measurement at the time. Even the coarse once-a-second progress logging tended to show slower throughput at the beginning of training; that slow-start might now be repeated at each epoch - for example, via GIL-contention between the 1st few new worker threads getting a job, and the new producer thread, trying to get ahead of the workers again.

@persiyanov
Copy link
Contributor Author

@piskvorky Are you OK with this PR? If so, let's merge it. I need this changes for my next pull request related to multistream training.

@piskvorky
Copy link
Owner

Yeah, I think so. I don't see any outstanding unresolved comments/issues. @menshikh-iv ?

@menshikh-iv menshikh-iv merged commit 408a714 into piskvorky:develop Jul 12, 2018
@persiyanov persiyanov deleted the feature/gsoc-multistream-vocab branch July 12, 2018 10:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants