Word2Vec original C is faster #1291

tmsimont · 2017-04-25T14:30:15Z

Cython is installed, gensim is version 0.12.1

print gensim.models.word2vec.FAST_VERSION

says 1

To generate the gensim results, I have run this:

import gensim, logging
from gensim.models import word2vec

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO, filename='ns.log')
sentences = word2vec.Text8Corpus('../code/c-implementation/text8')

#print gensim.models.word2vec.FAST_VERSION

for i in range(1,49):
  model = gensim.models.Word2Vec(sentences, size=100,workers=i,window=8,hs=0,negative=5,sample=1e-4)
  model.save_word2vec_format('text8-ns.model.bin', binary=False)

I ran the c-implementation like this:

./word2vec -train /home/trevor/code/c-implementation/text8 -output vectors-c.bin -cbow 0 -size 100 -window 8 -negative 5 -hs 0 -sample 1e-4 -threads $1 -binary 0 -iter 10

(I looped this script and passed in seq 48)

The c version was built with these flags:

CFLAGS = -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -g

The machine is a single node that has 2 Intel Xeon E5-2650v4 Broadwell-EP CPUs with 24 total cores (12 cores per processor). The cpu's support hyperthreading, which is why my expiriments go up to 48 threads (this thing is a beast)

Results:

Raw:
https://gist.github.com/tmsimont/451f3fa17ef28ae57cb87d55ca04245a

Gensim is slower at all number of threads, and seems to be unscalable beyond the 12 cores on a single processor.

Any idea why the original c version is so much faster at all numbers of threads?

The text was updated successfully, but these errors were encountered:

tmsimont · 2017-04-25T14:38:29Z

I should note that I am using virtualenv... I'm not sure if that affects the speed?

tmsimont · 2017-04-25T14:41:15Z

output log attached, too
ns.log.orig.txt

tmsimont · 2017-04-25T15:06:06Z

It seems the key is the compiler optimizations on the C program. Without them, gensim is much faster than the C implementation. It seems then that gensim is only faster than C without compiler optimization?

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)

gojomo · 2017-04-25T17:27:54Z

It's a known issue that gensim's Cython routines don't get the same nearly-linear speedup with the number-of-cores.

A major factor is that some portions of the implementation are still in pure Python, or otherwise still hold the "GIL" – notably the corpus iteration/tokenization, parcelling of job-sized chunks to threads, and lookup of word-tokens to array-indexes. All such sections are still limited to a single thread, no matter how many workers configured. Many threads can still be in no-GIL operations at that time – so one way to get relatively higher thread utilization is to choose options that make no-GIL operations take more time, such as larger vector size, more negative samples, or a larger window.

Another likely factor is how the C version only reads data in one way, from a single file and format, whereas gensim takes any Python iterator. The C version thus can simply tell each thread to do its own IO starting at different seek-points in the file. (I believe a side effect, usually too minor to be a consideration, is that some regions of the file could be trained slightly more or less than the exact chosen iteration count.) You might see better gensim performance removing IO/tokenization as a factor, by converting the corpus iterator to an in-memory list-of-lists-of-strings, before ever passing it to Word2Vec (so all operations happen on already-in-memory, already-tokenized texts).

It's also possible a factor (not as sure about this) in your particular results is gensim's default job-chunking sizes, given the small size of the text8 corpus – it might only create enough chunks for fewer-than-the-full-number-of-threads, or face more idleness around the start and finish (where gensim assigns exactly the requested training to its threads, while original word2vec.c just has all threads go full speed until the expected count of training examples is finished). So you might see closer performance, or shift the plateau out somewhat, when using a larger corpus. (Forcing more iterations might achieve a similar effect.)

(Text8 is weird another way - its lack of normal, varied sentence-breaks. Unsure if this could make gensim slightly faster or slower than with more typical corpora, but it might have a slight effect.)

gojomo · 2017-04-25T17:30:05Z

Also: running in a virtualenv shouldn't affect performance. The only thing 'virtual' about such an environment are its paths for discovering executables/modules/libraries – there's no virtualization-of-other-services overhead.

tmsimont · 2017-04-25T17:49:50Z

Is this iterator not optimized?

sentences = word2vec.Text8Corpus('../code/c-implementation/text8')

I'm not familiar with how to do as you say with the optimization of this iterator.

Could it be possible that the compiler optimization simply makes the C code run faster than the gensim python implementation?

gojomo · 2017-04-25T18:29:02Z

Here's the Text8Corpus code:

https://github.com/RaRe-Technologies/gensim/blob/14357c182a61c319f591de2cd03b440105144d3a/gensim/models/word2vec.py#L1477

What kind of 'optimized' do you mean? It's still just doing pure-Python (single-threaded) IO and string operations. And for the default iter=5 it will have to do those things once for the vocabulary-scan, then another 5 times during training. If you have the memory, try...

sentences = list(word2vec.Text8Corpus('../code/c-implementation/text8')

...to only do IO/string-stuff once, then train Word2Vec from already in-memory, already-tokenized lists-of-strings.

I'm sure the C-compiler optimizations help, but I didn't see your test numbers for the no-optimizations code. And the cythonized portions of the gensim code were compiled by the same system C-compiler, probably with similar optimization options. So I doubt it's the largest fact.

You can just look at CPU core utilization, for large numbers of threads during training, and see that the C code nearly saturates all assigned cores, whereas the gensim/Python code does not – so the threading-limitations are a major factor (and perhaps the largest factor).

tmsimont · 2017-04-25T18:30:56Z

Yes I'm sure that the threading limitations are the largest factor for more than 12 cores, but even for 12 and under, and even just for 1 core, the C implementation is faster. I will try turning the sentences to a list first. and re-run the test.

gojomo · 2017-04-25T18:35:24Z

Updated my comment above to add point that Cython code is compiled by the same C-compiler, likely with same optimization level.

Things to try include:

in-memory corpus
larger corpus (perhaps just by using a larger iter value)
more compute-intensive parameters (to extend the time spent in nogil blocks): size++, negative++, window++

The interthread handoffs in gensim – a single producer-thread reading the iterator, batching examples together, feeding worker threads – are also a likely factor, compared to the C-code where every thread just opens its own handle into a different starting-place in the same file.

tmylk · 2017-05-02T22:19:15Z

Current status: Awaiting a benchmark on a large in-memory corpus.

tmsimont · 2017-05-03T03:41:21Z

Changing the input to a list first helps a little bit, but it's still falling short of the C implementation.

I'm using less cores now for better comparison. I'd rather not distract everyone with scalability issues. All of these 12 cores are on a single chip with shared L3 cache, private L1 and L2 caches.

Is there really just a single producer thread? That seems like the most likely bottleneck here. I'm OK with C being faster, but want to confirm that I'm not misunderstanding this. There's a lengthy blog post that claims a huge speedup over the original C code in gensim, but that doesn't seem to be the reality here. Is C faster? Or is there something wrong with my code?

import gensim, logging
from gensim.models import word2vec

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO, filename='ns.log')
sentences = list(word2vec.Text8Corpus('../code/c-implementation/text8'))

for i in range(1,13):
  model = gensim.models.Word2Vec(sentences, size=100,workers=i,window=8,hs=0,negative=5,sample=1e-4)
  model.save_word2vec_format('text8-ns.model.bin', binary=False)

raw: https://gist.github.com/tmsimont/0079d8923be35a8d4653effecd604b34

gojomo · 2017-05-03T04:37:44Z

There is a single producer thread, and only some of the (most-compute-intensive) code is eligible for full multithreading – with the pure Python parts still subject to the Python GIL. These two factors seem sufficient to me to explain the shortfall, and core utilization readouts may help confirm this... but I'm repeating myself. Trying other variations of parameters or corpus, as suggested above, may create more insight into where gensim may be most competitive.

(Not sure exactly why those 2013 Mac benchmarks showed a gensim advantage, but both word2vec.c & gensim have changed since then, and 2017 tests on a Linux system are likely to have relevant differences in compiler, libraries, and more.)

piskvorky · 2017-05-18T13:29:28Z

The biggest difference (in favour of gensim) is the BLAS library used. Which I do not see mentioned in this thread.

@tmsimont which BLAS library is your scipy linked against (scipy.show_config())?

piskvorky · 2017-07-11T13:35:03Z

@tmsimont ping

tmsimont · 2017-08-03T01:43:50Z

@piskvorky Sorry to go so long without a response. New job... major life changes, etc...

lapack_opt_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    language = f77
blas_opt_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    language = f77
openblas_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    language = f77

gojomo · 2017-10-10T07:47:57Z

I believe the main bottlenecks here are the single-distributor-thread implementation, and general Python GIL contention. Conversation & potential improvements on those issues should continue on #336. Closing this issue in favor of that earliest report of such issues.

gojomo mentioned this issue Jul 27, 2017

Word2Vec does not run faster with more workers caused by sentences length #1509

Closed

menshikh-iv added feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills performance Issue related to performance (in HW meaning) labels Oct 2, 2017

gojomo mentioned this issue Oct 10, 2017

Under utilization of CPU cores when running Word2Vec #1617

Closed

gojomo closed this as completed Oct 10, 2017

maplejia mentioned this issue Jun 14, 2019

test the topic changing over time with CSV format #2527

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word2Vec original C is faster #1291

Word2Vec original C is faster #1291

tmsimont commented Apr 25, 2017 •

edited

Loading

tmsimont commented Apr 25, 2017

tmsimont commented Apr 25, 2017

tmsimont commented Apr 25, 2017

gojomo commented Apr 25, 2017

gojomo commented Apr 25, 2017

tmsimont commented Apr 25, 2017

gojomo commented Apr 25, 2017 •

edited

Loading

tmsimont commented Apr 25, 2017

gojomo commented Apr 25, 2017

tmylk commented May 2, 2017

tmsimont commented May 3, 2017

gojomo commented May 3, 2017 •

edited

Loading

piskvorky commented May 18, 2017 •

edited

Loading

piskvorky commented Jul 11, 2017

tmsimont commented Aug 3, 2017

gojomo commented Oct 10, 2017

Word2Vec original C is faster #1291

Word2Vec original C is faster #1291

Comments

tmsimont commented Apr 25, 2017 • edited Loading

tmsimont commented Apr 25, 2017

tmsimont commented Apr 25, 2017

tmsimont commented Apr 25, 2017

gojomo commented Apr 25, 2017

gojomo commented Apr 25, 2017

tmsimont commented Apr 25, 2017

gojomo commented Apr 25, 2017 • edited Loading

tmsimont commented Apr 25, 2017

gojomo commented Apr 25, 2017

tmylk commented May 2, 2017

tmsimont commented May 3, 2017

gojomo commented May 3, 2017 • edited Loading

piskvorky commented May 18, 2017 • edited Loading

piskvorky commented Jul 11, 2017

tmsimont commented Aug 3, 2017

gojomo commented Oct 10, 2017

tmsimont commented Apr 25, 2017 •

edited

Loading

gojomo commented Apr 25, 2017 •

edited

Loading

gojomo commented May 3, 2017 •

edited

Loading

piskvorky commented May 18, 2017 •

edited

Loading