Optimize `FastText.load_fasttext_model` #2340

mpenkov · 2019-01-19T11:39:56Z

Should fix #1261

gensim/models/fasttext.py

gensim/models/_fasttext_bin.py

gensim/models/doc2vec_inner.c

gensim/models/_utils_any2vec.pyx

gensim/test/test_fasttext.py

gensim/test/test_utils.py

gensim/models/utils_any2vec.py

menshikh-iv

Thanks @mpenkov, what's still missing

ngram byte-based func (FB port) in "glued" form I guess (to avoid "".join().split())
more tests (esp. for previous note)
final measurements (by time) for loading (before VS after VS FB impl) in 2 variants
- load model and retrieve vector to word
- load model and start training
~~backward compatibility fix for fastText fixes in 3.7 break compatibility with old models #2341~~ Fixed by Fix backward compatibility issue: loading FastTextKeyedVectors using KeyedVectors (missing attribute compatible_hash) #2349

gensim/models/_utils_any2vec.pyx

gensim/models/fasttext.py

gensim/models/utils_any2vec.py

gensim/test/test_fasttext.py

gensim/models/_utils_any2vec.pyx

horpto · 2019-01-22T14:19:02Z

gensim/models/fasttext.py

@@ -704,6 +708,14 @@ def train(self, sentences=None, corpus_file=None, total_examples=None, total_wor
            >>> model.train(sentences, total_examples=model.corpus_count, epochs=model.epochs)

        """
+        cant_train = hasattr(self.trainables, 'syn1neg') and self.trainables.syn1neg is None


stupid question: what if self.trainables does not have syn1neg attr at all, so can model train ?

Yes. I don't see any other code that sets syn1neg to None. So, the new code uses that value to mean "cannot continue training".

If trainables does not have syn1neg at all, it is possible to start training.

gensim/test/test_utils.py

gensim/models/_utils_any2vec.pyx

mpenkov · 2019-01-24T03:11:01Z

I benchmarked the model loading in this PR against 3.7.0 using:

from gensim.models import FastText
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(filename)s:%(lineno)s - %(message)s')
m = FastText.load_fasttext_format("cc.ru.300.bin")

Before: 13 min
After: 2min 20s

370.prof.gz

improv.prof.gz

menshikh-iv · 2019-01-24T04:43:20Z

Awesome, thanks @mpenkov and @horpto 👍

mpenkov · 2019-01-24T08:27:39Z

We're still considerably slower than the FB app:

(improv.venv) mpenkov@hetrad2:~$ cat words.txt 
команда
маленьких
друзей
возит
грузы
всех
быстрей
(improv.venv) mpenkov@hetrad2:~$ cat bench.py 
from gensim.models import FastText
import sys

m = FastText.load_fasttext_format("cc.ru.300.bin", full_model=False)
for line in sys.stdin:
    word = line.rstrip()
    print(word, m.wv[word])
(improv.venv) mpenkov@hetrad2:~$ time cat words.txt | python bench.py > /dev/null

real    2m31.777s
user    1m50.864s
sys     0m41.336s
(improv.venv) mpenkov@hetrad2:~$ time cat words.txt | fasttext print-word-vectors cc.ru.300.bin > /dev/null                                                                                                   

real    0m14.301s
user    0m2.812s
sys     0m11.480s

piskvorky · 2019-01-24T09:02:38Z

@mpenkov how much of that time is loading vs access? (in gensim and in fb)

EDIT: n/m, I see there are just a few words accessed, so this must be all loading.

menshikh-iv · 2019-01-24T09:06:57Z

@piskvorky you are right, all of it is loading, retrieve a vector by word works fast.

Note: I guess reason in "retrieve vectors for vocab" (adjust_vector mostly) on loading, FB doesn't do that (construct all vectors on-the-fly, don't precompute vocab), but we do.

mpenkov added 2 commits January 19, 2019 21:36

add docstring for Model namedtuple

e1b9ba0

add option to skip hidden matrix loading

53497a6

menshikh-iv reviewed Jan 19, 2019

View reviewed changes

gensim/models/fasttext.py Outdated Show resolved Hide resolved

menshikh-iv changed the title ~~[WIP] Fb improv~~ Optimize FastText.load_fasttext_model Jan 19, 2019

menshikh-iv changed the title ~~Optimize FastText.load_fasttext_model~~ [WIP] Optimize FastText.load_fasttext_model Jan 19, 2019

review response: rename fast -> full_model

6ae0bb9

menshikh-iv reviewed Jan 19, 2019

View reviewed changes

gensim/models/_fasttext_bin.py Outdated Show resolved Hide resolved

menshikh-iv mentioned this pull request Jan 19, 2019

Improve FastText loading times #1261

Closed

menshikh-iv suggested changes Jan 20, 2019

View reviewed changes

mpenkov added 3 commits January 20, 2019 19:38

speed up hash function based on ideas from @horpto and @menshikh-iv

3406bf0

remove obsolete ft_hash function

e458ff0

review response: update docstring

8144e20

menshikh-iv mentioned this pull request Jan 20, 2019

Add Sent2Vec model. Fix #1376 #1619

Closed

horpto reviewed Jan 20, 2019

View reviewed changes

gensim/models/utils_any2vec.py Outdated Show resolved Hide resolved

mpenkov added 3 commits January 21, 2019 19:18

attempt to hack around appveyor Py2.7 build missing stdint.h

d735099

fixup: add missing int8_t typedef

fd32340

Merge remote-tracking branch 'upstream/develop' into fb-improv

4e022d8

menshikh-iv mentioned this pull request Jan 21, 2019

fastText fixes in 3.7 break compatibility with old models #2341

Closed

menshikh-iv suggested changes Jan 21, 2019

View reviewed changes

mpenkov added 11 commits January 22, 2019 21:51

review response: avoid split and join

8e7bd40

review response: add comment to explain hack

2ea4672

review response: improve logging message

9137a27

review response: fix hash_main function

a654c3c

fixup: fix test_utils.py

f392d80

add tests for ngram generation

4fd56d5

Merge remote-tracking branch 'upstream/develop' into fb-improv

98d0a09

fixup in tests

b614030

add emoji test case

46e1ec1

minor fixup in logging message

d1b80c0

add byte tests

07c49c2

remove FIXME, absense of ord does not influence correctness

ccaba08

horpto reviewed Jan 22, 2019

View reviewed changes

gensim/models/_utils_any2vec.pyx Outdated Show resolved Hide resolved

horpto reviewed Jan 22, 2019

View reviewed changes

gensim/models/_utils_any2vec.pyx Outdated Show resolved Hide resolved

horpto reviewed Jan 22, 2019

View reviewed changes

mpenkov added 5 commits January 23, 2019 08:28

review response: introduce list slicing

0b0e46e

avoid using fstrings for Py2 compatibility

cdab5b6

flake8

d13e4c9

more Py2 compatibility

a412e48

flake8

6617b80

menshikh-iv suggested changes Jan 23, 2019

View reviewed changes

gensim/test/test_utils.py Outdated Show resolved Hide resolved

gensim/test/test_utils.py Show resolved Hide resolved

menshikh-iv reviewed Jan 23, 2019

View reviewed changes

gensim/models/_utils_any2vec.pyx Outdated Show resolved Hide resolved

horpto reviewed Jan 23, 2019

View reviewed changes

gensim/models/_utils_any2vec.pyx Show resolved Hide resolved

mpenkov and others added 8 commits January 23, 2019 15:12

review response: get rid of set()

da025d3

review response: remove excess bytes() call

402db7d

fix tests (wide unicode issue)

bb7282d

Merge remote-tracking branch 'upstream/develop' into fb-improv

8e72e18

add test against actual FB implementation

07baf84

adding temporary benchmarking code

ce1a631

replacing non-optimized code with optimized code

e87ca05

removing temporary benchmarking code

38146bd

remove wide characters from fb test code

136dc85

menshikh-iv changed the title ~~[WIP] Optimize FastText.load_fasttext_model~~ Optimize FastText.load_fasttext_model Jan 24, 2019

menshikh-iv merged commit 411f546 into piskvorky:develop Jan 24, 2019

mpenkov deleted the fb-improv branch June 26, 2020 01:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `FastText.load_fasttext_model` #2340

Optimize `FastText.load_fasttext_model` #2340

mpenkov commented Jan 19, 2019 •

edited by menshikh-iv

Loading

menshikh-iv left a comment •

edited

Loading

horpto Jan 22, 2019

mpenkov Jan 22, 2019

mpenkov commented Jan 24, 2019

menshikh-iv commented Jan 24, 2019

mpenkov commented Jan 24, 2019

piskvorky commented Jan 24, 2019 •

edited

Loading

menshikh-iv commented Jan 24, 2019

Optimize FastText.load_fasttext_model #2340

Optimize FastText.load_fasttext_model #2340

Conversation

mpenkov commented Jan 19, 2019 • edited by menshikh-iv Loading

menshikh-iv left a comment • edited Loading

Choose a reason for hiding this comment

horpto Jan 22, 2019

Choose a reason for hiding this comment

mpenkov Jan 22, 2019

Choose a reason for hiding this comment

mpenkov commented Jan 24, 2019

menshikh-iv commented Jan 24, 2019

mpenkov commented Jan 24, 2019

piskvorky commented Jan 24, 2019 • edited Loading

menshikh-iv commented Jan 24, 2019

Optimize `FastText.load_fasttext_model` #2340

Optimize `FastText.load_fasttext_model` #2340

mpenkov commented Jan 19, 2019 •

edited by menshikh-iv

Loading

menshikh-iv left a comment •

edited

Loading

piskvorky commented Jan 24, 2019 •

edited

Loading