Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File-based fast training for Any2Vec models #2127

Merged
merged 133 commits into from
Sep 14, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
133 commits
Select commit Hold shift + click to select a range
39a2c11
CythonLineSentence
Jul 9, 2018
20c22f7
fix
Jul 9, 2018
dd0e9ca
fix setup.py
Jul 9, 2018
6203c77
fixes
Jul 9, 2018
03bf799
some refactoring
Jul 9, 2018
660493f
remove printf
Jul 10, 2018
1aedfe8
compiled
Jul 10, 2018
9ff0bb1
second branch for pystreams
Jul 10, 2018
9e498b7
fix
Jul 10, 2018
1d4a2a8
learning rate decay in Cython + _do_train_epoch + _train_epoch_multis…
Jul 11, 2018
97bac7e
add train_epoch_sg function
Jul 11, 2018
4de3a84
call _train_epoch_multistream from train()
Jul 11, 2018
36d1412
add word2vec_inner.cpp
Jul 11, 2018
625025b
remove pragma from .cpp
Jul 11, 2018
8173da8
Merge branch 'develop' into feature/multistream-training
Jul 12, 2018
bd0a0e0
fix doc
Jul 12, 2018
63663fa
fix pip
Jul 12, 2018
2ee2405
add __reduce__ to CythonLineSentence for proper pickling
Jul 14, 2018
8f8e817
remove printf
Jul 14, 2018
ac28bbb
add 1 test for CythonLineSentence
Jul 14, 2018
942a12f
no vocab copying
Jul 18, 2018
2a44fbc
fixed
Jul 18, 2018
e4a8ba0
Revert "fixed"
Jul 19, 2018
394a417
Revert "no vocab copying"
Jul 19, 2018
9ab6b1b
remove input_streams, add corpus_file
Jul 24, 2018
5d2e2cf
fix
Jul 24, 2018
0489561
fix replacing input_streams -> corpus_file in Word2Vec class
Jul 24, 2018
901cad4
upd .cpp
Jul 26, 2018
c09035c
add C++11 compiler flags
Jul 26, 2018
1e3c314
pep8
Jul 26, 2018
d6755be
add link args too
Jul 26, 2018
cc4680c
upd FastLineSentence
Jul 26, 2018
9978f6b
fix signatures in doc2vec/fasttext + removed tests on multistream
Jul 26, 2018
35333dd
fix flake
Jul 26, 2018
86b91ac
clean up base_any2vec.py
Jul 26, 2018
fca6f50
fix
Jul 26, 2018
45ca084
fix CythonLineSentence ctor
Jul 26, 2018
16bb386
fix py3 type error
Jul 26, 2018
c83b96f
fix again
Jul 26, 2018
1a21b0b
try again
Jul 26, 2018
dd83a3e
new error
Jul 26, 2018
c72f0b6
fix test
Jul 27, 2018
74e51b3
add unordered_map wrapper
Jul 30, 2018
58fc112
upd
Jul 30, 2018
5e70184
fix cython compiling errors
Jul 30, 2018
9727782
upd word2vec_inner.cpp
Jul 30, 2018
d97ac0c
add some tests
Jul 31, 2018
b6d7bb3
more tests for corpus_file
Jul 31, 2018
0c1fc5f
fix docstrings
Jul 31, 2018
fd66e34
addressing comments
Aug 1, 2018
da9f3da
fix tests skipIf
Aug 1, 2018
81329d6
add persistence test
Aug 1, 2018
f2ba633
online learning tests
Aug 1, 2018
51cec43
fix save_as_line_sentence
Aug 1, 2018
a72ddf1
fix again
Aug 1, 2018
aba7682
address new comments
Aug 2, 2018
03d44b2
fix test
Aug 2, 2018
e4e8cb2
move multistream functions from word2vec_inner to word2vec_multistream
Aug 2, 2018
3e989de
fix tests
Aug 2, 2018
d8c5cdc
add .c file
Aug 3, 2018
2a42b85
fix test
Aug 3, 2018
002a60c
fix tests skipIf and setup.py
Aug 3, 2018
3850f49
fix mac os compatibility
Aug 3, 2018
c1e8a9b
add tutorial on w2v multistream
Aug 9, 2018
7b7195b
300% -> 200% in notebook
Aug 10, 2018
3a8a915
add MULTISTREAM_VERSION global constant
Aug 10, 2018
6beb96a
first move towards multistream FastText
Aug 10, 2018
a2eb5fc
move MULTISTREAM_VERSION
Aug 10, 2018
57f7b66
fix error
Aug 10, 2018
83ce7c2
fix CythonVocab
Aug 10, 2018
a3ede08
regenerated .c & .cpp files
Aug 10, 2018
d38463e
resolve ambiguate fast_sentence_* declarations
Aug 11, 2018
ec4c677
add test_training_multistream for fasttext
Aug 11, 2018
a5311d2
add skipif
Aug 11, 2018
f499d5b
add more tests
Aug 11, 2018
645499c
fix flake8
Aug 11, 2018
dc1b98d
add short example
Aug 12, 2018
b9564e9
upd jupyter notebook
Aug 13, 2018
eefdd65
fix docstrings in doc2vec
Aug 14, 2018
f669979
add d2v_train_epoch_dbow for from-file training
Aug 14, 2018
e80189f
add missing parts of from-file doc2vec
Aug 15, 2018
cf6b032
refactored a bit
Aug 15, 2018
87d8ea7
add total_corpus_count calculation in doc2vec
Aug 15, 2018
e2851b4
Merge branch 'develop' into feature/multistream-training
persiyanov Aug 15, 2018
1fdaa43
add tests for doc2vec file-based + rename MULTISTREAM -> CORPUSFILE e…
Aug 15, 2018
c2fa0d8
regenerated .c + .cpp files
Aug 15, 2018
5427416
add Word2VecConfig in order to remove repeating parts of code
Aug 15, 2018
7f7760b
make shared initialization
Aug 15, 2018
926fd5e
use init_config from word2vec_corpusfile
Aug 15, 2018
df47983
add FastTextConfig
Aug 15, 2018
0df7f6f
init_config -> init_w2v_config, init_ft_config
Aug 15, 2018
5fd1c99
regenerated .c & .cpp files
Aug 15, 2018
d9257be
using FastTextConfig in fasttext_corpusfile.pyx
Aug 15, 2018
67c572c
fix
Aug 15, 2018
8e82b9f
fix
Aug 15, 2018
db2a77f
fix next_random in w2v
Aug 15, 2018
a96bc6d
introduce Doc2VecConfig
Aug 16, 2018
3b4da64
fix init_d2v_config
Aug 16, 2018
53b967c
use Doc2VecConfig in doc2vec_corpusfile.pyx
Aug 16, 2018
f57d1cb
removed unused vars
Aug 16, 2018
b652afe
fix docstrings
Aug 16, 2018
260cfb5
fix more docstrings
Aug 16, 2018
a433018
test old model for doc2vec & fasttext
Aug 16, 2018
20ec49b
fix loading old models
Aug 16, 2018
1ced17d
fix fasttext model checking
Aug 16, 2018
0731449
merge fast_line_sentence.cpp and fast_line_sentence.h
Aug 16, 2018
35f0ab4
fix word2vec test
Aug 16, 2018
49905f0
fix syntax error
Aug 16, 2018
95c6ec9
remove redundanta seekg call
Aug 16, 2018
aed2b6b
fix example notebook
Aug 16, 2018
c1af621
add initial doc_tags computation
Aug 16, 2018
33bf97a
fix test
Aug 16, 2018
e592b6a
fix test for windows
Aug 17, 2018
d08e4c1
add one more test on offsets
Aug 17, 2018
468a000
get rid of subword_arrays in fasttext
Aug 17, 2018
f71e1f8
make hanging indents everywhere
Aug 17, 2018
811388b
open file in byte mode
Aug 18, 2018
ddd5901
fix pep
Aug 18, 2018
a3490c7
fix tests
Aug 18, 2018
a28ff0d
fix again
Aug 18, 2018
b2996f0
final fix?
Aug 18, 2018
64bb617
regenerated .c & .cpp files
Aug 18, 2018
816f63f
fix test_persistence_fromfile for FastText
Aug 18, 2018
abad1b8
add fasttext & doc2vec to notebook
Aug 20, 2018
0b03839
add short examples
Aug 20, 2018
6217c73
update file-based tutorial notebook
piskvorky Aug 23, 2018
f70d159
work credit + minor nb fixes
piskvorky Aug 25, 2018
9593d5f
remove FIXMEs from file-based *2vec notebook
piskvorky Sep 9, 2018
7b714b2
remove warnings in corpus_file mode
persiyanov Sep 9, 2018
b833f0f
fix deprecation warning
menshikh-iv Sep 12, 2018
bcc0fb9
regenerate .ipynb
persiyanov Sep 14, 2018
384e0b1
upd plot
persiyanov Sep 14, 2018
527266f
upd plot
persiyanov Sep 14, 2018
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
550 changes: 550 additions & 0 deletions docs/notebooks/Any2Vec_Filebased.ipynb

Large diffs are not rendered by default.

Binary file added docs/notebooks/word2vec_file_scaling.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
231 changes: 179 additions & 52 deletions gensim/models/base_any2vec.py

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions gensim/models/deprecated/doc2vec.py
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,7 @@ def load_old_doc2vec(*args, **kwargs):

new_model.train_count = old_model.__dict__.get('train_count', None)
new_model.corpus_count = old_model.__dict__.get('corpus_count', None)
new_model.corpus_total_words = old_model.__dict__.get('corpus_total_words', None)
new_model.running_training_loss = old_model.__dict__.get('running_training_loss', 0)
new_model.total_train_time = old_model.__dict__.get('total_train_time', None)
new_model.min_alpha_yet_reached = old_model.__dict__.get('min_alpha_yet_reached', old_model.alpha)
Expand Down
1 change: 1 addition & 0 deletions gensim/models/deprecated/fasttext.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,7 @@ def load_old_fasttext(*args, **kwargs):

new_model.train_count = old_model.train_count
new_model.corpus_count = old_model.corpus_count
new_model.corpus_total_words = old_model.corpus_total_words
new_model.running_training_loss = old_model.running_training_loss
new_model.total_train_time = old_model.total_train_time
new_model.min_alpha_yet_reached = old_model.min_alpha_yet_reached
Expand Down
3 changes: 3 additions & 0 deletions gensim/models/deprecated/word2vec.py
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,7 @@ def load_old_word2vec(*args, **kwargs):

new_model.train_count = old_model.__dict__.get('train_count', None)
new_model.corpus_count = old_model.__dict__.get('corpus_count', None)
new_model.corpus_total_words = old_model.__dict__.get('corpus_total_words', None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ss this needed only for w2v? Why not the same change for other models?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

new_model.running_training_loss = old_model.__dict__.get('running_training_loss', 0)
new_model.total_train_time = old_model.__dict__.get('total_train_time', None)
new_model.min_alpha_yet_reached = old_model.__dict__.get('min_alpha_yet_reached', old_model.alpha)
Expand Down Expand Up @@ -1622,6 +1623,8 @@ def load(cls, *args, **kwargs):
model.make_cum_table() # rebuild cum_table from vocabulary
if not hasattr(model, 'corpus_count'):
model.corpus_count = None
if not hasattr(model, 'corpus_total_words'):
model.corpus_total_words = None
for v in model.wv.vocab.values():
if hasattr(v, 'sample_int'):
break # already 0.12.0+ style int probabilities
Expand Down
281 changes: 153 additions & 128 deletions gensim/models/doc2vec.py

Large diffs are not rendered by default.

Loading