Retain the raw vocabulary when you feed several data respectively #1606

KazutoshiShinoda · 2017-10-02T18:34:35Z

When you have several data the sizes of which are so large, it is difficult to feed all the data to the model at the same time.

If you feed several data respectively, it is needed to retain the raw vocabulary to see if the number of occurrences of each infrequent word which is included in data is larger than min_count.

Thus, I sent this pull request.

If these commits are added, we can do this operation without discarding infrequent words the sizes of which are larger than min_count globally but smaller locally.

from gensim.models import word2vec
model = word2vec.Word2Vec(window=1, min_count=1)

def load(file):
    ~
    hoge
    ~
    return sentences

files=["data0.txt", "data1.txt", "data2.txt", ...]
update=False
for file in files:
    sentences = load(file)
    model.build_vocab(sentences, update=update)
    update=True
for file in files:
    sentences = load(file)
    model.train(sentences, total_examples=len(sentences),
                epochs=model.iter)

I added some codes in order to add to the vocabulary infrequent words which appear more times globally than min_count but locally smaller. It could occur when you need to feed several separate data to the model recursively because the size of each data is so large.

gojomo · 2017-10-03T00:13:30Z

It's usually cleaner and less-error-prone to create a single corpus iterator that streams all the combined data to a single build_vocab() call. There's an example of a utility class using this technique to virtually-concatenate multiple file in the PathLineSentences class:

https://github.com/RaRe-Technologies/gensim/blob/09fddf5c1215fe94f35f46d16747b4ce6c8b32f0/gensim/models/word2vec.py#L1630

This will generally also be better for a making a single call to train() - the loop you've shown, where train() is called separately for each file, will complete training on one file (from starting alpha to min_alpha over many iterations) before then doing many-iterations, with alpha going from its max to min, on the next file. Neither the completion of one file before any other are considered, nor the 'saw-tooth' jumping up-and-down of the effective 'alpha' learning-rate, are proper stochastic-gradient-descent and would be likely to get much worse final results.

Because the combined re-iterable full corpus is preferred, we'd probably not want to adapt build_vocab() to work this way.

KazutoshiShinoda · 2017-10-03T09:19:28Z

@gojomo Thanks for your fast reply.
I understood the circumstances. I will use PathLineSentences and train only once from now on.

I greatly appreciate your kindness!

KazutoshiShinoda added 3 commits October 3, 2017 03:11

Update word2vec.py

dd99463

I added some codes in order to add to the vocabulary infrequent words which appear more times globally than min_count but locally smaller. It could occur when you need to feed several separate data to the model recursively because the size of each data is so large.

Update word2vec.py

02e2af6

Update word2vec.py

6718a56

gojomo closed this Oct 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retain the raw vocabulary when you feed several data respectively #1606

Retain the raw vocabulary when you feed several data respectively #1606

KazutoshiShinoda commented Oct 2, 2017 •

edited

gojomo commented Oct 3, 2017

KazutoshiShinoda commented Oct 3, 2017 •

edited

Retain the raw vocabulary when you feed several data respectively #1606

Retain the raw vocabulary when you feed several data respectively #1606

Conversation

KazutoshiShinoda commented Oct 2, 2017 • edited

gojomo commented Oct 3, 2017

KazutoshiShinoda commented Oct 3, 2017 • edited

KazutoshiShinoda commented Oct 2, 2017 •

edited

KazutoshiShinoda commented Oct 3, 2017 •

edited