Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retain the raw vocabulary when you feed several data respectively #1606

Closed
wants to merge 3 commits into from

Conversation

KazutoshiShinoda
Copy link

@KazutoshiShinoda KazutoshiShinoda commented Oct 2, 2017

When you have several data the sizes of which are so large, it is difficult to feed all the data to the model at the same time.

If you feed several data respectively, it is needed to retain the raw vocabulary to see if the number of occurrences of each infrequent word which is included in data is larger than min_count.

Thus, I sent this pull request.

If these commits are added, we can do this operation without discarding infrequent words the sizes of which are larger than min_count globally but smaller locally.

from gensim.models import word2vec
model = word2vec.Word2Vec(window=1, min_count=1)

def load(file):
    ~
    hoge
    ~
    return sentences

files=["data0.txt", "data1.txt", "data2.txt", ...]
update=False
for file in files:
    sentences = load(file)
    model.build_vocab(sentences, update=update)
    update=True
for file in files:
    sentences = load(file)
    model.train(sentences, total_examples=len(sentences),
                epochs=model.iter)

I added some codes in order to add to the vocabulary infrequent words which appear more times globally than min_count but locally smaller. It could occur when you need to feed several separate data to the model recursively because the size of each data is so large.
@gojomo
Copy link
Collaborator

gojomo commented Oct 3, 2017

It's usually cleaner and less-error-prone to create a single corpus iterator that streams all the combined data to a single build_vocab() call. There's an example of a utility class using this technique to virtually-concatenate multiple file in the PathLineSentences class:

https://github.com/RaRe-Technologies/gensim/blob/09fddf5c1215fe94f35f46d16747b4ce6c8b32f0/gensim/models/word2vec.py#L1630

This will generally also be better for a making a single call to train() - the loop you've shown, where train() is called separately for each file, will complete training on one file (from starting alpha to min_alpha over many iterations) before then doing many-iterations, with alpha going from its max to min, on the next file. Neither the completion of one file before any other are considered, nor the 'saw-tooth' jumping up-and-down of the effective 'alpha' learning-rate, are proper stochastic-gradient-descent and would be likely to get much worse final results.

Because the combined re-iterable full corpus is preferred, we'd probably not want to adapt build_vocab() to work this way.

@KazutoshiShinoda
Copy link
Author

KazutoshiShinoda commented Oct 3, 2017

@gojomo Thanks for your fast reply.
I understood the circumstances. I will use PathLineSentences and train only once from now on.

I greatly appreciate your kindness!

@gojomo gojomo closed this Oct 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants