Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gensim's word2vec has a loss of 0 from epoch 1? #2920

Closed
LusKrew opened this issue Aug 20, 2020 · 1 comment
Closed

Gensim's word2vec has a loss of 0 from epoch 1? #2920

LusKrew opened this issue Aug 20, 2020 · 1 comment

Comments

@LusKrew
Copy link

LusKrew commented Aug 20, 2020

I am using the Word2vec module of Gensim library to train a word embedding, the dataset is 400k sentences with 100k unique words (its not english)

I'm using this code to monitor and calculate the loss :


class MonitorCallback(CallbackAny2Vec):
    def __init__(self, test_words):
        self._test_words = test_words




def on_epoch_end(self, model):
    print("Model loss:", model.get_latest_training_loss())  # print loss
    for word in self._test_words:  # show wv logic changes
        print(model.wv.most_similar(word))



monitor = MonitorCallback(["MyWord"])  # monitor with demo words


w2v_model = gensim.models.word2vec.Word2Vec(size=W2V_SIZE, window=W2V_WINDOW, min_count=W2V_MIN_COUNT  , callbacks=[monitor])

w2v_model.build_vocab(tokenized_corpus)


words = w2v_model.wv.vocab.keys()
vocab_size = len(words)
print("Vocab size", vocab_size)


print("[*] Training...")

w2v_model.train(tokenized_corpus, total_examples=len(tokenized_corpus), epochs=W2V_EPOCH)



The problem is from epoch 1 the loss is 0 and the vector of the monitored words dont change at all!

[*] Training...
Model loss: 0.0
Model loss: 0.0
Model loss: 0.0
Model loss: 0.0

so what is the problem here? is this normal? the tokenized corpus is a list of lists that are something like tokenized_corpus[0] = [ "word1" , "word2" , ...]

I googled and seems like some of the old versions of gensim had problem with calculating loss function, but they are from almost a year ago and it seems like it should be fixed right now?

I tried the code provided in the answer of this question as well but still the loss is 0 :

https://stackoverflow.com/questions/52038651/loss-does-not-decrease-during-training-word2vec-gensim

@gojomo
Copy link
Collaborator

gojomo commented Aug 20, 2020

You haven't used the compute_loss=True argument to the Word2Vec initialization to enable loss-tallying at all, per docs at https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec

After you do that, you may encounter other bugs with the current loss-tracking, which you can read about in detail via the open issues: https://github.com/RaRe-Technologies/gensim/issues?q=is%3Aissue+is%3Aopen+loss+in%3Atitle+

Unless/until you're sure your concern is a bug, questions are better handled via Stack Overflow (where I also answered your question) or the project discussion list, to reserve this issue-tracker for bugs & feature requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants