Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ns_exponent parameter to control the negative sampling distribution for *2vec models. Fix #2090 #2093

Conversation

fernandocamargoai
Copy link
Contributor

Fixes #2090

Description:

Like pointed out in the following article, the negative sampling distribution parameter, which is fixed as 0.75 in Gensim, is worth tuning, specially for other applications beyond NLP. So, I'd be very helpful to make it a parameter for the Word2Vec, instead of fixing it.

https://arxiv.org/abs/1804.04212

ns_exponent : float
The exponent used to smooth the cumulative distribution used for negative sampling.
1.0 leads to a sampling based on the frequency distribution, 0.0 makes items beings sampled equally,
while a negative value makes unpopular items being sampled more often than popular onces. The default value
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, grammar, and to give a hint of when this could be beneficially tuned, I'd reword as:

"The exponent used to shape the negative sampling distribution. A value of 1.0 samples exactly in proportion to the frequencies, 0.0 samples all words equally, while a negative value samples low-frequency words more than high-frequency words. The popular default value of 0.75 was chosen by the original Word2Vec paper. More recently, in https://arxiv.org/abs/1804.04212, Caselles-Dupré, Lesaint, & Royo-Letelier suggest that other values may perform better for recommendation applications."

The exponent used to smooth the cumulative distribution used for negative sampling.
1.0 leads to a sampling based on the frequency distribution, 0.0 makes items beings sampled equally,
while a negative value makes unpopular items being sampled more often than popular onces. The default value
is empirically set to 0.75 following the original paper of Word2Vec.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@gojomo
Copy link
Collaborator

gojomo commented Jun 18, 2018

Code looks good; I've suggested rewording the comment for clarity & to give a hint/pointer why this might be changed.

@fernandocamargoai
Copy link
Contributor Author

Hello, @gojomo. I've made the adjustments. Thank you for preparing a better text to document this parameter.

@fernandocamargoai
Copy link
Contributor Author

It seems the build failed with some kind of timeout when running the tests for Python 3.5. After all, it was passing before I simply changed the docs.

@gojomo
Copy link
Collaborator

gojomo commented Jun 19, 2018

Looks like it was a spurious failure unrelated to your commits; I forced a retry and it succeeded.

@menshikh-iv
Copy link
Contributor

LGTM, @fernandocamargoti please resolve merge conflict and I'll merge your PR

…ature/negative_sampling_distribution_parameter

# Conflicts:
#	gensim/models/doc2vec.py
#	gensim/models/fasttext.py
#	gensim/models/word2vec.py
@fernandocamargoai
Copy link
Contributor Author

Done, @menshikh-iv.

@menshikh-iv menshikh-iv changed the title Adding ns_exponent parameter to control the negative sampling distribution Adding ns_exponent parameter to control the negative sampling distribution for *2vec models. Fix #2090 Jun 22, 2018
@menshikh-iv menshikh-iv changed the title Adding ns_exponent parameter to control the negative sampling distribution for *2vec models. Fix #2090 Add ns_exponent parameter to control the negative sampling distribution for *2vec models. Fix #2090 Jun 22, 2018
@menshikh-iv menshikh-iv merged commit 76d194b into piskvorky:develop Jun 22, 2018
@menshikh-iv
Copy link
Contributor

@fernandocamargoti nice work, congratz with first contribution 👍

@fernandocamargoai fernandocamargoai deleted the feature/negative_sampling_distribution_parameter branch June 22, 2018 13:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Gensim doesn't allow changing negative sampling distribution parameter
4 participants