Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP #1991]: Word2Bits benchmarks #2011

Closed

Conversation

Projects
None yet
4 participants
@persiyanov
Copy link
Contributor

commented Mar 31, 2018

Pull request with intermediate results on this issue.

I've implemented quantization techniques as stated in the paper.

I've just benchmarked it on text8 corpus with 10 iterations over the dataset.

Other params (almost as in paper except embedding size):

  • size = 128
  • 10 iterations
  • Window = 10
  • Negative = 12
  • Min count = 5
  • Subsampling = 1e-4
  • Learning rate = 0.05 (linearly decayed to 0.0001)

Based on my initial benchmark results are not promising:

Training method Quantization after training Semantic accuracy Syntactic accuracy
Original w2v (num_bits=0) No quantization 49.96% 46.99%
1 bit 13.06% 15.16%
2 bits 13.31% 15.04%
1 bit quantization (num_bits=1) No quantization 29.51% 29.85%
1 bit 19.18% 13.28%
2 bits quantization (num_bits=2) No quantization 44.38% 36.46%
2 bits 18.79% 21.52%
  • Quantized vectors work better if training phrase was done with quantization techniques.
  • But there is a large accuracy drop when training is done with quantization. I suppose that embedding size (128) is too small to encode all information when quantization is applied.

So, currently I'm training the same models but for 25 iterations and with size=400. I'll post the results asap.

After that, I will train the same setup on EN wikipedia (as was done in the paper).

@menshikh-iv please, take a look at my code. Maybe there are some bugs and this leads to low accuracy.

@persiyanov

This comment has been minimized.

Copy link
Contributor Author

commented Mar 31, 2018

Results on text8 corpus, size=400, iter=25:

Training method Quantization after training Semantic accuracy Syntactic accuracy
Original w2v (num_bits=0) No quantization 47.01% 44.91%
1 bit 16.84% 21.31%
2 bits 16.84% 21.31%
1 bit quantization (num_bits=1) No quantization 66.71% 48.34%
1 bit 50.96% 34.33%
2 bits quantization (num_bits=2) No quantization 49.21% 37.05%
2 bits 28.59% 27.02%

Results are much better now, especially in 1 bit case. Will try it for size=1000 on text8 and then I will turn to Wikipedia dataset.

[UPD]: size=1000

Results on text8 corpus, size=400, iter=25:

Training method Quantization after training Semantic accuracy Syntactic accuracy
Original w2v (num_bits=0) No quantization 33.39% 38.47%
1 bit 14.11% 18.92%
2 bits 14.04% 18.95%
1 bit quantization (num_bits=1) No quantization 66.90% 48.45%
1 bit 54.64% 40.09%
2 bits quantization (num_bits=2) No quantization 40.46% 34.32%
2 bits 21.91% 26.70%

@persiyanov persiyanov referenced this pull request Mar 31, 2018

Closed

Word2Bits benchmark #1991

@menshikh-iv
Copy link
Collaborator

left a comment

Great @persiyanov,

  1. Add a new files (cython) to setup.py and MANIFEST.in, examples: https://github.com/RaRe-Technologies/gensim/pull/1825/files#diff-97c91a104c431d0c365565d3ac03ac13 and https://github.com/RaRe-Technologies/gensim/pull/1825/files#diff-2eeaed663bd0d25b7e608891384b7298
  2. Move num_bits to word2vec (instead of baseclass)
  3. Add backward compatibility tests and test for new functionality
  4. Look at CI logs, looks like your change affected to other models and break it
  5. We are really wait results from Wiki benchmark 🔥
@@ -0,0 +1,88 @@
from __future__ import unicode_literals

This comment has been minimized.

Copy link
@menshikh-iv

menshikh-iv Apr 1, 2018

Collaborator

notebook is enough here

@@ -298,7 +298,7 @@ def _set_train_params(self, **kwargs):
raise NotImplementedError()

def __init__(self, sentences=None, workers=3, vector_size=100, epochs=5, callbacks=(), batch_words=10000,
trim_rule=None, sg=0, alpha=0.025, window=5, seed=1, hs=0, negative=5, cbow_mean=1,
trim_rule=None, sg=0, alpha=0.025, window=5, seed=1, hs=0, negative=5, num_bits=0, cbow_mean=1,

This comment has been minimized.

Copy link
@menshikh-iv

menshikh-iv Apr 1, 2018

Collaborator

this parameter only for w2v, isn't? Maybe better to add it directly to w2v?

@@ -309,6 +309,7 @@ def __init__(self, sentences=None, workers=3, vector_size=100, epochs=5, callbac
self.min_alpha = float(min_alpha)
self.hs = int(hs)
self.negative = int(negative)
self.num_bits = int(num_bits)

This comment has been minimized.

Copy link
@menshikh-iv

menshikh-iv Apr 1, 2018

Collaborator

need to be sure that old model will be loaded successfully (need to modify load function) + backward compatibility test (this modification needed only for w2v class I think).

@@ -121,6 +121,18 @@ def save(self, fname_or_handle, **kwargs):
def load(cls, fname_or_handle, **kwargs):
return super(BaseKeyedVectors, cls).load(fname_or_handle, **kwargs)

def quantize_vectors(self, num_bits=0):

This comment has been minimized.

Copy link
@menshikh-iv

menshikh-iv Apr 1, 2018

Collaborator

missing docstring

This comment has been minimized.

Copy link
@menshikh-iv

menshikh-iv Apr 1, 2018

Collaborator

Also, how you reduce memory here (self.vectors is float32 anyway)?

@@ -424,7 +424,7 @@ class Word2Vec(BaseWordEmbeddingsModel):

def __init__(self, sentences=None, size=100, alpha=0.025, window=5, min_count=5,
max_vocab_size=None, sample=1e-3, seed=1, workers=3, min_alpha=0.0001,
sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=hash, iter=5, null_word=0,
sg=0, hs=0, negative=5, num_bits=0, cbow_mean=1, hashfxn=hash, iter=5, null_word=0,

This comment has been minimized.

Copy link
@menshikh-iv

menshikh-iv Apr 1, 2018

Collaborator

this is a good idea to have assert here (because this works only with CBOW and negative sampling) + num_bits accept 1 and 2 (not more)

fast_version=FAST_VERSION)

def _do_train_job(self, sentences, alpha, inits):
"""
Train a single batch of sentences. Return 2-tuple `(effective word count after
ignoring unknown words and sentence length trimming, total word count)`.
"""
work, neu1 = inits
work1, work2, neu1 = inits

This comment has been minimized.

Copy link
@menshikh-iv

menshikh-iv Apr 1, 2018

Collaborator

is it possible without work2 here?

else:
tally += train_batch_cbow(self, sentences, alpha, work, neu1, self.compute_loss)
tally += train_batch_cbow(self, sentences, alpha, work1, work2, neu1, self.compute_loss, self.num_bits)

This comment has been minimized.

Copy link
@menshikh-iv

menshikh-iv Apr 1, 2018

Collaborator

I think it's better to extract num_bits directly in cython from self (for save more consistent interfaces), like cdef int negative = model.negative

@persiyanov

This comment has been minimized.

Copy link
Contributor Author

commented Apr 5, 2018

Guys, don't lose me, I'm here. Models are training on English Wikipedia right now, it takes quite a long time. I suppose that results will be ready in a day or few.

@piskvorky

This comment has been minimized.

Copy link
Member

commented Apr 5, 2018

@persiyanov is it that much slower than "normal" word2vec? How many words per second are you seeing (on what hardware)?

@persiyanov

This comment has been minimized.

Copy link
Contributor Author

commented Apr 7, 2018

@piskvorky For "normal" word2vec it was around 140-150k words per sec.

Now the model with 1bit quantization is training, its speed is 85k words per sec.

@piskvorky

This comment has been minimized.

Copy link
Member

commented Apr 7, 2018

OK, so half the speed. That's not so bad. Do you see areas for possible optimization?

@persiyanov

This comment has been minimized.

Copy link
Contributor Author

commented Apr 10, 2018

Here are the results on english wikipedia. Trained for 5 epochs, embedding size 400:

Training method Quantization after training Semantic accuracy Syntactic accuracy
Original w2v (num_bits=0) No quantization 68.79% 38.97%
1 bit 44.48% 25.22%
2 bits 44.48% 25.22%
1 bit quantization (num_bits=1) No quantization 77.22% 61.36%
1 bit 67.16% 51.97%
2 bits quantization (num_bits=2) No quantization 65.12% 47.09%
2 bits 72.54% 54.99%

Looks like quantization really acts like a regularizer. It could help in two ways:

  1. One can train embeddings using quantized loss. Then, use trained fp32 vectors (it gives higher quality)
  2. One can train embeddings using quantized loss + use quantized vectors then. Not as good as (1), but still quite good.

[UPD]: I didn't perform text normalization on wiki corpus (as was done in the original word2bits paper) because of lack of time

@menshikh-iv

This comment has been minimized.

Copy link
Collaborator

commented Apr 10, 2018

Great, awesome results @persiyanov 🔥 🔥 🔥, definitely nice addition to Gensim!

@gojomo

This comment has been minimized.

Copy link
Member

commented Apr 10, 2018

Some thoughts:

  • text8 is so small for word-vector purposes I'd be reluctant to make any strong conclusions based on comparisons using it
  • parameters window=10, negative=12, alpha=0.05 are sufficiently different from more-common defaults that I wonder what motivated their choice in the Word2Bits paper. It'd be interesting to compare results with more-default parameters, or with (best non-quantized params) vs (best quantized params)
  • it'd be useful for tables of results to include runtimes – and for different options that each improve results, at a cost in runtime, to be compared. (For example: if quantization doubles runtime but improves accuracy, do other ways of spending twice-as-much time – especially the naive approach of twice-as-many training epochs – do just as well in the same amount of time?)

persiyanov added some commits May 10, 2018

persiyanov
fix
persiyanov
@piskvorky

This comment has been minimized.

Copy link
Member

commented Sep 6, 2018

@persiyanov what's the status here? This is a pretty cool feature. CC @menshikh-iv

@persiyanov

This comment has been minimized.

Copy link
Contributor Author

commented Sep 14, 2018

Attaching email discussion with author of Word2Bits paper. He said that quantization during training is unnecessary, the same effect could be achieved via l2 regularization (it's much faster).


Hi Dmitry and Menshikh,

This is Max, author of the Word2Bits paper. I've been following the Word2Bits benchmark pull request (#2011).

Firstly, thanks so much for benchmarking all this -- really appreciate it!

I have some updates to Word2Bits. The main finding is that you can train with l2 regularization and then just threshold the resulting word vector values (using the same quantization function as before) and still get pretty similar results as training with quantization and then thresholding. So the main finding is that training with quantization is equivalent to regularized training, and regularized training really helps improve word vector quality.

Thought this piece of information might be helpful, as training with l2 regularization is much faster than training with the quantization function! Hope this helps Gensim's word2vec!

Thanks!
Max


Greetings, Max!

This is really great, because my current implementation runs with 80/50k words per second for 1 and 2 bits quantization respectively. To compare, classic word2vec runs with 150k words per second.

I'll definitely try this out. Can you share the value of l2 coefficient from your experiments?

Thank you for reaching out, looking forward for your answer

Dmitry


Max Lam agnusmaximus@gmail.com | Tue, Apr 17, 2018 at 3:10 AM
To: Dmitry Persiyanov dmitry.persiyanov@gmail.comCc: menshikh.iv@gmail.com
Awesome! I used an l2 coefficient of .001 for my experiments (which seemed to work pretty well for full wikipedia and for text8).


ivan menshikh.iv@gmail.com | Tue, Apr 17, 2018 at 8:13 AM

To: Max Lam agnusmaximus@gmail.com, dmitry.persiyanov@gmail.com
Hi Max,thanks for the advice, this will be pretty useful for us! Have you GitHub account? |


Hi Ivan,

Yeah I have a github -- my id is agnusmaximus.

Thanks!
Max

@piskvorky

This comment has been minimized.

Copy link
Member

commented Sep 14, 2018

Thanks @persiyanov . I guess that means we drop this PR, and focus on adding regularization instead? CC @menshikh-iv @gojomo

@persiyanov

This comment has been minimized.

Copy link
Contributor Author

commented Sep 14, 2018

No need to drop this PR. As I remember, I already added l2 regularization in this PR, so we can move within it. Remember, that only training should be done with l2 regularization, but after that we should apply quantization, as I understood.

Currently, I almost don't have any free time to work on it (moved to a new job + my university classes started), maybe will have in a month or two.

@menshikh-iv

This comment has been minimized.

Copy link
Collaborator

commented Jan 10, 2019

Unfortunately, @persiyanov still have no time 😟 for this reason, I close this PR :(
@persiyanov feel free to re-open when you'll have an time (I really hope that this will happen).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.