-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Adding unsupervised FastText to Gensim #1525
[WIP] Adding unsupervised FastText to Gensim #1525
Conversation
There is a PR open here- #1482 |
gensim/models/fasttext.py
Outdated
|
||
for indices in word2_indices: | ||
word2_subwords += ['<' + model.wv.index2word[indices] + '>'] | ||
word2_subwords += Ft_Wrapper.compute_ngrams(model.wv.index2word[indices], model.min_n, model.max_n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works for now, but ideally we'd like a cleaner solution to this later on. In general, I think the FastText wrapper (to load .bin
files) and the FastText training code implemented here shares a lot of common ground (both conceptually and code-wise). Once we have the correctness of the models verified, we'd be looking to refactor it somehow (maybe just inheriting from the wrapper? Completely removing train functionality from the wrapper and replacing it with native train functionality?) Any thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. The current implementation shares code with Gensim's Fasttext wrapper so inheriting from the wrapper seems to be good way for avoiding this redundancy.
I think it would also be helpful to refactor the current Word2Vec
implementation since apart from using ngrams-vectors rather than word-vectors at the time of backpropagation in fasttext, the logic and code between the two models overlap significantly. Having one common parent class and the two models as the children could be a useful way to tackle this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that refactoring to avoid redundancy would be good. I'm not sure a common parent class is the way to go though, since most of the redundant code is in methods train_batch_cbow
and train_batch_skipgram
, which are both independently defined functions, and not methods of the Word2Vec
class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from these training functions, there is overlap between the two models in some other tasks as well. For instance, in our fasttext
implementation, we are first constructing the vocabulary in the same way as is done in Word2Vec
(i.e. calling scan_vocab
, scale_vocab
and finalize_vocab
functions) and then we are handling all the "fasttext-specific" things (like constructing the dictionary of ngrams and precomputing & storing ngrams for each word in the vocab). These "fasttext-specific" things can be handled at a prior stage (e.g. within scale_vocab
or finalize_vocab
functions) and this would also help us optimize things e.g. by avoiding iterating over the vocabulary a few times unnecessarily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The super
method is useful in such situations - where the parent class implementation of the method needs to be run along with whatever code is specific to the child class.
gensim/models/fasttext.py
Outdated
for indices in word2_indices: | ||
word2_subwords += ['<' + model.wv.index2word[indices] + '>'] | ||
word2_subwords += Ft_Wrapper.compute_ngrams(model.wv.index2word[indices], model.min_n, model.max_n) | ||
word2_subwords = list(set(word2_subwords)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we changed this to no longer be a set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. I have pushed those changes now.
gensim/models/fasttext.py
Outdated
if context_locks is None: | ||
context_locks = model.syn0_all_lockf | ||
|
||
if word not in model.wv.vocab: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary? Shouldn't word be necessarily present in vocab anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. That's correct. This was in the Word2Vec code as well so it slipped through I guess. Thanks for pointing this out!
Please mention the slow runtime of Gensim's pure python version in the notebook. The exact time on Lee corpus for example. |
Comparison between Gensim's native Python implementation with Facebook's original C++ code Note:
We have used mainly 3 functions for comparing the 2 implementations:
For the mode
Overlapping words: [ u'midnight', u'knight', u'dwight', u'tight', u'bright', u'wight', u'tamazight', u'nightclubs'] Similar results (overlap of around 7,8) was obtained for other modes as well. cc: @piskvorky @gojomo |
gensim/models/fasttext.py
Outdated
else: | ||
new_vocab_len = len(self.wv.vocab) | ||
for ngram, idx in self.wv.hash2index.items(): | ||
self.wv.hash2index[ngram] = idx + new_vocab_len - self.old_vocab_len |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be a good idea to have two separate matrices, one for storing the vectors for the <word>
tokens, and one for the subwords. Along with renaming the variable to more intuitive names (we don't really need to follow the syn0
syn1
nomenclature here), that should make code much cleaner.
It also makes resizing much easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@chinmayapancholi13 thanks. What are the differences due to? Why is |
@piskvorky There still remain some differences in terms of the "randomness" at training time b/w C++ and Python implementations. These include initialization of ngram-vector matrices, choosing which words are to be downsampled, choosing a reduced window size, randomization in negative sampling (randomly choosing negative words) & hierarchical softmax (tree) specific segments of code and multithreading (worker threads > 1 for the results shown above). So as far as the difference in accuracy values is concerned, the difference could be due to the relatively small size of corpus (10 MB) used here and the above listed sources for randomness. The values become closer while using a 100 MB corpus for training as can be seen below:
|
@piskvorky And about the point regarding "not-all-top-10-words-matching", that too seems to be because of the above mentioned reasons. On training the model with 100 MB corpus, the top-10 words for the Gensim: Wrapper: Overlap: It is very likely that even for a word which belongs to the top-10 list for one implementation and doesn't for the other, the word might be in the top-15 or top-20 (say) for the one it is missing in. :) |
LGTM for me, very nice job @chinmayapancholi13 🔥 👍 |
Thanks @chinmayapancholi13 . Most of these items seems RNG-related -- what RNG does the original code use? Any way we can simply replicate it, use the same seed and thus get the same random numbers? (at least for testing, so performance irrelevant here) |
@piskvorky The original C++ code uses minstd_rand along with uniform_real_distribution to generate the random values used at the various instances in the code (like the reduced window size). If we want the results to be even closer, we could try to emulate this RNG and use it in our code too. Also, there was some previous discussion on #1482 regarding comparing the outputs of our implementation with the original C++ code so that the results would be very close in observable quality rather than having identical numerical results. I wanted to know if the method in which the comparison has been done till now (using the 3 functions accuracy(), evaluate_word_pairs() and most_similar()) is in the right direction in your opinion? |
@chinmayapancholi13 thanks for investigating! I appreciate the thoroughness. We should go for approximation only if there's no other way -- that's why I'm asking about the RNG. Is it difficult to use the exact same RNG directly from Python? If we can replicate the original RNG (at least for testing), we don't need any "very close in observable quality" (which is always questionable -- is 2 % "very close" or not?). Instead, we can go for identical, ± some numeric rounding error. If it's not possible to use the original RNG, we could shrug the differences away. But I'd much prefer to start on the right foot, with a verifiable and verified algo, before commencing optimizations. |
@piskvorky I agree. We should indeed try to verify the model correctness fully before moving on to doing optimizations and replicating this RNG and incorporating it in the Python code should help to get closer results. I'll try to do this then and give updates if I hit any blockers. |
self.wv.syn0_all = self.wv.syn0_all.reshape((num_vectors, dim)) | ||
assert self.wv.syn0_all.shape == (self.bucket + len(self.wv.vocab), self.vector_size), \ | ||
self.wv.syn0_ngrams = np.fromfile(file_handle, dtype=dtype, count=num_vectors * dim) | ||
self.wv.syn0_ngrams = self.wv.syn0_ngrams.reshape((num_vectors, dim)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fasttext_wrapper training breaks here if I use min_count=2
with my own data, while the default min_count=5
is still working.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-8-08cd9b292aba> in <module>()
----> 1 model.train('/home/rilut/fastText/fasttext', '/datadir/all.csv', model='skipgram', min_count=2)
/home/rilut/anaconda2/lib/python2.7/site-packages/gensim-2.3.0-py2.7-linux-x86_64.egg/gensim/models/wrappers/fasttext.pyc in train(cls, ft_path, corpus_file, output_file, model, size, alpha, window, min_count, word_ngrams, loss, sample, negative, iter, min_n, max_n, sorted_vocab, threads)
221
222 output = utils.check_output(args=cmd)
--> 223 model = cls.load_fasttext_format(output_file)
224 cls.delete_training_files(output_file)
225 return model
/home/rilut/anaconda2/lib/python2.7/site-packages/gensim-2.3.0-py2.7-linux-x86_64.egg/gensim/models/wrappers/fasttext.pyc in load_fasttext_format(cls, model_file, encoding)
249 model_file += '.bin'
250 model.file_name = model_file
--> 251 model.load_binary_data(encoding=encoding)
252 return model
253
/home/rilut/anaconda2/lib/python2.7/site-packages/gensim-2.3.0-py2.7-linux-x86_64.egg/gensim/models/wrappers/fasttext.pyc in load_binary_data(self, encoding)
267 self.load_model_params(f)
268 self.load_dict(f, encoding=encoding)
--> 269 self.load_vectors(f)
270
271 def load_model_params(self, file_handle):
/home/rilut/anaconda2/lib/python2.7/site-packages/gensim-2.3.0-py2.7-linux-x86_64.egg/gensim/models/wrappers/fasttext.pyc in load_vectors(self, file_handle)
349 self.num_original_vectors = num_vectors
350 self.wv.syn0_ngrams = np.fromfile(file_handle, dtype=dtype, count=num_vectors * dim)
--> 351 self.wv.syn0_ngrams = self.wv.syn0_ngrams.reshape((num_vectors, dim))
352 assert self.wv.syn0_ngrams.shape == (self.bucket + len(self.wv.vocab), self.vector_size), \
353 'mismatch between actual weight matrix shape {} and expected shape {}'.format(
ValueError: cannot reshape array of size 211096795 into shape (2121425,100)
Seems we need to modify the num_vectors
in reshape
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @rilut! Using the FastText wrapper for different values of min_count
parameter is working fine for me. Could you please share the exact code that is causing the problem in your case?
About num_vectors
value, the values of num_vectors
and dim
are actually read from the files generated from FastText's original C++ code and thus the wrapper code uses the same values at the time of reshaping.
And apologies for the delayed response. I have been a little occupied recently.
@chinmayapancholi13 hey, what's a status here? When you'll have time to "full-verification"? |
Hi guys, I'd like to get this in ASAP, so people can start using it and provide feedback. |
Hey @menshikh-iv @piskvorky! My apologies for the hiatus in the ongoing work. I have semester exams going on currently and so haven't been able to devote much time to the PR in the last week. I am planning to resume working and complete the work remaining for verifying correctness fully (by replicating the RNG in Python) in about a week's time. Sorry again for the inconvenience and thanks for your patience. |
Good luck with your exams @chinmayapancholi13 👍 |
FYI @chinmayapancholi13, I'll merge it now (because it's done for current stage and we need a feedback from users). Very nice job @chinmayapancholi13 🔥 |
I'd like to experiment with Gensim and fasttext. I'm not sure what the current implementation status is. At this point, the Python implementation mentioned in the notebook is not available via pip, right? This means that, for now, only the c++ wrapper is available in 2.3.0? |
@Liebeck For 2.3.0 - only C++ wrapper, for next version, current implementation will be available. |
@Liebeck now this functionality available in latest (3.0.0) gensim version. |
@Liebeck right now @manneshiva optimize our pure-python version, very soon we'll have a very fast version, you can monitor progress - #1742 (will be finished in two weeks, maybe faster). |
This PR implements
FastText
model (unsupservised version) in Gensim.