New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Adding unsupervised FastText to Gensim #1525

Merged
merged 35 commits into from Sep 19, 2017

Conversation

Projects
None yet
8 participants
@chinmayapancholi13
Copy link
Contributor

chinmayapancholi13 commented Aug 8, 2017

This PR implements FastText model (unsupservised version) in Gensim.

@souravsingh

This comment has been minimized.

Copy link
Contributor

souravsingh commented Aug 10, 2017

There is a PR open here- #1482


for indices in word2_indices:
word2_subwords += ['<' + model.wv.index2word[indices] + '>']
word2_subwords += Ft_Wrapper.compute_ngrams(model.wv.index2word[indices], model.min_n, model.max_n)

This comment has been minimized.

@jayantj

jayantj Aug 11, 2017

Contributor

This works for now, but ideally we'd like a cleaner solution to this later on. In general, I think the FastText wrapper (to load .bin files) and the FastText training code implemented here shares a lot of common ground (both conceptually and code-wise). Once we have the correctness of the models verified, we'd be looking to refactor it somehow (maybe just inheriting from the wrapper? Completely removing train functionality from the wrapper and replacing it with native train functionality?) Any thoughts on this?

This comment has been minimized.

@chinmayapancholi13

chinmayapancholi13 Aug 13, 2017

Author Contributor

Agreed. The current implementation shares code with Gensim's Fasttext wrapper so inheriting from the wrapper seems to be good way for avoiding this redundancy.
I think it would also be helpful to refactor the current Word2Vec implementation since apart from using ngrams-vectors rather than word-vectors at the time of backpropagation in fasttext, the logic and code between the two models overlap significantly. Having one common parent class and the two models as the children could be a useful way to tackle this.

This comment has been minimized.

@jayantj

jayantj Aug 14, 2017

Contributor

I agree that refactoring to avoid redundancy would be good. I'm not sure a common parent class is the way to go though, since most of the redundant code is in methods train_batch_cbow and train_batch_skipgram, which are both independently defined functions, and not methods of the Word2Vec class.

This comment has been minimized.

@chinmayapancholi13

chinmayapancholi13 Aug 15, 2017

Author Contributor

Apart from these training functions, there is overlap between the two models in some other tasks as well. For instance, in our fasttext implementation, we are first constructing the vocabulary in the same way as is done in Word2Vec (i.e. calling scan_vocab, scale_vocab and finalize_vocab functions) and then we are handling all the "fasttext-specific" things (like constructing the dictionary of ngrams and precomputing & storing ngrams for each word in the vocab). These "fasttext-specific" things can be handled at a prior stage (e.g. within scale_vocab or finalize_vocab functions) and this would also help us optimize things e.g. by avoiding iterating over the vocabulary a few times unnecessarily.

This comment has been minimized.

@jayantj

jayantj Aug 15, 2017

Contributor

The super method is useful in such situations - where the parent class implementation of the method needs to be run along with whatever code is specific to the child class.

for indices in word2_indices:
word2_subwords += ['<' + model.wv.index2word[indices] + '>']
word2_subwords += Ft_Wrapper.compute_ngrams(model.wv.index2word[indices], model.min_n, model.max_n)
word2_subwords = list(set(word2_subwords))

This comment has been minimized.

@jayantj

jayantj Aug 11, 2017

Contributor

I thought we changed this to no longer be a set.

This comment has been minimized.

@chinmayapancholi13

chinmayapancholi13 Aug 13, 2017

Author Contributor

That's correct. I have pushed those changes now.

if context_locks is None:
context_locks = model.syn0_all_lockf

if word not in model.wv.vocab:

This comment has been minimized.

@jayantj

jayantj Aug 24, 2017

Contributor

Why is this necessary? Shouldn't word be necessarily present in vocab anyway?

This comment has been minimized.

@chinmayapancholi13

chinmayapancholi13 Aug 24, 2017

Author Contributor

Yes. That's correct. This was in the Word2Vec code as well so it slipped through I guess. Thanks for pointing this out!

@tmylk

This comment has been minimized.

Copy link
Contributor

tmylk commented Aug 24, 2017

Please mention the slow runtime of Gensim's pure python version in the notebook. The exact time on Lee corpus for example.

@chinmayapancholi13

This comment has been minimized.

Copy link
Contributor Author

chinmayapancholi13 commented Aug 25, 2017

Comparison between Gensim's native Python implementation with Facebook's original C++ code

Note:

  • The results here have been obtained after training models on the first 10 MB of text8 corpus with iter=2.
  • sg stands for Skipgram, hs stands for Hierarchical Softmax, neg stands for Negative Sampling and cbow stands for Continuous Bag of Words.
  • Implementation type Gensim refers to the Python code (to be added by this PR) and Wrapper refers to the wrapper (present in Gensim) for fastText's original C++ code.

We have used mainly 3 functions for comparing the 2 implementations:

  1. accuracy()
Training mode Semantic accuracy (Facebook) Semantic accuracy (Gensim) Syntactic accuracy (Facebook) Syntactic accuracy (Gensim)
sg , neg 3.98% (83/2086) 4.41% (92/2086) 32.05% (2053/6405) 36.80% (2357/6405)
sg , hs 9.11% (190/2086) 7.81% (163/2086) 50.98% (3265/6405) 49.99% (3202/6405)
cbow , neg 1.49% (31/2086) 2.25% (47/2086) 22.53% (1443/6405) 28.17% (1804/6405)
cbow , hs 4.60% (96/2086) 2.78% (58/2086) 51.40% (3292/6405) 47.84% (3064/6405)
  1. evaluate_word_pairs()
Training mode Implementation Pearson correlation coefficient Spearman rank-order correlation coefficient
sg , neg Wrapper (0.33571938305084625, 2.7735449357626718e-09) (correlation=0.3319501426263417, pvalue=4.2630631662495745e-09)
sg , neg Gensim (0.37160584013854336, 3.4314357501294739e-11) (correlation=0.37255638854484907, pvalue=3.0313995113129397e-11)
sg , hs Wrapper (0.43164118498657683, 5.9191607832804485e-15) (correlation=0.43202508957275548, pvalue=5.5678855620952807e-15)
sg , hs Gensim (0.44120623358358979, 1.2593563529334038e-15) (correlation=0.43432666642888956, pvalue=3.8520501516355618e-15)
cbow , neg Wrapper (0.30456273736238976, 8.1657874241078728e-08) (correlation=0.31730267261747791, pvalue=2.1454825237853816e-08)
cbow , neg Gensim (0.30577996094983406, 7.2064020538484688e-08) (correlation=0.32652507815616916, pvalue=7.8345327821526968e-09)
cbow , hs Wrapper (0.43941016642738401, 1.690337016383758e-15) (correlation=0.45312926376171292, pvalue=1.706860830188868e-16)
cbow , hs Gensim (0.37971140770563433, 1.1769482912504567e-11) (correlation=0.37195803751309486, pvalue=3.2775570056585581e-11)
  1. most_similar()

For the mode (cbow , neg), on retrieving the top-10 most similar words for the word night we get:

  • Gensim output:
    [(u'midnight', 0.9369428753852844), (u'knight', 0.906793475151062), (u'dwight', 0.8935667276382446), (u'tight', 0.8830252885818481), (u'bright', 0.8659111261367798), (u'tonight', 0.8634316325187683), (u'wight', 0.8603839874267578), (u'tamazight', 0.8542954921722412), (u'nightclubs', 0.8528885245323181), (u'deck', 0.850151002407074)]

  • Wrapper output:
    [(u'midnight', 0.9334157109260559), (u'knight', 0.9281861782073975), (u'dwight', 0.9247788190841675), (u'upright', 0.900949239730835), (u'tight', 0.896233081817627), (u'bright', 0.8925803899765015), (u'wight', 0.8893818259239197), (u'nightingale', 0.8800072073936462), (u'nightclubs', 0.8782978653907776), (u'tamazight', 0.8743830919265747)]

Overlapping words: [ u'midnight', u'knight', u'dwight', u'tight', u'bright', u'wight', u'tamazight', u'nightclubs']
(8 out of 10)

Similar results (overlap of around 7,8) was obtained for other modes as well.

cc: @piskvorky @gojomo

else:
new_vocab_len = len(self.wv.vocab)
for ngram, idx in self.wv.hash2index.items():
self.wv.hash2index[ngram] = idx + new_vocab_len - self.old_vocab_len

This comment has been minimized.

@jayantj

jayantj Aug 26, 2017

Contributor

I think it might be a good idea to have two separate matrices, one for storing the vectors for the <word> tokens, and one for the subwords. Along with renaming the variable to more intuitive names (we don't really need to follow the syn0 syn1 nomenclature here), that should make code much cleaner.

It also makes resizing much easier.

This comment has been minimized.

@chinmayapancholi13

chinmayapancholi13 Aug 29, 2017

Author Contributor

Done

@piskvorky

This comment has been minimized.

Copy link
Member

piskvorky commented Aug 30, 2017

@chinmayapancholi13 thanks. What are the differences due to? Why is upright missing from the results, why are there such large swings in the accuracy?

@chinmayapancholi13

This comment has been minimized.

Copy link
Contributor Author

chinmayapancholi13 commented Aug 30, 2017

@piskvorky There still remain some differences in terms of the "randomness" at training time b/w C++ and Python implementations. These include initialization of ngram-vector matrices, choosing which words are to be downsampled, choosing a reduced window size, randomization in negative sampling (randomly choosing negative words) & hierarchical softmax (tree) specific segments of code and multithreading (worker threads > 1 for the results shown above).

So as far as the difference in accuracy values is concerned, the difference could be due to the relatively small size of corpus (10 MB) used here and the above listed sources for randomness.

The values become closer while using a 100 MB corpus for training as can be seen below:

Training mode Implementation Semantic accuracy Syntactic accuracy
sg , neg Wrapper 4.82% 57.86%
sg , neg Gensim 5.95% 59.83%
sg , hs Wrapper 12.99% 60.89%
sg , hs Gensim 13.16% 60.18%
cbow , neg Wrapper 3.73% 62.82%
cbow , neg Gensim 4.19% 64.61%
cbow , hs Wrapper 10.14% 63.92%
cbow , hs Gensim 7.99% 64.97%
@chinmayapancholi13

This comment has been minimized.

Copy link
Contributor Author

chinmayapancholi13 commented Aug 30, 2017

@piskvorky And about the point regarding "not-all-top-10-words-matching", that too seems to be because of the above mentioned reasons. On training the model with 100 MB corpus, the top-10 words for the (cbow, neg) model become same as can be seen here:

Gensim:
[(u'midnight', 0.9214520454406738), (u'nightjar', 0.8952612280845642), (u'tonight', 0.8734667897224426), (u'nighthawk', 0.8727679252624512), (u'nightbreed', 0.8692173361778259), (u'nightfall', 0.8459283709526062), (u'nightmare', 0.8459077477455139), (u'nighttime', 0.8353838920593262), (u'mcknight', 0.8227508068084717), (u'nightjars', 0.8224337697029114)]

Wrapper:
[(u'midnight', 0.9323179721832275), (u'nightjar', 0.9195586442947388), (u'nighthawk', 0.8968080282211304), (u'nightfall', 0.8818791508674622), (u'mcknight', 0.8758728504180908), (u'nightbreed', 0.8738420009613037), (u'tonight', 0.8719567656517029), (u'nightmare', 0.857421875), (u'nightjars', 0.8562690019607544), (u'nighttime', 0.8551853895187378)]

Overlap:
set([u'tonight', u'nightjar', u'nighttime', u'nightmare', u'midnight', u'nighthawk', u'mcknight', u'nightbreed', u'nightfall', u'nightjars'])

It is very likely that even for a word which belongs to the top-10 list for one implementation and doesn't for the other, the word might be in the top-15 or top-20 (say) for the one it is missing in. :)

@menshikh-iv

This comment has been minimized.

Copy link
Member

menshikh-iv commented Aug 30, 2017

LGTM for me, very nice job @chinmayapancholi13 🔥 👍
What's you think @piskvorky?

@piskvorky

This comment has been minimized.

Copy link
Member

piskvorky commented Aug 31, 2017

Thanks @chinmayapancholi13 . Most of these items seems RNG-related -- what RNG does the original code use? Any way we can simply replicate it, use the same seed and thus get the same random numbers? (at least for testing, so performance irrelevant here)

@chinmayapancholi13

This comment has been minimized.

Copy link
Contributor Author

chinmayapancholi13 commented Sep 3, 2017

@piskvorky The original C++ code uses minstd_rand along with uniform_real_distribution to generate the random values used at the various instances in the code (like the reduced window size). If we want the results to be even closer, we could try to emulate this RNG and use it in our code too.

Also, there was some previous discussion on #1482 regarding comparing the outputs of our implementation with the original C++ code so that the results would be very close in observable quality rather than having identical numerical results. I wanted to know if the method in which the comparison has been done till now (using the 3 functions accuracy(), evaluate_word_pairs() and most_similar()) is in the right direction in your opinion?

@piskvorky

This comment has been minimized.

Copy link
Member

piskvorky commented Sep 3, 2017

@chinmayapancholi13 thanks for investigating! I appreciate the thoroughness.

We should go for approximation only if there's no other way -- that's why I'm asking about the RNG. Is it difficult to use the exact same RNG directly from Python?

If we can replicate the original RNG (at least for testing), we don't need any "very close in observable quality" (which is always questionable -- is 2 % "very close" or not?). Instead, we can go for identical, ± some numeric rounding error.

If it's not possible to use the original RNG, we could shrug the differences away. But I'd much prefer to start on the right foot, with a verifiable and verified algo, before commencing optimizations.

@chinmayapancholi13

This comment has been minimized.

Copy link
Contributor Author

chinmayapancholi13 commented Sep 5, 2017

@piskvorky I agree. We should indeed try to verify the model correctness fully before moving on to doing optimizations and replicating this RNG and incorporating it in the Python code should help to get closer results. I'll try to do this then and give updates if I hit any blockers.

chinmayapancholi13 added some commits Sep 7, 2017

self.wv.syn0_all = self.wv.syn0_all.reshape((num_vectors, dim))
assert self.wv.syn0_all.shape == (self.bucket + len(self.wv.vocab), self.vector_size), \
self.wv.syn0_ngrams = np.fromfile(file_handle, dtype=dtype, count=num_vectors * dim)
self.wv.syn0_ngrams = self.wv.syn0_ngrams.reshape((num_vectors, dim))

This comment has been minimized.

@rilut

rilut Sep 8, 2017

Fasttext_wrapper training breaks here if I use min_count=2 with my own data, while the default min_count=5 is still working.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-08cd9b292aba> in <module>()
----> 1 model.train('/home/rilut/fastText/fasttext', '/datadir/all.csv', model='skipgram', min_count=2)

/home/rilut/anaconda2/lib/python2.7/site-packages/gensim-2.3.0-py2.7-linux-x86_64.egg/gensim/models/wrappers/fasttext.pyc in train(cls, ft_path, corpus_file, output_file, model, size, alpha, window, min_count, word_ngrams, loss, sample, negative, iter, min_n, max_n, sorted_vocab, threads)
    221
    222         output = utils.check_output(args=cmd)
--> 223         model = cls.load_fasttext_format(output_file)
    224         cls.delete_training_files(output_file)
    225         return model

/home/rilut/anaconda2/lib/python2.7/site-packages/gensim-2.3.0-py2.7-linux-x86_64.egg/gensim/models/wrappers/fasttext.pyc in load_fasttext_format(cls, model_file, encoding)
    249             model_file += '.bin'
    250         model.file_name = model_file
--> 251         model.load_binary_data(encoding=encoding)
    252         return model
    253

/home/rilut/anaconda2/lib/python2.7/site-packages/gensim-2.3.0-py2.7-linux-x86_64.egg/gensim/models/wrappers/fasttext.pyc in load_binary_data(self, encoding)
    267             self.load_model_params(f)
    268             self.load_dict(f, encoding=encoding)
--> 269             self.load_vectors(f)
    270
    271     def load_model_params(self, file_handle):

/home/rilut/anaconda2/lib/python2.7/site-packages/gensim-2.3.0-py2.7-linux-x86_64.egg/gensim/models/wrappers/fasttext.pyc in load_vectors(self, file_handle)
    349         self.num_original_vectors = num_vectors
    350         self.wv.syn0_ngrams = np.fromfile(file_handle, dtype=dtype, count=num_vectors * dim)
--> 351         self.wv.syn0_ngrams = self.wv.syn0_ngrams.reshape((num_vectors, dim))
    352         assert self.wv.syn0_ngrams.shape == (self.bucket + len(self.wv.vocab), self.vector_size), \
    353             'mismatch between actual weight matrix shape {} and expected shape {}'.format(

ValueError: cannot reshape array of size 211096795 into shape (2121425,100)

Seems we need to modify the num_vectors in reshape

This comment has been minimized.

@chinmayapancholi13

chinmayapancholi13 Sep 19, 2017

Author Contributor

Hey @rilut! Using the FastText wrapper for different values of min_count parameter is working fine for me. Could you please share the exact code that is causing the problem in your case?
About num_vectors value, the values of num_vectors and dim are actually read from the files generated from FastText's original C++ code and thus the wrapper code uses the same values at the time of reshaping.
And apologies for the delayed response. I have been a little occupied recently.

@menshikh-iv

This comment has been minimized.

Copy link
Member

menshikh-iv commented Sep 14, 2017

@chinmayapancholi13 hey, what's a status here? When you'll have time to "full-verification"?

@piskvorky

This comment has been minimized.

Copy link
Member

piskvorky commented Sep 14, 2017

Hi guys, I'd like to get this in ASAP, so people can start using it and provide feedback.
Even if the implementation is slow for now -- once we're good with the API and correctness, the optimizations should be straightforward.

@chinmayapancholi13

This comment has been minimized.

Copy link
Contributor Author

chinmayapancholi13 commented Sep 14, 2017

Hey @menshikh-iv @piskvorky! My apologies for the hiatus in the ongoing work. I have semester exams going on currently and so haven't been able to devote much time to the PR in the last week. I am planning to resume working and complete the work remaining for verifying correctness fully (by replicating the RNG in Python) in about a week's time. Sorry again for the inconvenience and thanks for your patience.

@piskvorky

This comment has been minimized.

Copy link
Member

piskvorky commented Sep 14, 2017

Good luck with your exams @chinmayapancholi13 👍

@menshikh-iv

This comment has been minimized.

Copy link
Member

menshikh-iv commented Sep 19, 2017

FYI @chinmayapancholi13, I'll merge it now (because it's done for current stage and we need a feedback from users).
For the next changes (more RNG control, cythonizaiton) you should create new PR.

Very nice job @chinmayapancholi13 🔥

@menshikh-iv menshikh-iv merged commit 6e51156 into RaRe-Technologies:develop Sep 19, 2017

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@Liebeck

This comment has been minimized.

Copy link

Liebeck commented Sep 21, 2017

I'd like to experiment with Gensim and fasttext. I'm not sure what the current implementation status is.

At this point, the Python implementation mentioned in the notebook is not available via pip, right? This means that, for now, only the c++ wrapper is available in 2.3.0?

@menshikh-iv

This comment has been minimized.

Copy link
Member

menshikh-iv commented Sep 21, 2017

@Liebeck For 2.3.0 - only C++ wrapper, for next version, current implementation will be available.
If you want to start your experiments immediately - you can install gensim from develop branch.

@menshikh-iv

This comment has been minimized.

Copy link
Member

menshikh-iv commented Oct 2, 2017

@Liebeck now this functionality available in latest (3.0.0) gensim version.

@menshikh-iv

This comment has been minimized.

Copy link
Member

menshikh-iv commented Nov 28, 2017

@Liebeck right now @manneshiva optimize our pure-python version, very soon we'll have a very fast version, you can monitor progress - #1742 (will be finished in two weeks, maybe faster).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment