[WIP] Adding unsupervised FastText to Gensim #1525

chinmayapancholi13 · 2017-08-08T08:13:04Z

This PR implements FastText model (unsupservised version) in Gensim.

souravsingh · 2017-08-10T11:04:29Z

There is a PR open here- #1482

jayantj · 2017-08-11T04:16:17Z

gensim/models/fasttext.py

+
+            for indices in word2_indices:
+                word2_subwords += ['<' + model.wv.index2word[indices] + '>']
+                word2_subwords += Ft_Wrapper.compute_ngrams(model.wv.index2word[indices], model.min_n, model.max_n)


This works for now, but ideally we'd like a cleaner solution to this later on. In general, I think the FastText wrapper (to load .bin files) and the FastText training code implemented here shares a lot of common ground (both conceptually and code-wise). Once we have the correctness of the models verified, we'd be looking to refactor it somehow (maybe just inheriting from the wrapper? Completely removing train functionality from the wrapper and replacing it with native train functionality?) Any thoughts on this?

Agreed. The current implementation shares code with Gensim's Fasttext wrapper so inheriting from the wrapper seems to be good way for avoiding this redundancy.
I think it would also be helpful to refactor the current Word2Vec implementation since apart from using ngrams-vectors rather than word-vectors at the time of backpropagation in fasttext, the logic and code between the two models overlap significantly. Having one common parent class and the two models as the children could be a useful way to tackle this.

I agree that refactoring to avoid redundancy would be good. I'm not sure a common parent class is the way to go though, since most of the redundant code is in methods train_batch_cbow and train_batch_skipgram, which are both independently defined functions, and not methods of the Word2Vec class.

Apart from these training functions, there is overlap between the two models in some other tasks as well. For instance, in our fasttext implementation, we are first constructing the vocabulary in the same way as is done in Word2Vec (i.e. calling scan_vocab, scale_vocab and finalize_vocab functions) and then we are handling all the "fasttext-specific" things (like constructing the dictionary of ngrams and precomputing & storing ngrams for each word in the vocab). These "fasttext-specific" things can be handled at a prior stage (e.g. within scale_vocab or finalize_vocab functions) and this would also help us optimize things e.g. by avoiding iterating over the vocabulary a few times unnecessarily.

The super method is useful in such situations - where the parent class implementation of the method needs to be run along with whatever code is specific to the child class.

jayantj · 2017-08-11T04:16:52Z

gensim/models/fasttext.py

+            for indices in word2_indices:
+                word2_subwords += ['<' + model.wv.index2word[indices] + '>']
+                word2_subwords += Ft_Wrapper.compute_ngrams(model.wv.index2word[indices], model.min_n, model.max_n)
+            word2_subwords = list(set(word2_subwords))


I thought we changed this to no longer be a set.

That's correct. I have pushed those changes now.

jayantj · 2017-08-24T14:51:48Z

gensim/models/fasttext.py

+    if context_locks is None:
+        context_locks = model.syn0_all_lockf
+
+    if word not in model.wv.vocab:


Why is this necessary? Shouldn't word be necessarily present in vocab anyway?

Yes. That's correct. This was in the Word2Vec code as well so it slipped through I guess. Thanks for pointing this out!

tmylk · 2017-08-24T15:09:35Z

Please mention the slow runtime of Gensim's pure python version in the notebook. The exact time on Lee corpus for example.

chinmayapancholi13 · 2017-08-25T05:53:18Z

Comparison between Gensim's native Python implementation with Facebook's original C++ code

Note:

The results here have been obtained after training models on the first 10 MB of text8 corpus with iter=2.
sg stands for Skipgram, hs stands for Hierarchical Softmax, neg stands for Negative Sampling and cbow stands for Continuous Bag of Words.
Implementation type Gensim refers to the Python code (to be added by this PR) and Wrapper refers to the wrapper (present in Gensim) for fastText's original C++ code.

We have used mainly 3 functions for comparing the 2 implementations:

accuracy()

Training mode	Semantic accuracy (Facebook)	Semantic accuracy (Gensim)	Syntactic accuracy (Facebook)	Syntactic accuracy (Gensim)
sg , neg	3.98% (83/2086)	4.41% (92/2086)	32.05% (2053/6405)	36.80% (2357/6405)
sg , hs	9.11% (190/2086)	7.81% (163/2086)	50.98% (3265/6405)	49.99% (3202/6405)
cbow , neg	1.49% (31/2086)	2.25% (47/2086)	22.53% (1443/6405)	28.17% (1804/6405)
cbow , hs	4.60% (96/2086)	2.78% (58/2086)	51.40% (3292/6405)	47.84% (3064/6405)

evaluate_word_pairs()

Training mode	Implementation	Pearson correlation coefficient	Spearman rank-order correlation coefficient
sg , neg	Wrapper	(0.33571938305084625, 2.7735449357626718e-09)	(correlation=0.3319501426263417, pvalue=4.2630631662495745e-09)
sg , neg	Gensim	(0.37160584013854336, 3.4314357501294739e-11)	(correlation=0.37255638854484907, pvalue=3.0313995113129397e-11)
sg , hs	Wrapper	(0.43164118498657683, 5.9191607832804485e-15)	(correlation=0.43202508957275548, pvalue=5.5678855620952807e-15)
sg , hs	Gensim	(0.44120623358358979, 1.2593563529334038e-15)	(correlation=0.43432666642888956, pvalue=3.8520501516355618e-15)
cbow , neg	Wrapper	(0.30456273736238976, 8.1657874241078728e-08)	(correlation=0.31730267261747791, pvalue=2.1454825237853816e-08)
cbow , neg	Gensim	(0.30577996094983406, 7.2064020538484688e-08)	(correlation=0.32652507815616916, pvalue=7.8345327821526968e-09)
cbow , hs	Wrapper	(0.43941016642738401, 1.690337016383758e-15)	(correlation=0.45312926376171292, pvalue=1.706860830188868e-16)
cbow , hs	Gensim	(0.37971140770563433, 1.1769482912504567e-11)	(correlation=0.37195803751309486, pvalue=3.2775570056585581e-11)

most_similar()

For the mode (cbow , neg), on retrieving the top-10 most similar words for the word night we get:

Gensim output:
[(u'midnight', 0.9369428753852844), (u'knight', 0.906793475151062), (u'dwight', 0.8935667276382446), (u'tight', 0.8830252885818481), (u'bright', 0.8659111261367798), (u'tonight', 0.8634316325187683), (u'wight', 0.8603839874267578), (u'tamazight', 0.8542954921722412), (u'nightclubs', 0.8528885245323181), (u'deck', 0.850151002407074)]
Wrapper output:
[(u'midnight', 0.9334157109260559), (u'knight', 0.9281861782073975), (u'dwight', 0.9247788190841675), (u'upright', 0.900949239730835), (u'tight', 0.896233081817627), (u'bright', 0.8925803899765015), (u'wight', 0.8893818259239197), (u'nightingale', 0.8800072073936462), (u'nightclubs', 0.8782978653907776), (u'tamazight', 0.8743830919265747)]

Overlapping words: [ u'midnight', u'knight', u'dwight', u'tight', u'bright', u'wight', u'tamazight', u'nightclubs']
(8 out of 10)

Similar results (overlap of around 7,8) was obtained for other modes as well.

cc: @piskvorky @gojomo

…into fasttext_gensim

jayantj · 2017-08-26T01:49:07Z

gensim/models/fasttext.py

+        else:
+            new_vocab_len = len(self.wv.vocab)
+            for ngram, idx in self.wv.hash2index.items():
+                self.wv.hash2index[ngram] = idx + new_vocab_len - self.old_vocab_len


I think it might be a good idea to have two separate matrices, one for storing the vectors for the <word> tokens, and one for the subwords. Along with renaming the variable to more intuitive names (we don't really need to follow the syn0 syn1 nomenclature here), that should make code much cleaner.

It also makes resizing much easier.

piskvorky · 2017-08-30T15:34:07Z

@chinmayapancholi13 thanks. What are the differences due to? Why is upright missing from the results, why are there such large swings in the accuracy?

chinmayapancholi13 · 2017-08-30T16:20:55Z

@piskvorky There still remain some differences in terms of the "randomness" at training time b/w C++ and Python implementations. These include initialization of ngram-vector matrices, choosing which words are to be downsampled, choosing a reduced window size, randomization in negative sampling (randomly choosing negative words) & hierarchical softmax (tree) specific segments of code and multithreading (worker threads > 1 for the results shown above).

So as far as the difference in accuracy values is concerned, the difference could be due to the relatively small size of corpus (10 MB) used here and the above listed sources for randomness.

The values become closer while using a 100 MB corpus for training as can be seen below:

Training mode	Implementation	Semantic accuracy	Syntactic accuracy
sg , neg	Wrapper	4.82%	57.86%
sg , neg	Gensim	5.95%	59.83%
sg , hs	Wrapper	12.99%	60.89%
sg , hs	Gensim	13.16%	60.18%
cbow , neg	Wrapper	3.73%	62.82%
cbow , neg	Gensim	4.19%	64.61%
cbow , hs	Wrapper	10.14%	63.92%
cbow , hs	Gensim	7.99%	64.97%

chinmayapancholi13 · 2017-08-30T16:32:16Z

@piskvorky And about the point regarding "not-all-top-10-words-matching", that too seems to be because of the above mentioned reasons. On training the model with 100 MB corpus, the top-10 words for the (cbow, neg) model become same as can be seen here:

Gensim:
[(u'midnight', 0.9214520454406738), (u'nightjar', 0.8952612280845642), (u'tonight', 0.8734667897224426), (u'nighthawk', 0.8727679252624512), (u'nightbreed', 0.8692173361778259), (u'nightfall', 0.8459283709526062), (u'nightmare', 0.8459077477455139), (u'nighttime', 0.8353838920593262), (u'mcknight', 0.8227508068084717), (u'nightjars', 0.8224337697029114)]

Wrapper:
[(u'midnight', 0.9323179721832275), (u'nightjar', 0.9195586442947388), (u'nighthawk', 0.8968080282211304), (u'nightfall', 0.8818791508674622), (u'mcknight', 0.8758728504180908), (u'nightbreed', 0.8738420009613037), (u'tonight', 0.8719567656517029), (u'nightmare', 0.857421875), (u'nightjars', 0.8562690019607544), (u'nighttime', 0.8551853895187378)]

Overlap:
set([u'tonight', u'nightjar', u'nighttime', u'nightmare', u'midnight', u'nighthawk', u'mcknight', u'nightbreed', u'nightfall', u'nightjars'])

It is very likely that even for a word which belongs to the top-10 list for one implementation and doesn't for the other, the word might be in the top-15 or top-20 (say) for the one it is missing in. :)

menshikh-iv · 2017-08-30T17:12:16Z

LGTM for me, very nice job @chinmayapancholi13 🔥 👍
What's you think @piskvorky?

piskvorky · 2017-08-31T06:41:35Z

Thanks @chinmayapancholi13 . Most of these items seems RNG-related -- what RNG does the original code use? Any way we can simply replicate it, use the same seed and thus get the same random numbers? (at least for testing, so performance irrelevant here)

chinmayapancholi13 · 2017-09-03T09:41:23Z

@piskvorky The original C++ code uses minstd_rand along with uniform_real_distribution to generate the random values used at the various instances in the code (like the reduced window size). If we want the results to be even closer, we could try to emulate this RNG and use it in our code too.

Also, there was some previous discussion on #1482 regarding comparing the outputs of our implementation with the original C++ code so that the results would be very close in observable quality rather than having identical numerical results. I wanted to know if the method in which the comparison has been done till now (using the 3 functions accuracy(), evaluate_word_pairs() and most_similar()) is in the right direction in your opinion?

piskvorky · 2017-09-03T19:09:22Z

@chinmayapancholi13 thanks for investigating! I appreciate the thoroughness.

We should go for approximation only if there's no other way -- that's why I'm asking about the RNG. Is it difficult to use the exact same RNG directly from Python?

If we can replicate the original RNG (at least for testing), we don't need any "very close in observable quality" (which is always questionable -- is 2 % "very close" or not?). Instead, we can go for identical, ± some numeric rounding error.

If it's not possible to use the original RNG, we could shrug the differences away. But I'd much prefer to start on the right foot, with a verifiable and verified algo, before commencing optimizations.

chinmayapancholi13 · 2017-09-05T06:43:52Z

@piskvorky I agree. We should indeed try to verify the model correctness fully before moving on to doing optimizations and replicating this RNG and incorporating it in the Python code should help to get closer results. I'll try to do this then and give updates if I hit any blockers.

luthfianto · 2017-09-08T10:22:08Z

gensim/models/wrappers/fasttext.py

-        self.wv.syn0_all = self.wv.syn0_all.reshape((num_vectors, dim))
-        assert self.wv.syn0_all.shape == (self.bucket + len(self.wv.vocab), self.vector_size), \
+        self.wv.syn0_ngrams = np.fromfile(file_handle, dtype=dtype, count=num_vectors * dim)
+        self.wv.syn0_ngrams = self.wv.syn0_ngrams.reshape((num_vectors, dim))


Fasttext_wrapper training breaks here if I use min_count=2 with my own data, while the default min_count=5 is still working.

--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-8-08cd9b292aba> in <module>() ----> 1 model.train('/home/rilut/fastText/fasttext', '/datadir/all.csv', model='skipgram', min_count=2) /home/rilut/anaconda2/lib/python2.7/site-packages/gensim-2.3.0-py2.7-linux-x86_64.egg/gensim/models/wrappers/fasttext.pyc in train(cls, ft_path, corpus_file, output_file, model, size, alpha, window, min_count, word_ngrams, loss, sample, negative, iter, min_n, max_n, sorted_vocab, threads) 221 222 output = utils.check_output(args=cmd) --> 223 model = cls.load_fasttext_format(output_file) 224 cls.delete_training_files(output_file) 225 return model /home/rilut/anaconda2/lib/python2.7/site-packages/gensim-2.3.0-py2.7-linux-x86_64.egg/gensim/models/wrappers/fasttext.pyc in load_fasttext_format(cls, model_file, encoding) 249 model_file += '.bin' 250 model.file_name = model_file --> 251 model.load_binary_data(encoding=encoding) 252 return model 253 /home/rilut/anaconda2/lib/python2.7/site-packages/gensim-2.3.0-py2.7-linux-x86_64.egg/gensim/models/wrappers/fasttext.pyc in load_binary_data(self, encoding) 267 self.load_model_params(f) 268 self.load_dict(f, encoding=encoding) --> 269 self.load_vectors(f) 270 271 def load_model_params(self, file_handle): /home/rilut/anaconda2/lib/python2.7/site-packages/gensim-2.3.0-py2.7-linux-x86_64.egg/gensim/models/wrappers/fasttext.pyc in load_vectors(self, file_handle) 349 self.num_original_vectors = num_vectors 350 self.wv.syn0_ngrams = np.fromfile(file_handle, dtype=dtype, count=num_vectors * dim) --> 351 self.wv.syn0_ngrams = self.wv.syn0_ngrams.reshape((num_vectors, dim)) 352 assert self.wv.syn0_ngrams.shape == (self.bucket + len(self.wv.vocab), self.vector_size), \ 353 'mismatch between actual weight matrix shape {} and expected shape {}'.format( ValueError: cannot reshape array of size 211096795 into shape (2121425,100)

Seems we need to modify the num_vectors in reshape

Hey @rilut! Using the FastText wrapper for different values of min_count parameter is working fine for me. Could you please share the exact code that is causing the problem in your case?
About num_vectors value, the values of num_vectors and dim are actually read from the files generated from FastText's original C++ code and thus the wrapper code uses the same values at the time of reshaping.
And apologies for the delayed response. I have been a little occupied recently.

menshikh-iv · 2017-09-14T07:41:35Z

@chinmayapancholi13 hey, what's a status here? When you'll have time to "full-verification"?

piskvorky · 2017-09-14T16:00:17Z

Hi guys, I'd like to get this in ASAP, so people can start using it and provide feedback.
Even if the implementation is slow for now -- once we're good with the API and correctness, the optimizations should be straightforward.

chinmayapancholi13 · 2017-09-14T18:10:51Z

Hey @menshikh-iv @piskvorky! My apologies for the hiatus in the ongoing work. I have semester exams going on currently and so haven't been able to devote much time to the PR in the last week. I am planning to resume working and complete the work remaining for verifying correctness fully (by replicating the RNG in Python) in about a week's time. Sorry again for the inconvenience and thanks for your patience.

piskvorky · 2017-09-14T19:31:30Z

Good luck with your exams @chinmayapancholi13 👍

menshikh-iv · 2017-09-19T08:17:34Z

FYI @chinmayapancholi13, I'll merge it now (because it's done for current stage and we need a feedback from users).
For the next changes (more RNG control, cythonizaiton) you should create new PR.

Very nice job @chinmayapancholi13 🔥

Liebeck · 2017-09-21T08:28:38Z

I'd like to experiment with Gensim and fasttext. I'm not sure what the current implementation status is.

At this point, the Python implementation mentioned in the notebook is not available via pip, right? This means that, for now, only the c++ wrapper is available in 2.3.0?

menshikh-iv · 2017-09-21T08:42:41Z

@Liebeck For 2.3.0 - only C++ wrapper, for next version, current implementation will be available.
If you want to start your experiments immediately - you can install gensim from develop branch.

menshikh-iv · 2017-10-02T13:31:08Z

@Liebeck now this functionality available in latest (3.0.0) gensim version.

menshikh-iv · 2017-11-28T03:32:57Z

@Liebeck right now @manneshiva optimize our pure-python version, very soon we'll have a very fast version, you can monitor progress - #1742 (will be finished in two weeks, maybe faster).

added initial code for CBOW

a815c84

menshikh-iv mentioned this pull request Aug 10, 2017

[WIP] unsupervised fasttext #1482

Closed

jayantj reviewed Aug 11, 2017

View reviewed changes

chinmayapancholi13 added 2 commits August 14, 2017 00:00

updated unit tests for fasttext

102c14a

corrected use of matrix and precomputed ngrams for vocab words

4c449df

menshikh-iv mentioned this pull request Aug 14, 2017

native fastText (unsupervised) in gensim #1471

Closed

chinmayapancholi13 added 10 commits August 15, 2017 16:47

added EOS token in 'LineSentence' class

f49df54

added skipgram training code

1fcb8fa

updated unit tests for fasttext

82fda3c

seeded 'np.random' with 'self.seed'

cd59034

added test for persistence

353f7a8

updated seeding numpy obj

569a026

updated (unclean) fasttext code for review

c228b8d

updated fasttext tutorial notebook

29c627f

added 'save' and 'load_fasttext_format' functions

acbfdf2

updated unit tests for fasttext

cb7a2ad

jayantj reviewed Aug 24, 2017

View reviewed changes

chinmayapancholi13 added 2 commits August 25, 2017 10:40

cleaned main fasttext code

5a18297

updated unittests

4b98722

chinmayapancholi13 added 6 commits August 25, 2017 11:58

removed EOS token from LineSentence

cf1f3e0

fixed flake8 errors

d986242

[WIP] added online learning

bce17ff

added tests for online learning

cb84001

Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim …

fbe8bdc

…into fasttext_gensim

flake8 fixes

58c673a

jayantj reviewed Aug 26, 2017

View reviewed changes

flake8 errors fixed

a9e7d03

chinmayapancholi13 added 3 commits September 7, 2017 21:22

fixed oov word_vec

ec58512

removed merge conflicts

2ed7d31

updated test_training unittest

daace4a

luthfianto reviewed Sep 8, 2017

View reviewed changes

menshikh-iv mentioned this pull request Sep 14, 2017

[WIP] Add sent2vec in Gensim #1458

Closed

menshikh-iv added 4 commits September 18, 2017 14:01

Merge branch 'develop' into fasttext_gensim

58c531b

Fix broken merge

3ffa103

useless change (need to re-run Appveyour)

2b0583b

Add skipIf for Appveyor x32 (avoid memory error)

55d731a

menshikh-iv added the incubator project PR is RaRe incubator project label Sep 19, 2017

menshikh-iv merged commit 6e51156 into piskvorky:develop Sep 19, 2017

manneshiva mentioned this pull request Dec 26, 2017

Incorrect setting of FastText parameters during online training #1818

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Adding unsupervised FastText to Gensim #1525

[WIP] Adding unsupervised FastText to Gensim #1525

chinmayapancholi13 commented Aug 8, 2017

souravsingh commented Aug 10, 2017

jayantj Aug 11, 2017 •

edited

Loading

chinmayapancholi13 Aug 13, 2017

jayantj Aug 14, 2017

chinmayapancholi13 Aug 15, 2017

jayantj Aug 15, 2017

jayantj Aug 11, 2017

chinmayapancholi13 Aug 13, 2017

jayantj Aug 24, 2017

chinmayapancholi13 Aug 24, 2017

tmylk commented Aug 24, 2017

chinmayapancholi13 commented Aug 25, 2017 •

edited by menshikh-iv

Loading

jayantj Aug 26, 2017

chinmayapancholi13 Aug 29, 2017

piskvorky commented Aug 30, 2017 •

edited

Loading

chinmayapancholi13 commented Aug 30, 2017

chinmayapancholi13 commented Aug 30, 2017

menshikh-iv commented Aug 30, 2017

piskvorky commented Aug 31, 2017 •

edited

Loading

chinmayapancholi13 commented Sep 3, 2017

piskvorky commented Sep 3, 2017 •

edited

Loading

chinmayapancholi13 commented Sep 5, 2017

luthfianto Sep 8, 2017 •

edited

Loading

chinmayapancholi13 Sep 19, 2017

menshikh-iv commented Sep 14, 2017

piskvorky commented Sep 14, 2017 •

edited

Loading

chinmayapancholi13 commented Sep 14, 2017

piskvorky commented Sep 14, 2017

menshikh-iv commented Sep 19, 2017

Liebeck commented Sep 21, 2017

menshikh-iv commented Sep 21, 2017

menshikh-iv commented Oct 2, 2017

menshikh-iv commented Nov 28, 2017

[WIP] Adding unsupervised FastText to Gensim #1525

[WIP] Adding unsupervised FastText to Gensim #1525

Conversation

chinmayapancholi13 commented Aug 8, 2017

souravsingh commented Aug 10, 2017

jayantj Aug 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmylk commented Aug 24, 2017

chinmayapancholi13 commented Aug 25, 2017 • edited by menshikh-iv Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky commented Aug 30, 2017 • edited Loading

chinmayapancholi13 commented Aug 30, 2017

chinmayapancholi13 commented Aug 30, 2017

menshikh-iv commented Aug 30, 2017

piskvorky commented Aug 31, 2017 • edited Loading

chinmayapancholi13 commented Sep 3, 2017

piskvorky commented Sep 3, 2017 • edited Loading

chinmayapancholi13 commented Sep 5, 2017

luthfianto Sep 8, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv commented Sep 14, 2017

piskvorky commented Sep 14, 2017 • edited Loading

chinmayapancholi13 commented Sep 14, 2017

piskvorky commented Sep 14, 2017

menshikh-iv commented Sep 19, 2017

Liebeck commented Sep 21, 2017

menshikh-iv commented Sep 21, 2017

menshikh-iv commented Oct 2, 2017

menshikh-iv commented Nov 28, 2017

jayantj Aug 11, 2017 •

edited

Loading

chinmayapancholi13 commented Aug 25, 2017 •

edited by menshikh-iv

Loading

piskvorky commented Aug 30, 2017 •

edited

Loading

piskvorky commented Aug 31, 2017 •

edited

Loading

piskvorky commented Sep 3, 2017 •

edited

Loading

luthfianto Sep 8, 2017 •

edited

Loading

piskvorky commented Sep 14, 2017 •

edited

Loading