[WIP] Labeled w2v #1153

giacbrd · 2017-02-17T22:42:44Z

I have added a new class in the module labledword2vec: LabeledWord2Vec.
The goal of this class is already described in #960

It is a subclass of Word2Vec. Here directly subclassing is not the optimal solution. It would be preferable to have a base class, something like ShallowNeuralNetwork, with subclasses LabeledWord2Vec and Word2Vec. They both share the two layer neural network concept, but the small differences make them two totally different instruments.

I preferred to minimize my intrusion in Gensim, avoiding refactoring a lot of stuff. The solution of a more complex class hierarchy did not seem trivial.

gojomo · 2017-02-17T23:34:43Z

Just wondering, why wouldn't skip-gram be as appropriate as CBOW?

giacbrd · 2017-02-20T08:59:36Z

In the CBOW we obtain a model that, given a context of words, it returns a probability distribution over a vocabulary of words (probability of appearing in that context). So we have the direct computation of a language model, while with skip-gram we do the inverse, predicting the context. You should then compute for each word (labels in fastText classifier) the probability of the specific document to classify.
Maybe there is a better way, but at the moment I think that in skip-gram you have to do #labels "independent" operations on the neural network instead of a "almost" single matrix product.

piskvorky · 2017-03-03T22:17:04Z

Not sure we want to start adding supervised learning (classification) to gensim. There would have to be a really clear, convincing reason for such major change-of-mission.

@tmylk are you OK with this?

tmylk · 2017-03-13T20:21:45Z

@giacbrd Apologies for the delay in response. I would like to include and the necessary refactoring is on our list for this summer's Google Summer of Code.

The output of a Gensim unsupervised models only becomes useful after it is put through a supervised classifier. Training jointly with a supervised layer bring better results than training separately, as shown in FastText and supervised LDA. There is also a great demand for it as shown by FastText popularity, requests for supervised LDA and the success of ShortText package that integrates gensim with sklearn/keras.

menshikh-iv · 2017-06-13T10:26:37Z

Ping @giacbrd, what status of this PR?

giacbrd · 2017-06-13T12:08:04Z

Hi,
on my side the code is ready (I have probably to fix some problems on cython code and maybe I will and more tests) but maybe there is a misunderstanding. I was waiting for a feedback, I don't understand if a refactoring is mandatory for this PR and if someone is going to do it (e.g. https://github.com/numfocus/gsoc/blob/master/2017/proposals/prakhar_gsoc_17.md). So, what is actually missing for this PR?

menshikh-iv · 2017-06-13T13:06:53Z

@giacbrd Please, reformat your docstrings according to google format. Also, need to add a short example of usage in notebook.
After this, I think, we can merge this PR.
What do you think about it @piskvorky @gojomo?

piskvorky · 2017-06-13T13:51:27Z

@giacbrd can you summarize the use-case for this? What are the advantages of LabeledWord2Vec over FastText, or any of the classification models in scikit-learn? When would one use this class?

giacbrd · 2017-06-13T22:07:45Z

LabeledWord2Vec is practically very close to the original fastText classification model, but it obviously has all the advantages of being written in Python/Cython and a familiar interface.
It has a good effectiveness on different benchmark datasets, very similar to the standard linear classifiers of Scikit-learn.

Pros:

it does not need a two-phases learning: vectorization + learning, the algorithm directly ingests raw text
it performs online learning
the learning approach is radically different: it is not a linear separation in the word vector space but it may capture deeper semantics
it scales very well on multi-class problems, hundreds or thousands of categories are not a problem

Cons

trained models are heavy (word and label matrices) but this can be controlled with the hashing trick
computational cost lower bound is pretty high; several other algorithms, once you have document vectors, are very fast

Given its different approach to text classification, it is a preferable alternative to many linear models; sometimes it can perform really better on specific data or domains

menshikh-iv · 2017-06-22T08:29:53Z

In my opinion, need to add several things @giacbrd:

Add google-style docstrings (and reformat existed)
Add notebook with usage (maybe comparing with FastText or another supervised variant)
Add more tests
Update setup.py for correct build

After this, I think we can merge thus PR, @piskvorky what do you think about it?

giacbrd · 2017-06-22T08:52:33Z

@menshikh-iv yes, I was waiting for a confirmation, I see there are still doubts about the eligibility of this model for Gensim. I mean, maybe you don't want to introduce a text classification algorithm in the library?

I am writing a notebook for https://github.com/giacbrd/ShallowLearn , which is a layer (a scikit-learn interface) over LabeledWord2Vec, reproducing the official tutorial of fastText (https://github.com/facebookresearch/fastText/blob/master/tutorials/supervised-learning.md) and highlighting the additional features. It could be suitable also as notebook for Gensim...

menshikh-iv · 2017-07-03T10:39:50Z

wdyt @piskvorky?

prakhar2b · 2017-07-03T11:23:24Z

@piskvorky what do you think about this PR? Refactoring and optimization is a part of my gsoc timeline which starts today. It would be better to know your vision for this project.

https://github.com/RaRe-Technologies/gensim/wiki/GSOC-2017-project-ideas#performance-improvements-in-gensim-and-fasttext

piskvorky · 2017-07-03T12:06:31Z

@prakhar2b I'd much prefer to have "native" fastText in gensim (Python/C/Cython) first (currently we only have a wrapper for the C++ code). That's an unsupervised algorithm, perfectly in line with gensim's mission (unlike supervised classification). In addition, fastText is a cool, useful algo.

But I don't know how much leeway there is to change your GSoC topic. Or are these two tasks related, how much overlap is there? Any chance to do both at once?

Also, what is the connection to @giacbrd 's existing work? Will you two work on this together? Or what's the difference?

giacbrd · 2017-07-03T13:09:02Z

The unsupervised models of fastText are the ones described here: https://arxiv.org/abs/1607.04606
They are also the ones of the Gensim wrapper.

LabeledWord2Vec instead only refers to https://arxiv.org/abs/1607.01759, which is a supervised model that also exploits the "tricks" of the previous article. However, in the case of LabeledWord2Vec, I have not implemented all these tricks, i.e., subword n-grams and the hashing trick. In fact these should be implemented as generic features in Gensim, at vocabulary construction time.

Subword n-grams and the hashing trick could be used by any word-vector-space based method in Gensim (just like the phrases https://radimrehurek.com/gensim/models/phrases.html). By using them with the current implementation of word2vec in Gensim, we would practically obtain the fastText unsupervised models!

My opinion is that, if we want fastText unsupervised models in Gensim, word2vec should be improved following this https://arxiv.org/abs/1607.04606 and its related code. A refactorization is necessary, and an improved design of Word2vec class. E.g. the words vocabulary should be able to work with subwords, word hashes, word n-grams, ... (OOV words in general).

LabeledWord2Vec is relatively simple modification of Word2vec for using a different vocabulary of words in the output layer of the network (a set of labels instead of the text words), in order to perform text classification.

prakhar2b · 2017-07-03T13:23:09Z

@piskvorky As for native fasttext code, yes we have wrapper for training, for loading we already have native python code only (ref- load old/new model/ using bin only). After working on two PRs #1341 #1319 I'm quite familiar with fastText (unsupervised) working, and it would be exciting to work on this as well, but it would definitely take considerable amount of time .

As for the overlap with this PR, there must definitely be overlap in training codes between word2vec, unsupervised fastText, and labeledWord2Vec. And even the gsoc project was to refactor (and optimize) the labeledw2v code and reorganize word2vec and labeledw2v as a subclass of BaseNN (Base Neural Network) class.

I think it would be better to draw an outline of what features we want (labeledWord2vec is not complete facebook's fastText supervised classification), keeping in mind all three- word2vec, fasttext unsupervised, and supervised, and then do the necessary refactoring.

btw, I see you are quite reluctant to add supervised classification in gensim, any specific reason for that ?

cc @jayantj

menshikh-iv · 2017-08-10T10:37:44Z

For my opinion, it's a nice feature for future 'contribute' subpackage, @giacbrd are you planning to finish this PR?

giacbrd · 2017-08-11T22:37:55Z

I don't think to continue to work on this, it seems there is not an actual interest in supervised learning. Maybe after the work of @prakhar2b and a general refactoring of these models it will be worth adding LabeledWord2Vec, thas is just:

LabeledWord2Vec is relatively simple modification of Word2vec for using a different vocabulary of words in the output layer of the network (a set of labels instead of the text words), in order to perform text classification.

Maybe I don't understand the purpose of finalizing this PR, will it be merged to develop? As "external contribution" it is already available here https://github.com/giacbrd/ShallowLearn

menshikh-iv · 2017-08-17T11:06:45Z

So, firstly we should to finish fasttext & refactor "common w2v code", and after it finish this PR and add it to 'contribute' subpackage. Thanks for your work @giacbrd, I'll ping your when we will be ready for it.

tmylk and others added 30 commits November 5, 2015 19:07

Merge branch 'release-0.12.3rc1'

1c63c9a

Merge branch 'release-0.12.3'

280a488

Merge branch 'release-0.12.3'

ddeb002

Update CHANGELOG.txt

f2ac3a9

Update CHANGELOG.txt

cf09e8c

resolve merge conflict in Changelog

b61287a

Merge branch 'release-0.12.4' with piskvorky#596

3ade404

Merge branch 'release-0.13.0'

9e6522e

Merge branch 'release-0.13.0'

87c4e9c

Release version typo fix

9c74b40

Merge branch 'release-0.13.0rc1'

7b30025

Merge branch 'release-0.13.0'

de79c8e

Merge branch 'release-0.13.1'

d4f9cc5

Merge branch 'release-0.13.2'

d8e9c0f

Merge branch 'release-0.13.2'

7c118fc

Merge branch 'release-0.13.3'

432f840

Merge branch 'release-0.13.3'

b42e181

Win and OSX build fix

3067cb0

Merge branch 'release-0.13.4'

e838391

first commit with most of the code

52054c6

some test for the new model

d57ee41

labeled word2vec model

5f0821e

Merge branch 'release-0.13.4.1'

5d47ec4

Merge branch 'release-1.0.0rc1'

a18de8d

Typo in version

67b1a17

better docs and iterator over data

cfc0e21

Fix merge conflict

df13670

refactoring

89ac22d

code formatting

9902951

Merge remote-tracking branch 'upstream/develop'

935efac

giacbrd added 2 commits February 17, 2017 18:30

Merge branch 'master' into labeled_w2v

0e9a29e

fixes in docs

a618495

giacbrd mentioned this pull request Feb 17, 2017

CBOW model equivalent to the supervised learning model of fastText #960

Open

giacbrd closed this Feb 20, 2017

giacbrd reopened this Feb 20, 2017

tmylk changed the title ~~Labeled w2v~~ [WIP] Labeled w2v May 1, 2017

prakhar2b mentioned this pull request Jul 6, 2017

native fastText (unsupervised) in gensim #1471

Closed

menshikh-iv added the almost complete label Aug 8, 2017

menshikh-iv closed this Aug 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Labeled w2v #1153

[WIP] Labeled w2v #1153

giacbrd commented Feb 17, 2017

gojomo commented Feb 17, 2017

giacbrd commented Feb 20, 2017

piskvorky commented Mar 3, 2017

tmylk commented Mar 13, 2017

menshikh-iv commented Jun 13, 2017

giacbrd commented Jun 13, 2017

menshikh-iv commented Jun 13, 2017 •

edited

piskvorky commented Jun 13, 2017 •

edited

giacbrd commented Jun 13, 2017 •

edited

menshikh-iv commented Jun 22, 2017

giacbrd commented Jun 22, 2017 •

edited

menshikh-iv commented Jul 3, 2017

prakhar2b commented Jul 3, 2017

piskvorky commented Jul 3, 2017 •

edited

giacbrd commented Jul 3, 2017

prakhar2b commented Jul 3, 2017 •

edited

menshikh-iv commented Aug 10, 2017

giacbrd commented Aug 11, 2017

menshikh-iv commented Aug 17, 2017

[WIP] Labeled w2v #1153

[WIP] Labeled w2v #1153

Conversation

giacbrd commented Feb 17, 2017

gojomo commented Feb 17, 2017

giacbrd commented Feb 20, 2017

piskvorky commented Mar 3, 2017

tmylk commented Mar 13, 2017

menshikh-iv commented Jun 13, 2017

giacbrd commented Jun 13, 2017

menshikh-iv commented Jun 13, 2017 • edited

piskvorky commented Jun 13, 2017 • edited

giacbrd commented Jun 13, 2017 • edited

menshikh-iv commented Jun 22, 2017

giacbrd commented Jun 22, 2017 • edited

menshikh-iv commented Jul 3, 2017

prakhar2b commented Jul 3, 2017

piskvorky commented Jul 3, 2017 • edited

giacbrd commented Jul 3, 2017

prakhar2b commented Jul 3, 2017 • edited

menshikh-iv commented Aug 10, 2017

giacbrd commented Aug 11, 2017

menshikh-iv commented Aug 17, 2017

menshikh-iv commented Jun 13, 2017 •

edited

piskvorky commented Jun 13, 2017 •

edited

giacbrd commented Jun 13, 2017 •

edited

giacbrd commented Jun 22, 2017 •

edited

piskvorky commented Jul 3, 2017 •

edited

prakhar2b commented Jul 3, 2017 •

edited