Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Labeled w2v #1153

Closed
wants to merge 32 commits into from
Closed

[WIP] Labeled w2v #1153

wants to merge 32 commits into from

Conversation

giacbrd
Copy link

@giacbrd giacbrd commented Feb 17, 2017

I have added a new class in the module labledword2vec: LabeledWord2Vec.
The goal of this class is already described in #960

It is a subclass of Word2Vec. Here directly subclassing is not the optimal solution. It would be preferable to have a base class, something like ShallowNeuralNetwork, with subclasses LabeledWord2Vec and Word2Vec. They both share the two layer neural network concept, but the small differences make them two totally different instruments.

I preferred to minimize my intrusion in Gensim, avoiding refactoring a lot of stuff. The solution of a more complex class hierarchy did not seem trivial.

@gojomo
Copy link
Collaborator

gojomo commented Feb 17, 2017

Just wondering, why wouldn't skip-gram be as appropriate as CBOW?

@giacbrd
Copy link
Author

giacbrd commented Feb 20, 2017

In the CBOW we obtain a model that, given a context of words, it returns a probability distribution over a vocabulary of words (probability of appearing in that context). So we have the direct computation of a language model, while with skip-gram we do the inverse, predicting the context. You should then compute for each word (labels in fastText classifier) the probability of the specific document to classify.
Maybe there is a better way, but at the moment I think that in skip-gram you have to do #labels "independent" operations on the neural network instead of a "almost" single matrix product.

@giacbrd giacbrd closed this Feb 20, 2017
@giacbrd giacbrd reopened this Feb 20, 2017
@piskvorky
Copy link
Owner

Not sure we want to start adding supervised learning (classification) to gensim. There would have to be a really clear, convincing reason for such major change-of-mission.

@tmylk are you OK with this?

@tmylk
Copy link
Contributor

tmylk commented Mar 13, 2017

@giacbrd Apologies for the delay in response. I would like to include and the necessary refactoring is on our list for this summer's Google Summer of Code.

The output of a Gensim unsupervised models only becomes useful after it is put through a supervised classifier. Training jointly with a supervised layer bring better results than training separately, as shown in FastText and supervised LDA. There is also a great demand for it as shown by FastText popularity, requests for supervised LDA and the success of ShortText package that integrates gensim with sklearn/keras.

@tmylk tmylk changed the title Labeled w2v [WIP] Labeled w2v May 1, 2017
@menshikh-iv
Copy link
Contributor

Ping @giacbrd, what status of this PR?

@giacbrd
Copy link
Author

giacbrd commented Jun 13, 2017

Hi,
on my side the code is ready (I have probably to fix some problems on cython code and maybe I will and more tests) but maybe there is a misunderstanding. I was waiting for a feedback, I don't understand if a refactoring is mandatory for this PR and if someone is going to do it (e.g. https://github.com/numfocus/gsoc/blob/master/2017/proposals/prakhar_gsoc_17.md). So, what is actually missing for this PR?

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jun 13, 2017

@giacbrd Please, reformat your docstrings according to google format. Also, need to add a short example of usage in notebook.
After this, I think, we can merge this PR.
What do you think about it @piskvorky @gojomo?

@piskvorky
Copy link
Owner

piskvorky commented Jun 13, 2017

@giacbrd can you summarize the use-case for this? What are the advantages of LabeledWord2Vec over FastText, or any of the classification models in scikit-learn? When would one use this class?

@giacbrd
Copy link
Author

giacbrd commented Jun 13, 2017

LabeledWord2Vec is practically very close to the original fastText classification model, but it obviously has all the advantages of being written in Python/Cython and a familiar interface.
It has a good effectiveness on different benchmark datasets, very similar to the standard linear classifiers of Scikit-learn.

Pros:

  • it does not need a two-phases learning: vectorization + learning, the algorithm directly ingests raw text
  • it performs online learning
  • the learning approach is radically different: it is not a linear separation in the word vector space but it may capture deeper semantics
  • it scales very well on multi-class problems, hundreds or thousands of categories are not a problem

Cons

  • trained models are heavy (word and label matrices) but this can be controlled with the hashing trick
  • computational cost lower bound is pretty high; several other algorithms, once you have document vectors, are very fast

Given its different approach to text classification, it is a preferable alternative to many linear models; sometimes it can perform really better on specific data or domains

@menshikh-iv
Copy link
Contributor

In my opinion, need to add several things @giacbrd:

  • Add google-style docstrings (and reformat existed)
  • Add notebook with usage (maybe comparing with FastText or another supervised variant)
  • Add more tests
  • Update setup.py for correct build

After this, I think we can merge thus PR, @piskvorky what do you think about it?

@giacbrd
Copy link
Author

giacbrd commented Jun 22, 2017

@menshikh-iv yes, I was waiting for a confirmation, I see there are still doubts about the eligibility of this model for Gensim. I mean, maybe you don't want to introduce a text classification algorithm in the library?

I am writing a notebook for https://github.com/giacbrd/ShallowLearn , which is a layer (a scikit-learn interface) over LabeledWord2Vec, reproducing the official tutorial of fastText (https://github.com/facebookresearch/fastText/blob/master/tutorials/supervised-learning.md) and highlighting the additional features. It could be suitable also as notebook for Gensim...

@menshikh-iv
Copy link
Contributor

wdyt @piskvorky?

@prakhar2b
Copy link
Contributor

@piskvorky what do you think about this PR? Refactoring and optimization is a part of my gsoc timeline which starts today. It would be better to know your vision for this project.

https://github.com/RaRe-Technologies/gensim/wiki/GSOC-2017-project-ideas#performance-improvements-in-gensim-and-fasttext

@piskvorky
Copy link
Owner

piskvorky commented Jul 3, 2017

@prakhar2b I'd much prefer to have "native" fastText in gensim (Python/C/Cython) first (currently we only have a wrapper for the C++ code). That's an unsupervised algorithm, perfectly in line with gensim's mission (unlike supervised classification). In addition, fastText is a cool, useful algo.

But I don't know how much leeway there is to change your GSoC topic. Or are these two tasks related, how much overlap is there? Any chance to do both at once?

Also, what is the connection to @giacbrd 's existing work? Will you two work on this together? Or what's the difference?

@giacbrd
Copy link
Author

giacbrd commented Jul 3, 2017

The unsupervised models of fastText are the ones described here: https://arxiv.org/abs/1607.04606
They are also the ones of the Gensim wrapper.

LabeledWord2Vec instead only refers to https://arxiv.org/abs/1607.01759, which is a supervised model that also exploits the "tricks" of the previous article. However, in the case of LabeledWord2Vec, I have not implemented all these tricks, i.e., subword n-grams and the hashing trick. In fact these should be implemented as generic features in Gensim, at vocabulary construction time.

Subword n-grams and the hashing trick could be used by any word-vector-space based method in Gensim (just like the phrases https://radimrehurek.com/gensim/models/phrases.html). By using them with the current implementation of word2vec in Gensim, we would practically obtain the fastText unsupervised models!

My opinion is that, if we want fastText unsupervised models in Gensim, word2vec should be improved following this https://arxiv.org/abs/1607.04606 and its related code. A refactorization is necessary, and an improved design of Word2vec class. E.g. the words vocabulary should be able to work with subwords, word hashes, word n-grams, ... (OOV words in general).

LabeledWord2Vec is relatively simple modification of Word2vec for using a different vocabulary of words in the output layer of the network (a set of labels instead of the text words), in order to perform text classification.

@prakhar2b
Copy link
Contributor

prakhar2b commented Jul 3, 2017

@piskvorky As for native fasttext code, yes we have wrapper for training, for loading we already have native python code only (ref- load old/new model/ using bin only). After working on two PRs #1341 #1319 I'm quite familiar with fastText (unsupervised) working, and it would be exciting to work on this as well, but it would definitely take considerable amount of time .

As for the overlap with this PR, there must definitely be overlap in training codes between word2vec, unsupervised fastText, and labeledWord2Vec. And even the gsoc project was to refactor (and optimize) the labeledw2v code and reorganize word2vec and labeledw2v as a subclass of BaseNN (Base Neural Network) class.

I think it would be better to draw an outline of what features we want (labeledWord2vec is not complete facebook's fastText supervised classification), keeping in mind all three- word2vec, fasttext unsupervised, and supervised, and then do the necessary refactoring.

btw, I see you are quite reluctant to add supervised classification in gensim, any specific reason for that ?

cc @jayantj

@menshikh-iv
Copy link
Contributor

For my opinion, it's a nice feature for future 'contribute' subpackage, @giacbrd are you planning to finish this PR?

@giacbrd
Copy link
Author

giacbrd commented Aug 11, 2017

I don't think to continue to work on this, it seems there is not an actual interest in supervised learning. Maybe after the work of @prakhar2b and a general refactoring of these models it will be worth adding LabeledWord2Vec, thas is just:

LabeledWord2Vec is relatively simple modification of Word2vec for using a different vocabulary of words in the output layer of the network (a set of labels instead of the text words), in order to perform text classification.

Maybe I don't understand the purpose of finalizing this PR, will it be merged to develop? As "external contribution" it is already available here https://github.com/giacbrd/ShallowLearn

@menshikh-iv
Copy link
Contributor

So, firstly we should to finish fasttext & refactor "common w2v code", and after it finish this PR and add it to 'contribute' subpackage. Thanks for your work @giacbrd, I'll ping your when we will be ready for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants