[MRG] Poincare model keyedvectors #1700

jayantj · 2017-11-08T03:43:22Z

This branch includes the changes to KeyedVectors and PoincareKeyedVectors for the PoincareModel.

Includes commits from #1696 , to be reviewed and merged after #1696

TODOs:

Evaluating Gensim implementation and adding results to notebook
Adding KeyedVector methods like most_similar to PoincareKeyedVectors
Refactoring KeyedVectors into KeyedVectorsBase and EuclideanKeyedVectors for cleaner class hierarchy
Tests for PoincareKeyedVectors

…keys

… updates tests

…nce and vector_distance_batch

jayantj · 2017-11-22T00:57:37Z

Nice catch for the integer division, fixed.

jayantj · 2017-11-22T01:10:49Z

Short summary and rationale for the changes to KeyedVectors in this PR -

Refactoring of the previously existing KeyedVectors class into KeyedVectorsBase and EuclideanKeyedVectors. For backwards compatibility, the keyedvectors module contains a reference KeyedVectors which points to EuclideanKeyedVectors. It may be a good idea to rename KeyedVectors elsewhere in the codebase to EuclideanKeyedVectors.
KeyedVectorsBase is simply a collection of vectors and associated labels, supporting the following old methods -
- __getitem__
- __contains__
- word_vec
- similarity
- most_similar_to_given
- load_word2vec_format/save_word2vec_format
  Along with the following new methods -
- distance
- distances
- words_closer_than
- rank
  Note that the KeyedVectorsBase class does not provide definitions for the distance, distances and similarity methods, it is upto the child class to define them as appropriate.
EuclideanKeyedVectors is derived from KeyedVectorsBase and contains functionality that is only relevant/meaningful for vectors in Euclidean space. Both word2vec and fasttext vectors fall into this category.
PoincareKeyedVectors is a new class to contain vectors in hyperbolic space for the Poincare model, which supports operations specific to the vectors for the Poincare model.
The most_similar method conceptually makes sense for both PoincareKeyedVectors and EuclideanKeyedVectors, however as the already existing API for most_similar could not be supported in PoincareKeyedVectors, PoincareKeyedVectors provides an implementation of most_similar with a different API. For this reason, most_similar is not present in the KeyedVectorsBase class.

The PR also adds missing tests for some of the older KeyedVectors methods.

menshikh-iv

In general - very nice work 💣 🔥
Only several questions (and I'll fix docstrings / PEP8 after your commits).

menshikh-iv · 2017-11-22T11:22:54Z

gensim/models/poincare_visualization.py

@@ -0,0 +1,187 @@
+#!/usr/bin/env python


I think better write this code in ipython notebook (or maybe move to docs/notebooks and import from "under the feet").
gensim.models isn't a suitable place for this file.

I agree gensim.models isn't a good place for it. I don't want to have it in an ipython notebook or docs/notebooks though since that would mean a user can't import and use it, and I think it is definitely useful for a user. Do you think creating a new package poincare in gensim/ would be a good idea? Other models do this too (e.g. topic_coherence)

I don't think that this is a very good idea, after refactoring, topic_coherence will be moved/renamed in "deep" of gensim.modules (coherence contains only inner/secondary functions, not public API), but in your case, API is public.

I agree about imports (it's really untrivial, how to import from docs/notebooks if you in /randomfolder, only manually with importlib I think.

We have many viz helpers (produced by @parulsethi on GSoC) + now Parul works on very nice viz for topic models. Potentially, we can create the distinct repository (like gensim-data) and move all viz helpers, or, as you suggest, create submodule gensim.viz and move all viz stuff (not only your Poincare viz).

Hard question, I don't know what's better right now.

WDYT @piskvorky @janpom @parulsethi?

A submodule gensim.viz sounds good to me, keeping in mind we might have future visualizations too. I don't have a good enough perspective on this though, so whatever you decide is okay with me.

Hi @menshikh-iv , so is it okay if I create a gensim.viz package for this and any future gensim visualizations, and move the poincare visualization there?

@piskvorky @parulsethi

gensim.viz submodule would be useful for #1616 also in future, and the long code blocks of network graph/dendrogram could also be wrapped up in a function under this module so that those visualizations can be produced simply using the imports.

@jayantj sounds good.

Great! Thanks for the feedback @parulsethi @menshikh-iv
Pushed changes with a new gensim.viz package.

menshikh-iv · 2017-11-22T11:25:57Z

gensim/models/poincare.py

@@ -514,6 +516,8 @@ def train(self, epochs, batch_size=10, print_every=1000, check_gradients_every=N
        """
        if self.workers > 1:
            raise NotImplementedError("Multi-threaded version not implemented yet")
+        # Some divide-by-zero results are handled explicitly
+        old_settings = np.seterr(divide='ignore', invalid='ignore')


Why? You mean that division by zero is expected and you process this situation in code, I'm correct?

Yeah, it happens in PoincareBatch. I'm setting it here to avoid repeated calls to np.seterr

menshikh-iv · 2017-11-22T11:31:30Z

gensim/models/poincare.py

+
+        Parameters
+        ----------
+        node : str


This always str (or int possible too)? This question more global (about all methods) that pass node argument?

It could be an int too in theory, depending on what the vocab keys are. The most common case is str though. How would you prefer to handle this?

maybe str or int everywhere?

Sounds good

Done. Also made some other changes to docstrings for more clarity.

jayantj · 2017-11-23T07:52:50Z

Some completely unrelated tests seem to be failing on travis (test_translation_matrix, test_lda_model). Not sure what that is about.

piskvorky · 2017-11-26T17:12:41Z

@menshikh-iv we need this resolved & finished -- can you have a look? Cheers.

jayantj added 30 commits October 26, 2017 22:13

Initial classes and loading data for poincare model

6afdd22

Initial implementation of training using autograd

a804006

faster negative sampling, bugfix in vector updates

6bd0d4b

allows poincare dist function to be differentiable by autograd

98f94a7

batched gradient descent initial implementation

b727523

minor changes to batch poincare distance computation

1e6aee1

Adds calculation of gradients for poincare model

e286a0b

Correct implementation of clipping of updated vectors

3e28e8b

Fixes error in gradient computation

99a2270

Better messages while training

2e9e31c

Renames PoincareDistance to PoincareExample for clarity

d72cb10

Compares computed gradients to autograd gradients every few iterations

d439501

Avoids doing some numpy computations twice

e1ed24d

Avoids creating copies of numpy vectors

3b2a383

Only calls nan_to_num when gamma has at least one value equal to 1

7d68aae

Simply sets nan gradients to zero instead of nan_to_num

ba82d42

Adds batch-wise implementation of training and gradient computations

71f61d1

Minor correction in clipping

2a5a7fb

Merge branch 'poincare' into poincare_model

0c57aa1

Fixes typo in clip_vectors

9c51609

Prints average loss every few iterations instead of current loss

f22d9b2

Adds weighted negative sampling

7905c8c

Ensures positive edges are not returned by negative sampling

075df25

Poincare model stores node indices in relations instead of node keys

6060e56

Minor renaming; uses node indices for batch training instead of node …

8ea8f23

…keys

Changes shapes of vectors passed to PoincareBatch

b8d77e3

Minor bugfixes related to batch size

0011b93

Corrects implementation of negative sampling for batch training

b52ee2e

Adds option to check gradients in batchwise training

d247384

Checks gradients only every few iterations

8c4f5a3

jayantj added 7 commits November 22, 2017 05:43

Makes default argument for distances immutable

235b643

Uses conditional import for pygtrie in LexicalEntailmentEvaluation

d0b8563

Renames position_in_hierarchy to norm with minor change in behaviour,…

cedd0e1

… updates tests

Renames poincare_distance and poincare_distance_batch to vector_dista…

0317189

…nce and vector_distance_batch

Forces float division for positive_fraction in _sample_negatives

e693e64

Removes unused method from PoincareKeyedVectors

e931085

Updates report notebook with usage examples of new API methods

3c8d9f2

Minor pep8 fix

73ed696

jayantj force-pushed the poincare_model_keyedvectors branch from 004b572 to 73ed696 Compare November 22, 2017 01:15

jayantj changed the title ~~[WIP] Poincare model keyedvectors~~ [MRG] Poincare model keyedvectors Nov 22, 2017

jayantj mentioned this pull request Nov 22, 2017

[MRG] Poincare l2 regularization #1734

Merged

menshikh-iv suggested changes Nov 22, 2017

View reviewed changes

jayantj added 3 commits November 23, 2017 06:01

Fixes pep8 issues, unused imports and typo

ee92be9

Adds example of saving and loading model to notebook

46a7efb

Updates docstrings in poincare.py

291dac6

jayantj and others added 8 commits November 27, 2017 12:48

Moves poincare visualization methods to new gensim.viz module

c532e6e

Updates rst files for poincare viz

c506b96

Adds newline at the end of poincare.py in viz package

b4ec393

Adds link to original paper to poincare notebook

a7c3080

fix viz.poincare & update docs dependencies

e53f487

add link to init file

4775f4d

fix PEP8

a22c601

fixes for poincare.py

6a2da73

menshikh-iv merged commit 1ac5a26 into poincare Dec 4, 2017

jayantj mentioned this pull request Dec 4, 2017

[MRG] Add Poincare model #1757

Merged

menshikh-iv deleted the poincare_model_keyedvectors branch July 5, 2018 17:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Poincare model keyedvectors #1700

[MRG] Poincare model keyedvectors #1700

jayantj commented Nov 8, 2017 •

edited

Loading

jayantj commented Nov 22, 2017

jayantj commented Nov 22, 2017

menshikh-iv left a comment

menshikh-iv Nov 22, 2017 •

edited

Loading

jayantj Nov 23, 2017

menshikh-iv Nov 23, 2017 •

edited

Loading

jayantj Nov 23, 2017

jayantj Nov 26, 2017 •

edited

Loading

jayantj Nov 26, 2017

parulsethi Nov 26, 2017

menshikh-iv Nov 27, 2017

jayantj Nov 27, 2017

menshikh-iv Nov 22, 2017

jayantj Nov 23, 2017

menshikh-iv Nov 22, 2017

jayantj Nov 23, 2017

menshikh-iv Nov 23, 2017

jayantj Nov 23, 2017

jayantj Nov 23, 2017

jayantj commented Nov 23, 2017

piskvorky commented Nov 26, 2017

[MRG] Poincare model keyedvectors #1700

[MRG] Poincare model keyedvectors #1700

Conversation

jayantj commented Nov 8, 2017 • edited Loading

jayantj commented Nov 22, 2017

jayantj commented Nov 22, 2017

menshikh-iv left a comment

Choose a reason for hiding this comment

menshikh-iv Nov 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv Nov 23, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayantj Nov 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayantj commented Nov 23, 2017

piskvorky commented Nov 26, 2017

jayantj commented Nov 8, 2017 •

edited

Loading

menshikh-iv Nov 22, 2017 •

edited

Loading

menshikh-iv Nov 23, 2017 •

edited

Loading

jayantj Nov 26, 2017 •

edited

Loading