[WIP][GSoC 2018] Similarity Learning #2050

aneesh-joshi · 2018-05-15T19:27:15Z

run dssm_example.py to get a complete run on implementations
Work is in progress, so several features need to be added and code needs to be cleaned.
This is provided as a proof of concept/demo

Using TensorFlow backend.
2018-05-16 00:54:35,517 : MainThread : INFO : Starting Vocab Build
2018-05-16 00:54:35,518 : MainThread : INFO : Building vocab
2018-05-16 00:54:41,060 : MainThread : INFO : word vocab build complete
2018-05-16 00:54:41,061 : MainThread : INFO : Vocab Build Complete
2018-05-16 00:55:05,019 : MainThread : INFO : There are a total of 20347 query, document pairs in the dataset
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
query (InputLayer)              (None, 6061)         0
__________________________________________________________________________________________________
doc (InputLayer)                (None, 6061)         0
__________________________________________________________________________________________________
sequential_1 (Sequential)       (None, 64)           612664      query[0][0]
                                                                 doc[0][0]
__________________________________________________________________________________________________
dot_1 (Dot)                     (None, 1)            0           sequential_1[1][0]
                                                                 sequential_1[2][0]
==================================================================================================
Total params: 612,664
Trainable params: 612,664
Non-trainable params: 0
__________________________________________________________________________________________________
Epoch 1/2
20347/20347 [==============================] - 25s 1ms/step - loss: 2.4343 - acc: 0.9307
Epoch 2/2
20347/20347 [==============================] - 22s 1ms/step - loss: 0.2488 - acc: 0.9254

…into develop

This reverts commit 6c06fbc.

…into similarity_learning_develop

aneesh-joshi · 2018-06-29T21:36:20Z

@menshikh-iv
Please Note:

verbose parameter was added since keras prints training stats, etc. which are needed while training but negatively affect doctests.
If a model has already built its vocab with a certain KeyedVector, we cannot retrain it with another KeyedVector if the user wants to change it since that would mean redoing the keras model which will have to be rebuilt. Perhaps, we will have to do a weight save and transfer.
I have made models.experimental into a module since it is needed by tox -e docs and the .rst

About current state:

python -m doctest -v drmm_tks.py runs successfully
made some document fixes and code changes you requested

I am trying to get things reviewed sooner so I can stay in the right direction. Thus, some changes are still missing which I am working on. These are:

Fixing IPYNB
Random Seeding

For Random seeding I need help. I don't understand what you want. Could you link me to some code example or some tutorial. How should I set the seed?

I have also added the .rst but am not sure if it is correct. Is there a way to generate the docs?

aneesh-joshi · 2018-06-29T21:50:04Z

with tox -e docs I get the error:

Warning, treated as error:
/home/circleci/gensim/.tox/docs/lib/python2.7/site-packages/gensim/models/experimental/drmm_tks.py:docstring of gensim.models.experimental.drmm_tks.DRMM_TKS:9:Unexpected indentation.

but I cannot see any indentation on that line. 😕

https://github.com/aneesh-joshi/gensim/blob/2e6805181518b47adf4c30e1fbc8cecde83c6e2c/gensim/models/experimental/drmm_tks.py#L9

menshikh-iv · 2018-06-30T02:42:41Z

If a model has already built its vocab with a certain KeyedVector, we cannot retrain it with another KeyedVector if the user wants to change it since that would mean redoing the keras model which will have to be rebuilt. Perhaps, we will have to do a weight save and transfer.

That's expected

For Random seeding I need help. I don't understand what you want. Could you link me to some code example or some tutorial. How should I set the seed?

In your case, you need to create random vector for each OOV manually (not full matrix for all OOV at one moment), like (this isn't perfect example, but demonstrate what I mean)

import numpy as np

emb_size = 300
oov_words = ["hello", "world", "wow"]
matrix = []

for word in oov_words:
    rng = np.random.RandomState(seed=abs(hash(word)) % (2 ** 32 - 1))
    matrix.append(rng.rand(emb_size))

matrix = np.array(matrix)  # use this matrix in Keras
assert matrix.shape == (len(oov_words), emb_size)

/home/circleci/gensim/.tox/docs/lib/python2.7/site-packages/gensim/models/experimental/drmm_tks.py:docstring of gensim.models.experimental.drmm_tks.DRMM_TKS:9:Unexpected indentation.

This pointed to class docstring (not module), because DRMM_TKS class mentioned. Potential "problematic" places looks like this

replace SPHINXOPTS = -W -> SPHINXOPTS = in gensim/docs/src/Makefile to see all issues (not only one). This change disable conversion "warnings -> errors" (useful for see all sphinx-issues in one moment instead of one by one)

menshikh-iv · 2018-07-02T07:29:36Z

docs/src/apiref.rst

@@ -68,6 +68,7 @@ Modules:
    models/deprecated/keyedvectors
    models/deprecated/fasttext_wrapper
    models/base_any2vec
+    models/experimental/drmm_tks


Need also include other files to documentation building (like callbacks, layers, etc)

@menshikh-iv

Please refer to the link below which shows the diff of the requested changes

451e3b1?utf8=%E2%9C%93&diff=unified

Please note

tox -e docs will throw errors. Not on my files but on some keras files since I am inheriting from the Keras Layer class which has some unformatted docs.

@aneesh-joshi that's shouldn't happen (because you include only your files, not Keras), can you show me log of tox -e docs that mention the error in some Keras file (not your)?

/home/aneeshj/Projects/gensim/.tox/docs/local/lib/python2.7/site-packages/gensim/models/experimental/custom_layers.py:docstring of gensim.models.experimental.custom_layers.TopKLayer.add_weight:10: WARNING: Unexpected indentation. /home/aneeshj/Projects/gensim/.tox/docs/local/lib/python2.7/site-packages/gensim/models/experimental/custom_layers.py:docstring of gensim.models.experimental.custom_layers.TopKLayer.add_weight:12: WARNING: Block quote ends without a blank line; unexpected unindent. /home/aneeshj/Projects/gensim/.tox/docs/local/lib/python2.7/site-packages/gensim/models/experimental/custom_layers.py:docstring of gensim.models.experimental.custom_layers.TopKLayer.call:4: WARNING: Inline strong start-string without end-string.

I haven't implemented any of the above functions. Just inherited the Layer class.

Aha, looks like you are right (issue with docstring of the parent class that we can't control).
Simple workaround - define these methods yourself and call super (but don't worry much about it now), you have more critical tasks now.

menshikh-iv · 2018-07-02T07:33:33Z

gensim/models/experimental/drmm_tks.py

+
+The trained model needs to be trained on data in the format:
+
+>>> queries = ["When was World War 1 fought ?".lower().split(),


No vertical indents (again), here and everywhere.
Also, all imports should be on top of examples (also please import current model too).

No vertical indents (again)

Sorry for making you to keep repeating. I keep missing it.

menshikh-iv · 2018-07-02T07:34:38Z

gensim/models/experimental/drmm_tks.py

+>>> queries = ["how are glacier caves formed ?".lower().split()]
+>>> docs = ["A partly submerged glacier cave on Perito Moreno Glacier".lower().split(),
+...         "A glacier cave is a cave formed within the ice of a glacier".lower().split()]
+


where is your testing?

- fixes all docs and doctest errors - fixes requested changes in PR

menshikh-iv · 2018-07-04T01:56:59Z

gensim/models/experimental/UI_Example.ipynb

+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!python experimental_data/get_data.py"


better place this code directly in notebook & remove get_data.py from repo

menshikh-iv · 2018-07-04T01:59:57Z

gensim/models/experimental/UI_Example.ipynb

+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "queries = [simple_preprocess(\"how are glacier caves formed\"),\n",


Again, no vertical (here and everywhere)

menshikh-iv · 2018-07-04T02:01:41Z

gensim/models/experimental/UI_Example.ipynb

+      "skipping query-doc pair due to no words in vocab\n",
+      "MAP: 0.56\n",
+      "nDCG@1 : 0.41 \n",
+


apply this function to your NN too

menshikh-iv · 2018-07-04T02:02:42Z

gensim/models/experimental/custom_callbacks.py

+        ----------
+        test_data : dict
+            A dictionary which holds the validation data. It consists of the following keys:
+                - "X1" : numpy array


Is this rendered correctly? I can't check because current build failed.

menshikh-iv · 2018-07-04T02:12:16Z

gensim/models/experimental/custom_callbacks.py

+                    )
+            for key in test_data.keys():
+                if key not in ['X1', 'X2', 'y', 'doc_lengths']:
+                    raise ValueError("test_data dictionary doesn't have the  keys: 'X1', 'X2', 'y', 'doc_lengths'")


incorrect check: if test_data.keys() contains needed keys + some additional key - this will fail

menshikh-iv · 2018-07-04T02:56:02Z

gensim/models/experimental/drmm_tks.py

+
+        # get all the vocab words
+        for q in self.queries:
+            self.word_counter.update(q)


If I call build_vocab twice, what happens?

menshikh-iv · 2018-07-04T02:57:37Z

gensim/models/experimental/drmm_tks.py

+            self.build_vocab(self.queries, self.docs, self.labels, self.word_embedding)
+
+        is_iterable = False
+        if isinstance(self.queries, Iterable) and not isinstance(self.queries, list):


Again is_iterable super strange, your input always iterable

menshikh-iv · 2018-07-04T02:58:31Z

gensim/models/experimental/drmm_tks.py

+            loss = 'mse'
+            self.model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
+        else:
+            logger.info("Model will be retrained")


what mean "retrained"? Is this updated or train from scratch?

menshikh-iv · 2018-07-04T02:59:01Z

gensim/models/experimental/drmm_tks.py

+                            )
+            val_callback = [val_callback]  # since `model.fit` requires a list
+
+        # If train is called again, not all values should be reset


What's values? Can you clarify please?

menshikh-iv · 2018-07-04T02:59:22Z

gensim/models/experimental/drmm_tks.py

+            self.first_train = False
+
+        if is_iterable:
+            self.model.fit_generator(train_generator, steps_per_epoch=steps_per_epoch, callbacks=val_callback,


should be always fit_generator and no is_iterable

* fix TopK Layer output dim shape * update ipynb to have newest model

menshikh-iv · 2018-08-16T12:39:50Z

See

aneesh-joshi added 30 commits February 9, 2018 01:14

handle deprecation

e249ed4

Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim …

62f6c82

…into develop

handle max_count

1677e98

change flag name

e8c08f8

make flake8 compatible

258d033

move max_vocab to prepare vocab

875c65c

correct max_vocab semantics

0aa8426

remove unnecessary nextline

390f333

fix bug and make flake8 complaint

8c508c7

refactor code and change sorting to key based

c826b19

add tests

35dc681

introduce effective_min_count

67f6a14

make flake8 compliant

7b1f612

remove clobbering of min_count

fafee70

remove min_count assertion

9d99660

.\gensim\models\word2vec.py

6c06fbc

Revert ".\gensim\models\word2vec.py"

c5a0e6e

This reverts commit 6c06fbc.

rename max_vocab to max_final_vocab

fdd2aab

update test to max_final_vocab

974d587

move and modify comment docs

ddb3556

make flake8 compliant

c54d8a9

refactor word2vec.py

f379616

handle possible old model load errors

46d3885

include effective_min_count tests

2cf5625

make flake compliant

8578e3d

remove check for max_final_vocab

a43fea3

include backward compat for 3.3 models

340a8cf

remove unnecessary newline

0b62407

add test case for max_final_vocab

5b7a6c2

merge master

48ad4dc

aneesh-joshi added 4 commits June 29, 2018 01:09

make some requested changes

d6818ee

Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim …

256f319

…into similarity_learning_develop

fix load bug

5c72137

fix doctests

350f4aa

fix module imports

2e68051

aneesh-joshi added 13 commits July 1, 2018 12:58

add all doctests

376b28f

add some ignored files

0651f44

add drmm_tks test

5864db0

fix imports

20fbbfc

fix docs

50386af

finalize docs

aab50fa

fix typo

236e4b7

fix doctests

66e2385

fix file name typo

d662bac

fix flake

127b441

add w2v eval in notebook

4190b99

add evaluate model and complete ipynb

1299bb7

remove models

157b7d7

menshikh-iv suggested changes Jul 2, 2018

View reviewed changes

- adds non model files to docs

451e3b1

- fixes all docs and doctest errors - fixes requested changes in PR

menshikh-iv suggested changes Jul 4, 2018

View reviewed changes

aneesh-joshi added 4 commits July 4, 2018 13:55

fix TopK Layer bug

fd575ea

add drmm

5280853

get well tuned model

7dec231

* Tune paramaters to get MAP : 0.60 and

5219d9e

* fix TopK Layer output dim shape * update ipynb to have newest model

menshikh-iv closed this Aug 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][GSoC 2018] Similarity Learning #2050

[WIP][GSoC 2018] Similarity Learning #2050

aneesh-joshi commented May 15, 2018

aneesh-joshi commented Jun 29, 2018 •

edited

Loading

aneesh-joshi commented Jun 29, 2018 •

edited

Loading

menshikh-iv commented Jun 30, 2018 •

edited

Loading

menshikh-iv Jul 2, 2018

aneesh-joshi Jul 3, 2018

aneesh-joshi Jul 3, 2018

menshikh-iv Jul 4, 2018

aneesh-joshi Jul 4, 2018

menshikh-iv Jul 4, 2018 •

edited

Loading

menshikh-iv Jul 2, 2018

aneesh-joshi Jul 2, 2018 •

edited

Loading

menshikh-iv Jul 2, 2018

menshikh-iv Jul 4, 2018

menshikh-iv Jul 4, 2018

menshikh-iv Jul 4, 2018

menshikh-iv Jul 4, 2018

menshikh-iv Jul 4, 2018

menshikh-iv Jul 4, 2018

menshikh-iv Jul 4, 2018

menshikh-iv Jul 4, 2018

menshikh-iv Jul 4, 2018

menshikh-iv Jul 4, 2018

menshikh-iv commented Aug 16, 2018


		The trained model needs to be trained on data in the format:

		>>> queries = ["When was World War 1 fought ?".lower().split(),

[WIP][GSoC 2018] Similarity Learning #2050

[WIP][GSoC 2018] Similarity Learning #2050

Conversation

aneesh-joshi commented May 15, 2018

aneesh-joshi commented Jun 29, 2018 • edited Loading

aneesh-joshi commented Jun 29, 2018 • edited Loading

menshikh-iv commented Jun 30, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv Jul 4, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aneesh-joshi Jul 2, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv commented Aug 16, 2018

aneesh-joshi commented Jun 29, 2018 •

edited

Loading

aneesh-joshi commented Jun 29, 2018 •

edited

Loading

menshikh-iv commented Jun 30, 2018 •

edited

Loading

menshikh-iv Jul 4, 2018 •

edited

Loading

aneesh-joshi Jul 2, 2018 •

edited

Loading