Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][GSoC 2018] Similarity Learning #2050

Closed

Conversation

aneesh-joshi
Copy link
Contributor

run dssm_example.py to get a complete run on implementations
Work is in progress, so several features need to be added and code needs to be cleaned.
This is provided as a proof of concept/demo

Using TensorFlow backend.
2018-05-16 00:54:35,517 : MainThread : INFO : Starting Vocab Build
2018-05-16 00:54:35,518 : MainThread : INFO : Building vocab
2018-05-16 00:54:41,060 : MainThread : INFO : word vocab build complete
2018-05-16 00:54:41,061 : MainThread : INFO : Vocab Build Complete
2018-05-16 00:55:05,019 : MainThread : INFO : There are a total of 20347 query, document pairs in the dataset
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
query (InputLayer)              (None, 6061)         0
__________________________________________________________________________________________________
doc (InputLayer)                (None, 6061)         0
__________________________________________________________________________________________________
sequential_1 (Sequential)       (None, 64)           612664      query[0][0]
                                                                 doc[0][0]
__________________________________________________________________________________________________
dot_1 (Dot)                     (None, 1)            0           sequential_1[1][0]
                                                                 sequential_1[2][0]
==================================================================================================
Total params: 612,664
Trainable params: 612,664
Non-trainable params: 0
__________________________________________________________________________________________________
Epoch 1/2
20347/20347 [==============================] - 25s 1ms/step - loss: 2.4343 - acc: 0.9307
Epoch 2/2
20347/20347 [==============================] - 22s 1ms/step - loss: 0.2488 - acc: 0.9254

@aneesh-joshi
Copy link
Contributor Author

aneesh-joshi commented Jun 29, 2018

@menshikh-iv
Please Note:

  1. verbose parameter was added since keras prints training stats, etc. which are needed while training but negatively affect doctests.
  2. If a model has already built its vocab with a certain KeyedVector, we cannot retrain it with another KeyedVector if the user wants to change it since that would mean redoing the keras model which will have to be rebuilt. Perhaps, we will have to do a weight save and transfer.
  3. I have made models.experimental into a module since it is needed by tox -e docs and the .rst

About current state:

  • python -m doctest -v drmm_tks.py runs successfully
  • made some document fixes and code changes you requested

I am trying to get things reviewed sooner so I can stay in the right direction. Thus, some changes are still missing which I am working on. These are:

  1. Fixing IPYNB
  2. Random Seeding

For Random seeding I need help. I don't understand what you want. Could you link me to some code example or some tutorial. How should I set the seed?

I have also added the .rst but am not sure if it is correct. Is there a way to generate the docs?

@aneesh-joshi
Copy link
Contributor Author

aneesh-joshi commented Jun 29, 2018

with tox -e docs I get the error:

Warning, treated as error:
/home/circleci/gensim/.tox/docs/lib/python2.7/site-packages/gensim/models/experimental/drmm_tks.py:docstring of gensim.models.experimental.drmm_tks.DRMM_TKS:9:Unexpected indentation.

but I cannot see any indentation on that line. 😕

https://github.com/aneesh-joshi/gensim/blob/2e6805181518b47adf4c30e1fbc8cecde83c6e2c/gensim/models/experimental/drmm_tks.py#L9

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jun 30, 2018

If a model has already built its vocab with a certain KeyedVector, we cannot retrain it with another KeyedVector if the user wants to change it since that would mean redoing the keras model which will have to be rebuilt. Perhaps, we will have to do a weight save and transfer.

That's expected

For Random seeding I need help. I don't understand what you want. Could you link me to some code example or some tutorial. How should I set the seed?

In your case, you need to create random vector for each OOV manually (not full matrix for all OOV at one moment), like (this isn't perfect example, but demonstrate what I mean)

import numpy as np

emb_size = 300
oov_words = ["hello", "world", "wow"]
matrix = []

for word in oov_words:
    rng = np.random.RandomState(seed=abs(hash(word)) % (2 ** 32 - 1))
    matrix.append(rng.rand(emb_size))

matrix = np.array(matrix)  # use this matrix in Keras
assert matrix.shape == (len(oov_words), emb_size)

/home/circleci/gensim/.tox/docs/lib/python2.7/site-packages/gensim/models/experimental/drmm_tks.py:docstring of gensim.models.experimental.drmm_tks.DRMM_TKS:9:Unexpected indentation.

This pointed to class docstring (not module), because DRMM_TKS class mentioned. Potential "problematic" places looks like this

replace SPHINXOPTS = -W -> SPHINXOPTS = in gensim/docs/src/Makefile to see all issues (not only one). This change disable conversion "warnings -> errors" (useful for see all sphinx-issues in one moment instead of one by one)

@@ -68,6 +68,7 @@ Modules:
models/deprecated/keyedvectors
models/deprecated/fasttext_wrapper
models/base_any2vec
models/experimental/drmm_tks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need also include other files to documentation building (like callbacks, layers, etc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv

Please refer to the link below which shows the diff of the requested changes

451e3b1?utf8=%E2%9C%93&diff=unified

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note

tox -e docs will throw errors. Not on my files but on some keras files since I am inheriting from the Keras Layer class which has some unformatted docs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aneesh-joshi that's shouldn't happen (because you include only your files, not Keras), can you show me log of tox -e docs that mention the error in some Keras file (not your)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/home/aneeshj/Projects/gensim/.tox/docs/local/lib/python2.7/site-packages/gensim/models/experimental/custom_layers.py:docstring of gensim.models.experimental.custom_layers.TopKLayer.add_weight:10: WARNING: Unexpected indentation.
/home/aneeshj/Projects/gensim/.tox/docs/local/lib/python2.7/site-packages/gensim/models/experimental/custom_layers.py:docstring of gensim.models.experimental.custom_layers.TopKLayer.add_weight:12: WARNING: Block quote ends without a blank line; unexpected unindent.
/home/aneeshj/Projects/gensim/.tox/docs/local/lib/python2.7/site-packages/gensim/models/experimental/custom_layers.py:docstring of gensim.models.experimental.custom_layers.TopKLayer.call:4: WARNING: Inline strong start-string without end-string.

I haven't implemented any of the above functions. Just inherited the Layer class.

Copy link
Contributor

@menshikh-iv menshikh-iv Jul 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, looks like you are right (issue with docstring of the parent class that we can't control).
Simple workaround - define these methods yourself and call super (but don't worry much about it now), you have more critical tasks now.


The trained model needs to be trained on data in the format:

>>> queries = ["When was World War 1 fought ?".lower().split(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No vertical indents (again), here and everywhere.
Also, all imports should be on top of examples (also please import current model too).

Copy link
Contributor Author

@aneesh-joshi aneesh-joshi Jul 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No vertical indents (again)

Sorry for making you to keep repeating. I keep missing it.

>>> queries = ["how are glacier caves formed ?".lower().split()]
>>> docs = ["A partly submerged glacier cave on Perito Moreno Glacier".lower().split(),
... "A glacier cave is a cave formed within the ice of a glacier".lower().split()]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is your testing?

- fixes all docs and doctest errors
- fixes requested changes in PR
"metadata": {},
"outputs": [],
"source": [
"!python experimental_data/get_data.py"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better place this code directly in notebook & remove get_data.py from repo

"metadata": {},
"outputs": [],
"source": [
"queries = [simple_preprocess(\"how are glacier caves formed\"),\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, no vertical (here and everywhere)

"skipping query-doc pair due to no words in vocab\n",
"MAP: 0.56\n",
"nDCG@1 : 0.41 \n",

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apply this function to your NN too

----------
test_data : dict
A dictionary which holds the validation data. It consists of the following keys:
- "X1" : numpy array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this rendered correctly? I can't check because current build failed.

)
for key in test_data.keys():
if key not in ['X1', 'X2', 'y', 'doc_lengths']:
raise ValueError("test_data dictionary doesn't have the keys: 'X1', 'X2', 'y', 'doc_lengths'")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incorrect check: if test_data.keys() contains needed keys + some additional key - this will fail


# get all the vocab words
for q in self.queries:
self.word_counter.update(q)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I call build_vocab twice, what happens?

self.build_vocab(self.queries, self.docs, self.labels, self.word_embedding)

is_iterable = False
if isinstance(self.queries, Iterable) and not isinstance(self.queries, list):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again is_iterable super strange, your input always iterable

loss = 'mse'
self.model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
else:
logger.info("Model will be retrained")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what mean "retrained"? Is this updated or train from scratch?

)
val_callback = [val_callback] # since `model.fit` requires a list

# If train is called again, not all values should be reset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's values? Can you clarify please?

self.first_train = False

if is_iterable:
self.model.fit_generator(train_generator, steps_per_epoch=steps_per_epoch, callbacks=val_callback,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be always fit_generator and no is_iterable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants