Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix datatype parameter for KeyedVectors.load_word2vec_format. Fix #1682 #1819

Merged
merged 19 commits into from
Feb 16, 2018

Conversation

pushpankar
Copy link
Contributor

No description provided.

@menshikh-iv
Copy link
Contributor

Hi @pushpankar, thanks for your PR, what's about "other cases" from #1682?
CC @jayantj, can you elaborate on additional cases?

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jan 15, 2018

Ping @pushpankar, how is going? You also need to write additional tests to check this behavior and prevent regression + please merge develop from upstream.

@pushpankar
Copy link
Contributor Author

Sorry. I thought my work is done. Should I write a test script in gensim/test and check with different types of inputs like float16, float32, ints, etc?

@menshikh-iv
Copy link
Contributor

@pushpankar for two types already enough, also, use example from original issue

@@ -0,0 +1,3 @@
2 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be in gensim/test/test_data

def test_datatype(self):
path = os.path.join(os.path.dirname(__file__), 'test.kv.txt')
kv = KeyedVectors.load_word2vec_format(path, datatype=np.float64)
self.assertEqual(kv['horse.n.01'][0], -0.0008546282343595379)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. You should use assertAlmostEqual instead of assertEqual for float comparison
  2. You also need to check datatype


class TestDataType(unittest.TestCase):
def test_datatype(self):
path = os.path.join(os.path.dirname(__file__), 'test.kv.txt')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you move test.kv.txt construction should look like

from gensim.test.utils import datapath

path = datapath('test.kv.txt')

@menshikh-iv menshikh-iv changed the title Fixes #1682 Fix datatype parameter for KeyedVectors.load_word2vec_format. Fix #1682 Jan 16, 2018
@jayantj
Copy link
Contributor

jayantj commented Jan 16, 2018

Hi @pushpankar thanks for the PR! Apart from the suggestions @menshikh-iv mentioned, it'd also be good to have a test for word2vec files in binary format.

@@ -0,0 +1,18 @@
import logging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a file header, like in the other test files.

@pushpankar
Copy link
Contributor Author

Saving model(loaded with datatype=np.float64) in binary and then reloading it is raising unicode error. I looking for solution. Everything else is done.

@pushpankar
Copy link
Contributor Author

pushpankar commented Jan 16, 2018

Saving model also causes loss in precision in both text and binary format.

@menshikh-iv
Copy link
Contributor

I see many failed tests, @pushpankar please have a look into.

@pushpankar
Copy link
Contributor Author

@menshikh-iv I am working on it. Changing the way binary files are loaded is causing it.

Copy link
Contributor Author

@pushpankar pushpankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv @jayantj I need some help.

@@ -220,7 +220,7 @@ def add_word(word, weights):
result.index2word.append(word)

if binary:
binary_len = dtype(REAL).itemsize * vector_size
binary_len = dtype(datatype).itemsize * vector_size
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having problem detecting binary_len of vector saved with custom datatype. The only clue is that the next vector starts after a " " but before the space comes a string(also converted in python bytes) which can be of any length. @menshikh-iv any suggestion?

Copy link
Contributor Author

@pushpankar pushpankar Jan 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried adding \n at the end of each vector during saving in binary but that broke many other tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pushpankar how it works if datatype=REAL here?

Copy link
Contributor Author

@pushpankar pushpankar Jan 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since earlier float were only saved with only 32 bit precision, knowing the size of each vector in binary format was easy. Casting to lower precision is only done after loading vectors.
Please also note that in develop branch too, casting vector to lower precision and saving it in binary and then loading leads to some errors. This is because while loading float32 is being assumed but during saving it was saved with lower precision like float16. I am adding some code to make it more clear.

from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('./test_data/test.kv.txt', datatype=np.float16)
print(model['horse.n.01'][0])
model.save_word2vec_format('./test_data/test.kv.bin', binary=True)
model2 = KeyedVectors.load_word2vec_format('./test_data/test.kv.bin', datatype=np.float32, binary=True)
print(model2['horse.n.01'][0])

Gives

Traceback (most recent call last):
  File "convert2binary.py", line 7, in <module>
    print(model2['horse.n.01'][0])
  File "/home/pushpankar/gensim/gensim/models/keyedvectors.py", line 326, in __getitem__
    return self.word_vec(words)
  File "/home/pushpankar/gensim/gensim/models/keyedvectors.py", line 453, in word_vec
    raise KeyError("word '%s' not in vocabulary" % word)
KeyError: "word 'horse.n.01' not in vocabulary"

This is because float32 was assumed while reading binary vector but originally it was saved with float16. Thus more than necessary bytes was read for every vector.
Let me know if I am not clear enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, probably the easiest solution for this case is read/write with REAL type & cast it before the end of "load" process, wdyt @jayantj?

@pushpankar
Copy link
Contributor Author

pushpankar commented Jan 20, 2018

This fixes:

  1. Precision loss issue if model is saved in text format.
  2. Crash issue during loading binary model with precision other than float32 but precision loss issue still persists if precision is higher than float32 during both loading and saving binary model.

def test_text(self):
path = datapath('test.kv.txt')
kv = KeyedVectors.load_word2vec_format(path, binary=False,
datatype=np.float64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's about different datatypes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will np.float16, np.float32, and np.float64 be enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@@ -147,9 +147,10 @@ def save_word2vec_format(self, fname, fvocab=None, binary=False, total_vec=None)
for word, vocab in sorted(iteritems(self.vocab), key=lambda item: -item[1].count):
row = self.syn0[vocab.index]
if binary:
row = row.astype(REAL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because

from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('./test_data/test.kv.txt', datatype=np.float16)
print(model['horse.n.01'][0])
model.save_word2vec_format('./test_data/test.kv.bin', binary=True)
model2 = KeyedVectors.load_word2vec_format('./test_data/test.kv.bin', datatype=np.float32, binary=True)

this causes crash.
This is another bug that exists in develop branch.
Steps to reproduce error:

  1. load model with low precision.
  2. save that model in binary
  3. then finally try to load the binary model.

Copy link
Contributor

@jayantj jayantj Feb 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for investigating that and reporting it. Do you have any ideas why that might be occurring? The fix seems a little hack-ish and might mask other genuine problems.

@menshikh-iv
Copy link
Contributor

#1777 should be merged first

@menshikh-iv
Copy link
Contributor

Hello @pushpankar, can you resolve merge-conflict?

@menshikh-iv
Copy link
Contributor

Good work @pushpankar, LGTM, wdyt @jayantj?


class TestDataType(unittest.TestCase):
def load_model(self, datatype):
path = datapath('test.kv.txt')
Copy link
Contributor

@jayantj jayantj Feb 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A slightly more descriptive name would be helpful, there's already a lot of test data and it can easily get confusing.

model1.save_word2vec_format(binary_path, binary=True)
model2 = KeyedVectors.load_word2vec_format(binary_path, datatype=np.float64, binary=True)
self.assertAlmostEqual(model1["horse.n.01"][0], np.float16(model2["horse.n.01"][0]))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another test to verify that the type of a value in the loaded array matches the datatype passed to load_word2vec_format would explicitly confirm the new behaviour.

@jayantj
Copy link
Contributor

jayantj commented Feb 16, 2018

LGTM to me too (left some minor comments which should be taken care of, IMO), thanks for the fix @pushpankar

@menshikh-iv
Copy link
Contributor

@pushpankar good work, congratz with the first contribution! 👍

@menshikh-iv menshikh-iv merged commit 205fd79 into piskvorky:develop Feb 16, 2018
sj29-innovate pushed a commit to sj29-innovate/gensim that referenced this pull request Feb 21, 2018
…iskvorky#1682 (piskvorky#1819)

* load vector with high precision

* Test changes

* Fix flake8 error

* Fix path error

* Reformat code

* Fix precision loss issue for binary word2vec

* Fix precision loss during saving model in text format

* Fix binary file loading issue

* Test other datatypes as well.

* Test type conversion

* Fix build error

* Use better names

* Test type after conversion
Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor code style request.

class TestDataType(unittest.TestCase):
def load_model(self, datatype):
path = datapath('high_precision.kv.txt')
kv = KeyedVectors.load_word2vec_format(path, binary=False,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hanging indent please (not vertical).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you are talking about line 22-23. I have merged them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants