FastTextKeyedVectors.add_vectors is not adding vectors #3388

globba · 2022-09-28T09:13:24Z

Problem description

I have been trying to create a FastTextKeyedVectors and adding vectors to it using either add_vector or add_vectors but the methods are not adding anything. After looking at the implementation of those methods, I think there is an error while checking if a key has already been added.

Steps/code/corpus to reproduce

I create a FastTextKeyedVectors using the defaults used by the FastText model, then try to add vectors to it using add_vector or add_vectors:

wv = FastTextKeyedVectors(vector_size=2, min_n=3, max_n=6, bucket=2000000)
wv.add_vector("test", [0.5, 0.5])
print(wv.key_to_index)
>> {}
print(wv.index_to_key)
>> []
print(wv.vectors)
>> []

wv.add_vectors(["test"], [[0.5, 0.5]])
print(wv.key_to_index)
>> {}
print(wv.index_to_key)
>> []
print(wv.vectors)
>> []

wv.key_to_index, wv.index_to_key and wv.vectors are all empty.

FastTextKeyedVectors is a child of KeyedVectors where the add_vector/s methods are implemented. add_vector does a few checks then calls add_vectors.
In add_vectors, there is an in_vocab_mask, which is a list of booleans indicating if a key is already present in the KeyedVectors.

in_vocab_mask = np.zeros(len(keys), dtype=bool)
        for idx, key in enumerate(keys):
            if key in self:
                in_vocab_mask[idx] = True

Since Gensim 4.0, key in wv will always return True with FastText by design. The proper way of checking if a key exists is by calling key in wv.key_to_index (See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#10-check-if-a-word-is-fully-oov-out-of-vocabulary-for-fasttext)

So replacing the above code by

in_vocab_mask = np.zeros(len(keys), dtype=bool)
        for idx, key in enumerate(keys):
            if key in self.key_to_index:
                in_vocab_mask[idx] = True

seems to fix the issue.

wv = FastTextKeyedVectors(vector_size=2, min_n=3, max_n=6, bucket=2000000)
wv.add_vectors(["test"], [[0.5, 0.5]])
print(wv.key_to_index)
>> {'test': 0}
print(wv.index_to_key)
>> ['test']
print(wv.vectors)
>> [[0.5 0.5]]

I am not sure how FastText models are able to add vectors to FastTextKeyedVectors the proper way when training without encountering this issue as I have not looked at the training code in detail.

Versions

Linux-5.10.0-17-amd64-x86_64-with-glibc2.31
Python 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]
Bits 64
NumPy 1.21.6
SciPy 1.7.3
gensim 4.2.0
FAST_VERSION 1

The text was updated successfully, but these errors were encountered:

gojomo · 2022-09-28T16:17:01Z

Pretty clearly a bug & looks like you've found the real cause. Can you make a unit test that fails before your fix, and succeeds after, and bundle that wth the fix in a PR?

The training-from-scratch path likely works because it doesn't rely on any add_vector[s]() paths, but rather allocates a new object for all needed words at once, then lets training modify them directly in-place (rather than via specific inserts/replaces). The original/basic functionality for word2vec/fasttext/etc was only fresh training from a complete corpus into a frozen set of known words – add_vector() functionality was only bolted-on, somwhat awkwardly, as a later convenience. (And, whether & when it makes good theoretical sense to be insert new vectors into a model, that weren't co-trained with the main batch, is still not under-explored/under-documented.)

globba · 2022-09-29T10:24:42Z

Pushed the fix.

For context, I found this while trying to create a subset of a trained FastTextKeyedVectors as I needed to do a FastTextKeyedVectors.most_similar() while ignoring some words from the vocabulary.

I ended up copying all the keys, vectors and other variables I needed to a new FastTextKeyedVectors which seems to work, but it looks messy and I'm not sure that will work properly if I try to train it with new words.

Having a built-in way of performing a most_similar() while ignoring a list of words from the vocab would be very convenient. It would also be more memory-efficient than having two separate instances of the KeyedVectors.

edit: Just figured out I could simply call most_similar() with topn=None, then get the indices of the words I want to keep and sort them based on similarity.

piskvorky added the bug Issue described a bug label Sep 29, 2022

globba mentioned this issue Sep 29, 2022

Fixed FastTextKeyedVectors handling in add_vector #3389

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FastTextKeyedVectors.add_vectors is not adding vectors #3388

FastTextKeyedVectors.add_vectors is not adding vectors #3388

globba commented Sep 28, 2022 •

edited

gojomo commented Sep 28, 2022

globba commented Sep 29, 2022 •

edited

FastTextKeyedVectors.add_vectors is not adding vectors #3388

FastTextKeyedVectors.add_vectors is not adding vectors #3388

Comments

globba commented Sep 28, 2022 • edited

Problem description

Steps/code/corpus to reproduce

Versions

gojomo commented Sep 28, 2022

globba commented Sep 29, 2022 • edited

globba commented Sep 28, 2022 •

edited

globba commented Sep 29, 2022 •

edited