Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FastTextKeyedVectors.add_vectors is not adding vectors #3388

Open
globba opened this issue Sep 28, 2022 · 2 comments
Open

FastTextKeyedVectors.add_vectors is not adding vectors #3388

globba opened this issue Sep 28, 2022 · 2 comments
Labels
bug Issue described a bug

Comments

@globba
Copy link
Contributor

globba commented Sep 28, 2022

Problem description

I have been trying to create a FastTextKeyedVectors and adding vectors to it using either add_vector or add_vectors but the methods are not adding anything. After looking at the implementation of those methods, I think there is an error while checking if a key has already been added.

Steps/code/corpus to reproduce

I create a FastTextKeyedVectors using the defaults used by the FastText model, then try to add vectors to it using add_vector or add_vectors:

wv = FastTextKeyedVectors(vector_size=2, min_n=3, max_n=6, bucket=2000000)
wv.add_vector("test", [0.5, 0.5])
print(wv.key_to_index)
>> {}
print(wv.index_to_key)
>> []
print(wv.vectors)
>> []

wv.add_vectors(["test"], [[0.5, 0.5]])
print(wv.key_to_index)
>> {}
print(wv.index_to_key)
>> []
print(wv.vectors)
>> []

wv.key_to_index, wv.index_to_key and wv.vectors are all empty.

FastTextKeyedVectors is a child of KeyedVectors where the add_vector/s methods are implemented. add_vector does a few checks then calls add_vectors.
In add_vectors, there is an in_vocab_mask, which is a list of booleans indicating if a key is already present in the KeyedVectors.

in_vocab_mask = np.zeros(len(keys), dtype=bool)
        for idx, key in enumerate(keys):
            if key in self:
                in_vocab_mask[idx] = True

Since Gensim 4.0, key in wv will always return True with FastText by design. The proper way of checking if a key exists is by calling key in wv.key_to_index (See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#10-check-if-a-word-is-fully-oov-out-of-vocabulary-for-fasttext)

So replacing the above code by

in_vocab_mask = np.zeros(len(keys), dtype=bool)
        for idx, key in enumerate(keys):
            if key in self.key_to_index:
                in_vocab_mask[idx] = True

seems to fix the issue.

wv = FastTextKeyedVectors(vector_size=2, min_n=3, max_n=6, bucket=2000000)
wv.add_vectors(["test"], [[0.5, 0.5]])
print(wv.key_to_index)
>> {'test': 0}
print(wv.index_to_key)
>> ['test']
print(wv.vectors)
>> [[0.5 0.5]]

I am not sure how FastText models are able to add vectors to FastTextKeyedVectors the proper way when training without encountering this issue as I have not looked at the training code in detail.

Versions

Linux-5.10.0-17-amd64-x86_64-with-glibc2.31
Python 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]
Bits 64
NumPy 1.21.6
SciPy 1.7.3
gensim 4.2.0
FAST_VERSION 1

@gojomo
Copy link
Collaborator

gojomo commented Sep 28, 2022

Pretty clearly a bug & looks like you've found the real cause. Can you make a unit test that fails before your fix, and succeeds after, and bundle that wth the fix in a PR?

The training-from-scratch path likely works because it doesn't rely on any add_vector[s]() paths, but rather allocates a new object for all needed words at once, then lets training modify them directly in-place (rather than via specific inserts/replaces). The original/basic functionality for word2vec/fasttext/etc was only fresh training from a complete corpus into a frozen set of known words – add_vector() functionality was only bolted-on, somwhat awkwardly, as a later convenience. (And, whether & when it makes good theoretical sense to be insert new vectors into a model, that weren't co-trained with the main batch, is still not under-explored/under-documented.)

@globba
Copy link
Contributor Author

globba commented Sep 29, 2022

Pushed the fix.

For context, I found this while trying to create a subset of a trained FastTextKeyedVectors as I needed to do a FastTextKeyedVectors.most_similar() while ignoring some words from the vocabulary.

I ended up copying all the keys, vectors and other variables I needed to a new FastTextKeyedVectors which seems to work, but it looks messy and I'm not sure that will work properly if I try to train it with new words.

Having a built-in way of performing a most_similar() while ignoring a list of words from the vocab would be very convenient. It would also be more memory-efficient than having two separate instances of the KeyedVectors.

edit: Just figured out I could simply call most_similar() with topn=None, then get the indices of the words I want to keep and sort them based on similarity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug
Projects
None yet
Development

No branches or pull requests

3 participants