You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been trying to create a FastTextKeyedVectors and adding vectors to it using either add_vector or add_vectors but the methods are not adding anything. After looking at the implementation of those methods, I think there is an error while checking if a key has already been added.
Steps/code/corpus to reproduce
I create a FastTextKeyedVectors using the defaults used by the FastText model, then try to add vectors to it using add_vector or add_vectors:
wv.key_to_index, wv.index_to_key and wv.vectors are all empty.
FastTextKeyedVectors is a child of KeyedVectors where the add_vector/s methods are implemented. add_vector does a few checks then calls add_vectors.
In add_vectors, there is an in_vocab_mask, which is a list of booleans indicating if a key is already present in the KeyedVectors.
in_vocab_mask = np.zeros(len(keys), dtype=bool)
for idx, key in enumerate(keys):
if key in self:
in_vocab_mask[idx] = True
I am not sure how FastText models are able to add vectors to FastTextKeyedVectors the proper way when training without encountering this issue as I have not looked at the training code in detail.
Pretty clearly a bug & looks like you've found the real cause. Can you make a unit test that fails before your fix, and succeeds after, and bundle that wth the fix in a PR?
The training-from-scratch path likely works because it doesn't rely on any add_vector[s]() paths, but rather allocates a new object for all needed words at once, then lets training modify them directly in-place (rather than via specific inserts/replaces). The original/basic functionality for word2vec/fasttext/etc was only fresh training from a complete corpus into a frozen set of known words – add_vector() functionality was only bolted-on, somwhat awkwardly, as a later convenience. (And, whether & when it makes good theoretical sense to be insert new vectors into a model, that weren't co-trained with the main batch, is still not under-explored/under-documented.)
For context, I found this while trying to create a subset of a trained FastTextKeyedVectors as I needed to do a FastTextKeyedVectors.most_similar() while ignoring some words from the vocabulary.
I ended up copying all the keys, vectors and other variables I needed to a new FastTextKeyedVectors which seems to work, but it looks messy and I'm not sure that will work properly if I try to train it with new words.
Having a built-in way of performing a most_similar() while ignoring a list of words from the vocab would be very convenient. It would also be more memory-efficient than having two separate instances of the KeyedVectors.
edit: Just figured out I could simply call most_similar() with topn=None, then get the indices of the words I want to keep and sort them based on similarity.
Problem description
I have been trying to create a
FastTextKeyedVectors
and adding vectors to it using eitheradd_vector
oradd_vectors
but the methods are not adding anything. After looking at the implementation of those methods, I think there is an error while checking if a key has already been added.Steps/code/corpus to reproduce
I create a
FastTextKeyedVectors
using the defaults used by theFastText
model, then try to add vectors to it usingadd_vector
oradd_vectors
:wv.key_to_index
,wv.index_to_key
andwv.vectors
are all empty.FastTextKeyedVectors
is a child ofKeyedVectors
where theadd_vector/s
methods are implemented.add_vector
does a few checks then callsadd_vectors
.In
add_vectors
, there is anin_vocab_mask
, which is a list of booleans indicating if a key is already present in the KeyedVectors.Since Gensim 4.0,
key in wv
will always return True with FastText by design. The proper way of checking if a key exists is by callingkey in wv.key_to_index
(See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#10-check-if-a-word-is-fully-oov-out-of-vocabulary-for-fasttext)So replacing the above code by
seems to fix the issue.
I am not sure how
FastText
models are able to add vectors toFastTextKeyedVectors
the proper way when training without encountering this issue as I have not looked at the training code in detail.Versions
Linux-5.10.0-17-amd64-x86_64-with-glibc2.31
Python 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]
Bits 64
NumPy 1.21.6
SciPy 1.7.3
gensim 4.2.0
FAST_VERSION 1
The text was updated successfully, but these errors were encountered: