Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc2VecKeyedVectors doesn't effectively support __setitem__()/add() #2683

Open
gojomo opened this issue Nov 21, 2019 · 2 comments
Open

Doc2VecKeyedVectors doesn't effectively support __setitem__()/add() #2683

gojomo opened this issue Nov 21, 2019 · 2 comments
Labels
bug Issue described a bug feature Issue described a new feature

Comments

@gojomo
Copy link
Collaborator

gojomo commented Nov 21, 2019

Per user report on SO, neither assignment to a bracketed-access (as would be implemented by __setitem__()) nor use of the add() method will successfully mutate a Doc2VecKeyedVectors object.

Looking closer, it seems the superclass __setItem__() passes through to superclass add(), which was only ever implemented for word-centric sets of vectors – consulting/updating properties like .vocab that only exist as empty values in Doc2VecKeyedVectors because of the currently confused inheritance created by #1777.

@mpenkov mpenkov added the feature Issue described a new feature label Dec 21, 2019
@gojomo gojomo added the bug Issue described a bug label Dec 22, 2019
@ThijsKranenburg
Copy link

As an addition to the SO post, I want to add new documents to the model.

It seems this should be done with the add() method, but since this is not working I figured the following work-around out:

model = Doc2Vec.load(PATH_to_model)

# Add vector and identifier to original values
model.docvecs.vectors_docs =  np.vstack([model.docvecs.vectors_docs, new_vec])
model.docvecs.index2entity.append(new_identifier)

# Test if new document is included
model.docvecs.most_similar(positive = [new_vec])

Calling the most_similar() method returns results including this new document, also after saving and loading the model. So it seems to work.

My question is whether this is a 'correct' way of working around this bug, or if I am missing something.

@gojomo
Copy link
Collaborator Author

gojomo commented Jan 14, 2020

@ThijsKranenburg - If it works for your purposes, it's good enough! Note though you've not yet done enough to look-up the new vectors by identifier – that's also require adding entries to the model.docvecs.doctags dict. And the possible effects of such a workaround on any further training are unclear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug feature Issue described a new feature
Projects
None yet
Development

No branches or pull requests

3 participants