Skip to content
This repository was archived by the owner on Jul 28, 2025. It is now read-only.

Fix pipeline for creating vocab with gensim 4.1.2#156

Merged
w-is-h merged 1 commit intoCogStack:masterfrom
umcu:fix_create_vocab
Oct 21, 2021
Merged

Fix pipeline for creating vocab with gensim 4.1.2#156
w-is-h merged 1 commit intoCogStack:masterfrom
umcu:fix_create_vocab

Conversation

@sandertan
Copy link
Contributor

@sandertan sandertan commented Oct 21, 2021

Hi @w-is-h , gensim 4.1.2 in MedCAT 1.2.0 introduced some changes to the Word2Vec() class which caused MedCAT to crash during creation of the vocabulary. Our medcat-model-creator integration test picked this up. I'll try to create a separate PR to port this creator & tests to MedCAT next week, as discussed in #116

I renamed the parameters in add_vectors() to match the names in Word2Vec. Let me know if you'd like to keep the old MedCAT names.

Changes in Word2Vec

  File "/Users/stan3/Data/MedCAT/medcat/utils/make_vocab.py", line 149, in add_vectors
    for word in w2v.wv.vocab.keys():
  File "/Users/stan3/miniconda3/envs/medcat-1-2-0/lib/python3.8/site-packages/gensim/models/keyedvectors.py", line 661, in vocab
    raise AttributeError(
AttributeError: The vocab attribute was removed from KeyedVector in Gensim 4.0.0.
Use KeyedVector's .key_to_index dict, .index_to_key list, and methods .get_vecattr(key, attr) and .set_vecattr(key, attr, new_val) instead.
See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4

Also see:
https://github.com/RaRe-Technologies/gensim/blob/5bec27767ad40712e8912d53a896cb2282c33880/gensim/models/word2vec.py#L322

and:
https://github.com/RaRe-Technologies/gensim/blob/5bec27767ad40712e8912d53a896cb2282c33880/gensim/models/word2vec.py#L273

@sandertan
Copy link
Contributor Author

Also, I'm not sure whether class SpacyHFTokis still being used, but it also contained an incompatibility with the new gensim so I updated it.

@sandertan
Copy link
Contributor Author

Tutorials also need an update for new Word2Vec parameter names, e.g. size to vector_size:

image

Copy link
Collaborator

@w-is-h w-is-h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for the fix

@w-is-h
Copy link
Collaborator

w-is-h commented Oct 21, 2021

Also, I'm not sure whether class SpacyHFTokis still being used, but it also contained an incompatibility with the new gensim so I updated it.

Not used anymore, we have to cleanup some things, will be done soon

@sandertan sandertan deleted the fix_create_vocab branch October 25, 2021 14:06
mart-r pushed a commit to mart-r/MedCAT that referenced this pull request Jun 14, 2023
Fix pipeline for creating vocab with gensim 4.1.2
alhendrickson pushed a commit to CogStack/cogstack-nlp that referenced this pull request Jul 1, 2025
Fix pipeline for creating vocab with gensim 4.1.2
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants