-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify vectorizer #138
Simplify vectorizer #138
Conversation
Also, how do we handle |
UNK should be a special symbol which is then randomly initialized in the vectorizer. |
Should vectorizer specifically check if vocab has those symbols and then generate randoms? Also, that would not be reproducible because for the same dataset if you load vectors two times you could get two different embeddings for those symbols. |
Re 1.: if the vectorizer doesn't find a vector for a word in the vocab, it should randomly initialize it. This is done in |
@ivansmokovic added doc notes, it seems that from python 3.6 and above (https://docs.python.org/3.6/whatsnew/3.6.html#other-language-changes) the insertion order is preserved by dictionaries. Since we support 3.6+, there is no need for OrderedDict or sth alike. Also: set the default random vector for words not found in the vector file to be drawn from the normal distribution (as is standard practice). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor additions required.
@ivansmokovic revised wrt comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@mttk Feel free to merge. |
From
to
load_vocab(vocab)
for a non-None vocab now returns the embedding matrix by default to avoid the second call.Also, maybe create a custom classmethod which only returns the embedding matrix:
embeddings = GloVe.for_vocab(vocab)
which would mask the constructor call.