Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify vectorizer #138

Merged
merged 6 commits into from
Mar 3, 2020
Merged

Simplify vectorizer #138

merged 6 commits into from
Mar 3, 2020

Conversation

mttk
Copy link
Member

@mttk mttk commented Feb 24, 2020

From

vectorizer = GloVe()
vectorizer.load_vocab(vocab)
embeddings = vectorizer.get_embedding_matrix(vocab)

to

embeddings = GloVe().load_vocab(vocab)

load_vocab(vocab) for a non-None vocab now returns the embedding matrix by default to avoid the second call.

Also, maybe create a custom classmethod which only returns the embedding matrix:
embeddings = GloVe.for_vocab(vocab) which would mask the constructor call.

@ivansmokovic
Copy link
Collaborator

ivansmokovic commented Feb 24, 2020

Also, how do we handle <UNK> and '' in vectorizers. Should we let users define that in their custom_numericalize?

@mttk
Copy link
Member Author

mttk commented Feb 24, 2020

UNK should be a special symbol which is then randomly initialized in the vectorizer.

@ivansmokovic
Copy link
Collaborator

UNK should be a special symbol which is then randomly initialized in the vectorizer.e

Should vectorizer specifically check if vocab has those symbols and then generate randoms? Also, that would not be reproducible because for the same dataset if you load vectors two times you could get two different embeddings for those symbols.

@mttk
Copy link
Member Author

mttk commented Feb 24, 2020

Re 1.: if the vectorizer doesn't find a vector for a word in the vocab, it should randomly initialize it. This is done in token_to_vector (but currently initialized to zeros by default).
It is also reproducible given a fixed random seed, which is fine.

@mttk
Copy link
Member Author

mttk commented Feb 27, 2020

@ivansmokovic added doc notes, it seems that from python 3.6 and above (https://docs.python.org/3.6/whatsnew/3.6.html#other-language-changes) the insertion order is preserved by dictionaries. Since we support 3.6+, there is no need for OrderedDict or sth alike.

Also: set the default random vector for words not found in the vector file to be drawn from the normal distribution (as is standard practice).

Copy link
Collaborator

@ivansmokovic ivansmokovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor additions required.

takepod/storage/vectorizers/vectorizer.py Show resolved Hide resolved
takepod/storage/vectorizers/vectorizer.py Show resolved Hide resolved
takepod/storage/vectorizers/vectorizer.py Outdated Show resolved Hide resolved
@mttk
Copy link
Member Author

mttk commented Feb 28, 2020

@ivansmokovic revised wrt comments

Copy link
Collaborator

@sskudar sskudar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@ivansmokovic ivansmokovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ivansmokovic
Copy link
Collaborator

@mttk Feel free to merge.

@mttk mttk merged commit f8429a5 into master Mar 3, 2020
@mttk mttk deleted the simplify_vectorizer branch January 4, 2021 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants