Simplify vectorizer #138

mttk · 2020-02-24T13:46:08Z

From

vectorizer = GloVe()
vectorizer.load_vocab(vocab)
embeddings = vectorizer.get_embedding_matrix(vocab)

to

embeddings = GloVe().load_vocab(vocab)

load_vocab(vocab) for a non-None vocab now returns the embedding matrix by default to avoid the second call.

Also, maybe create a custom classmethod which only returns the embedding matrix:
embeddings = GloVe.for_vocab(vocab) which would mask the constructor call.

takepod/storage/vectorizers/vectorizer.py

ivansmokovic · 2020-02-24T15:29:36Z

Also, how do we handle <UNK> and '' in vectorizers. Should we let users define that in their custom_numericalize?

mttk · 2020-02-24T15:32:21Z

UNK should be a special symbol which is then randomly initialized in the vectorizer.

ivansmokovic · 2020-02-24T15:35:17Z

UNK should be a special symbol which is then randomly initialized in the vectorizer.e

Should vectorizer specifically check if vocab has those symbols and then generate randoms? Also, that would not be reproducible because for the same dataset if you load vectors two times you could get two different embeddings for those symbols.

mttk · 2020-02-24T15:40:27Z

Re 1.: if the vectorizer doesn't find a vector for a word in the vocab, it should randomly initialize it. This is done in token_to_vector (but currently initialized to zeros by default).
It is also reproducible given a fixed random seed, which is fine.

mttk · 2020-02-27T11:39:50Z

@ivansmokovic added doc notes, it seems that from python 3.6 and above (https://docs.python.org/3.6/whatsnew/3.6.html#other-language-changes) the insertion order is preserved by dictionaries. Since we support 3.6+, there is no need for OrderedDict or sth alike.

Also: set the default random vector for words not found in the vector file to be drawn from the normal distribution (as is standard practice).

ivansmokovic

Minor additions required.

takepod/storage/vectorizers/vectorizer.py

mttk · 2020-02-28T12:09:35Z

@ivansmokovic revised wrt comments

sskudar

LGTM

ivansmokovic

LGTM

ivansmokovic · 2020-03-03T13:04:00Z

@mttk Feel free to merge.

mttk added 2 commits February 24, 2020 14:35

QOL: simplify vectorizer

fa44a1f

QOL: simplify vectorizer

f76f4ce

mttk requested review from ivansmokovic and sskudar February 24, 2020 13:46

ivansmokovic reviewed Feb 24, 2020

View reviewed changes

takepod/storage/vectorizers/vectorizer.py Outdated Show resolved Hide resolved

mttk added 2 commits February 27, 2020 12:34

Doc

5443d89

Add random normal init + set it as default

bfd0edf

flake fix

1a7705f

ivansmokovic requested changes Feb 28, 2020

View reviewed changes

takepod/storage/vectorizers/vectorizer.py Show resolved Hide resolved

takepod/storage/vectorizers/vectorizer.py Show resolved Hide resolved

takepod/storage/vectorizers/vectorizer.py Outdated Show resolved Hide resolved

Typos

1544b1e

sskudar approved these changes Mar 2, 2020

View reviewed changes

mttk requested a review from ivansmokovic March 2, 2020 15:44

ivansmokovic assigned mttk Mar 3, 2020

ivansmokovic approved these changes Mar 3, 2020

View reviewed changes

mttk merged commit f8429a5 into master Mar 3, 2020

FilipBolt mentioned this pull request Mar 3, 2020

Update BiLSTM Chain CRF model to make it pickleable #132

Merged

mttk deleted the simplify_vectorizer branch January 4, 2021 14:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify vectorizer #138

Simplify vectorizer #138

mttk commented Feb 24, 2020

ivansmokovic commented Feb 24, 2020 •

edited

Loading

mttk commented Feb 24, 2020 •

edited

Loading

ivansmokovic commented Feb 24, 2020

mttk commented Feb 24, 2020

mttk commented Feb 27, 2020

ivansmokovic left a comment •

edited

Loading

mttk commented Feb 28, 2020

sskudar left a comment

ivansmokovic left a comment

ivansmokovic commented Mar 3, 2020

Simplify vectorizer #138

Simplify vectorizer #138

Conversation

mttk commented Feb 24, 2020

ivansmokovic commented Feb 24, 2020 • edited Loading

mttk commented Feb 24, 2020 • edited Loading

ivansmokovic commented Feb 24, 2020

mttk commented Feb 24, 2020

mttk commented Feb 27, 2020

ivansmokovic left a comment • edited Loading

Choose a reason for hiding this comment

mttk commented Feb 28, 2020

sskudar left a comment

Choose a reason for hiding this comment

ivansmokovic left a comment

Choose a reason for hiding this comment

ivansmokovic commented Mar 3, 2020

ivansmokovic commented Feb 24, 2020 •

edited

Loading

mttk commented Feb 24, 2020 •

edited

Loading

ivansmokovic left a comment •

edited

Loading