add bert base class and restructure encoding extractor #383

rbroc · 2020-02-27T22:29:44Z

restructuring BERT extractors so to have general PretrainedBert class from which encoding extractor and masked language model extractor both inherit.

…usive w/ top_n)

rbroc · 2020-03-12T13:44:41Z

@adelavega @tyarkoni this is now at a stage where it would benefit from some input, both on the general structure as well as on a few specific more points:

All models have (a good part of) the __init__ method plus the preprocess, extract and to_df methods in common. However, I have the feeling that there might be simpler ways of doing this, input on how to simplify is very welcome.
Right now, there are three Bert extractors: 1) BertEncoding that returns token-level encodings (excluding special tokens); 2) BertSequence that extracts sequence-level encodings (either the CLS, or the SEP, or pooled token-level encodings); 3) BertLM, extracting probabilities for each vocabulary token for a specific mask. What they do differently is mainly the post processing of the model output, which also means that potentially we could add also any BERT-based fine-tuned model including user-defined models. I think the sentiment ones could be especially cool to have, but this deserves a separate discussion (more later today).
Still have to test this, but these very extractors may easily be made more general so to also work for Bert-like models like DistilBert, Roberta, and Albert (the only difference being the dimensionality of the encodings you get back).
Re: BertLMExtractor (the language model one). In the way the extractor is set up right now, the user specifies which items to mask via the mask argument (which can be a word or an index) at __init__. It would be nicer to do that on extract, but it makes things a tad more complicated in terms of logging the mask.
I've found myself wondering a few times about what to do with additional information which is not log_attributes nor encoded in the Stim object, but could still be useful to return. These are metadata such as the encoded tokens themselves in the encoding extractor, which are not strictly speaking features, but still would be nice to keep in the dataset in most cases, or the sequence the masked object is part of in BertLMExtractor (i.e. the whole input sequence). Do we want to consider a more pliers-general fix for these cases?
This latter problem is solved, right now, by setting stims.name = input sequence if stims.name is not defined, so that the user can retrieve that from the result dataframe if metadata=True.
Another major question is whether, where and how to implement the Extractor that computes entropy and/or other metrics on the probability distributions. Ale suggests a solution like this, potentially generalizable to functions other than entropy and to all extractors that output probability distributions. Another extractor of this type could be something that computes autoregressive metrics on the result of a certain extractor, e.g. how correlated encodings for two successive words are.

tyarkoni

Looks good overall; see comments.

pliers/extractors/text.py

tyarkoni · 2020-03-17T16:51:56Z

Thoughts on your questions:

All models have (a good part of) the init method plus the preprocess, extract and to_df methods in common. However, I have the feeling that there might be simpler ways of doing this, input on how to simplify is very welcome.

Looks pretty sensible to me. One thing you could potentially do is add a slimmer, abstract (i.e., uses ABCMeta as metaclass) BertBase class that provides just a skeleton of required methods (_preprocess, _postprocess, etc.). That would add some code but might increase clarity in understanding the logical structure of these classes. But it's probably not worth it so long as there are only 3 classes in a single hierarchy.

Still have to test this, but these very extractors may easily be made more general so to also work for Bert-like models like DistilBert, Roberta, and Albert (the only difference being the dimensionality of the encodings you get back).

That would be neat if it's very straightforward. If it isn't, we can hold off until after we've used these classes a bit and have a sense of how we like the API.

Re: BertLMExtractor (the language model one). In the way the extractor is set up right now, the user specifies which items to mask via the mask argument (which can be a word or an index) at init. It would be nicer to do that on extract, but it makes things a tad more complicated in terms of logging the mask.

Yeah, this problem has come up elsewhere. Passing additional arguments to extract() is a non-starter, it's a stipulation of the whole package that we avoid doing that (because there's no way to flow arguments through a transformer graph at the moment). If the concern is efficiency (i.e., it sucks to have to reinitialize every time just to change the mask), a good compromise that doesn't break the intended API is to add an update_mask() class method or something like that.

I've found myself wondering a few times about what to do with additional information which is not log_attributes nor encoded in the Stim object, but could still be useful to return. These are metadata such as the encoded tokens themselves in the encoding extractor, which are not strictly speaking features, but still would be nice to keep in the dataset in most cases, or the sequence the masked object is part of in BertLMExtractor (i.e. the whole input sequence). Do we want to consider a more pliers-general fix for these cases?

That seems like a good idea (though probably not a high priority); feel free to open an issue.

This latter problem is solved, right now, by setting stims.name = input sequence if stims.name is not defined, so that the user can retrieve that from the result dataframe if metadata=True.
Another major question is whether, where and how to implement the Extractor that computes entropy and/or other metrics on the probability distributions. Ale suggests a solution like this, potentially generalizable to functions other than entropy and to all extractors that output probability distributions. Another extractor of this type could be something that computes autoregressive metrics on the result of a certain extractor, e.g. how correlated encodings for two successive words are.

Per our discussion a couple of weeks ago, I think I'm leaning towards defining a new Transformer type that takes ExtractorResult as input and returns ExtractorResult as output. I don't like the idea of creating an Extractor that implicitly wraps a fixed set of other Extractor classes because that's not very scalable (i.e., because of combinatorial explosion as complexity increases). But this probably deserves its own separate discussion, so maybe open an issue for it.

adelavega · 2020-03-17T21:36:44Z

Pretty much agree on everything Tal already said.

I think adding an update_mask argument would solve our needs. In that case though, maybe allowing mask to be None on init would be nice. That way, if you plan on cycling through masks, you can initialize it generically. However, if the mask is not set before _extract it should gracefully fail.

About the _log_attributes issue, I agree it's generally not a high priority especially if we can get away with adding attributes to the Stimulus. A generic solution seems pretty easy, since essentially we're just talking about having a different type of result at the level as a "feature" but that is semantically different (i.e. a result of the processing vs the extraction itself).

I have some thoughts about the postprocessing issue, I'll open an issue to discuss.

Co-Authored-By: Tal Yarkoni <tyarkoni@gmail.com>

rbroc · 2020-04-03T14:31:47Z

@adelavega @tyarkoni
I think I need your help with tests. I've run out of resources in terms of how to try fixing the issues that pop up when testing Bert extractors on travis (tests pass locally).

There are a bunch of new tests compared to the previous Bert implementations, as we now support more extractors and more pretrained models. However, tests always crash on travis, most often with a mysterious 137 error code, which seems to be likely related to memory. It's not one specific test causing this, but more of an incremental issue when adding more than a few tests.

I tried a few things to solve this. First, I del all the extractors and ExtractorResults at the end of each test, and clear the folder that contains the model weights.
I have also tried not to store extractors in variables whenever we don't need to initialize them more than once. In most cases, the tests just run init, transform and to_df in one line, only storing the df in a variable. Memory issues still pop up every now and then.

When I run only one test, disabling all others, tests tend to pass. Which suggests that it is not an issue with the memory requirements of the individual tests.

What's weird is that sometimes, even when choosing a subset of the tests and succeeding in running all of them, out of memory errors after, that is for the test_video_extractor tests the ones running right after the text extractor tests, and the whole thing crashed. Which might suggest that, for some reason, either travis, or pytest, or conda is keeping stuff in memory.

I've also tried to put all Bert tests in a separate test file, and run it on travis as a separate line in script. In which case I get the error fork: cannot allocate memory.

There's also one more issue, which seems to be more easily solvable though. Some of the tests require downloading pre-trained models and tokenizers. This takes a few minutes, sometimes even more than 10 (much faster locally), and travis has a 10 minutes no input timeout. But this seems to be more easily solvable, maybe using travis_wait.

Sorry for the long comment and the bother, I've really tried my best and have now run out of ideas.

tyarkoni · 2020-04-03T15:08:34Z

No worries, thanks for making an effort! My guess is we're hitting the limits of Travis's free tier. I'll look into pricing etc. and see if it makes sense to upgrade (likely not), but for the time being let's not worry about this. What would probably be helpful is to create a new pytest tag called "high_mem" or something like that, and then mark any tests known to cause failure on travis and make sure they're skipped in the pytest configuration. We already do this for the paid tests, you can just adapt from that.

Obviously this isn't ideal, but for the time being we can just run the tests locally, and we'll figure something out later. (This is also still an issue for the paid services; I disabled them because of cost/bugs > 1 year ago, and haven't re-enabled them. At some point we might just need to rethink our testing approach/infrastructure).

rbroc · 2020-04-03T17:31:05Z

Thanks for the quick reply, Tal!
I'll mark most of the BERT tests as high memory then, by monday.
Will then run all tests again locally, and then the PR should be ready for review.
Maybe someone else could then also run the tests locally, just to ensure a bit of generalizability across environment.
Have a nice weekend!

coveralls · 2020-04-07T10:00:59Z

Coverage decreased (-7.2%) to 71.437% when pulling 71d9b7c on rbroc:bertLM into 46bc248 on tyarkoni:master.

tyarkoni

LGTM! Just one minor comment about a missing docstring.

tyarkoni · 2020-04-16T14:57:49Z

pliers/extractors/text.py

+        self.return_softmax = return_softmax
+        self.return_masked_word = return_masked_word
+
+    def update_mask(self, new_mask):


This is publicly exposed, so needs a docstring.

thanks Tal! Docstring updated. If everything else is good should be ready to merge.

adelavega · 2020-04-26T23:13:10Z

@rbroc this is ready to merge, right? If so, I can go ahead and do so

rbroc · 2020-04-27T07:15:29Z

ready to merge, @adelavega, unless @tyarkoni has further remarks.

rbroc added 18 commits February 27, 2020 16:26

add base class and restructure encoding extractor

f36b2b5

fix structure and add LM extractor

1ddc26b

add one more annotation

05e7670

add

a89ed65

fix prediction shape

20eef0a

start implementing target routine + other fixes

a0fbb8f

add softmax as option

27f996a

only allow one mask

ac01e15

add threshold option, refine target tokens option (both mutually excl…

597835c

…usive w/ top_n)

edit docstring

60b91f0

allow keep info on true word

b4fbb45

move mask specification to extract

c66dfb0

fix mask-based indexing in mask method

a9d24f3

refine logic

68ddd94

checkpoint

0f2918f

restore mask in init

dbf4d20

fix to_df and indexing

527c6d4

notes

c8b1c36

tyarkoni reviewed Mar 17, 2020

View reviewed changes

rbroc and others added 8 commits March 18, 2020 08:47

checkpoint

445ef09

Update pliers/extractors/text.py

92ff1eb

Co-Authored-By: Tal Yarkoni <tyarkoni@gmail.com>

Update pliers/extractors/text.py

6089fc6

Co-Authored-By: Tal Yarkoni <tyarkoni@gmail.com>

Merge branch 'bertLM' of https://github.com/rbroc/pliers into bertLM

c9dc0ea

_model_attributes as class attribute

c6020a9

check pooling arg before superclass initializer

9dc891d

move superclass init after argument validation

5629a0b

add docstring to additional methods

d74ace4

rbroc added 7 commits April 2, 2020 12:05

disable timeout

13bb469

remove all other models

ecd9d9a

revert

23b121b

test text ext only

736e661

no init target

1cc1756

separate file

f8b6d90

try last test edit

a58e296

revert

e9c2edb

rbroc added 12 commits April 3, 2020 19:31

do not clear models

706100b

add markers

f876fac

fix travis flag

d2fa698

mark all as high mem

561bd91

skipif in travis

f4a2690

clear cache opt

1142d14

delete bert_test file

e78e921

checkpoint

0e59c25

tf test

2ac62df

checkpoint

a8bda72

skip one more test

10e157f

only run encoding extractor

4a077bf

tyarkoni approved these changes Apr 16, 2020

View reviewed changes

rbroc added 2 commits April 16, 2020 17:32

add docstring to update_mask

380950b

remove abc import

71d9b7c

adelavega merged commit 63e668a into PsychoinformaticsLab:master Apr 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add bert base class and restructure encoding extractor #383

add bert base class and restructure encoding extractor #383

rbroc commented Feb 27, 2020 •

edited

rbroc commented Mar 12, 2020

tyarkoni left a comment

tyarkoni commented Mar 17, 2020

adelavega commented Mar 17, 2020

rbroc commented Apr 3, 2020

tyarkoni commented Apr 3, 2020

rbroc commented Apr 3, 2020

coveralls commented Apr 7, 2020 •

edited

tyarkoni left a comment

tyarkoni Apr 16, 2020

rbroc Apr 16, 2020

adelavega commented Apr 26, 2020

rbroc commented Apr 27, 2020

add bert base class and restructure encoding extractor #383

add bert base class and restructure encoding extractor #383

Conversation

rbroc commented Feb 27, 2020 • edited

rbroc commented Mar 12, 2020

tyarkoni left a comment

Choose a reason for hiding this comment

tyarkoni commented Mar 17, 2020

adelavega commented Mar 17, 2020

rbroc commented Apr 3, 2020

tyarkoni commented Apr 3, 2020

rbroc commented Apr 3, 2020

coveralls commented Apr 7, 2020 • edited

tyarkoni left a comment

Choose a reason for hiding this comment

tyarkoni Apr 16, 2020

Choose a reason for hiding this comment

rbroc Apr 16, 2020

Choose a reason for hiding this comment

adelavega commented Apr 26, 2020

rbroc commented Apr 27, 2020

rbroc commented Feb 27, 2020 •

edited

coveralls commented Apr 7, 2020 •

edited