Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add bert base class and restructure encoding extractor #383

Merged
merged 92 commits into from Apr 27, 2020

Conversation

rbroc
Copy link
Collaborator

@rbroc rbroc commented Feb 27, 2020

restructuring BERT extractors so to have general PretrainedBert class from which encoding extractor and masked language model extractor both inherit.

@rbroc
Copy link
Collaborator Author

rbroc commented Mar 12, 2020

@adelavega @tyarkoni this is now at a stage where it would benefit from some input, both on the general structure as well as on a few specific more points:

  • All models have (a good part of) the __init__ method plus the preprocess, extract and to_df methods in common. However, I have the feeling that there might be simpler ways of doing this, input on how to simplify is very welcome.
  • Right now, there are three Bert extractors: 1) BertEncoding that returns token-level encodings (excluding special tokens); 2) BertSequence that extracts sequence-level encodings (either the CLS, or the SEP, or pooled token-level encodings); 3) BertLM, extracting probabilities for each vocabulary token for a specific mask. What they do differently is mainly the post processing of the model output, which also means that potentially we could add also any BERT-based fine-tuned model including user-defined models. I think the sentiment ones could be especially cool to have, but this deserves a separate discussion (more later today).
  • Still have to test this, but these very extractors may easily be made more general so to also work for Bert-like models like DistilBert, Roberta, and Albert (the only difference being the dimensionality of the encodings you get back).
  • Re: BertLMExtractor (the language model one). In the way the extractor is set up right now, the user specifies which items to mask via the mask argument (which can be a word or an index) at __init__. It would be nicer to do that on extract, but it makes things a tad more complicated in terms of logging the mask.
  • I've found myself wondering a few times about what to do with additional information which is not log_attributes nor encoded in the Stim object, but could still be useful to return. These are metadata such as the encoded tokens themselves in the encoding extractor, which are not strictly speaking features, but still would be nice to keep in the dataset in most cases, or the sequence the masked object is part of in BertLMExtractor (i.e. the whole input sequence). Do we want to consider a more pliers-general fix for these cases?
  • This latter problem is solved, right now, by setting stims.name = input sequence if stims.name is not defined, so that the user can retrieve that from the result dataframe if metadata=True.
  • Another major question is whether, where and how to implement the Extractor that computes entropy and/or other metrics on the probability distributions. Ale suggests a solution like this, potentially generalizable to functions other than entropy and to all extractors that output probability distributions. Another extractor of this type could be something that computes autoregressive metrics on the result of a certain extractor, e.g. how correlated encodings for two successive words are.

Copy link
Collaborator

@tyarkoni tyarkoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall; see comments.

pliers/extractors/text.py Outdated Show resolved Hide resolved
pliers/extractors/text.py Outdated Show resolved Hide resolved
pliers/extractors/text.py Outdated Show resolved Hide resolved
pliers/extractors/text.py Outdated Show resolved Hide resolved
pliers/extractors/text.py Outdated Show resolved Hide resolved
pliers/extractors/text.py Show resolved Hide resolved
pliers/extractors/text.py Outdated Show resolved Hide resolved
pliers/extractors/text.py Outdated Show resolved Hide resolved
pliers/extractors/text.py Outdated Show resolved Hide resolved
@tyarkoni
Copy link
Collaborator

Thoughts on your questions:

All models have (a good part of) the init method plus the preprocess, extract and to_df methods in common. However, I have the feeling that there might be simpler ways of doing this, input on how to simplify is very welcome.

Looks pretty sensible to me. One thing you could potentially do is add a slimmer, abstract (i.e., uses ABCMeta as metaclass) BertBase class that provides just a skeleton of required methods (_preprocess, _postprocess, etc.). That would add some code but might increase clarity in understanding the logical structure of these classes. But it's probably not worth it so long as there are only 3 classes in a single hierarchy.

Still have to test this, but these very extractors may easily be made more general so to also work for Bert-like models like DistilBert, Roberta, and Albert (the only difference being the dimensionality of the encodings you get back).

That would be neat if it's very straightforward. If it isn't, we can hold off until after we've used these classes a bit and have a sense of how we like the API.

Re: BertLMExtractor (the language model one). In the way the extractor is set up right now, the user specifies which items to mask via the mask argument (which can be a word or an index) at init. It would be nicer to do that on extract, but it makes things a tad more complicated in terms of logging the mask.

Yeah, this problem has come up elsewhere. Passing additional arguments to extract() is a non-starter, it's a stipulation of the whole package that we avoid doing that (because there's no way to flow arguments through a transformer graph at the moment). If the concern is efficiency (i.e., it sucks to have to reinitialize every time just to change the mask), a good compromise that doesn't break the intended API is to add an update_mask() class method or something like that.

I've found myself wondering a few times about what to do with additional information which is not log_attributes nor encoded in the Stim object, but could still be useful to return. These are metadata such as the encoded tokens themselves in the encoding extractor, which are not strictly speaking features, but still would be nice to keep in the dataset in most cases, or the sequence the masked object is part of in BertLMExtractor (i.e. the whole input sequence). Do we want to consider a more pliers-general fix for these cases?

That seems like a good idea (though probably not a high priority); feel free to open an issue.

This latter problem is solved, right now, by setting stims.name = input sequence if stims.name is not defined, so that the user can retrieve that from the result dataframe if metadata=True.
Another major question is whether, where and how to implement the Extractor that computes entropy and/or other metrics on the probability distributions. Ale suggests a solution like this, potentially generalizable to functions other than entropy and to all extractors that output probability distributions. Another extractor of this type could be something that computes autoregressive metrics on the result of a certain extractor, e.g. how correlated encodings for two successive words are.

Per our discussion a couple of weeks ago, I think I'm leaning towards defining a new Transformer type that takes ExtractorResult as input and returns ExtractorResult as output. I don't like the idea of creating an Extractor that implicitly wraps a fixed set of other Extractor classes because that's not very scalable (i.e., because of combinatorial explosion as complexity increases). But this probably deserves its own separate discussion, so maybe open an issue for it.

@adelavega
Copy link
Member

Pretty much agree on everything Tal already said.

I think adding an update_mask argument would solve our needs. In that case though, maybe allowing mask to be None on init would be nice. That way, if you plan on cycling through masks, you can initialize it generically. However, if the mask is not set before _extract it should gracefully fail.

About the _log_attributes issue, I agree it's generally not a high priority especially if we can get away with adding attributes to the Stimulus. A generic solution seems pretty easy, since essentially we're just talking about having a different type of result at the level as a "feature" but that is semantically different (i.e. a result of the processing vs the extraction itself).

I have some thoughts about the postprocessing issue, I'll open an issue to discuss.

@rbroc
Copy link
Collaborator Author

rbroc commented Apr 3, 2020

@adelavega @tyarkoni
I think I need your help with tests. I've run out of resources in terms of how to try fixing the issues that pop up when testing Bert extractors on travis (tests pass locally).

There are a bunch of new tests compared to the previous Bert implementations, as we now support more extractors and more pretrained models. However, tests always crash on travis, most often with a mysterious 137 error code, which seems to be likely related to memory. It's not one specific test causing this, but more of an incremental issue when adding more than a few tests.

I tried a few things to solve this. First, I del all the extractors and ExtractorResults at the end of each test, and clear the folder that contains the model weights.
I have also tried not to store extractors in variables whenever we don't need to initialize them more than once. In most cases, the tests just run init, transform and to_df in one line, only storing the df in a variable. Memory issues still pop up every now and then.

When I run only one test, disabling all others, tests tend to pass. Which suggests that it is not an issue with the memory requirements of the individual tests.

What's weird is that sometimes, even when choosing a subset of the tests and succeeding in running all of them, out of memory errors after, that is for the test_video_extractor tests the ones running right after the text extractor tests, and the whole thing crashed. Which might suggest that, for some reason, either travis, or pytest, or conda is keeping stuff in memory.

I've also tried to put all Bert tests in a separate test file, and run it on travis as a separate line in script. In which case I get the error fork: cannot allocate memory.

There's also one more issue, which seems to be more easily solvable though. Some of the tests require downloading pre-trained models and tokenizers. This takes a few minutes, sometimes even more than 10 (much faster locally), and travis has a 10 minutes no input timeout. But this seems to be more easily solvable, maybe using travis_wait.

Sorry for the long comment and the bother, I've really tried my best and have now run out of ideas.

@tyarkoni
Copy link
Collaborator

tyarkoni commented Apr 3, 2020

No worries, thanks for making an effort! My guess is we're hitting the limits of Travis's free tier. I'll look into pricing etc. and see if it makes sense to upgrade (likely not), but for the time being let's not worry about this. What would probably be helpful is to create a new pytest tag called "high_mem" or something like that, and then mark any tests known to cause failure on travis and make sure they're skipped in the pytest configuration. We already do this for the paid tests, you can just adapt from that.

Obviously this isn't ideal, but for the time being we can just run the tests locally, and we'll figure something out later. (This is also still an issue for the paid services; I disabled them because of cost/bugs > 1 year ago, and haven't re-enabled them. At some point we might just need to rethink our testing approach/infrastructure).

@rbroc
Copy link
Collaborator Author

rbroc commented Apr 3, 2020

Thanks for the quick reply, Tal!
I'll mark most of the BERT tests as high memory then, by monday.
Will then run all tests again locally, and then the PR should be ready for review.
Maybe someone else could then also run the tests locally, just to ensure a bit of generalizability across environment.
Have a nice weekend!

@coveralls
Copy link

coveralls commented Apr 7, 2020

Coverage Status

Coverage decreased (-7.2%) to 71.437% when pulling 71d9b7c on rbroc:bertLM into 46bc248 on tyarkoni:master.

Copy link
Collaborator

@tyarkoni tyarkoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just one minor comment about a missing docstring.

self.return_softmax = return_softmax
self.return_masked_word = return_masked_word

def update_mask(self, new_mask):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is publicly exposed, so needs a docstring.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Tal! Docstring updated. If everything else is good should be ready to merge.

@adelavega
Copy link
Member

@rbroc this is ready to merge, right? If so, I can go ahead and do so

@rbroc
Copy link
Collaborator Author

rbroc commented Apr 27, 2020

ready to merge, @adelavega, unless @tyarkoni has further remarks.

@adelavega adelavega merged commit 63e668a into PsychoinformaticsLab:master Apr 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants