Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify & simplify model adapter integration #263

Merged
merged 4 commits into from Jan 28, 2022

Conversation

calpt
Copy link
Member

@calpt calpt commented Dec 16, 2021

Note: Depends on #257


Background

Currently, we have a lot of redundant code in the model mixins of each model class integration (src/transformers/adapters/models).

Solution

This PR does a range of refactoring with the goal to unify and simplify adapter the adapter integrations into the currently supported models:

  • AdapterLayerBaseMixin has been converted to a torch module, AdapterLayer, with a proper forward() method. This makes the integration into most models cleaner (except for (Ro)BERT(a)/ T5) as they don't require specific module mixins anymore
  • The transformer now is expected to be passed to AdapterLayer instead of being a class attribute. This removes the need for a couple of model integration workarounds.
  • A lot of management code (train_adapter(), train_adapter_fusion(), add_adapter(), get_adapter() etc.) has been pulled up to ModelAdaptersMixin. All of these methods now are model-agnostic and have been removed from each model-specific mixin.
  • To enable the last point, a new abstract iter_layers() method has been introduced in ModelAdaptersMixin. In the best case (e.g. DistilBERT), this is the only method that has to be implemented by each model mixin.
  • A new mechanism to call methods on each AdapterLayer has been added (ModelAdaptersMixin.apply_to_adapter_layers()). This means a base model mixin can directly call AdapterLayer methods. All intermediate add_adapter(), enable_adapters() etc. methods in each model layer are not necessary anymore.
  • As a consequence of the last points, all module mixins between the module holding AdapterLayer and the base model mixin have been removed (e.g. BertEncoderAdaptersMixin).
  • All tensor adjustments for parallel adapter composition now use adjust_tensors_for_parallel() in composition.py.

Breaking changes

  • The return of get_adapter() has changed.

@calpt calpt force-pushed the master branch 2 times, most recently from 6245d31 to d9c06ec Compare January 6, 2022 19:43
@calpt calpt changed the base branch from master to develop January 6, 2022 19:46
@adapter-hub adapter-hub deleted a comment from github-actions bot Jan 19, 2022
@adapter-hub-bert
Copy link
Member

This PR/issue depends on:


Automatically provided by Dependent Issues (🤖).

@calpt calpt marked this pull request as ready for review January 19, 2022 09:52
@calpt calpt requested a review from hSterz January 19, 2022 10:18
Copy link
Member

@hSterz hSterz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like some great changes that will make things easier. I just have some small questions :)

else:
hidden_states = hidden_states + input_tensor

return hidden_states

def forward(self, hidden_states, input_tensor, layer_norm):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the advantage of passing that with every forward call instead of having a parameter in the class containing the layer norm?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The layer norm usually is a child module of the Transformer block module (or part of it as e.g. for BERT). As the AdapterLayer usually also is a child module of the same block module, we would have to keep a reference to an attribute of the parent module. As this makes the implementation less clean and poses some issues (cf. #228), I decided to pass the layer norm in the forward pass.

I might have missed some better solution though, please let me know if there are any :)

from ..model_mixin import InvertibleAdaptersMixin, ModelAdaptersMixin


logger = logging.getLogger(__name__)


class BertSelfOutputAdaptersMixin(AdapterLayerBaseMixin):
# For backwards compatibility, BertSelfOutput inherits directly from AdapterLayer
class BertSelfOutputAdaptersMixin(AdapterLayer):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do some models still have AdapterMixins and others don't e.g. in the BART implementation, the BartSelfAttentionAdaptersModule is completely removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly for compatibility reasons. I didn't want to change the weight names and write yet another weight renaming method, so mixins are kept wherever necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, we could go even further and also remove mixins such as BartEncoderLayerAdaptersMixin and just add the code to the main modeling classes as they mainly consist of creating child modules 🤔

@calpt calpt merged commit a357ffe into adapter-hub:develop Jan 28, 2022
@calpt calpt deleted the dev/layer_refactoring branch January 28, 2022 19:06
calpt added a commit that referenced this pull request Feb 2, 2022
* `AdapterLayer` module

* Simplify adapter model mixin implementations

* Remove unused imports

* Fix AdapterSetup context test
calpt added a commit that referenced this pull request Feb 9, 2022
* `AdapterLayer` module

* Simplify adapter model mixin implementations

* Remove unused imports

* Fix AdapterSetup context test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants