New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unify & simplify model adapter integration #263
Conversation
6245d31
to
d9c06ec
Compare
60c6e48
to
e062a16
Compare
This PR/issue depends on: Automatically provided by Dependent Issues (🤖). |
059e9ad
to
560fc79
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like some great changes that will make things easier. I just have some small questions :)
else: | ||
hidden_states = hidden_states + input_tensor | ||
|
||
return hidden_states | ||
|
||
def forward(self, hidden_states, input_tensor, layer_norm): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the advantage of passing that with every forward call instead of having a parameter in the class containing the layer norm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The layer norm usually is a child module of the Transformer block module (or part of it as e.g. for BERT). As the AdapterLayer
usually also is a child module of the same block module, we would have to keep a reference to an attribute of the parent module. As this makes the implementation less clean and poses some issues (cf. #228), I decided to pass the layer norm in the forward pass.
I might have missed some better solution though, please let me know if there are any :)
from ..model_mixin import InvertibleAdaptersMixin, ModelAdaptersMixin | ||
|
||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
class BertSelfOutputAdaptersMixin(AdapterLayerBaseMixin): | ||
# For backwards compatibility, BertSelfOutput inherits directly from AdapterLayer | ||
class BertSelfOutputAdaptersMixin(AdapterLayer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do some models still have AdapterMixins and others don't e.g. in the BART implementation, the BartSelfAttentionAdaptersModule
is completely removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly for compatibility reasons. I didn't want to change the weight names and write yet another weight renaming method, so mixins are kept wherever necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, we could go even further and also remove mixins such as BartEncoderLayerAdaptersMixin
and just add the code to the main modeling classes as they mainly consist of creating child modules 🤔
* `AdapterLayer` module * Simplify adapter model mixin implementations * Remove unused imports * Fix AdapterSetup context test
* `AdapterLayer` module * Simplify adapter model mixin implementations * Remove unused imports * Fix AdapterSetup context test
Note: Depends on #257
Background
Currently, we have a lot of redundant code in the model mixins of each model class integration (
src/transformers/adapters/models
).Solution
This PR does a range of refactoring with the goal to unify and simplify adapter the adapter integrations into the currently supported models:
AdapterLayerBaseMixin
has been converted to a torch module,AdapterLayer
, with a properforward()
method. This makes the integration into most models cleaner (except for (Ro)BERT(a)/ T5) as they don't require specific module mixins anymoreAdapterLayer
instead of being a class attribute. This removes the need for a couple of model integration workarounds.train_adapter()
,train_adapter_fusion()
,add_adapter()
,get_adapter()
etc.) has been pulled up toModelAdaptersMixin
. All of these methods now are model-agnostic and have been removed from each model-specific mixin.iter_layers()
method has been introduced inModelAdaptersMixin
. In the best case (e.g. DistilBERT), this is the only method that has to be implemented by each model mixin.AdapterLayer
has been added (ModelAdaptersMixin.apply_to_adapter_layers()
). This means a base model mixin can directly callAdapterLayer
methods. All intermediateadd_adapter()
,enable_adapters()
etc. methods in each model layer are not necessary anymore.AdapterLayer
and the base model mixin have been removed (e.g.BertEncoderAdaptersMixin
).adjust_tensors_for_parallel()
incomposition.py
.Breaking changes
get_adapter()
has changed.