-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Support BERT Model Series (and/or T5). #258
Comments
Some caveats: BERT is encoder-only, *not *encoder-decoder. It should be a
HookedBERT or HookedEncoder class, in my opinion.
Note that post-LayerNorm means "LayerNorm applied after the layer".
Confusingly, the LayerNormPre class means "LayerNorm with weights folded
into the subsequent layer". Pre-LayerNorm when used in papers means
LayerNorm at the start of a layer. *shrug* Naming conventions are hard man
…On Wed, 26 Apr 2023 at 11:01, Joseph Bloom ***@***.***> wrote:
[image: Screenshot 2023-04-26 at 8 00 40 pm]
<https://user-images.githubusercontent.com/69127271/234541975-b5f48a88-0354-4e2c-a060-0deb52d60de8.png>
Proposal
A glaring gap in models supported by TransformerLens is the lack of
support for encoder/decoder models like BERT and T5. In order to rectify
this, we need to build support for encoder-decoder models, cross attention
and likely a few other details.
Motivation
- I've been asked a number of times by people whether we have T5 or
BERT support coming.
- I think this is a reasonable contribution for people skilling up.
You can ask for help or collaborate with other people working on building
it.
- I (Joseph) and likely others will be able to give feedback on your
work and pitch in if needed!
Pitch
In order to complete this card we are going to build a class called
HookedEncoderDecoder which can be instantiated and have the BERT or T5
weights from huggingface <https://huggingface.co/bert-base-uncased>
loaded in. This is the same strategy the package uses for other models so
there's nothing fundamentally new, only components we haven't yet built
with hooks. Once we have a class that has corresponding weights as needed,
we will want to ensure all components are "hooked" correctly, enabling
methods like .run_with_cache to work. It's not as complicated as it sounds
to get it working.
Possibly the "best" version of this is tricky but we're willing to
tolerate some code duplication as long we're confident the implementation
is accurate (ie: tests show the hooked activations are correct and
inference is identical to what we get without using TransformerLens.
Tasks:
- Make a fork for BERT support (keep it synced if building this takes
more than a moment)
- Write some tests to be used during development, these should cover
any novel architecture components these will involve during development.
I'd start with two placeholder tests. 1. for do you have a model
architecture which you can load weights into. and 2. if you had that
architecture, would it produce identical results to the hugging-face model.
- Load the BERT model, look at each component. Maybe read the BERT
paper if needed to get context. Make a list of every nn.module that we
don't currently support. Some items that will be on the list. 1. a linear
block before the unembed, post-layerNorm (can't be unfolded RIP),
sinusoidal positional embeddings. T5 might have other stuff.
- Keep in mind you might need a new config class to configure the
entire architecture. If so, build it.
- If you can pass the test for the model doing inference correctly and
show that you can use .run_activation_cache and all the other methods that
work on hooked transformer (within reason), then we're likely ready to
merge.
Some things to note:
- You should refine the to do list and add detail as you go.
- Writing tests should make your life easier and create feedback
cycles. It will also make it easier if multiple people want to collaborate
on this.
- A previous PR
<#82> attempted to
provide BERT support is here. Unfortunately with drift it was too
complicated to Merge. To avoid this, I recommend either syncing regularly
or getting someone else to finish the PR if you can't.
- Possibly quite useful, a link to the original ARENA materials where
we did this
<https://github.com/callummcdougall/arena-v1/blob/main/w2d2/solutions_build_bert.py>
without the hooks during the bootcamp
Additional context
- BERT paper <https://arxiv.org/abs/1810.04805>
- BERT on huggingface <https://huggingface.co/bert-base-uncased>
- Related T5 Model issue
<#94>
Checklist
- I have checked that there is no similar issue
<https://github.com/Farama-Foundation/MiniGrid/issues> in the repo (
*required*)
—
Reply to this email directly, view it on GitHub
<#258>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASRPNKJUT3735X5MX6SOC7DXDDW7VANCNFSM6AAAAAAXMGIZMA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
I'd like to work on this! I'll start looking into it tomorrow. |
@rusheb nice! It would be a good idea to create a simple PR first that splits the components from https://github.com/neelnanda-io/TransformerLens/blob/main/transformer_lens/components.py into different files |
Why? To me that seems worse - having all components in the same place, in a
file where you can easy Ctrl+F and jump to definition seems great. And
adding some BERTBlocks or CrossAttention components to that file seems
totally fine to me
…On Thu, 27 Apr 2023 at 07:45, Alan ***@***.***> wrote:
@rusheb <https://github.com/rusheb> nice! It would be a good idea to
create a simple PR first that splits the components from
https://github.com/neelnanda-io/TransformerLens/blob/main/transformer_lens/components.py
into different files
—
Reply to this email directly, view it on GitHub
<#258 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASRPNKNFTZ46J5T6NWZWWVTXDIIXLANCNFSM6AAAAAAXMGIZMA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
If you are developing and end up with the same files open multiple times, this feels like a tell to me, but if that's not happening it's probably fine. Seems like personal preference? |
Haha, uh oh, I just spent a couple of hours on this. I see both sides of the argument. I think it's pythonic to have all the components in one file, and I'm also having some issues with getting the imports right when moving them to multiple files. OTOH it is a super long file and might make it easier to factor out duplication if components were organised into files. In my implementation I was planning to put all components in a I'll probably pause on this until we get to a resolution. Branch here if anyone wants to look: https://github.com/rusheb/TransformerLens/tree/rusheb-components |
I think it's minor. If Neel doesn't want to move them then maybe best to just keep everything in the one file and focus on the important parts of this ticket? |
Yeah I get it's not crazy yet, although 800 lines is quite a bit for newcomers (as you have to skim a bit to understand what is there). It was really just a suggestion based on the direction of travel with the repo - as the methods start getting split up for unit tests, it's going to get into the thousands. Anyway fine for leaving as is and sorry about the extra work @rusheb! |
For what it's worth if we keep the split we can probably just drop in all these tests - https://github.com/skyhookadventure/transformer-from-scratch/tree/main/transformer_from_scratch/components/tests |
Based on various offline discussions, the MVP of this will be BERT for Masked Language Modelling -- basically the same architecture as huggingface BertForMaskedLM: the core BERT module plus a language modelling head. Other tasks/outputs (e.g. embeddings, next sentence prediction, causal language modelling) are being considered as out of scope for the first iteration |
Hi, is there any updates? I do hope T5 can be supported. Thanks! |
BERT support was added, but T5 was not, I believe (see HookedEncoder). We'd
welcome a PR adding T5 if you're interested! I'm not sure there's the
capacity to add it ourselves. What's the use case?
…On Sat, 4 May 2024 at 02:12, fzyzcjy ***@***.***> wrote:
Hi, is there any updates? I do hope T5 can be supported. Thanks!
—
Reply to this email directly, view it on GitHub
<#258 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASRPNKLH64QS6HCNOLS2QE3ZAQYXBAVCNFSM6AAAAAAXMGIZMCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJTHEZTOMBSGE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
@neelnanda-io Thank you! I am interested and hope to squeeze some time trying that (but cannot guarantee).
You know, many tasks such as machine translations uses encoder-decoder architecture instead of decoder-only. |
Makes sense! Note that the nnsight library works out of the box for any
PyTorch model. It's more lightweight than TransformerLens accordingly, but
likely still useful
…On Sat, 4 May 2024, 4:09 am fzyzcjy, ***@***.***> wrote:
@neelnanda-io <https://github.com/neelnanda-io> Thank you! I am
interested and hope to squeeze some time trying that (but cannot guarantee).
What's the use case?
You know, many tasks such as machine translations uses encoder-decoder
architecture instead of decoder-only.
—
Reply to this email directly, view it on GitHub
<#258 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASRPNKJV3LV5V2HRF2NS4UTZARGOJAVCNFSM6AAAAAAXMGIZMCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJTHE3TSOJUHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I see. Thank you! |
Proposal
A glaring gap in models supported by TransformerLens is the lack of support for encoder/decoder models like BERT and T5. In order to rectify this, we need to build support for encoder-decoder models, cross attention and likely a few other details.
Motivation
Pitch
In order to complete this card we are going to build a class called HookedEncoderDecoder which can be instantiated and have the BERT or T5 weights from huggingface loaded in. This is the same strategy the package uses for other models so there's nothing fundamentally new, only components we haven't yet built with hooks. Once we have a class that has corresponding weights as needed, we will want to ensure all components are "hooked" correctly, enabling methods like .run_with_cache to work. It's not as complicated as it sounds to get it working.
Possibly the "best" version of this is tricky but we're willing to tolerate some code duplication as long we're confident the implementation is accurate (ie: tests show the hooked activations are correct and inference is identical to what we get without using TransformerLens.
Tasks:
Some things to note:
Additional context
Checklist
The text was updated successfully, but these errors were encountered: