[Proposal] Support BERT Model Series (and/or T5). #258

jbloomAus · 2023-04-26T10:01:18Z

Proposal

A glaring gap in models supported by TransformerLens is the lack of support for encoder/decoder models like BERT and T5. In order to rectify this, we need to build support for encoder-decoder models, cross attention and likely a few other details.

Motivation

I've been asked a number of times by people whether we have T5 or BERT support coming.
I think this is a reasonable contribution for people skilling up. You can ask for help or collaborate with other people working on building it.
I (Joseph) and likely others will be able to give feedback on your work and pitch in if needed!

Pitch

In order to complete this card we are going to build a class called HookedEncoderDecoder which can be instantiated and have the BERT or T5 weights from huggingface loaded in. This is the same strategy the package uses for other models so there's nothing fundamentally new, only components we haven't yet built with hooks. Once we have a class that has corresponding weights as needed, we will want to ensure all components are "hooked" correctly, enabling methods like .run_with_cache to work. It's not as complicated as it sounds to get it working.

Possibly the "best" version of this is tricky but we're willing to tolerate some code duplication as long we're confident the implementation is accurate (ie: tests show the hooked activations are correct and inference is identical to what we get without using TransformerLens.

Tasks:

Make a fork for BERT support (keep it synced if building this takes more than a moment)
Write some tests to be used during development, these should cover any novel architecture components these will involve during development. I'd start with two placeholder tests. 1. for do you have a model architecture which you can load weights into. and 2. if you had that architecture, would it produce identical results to the hugging-face model.
Load the BERT model, look at each component. Maybe read the BERT paper if needed to get context. Make a list of every nn.module that we don't currently support. Some items that will be on the list. 1. a linear block before the unembed, post-layerNorm (can't be unfolded RIP), sinusoidal positional embeddings. T5 might have other stuff.
Keep in mind you might need a new config class to configure the entire architecture. If so, build it.
If you can pass the test for the model doing inference correctly and show that you can use .run_activation_cache and all the other methods that work on hooked transformer (within reason), then we're likely ready to merge.

Some things to note:

You should refine the to do list and add detail as you go.
Writing tests should make your life easier and create feedback cycles. It will also make it easier if multiple people want to collaborate on this.
A previous PR attempted to provide BERT support is here. Unfortunately with drift it was too complicated to Merge. To avoid this, I recommend either syncing regularly or getting someone else to finish the PR if you can't.
Possibly quite useful, a link to the original ARENA materials where we did this without the hooks during the bootcamp

Additional context

Checklist

I have checked that there is no similar issue in the repo (required)

neelnanda-io · 2023-04-26T10:06:28Z

Some caveats: BERT is encoder-only, *not *encoder-decoder. It should be a HookedBERT or HookedEncoder class, in my opinion. Note that post-LayerNorm means "LayerNorm applied after the layer". Confusingly, the LayerNormPre class means "LayerNorm with weights folded into the subsequent layer". Pre-LayerNorm when used in papers means LayerNorm at the start of a layer. *shrug* Naming conventions are hard man

…

On Wed, 26 Apr 2023 at 11:01, Joseph Bloom ***@***.***> wrote: [image: Screenshot 2023-04-26 at 8 00 40 pm] <https://user-images.githubusercontent.com/69127271/234541975-b5f48a88-0354-4e2c-a060-0deb52d60de8.png> Proposal A glaring gap in models supported by TransformerLens is the lack of support for encoder/decoder models like BERT and T5. In order to rectify this, we need to build support for encoder-decoder models, cross attention and likely a few other details. Motivation - I've been asked a number of times by people whether we have T5 or BERT support coming. - I think this is a reasonable contribution for people skilling up. You can ask for help or collaborate with other people working on building it. - I (Joseph) and likely others will be able to give feedback on your work and pitch in if needed! Pitch In order to complete this card we are going to build a class called HookedEncoderDecoder which can be instantiated and have the BERT or T5 weights from huggingface <https://huggingface.co/bert-base-uncased> loaded in. This is the same strategy the package uses for other models so there's nothing fundamentally new, only components we haven't yet built with hooks. Once we have a class that has corresponding weights as needed, we will want to ensure all components are "hooked" correctly, enabling methods like .run_with_cache to work. It's not as complicated as it sounds to get it working. Possibly the "best" version of this is tricky but we're willing to tolerate some code duplication as long we're confident the implementation is accurate (ie: tests show the hooked activations are correct and inference is identical to what we get without using TransformerLens. Tasks: - Make a fork for BERT support (keep it synced if building this takes more than a moment) - Write some tests to be used during development, these should cover any novel architecture components these will involve during development. I'd start with two placeholder tests. 1. for do you have a model architecture which you can load weights into. and 2. if you had that architecture, would it produce identical results to the hugging-face model. - Load the BERT model, look at each component. Maybe read the BERT paper if needed to get context. Make a list of every nn.module that we don't currently support. Some items that will be on the list. 1. a linear block before the unembed, post-layerNorm (can't be unfolded RIP), sinusoidal positional embeddings. T5 might have other stuff. - Keep in mind you might need a new config class to configure the entire architecture. If so, build it. - If you can pass the test for the model doing inference correctly and show that you can use .run_activation_cache and all the other methods that work on hooked transformer (within reason), then we're likely ready to merge. Some things to note: - You should refine the to do list and add detail as you go. - Writing tests should make your life easier and create feedback cycles. It will also make it easier if multiple people want to collaborate on this. - A previous PR <#82> attempted to provide BERT support is here. Unfortunately with drift it was too complicated to Merge. To avoid this, I recommend either syncing regularly or getting someone else to finish the PR if you can't. - Possibly quite useful, a link to the original ARENA materials where we did this <https://github.com/callummcdougall/arena-v1/blob/main/w2d2/solutions_build_bert.py> without the hooks during the bootcamp Additional context - BERT paper <https://arxiv.org/abs/1810.04805> - BERT on huggingface <https://huggingface.co/bert-base-uncased> - Related T5 Model issue <#94> Checklist - I have checked that there is no similar issue <https://github.com/Farama-Foundation/MiniGrid/issues> in the repo ( *required*) — Reply to this email directly, view it on GitHub <#258>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASRPNKJUT3735X5MX6SOC7DXDDW7VANCNFSM6AAAAAAXMGIZMA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

rusheb · 2023-04-26T21:27:55Z

I'd like to work on this! I'll start looking into it tomorrow.

alan-cooney · 2023-04-27T06:44:58Z

@rusheb nice! It would be a good idea to create a simple PR first that splits the components from https://github.com/neelnanda-io/TransformerLens/blob/main/transformer_lens/components.py into different files

neelnanda-io · 2023-04-27T09:55:01Z

Why? To me that seems worse - having all components in the same place, in a file where you can easy Ctrl+F and jump to definition seems great. And adding some BERTBlocks or CrossAttention components to that file seems totally fine to me

…

On Thu, 27 Apr 2023 at 07:45, Alan ***@***.***> wrote: @rusheb <https://github.com/rusheb> nice! It would be a good idea to create a simple PR first that splits the components from https://github.com/neelnanda-io/TransformerLens/blob/main/transformer_lens/components.py into different files — Reply to this email directly, view it on GitHub <#258 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASRPNKNFTZ46J5T6NWZWWVTXDIIXLANCNFSM6AAAAAAXMGIZMA> . You are receiving this because you commented.Message ID: ***@***.***>

jbloomAus · 2023-04-27T10:03:20Z

If you are developing and end up with the same files open multiple times, this feels like a tell to me, but if that's not happening it's probably fine. Seems like personal preference?

rusheb · 2023-04-27T10:04:23Z

Haha, uh oh, I just spent a couple of hours on this.

I see both sides of the argument. I think it's pythonic to have all the components in one file, and I'm also having some issues with getting the imports right when moving them to multiple files. OTOH it is a super long file and might make it easier to factor out duplication if components were organised into files.

In my implementation I was planning to put all components in a components module so they would still be easy to find/search through. And obviously IDEs make searching pretty easy in either scenario.

I'll probably pause on this until we get to a resolution. Branch here if anyone wants to look: https://github.com/rusheb/TransformerLens/tree/rusheb-components

rusheb · 2023-04-27T10:05:25Z

I think it's minor. If Neel doesn't want to move them then maybe best to just keep everything in the one file and focus on the important parts of this ticket?

alan-cooney · 2023-04-27T10:59:57Z

Why? To me that seems worse - having all components in the same place, in a file where you can easy Ctrl+F and jump to definition seems great. And adding some BERTBlocks or CrossAttention components to that file seems totally fine to me

Yeah I get it's not crazy yet, although 800 lines is quite a bit for newcomers (as you have to skim a bit to understand what is there). It was really just a suggestion based on the direction of travel with the repo - as the methods start getting split up for unit tests, it's going to get into the thousands.

Anyway fine for leaving as is and sorry about the extra work @rusheb!

alan-cooney · 2023-04-27T11:01:09Z

For what it's worth if we keep the split we can probably just drop in all these tests - https://github.com/skyhookadventure/transformer-from-scratch/tree/main/transformer_from_scratch/components/tests

rusheb · 2023-05-05T21:14:17Z

Sneak preview!!

rusheb · 2023-05-11T11:36:44Z

Based on various offline discussions, the MVP of this will be BERT for Masked Language Modelling -- basically the same architecture as huggingface BertForMaskedLM: the core BERT module plus a language modelling head.

Other tasks/outputs (e.g. embeddings, next sentence prediction, causal language modelling) are being considered as out of scope for the first iteration

fzyzcjy · 2024-05-04T01:11:55Z

Hi, is there any updates? I do hope T5 can be supported. Thanks!

neelnanda-io · 2024-05-04T01:44:27Z

BERT support was added, but T5 was not, I believe (see HookedEncoder). We'd welcome a PR adding T5 if you're interested! I'm not sure there's the capacity to add it ourselves. What's the use case?

…

On Sat, 4 May 2024 at 02:12, fzyzcjy ***@***.***> wrote: Hi, is there any updates? I do hope T5 can be supported. Thanks! — Reply to this email directly, view it on GitHub <#258 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASRPNKLH64QS6HCNOLS2QE3ZAQYXBAVCNFSM6AAAAAAXMGIZMCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJTHEZTOMBSGE> . You are receiving this because you commented.Message ID: ***@***.***>

fzyzcjy · 2024-05-04T03:09:02Z

@neelnanda-io Thank you! I am interested and hope to squeeze some time trying that (but cannot guarantee).

What's the use case?

You know, many tasks such as machine translations uses encoder-decoder architecture instead of decoder-only.

neelnanda-io · 2024-05-04T11:25:03Z

Makes sense! Note that the nnsight library works out of the box for any PyTorch model. It's more lightweight than TransformerLens accordingly, but likely still useful

…

On Sat, 4 May 2024, 4:09 am fzyzcjy, ***@***.***> wrote: @neelnanda-io <https://github.com/neelnanda-io> Thank you! I am interested and hope to squeeze some time trying that (but cannot guarantee). What's the use case? You know, many tasks such as machine translations uses encoder-decoder architecture instead of decoder-only. — Reply to this email directly, view it on GitHub <#258 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASRPNKJV3LV5V2HRF2NS4UTZARGOJAVCNFSM6AAAAAAXMGIZMCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJTHE3TSOJUHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

fzyzcjy · 2024-05-04T11:35:25Z

I see. Thank you!

rusheb self-assigned this Apr 26, 2023

rusheb mentioned this issue Apr 27, 2023

Optimize imports in HookedTransformer #260

Merged

1 task

rusheb mentioned this issue Apr 28, 2023

Introduce HookedEncoderConfig (Issue #258) #262

Closed

7 tasks

This was referenced May 18, 2023

Introducing HookedEncoder #276

Merged

[Proposal] BERT: Future work #277

Open

rusheb closed this as completed in #276 May 19, 2023

This was referenced May 21, 2024

Support for T5 models #604

Closed

Encoder-Decoder (T5) support #605

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Support BERT Model Series (and/or T5). #258

[Proposal] Support BERT Model Series (and/or T5). #258

jbloomAus commented Apr 26, 2023

neelnanda-io commented Apr 26, 2023 via email

rusheb commented Apr 26, 2023 •

edited

alan-cooney commented Apr 27, 2023

neelnanda-io commented Apr 27, 2023 via email

jbloomAus commented Apr 27, 2023

rusheb commented Apr 27, 2023

rusheb commented Apr 27, 2023

alan-cooney commented Apr 27, 2023

alan-cooney commented Apr 27, 2023

rusheb commented May 5, 2023 •

edited

rusheb commented May 11, 2023 •

edited

fzyzcjy commented May 4, 2024

neelnanda-io commented May 4, 2024 via email

fzyzcjy commented May 4, 2024

neelnanda-io commented May 4, 2024 via email

fzyzcjy commented May 4, 2024 •

edited

[Proposal] Support BERT Model Series (and/or T5). #258

[Proposal] Support BERT Model Series (and/or T5). #258

Comments

jbloomAus commented Apr 26, 2023

Proposal

Motivation

Pitch

Additional context

Checklist

neelnanda-io commented Apr 26, 2023 via email

rusheb commented Apr 26, 2023 • edited

alan-cooney commented Apr 27, 2023

neelnanda-io commented Apr 27, 2023 via email

jbloomAus commented Apr 27, 2023

rusheb commented Apr 27, 2023

rusheb commented Apr 27, 2023

alan-cooney commented Apr 27, 2023

alan-cooney commented Apr 27, 2023

rusheb commented May 5, 2023 • edited

rusheb commented May 11, 2023 • edited

fzyzcjy commented May 4, 2024

neelnanda-io commented May 4, 2024 via email

fzyzcjy commented May 4, 2024

neelnanda-io commented May 4, 2024 via email

fzyzcjy commented May 4, 2024 • edited

rusheb commented Apr 26, 2023 •

edited

rusheb commented May 5, 2023 •

edited

rusheb commented May 11, 2023 •

edited

fzyzcjy commented May 4, 2024 •

edited