Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Support BERT Model Series (and/or T5). #258

Closed
1 of 6 tasks
jbloomAus opened this issue Apr 26, 2023 · 16 comments
Closed
1 of 6 tasks

[Proposal] Support BERT Model Series (and/or T5). #258

jbloomAus opened this issue Apr 26, 2023 · 16 comments
Assignees

Comments

@jbloomAus
Copy link
Collaborator

Screenshot 2023-04-26 at 8 00 40 pm

Proposal

A glaring gap in models supported by TransformerLens is the lack of support for encoder/decoder models like BERT and T5. In order to rectify this, we need to build support for encoder-decoder models, cross attention and likely a few other details.

Motivation

  • I've been asked a number of times by people whether we have T5 or BERT support coming.
  • I think this is a reasonable contribution for people skilling up. You can ask for help or collaborate with other people working on building it.
  • I (Joseph) and likely others will be able to give feedback on your work and pitch in if needed!

Pitch

In order to complete this card we are going to build a class called HookedEncoderDecoder which can be instantiated and have the BERT or T5 weights from huggingface loaded in. This is the same strategy the package uses for other models so there's nothing fundamentally new, only components we haven't yet built with hooks. Once we have a class that has corresponding weights as needed, we will want to ensure all components are "hooked" correctly, enabling methods like .run_with_cache to work. It's not as complicated as it sounds to get it working.

Possibly the "best" version of this is tricky but we're willing to tolerate some code duplication as long we're confident the implementation is accurate (ie: tests show the hooked activations are correct and inference is identical to what we get without using TransformerLens.

Tasks:

  • Make a fork for BERT support (keep it synced if building this takes more than a moment)
  • Write some tests to be used during development, these should cover any novel architecture components these will involve during development. I'd start with two placeholder tests. 1. for do you have a model architecture which you can load weights into. and 2. if you had that architecture, would it produce identical results to the hugging-face model.
  • Load the BERT model, look at each component. Maybe read the BERT paper if needed to get context. Make a list of every nn.module that we don't currently support. Some items that will be on the list. 1. a linear block before the unembed, post-layerNorm (can't be unfolded RIP), sinusoidal positional embeddings. T5 might have other stuff.
  • Keep in mind you might need a new config class to configure the entire architecture. If so, build it.
  • If you can pass the test for the model doing inference correctly and show that you can use .run_activation_cache and all the other methods that work on hooked transformer (within reason), then we're likely ready to merge.

Some things to note:

  • You should refine the to do list and add detail as you go.
  • Writing tests should make your life easier and create feedback cycles. It will also make it easier if multiple people want to collaborate on this.
  • A previous PR attempted to provide BERT support is here. Unfortunately with drift it was too complicated to Merge. To avoid this, I recommend either syncing regularly or getting someone else to finish the PR if you can't.
  • Possibly quite useful, a link to the original ARENA materials where we did this without the hooks during the bootcamp

Additional context

Checklist

  • I have checked that there is no similar issue in the repo (required)
@neelnanda-io
Copy link
Collaborator

neelnanda-io commented Apr 26, 2023 via email

@rusheb
Copy link
Collaborator

rusheb commented Apr 26, 2023

I'd like to work on this! I'll start looking into it tomorrow.

@rusheb rusheb self-assigned this Apr 26, 2023
@alan-cooney
Copy link
Collaborator

@rusheb nice! It would be a good idea to create a simple PR first that splits the components from https://github.com/neelnanda-io/TransformerLens/blob/main/transformer_lens/components.py into different files

@neelnanda-io
Copy link
Collaborator

neelnanda-io commented Apr 27, 2023 via email

@jbloomAus
Copy link
Collaborator Author

If you are developing and end up with the same files open multiple times, this feels like a tell to me, but if that's not happening it's probably fine. Seems like personal preference?

@rusheb
Copy link
Collaborator

rusheb commented Apr 27, 2023

Haha, uh oh, I just spent a couple of hours on this.

I see both sides of the argument. I think it's pythonic to have all the components in one file, and I'm also having some issues with getting the imports right when moving them to multiple files. OTOH it is a super long file and might make it easier to factor out duplication if components were organised into files.

In my implementation I was planning to put all components in a components module so they would still be easy to find/search through. And obviously IDEs make searching pretty easy in either scenario.

I'll probably pause on this until we get to a resolution. Branch here if anyone wants to look: https://github.com/rusheb/TransformerLens/tree/rusheb-components

@rusheb
Copy link
Collaborator

rusheb commented Apr 27, 2023

I think it's minor. If Neel doesn't want to move them then maybe best to just keep everything in the one file and focus on the important parts of this ticket?

@alan-cooney
Copy link
Collaborator

Why? To me that seems worse - having all components in the same place, in a file where you can easy Ctrl+F and jump to definition seems great. And adding some BERTBlocks or CrossAttention components to that file seems totally fine to me

Yeah I get it's not crazy yet, although 800 lines is quite a bit for newcomers (as you have to skim a bit to understand what is there). It was really just a suggestion based on the direction of travel with the repo - as the methods start getting split up for unit tests, it's going to get into the thousands.

Anyway fine for leaving as is and sorry about the extra work @rusheb!

@alan-cooney
Copy link
Collaborator

For what it's worth if we keep the split we can probably just drop in all these tests - https://github.com/skyhookadventure/transformer-from-scratch/tree/main/transformer_from_scratch/components/tests

@rusheb
Copy link
Collaborator

rusheb commented May 5, 2023

Sneak preview!!

@rusheb
Copy link
Collaborator

rusheb commented May 11, 2023

Based on various offline discussions, the MVP of this will be BERT for Masked Language Modelling -- basically the same architecture as huggingface BertForMaskedLM: the core BERT module plus a language modelling head.

Other tasks/outputs (e.g. embeddings, next sentence prediction, causal language modelling) are being considered as out of scope for the first iteration

@fzyzcjy
Copy link

fzyzcjy commented May 4, 2024

Hi, is there any updates? I do hope T5 can be supported. Thanks!

@neelnanda-io
Copy link
Collaborator

neelnanda-io commented May 4, 2024 via email

@fzyzcjy
Copy link

fzyzcjy commented May 4, 2024

@neelnanda-io Thank you! I am interested and hope to squeeze some time trying that (but cannot guarantee).

What's the use case?

You know, many tasks such as machine translations uses encoder-decoder architecture instead of decoder-only.

@neelnanda-io
Copy link
Collaborator

neelnanda-io commented May 4, 2024 via email

@fzyzcjy
Copy link

fzyzcjy commented May 4, 2024

I see. Thank you!

This was referenced May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants