Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

align gpt-j layernorm to hf #481

Merged
merged 13 commits into from Dec 6, 2022
Merged

align gpt-j layernorm to hf #481

merged 13 commits into from Dec 6, 2022

Conversation

sweinbach
Copy link
Contributor

Looking deeper into the gpt-j residual implementation I found a delta in the way layernorm(s) are applied. I don't see the point in applying two separate layer norm modules to the hidden_states (x)

Compare the HF implementation.
https://github.com/huggingface/transformers/blob/a94105f95fb66ee4129077c03e4e8a224f6a07fd/src/transformers/models/gptj/modeling_gptj.py#L279

Is there a reason for having two layernorms? Am I completally off?

@sweinbach sweinbach requested a review from a team as a code owner December 15, 2021 13:20
@sdtblck
Copy link
Contributor

sdtblck commented Dec 17, 2021

Good catch, a couple of other people have already noted this. There's no reason, just an oversight. Although I suspect it can't really hurt, aside from possibly a slight efficiency degradation.

The problem is, we already have models trained with this config, and naively merging this would break backward compatibility. So I'm not really sure how best to handle this. Any ideas?

@sweinbach
Copy link
Contributor Author

Good catch, a couple of other people have already noted this. There's no reason, just an oversight. Although I suspect it can't really hurt, aside from possibly a slight efficiency degradation.

The problem is, we already have models trained with this config, and naively merging this would break backward compatibility. So I'm not really sure how best to handle this. Any ideas?

Let’s add a neox parameter with default two layernorms and non default the one layernorm. I can do that Monday. Pretty spend for the week.

@sdtblck
Copy link
Contributor

sdtblck commented Dec 18, 2021

Good catch, a couple of other people have already noted this. There's no reason, just an oversight. Although I suspect it can't really hurt, aside from possibly a slight efficiency degradation.
The problem is, we already have models trained with this config, and naively merging this would break backward compatibility. So I'm not really sure how best to handle this. Any ideas?

Let’s add a neox parameter with default two layernorms and non default the one layernorm. I can do that Monday. Pretty spend for the week.

We can always make the one layernorm the default, and just patch the checkpoint / config files for 20B...

Not sure.

@StellaAthena
Copy link
Member

I think it largely comes down to how many more models we expect to train with this codebase / how much we expect others to use it. If we don’t train many more models after the scaling suite and don’t expect others to use it, leave it as is. But the more additional models we train the more annoying the weird default is, and if we expect others to use the code it’s going to cause problems.

@sweinbach
Copy link
Contributor Author

Good catch, a couple of other people have already noted this. There's no reason, just an oversight. Although I suspect it can't really hurt, aside from possibly a slight efficiency degradation.
The problem is, we already have models trained with this config, and naively merging this would break backward compatibility. So I'm not really sure how best to handle this. Any ideas?

Let’s add a neox parameter with default two layernorms and non default the one layernorm. I can do that Monday. Pretty spend for the week.

Sorry did not get to it so far. It is not very high on my list of priorities. Feel free to take over or close the PR. Can always reopen…

@StellaAthena
Copy link
Member

It looks this makes no difference in LM training. I want to try removing the second norm from the model we just trained and see if it literally makes no impact or if it's just approximately zero impact. I'm pretty in favor of the change to a single norm though, as it's quite untidy to have a second norm out of nowhere.

Screen Shot 2022-02-01 at 9 41 04 PM

@sweinbach
Copy link
Contributor Author

It looks this makes no difference in LM training. I want to try removing the second norm from the model we just trained and see if it literally makes no impact or if it's just approximately zero impact. I'm pretty in favor of the change to a single norm though, as it's quite untidy to have a second norm out of nowhere.

Screen Shot 2022-02-01 at 9 41 04 PM

You mean just remove it in a trained model? The layernorm has trainable parameters (weight and bias). I assume this will not work.

@StellaAthena
Copy link
Member

It looks this makes no difference in LM training. I want to try removing the second norm from the model we just trained and see if it literally makes no impact or if it's just approximately zero impact. I'm pretty in favor of the change to a single norm though, as it's quite untidy to have a second norm out of nowhere.
Screen Shot 2022-02-01 at 9 41 04 PM

You mean just remove it in a trained model? The layernorm has trainable parameters (weight and bias). I assume this will not work.

It does, but at least in theory those can be propagated up one layer without disturbing the overall computation.

@StellaAthena
Copy link
Member

@sdtblck @sweinbach I'm going to make the executive decision that the error will be corrected by default. I've created a tag to pin the old version, and anyone using the bugged version can use the tagged release.

@sweinbach Can you confirm that this PR is still ready to go, get it up-to-date with the latest commit on main, and then ping me? I'll add text to the readme explaining the breaking change, link to tagged version, and we'll merge it.

Copy link
Member

@Quentin-Anthony Quentin-Anthony left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good to me. I think we're clear to merge.

@StellaAthena
Copy link
Member

@Quentin-Anthony the thing we need to be careful about here is backwards compatibility. We know that this will break past models, and will cause a small amount of drift with the HF implementation.

I don’t think that there’s a way to convert the untied set-up to the tied set-up. If that’s the case, I see two options:

  1. Tying the embedding, but introducing a config that allows you to toggle it
  2. Leave the codebase as is.

I think option 1 is the right thing to do, theoretically. The question is if it’s the right thing to do pragmatically.

@Quentin-Anthony
Copy link
Member

@Quentin-Anthony the thing we need to be careful about here is backwards compatibility. We know that this will break past models, and will cause a small amount of drift with the HF implementation.

I don’t think that there’s a way to convert the untied set-up to the tied set-up. If that’s the case, I see two options:

  1. Tying the embedding, but introducing a config that allows you to toggle it
  2. Leave the codebase as is.

I think option 1 is the right thing to do, theoretically. The question is if it’s the right thing to do pragmatically.

@StellaAthena -- I see. I think that a good middle-of-the-road strategy would be to go with option 1 you listed, yet set the toggle config option to remain untied by default. That way most users don't need to make a change, but we'll tie for any future models we intend to release.

@StellaAthena
Copy link
Member

I think that a good middle-of-the-road strategy would be to go with option 1 you listed, yet set the toggle config option to remain untied by default. That way most users don't need to make a change, but we'll tie for any future models we intend to release. would be to go with option 1 you listed, yet set the toggle config option to remain untied by default. That way most users don't need to make a change, but we'll tie for any future models we intend to release.

My main issue with that is that people will use the default configs the overwhelming majority of the time. The population of people who have trained their own GPT-NeoX models is quite small, and I would assume that most people just use the config file we provide uncritically.

I’ll go add the code and we can quibble about the default behavior later.

@StellaAthena
Copy link
Member

StellaAthena commented Nov 29, 2022

I didn't update the 20b.yml or Pythia config files yet because they don't actually exist on this branch. But with tying defaulting to False this should be fully backwards compatible.

We will need to update both the HF conversion script and the HF library to allow for this to be configured. For now, let's just raise a warning when the conversion script is called.

@StellaAthena
Copy link
Member

@Quentin-Anthony I think we’re good to go here? Unless you want to double check my latest commit

@Quentin-Anthony
Copy link
Member

@Quentin-Anthony I think we’re good to go here? Unless you want to double check my latest commit

Yep we're good to go. Just took a look.

@Quentin-Anthony Quentin-Anthony merged commit 0accac6 into main Dec 6, 2022
@Quentin-Anthony Quentin-Anthony deleted the gpt_j_layernorm_fix branch December 6, 2022 02:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants