New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
align gpt-j layernorm to hf #481
Conversation
Good catch, a couple of other people have already noted this. There's no reason, just an oversight. Although I suspect it can't really hurt, aside from possibly a slight efficiency degradation. The problem is, we already have models trained with this config, and naively merging this would break backward compatibility. So I'm not really sure how best to handle this. Any ideas? |
Let’s add a neox parameter with default two layernorms and non default the one layernorm. I can do that Monday. Pretty spend for the week. |
We can always make the one layernorm the default, and just patch the checkpoint / config files for 20B... Not sure. |
I think it largely comes down to how many more models we expect to train with this codebase / how much we expect others to use it. If we don’t train many more models after the scaling suite and don’t expect others to use it, leave it as is. But the more additional models we train the more annoying the weird default is, and if we expect others to use the code it’s going to cause problems. |
Sorry did not get to it so far. It is not very high on my list of priorities. Feel free to take over or close the PR. Can always reopen… |
It looks this makes no difference in LM training. I want to try removing the second norm from the model we just trained and see if it literally makes no impact or if it's just approximately zero impact. I'm pretty in favor of the change to a single norm though, as it's quite untidy to have a second norm out of nowhere. |
update from main
@sdtblck @sweinbach I'm going to make the executive decision that the error will be corrected by default. I've created a tag to pin the old version, and anyone using the bugged version can use the tagged release. @sweinbach Can you confirm that this PR is still ready to go, get it up-to-date with the latest commit on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes look good to me. I think we're clear to merge.
@Quentin-Anthony the thing we need to be careful about here is backwards compatibility. We know that this will break past models, and will cause a small amount of drift with the HF implementation. I don’t think that there’s a way to convert the untied set-up to the tied set-up. If that’s the case, I see two options:
I think option 1 is the right thing to do, theoretically. The question is if it’s the right thing to do pragmatically. |
@StellaAthena -- I see. I think that a good middle-of-the-road strategy would be to go with option 1 you listed, yet set the toggle config option to remain untied by default. That way most users don't need to make a change, but we'll tie for any future models we intend to release. |
My main issue with that is that people will use the default configs the overwhelming majority of the time. The population of people who have trained their own GPT-NeoX models is quite small, and I would assume that most people just use the config file we provide uncritically. I’ll go add the code and we can quibble about the default behavior later. |
I didn't update the 20b.yml or Pythia config files yet because they don't actually exist on this branch. But with tying defaulting to We will need to update both the HF conversion script and the HF library to allow for this to be configured. For now, let's just raise a warning when the conversion script is called. |
Raise error in HF conversion script if using single tied layernorm with GPT-J residual
@Quentin-Anthony I think we’re good to go here? Unless you want to double check my latest commit |
Yep we're good to go. Just took a look. |
Looking deeper into the gpt-j residual implementation I found a delta in the way layernorm(s) are applied. I don't see the point in applying two separate layer norm modules to the hidden_states (x)
Compare the HF implementation.
https://github.com/huggingface/transformers/blob/a94105f95fb66ee4129077c03e4e8a224f6a07fd/src/transformers/models/gptj/modeling_gptj.py#L279
Is there a reason for having two layernorms? Am I completally off?