align gpt-j layernorm to hf #481

sweinbach · 2021-12-15T13:20:44Z

Looking deeper into the gpt-j residual implementation I found a delta in the way layernorm(s) are applied. I don't see the point in applying two separate layer norm modules to the hidden_states (x)

Compare the HF implementation.
https://github.com/huggingface/transformers/blob/a94105f95fb66ee4129077c03e4e8a224f6a07fd/src/transformers/models/gptj/modeling_gptj.py#L279

Is there a reason for having two layernorms? Am I completally off?

sdtblck · 2021-12-17T13:33:31Z

Good catch, a couple of other people have already noted this. There's no reason, just an oversight. Although I suspect it can't really hurt, aside from possibly a slight efficiency degradation.

The problem is, we already have models trained with this config, and naively merging this would break backward compatibility. So I'm not really sure how best to handle this. Any ideas?

sweinbach · 2021-12-17T18:51:56Z

Good catch, a couple of other people have already noted this. There's no reason, just an oversight. Although I suspect it can't really hurt, aside from possibly a slight efficiency degradation.

The problem is, we already have models trained with this config, and naively merging this would break backward compatibility. So I'm not really sure how best to handle this. Any ideas?

Let’s add a neox parameter with default two layernorms and non default the one layernorm. I can do that Monday. Pretty spend for the week.

sdtblck · 2021-12-18T16:05:46Z

Good catch, a couple of other people have already noted this. There's no reason, just an oversight. Although I suspect it can't really hurt, aside from possibly a slight efficiency degradation.
The problem is, we already have models trained with this config, and naively merging this would break backward compatibility. So I'm not really sure how best to handle this. Any ideas?

Let’s add a neox parameter with default two layernorms and non default the one layernorm. I can do that Monday. Pretty spend for the week.

We can always make the one layernorm the default, and just patch the checkpoint / config files for 20B...

Not sure.

StellaAthena · 2021-12-18T16:25:04Z

I think it largely comes down to how many more models we expect to train with this codebase / how much we expect others to use it. If we don’t train many more models after the scaling suite and don’t expect others to use it, leave it as is. But the more additional models we train the more annoying the weird default is, and if we expect others to use the code it’s going to cause problems.

sweinbach · 2021-12-24T09:09:36Z

Good catch, a couple of other people have already noted this. There's no reason, just an oversight. Although I suspect it can't really hurt, aside from possibly a slight efficiency degradation.
The problem is, we already have models trained with this config, and naively merging this would break backward compatibility. So I'm not really sure how best to handle this. Any ideas?

Let’s add a neox parameter with default two layernorms and non default the one layernorm. I can do that Monday. Pretty spend for the week.

Sorry did not get to it so far. It is not very high on my list of priorities. Feel free to take over or close the PR. Can always reopen…

StellaAthena · 2022-02-02T02:43:38Z

It looks this makes no difference in LM training. I want to try removing the second norm from the model we just trained and see if it literally makes no impact or if it's just approximately zero impact. I'm pretty in favor of the change to a single norm though, as it's quite untidy to have a second norm out of nowhere.

sweinbach · 2022-02-02T07:03:53Z

It looks this makes no difference in LM training. I want to try removing the second norm from the model we just trained and see if it literally makes no impact or if it's just approximately zero impact. I'm pretty in favor of the change to a single norm though, as it's quite untidy to have a second norm out of nowhere.

You mean just remove it in a trained model? The layernorm has trainable parameters (weight and bias). I assume this will not work.

StellaAthena · 2022-02-02T14:42:44Z

It looks this makes no difference in LM training. I want to try removing the second norm from the model we just trained and see if it literally makes no impact or if it's just approximately zero impact. I'm pretty in favor of the change to a single norm though, as it's quite untidy to have a second norm out of nowhere.

You mean just remove it in a trained model? The layernorm has trainable parameters (weight and bias). I assume this will not work.

It does, but at least in theory those can be propagated up one layer without disturbing the overall computation.

update from main

StellaAthena · 2022-03-23T19:02:02Z

@sdtblck @sweinbach I'm going to make the executive decision that the error will be corrected by default. I've created a tag to pin the old version, and anyone using the bugged version can use the tagged release.

@sweinbach Can you confirm that this PR is still ready to go, get it up-to-date with the latest commit on main, and then ping me? I'll add text to the readme explaining the breaking change, link to tagged version, and we'll merge it.

Quentin-Anthony

These changes look good to me. I think we're clear to merge.

StellaAthena · 2022-11-25T21:19:07Z

@Quentin-Anthony the thing we need to be careful about here is backwards compatibility. We know that this will break past models, and will cause a small amount of drift with the HF implementation.

I don’t think that there’s a way to convert the untied set-up to the tied set-up. If that’s the case, I see two options:

Tying the embedding, but introducing a config that allows you to toggle it
Leave the codebase as is.

I think option 1 is the right thing to do, theoretically. The question is if it’s the right thing to do pragmatically.

Quentin-Anthony · 2022-11-27T23:46:31Z

@Quentin-Anthony the thing we need to be careful about here is backwards compatibility. We know that this will break past models, and will cause a small amount of drift with the HF implementation.

I don’t think that there’s a way to convert the untied set-up to the tied set-up. If that’s the case, I see two options:

Tying the embedding, but introducing a config that allows you to toggle it

Leave the codebase as is.

I think option 1 is the right thing to do, theoretically. The question is if it’s the right thing to do pragmatically.

@StellaAthena -- I see. I think that a good middle-of-the-road strategy would be to go with option 1 you listed, yet set the toggle config option to remain untied by default. That way most users don't need to make a change, but we'll tie for any future models we intend to release.

StellaAthena · 2022-11-28T19:04:58Z

I think that a good middle-of-the-road strategy would be to go with option 1 you listed, yet set the toggle config option to remain untied by default. That way most users don't need to make a change, but we'll tie for any future models we intend to release. would be to go with option 1 you listed, yet set the toggle config option to remain untied by default. That way most users don't need to make a change, but we'll tie for any future models we intend to release.

My main issue with that is that people will use the default configs the overwhelming majority of the time. The population of people who have trained their own GPT-NeoX models is quite small, and I would assume that most people just use the config file we provide uncritically.

I’ll go add the code and we can quibble about the default behavior later.

StellaAthena · 2022-11-29T06:32:45Z

I didn't update the 20b.yml or Pythia config files yet because they don't actually exist on this branch. But with tying defaulting to False this should be fully backwards compatible.

We will need to update both the HF conversion script and the HF library to allow for this to be configured. For now, let's just raise a warning when the conversion script is called.

Raise error in HF conversion script if using single tied layernorm with GPT-J residual

StellaAthena · 2022-12-03T22:15:35Z

@Quentin-Anthony I think we’re good to go here? Unless you want to double check my latest commit

Quentin-Anthony · 2022-12-06T02:38:47Z

@Quentin-Anthony I think we’re good to go here? Unless you want to double check my latest commit

Yep we're good to go. Just took a look.

align gpt-j layernorm to hf

32764e9

sweinbach requested a review from sdtblck December 15, 2021 13:20

sweinbach requested a review from a team as a code owner December 15, 2021 13:20

sweinbach requested a review from ConnorJL December 15, 2021 13:20

Merge pull request #503 from EleutherAI/main

5e4a950

update from main

Quentin-Anthony previously approved these changes Nov 25, 2022

View reviewed changes

Allows for configuring the tying of GPT-J-style residuals

bfbb5b7

StellaAthena dismissed Quentin-Anthony’s stale review via bfbb5b7 November 29, 2022 06:13

StellaAthena and others added 4 commits November 29, 2022 01:19

Passed new config correctly

bbd7431

Added config to toggle tied residual

3c0598c

Merge branch 'main' into gpt_j_layernorm_fix

ace36f4

Update NeoXArgs docs automatically

4f1eb60

haileyschoelkopf and others added 4 commits December 2, 2022 17:26

Raise error in HF conversion script if =True

23a5ccd

Update NeoXArgs docs automatically

4a29951

Merge pull request #728 from EleutherAI/patch_conversion_layernormsync

f0865e3

Raise error in HF conversion script if using single tied layernorm with GPT-J residual

Update NeoXArgs docs automatically

d65a4d0

Quentin-Anthony and others added 2 commits December 5, 2022 21:36

Merge branch 'main' into gpt_j_layernorm_fix

bcb277e

Update NeoXArgs docs automatically

3b1d478

Quentin-Anthony approved these changes Dec 6, 2022

View reviewed changes

Quentin-Anthony merged commit 0accac6 into main Dec 6, 2022

Quentin-Anthony deleted the gpt_j_layernorm_fix branch December 6, 2022 02:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

align gpt-j layernorm to hf #481

align gpt-j layernorm to hf #481

sweinbach commented Dec 15, 2021

sdtblck commented Dec 17, 2021

sweinbach commented Dec 17, 2021

sdtblck commented Dec 18, 2021

StellaAthena commented Dec 18, 2021

sweinbach commented Dec 24, 2021

StellaAthena commented Feb 2, 2022

sweinbach commented Feb 2, 2022

StellaAthena commented Feb 2, 2022

StellaAthena commented Mar 23, 2022

Quentin-Anthony left a comment

StellaAthena commented Nov 25, 2022

Quentin-Anthony commented Nov 27, 2022

StellaAthena commented Nov 28, 2022

StellaAthena commented Nov 29, 2022 •

edited

StellaAthena commented Dec 3, 2022

Quentin-Anthony commented Dec 6, 2022

align gpt-j layernorm to hf #481

align gpt-j layernorm to hf #481

Conversation

sweinbach commented Dec 15, 2021

sdtblck commented Dec 17, 2021

sweinbach commented Dec 17, 2021

sdtblck commented Dec 18, 2021

StellaAthena commented Dec 18, 2021

sweinbach commented Dec 24, 2021

StellaAthena commented Feb 2, 2022

sweinbach commented Feb 2, 2022

StellaAthena commented Feb 2, 2022

StellaAthena commented Mar 23, 2022

Quentin-Anthony left a comment

Choose a reason for hiding this comment

StellaAthena commented Nov 25, 2022

Quentin-Anthony commented Nov 27, 2022

StellaAthena commented Nov 28, 2022

StellaAthena commented Nov 29, 2022 • edited

StellaAthena commented Dec 3, 2022

Quentin-Anthony commented Dec 6, 2022

StellaAthena commented Nov 29, 2022 •

edited