Fix GPTNeoX-20B training #2240

dvruette · 2023-03-26T20:35:24Z

Notable config settings that are required to get GPTNeoX-20B to behave nicely:

bf16: when training in fp16, gradients often over/underflow, leading to bad convergence. (bf16 could also potentially improve the loss of other models, we should try)
More warmup steps: To fill the optimizer buffers, a gentle warmup is required. Right now, we do this by just using more warmup steps, but in the future a linear LR warmup schedule could be used to achieve the same effect.
Gradient checkpointing is faster than gradient accumulation. This could also translate to other models.
Stage 3 is required to fit the 20B model in bf16. For some reason, fp16 was possible to fit with stage 2 but bf16 wasn't.

Other learnings:

Residual dropout quickly degrades performance of the pre-trained model. Even p=0.1 leads to an initial loss of > 5.
Flash attention does not go too well with the 20B model. There are slight numerical differences between flash attention and the GPTNeoX attention implementation that accumulate layer-by-layer, ultimately leading to vastly different results.
- Run in bf16 with regular implementation: https://wandb.ai/open-assistant/supervised-finetuning/runs/x2pczqa9
- Run in bf16 with flash attention: https://wandb.ai/open-assistant/supervised-finetuning/runs/cvv3edm8
- Run in fp32 with regular implementation: https://wandb.ai/open-assistant/supervised-finetuning/runs/shrzz3xp

EDIT: Updated fp32 comparison run. It actually behaves nicely just like bf16, so the most likely explanation for flash attention not working is accumulating errors.

…o training_gpt_neox

sanagno · 2023-03-27T08:58:01Z

Are all the requirements necessary?

If I run pipreqs I get:

bitsandbytes==0.36.0.post2
datasets==2.8.0
deepspeed==0.7.7
evaluate==0.4.0
fastlangid==1.0.11
flash_attn==0.2.8
gdown==4.7.1
ipython==8.11.0
model_training==1.0.0
numpy==1.24.2
pytest==7.2.2
PyYAML==6.0
scikit_learn==1.2.2
sentencepiece==0.1.97
tokenizers==0.13.2
torch==1.13.1
tqdm==4.65.0
transformers==4.26.1
trlx==0.3.0

dvruette · 2023-03-27T10:52:06Z

Maybe not, I think @andreaskoepf added them at some point, maybe I wasn't supposed to commit them.

andreaskoepf · 2023-03-27T13:12:35Z

Maybe not, I think @andreaskoepf added them at some point, maybe I wasn't supposed to commit them.

If it was me who changed requirements.txt it was not intentionally ... IMO we should only add the requirements that are really necessary.

dvruette · 2023-03-27T13:13:37Z

Ok, I'll adopt the requirements by @sanagno.

model/model_training/tools/export.py

model/model_training/configs/config.yaml

model/model_training/requirements.txt

…ng_gpt_neox

andreaskoepf · 2023-03-31T09:15:08Z

model/model_training/models/patching.py

@@ -127,6 +127,9 @@ def patch_model(
    if resid_pdrop is not None and (resid_pdrop < 0 or resid_pdrop > 1.0):
        raise ValueError("Invalid argument: `resid_pdrop` must be between 0.0 and 1.0")

+    if not flash_attention and (resid_pdrop is None or resid_pdrop == 0.0):


and not resid_pdrop would also be possible.

I personally like the explicitness, but can change if preferred

andreaskoepf · 2023-03-31T09:17:53Z

model/pyproject.toml

@@ -15,18 +15,27 @@ dependencies = [
    "datasets==2.8.0",
    "deepspeed==0.7.7",


I am currently training with newer version of deepspeed everywhere since new HF transformers version complaine abut old deepspeed version .. IMO we can safely update this to latest version.

andreaskoepf

looks, good .. left comment regarding deepspeed version .. but could also be changed in a separate PR.

…Assistant into training_gpt_neox

…o training_gpt_neox

Notable config settings that are required to get GPTNeoX-20B to behave nicely: - bf16: when training in fp16, gradients often over/underflow, leading to bad convergence. (bf16 could also potentially improve the loss of other models, we should try) - More warmup steps: To fill the optimizer buffers, a gentle warmup is required. Right now, we do this by just using more warmup steps, but in the future a linear LR warmup schedule could be used to achieve the same effect. - Gradient checkpointing is faster than gradient accumulation. This could also translate to other models. - Stage 3 is required to fit the 20B model in bf16. For some reason, fp16 was possible to fit with stage 2 but bf16 wasn't. Other learnings: - Residual dropout quickly degrades performance of the pre-trained model. Even `p=0.1` leads to an initial loss of > 5. - Flash attention does not go too well with the 20B model. There are slight numerical differences between flash attention and the GPTNeoX attention implementation that accumulate layer-by-layer, ultimately leading to vastly different results. - Run in bf16 with regular implementation: https://wandb.ai/open-assistant/supervised-finetuning/runs/x2pczqa9 - Run in bf16 with flash attention: https://wandb.ai/open-assistant/supervised-finetuning/runs/cvv3edm8 - Run in fp32 with regular implementation: https://wandb.ai/open-assistant/supervised-finetuning/runs/shrzz3xp EDIT: Updated fp32 comparison run. It actually behaves nicely just like bf16, so the most likely explanation for flash attention not working is accumulating errors.

dvruette added 18 commits March 10, 2023 14:49

add patch for using flash attention

49bab50

add test for flash attention

043409d

add script for exporting models

a0a2fb6

add flash attn to training script, minor wandb fixes

51f0358

update training config

d79eaba

Merge branch 'main' into flash-attention

ac2bc9a

add gpt-x neo experiment

aa8c60f

fix loss without label mask

30315f2

load model in fp16 if configured like that

ec3a4e2

add deepspeed config as option

a7e75d5

update deepspeed config

c9f3f52

Merge branch 'main' of https://github.com/LAION-AI/Open-Assistant int…

dbf7d17

…o training_gpt_neox

minor changes

8dd8727

Merge branch 'main' of https://github.com/LAION-AI/Open-Assistant int…

b62e4a9

…o training_gpt_neox

fix gpt-neox-20b training

9dd9532

Merge branch 'main' into training_gpt_neox

3edd638

commit launch.json

37cb7e5

commit requirements

4f15e7c

dvruette requested review from theblackcat102, sanagno, andreaskoepf and yk as code owners March 26, 2023 20:35

update requirements

000f4f1

theblackcat102 added the ml label Mar 28, 2023

andreaskoepf reviewed Mar 29, 2023

View reviewed changes

model/model_training/tools/export.py Outdated Show resolved Hide resolved

andreaskoepf reviewed Mar 29, 2023

View reviewed changes

model/model_training/configs/config.yaml Outdated Show resolved Hide resolved

andreaskoepf reviewed Mar 29, 2023

View reviewed changes

model/model_training/requirements.txt Outdated Show resolved Hide resolved

dvruette added 4 commits March 31, 2023 10:23

Merge branch 'main' of github.com:LAION-AI/Open-Assistant into traini…

4ffac8d

…ng_gpt_neox

revert pythia config changes

b44faf8

move dependencies from requirements to pyproject

37331fb

delete duplicate export script

f4981ef

andreaskoepf reviewed Mar 31, 2023

View reviewed changes

andreaskoepf approved these changes Mar 31, 2023

View reviewed changes

dvruette added 4 commits March 31, 2023 10:56

add bf16 support to export script

f0db55f

Merge branch 'training_gpt_neox' of https://github.com/LAION-AI/Open-…

a56d17f

…Assistant into training_gpt_neox

Merge branch 'main' of https://github.com/LAION-AI/Open-Assistant int…

590660d

…o training_gpt_neox

fix config

563e7dd

dvruette enabled auto-merge (squash) March 31, 2023 10:58

dvruette merged commit d9ad23c into main Mar 31, 2023

dvruette deleted the training_gpt_neox branch March 31, 2023 11:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GPTNeoX-20B training #2240

Fix GPTNeoX-20B training #2240

dvruette commented Mar 26, 2023 •

edited

Loading

sanagno commented Mar 27, 2023

dvruette commented Mar 27, 2023

andreaskoepf commented Mar 27, 2023

dvruette commented Mar 27, 2023

andreaskoepf Mar 31, 2023

dvruette Mar 31, 2023

andreaskoepf Mar 31, 2023

andreaskoepf left a comment

		@@ -15,18 +15,27 @@ dependencies = [
		"datasets==2.8.0",
		"deepspeed==0.7.7",

Fix GPTNeoX-20B training #2240

Fix GPTNeoX-20B training #2240

Conversation

dvruette commented Mar 26, 2023 • edited Loading

sanagno commented Mar 27, 2023

dvruette commented Mar 27, 2023

andreaskoepf commented Mar 27, 2023

dvruette commented Mar 27, 2023

andreaskoepf Mar 31, 2023

Choose a reason for hiding this comment

dvruette Mar 31, 2023

Choose a reason for hiding this comment

andreaskoepf Mar 31, 2023

Choose a reason for hiding this comment

andreaskoepf left a comment

Choose a reason for hiding this comment

dvruette commented Mar 26, 2023 •

edited

Loading