Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GPTNeoX-20B training #2240

Merged
merged 27 commits into from
Mar 31, 2023
Merged

Fix GPTNeoX-20B training #2240

merged 27 commits into from
Mar 31, 2023

Conversation

dvruette
Copy link
Collaborator

@dvruette dvruette commented Mar 26, 2023

Notable config settings that are required to get GPTNeoX-20B to behave nicely:

  • bf16: when training in fp16, gradients often over/underflow, leading to bad convergence. (bf16 could also potentially improve the loss of other models, we should try)
  • More warmup steps: To fill the optimizer buffers, a gentle warmup is required. Right now, we do this by just using more warmup steps, but in the future a linear LR warmup schedule could be used to achieve the same effect.
  • Gradient checkpointing is faster than gradient accumulation. This could also translate to other models.
  • Stage 3 is required to fit the 20B model in bf16. For some reason, fp16 was possible to fit with stage 2 but bf16 wasn't.

Other learnings:

EDIT: Updated fp32 comparison run. It actually behaves nicely just like bf16, so the most likely explanation for flash attention not working is accumulating errors.

@sanagno
Copy link
Collaborator

sanagno commented Mar 27, 2023

Are all the requirements necessary?

If I run pipreqs I get:

bitsandbytes==0.36.0.post2
datasets==2.8.0
deepspeed==0.7.7
evaluate==0.4.0
fastlangid==1.0.11
flash_attn==0.2.8
gdown==4.7.1
ipython==8.11.0
model_training==1.0.0
numpy==1.24.2
pytest==7.2.2
PyYAML==6.0
scikit_learn==1.2.2
sentencepiece==0.1.97
tokenizers==0.13.2
torch==1.13.1
tqdm==4.65.0
transformers==4.26.1
trlx==0.3.0

@dvruette
Copy link
Collaborator Author

Maybe not, I think @andreaskoepf added them at some point, maybe I wasn't supposed to commit them.

@andreaskoepf
Copy link
Collaborator

Maybe not, I think @andreaskoepf added them at some point, maybe I wasn't supposed to commit them.

If it was me who changed requirements.txt it was not intentionally ... IMO we should only add the requirements that are really necessary.

@dvruette
Copy link
Collaborator Author

Ok, I'll adopt the requirements by @sanagno.

@@ -127,6 +127,9 @@ def patch_model(
if resid_pdrop is not None and (resid_pdrop < 0 or resid_pdrop > 1.0):
raise ValueError("Invalid argument: `resid_pdrop` must be between 0.0 and 1.0")

if not flash_attention and (resid_pdrop is None or resid_pdrop == 0.0):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and not resid_pdrop would also be possible.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally like the explicitness, but can change if preferred

@@ -15,18 +15,27 @@ dependencies = [
"datasets==2.8.0",
"deepspeed==0.7.7",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am currently training with newer version of deepspeed everywhere since new HF transformers version complaine abut old deepspeed version .. IMO we can safely update this to latest version.

Copy link
Collaborator

@andreaskoepf andreaskoepf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks, good .. left comment regarding deepspeed version .. but could also be changed in a separate PR.

@dvruette dvruette enabled auto-merge (squash) March 31, 2023 10:58
@dvruette dvruette merged commit d9ad23c into main Mar 31, 2023
@dvruette dvruette deleted the training_gpt_neox branch March 31, 2023 11:00
yk pushed a commit that referenced this pull request Apr 2, 2023
Notable config settings that are required to get GPTNeoX-20B to behave
nicely:
- bf16: when training in fp16, gradients often over/underflow, leading
to bad convergence. (bf16 could also potentially improve the loss of
other models, we should try)
- More warmup steps: To fill the optimizer buffers, a gentle warmup is
required. Right now, we do this by just using more warmup steps, but in
the future a linear LR warmup schedule could be used to achieve the same
effect.
- Gradient checkpointing is faster than gradient accumulation. This
could also translate to other models.
- Stage 3 is required to fit the 20B model in bf16. For some reason,
fp16 was possible to fit with stage 2 but bf16 wasn't.

Other learnings:
- Residual dropout quickly degrades performance of the pre-trained
model. Even `p=0.1` leads to an initial loss of > 5.
- Flash attention does not go too well with the 20B model. There are
slight numerical differences between flash attention and the GPTNeoX
attention implementation that accumulate layer-by-layer, ultimately
leading to vastly different results.
- Run in bf16 with regular implementation:
https://wandb.ai/open-assistant/supervised-finetuning/runs/x2pczqa9
- Run in bf16 with flash attention:
https://wandb.ai/open-assistant/supervised-finetuning/runs/cvv3edm8
- Run in fp32 with regular implementation:
https://wandb.ai/open-assistant/supervised-finetuning/runs/shrzz3xp

EDIT: Updated fp32 comparison run. It actually behaves nicely just like
bf16, so the most likely explanation for flash attention not working is
accumulating errors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants