-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix GPTNeoX-20B training #2240
Fix GPTNeoX-20B training #2240
Conversation
…o training_gpt_neox
…o training_gpt_neox
Are all the requirements necessary? If I run pipreqs I get:
|
Maybe not, I think @andreaskoepf added them at some point, maybe I wasn't supposed to commit them. |
If it was me who changed requirements.txt it was not intentionally ... IMO we should only add the requirements that are really necessary. |
Ok, I'll adopt the requirements by @sanagno. |
@@ -127,6 +127,9 @@ def patch_model( | |||
if resid_pdrop is not None and (resid_pdrop < 0 or resid_pdrop > 1.0): | |||
raise ValueError("Invalid argument: `resid_pdrop` must be between 0.0 and 1.0") | |||
|
|||
if not flash_attention and (resid_pdrop is None or resid_pdrop == 0.0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and not resid_pdrop
would also be possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally like the explicitness, but can change if preferred
@@ -15,18 +15,27 @@ dependencies = [ | |||
"datasets==2.8.0", | |||
"deepspeed==0.7.7", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am currently training with newer version of deepspeed everywhere since new HF transformers version complaine abut old deepspeed version .. IMO we can safely update this to latest version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks, good .. left comment regarding deepspeed version .. but could also be changed in a separate PR.
…Assistant into training_gpt_neox
…o training_gpt_neox
Notable config settings that are required to get GPTNeoX-20B to behave nicely: - bf16: when training in fp16, gradients often over/underflow, leading to bad convergence. (bf16 could also potentially improve the loss of other models, we should try) - More warmup steps: To fill the optimizer buffers, a gentle warmup is required. Right now, we do this by just using more warmup steps, but in the future a linear LR warmup schedule could be used to achieve the same effect. - Gradient checkpointing is faster than gradient accumulation. This could also translate to other models. - Stage 3 is required to fit the 20B model in bf16. For some reason, fp16 was possible to fit with stage 2 but bf16 wasn't. Other learnings: - Residual dropout quickly degrades performance of the pre-trained model. Even `p=0.1` leads to an initial loss of > 5. - Flash attention does not go too well with the 20B model. There are slight numerical differences between flash attention and the GPTNeoX attention implementation that accumulate layer-by-layer, ultimately leading to vastly different results. - Run in bf16 with regular implementation: https://wandb.ai/open-assistant/supervised-finetuning/runs/x2pczqa9 - Run in bf16 with flash attention: https://wandb.ai/open-assistant/supervised-finetuning/runs/cvv3edm8 - Run in fp32 with regular implementation: https://wandb.ai/open-assistant/supervised-finetuning/runs/shrzz3xp EDIT: Updated fp32 comparison run. It actually behaves nicely just like bf16, so the most likely explanation for flash attention not working is accumulating errors.
Notable config settings that are required to get GPTNeoX-20B to behave nicely:
Other learnings:
p=0.1
leads to an initial loss of > 5.EDIT: Updated fp32 comparison run. It actually behaves nicely just like bf16, so the most likely explanation for flash attention not working is accumulating errors.