Skip to content

Investigation: Loss spikes when loss developes close to convergence #129

@le1nux

Description

@le1nux

During the training of 3.6B and 7B with FSDP we experienced a loss spike after the loss as the model was moving towards convergence.

Things that we should check in our implementation:

  • Correctness of gradient clipping with FSDP
  • Exploration of implementational differences of AdamW and Adam in MegetronLM and Pytorch
  • Weight initialization
  • GPT2 implementation (we could train a small model directly from Huggingface for comparison)

Metadata

Metadata

Labels

enhancementNew feature or requesthelp wantedExtra attention is needed

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions