Investigation: Loss spikes when loss developes close to convergence

During the training of 3.6B and 7B with FSDP we experienced a loss spike after the loss as the model was moving towards convergence. 

Things that we should check in our implementation: 

- [x] Correctness of gradient clipping with FSDP
- [x] Exploration of implementational differences of AdamW and Adam in MegetronLM and Pytorch
- [x] Weight initialization
- [ ] ~~GPT2 implementation (we could train a small model directly from Huggingface for comparison)~~

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigation: Loss spikes when loss developes close to convergence #129

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigation: Loss spikes when loss developes close to convergence #129

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions