During the training of 3.6B and 7B with FSDP we experienced a loss spike after the loss as the model was moving towards convergence. Things that we should check in our implementation: - [x] Correctness of gradient clipping with FSDP - [x] Exploration of implementational differences of AdamW and Adam in MegetronLM and Pytorch - [x] Weight initialization - [ ] ~~GPT2 implementation (we could train a small model directly from Huggingface for comparison)~~
During the training of 3.6B and 7B with FSDP we experienced a loss spike after the loss as the model was moving towards convergence.
Things that we should check in our implementation:
GPT2 implementation (we could train a small model directly from Huggingface for comparison)