Fix Training Step Logging & Log Number of Consumed Tokens#137
Fix Training Step Logging & Log Number of Consumed Tokens#137le1nux merged 28 commits intodev_experimentsfrom
Conversation
flxst
left a comment
There was a problem hiding this comment.
If you look at the config file now, e.g. here in L8-11, I think there is a general problem:
For the parameters global_training_log_interval_in_steps, global_checkpointing_interval_in_steps & global_evaluation_interval_in_steps, "steps" corresponds to "optimizer steps". In contrast, for the parameter global_num_seen_steps (and the related skip_num_micro_steps), "steps" refers to "micro batch steps".
This seems confusing. Maybe we should either have this difference explicitly reflected in the names of the parameters (e.g. global_num_seen_steps -> global_num_seen_micro_steps), or make further changes such that "steps" always refers to the same thing.
Based on your proposal, I would suggest the following changes:
When changing the the batch_size and num_ranks between previous run and warmstart, we might see a few samples twice in this case.
What do you think? |
|
|
…o FSDPCheckpointSaving
…pointing_interval_in_steps, global_evaluation_interval_in_steps to training_log_interval_in_steps, checkpointing_interval_in_steps, evaluation_interval_in_steps, respectively
Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>
Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>
…lities into fix_logging_steps
No description provided.