Skip to content

Fix Training Step Logging & Log Number of Consumed Tokens#137

Merged
le1nux merged 28 commits intodev_experimentsfrom
fix_logging_steps
Jun 6, 2024
Merged

Fix Training Step Logging & Log Number of Consumed Tokens#137
le1nux merged 28 commits intodev_experimentsfrom
fix_logging_steps

Conversation

@mali-git
Copy link
Copy Markdown
Member

No description provided.

@mali-git mali-git changed the title feat: log consumed tokens Fix Training Step Logging & Log Number of Consumed Tokens May 24, 2024
@le1nux le1nux changed the base branch from main to dev_experiments May 27, 2024 09:56
@le1nux le1nux requested a review from flxst May 27, 2024 12:47
@le1nux le1nux added the enhancement New feature or request label May 27, 2024
Copy link
Copy Markdown
Member

@flxst flxst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at the config file now, e.g. here in L8-11, I think there is a general problem:

For the parameters global_training_log_interval_in_steps, global_checkpointing_interval_in_steps & global_evaluation_interval_in_steps, "steps" corresponds to "optimizer steps". In contrast, for the parameter global_num_seen_steps (and the related skip_num_micro_steps), "steps" refers to "micro batch steps".

This seems confusing. Maybe we should either have this difference explicitly reflected in the names of the parameters (e.g. global_num_seen_steps -> global_num_seen_micro_steps), or make further changes such that "steps" always refers to the same thing.

Comment thread src/modalities/config/config.py Outdated
Comment thread src/modalities/dataloader/dataloader_factory.py Outdated
@le1nux
Copy link
Copy Markdown
Member

le1nux commented May 29, 2024

If you look at the config file now, e.g. here in L8-11, I think there is a general problem:

For the parameters global_training_log_interval_in_steps, global_checkpointing_interval_in_steps & global_evaluation_interval_in_steps, "steps" corresponds to "optimizer steps". In contrast, for the parameter global_num_seen_steps (and the related skip_num_micro_steps), "steps" refers to "micro batch steps".

This seems confusing. Maybe we should either have this difference explicitly reflected in the names of the parameters (e.g. global_num_seen_steps -> global_num_seen_micro_steps), or make further changes such that "steps" always refers to the same thing.

Based on your proposal, I would suggest the following changes:

  1. rename all global steps variables, i.e., global_training_log_interval_in_steps, global_checkpointing_interval_in_steps, global_evaluation_interval_in_steps, global_num_seen_steps -> training_log_interval_in_steps, checkpointing_interval_in_steps, evaluation_interval_in_steps, num_seen_steps, since FSDP and DDP don't have the concept of local vs global steps. We should still distinct between local and global batch sizes though!
  2. Dataloader: To define the number of batches to be skipped for the dataloader during warmstart, I would suggest we use the variable skip_num_batches instead of skip_num_micro_steps. The dataloader does not have to know about things like (micro) steps. The calculation should happen outside, possibly even manually in the beginning. The calculation would be
skip_num_samples = old_batch_size*old_gradient_accumulation_steps*old_num_steps*old_num_ranks
skip_num_batches = skip_num_samples // (current_batch_size*current_num_ranks)

When changing the the batch_size and num_ranks between previous run and warmstart, we might see a few samples twice in this case.

  1. Rename tokens_per_train_step to global_num_tokens_per_train_step

What do you think?

@flxst
Copy link
Copy Markdown
Member

flxst commented May 29, 2024

  1. I think that's a good idea.
  2. This would also be an improvement in my opinion. However, can't we go a step further and specify the number of skipped samples (skip_num_samples) directly? Or even skipped tokens (skip_num_tokens)? This would be particularly convenient if the number of consumed samples / tokens was written to the checkpoint. The number of skipped batches (skip_num_batches) may then be derived using the new (global) batch size internally, i.e. the user doesn't need to think about this.
  3. Wouldn't this contradict the logic behind the suggested changes in 1.? "steps" is short for "optimizer steps", so it is clear that tokens_per_train_step refers to the global batch size. We could actually call it global_batch_size as well :)

@le1nux
Copy link
Copy Markdown
Member

le1nux commented May 29, 2024

  1. ok I'll change this then
  2. Good idea! If we pass the global_batch_size and context_size to CheckpointExexecution, then we can calculate the number of seen samples and number of seen tokens there and save it as part of the checkpoint file name, e.g., eid_{experiment_id}-{entity}-num_steps_{num_train_steps}-num_samples_{num_samples}-num_tokens_{num_tokens}.bin. For a warmstart, we would pass in the number of seen samples or number of seen tokens to the dataloader factory.
  3. From my point of view this would not contradict. There are local_num_tokens_per_train_step which is the number of seen tokens within one step on a single rank and global_num_tokens_per_train_step which is local_num_tokens_per_train_step*num_ranks. So "global" and "local" does not refer to the step but to the num_tokens.

@le1nux le1nux assigned le1nux and mali-git and unassigned mali-git Jun 2, 2024
@le1nux le1nux requested a review from flxst June 2, 2024 11:14
@flxst flxst requested review from fromm-m June 3, 2024 08:44
Copy link
Copy Markdown
Member

@flxst flxst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I added a trivial fix for the unit tests (d98bb1b) and left some minor comments.

Comment thread src/modalities/trainer.py Outdated
Comment thread src/modalities/trainer.py Outdated
Comment thread src/modalities/utils/number_conversion.py
le1nux and others added 4 commits June 6, 2024 11:59
Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>
Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>
@le1nux le1nux merged commit 34f2cd5 into dev_experiments Jun 6, 2024
@le1nux le1nux deleted the fix_logging_steps branch June 6, 2024 13:52
@le1nux le1nux mentioned this pull request Jun 11, 2024
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants