Fix Training Step Logging & Log Number of Consumed Tokens by mali-git · Pull Request #137 · Modalities/modalities

mali-git · 2024-05-24T12:01:30Z

No description provided.

flxst

If you look at the config file now, e.g. here in L8-11, I think there is a general problem:

For the parameters global_training_log_interval_in_steps, global_checkpointing_interval_in_steps & global_evaluation_interval_in_steps, "steps" corresponds to "optimizer steps". In contrast, for the parameter global_num_seen_steps (and the related skip_num_micro_steps), "steps" refers to "micro batch steps".

This seems confusing. Maybe we should either have this difference explicitly reflected in the names of the parameters (e.g. global_num_seen_steps -> global_num_seen_micro_steps), or make further changes such that "steps" always refers to the same thing.

le1nux · 2024-05-29T10:52:17Z

If you look at the config file now, e.g. here in L8-11, I think there is a general problem:

For the parameters global_training_log_interval_in_steps, global_checkpointing_interval_in_steps & global_evaluation_interval_in_steps, "steps" corresponds to "optimizer steps". In contrast, for the parameter global_num_seen_steps (and the related skip_num_micro_steps), "steps" refers to "micro batch steps".

This seems confusing. Maybe we should either have this difference explicitly reflected in the names of the parameters (e.g. global_num_seen_steps -> global_num_seen_micro_steps), or make further changes such that "steps" always refers to the same thing.

Based on your proposal, I would suggest the following changes:

rename all global steps variables, i.e., global_training_log_interval_in_steps, global_checkpointing_interval_in_steps, global_evaluation_interval_in_steps, global_num_seen_steps -> training_log_interval_in_steps, checkpointing_interval_in_steps, evaluation_interval_in_steps, num_seen_steps, since FSDP and DDP don't have the concept of local vs global steps. We should still distinct between local and global batch sizes though!
Dataloader: To define the number of batches to be skipped for the dataloader during warmstart, I would suggest we use the variable skip_num_batches instead of skip_num_micro_steps. The dataloader does not have to know about things like (micro) steps. The calculation should happen outside, possibly even manually in the beginning. The calculation would be

skip_num_samples = old_batch_size*old_gradient_accumulation_steps*old_num_steps*old_num_ranks
skip_num_batches = skip_num_samples // (current_batch_size*current_num_ranks)

When changing the the batch_size and num_ranks between previous run and warmstart, we might see a few samples twice in this case.

Rename tokens_per_train_step to global_num_tokens_per_train_step

What do you think?

flxst · 2024-05-29T11:42:31Z

I think that's a good idea.
This would also be an improvement in my opinion. However, can't we go a step further and specify the number of skipped samples (skip_num_samples) directly? Or even skipped tokens (skip_num_tokens)? This would be particularly convenient if the number of consumed samples / tokens was written to the checkpoint. The number of skipped batches (skip_num_batches) may then be derived using the new (global) batch size internally, i.e. the user doesn't need to think about this.
Wouldn't this contradict the logic behind the suggested changes in 1.? "steps" is short for "optimizer steps", so it is clear that tokens_per_train_step refers to the global batch size. We could actually call it global_batch_size as well :)

le1nux · 2024-05-29T17:27:47Z

ok I'll change this then
Good idea! If we pass the global_batch_size and context_size to CheckpointExexecution, then we can calculate the number of seen samples and number of seen tokens there and save it as part of the checkpoint file name, e.g., eid_{experiment_id}-{entity}-num_steps_{num_train_steps}-num_samples_{num_samples}-num_tokens_{num_tokens}.bin. For a warmstart, we would pass in the number of seen samples or number of seen tokens to the dataloader factory.
From my point of view this would not contradict. There are local_num_tokens_per_train_step which is the number of seen tokens within one step on a single rank and global_num_tokens_per_train_step which is local_num_tokens_per_train_step*num_ranks. So "global" and "local" does not refer to the step but to the num_tokens.

…mberConversion

…n_step

…o FSDPCheckpointSaving

…pointing_interval_in_steps, global_evaluation_interval_in_steps to training_log_interval_in_steps, checkpointing_interval_in_steps, evaluation_interval_in_steps, respectively

…lculate the train steps

flxst

LGTM! I added a trivial fix for the unit tests (d98bb1b) and left some minor comments.

Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>

…lities into fix_logging_steps

mali-git added 2 commits May 24, 2024 12:00

feat: log consumed tokens

b08d1f3

fix: training step logging

ffa374a

mali-git changed the title ~~feat: log consumed tokens~~ Fix Training Step Logging & Log Number of Consumed Tokens May 24, 2024

le1nux changed the base branch from main to dev_experiments May 27, 2024 09:56

le1nux added 3 commits May 27, 2024 14:29

refactor: renamed skip_num_steps to skip_num_micro_steps in Dataloader

529e146

refactor: passing now tokens_per_train_step from main to trainer

7dc9e6a

fix: failing tests due to missing tokens_per_train_step

8616988

le1nux assigned mali-git May 27, 2024

le1nux requested a review from flxst May 27, 2024 12:47

le1nux added the enhancement New feature or request label May 27, 2024

flxst requested changes May 28, 2024

View reviewed changes

Comment thread src/modalities/config/config.py Outdated

Comment thread src/modalities/dataloader/dataloader_factory.py Outdated

le1nux added 2 commits May 29, 2024 11:36

chore: fixed typo

a288932

chore: fixed skip_num_micro_steps in test configs

e087ec4

refactor: skip_num_batches implemented for Dataloader, added testing

dedd2fc

le1nux added 13 commits May 29, 2024 19:29

feat: implemented get_local_num_batches_from_num_tokens as part of Nu…

f92c35d

…mberConversion

test: test skips batches now based on number of tokens to skip

bfcae03

refactor: renamed tokens_per_train_step to global_num_tokens_per_trai…

21620bf

…n_step

refactor: trainer with num_train_steps_done

c58d5c0

refactor: num_train_steps_done now calculated internally of _train_batch

42a1b8a

refactor: applied num_train_steps_done everywhere

2821bab

feat: implemented num_tokens_from_num_steps_callable and passing it t…

f74054f

…o FSDPCheckpointSaving

refactor: renamed global_training_log_interval_in_steps, global_check…

8ade072

…pointing_interval_in_steps, global_evaluation_interval_in_steps to training_log_interval_in_steps, checkpointing_interval_in_steps, evaluation_interval_in_steps, respectively

refactor: added num_steps_from_num_tokens computation to configs

e117649

fix: fixed failing test_skipped_and_distributed_dataloader_from_config

61f89f3

fix: fixed failing test_e2e_training_run_wout_ckpt

d6674cb

fix: fixed failing test_e2e_coca_training_run_without_checkpoint

d1cce1b

fix: fixed checkpoint execution tests

d8ef06d

le1nux added 2 commits June 2, 2024 12:33

refactor: batch_progress_subscriber now uses gradient_acc_steps to ca…

7971598

…lculate the train steps

refactor: logging and evaluation now called only when the step changed

a93667a

le1nux assigned le1nux and mali-git and unassigned mali-git Jun 2, 2024

le1nux requested a review from flxst June 2, 2024 11:14

flxst requested review from fromm-m June 3, 2024 08:44

refactor(configs): update batch_progress_subscriber config

d98bb1b

flxst approved these changes Jun 3, 2024

View reviewed changes

Comment thread src/modalities/trainer.py Outdated

Comment thread src/modalities/trainer.py Outdated

Comment thread src/modalities/utils/number_conversion.py

fromm-m approved these changes Jun 3, 2024

View reviewed changes

le1nux and others added 4 commits June 6, 2024 11:59

Update src/modalities/trainer.py

2d19a87

Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>

Update src/modalities/trainer.py

600a36b

Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>

test: added tests for NumberConversion

a3b2a0a

chore: Merge branch 'fix_logging_steps' of github.com:Modalities/moda…

099d58b

…lities into fix_logging_steps

le1nux merged commit 34f2cd5 into dev_experiments Jun 6, 2024

le1nux deleted the fix_logging_steps branch June 6, 2024 13:52

le1nux mentioned this pull request Jun 11, 2024

Towards stable modalities version #141

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Training Step Logging & Log Number of Consumed Tokens#137

Fix Training Step Logging & Log Number of Consumed Tokens#137
le1nux merged 28 commits intodev_experimentsfrom
fix_logging_steps

mali-git commented May 24, 2024

Uh oh!

flxst left a comment

Uh oh!

Uh oh!

Uh oh!

le1nux commented May 29, 2024

Uh oh!

flxst commented May 29, 2024 •

edited

Loading

Uh oh!

le1nux commented May 29, 2024

Uh oh!

flxst left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mali-git commented May 24, 2024

Uh oh!

flxst left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

le1nux commented May 29, 2024

Uh oh!

flxst commented May 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

le1nux commented May 29, 2024

Uh oh!

flxst left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

flxst commented May 29, 2024 •

edited

Loading