Skip to content

Pstjohn/stop and go test non validation#476

Merged
pstjohn merged 9 commits into
NVIDIA-BioNeMo:mainfrom
pstjohn:pstjohn/stop-and-go-test-non-validation
Dec 4, 2024
Merged

Pstjohn/stop and go test non validation#476
pstjohn merged 9 commits into
NVIDIA-BioNeMo:mainfrom
pstjohn:pstjohn/stop-and-go-test-non-validation

Conversation

@pstjohn
Copy link
Copy Markdown
Collaborator

@pstjohn pstjohn commented Nov 26, 2024

Explicitly test that train inputs and outputs are consistent when we use PreemptionCallback to handle a training interrupt.
Currently it seems that doing this changes the validation step schedule, but otherwise training should be identical.

STOP OUTPUT:

Sanity checking Validation: iteration 1/2
Sanity checking Validation: iteration 2/2
Training epoch 0, iteration 0/9 | lr: 0 | global_batch_size: 2 | global_step: 0 | reduced_train_loss: 4.87
Training epoch 0, iteration 1/9 | lr: 2e-06 | global_batch_size: 2 | global_step: 1 | reduced_train_loss: 4.771 | consumed_samples: 4
[NeMo I 2024-12-02 19:12:44 preemption:87] Received signal 12, initiating graceful stop
[NeMo I 2024-12-02 19:12:44 preemption:67] Preemption detected, saving checkpoint and exiting
2024-12-02 19:12:44,487 _dedup_tensors.py:46 INFO p:MainProcess t:MainThread: Duplicate keys to remove: {}

RESUME OUTPUT:

Restored all states from the checkpoint at /tmp/tmpn8luhenb/TestESM2StopAndGoCheckpointNotAtValidation/checkpoints/epoch=0-step=2-val_loss=0.00-last/weights
Training epoch 0, iteration 3/9 | lr: 6e-06 | consumed_samples: 8 | global_batch_size: 2 | global_step: 3 | reduced_train_loss: 4.783
Training epoch 0, iteration 4/9 | lr: 8e-06 | consumed_samples: 10 | global_batch_size: 2 | global_step: 4 | reduced_train_loss: 4.785
Validation: iteration 1/2
Validation: iteration 2/2
2024-12-02 19:12:50,690 _dedup_tensors.py:46 INFO p:MainProcess t:MainThread: Duplicate keys to remove: {}
Training epoch 0, iteration 5/9 | lr: 1e-05 | consumed_samples: 12 | global_batch_size: 2 | global_step: 5 | reduced_train_loss: 4.672 | val_loss: 4.711
Training epoch 0, iteration 6/9 | lr: 1.2e-05 | consumed_samples: 14 | global_batch_size: 2 | global_step: 6 | reduced_train_loss: 4.7 | val_loss: 4.711
Training epoch 0, iteration 7/9 | lr: 1.4e-05 | consumed_samples: 16 | global_batch_size: 2 | global_step: 7 | reduced_train_loss: 4.621 | val_loss: 4.711
Training epoch 0, iteration 8/9 | lr: 1.6e-05 | consumed_samples: 18 | global_batch_size: 2 | global_step: 8 | reduced_train_loss: 4.532 | val_loss: 4.711
Validation: iteration 1/2
Validation: iteration 2/2
Training epoch 0, iteration 9/9 | lr: 1.8e-05 | consumed_samples: 20 | global_batch_size: 2 | global_step: 9 | reduced_train_loss: 4.497 | val_loss: 4.544
`Trainer.fit` stopped: `max_steps=10` reached.

CONTINUOUS OUTPUT:

Sanity checking Validation: iteration 1/2
Sanity checking Validation: iteration 2/2
Training epoch 0, iteration 0/9 | lr: 0 | global_batch_size: 2 | global_step: 0 | reduced_train_loss: 4.87
Training epoch 0, iteration 1/9 | lr: 2e-06 | global_batch_size: 2 | global_step: 1 | reduced_train_loss: 4.771 | consumed_samples: 4
Training epoch 0, iteration 2/9 | lr: 4e-06 | global_batch_size: 2 | global_step: 2 | reduced_train_loss: 4.87 | consumed_samples: 6
Training epoch 0, iteration 3/9 | lr: 6e-06 | global_batch_size: 2 | global_step: 3 | reduced_train_loss: 4.783 | consumed_samples: 8
Validation: iteration 1/2
Validation: iteration 2/2
Training epoch 0, iteration 4/9 | lr: 8e-06 | global_batch_size: 2 | global_step: 4 | reduced_train_loss: 4.785 | consumed_samples: 10 | val_loss: 4.758
Training epoch 0, iteration 5/9 | lr: 1e-05 | global_batch_size: 2 | global_step: 5 | reduced_train_loss: 4.672 | consumed_samples: 12 | val_loss: 4.758
Training epoch 0, iteration 6/9 | lr: 1.2e-05 | global_batch_size: 2 | global_step: 6 | reduced_train_loss: 4.7 | consumed_samples: 14 | val_loss: 4.758
Training epoch 0, iteration 7/9 | lr: 1.4e-05 | global_batch_size: 2 | global_step: 7 | reduced_train_loss: 4.621 | consumed_samples: 16 | val_loss: 4.758
Validation: iteration 1/2
Validation: iteration 2/2
Training epoch 0, iteration 8/9 | lr: 1.6e-05 | global_batch_size: 2 | global_step: 8 | reduced_train_loss: 4.532 | consumed_samples: 18 | val_loss: 4.573
Training epoch 0, iteration 9/9 | lr: 1.8e-05 | global_batch_size: 2 | global_step: 9 | reduced_train_loss: 4.497 | consumed_samples: 20 | val_loss: 4.573
Validation: iteration 1/2
Validation: iteration 2/2
`Trainer.fit` stopped: `max_steps=10` reached.


@pstjohn pstjohn force-pushed the pstjohn/stop-and-go-test-non-validation branch 3 times, most recently from c5deefd to bbf2bd7 Compare December 2, 2024 22:44
@pstjohn pstjohn marked this pull request as ready for review December 2, 2024 22:44
@pstjohn
Copy link
Copy Markdown
Collaborator Author

pstjohn commented Dec 2, 2024

/build-ci

Comment thread sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py Outdated
Comment thread sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py Outdated
Comment thread sub-packages/bionemo-testing/src/bionemo/testing/harnesses/stop_and_go.py Outdated
Comment thread sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py Outdated
Comment thread sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py Outdated
Copy link
Copy Markdown
Contributor

@sichu2023 sichu2023 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go. Only some comments to help me understand it better.

@pstjohn pstjohn force-pushed the pstjohn/stop-and-go-test-non-validation branch from e8e4853 to 364659c Compare December 4, 2024 16:05
@pstjohn pstjohn force-pushed the pstjohn/stop-and-go-test-non-validation branch from 364659c to 117e00c Compare December 4, 2024 16:59
@pstjohn
Copy link
Copy Markdown
Collaborator Author

pstjohn commented Dec 4, 2024

/build-ci

@pstjohn pstjohn enabled auto-merge (squash) December 4, 2024 17:01
@pstjohn
Copy link
Copy Markdown
Collaborator Author

pstjohn commented Dec 4, 2024

/build-ci

@pstjohn pstjohn merged commit 38be873 into NVIDIA-BioNeMo:main Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants