Skip to content

Add support on num steps for learning rate scheduler#489

Merged
sichu2023 merged 10 commits into
mainfrom
sichu/esm2_learning_rate_num_steps
Dec 5, 2024
Merged

Add support on num steps for learning rate scheduler#489
sichu2023 merged 10 commits into
mainfrom
sichu/esm2_learning_rate_num_steps

Conversation

@sichu2023
Copy link
Copy Markdown
Contributor

@sichu2023 sichu2023 commented Dec 2, 2024

Decouple num_steps with number of steps in learning rate scheduler.

python sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py ... --learning-rate-num-steps=...

Or taking the pydantic route

export MY_DATA_SOURCE="ngc"

# The fastest transformer engine environment variables in testing were the following two
TEST_DATA_DIR=$(download_bionemo_data esm2/testdata_esm2_pretrain:2.0 --source $MY_DATA_SOURCE); \
bionemo-esm2-recipe \
--train-cluster-path ${TEST_DATA_DIR}/2024_03_sanity/train_clusters_sanity.parquet     \
--train-database-path ${TEST_DATA_DIR}/2024_03_sanity/train_sanity.db     \
--valid-cluster-path ${TEST_DATA_DIR}/2024_03_sanity/valid_clusters.parquet     \
--valid-database-path ${TEST_DATA_DIR}/2024_03_sanity/validation.db     \
--result-dir ./results     \
--dest my_config.json \
--recipe 8m \
--scheduler-max-steps=500000 \
--max-steps=10000

bionemo-esm2-train \
--data-config-t bionemo.esm2.run.config_models.ESM2DataConfig \
--model-config-t bionemo.esm2.run.config_models.ExposedESM2PretrainConfig \
--config my_config.json

@sichu2023 sichu2023 self-assigned this Dec 2, 2024
@sichu2023 sichu2023 requested a review from dorotat-nv December 2, 2024 12:35
@sichu2023 sichu2023 force-pushed the sichu/esm2_learning_rate_num_steps branch from 66fa7db to aebe277 Compare December 2, 2024 12:35
@sichu2023
Copy link
Copy Markdown
Contributor Author

/build-ci

@sichu2023 sichu2023 enabled auto-merge (squash) December 2, 2024 12:37
@sichu2023
Copy link
Copy Markdown
Contributor Author

sichu2023 commented Dec 2, 2024

Comment thread sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py Outdated
@sichu2023
Copy link
Copy Markdown
Contributor Author

/build-ci

@skothenhill-nv
Copy link
Copy Markdown
Collaborator

I didnt quite figure out the right abstraction for schedulers + optimizers. When you do this, if there is an approach that stands out to you please let me know! To answer your question:

@skothenhill-nv Do you know where I should update TrainingConfig to mirror the change to train.py? https://github.com/NVIDIA/bionemo-framework/blob/sichu/esm2_learning_rate_num_steps/sub-packages/bionemo-llm/src/bionemo/llm/train.py#L216

Schedulers are defined in the train method, so you'll want to add its usage here:

  1. https://github.com/NVIDIA/bionemo-framework/blob/sichu/esm2_learning_rate_num_steps/sub-packages/bionemo-llm/src/bionemo/llm/train.py#L204

  2. TrainingConfig is defined here: https://github.com/NVIDIA/bionemo-framework/blob/sichu/esm2_learning_rate_num_steps/sub-packages/bionemo-llm/src/bionemo/llm/run/config_models.py#L262

add it as a field and update the recipes to do the right thing.

@sichu2023 sichu2023 requested a review from dorotat-nv December 3, 2024 13:51
@sichu2023 sichu2023 force-pushed the sichu/esm2_learning_rate_num_steps branch from 995d9ad to 7ba00fc Compare December 3, 2024 14:09
@sichu2023
Copy link
Copy Markdown
Contributor Author

/build-ci

Comment thread sub-packages/bionemo-llm/src/bionemo/llm/train.py Outdated
@sichu2023 sichu2023 force-pushed the sichu/esm2_learning_rate_num_steps branch from 7ba00fc to 0532c87 Compare December 3, 2024 18:20
@sichu2023
Copy link
Copy Markdown
Contributor Author

/build-ci

@sichu2023 sichu2023 force-pushed the sichu/esm2_learning_rate_num_steps branch from 0532c87 to 2cbb32c Compare December 4, 2024 12:30
@sichu2023
Copy link
Copy Markdown
Contributor Author

/build-ci

1 similar comment
@sichu2023
Copy link
Copy Markdown
Contributor Author

/build-ci

@sichu2023 sichu2023 force-pushed the sichu/esm2_learning_rate_num_steps branch from ba3e002 to 1529c0a Compare December 4, 2024 20:39
@sichu2023
Copy link
Copy Markdown
Contributor Author

/build-ci

1 similar comment
@sichu2023
Copy link
Copy Markdown
Contributor Author

/build-ci

@sichu2023 sichu2023 force-pushed the sichu/esm2_learning_rate_num_steps branch from c595c18 to e4d8a1e Compare December 4, 2024 21:39
@sichu2023
Copy link
Copy Markdown
Contributor Author

/build-ci

@sichu2023 sichu2023 force-pushed the sichu/esm2_learning_rate_num_steps branch from 2c700b2 to 0f8e1dd Compare December 4, 2024 22:20
@sichu2023
Copy link
Copy Markdown
Contributor Author

/build-ci

@sichu2023 sichu2023 merged commit c95d24e into main Dec 5, 2024
@sichu2023 sichu2023 deleted the sichu/esm2_learning_rate_num_steps branch December 5, 2024 14:58
@sichu2023 sichu2023 restored the sichu/esm2_learning_rate_num_steps branch December 5, 2024 16:11
@sichu2023 sichu2023 deleted the sichu/esm2_learning_rate_num_steps branch December 5, 2024 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants