# Using Sequence Packing to Improve ESM-2 PreTraining with BioNeMo Recipes
This Starter Kit demonstrates pretraining the [ESM-2 model](https://www.biorxiv.org/content/10.1101/2022.07.20.500902v2) using BioNeMo Recipes.
BioNeMo Recipes showcases an easy path to accelerate, scale and deploy transformer based biological foundation models using NVIDIA [TransformerEngine](https://github.com/NVIDIA/TransformerEngine).To learn more about BioNeMo Recipes, checkout the the Github repo: https://github.com/NVIDIA/bionemo-framework/tree/main/bionemo-recipes

ESM2 is pre-trained, bi-directional encoder (BERT-style model) over amino acid sequences. ESM-2 models provide embeddings for amino acids that have led to state-of-the-art performance on downstream tasks such as structure and function prediction. ESM2

The ESM2 recipe example also includes sequence packing with THD (Total, Height, Depth) format to achieve maximum computational efficiency when training on variable-length protein sequences. This example will showcase and pretrain the ESM2 model with and without sequence packing to showcase it's benefits. 


## Requirements:
* This model must be run on the Ampere version or above hardware. 



## Setting up BioNeMo Recipes

To start using BioNeMo Recipes, you will need to clone BioNeMo Framework from github and install the `requirements.txt` for your desired recipe. 

For this example, we will install the [ESM-2 example](https://github.com/NVIDIA/bionemo-framework/tree/main/bionemo-recipes/recipes/esm2_native_te).

This recipe uses the `pytorch:25.06-py3` image from NGC which preinstalls TensorEngine.


In [2]:
%%bash
git clone https://github.com/NVIDIA/bionemo-framework.git
cd bionemo-framework/bionemo-recipes/recipes
pip install -r esm2_native_te/requirements.txt

Cloning into 'bionemo-framework'...
Updating files: 100% (956/956), done.


Collecting datasets (from -r esm2_native_te_mfsdp_thd/requirements.txt (line 1))
  Downloading datasets-4.1.1-py3-none-any.whl.metadata (18 kB)
Collecting megatron-fsdp==0.1.0rc0 (from -r esm2_native_te_mfsdp_thd/requirements.txt (line 2))
  Downloading megatron_fsdp-0.1.0rc0-py3-none-any.whl.metadata (5.1 kB)
Collecting hydra-core (from -r esm2_native_te_mfsdp_thd/requirements.txt (line 3))
  Downloading hydra_core-1.3.2-py3-none-any.whl.metadata (5.5 kB)
Collecting transformers (from -r esm2_native_te_mfsdp_thd/requirements.txt (line 7))
  Downloading transformers-4.56.2-py3-none-any.whl.metadata (40 kB)
Collecting wandb (from -r esm2_native_te_mfsdp_thd/requirements.txt (line 8))
  Downloading wandb-0.22.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Collecting pyarrow>=21.0.0 (from datasets->-r esm2_native_te_mfsdp_thd/requirements.txt (line 1))
  Downloading pyarrow-21.0.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting xxhash (from d

  Running command git clone --filter=blob:none --quiet https://github.com/NVIDIA/nvidia-dlfw-inspect.git /tmp/pip-install-p1twhju0/nvdlfw-inspect_468200e586284fd3833a74c4fee815a2
  Running command git checkout -q 6b60eb0e675606fb2fbbfcb12667ecbec75eaf87


  Resolved https://github.com/NVIDIA/nvidia-dlfw-inspect.git to commit 6b60eb0e675606fb2fbbfcb12667ecbec75eaf87
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting tokenizers<=0.23.0,>=0.22.0 (from transformers->-r esm2_native_te_mfsdp_thd/requirements.txt (line 7))
  Downloading tokenizers-0.22.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)
Collecting hf-xet<2.0.0,>=1.1.3 (from huggingface-hub>=0.24.0->datasets->-r esm2_native_te_mfsdp_thd/requirements.txt (line 1))
  Downloading hf_xet-1.1.10-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.7 kB)
Collecting gitpython!=3.1.29,>=1.0.0 (from wandb->-r esm2_native_te_mfsdp_thd/requirements.txt (line 8))
 

[33m  DEPRECATION: Building 'antlr4-python3-runtime' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'antlr4-python3-runtime'. Discussion can be found at https://github.com/pypa/pip/issues/6334[0m[33m
[0m

  Building wheel for antlr4-python3-runtime (setup.py): started
  Building wheel for antlr4-python3-runtime (setup.py): finished with status 'done'
  Created wheel for antlr4-python3-runtime: filename=antlr4_python3_runtime-4.9.3-py3-none-any.whl size=144590 sha256=36983943e3f4e4d1dd5441ab8740a7b8a850b1d3641f036ad355b24fd3cfc7ab
  Stored in directory: /root/.cache/pip/wheels/1f/be/48/13754633f1d08d1fbfc60d5e80ae1e5d7329500477685286cd
Successfully built antlr4-python3-runtime
Installing collected packages: antlr4-python3-runtime, xxhash, smmap, sentry-sdk, pyarrow, omegaconf, multiprocess, hf-xet, hydra-core, huggingface-hub, gitdb, tokenizers, megatron-fsdp, gitpython, wandb, transformers, datasets
[2K  Attempting uninstall: pyarrow90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 3/17[0m [sentry-sdk]
[2K    Found existing installation: pyarrow 19.0.1━━━━━━━━━━━━━━━━━[0m [32m 3/17[0m [sentry-sdk]
[2K    Uninstalling pyarrow-19.0.1:━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 3/17[0m [s

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 25.4.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 21.0.0 which is incompatible.
pylibcudf 25.4.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 21.0.0 which is incompatible.[0m[31m
[0m

Successfully installed antlr4-python3-runtime-4.9.3 datasets-4.1.1 gitdb-4.0.12 gitpython-3.1.45 hf-xet-1.1.10 huggingface-hub-0.35.1 hydra-core-1.3.2 megatron-fsdp-0.1.0rc0 multiprocess-0.70.16 omegaconf-2.3.0 pyarrow-21.0.0 sentry-sdk-2.39.0 smmap-5.0.2 tokenizers-0.22.1 transformers-4.56.2 wandb-0.22.0 xxhash-3.5.0


[0m

Collecting transformers@ git+https://github.com/huggingface/transformers (from -r esm2_native_te/requirements.txt (line 7))
  Cloning https://github.com/huggingface/transformers to /tmp/pip-install-ab5t2823/transformers_7bfb42d02d9741269aca95c84d6c037a


  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-install-ab5t2823/transformers_7bfb42d02d9741269aca95c84d6c037a


  Resolved https://github.com/huggingface/transformers to commit 53838edde77cb10f3a360150aa85a457637e9ac3
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting huggingface-hub==1.0.0.rc1 (from transformers@ git+https://github.com/huggingface/transformers->-r esm2_native_te/requirements.txt (line 7))
  Downloading huggingface_hub-1.0.0rc1-py3-none-any.whl.metadata (14 kB)
Collecting typer-slim (from huggingface-hub==1.0.0.rc1->transformers@ git+https://github.com/huggingface/transformers->-r esm2_native_te/requirements.txt (line 7))
  Downloading typer_slim-0.19.2-py3-none-any.whl.metadata (16 kB)
Collecting nvdlfw-inspect@ git+https://github.com/NVIDIA/nvidia-dlfw-inspect.git@v0.1#egg=nvdlfw-inspect (fro

  Running command git clone --filter=blob:none --quiet https://github.com/NVIDIA/nvidia-dlfw-inspect.git /tmp/pip-install-ab5t2823/nvdlfw-inspect_ea4fcb7ec86c44c99963d1251e3467db
  Running command git checkout -q 6b60eb0e675606fb2fbbfcb12667ecbec75eaf87


  Resolved https://github.com/NVIDIA/nvidia-dlfw-inspect.git to commit 6b60eb0e675606fb2fbbfcb12667ecbec75eaf87
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Downloading huggingface_hub-1.0.0rc1-py3-none-any.whl (526 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m526.7/526.7 kB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading typer_slim-0.19.2-py3-none-any.whl (46 kB)
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml): started
  Building wheel for transformers (pyproject.toml): finished with status 'done'
  Created wheel for transformers: filename=transformers-4.57.0.dev0-py3-none-any.whl size=11510255 sha256=ab977ebe114a91

[0m

## ESM2 Training with Megatron FDSP without Sequence Packing

In [15]:
%%bash
cd bionemo-framework/bionemo-recipes/recipes/esm2_native_te
torchrun train_mfsdp.py --config-name L0_sanity num_train_steps=100

[2025-09-26 23:30:46,711][__main__][INFO] - Initializing distributed training: DistributedConfig(rank=0, local_rank=0, world_size=1, _master_addr='127.0.0.1', _master_port='29500')
[2025-09-26 23:30:47,601][httpx][INFO] - HTTP Request: HEAD https://huggingface.co/nvidia/esm2_t6_8M_UR50D/resolve/main/config.json "HTTP/1.1 307 Temporary Redirect"
[2025-09-26 23:30:47,724][httpx][INFO] - HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/nvidia/esm2_t6_8M_UR50D/0c3e6113b0ea62b7cd004320c04260a95f26b1be/config.json "HTTP/1.1 200 OK"
[2025-09-26 23:30:47,847][httpx][INFO] - HTTP Request: GET https://huggingface.co/api/resolve-cache/models/nvidia/esm2_t6_8M_UR50D/0c3e6113b0ea62b7cd004320c04260a95f26b1be/config.json "HTTP/1.1 200 OK"


The module name nvidia/esm2_t6_8M_UR50D (originally nvidia/esm2_t6_8M_UR50D) is not a valid Python identifier. Please rename the original module to avoid import issues.


[2025-09-26 23:30:47,987][httpx][INFO] - HTTP Request: HEAD https://huggingface.co/nvidia/esm2_t6_8M_UR50D/resolve/main/esm_nv.py "HTTP/1.1 307 Temporary Redirect"
[2025-09-26 23:30:48,109][httpx][INFO] - HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/nvidia/esm2_t6_8M_UR50D/0c3e6113b0ea62b7cd004320c04260a95f26b1be/esm_nv.py "HTTP/1.1 200 OK"
[2025-09-26 23:30:48,233][httpx][INFO] - HTTP Request: GET https://huggingface.co/api/resolve-cache/models/nvidia/esm2_t6_8M_UR50D/0c3e6113b0ea62b7cd004320c04260a95f26b1be/esm_nv.py "HTTP/1.1 200 OK"


A new version of the following files was downloaded from https://huggingface.co/nvidia/esm2_t6_8M_UR50D:
- esm_nv.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


[2025-09-26 23:30:48,881][httpx][INFO] - HTTP Request: HEAD https://huggingface.co/nvidia/esm2_t6_8M_UR50D/resolve/main/config.json "HTTP/1.1 307 Temporary Redirect"
[2025-09-26 23:30:48,886][httpx][INFO] - HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/nvidia/esm2_t6_8M_UR50D/0c3e6113b0ea62b7cd004320c04260a95f26b1be/config.json "HTTP/1.1 200 OK"


The module name nvidia/esm2_t6_8M_UR50D (originally nvidia/esm2_t6_8M_UR50D) is not a valid Python identifier. Please rename the original module to avoid import issues.


[2025-09-26 23:30:49,019][httpx][INFO] - HTTP Request: HEAD https://huggingface.co/nvidia/esm2_t6_8M_UR50D/resolve/main/esm_nv.py "HTTP/1.1 307 Temporary Redirect"
[2025-09-26 23:30:49,024][httpx][INFO] - HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/nvidia/esm2_t6_8M_UR50D/0c3e6113b0ea62b7cd004320c04260a95f26b1be/esm_nv.py "HTTP/1.1 200 OK"
[2025-09-26 23:30:50,461][megatron_fsdp.param_and_grad_buffer][INFO] - Number of FSDP Parameter Groups: 9
[FSDP_UNIT 13] Group 0: elems=21184 dtype=torch.bfloat16 bufs=weight,main_weight,grad pad=0.00 MB
	esm.embeddings.word_embeddings.weight (64, 320)
	lm_head.decoder.layer_norm_weight (320,)
	lm_head.decoder.layer_norm_bias (320,)
	lm_head.decoder.bias (64,)
[FSDP_UNIT 0] Group 1: elems=1232960 dtype=torch.bfloat16 bufs=weight,main_weight,grad pad=0.01 MB
	esm.encoder.layers.0.layernorm_mlp.fc2_weight (320, 1280)
	esm.encoder.layers.0.self_attention.layernorm_qkv.weight (960, 320)
	esm.encoder.layers.0.self_attention.proj.wei

wandb: Tracking run with wandb version 0.22.0
wandb: W&B syncing is set to `offline` in this directory. Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
wandb: Run data is saved locally in /mount/zozhang/bionemo/bionemo-framework/bionemo-recipes/recipes/esm2_native_te/wandb/offline-run-20250926_233052-ywfet95h
Training:   1%|          | 1/100 [00:02<04:49,  2.93s/it, loss=3.48]

[2025-09-26 23:30:56,446][perf_logger][INFO] - loss: 3.48, global_step: 0, learning_rate: 4e-06, grad_norm: 14.6, step_time: nan, tokens_per_second: nan, unpadded_tokens_per_second: nan


Training:   2%|▏         | 2/100 [00:02<04:47,  2.93s/it, loss=3.48]

[2025-09-26 23:30:56,496][perf_logger][INFO] - loss: 3.48, global_step: 1, learning_rate: 8e-06, grad_norm: 15.7, step_time: 0.0506, tokens_per_second: 4.05e+04, unpadded_tokens_per_second: 4.82e+03


Training:   3%|▎         | 3/100 [00:03<04:44,  2.93s/it, loss=3.4] 

[2025-09-26 23:30:56,537][perf_logger][INFO] - loss: 3.4, global_step: 2, learning_rate: 1.2e-05, grad_norm: 13.6, step_time: 0.0409, tokens_per_second: 5.01e+04, unpadded_tokens_per_second: 1.72e+04


Training:   4%|▍         | 4/100 [00:03<00:56,  1.70it/s, loss=3.44]

[2025-09-26 23:30:56,578][perf_logger][INFO] - loss: 3.44, global_step: 3, learning_rate: 1.6e-05, grad_norm: 11, step_time: 0.0408, tokens_per_second: 5.02e+04, unpadded_tokens_per_second: 2.13e+04


Training:   5%|▌         | 5/100 [00:03<00:56,  1.70it/s, loss=3.55]

[2025-09-26 23:30:56,618][perf_logger][INFO] - loss: 3.55, global_step: 4, learning_rate: 2e-05, grad_norm: 16.8, step_time: 0.0406, tokens_per_second: 5.05e+04, unpadded_tokens_per_second: 1.5e+04


Training:   6%|▌         | 6/100 [00:03<00:55,  1.70it/s, loss=3.24]

[2025-09-26 23:30:56,658][perf_logger][INFO] - loss: 3.24, global_step: 5, learning_rate: 2.4e-05, grad_norm: 11.5, step_time: 0.0395, tokens_per_second: 5.18e+04, unpadded_tokens_per_second: 1.21e+04


Training:   7%|▋         | 7/100 [00:03<00:27,  3.39it/s, loss=3.25]

[2025-09-26 23:30:56,698][perf_logger][INFO] - loss: 3.25, global_step: 6, learning_rate: 2.8e-05, grad_norm: 11.5, step_time: 0.0403, tokens_per_second: 5.09e+04, unpadded_tokens_per_second: 1.09e+04


Training:   8%|▊         | 8/100 [00:03<00:27,  3.39it/s, loss=3.13]

[2025-09-26 23:30:56,737][perf_logger][INFO] - loss: 3.13, global_step: 7, learning_rate: 3.2e-05, grad_norm: 6.45, step_time: 0.0396, tokens_per_second: 5.18e+04, unpadded_tokens_per_second: 1.42e+04


Training:   9%|▉         | 9/100 [00:03<00:26,  3.39it/s, loss=3.02]

[2025-09-26 23:30:56,777][perf_logger][INFO] - loss: 3.02, global_step: 8, learning_rate: 3.6e-05, grad_norm: 6.9, step_time: 0.0393, tokens_per_second: 5.22e+04, unpadded_tokens_per_second: 1.54e+04


Training:  10%|█         | 10/100 [00:03<00:16,  5.44it/s, loss=2.89]

[2025-09-26 23:30:56,816][perf_logger][INFO] - loss: 2.89, global_step: 9, learning_rate: 4e-05, grad_norm: 10.2, step_time: 0.0394, tokens_per_second: 5.2e+04, unpadded_tokens_per_second: 8.23e+03


Training:  11%|█         | 11/100 [00:03<00:16,  5.44it/s, loss=2.84]

[2025-09-26 23:30:56,856][perf_logger][INFO] - loss: 2.84, global_step: 10, learning_rate: 4.4e-05, grad_norm: 6.59, step_time: 0.0395, tokens_per_second: 5.19e+04, unpadded_tokens_per_second: 1.76e+04


Training:  12%|█▏        | 12/100 [00:03<00:16,  5.44it/s, loss=3.01]

[2025-09-26 23:30:56,896][perf_logger][INFO] - loss: 3.01, global_step: 11, learning_rate: 4.8e-05, grad_norm: 6.47, step_time: 0.04, tokens_per_second: 5.11e+04, unpadded_tokens_per_second: 9.17e+03


Training:  13%|█▎        | 13/100 [00:03<00:11,  7.76it/s, loss=3.17]

[2025-09-26 23:30:56,936][perf_logger][INFO] - loss: 3.17, global_step: 12, learning_rate: 5.2e-05, grad_norm: 6.45, step_time: 0.0398, tokens_per_second: 5.14e+04, unpadded_tokens_per_second: 1.3e+04


Training:  14%|█▍        | 14/100 [00:03<00:11,  7.76it/s, loss=3.25]

[2025-09-26 23:30:56,975][perf_logger][INFO] - loss: 3.25, global_step: 13, learning_rate: 5.6e-05, grad_norm: 7.36, step_time: 0.0398, tokens_per_second: 5.15e+04, unpadded_tokens_per_second: 8.04e+03


Training:  15%|█▌        | 15/100 [00:03<00:10,  7.76it/s, loss=3.11]

[2025-09-26 23:30:57,015][perf_logger][INFO] - loss: 3.11, global_step: 14, learning_rate: 6e-05, grad_norm: 4.78, step_time: 0.0399, tokens_per_second: 5.13e+04, unpadded_tokens_per_second: 2.1e+04


Training:  16%|█▌        | 16/100 [00:03<00:08, 10.28it/s, loss=3.01]

[2025-09-26 23:30:57,054][perf_logger][INFO] - loss: 3.01, global_step: 15, learning_rate: 6.4e-05, grad_norm: 3.73, step_time: 0.0391, tokens_per_second: 5.24e+04, unpadded_tokens_per_second: 1.92e+04


Training:  17%|█▋        | 17/100 [00:03<00:08, 10.28it/s, loss=2.94]

[2025-09-26 23:30:57,094][perf_logger][INFO] - loss: 2.94, global_step: 16, learning_rate: 6.8e-05, grad_norm: 3.69, step_time: 0.0395, tokens_per_second: 5.19e+04, unpadded_tokens_per_second: 1.52e+04


Training:  18%|█▊        | 18/100 [00:03<00:07, 10.28it/s, loss=3.12]

[2025-09-26 23:30:57,133][perf_logger][INFO] - loss: 3.12, global_step: 17, learning_rate: 7.2e-05, grad_norm: 4.47, step_time: 0.039, tokens_per_second: 5.25e+04, unpadded_tokens_per_second: 2.21e+04


Training:  19%|█▉        | 19/100 [00:03<00:06, 12.85it/s, loss=2.93]

[2025-09-26 23:30:57,172][perf_logger][INFO] - loss: 2.93, global_step: 18, learning_rate: 7.6e-05, grad_norm: 4.4, step_time: 0.0392, tokens_per_second: 5.22e+04, unpadded_tokens_per_second: 1.72e+04


Training:  20%|██        | 20/100 [00:03<00:06, 12.85it/s, loss=2.89]

[2025-09-26 23:30:57,213][perf_logger][INFO] - loss: 2.89, global_step: 19, learning_rate: 8e-05, grad_norm: 6.75, step_time: 0.0411, tokens_per_second: 4.98e+04, unpadded_tokens_per_second: 8.24e+03


Training:  21%|██        | 21/100 [00:03<00:06, 12.85it/s, loss=2.84]

[2025-09-26 23:30:57,254][perf_logger][INFO] - loss: 2.84, global_step: 20, learning_rate: 8.4e-05, grad_norm: 5.79, step_time: 0.0404, tokens_per_second: 5.07e+04, unpadded_tokens_per_second: 1.11e+04


Training:  22%|██▏       | 22/100 [00:03<00:05, 15.18it/s, loss=3.04]

[2025-09-26 23:30:57,295][perf_logger][INFO] - loss: 3.04, global_step: 21, learning_rate: 8.8e-05, grad_norm: 4.55, step_time: 0.0415, tokens_per_second: 4.94e+04, unpadded_tokens_per_second: 1.48e+04


Training:  23%|██▎       | 23/100 [00:03<00:05, 15.18it/s, loss=2.81]

[2025-09-26 23:30:57,335][perf_logger][INFO] - loss: 2.81, global_step: 22, learning_rate: 9.2e-05, grad_norm: 3.57, step_time: 0.0398, tokens_per_second: 5.15e+04, unpadded_tokens_per_second: 8.14e+03


Training:  24%|██▍       | 24/100 [00:03<00:05, 15.18it/s, loss=2.93]

[2025-09-26 23:30:57,374][perf_logger][INFO] - loss: 2.93, global_step: 23, learning_rate: 9.6e-05, grad_norm: 3.66, step_time: 0.0391, tokens_per_second: 5.23e+04, unpadded_tokens_per_second: 1.24e+04


Training:  25%|██▌       | 25/100 [00:03<00:04, 17.39it/s, loss=2.87]

[2025-09-26 23:30:57,413][perf_logger][INFO] - loss: 2.87, global_step: 24, learning_rate: 0.0001, grad_norm: 5.57, step_time: 0.0395, tokens_per_second: 5.19e+04, unpadded_tokens_per_second: 1.38e+04


Training:  26%|██▌       | 26/100 [00:03<00:04, 17.39it/s, loss=2.97]

[2025-09-26 23:30:57,453][perf_logger][INFO] - loss: 2.97, global_step: 25, learning_rate: 0.000104, grad_norm: 3.67, step_time: 0.0394, tokens_per_second: 5.2e+04, unpadded_tokens_per_second: 8.07e+03


Training:  27%|██▋       | 27/100 [00:03<00:04, 17.39it/s, loss=2.93]

[2025-09-26 23:30:57,492][perf_logger][INFO] - loss: 2.93, global_step: 26, learning_rate: 0.000108, grad_norm: 2.45, step_time: 0.0393, tokens_per_second: 5.21e+04, unpadded_tokens_per_second: 3.05e+04


Training:  28%|██▊       | 28/100 [00:04<00:03, 19.29it/s, loss=2.95]

[2025-09-26 23:30:57,531][perf_logger][INFO] - loss: 2.95, global_step: 27, learning_rate: 0.000112, grad_norm: 6.16, step_time: 0.0391, tokens_per_second: 5.24e+04, unpadded_tokens_per_second: 5.76e+03


Training:  29%|██▉       | 29/100 [00:04<00:03, 19.29it/s, loss=3.16]

[2025-09-26 23:30:57,570][perf_logger][INFO] - loss: 3.16, global_step: 28, learning_rate: 0.000116, grad_norm: 4.61, step_time: 0.0393, tokens_per_second: 5.21e+04, unpadded_tokens_per_second: 9.82e+03


Training:  30%|███       | 30/100 [00:04<00:03, 19.29it/s, loss=2.99]

[2025-09-26 23:30:57,610][perf_logger][INFO] - loss: 2.99, global_step: 29, learning_rate: 0.00012, grad_norm: 3.28, step_time: 0.0395, tokens_per_second: 5.18e+04, unpadded_tokens_per_second: 2.45e+04


Training:  31%|███       | 31/100 [00:04<00:03, 20.75it/s, loss=2.92]

[2025-09-26 23:30:57,651][perf_logger][INFO] - loss: 2.92, global_step: 30, learning_rate: 0.000124, grad_norm: 4.62, step_time: 0.0411, tokens_per_second: 4.99e+04, unpadded_tokens_per_second: 8.82e+03


Training:  32%|███▏      | 32/100 [00:04<00:03, 20.75it/s, loss=3.02]

[2025-09-26 23:30:57,695][perf_logger][INFO] - loss: 3.02, global_step: 31, learning_rate: 0.000128, grad_norm: 3.28, step_time: 0.0443, tokens_per_second: 4.62e+04, unpadded_tokens_per_second: 1.08e+04


Training:  33%|███▎      | 33/100 [00:04<00:03, 20.75it/s, loss=2.79]

[2025-09-26 23:30:57,741][perf_logger][INFO] - loss: 2.79, global_step: 32, learning_rate: 0.000132, grad_norm: 3.48, step_time: 0.0457, tokens_per_second: 4.48e+04, unpadded_tokens_per_second: 5.53e+03


Training:  34%|███▍      | 34/100 [00:04<00:03, 21.31it/s, loss=2.88]

[2025-09-26 23:30:57,784][perf_logger][INFO] - loss: 2.88, global_step: 33, learning_rate: 0.000136, grad_norm: 2.57, step_time: 0.0422, tokens_per_second: 4.85e+04, unpadded_tokens_per_second: 3.05e+04


Training:  35%|███▌      | 35/100 [00:04<00:03, 21.31it/s, loss=3.02]

[2025-09-26 23:30:57,824][perf_logger][INFO] - loss: 3.02, global_step: 34, learning_rate: 0.00014, grad_norm: 3.44, step_time: 0.0404, tokens_per_second: 5.07e+04, unpadded_tokens_per_second: 2.05e+04


Training:  36%|███▌      | 36/100 [00:04<00:03, 21.31it/s, loss=2.98]

[2025-09-26 23:30:57,863][perf_logger][INFO] - loss: 2.98, global_step: 35, learning_rate: 0.000144, grad_norm: 2.99, step_time: 0.0395, tokens_per_second: 5.18e+04, unpadded_tokens_per_second: 1.8e+04


Training:  37%|███▋      | 37/100 [00:04<00:02, 22.32it/s, loss=2.91]

[2025-09-26 23:30:57,903][perf_logger][INFO] - loss: 2.91, global_step: 36, learning_rate: 0.000148, grad_norm: 5.08, step_time: 0.0398, tokens_per_second: 5.15e+04, unpadded_tokens_per_second: 1.17e+04


Training:  38%|███▊      | 38/100 [00:04<00:02, 22.32it/s, loss=2.87]

[2025-09-26 23:30:57,944][perf_logger][INFO] - loss: 2.87, global_step: 37, learning_rate: 0.000152, grad_norm: 4.13, step_time: 0.0413, tokens_per_second: 4.96e+04, unpadded_tokens_per_second: 1.77e+04


Training:  39%|███▉      | 39/100 [00:04<00:02, 22.32it/s, loss=2.96]

[2025-09-26 23:30:57,984][perf_logger][INFO] - loss: 2.96, global_step: 38, learning_rate: 0.000156, grad_norm: 3.14, step_time: 0.0395, tokens_per_second: 5.19e+04, unpadded_tokens_per_second: 2.04e+04


Training:  40%|████      | 40/100 [00:04<00:02, 22.96it/s, loss=2.75]

[2025-09-26 23:30:58,025][perf_logger][INFO] - loss: 2.75, global_step: 39, learning_rate: 0.00016, grad_norm: 3.41, step_time: 0.0413, tokens_per_second: 4.96e+04, unpadded_tokens_per_second: 1.16e+04


Training:  41%|████      | 41/100 [00:04<00:02, 22.96it/s, loss=2.84]

[2025-09-26 23:30:58,071][perf_logger][INFO] - loss: 2.84, global_step: 40, learning_rate: 0.000164, grad_norm: 2.83, step_time: 0.0456, tokens_per_second: 4.49e+04, unpadded_tokens_per_second: 2.5e+04


Training:  42%|████▏     | 42/100 [00:04<00:02, 22.96it/s, loss=3.01]

[2025-09-26 23:30:58,117][perf_logger][INFO] - loss: 3.01, global_step: 41, learning_rate: 0.000168, grad_norm: 4.02, step_time: 0.0459, tokens_per_second: 4.46e+04, unpadded_tokens_per_second: 1.59e+04


Training:  43%|████▎     | 43/100 [00:04<00:02, 22.59it/s, loss=2.91]

[2025-09-26 23:30:58,163][perf_logger][INFO] - loss: 2.91, global_step: 42, learning_rate: 0.000172, grad_norm: 3.2, step_time: 0.0463, tokens_per_second: 4.43e+04, unpadded_tokens_per_second: 2.74e+04


Training:  44%|████▍     | 44/100 [00:04<00:02, 22.59it/s, loss=3.02]

[2025-09-26 23:30:58,203][perf_logger][INFO] - loss: 3.02, global_step: 43, learning_rate: 0.000176, grad_norm: 6.25, step_time: 0.0405, tokens_per_second: 5.05e+04, unpadded_tokens_per_second: 4.32e+03


Training:  45%|████▌     | 45/100 [00:04<00:02, 22.59it/s, loss=2.87]

[2025-09-26 23:30:58,243][perf_logger][INFO] - loss: 2.87, global_step: 44, learning_rate: 0.00018, grad_norm: 2.65, step_time: 0.0393, tokens_per_second: 5.22e+04, unpadded_tokens_per_second: 1.16e+04


Training:  46%|████▌     | 46/100 [00:04<00:02, 23.32it/s, loss=2.91]

[2025-09-26 23:30:58,282][perf_logger][INFO] - loss: 2.91, global_step: 45, learning_rate: 0.000184, grad_norm: 3.93, step_time: 0.0393, tokens_per_second: 5.21e+04, unpadded_tokens_per_second: 8.09e+03


Training:  47%|████▋     | 47/100 [00:04<00:02, 23.32it/s, loss=3.02]

[2025-09-26 23:30:58,322][perf_logger][INFO] - loss: 3.02, global_step: 46, learning_rate: 0.000188, grad_norm: 2.77, step_time: 0.0399, tokens_per_second: 5.14e+04, unpadded_tokens_per_second: 3.42e+04


Training:  48%|████▊     | 48/100 [00:04<00:02, 23.32it/s, loss=2.79]

[2025-09-26 23:30:58,361][perf_logger][INFO] - loss: 2.79, global_step: 47, learning_rate: 0.000192, grad_norm: 3.35, step_time: 0.0392, tokens_per_second: 5.22e+04, unpadded_tokens_per_second: 1.51e+04


Training:  49%|████▉     | 49/100 [00:04<00:02, 23.91it/s, loss=2.87]

[2025-09-26 23:30:58,400][perf_logger][INFO] - loss: 2.87, global_step: 48, learning_rate: 0.000196, grad_norm: 4.83, step_time: 0.039, tokens_per_second: 5.25e+04, unpadded_tokens_per_second: 7.34e+03


Training:  50%|█████     | 50/100 [00:04<00:02, 23.91it/s, loss=2.92]

[2025-09-26 23:30:58,439][perf_logger][INFO] - loss: 2.92, global_step: 49, learning_rate: 0.0002, grad_norm: 3.49, step_time: 0.0391, tokens_per_second: 5.24e+04, unpadded_tokens_per_second: 6.6e+03


Training:  51%|█████     | 51/100 [00:04<00:02, 23.91it/s, loss=3]   

[2025-09-26 23:30:58,478][perf_logger][INFO] - loss: 3, global_step: 50, learning_rate: 0.000204, grad_norm: 2.88, step_time: 0.039, tokens_per_second: 5.25e+04, unpadded_tokens_per_second: 3.04e+04


Training:  52%|█████▏    | 52/100 [00:05<00:01, 24.37it/s, loss=2.88]

[2025-09-26 23:30:58,518][perf_logger][INFO] - loss: 2.88, global_step: 51, learning_rate: 0.000208, grad_norm: 2.91, step_time: 0.0394, tokens_per_second: 5.2e+04, unpadded_tokens_per_second: 1.15e+04


Training:  53%|█████▎    | 53/100 [00:05<00:01, 24.37it/s, loss=2.9] 

[2025-09-26 23:30:58,557][perf_logger][INFO] - loss: 2.9, global_step: 52, learning_rate: 0.000212, grad_norm: 2.28, step_time: 0.0394, tokens_per_second: 5.2e+04, unpadded_tokens_per_second: 1.88e+04


Training:  54%|█████▍    | 54/100 [00:05<00:01, 24.37it/s, loss=2.76]

[2025-09-26 23:30:58,596][perf_logger][INFO] - loss: 2.76, global_step: 53, learning_rate: 0.000216, grad_norm: 3.75, step_time: 0.0393, tokens_per_second: 5.22e+04, unpadded_tokens_per_second: 1.15e+04


Training:  55%|█████▌    | 55/100 [00:05<00:01, 24.72it/s, loss=2.75]

[2025-09-26 23:30:58,635][perf_logger][INFO] - loss: 2.75, global_step: 54, learning_rate: 0.00022, grad_norm: 3.31, step_time: 0.0388, tokens_per_second: 5.28e+04, unpadded_tokens_per_second: 2.11e+04


Training:  56%|█████▌    | 56/100 [00:05<00:01, 24.72it/s, loss=2.93]

[2025-09-26 23:30:58,674][perf_logger][INFO] - loss: 2.93, global_step: 55, learning_rate: 0.000224, grad_norm: 2.56, step_time: 0.0389, tokens_per_second: 5.27e+04, unpadded_tokens_per_second: 2.24e+04


Training:  57%|█████▋    | 57/100 [00:05<00:01, 24.72it/s, loss=2.98]

[2025-09-26 23:30:58,712][perf_logger][INFO] - loss: 2.98, global_step: 56, learning_rate: 0.000228, grad_norm: 3.13, step_time: 0.0387, tokens_per_second: 5.29e+04, unpadded_tokens_per_second: 2.36e+04


Training:  58%|█████▊    | 58/100 [00:05<00:01, 25.04it/s, loss=2.76]

[2025-09-26 23:30:58,751][perf_logger][INFO] - loss: 2.76, global_step: 57, learning_rate: 0.000232, grad_norm: 2.55, step_time: 0.0386, tokens_per_second: 5.31e+04, unpadded_tokens_per_second: 1.16e+04


Training:  59%|█████▉    | 59/100 [00:05<00:01, 25.04it/s, loss=2.96]

[2025-09-26 23:30:58,790][perf_logger][INFO] - loss: 2.96, global_step: 58, learning_rate: 0.000236, grad_norm: 2.77, step_time: 0.0392, tokens_per_second: 5.23e+04, unpadded_tokens_per_second: 1.09e+04


Training:  60%|██████    | 60/100 [00:05<00:01, 25.04it/s, loss=2.9] 

[2025-09-26 23:30:58,829][perf_logger][INFO] - loss: 2.9, global_step: 59, learning_rate: 0.00024, grad_norm: 4.22, step_time: 0.039, tokens_per_second: 5.25e+04, unpadded_tokens_per_second: 2.19e+04


Training:  61%|██████    | 61/100 [00:05<00:01, 25.22it/s, loss=2.84]

[2025-09-26 23:30:58,868][perf_logger][INFO] - loss: 2.84, global_step: 60, learning_rate: 0.000244, grad_norm: 4.91, step_time: 0.0388, tokens_per_second: 5.28e+04, unpadded_tokens_per_second: 1.19e+04


Training:  62%|██████▏   | 62/100 [00:05<00:01, 25.22it/s, loss=2.93]

[2025-09-26 23:30:58,907][perf_logger][INFO] - loss: 2.93, global_step: 61, learning_rate: 0.000248, grad_norm: 5.73, step_time: 0.0391, tokens_per_second: 5.23e+04, unpadded_tokens_per_second: 1.02e+04


Training:  63%|██████▎   | 63/100 [00:05<00:01, 25.22it/s, loss=2.84]

[2025-09-26 23:30:58,946][perf_logger][INFO] - loss: 2.84, global_step: 62, learning_rate: 0.000252, grad_norm: 5.71, step_time: 0.039, tokens_per_second: 5.25e+04, unpadded_tokens_per_second: 1.22e+04


Training:  64%|██████▍   | 64/100 [00:05<00:01, 25.36it/s, loss=2.84]

[2025-09-26 23:30:58,985][perf_logger][INFO] - loss: 2.84, global_step: 63, learning_rate: 0.000256, grad_norm: 4.18, step_time: 0.0387, tokens_per_second: 5.3e+04, unpadded_tokens_per_second: 1.46e+04


Training:  65%|██████▌   | 65/100 [00:05<00:01, 25.36it/s, loss=2.79]

[2025-09-26 23:30:59,024][perf_logger][INFO] - loss: 2.79, global_step: 64, learning_rate: 0.00026, grad_norm: 10.3, step_time: 0.0392, tokens_per_second: 5.23e+04, unpadded_tokens_per_second: 1.35e+04


Training:  66%|██████▌   | 66/100 [00:05<00:01, 25.36it/s, loss=2.66]

[2025-09-26 23:30:59,063][perf_logger][INFO] - loss: 2.66, global_step: 65, learning_rate: 0.000264, grad_norm: 3.67, step_time: 0.0393, tokens_per_second: 5.21e+04, unpadded_tokens_per_second: 1.81e+04


Training:  67%|██████▋   | 67/100 [00:05<00:01, 25.35it/s, loss=2.85]

[2025-09-26 23:30:59,103][perf_logger][INFO] - loss: 2.85, global_step: 66, learning_rate: 0.000268, grad_norm: 3.75, step_time: 0.0401, tokens_per_second: 5.11e+04, unpadded_tokens_per_second: 7.96e+03


Training:  68%|██████▊   | 68/100 [00:05<00:01, 25.35it/s, loss=2.89]

[2025-09-26 23:30:59,143][perf_logger][INFO] - loss: 2.89, global_step: 67, learning_rate: 0.000272, grad_norm: 4.05, step_time: 0.0396, tokens_per_second: 5.17e+04, unpadded_tokens_per_second: 2.92e+04


Training:  69%|██████▉   | 69/100 [00:05<00:01, 25.35it/s, loss=2.88]

[2025-09-26 23:30:59,182][perf_logger][INFO] - loss: 2.88, global_step: 68, learning_rate: 0.000276, grad_norm: 3.36, step_time: 0.0387, tokens_per_second: 5.29e+04, unpadded_tokens_per_second: 1.37e+04


Training:  70%|███████   | 70/100 [00:05<00:01, 25.41it/s, loss=2.93]

[2025-09-26 23:30:59,221][perf_logger][INFO] - loss: 2.93, global_step: 69, learning_rate: 0.00028, grad_norm: 3.38, step_time: 0.0391, tokens_per_second: 5.24e+04, unpadded_tokens_per_second: 2.88e+04


Training:  71%|███████   | 71/100 [00:05<00:01, 25.41it/s, loss=2.74]

[2025-09-26 23:30:59,259][perf_logger][INFO] - loss: 2.74, global_step: 70, learning_rate: 0.000284, grad_norm: 2.69, step_time: 0.0386, tokens_per_second: 5.3e+04, unpadded_tokens_per_second: 1.79e+04


Training:  72%|███████▏  | 72/100 [00:05<00:01, 25.41it/s, loss=2.79]

[2025-09-26 23:30:59,298][perf_logger][INFO] - loss: 2.79, global_step: 71, learning_rate: 0.000288, grad_norm: 4.07, step_time: 0.0387, tokens_per_second: 5.29e+04, unpadded_tokens_per_second: 1.35e+04


Training:  73%|███████▎  | 73/100 [00:05<00:01, 25.52it/s, loss=2.92]

[2025-09-26 23:30:59,337][perf_logger][INFO] - loss: 2.92, global_step: 72, learning_rate: 0.000292, grad_norm: 3.51, step_time: 0.039, tokens_per_second: 5.24e+04, unpadded_tokens_per_second: 1.62e+04


Training:  74%|███████▍  | 74/100 [00:05<00:01, 25.52it/s, loss=2.86]

[2025-09-26 23:30:59,376][perf_logger][INFO] - loss: 2.86, global_step: 73, learning_rate: 0.000296, grad_norm: 4.78, step_time: 0.0392, tokens_per_second: 5.23e+04, unpadded_tokens_per_second: 1.25e+04


Training:  75%|███████▌  | 75/100 [00:05<00:00, 25.52it/s, loss=2.87]

[2025-09-26 23:30:59,416][perf_logger][INFO] - loss: 2.87, global_step: 74, learning_rate: 0.0003, grad_norm: 3.01, step_time: 0.0394, tokens_per_second: 5.2e+04, unpadded_tokens_per_second: 1.12e+04


Training:  76%|███████▌  | 76/100 [00:05<00:00, 25.46it/s, loss=2.86]

[2025-09-26 23:30:59,456][perf_logger][INFO] - loss: 2.86, global_step: 75, learning_rate: 0.000304, grad_norm: 3.28, step_time: 0.0398, tokens_per_second: 5.14e+04, unpadded_tokens_per_second: 1.75e+04


Training:  77%|███████▋  | 77/100 [00:05<00:00, 25.46it/s, loss=2.9] 

[2025-09-26 23:30:59,495][perf_logger][INFO] - loss: 2.9, global_step: 76, learning_rate: 0.000308, grad_norm: 3.46, step_time: 0.0391, tokens_per_second: 5.24e+04, unpadded_tokens_per_second: 1.21e+04


Training:  78%|███████▊  | 78/100 [00:06<00:00, 25.46it/s, loss=2.86]

[2025-09-26 23:30:59,534][perf_logger][INFO] - loss: 2.86, global_step: 77, learning_rate: 0.000312, grad_norm: 5.82, step_time: 0.0392, tokens_per_second: 5.22e+04, unpadded_tokens_per_second: 6.6e+03


Training:  79%|███████▉  | 79/100 [00:06<00:00, 25.47it/s, loss=2.92]

[2025-09-26 23:30:59,573][perf_logger][INFO] - loss: 2.92, global_step: 78, learning_rate: 0.000316, grad_norm: 3.24, step_time: 0.0394, tokens_per_second: 5.2e+04, unpadded_tokens_per_second: 9.72e+03


Training:  80%|████████  | 80/100 [00:06<00:00, 25.47it/s, loss=2.93]

[2025-09-26 23:30:59,614][perf_logger][INFO] - loss: 2.93, global_step: 79, learning_rate: 0.00032, grad_norm: 4.2, step_time: 0.0405, tokens_per_second: 5.05e+04, unpadded_tokens_per_second: 1.77e+04


Training:  81%|████████  | 81/100 [00:06<00:00, 25.47it/s, loss=2.77]

[2025-09-26 23:30:59,654][perf_logger][INFO] - loss: 2.77, global_step: 80, learning_rate: 0.000324, grad_norm: 5.02, step_time: 0.0399, tokens_per_second: 5.13e+04, unpadded_tokens_per_second: 1.1e+04


Training:  82%|████████▏ | 82/100 [00:06<00:00, 25.34it/s, loss=3.1] 

[2025-09-26 23:30:59,693][perf_logger][INFO] - loss: 3.1, global_step: 81, learning_rate: 0.000328, grad_norm: 5.03, step_time: 0.0394, tokens_per_second: 5.2e+04, unpadded_tokens_per_second: 2.18e+04


Training:  83%|████████▎ | 83/100 [00:06<00:00, 25.34it/s, loss=2.84]

[2025-09-26 23:30:59,733][perf_logger][INFO] - loss: 2.84, global_step: 82, learning_rate: 0.000332, grad_norm: 3.09, step_time: 0.0395, tokens_per_second: 5.19e+04, unpadded_tokens_per_second: 1.45e+04


Training:  84%|████████▍ | 84/100 [00:06<00:00, 25.34it/s, loss=2.99]

[2025-09-26 23:30:59,771][perf_logger][INFO] - loss: 2.99, global_step: 83, learning_rate: 0.000336, grad_norm: 3.51, step_time: 0.0388, tokens_per_second: 5.28e+04, unpadded_tokens_per_second: 1.01e+04


Training:  85%|████████▌ | 85/100 [00:06<00:00, 25.38it/s, loss=2.89]

[2025-09-26 23:30:59,811][perf_logger][INFO] - loss: 2.89, global_step: 84, learning_rate: 0.00034, grad_norm: 2.59, step_time: 0.0394, tokens_per_second: 5.2e+04, unpadded_tokens_per_second: 1.2e+04


Training:  86%|████████▌ | 86/100 [00:06<00:00, 25.38it/s, loss=2.85]

[2025-09-26 23:30:59,850][perf_logger][INFO] - loss: 2.85, global_step: 85, learning_rate: 0.000344, grad_norm: 4.02, step_time: 0.0395, tokens_per_second: 5.18e+04, unpadded_tokens_per_second: 1.11e+04


Training:  87%|████████▋ | 87/100 [00:06<00:00, 25.38it/s, loss=2.81]

[2025-09-26 23:30:59,890][perf_logger][INFO] - loss: 2.81, global_step: 86, learning_rate: 0.000348, grad_norm: 2.95, step_time: 0.0397, tokens_per_second: 5.16e+04, unpadded_tokens_per_second: 1.96e+04


Training:  88%|████████▊ | 88/100 [00:06<00:00, 25.38it/s, loss=2.81]

[2025-09-26 23:30:59,929][perf_logger][INFO] - loss: 2.81, global_step: 87, learning_rate: 0.000352, grad_norm: 3.12, step_time: 0.039, tokens_per_second: 5.26e+04, unpadded_tokens_per_second: 9.04e+03


Training:  89%|████████▉ | 89/100 [00:06<00:00, 25.38it/s, loss=2.93]

[2025-09-26 23:30:59,968][perf_logger][INFO] - loss: 2.93, global_step: 88, learning_rate: 0.000356, grad_norm: 2.55, step_time: 0.0392, tokens_per_second: 5.22e+04, unpadded_tokens_per_second: 2.01e+04


Training:  90%|█████████ | 90/100 [00:06<00:00, 25.38it/s, loss=2.91]

[2025-09-26 23:31:00,007][perf_logger][INFO] - loss: 2.91, global_step: 89, learning_rate: 0.00036, grad_norm: 4.31, step_time: 0.0389, tokens_per_second: 5.26e+04, unpadded_tokens_per_second: 1.26e+04


Training:  91%|█████████ | 91/100 [00:06<00:00, 25.50it/s, loss=2.88]

[2025-09-26 23:31:00,045][perf_logger][INFO] - loss: 2.88, global_step: 90, learning_rate: 0.000364, grad_norm: 2.86, step_time: 0.0382, tokens_per_second: 5.36e+04, unpadded_tokens_per_second: 1.74e+04


Training:  92%|█████████▏| 92/100 [00:06<00:00, 25.50it/s, loss=2.9] 

[2025-09-26 23:31:00,084][perf_logger][INFO] - loss: 2.9, global_step: 91, learning_rate: 0.000368, grad_norm: 3.48, step_time: 0.039, tokens_per_second: 5.25e+04, unpadded_tokens_per_second: 1.82e+04


Training:  93%|█████████▎| 93/100 [00:06<00:00, 25.50it/s, loss=2.99]

[2025-09-26 23:31:00,124][perf_logger][INFO] - loss: 2.99, global_step: 92, learning_rate: 0.000372, grad_norm: 4.93, step_time: 0.0394, tokens_per_second: 5.2e+04, unpadded_tokens_per_second: 4.95e+03


Training:  94%|█████████▍| 94/100 [00:06<00:00, 25.52it/s, loss=2.89]

[2025-09-26 23:31:00,163][perf_logger][INFO] - loss: 2.89, global_step: 93, learning_rate: 0.000376, grad_norm: 2.49, step_time: 0.0389, tokens_per_second: 5.26e+04, unpadded_tokens_per_second: 2.31e+04


Training:  95%|█████████▌| 95/100 [00:06<00:00, 25.52it/s, loss=2.83]

[2025-09-26 23:31:00,202][perf_logger][INFO] - loss: 2.83, global_step: 94, learning_rate: 0.00038, grad_norm: 2.84, step_time: 0.0393, tokens_per_second: 5.21e+04, unpadded_tokens_per_second: 1.05e+04


Training:  96%|█████████▌| 96/100 [00:06<00:00, 25.52it/s, loss=2.5] 

[2025-09-26 23:31:00,240][perf_logger][INFO] - loss: 2.5, global_step: 95, learning_rate: 0.000384, grad_norm: 5.13, step_time: 0.0381, tokens_per_second: 5.38e+04, unpadded_tokens_per_second: 4.41e+03


Training:  97%|█████████▋| 97/100 [00:06<00:00, 25.61it/s, loss=2.81]

[2025-09-26 23:31:00,279][perf_logger][INFO] - loss: 2.81, global_step: 96, learning_rate: 0.000388, grad_norm: 2.39, step_time: 0.0388, tokens_per_second: 5.28e+04, unpadded_tokens_per_second: 1.59e+04


Training:  98%|█████████▊| 98/100 [00:06<00:00, 25.61it/s, loss=2.71]

[2025-09-26 23:31:00,318][perf_logger][INFO] - loss: 2.71, global_step: 97, learning_rate: 0.000392, grad_norm: 2.86, step_time: 0.0392, tokens_per_second: 5.22e+04, unpadded_tokens_per_second: 1.26e+04


Training:  99%|█████████▉| 99/100 [00:06<00:00, 25.61it/s, loss=2.77]

[2025-09-26 23:31:00,362][perf_logger][INFO] - loss: 2.77, global_step: 98, learning_rate: 0.000396, grad_norm: 2.28, step_time: 0.0439, tokens_per_second: 4.66e+04, unpadded_tokens_per_second: 2.18e+04


Training: 100%|██████████| 100/100 [00:06<00:00, 25.28it/s, loss=2.79]

[2025-09-26 23:31:00,401][perf_logger][INFO] - loss: 2.79, global_step: 99, learning_rate: 0.0004, grad_norm: 2.51, step_time: 0.0391, tokens_per_second: 5.24e+04, unpadded_tokens_per_second: 2.72e+04


wandb: 
wandb: Run history:
wandb:                train/global_step ▁▁▁▁▁▂▂▂▂▂��▃▃▃▃▃▃▃▄▄▄▄▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇███
wandb:                  train/grad_norm ▇█▇▆▆▃▃▃▂▂▂▃▂▂▂▁▂▁▁▃▁▁▂��▁▂▂▅▂▂▃▂▂▂▁▁▁▂▂▁
wandb:              train/learning_rate ▁▁▁���▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇��▇██
wandb:                       train/loss ▇▇█▅▂▅▃▄▃▄▂▃▃▃▂▃���▃▂▂▃▂▂▁▂▂▂▂▃▁▃▂▃▃▂▂▂▃▃▂
wandb:                  train/step_time  ▃▃▂▂▃▂▂▃▄▂▂▂▂▃█▅▂█▂▂▂▁▂▁▂▂▂▁▂▂▂▂▂▂▁▂▂▂▂
wandb:          train/tokens_per_second  ▁▆▆▇▇▇▇▇▆���▆▅▆▇▇▆▃▇▇▇██▇▇▇▇▇██▇▇▇▇▇▇▇█▇▇
wandb: train/unpadded_tokens_per_second  ▄▄▃▄▂▅▄▆▂█▂▃▁▅▃▄▇▁▃▄▃▅▃▆▆▃▃▂▅▅▃▂▅▆▁▆▁▄▆
wandb: 
wandb: Run summary:
wandb:                train/global_step 99
wandb:                  train/grad_norm 2.50514
wandb:              train/learning_rate 0.0004
wandb:                       train/loss 2.7893
wandb:                  train/step_time 0.03907
wandb:          train/tokens_per_second 52416.3742
wandb: train/unpadded_tokens_per_second 27231.94441
wandb: 
wandb: You can sync this

[2025-09-26 23:31:00,413][perf_logger][INFO] - RUN CONFIG:
{'adamw_kwargs': {'betas': [0.9, 0.98],
                  'eps': 1e-08,
                  'fused': True,
                  'lr': 0.0004,
                  'weight_decay': 0.01},
 'checkpoint': {'ckpt_dir': None,
                'resume_from_checkpoint': True,
                'save_checkpoints': True,
                'save_every_n_steps': 50,
                'save_final_model': False,
                'use_distributed_checkpoint_fsdp2': True},
 'dataset': {'load_dataset_kwargs': {'data_files': 'train.parquet',
                                     'path': 'parquet',
                                     'split': 'train',
                                     'streaming': True},
             'max_seq_length': 1024,
             'micro_batch_size': 2,
             'num_workers': 1,
             'sequence_packing_pad_to_multiple_of': None,
             'tokenizer_name': 'nvidia/esm2_t6_8M_UR50D',
             'use_sequence_packing': Fa

## ESM2 Training with Sequence Packing

To turn on sequence packing, set `dataset.use_sequence_packing=true` in your config.

In [17]:
%%bash
cd bionemo-framework/bionemo-recipes/recipes/esm2_native_te
torchrun train_mfsdp.py --config-name L0_sanity ++dataset.use_sequence_packing=true

[2025-09-26 23:55:43,710][__main__][INFO] - Initializing distributed training: DistributedConfig(rank=0, local_rank=0, world_size=1, _master_addr='127.0.0.1', _master_port='29500')
[2025-09-26 23:55:44,635][httpx][INFO] - HTTP Request: HEAD https://huggingface.co/nvidia/esm2_t33_650M_UR50D/resolve/main/config.json "HTTP/1.1 307 Temporary Redirect"
[2025-09-26 23:55:44,775][httpx][INFO] - HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/nvidia/esm2_t33_650M_UR50D/a9cc870059da10264e80e3fbf67d6bb3986b9442/config.json "HTTP/1.1 200 OK"
[2025-09-26 23:55:44,937][httpx][INFO] - HTTP Request: GET https://huggingface.co/api/resolve-cache/models/nvidia/esm2_t33_650M_UR50D/a9cc870059da10264e80e3fbf67d6bb3986b9442/config.json "HTTP/1.1 200 OK"


The module name nvidia/esm2_t33_650M_UR50D (originally nvidia/esm2_t33_650M_UR50D) is not a valid Python identifier. Please rename the original module to avoid import issues.


[2025-09-26 23:55:45,107][httpx][INFO] - HTTP Request: HEAD https://huggingface.co/nvidia/esm2_t33_650M_UR50D/resolve/main/esm_nv.py "HTTP/1.1 307 Temporary Redirect"
[2025-09-26 23:55:45,245][httpx][INFO] - HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/nvidia/esm2_t33_650M_UR50D/a9cc870059da10264e80e3fbf67d6bb3986b9442/esm_nv.py "HTTP/1.1 200 OK"
[2025-09-26 23:55:45,381][httpx][INFO] - HTTP Request: GET https://huggingface.co/api/resolve-cache/models/nvidia/esm2_t33_650M_UR50D/a9cc870059da10264e80e3fbf67d6bb3986b9442/esm_nv.py "HTTP/1.1 200 OK"


A new version of the following files was downloaded from https://huggingface.co/nvidia/esm2_t33_650M_UR50D:
- esm_nv.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


[2025-09-26 23:55:46,111][httpx][INFO] - HTTP Request: HEAD https://huggingface.co/nvidia/esm2_t33_650M_UR50D/resolve/main/config.json "HTTP/1.1 307 Temporary Redirect"
[2025-09-26 23:55:46,143][httpx][INFO] - HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/nvidia/esm2_t33_650M_UR50D/a9cc870059da10264e80e3fbf67d6bb3986b9442/config.json "HTTP/1.1 200 OK"


The module name nvidia/esm2_t33_650M_UR50D (originally nvidia/esm2_t33_650M_UR50D) is not a valid Python identifier. Please rename the original module to avoid import issues.


[2025-09-26 23:55:46,292][httpx][INFO] - HTTP Request: HEAD https://huggingface.co/nvidia/esm2_t33_650M_UR50D/resolve/main/esm_nv.py "HTTP/1.1 307 Temporary Redirect"
[2025-09-26 23:55:46,323][httpx][INFO] - HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/nvidia/esm2_t33_650M_UR50D/a9cc870059da10264e80e3fbf67d6bb3986b9442/esm_nv.py "HTTP/1.1 200 OK"
[2025-09-26 23:55:48,060][megatron_fsdp.param_and_grad_buffer][INFO] - Number of FSDP Parameter Groups: 36
[FSDP_UNIT 67] Group 0: elems=84544 dtype=torch.bfloat16 bufs=weight,main_weight,grad pad=0.01 MB
	esm.embeddings.word_embeddings.weight (64, 1280)
	lm_head.decoder.layer_norm_weight (1280,)
	lm_head.decoder.layer_norm_bias (1280,)
	lm_head.decoder.bias (64,)
[FSDP_UNIT 0] Group 1: elems=19677440 dtype=torch.bfloat16 bufs=weight,main_weight,grad pad=0.03 MB
	esm.encoder.layers.0.layernorm_mlp.fc2_weight (1280, 5120)
	esm.encoder.layers.0.self_attention.layernorm_qkv.weight (3840, 1280)
	esm.encoder.layers.0.self_atte

wandb: Tracking run with wandb version 0.22.0
wandb: W&B syncing is set to `offline` in this directory. Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
wandb: Run data is saved locally in /mount/zozhang/bionemo/bionemo-framework/bionemo-recipes/recipes/esm2_native_te/wandb/offline-run-20250926_235550-asa9m0c2
Training:   0%|          | 1/200 [00:06<20:14,  6.10s/it, loss=3.54]

[2025-09-26 23:55:57,948][perf_logger][INFO] - loss: 3.54, global_step: 0, learning_rate: 2e-07, grad_norm: 23.9, step_time: nan, tokens_per_second: nan, unpadded_tokens_per_second: nan


Training:   1%|          | 2/200 [00:07<10:41,  3.24s/it, loss=3.56]

[2025-09-26 23:55:59,188][perf_logger][INFO] - loss: 3.56, global_step: 1, learning_rate: 4e-07, grad_norm: 25.3, step_time: 1.24, tokens_per_second: 1.27e+03, unpadded_tokens_per_second: 1.27e+03


Training:   2%|▏         | 3/200 [00:07<06:30,  1.98s/it, loss=3.46]

[2025-09-26 23:55:59,668][perf_logger][INFO] - loss: 3.46, global_step: 2, learning_rate: 6e-07, grad_norm: 30.1, step_time: 0.48, tokens_per_second: 2.26e+03, unpadded_tokens_per_second: 2.26e+03


Training:   2%|▏         | 4/200 [00:08<04:10,  1.28s/it, loss=3.46]

[2025-09-26 23:55:59,869][perf_logger][INFO] - loss: 3.46, global_step: 3, learning_rate: 8e-07, grad_norm: 24, step_time: 0.201, tokens_per_second: 4.99e+03, unpadded_tokens_per_second: 4.99e+03


Training:   2%|▎         | 5/200 [00:08<02:53,  1.13it/s, loss=3.47]

[2025-09-26 23:56:00,066][perf_logger][INFO] - loss: 3.47, global_step: 4, learning_rate: 1e-06, grad_norm: 24, step_time: 0.197, tokens_per_second: 4.73e+03, unpadded_tokens_per_second: 4.73e+03


Training:   3%|▎         | 6/200 [00:08<02:06,  1.54it/s, loss=3.43]

[2025-09-26 23:56:00,258][perf_logger][INFO] - loss: 3.43, global_step: 5, learning_rate: 1.2e-06, grad_norm: 32.1, step_time: 0.192, tokens_per_second: 5.54e+03, unpadded_tokens_per_second: 5.54e+03


Training:   4%|▎         | 7/200 [00:08<01:37,  1.98it/s, loss=3.52]

[2025-09-26 23:56:00,460][perf_logger][INFO] - loss: 3.52, global_step: 6, learning_rate: 1.4e-06, grad_norm: 21.5, step_time: 0.203, tokens_per_second: 4.13e+03, unpadded_tokens_per_second: 4.13e+03


Training:   4%|▍         | 8/200 [00:08<01:18,  2.46it/s, loss=3.69]

[2025-09-26 23:56:00,656][perf_logger][INFO] - loss: 3.69, global_step: 7, learning_rate: 1.6e-06, grad_norm: 26.6, step_time: 0.196, tokens_per_second: 8.13e+03, unpadded_tokens_per_second: 8.12e+03


Training:   4%|▍         | 9/200 [00:09<01:14,  2.56it/s, loss=3.43]

[2025-09-26 23:56:01,014][perf_logger][INFO] - loss: 3.43, global_step: 8, learning_rate: 1.8e-06, grad_norm: 21.4, step_time: 0.358, tokens_per_second: 4.09e+03, unpadded_tokens_per_second: 4.09e+03


Training:   5%|▌         | 10/200 [00:09<01:02,  3.05it/s, loss=3.34]

[2025-09-26 23:56:01,199][perf_logger][INFO] - loss: 3.34, global_step: 9, learning_rate: 2e-06, grad_norm: 23.8, step_time: 0.185, tokens_per_second: 5.48e+03, unpadded_tokens_per_second: 5.48e+03


Training:   6%|▌         | 11/200 [00:09<00:53,  3.51it/s, loss=3.36]

[2025-09-26 23:56:01,388][perf_logger][INFO] - loss: 3.36, global_step: 10, learning_rate: 2.2e-06, grad_norm: 21.1, step_time: 0.189, tokens_per_second: 5.63e+03, unpadded_tokens_per_second: 5.63e+03


Training:   6%|▌         | 12/200 [00:09<00:48,  3.88it/s, loss=3.22]

[2025-09-26 23:56:01,582][perf_logger][INFO] - loss: 3.22, global_step: 11, learning_rate: 2.4e-06, grad_norm: 18.9, step_time: 0.194, tokens_per_second: 4.17e+03, unpadded_tokens_per_second: 4.17e+03


Training:   6%|▋         | 13/200 [00:09<00:45,  4.09it/s, loss=3.19]

[2025-09-26 23:56:01,796][perf_logger][INFO] - loss: 3.19, global_step: 12, learning_rate: 2.6e-06, grad_norm: 21.3, step_time: 0.214, tokens_per_second: 4.04e+03, unpadded_tokens_per_second: 4.04e+03


Training:   7%|▋         | 14/200 [00:10<00:43,  4.32it/s, loss=3.19]

[2025-09-26 23:56:01,999][perf_logger][INFO] - loss: 3.19, global_step: 13, learning_rate: 2.8e-06, grad_norm: 15.9, step_time: 0.203, tokens_per_second: 7.04e+03, unpadded_tokens_per_second: 7.03e+03


Training:   8%|▊         | 15/200 [00:10<00:40,  4.52it/s, loss=3.12]

[2025-09-26 23:56:02,195][perf_logger][INFO] - loss: 3.12, global_step: 14, learning_rate: 3e-06, grad_norm: 14.4, step_time: 0.197, tokens_per_second: 6.89e+03, unpadded_tokens_per_second: 6.89e+03


Training:   8%|▊         | 16/200 [00:10<00:38,  4.74it/s, loss=3.06]

[2025-09-26 23:56:02,382][perf_logger][INFO] - loss: 3.06, global_step: 15, learning_rate: 3.2e-06, grad_norm: 17.1, step_time: 0.187, tokens_per_second: 4.51e+03, unpadded_tokens_per_second: 4.5e+03


Training:   8%|▊         | 17/200 [00:10<00:37,  4.90it/s, loss=3.01]

[2025-09-26 23:56:02,570][perf_logger][INFO] - loss: 3.01, global_step: 16, learning_rate: 3.4e-06, grad_norm: 17.5, step_time: 0.188, tokens_per_second: 8.21e+03, unpadded_tokens_per_second: 8.21e+03


Training:   9%|▉         | 18/200 [00:10<00:35,  5.07it/s, loss=3.1] 

[2025-09-26 23:56:02,752][perf_logger][INFO] - loss: 3.1, global_step: 17, learning_rate: 3.6e-06, grad_norm: 15.1, step_time: 0.181, tokens_per_second: 8.49e+03, unpadded_tokens_per_second: 8.49e+03


Training:  10%|▉         | 19/200 [00:11<00:35,  5.09it/s, loss=3.02]

[2025-09-26 23:56:02,946][perf_logger][INFO] - loss: 3.02, global_step: 18, learning_rate: 3.8e-06, grad_norm: 17.1, step_time: 0.195, tokens_per_second: 6.14e+03, unpadded_tokens_per_second: 6.13e+03


Training:  10%|█         | 20/200 [00:11<00:35,  5.08it/s, loss=2.85]

[2025-09-26 23:56:03,144][perf_logger][INFO] - loss: 2.85, global_step: 19, learning_rate: 4e-06, grad_norm: 11.3, step_time: 0.198, tokens_per_second: 6.49e+03, unpadded_tokens_per_second: 6.49e+03


Training:  10%|█         | 21/200 [00:11<00:36,  4.92it/s, loss=2.84]

[2025-09-26 23:56:03,363][perf_logger][INFO] - loss: 2.84, global_step: 20, learning_rate: 4.2e-06, grad_norm: 8.91, step_time: 0.218, tokens_per_second: 8.56e+03, unpadded_tokens_per_second: 8.54e+03


Training:  11%|█         | 22/200 [00:11<00:35,  5.03it/s, loss=2.91]

[2025-09-26 23:56:03,551][perf_logger][INFO] - loss: 2.91, global_step: 21, learning_rate: 4.4e-06, grad_norm: 13.1, step_time: 0.188, tokens_per_second: 7.66e+03, unpadded_tokens_per_second: 7.65e+03


Training:  12%|█▏        | 23/200 [00:11<00:34,  5.09it/s, loss=2.72]

[2025-09-26 23:56:03,742][perf_logger][INFO] - loss: 2.72, global_step: 22, learning_rate: 4.6e-06, grad_norm: 14.7, step_time: 0.19, tokens_per_second: 4.06e+03, unpadded_tokens_per_second: 4.06e+03


Training:  12%|█▏        | 24/200 [00:12<00:36,  4.84it/s, loss=2.97]

[2025-09-26 23:56:03,972][perf_logger][INFO] - loss: 2.97, global_step: 23, learning_rate: 4.8e-06, grad_norm: 10.9, step_time: 0.23, tokens_per_second: 8.5e+03, unpadded_tokens_per_second: 8.49e+03


Training:  12%|█▎        | 25/200 [00:12<00:35,  4.86it/s, loss=2.92]

[2025-09-26 23:56:04,176][perf_logger][INFO] - loss: 2.92, global_step: 24, learning_rate: 5e-06, grad_norm: 13.6, step_time: 0.204, tokens_per_second: 2.67e+03, unpadded_tokens_per_second: 2.67e+03


Training:  13%|█▎        | 26/200 [00:12<00:34,  5.03it/s, loss=2.93]

[2025-09-26 23:56:04,358][perf_logger][INFO] - loss: 2.93, global_step: 25, learning_rate: 5.2e-06, grad_norm: 7.91, step_time: 0.183, tokens_per_second: 8.98e+03, unpadded_tokens_per_second: 8.97e+03


Training:  14%|█▎        | 27/200 [00:12<00:33,  5.17it/s, loss=2.86]

[2025-09-26 23:56:04,539][perf_logger][INFO] - loss: 2.86, global_step: 26, learning_rate: 5.4e-06, grad_norm: 8.63, step_time: 0.18, tokens_per_second: 6.62e+03, unpadded_tokens_per_second: 6.62e+03


Training:  14%|█▍        | 28/200 [00:12<00:33,  5.17it/s, loss=3.05]

[2025-09-26 23:56:04,732][perf_logger][INFO] - loss: 3.05, global_step: 27, learning_rate: 5.6e-06, grad_norm: 13.9, step_time: 0.193, tokens_per_second: 8.76e+03, unpadded_tokens_per_second: 8.76e+03


Training:  14%|█▍        | 29/200 [00:13<00:33,  5.16it/s, loss=2.96]

[2025-09-26 23:56:04,927][perf_logger][INFO] - loss: 2.96, global_step: 28, learning_rate: 5.8e-06, grad_norm: 11, step_time: 0.195, tokens_per_second: 7e+03, unpadded_tokens_per_second: 6.99e+03


Training:  15%|█▌        | 30/200 [00:13<00:32,  5.26it/s, loss=2.99]

[2025-09-26 23:56:05,108][perf_logger][INFO] - loss: 2.99, global_step: 29, learning_rate: 6e-06, grad_norm: 12.3, step_time: 0.181, tokens_per_second: 7.07e+03, unpadded_tokens_per_second: 7.06e+03


Training:  16%|█▌        | 31/200 [00:13<00:31,  5.33it/s, loss=2.93]

[2025-09-26 23:56:05,291][perf_logger][INFO] - loss: 2.93, global_step: 30, learning_rate: 6.2e-06, grad_norm: 14.6, step_time: 0.183, tokens_per_second: 4.73e+03, unpadded_tokens_per_second: 4.72e+03


Training:  16%|█▌        | 32/200 [00:13<00:31,  5.28it/s, loss=2.87]

[2025-09-26 23:56:05,483][perf_logger][INFO] - loss: 2.87, global_step: 31, learning_rate: 6.4e-06, grad_norm: 8.38, step_time: 0.193, tokens_per_second: 5.4e+03, unpadded_tokens_per_second: 5.4e+03


Training:  16%|█▋        | 33/200 [00:13<00:31,  5.30it/s, loss=2.73]

[2025-09-26 23:56:05,671][perf_logger][INFO] - loss: 2.73, global_step: 32, learning_rate: 6.6e-06, grad_norm: 9.17, step_time: 0.188, tokens_per_second: 6.61e+03, unpadded_tokens_per_second: 6.61e+03


Training:  17%|█▋        | 34/200 [00:14<00:31,  5.32it/s, loss=2.92]

[2025-09-26 23:56:05,857][perf_logger][INFO] - loss: 2.92, global_step: 33, learning_rate: 6.8e-06, grad_norm: 11, step_time: 0.186, tokens_per_second: 7.92e+03, unpadded_tokens_per_second: 7.92e+03


Training:  18%|█▊        | 35/200 [00:14<00:30,  5.36it/s, loss=2.95]

[2025-09-26 23:56:06,041][perf_logger][INFO] - loss: 2.95, global_step: 34, learning_rate: 7e-06, grad_norm: 10.3, step_time: 0.184, tokens_per_second: 9.02e+03, unpadded_tokens_per_second: 9.01e+03


Training:  18%|█▊        | 36/200 [00:14<00:31,  5.20it/s, loss=2.82]

[2025-09-26 23:56:06,246][perf_logger][INFO] - loss: 2.82, global_step: 35, learning_rate: 7.2e-06, grad_norm: 9.7, step_time: 0.205, tokens_per_second: 5.93e+03, unpadded_tokens_per_second: 5.93e+03


Training:  18%|█▊        | 37/200 [00:14<00:31,  5.19it/s, loss=2.88]

[2025-09-26 23:56:06,440][perf_logger][INFO] - loss: 2.88, global_step: 36, learning_rate: 7.4e-06, grad_norm: 8.45, step_time: 0.194, tokens_per_second: 5.79e+03, unpadded_tokens_per_second: 5.78e+03


Training:  19%|█▉        | 38/200 [00:14<00:30,  5.26it/s, loss=2.81]

[2025-09-26 23:56:06,623][perf_logger][INFO] - loss: 2.81, global_step: 37, learning_rate: 7.6e-06, grad_norm: 7.83, step_time: 0.183, tokens_per_second: 6.21e+03, unpadded_tokens_per_second: 6.21e+03


Training:  20%|█▉        | 39/200 [00:14<00:31,  5.16it/s, loss=2.77]

[2025-09-26 23:56:06,827][perf_logger][INFO] - loss: 2.77, global_step: 38, learning_rate: 7.8e-06, grad_norm: 8.52, step_time: 0.203, tokens_per_second: 3.61e+03, unpadded_tokens_per_second: 3.61e+03


Training:  20%|██        | 40/200 [00:15<00:30,  5.21it/s, loss=2.83]

[2025-09-26 23:56:07,014][perf_logger][INFO] - loss: 2.83, global_step: 39, learning_rate: 8e-06, grad_norm: 6.5, step_time: 0.188, tokens_per_second: 5.87e+03, unpadded_tokens_per_second: 5.86e+03


Training:  20%|██        | 41/200 [00:15<00:30,  5.19it/s, loss=2.94]

[2025-09-26 23:56:07,208][perf_logger][INFO] - loss: 2.94, global_step: 40, learning_rate: 8.2e-06, grad_norm: 8.1, step_time: 0.194, tokens_per_second: 6.71e+03, unpadded_tokens_per_second: 6.7e+03


Training:  21%|██        | 42/200 [00:15<00:30,  5.20it/s, loss=3.13]

[2025-09-26 23:56:07,400][perf_logger][INFO] - loss: 3.13, global_step: 41, learning_rate: 8.4e-06, grad_norm: 10.5, step_time: 0.192, tokens_per_second: 5.04e+03, unpadded_tokens_per_second: 5.04e+03


Training:  22%|██▏       | 43/200 [00:15<00:30,  5.22it/s, loss=2.84]

[2025-09-26 23:56:07,589][perf_logger][INFO] - loss: 2.84, global_step: 42, learning_rate: 8.6e-06, grad_norm: 7.85, step_time: 0.189, tokens_per_second: 4.81e+03, unpadded_tokens_per_second: 4.81e+03


Training:  22%|██▏       | 44/200 [00:15<00:29,  5.27it/s, loss=2.87]

[2025-09-26 23:56:07,775][perf_logger][INFO] - loss: 2.87, global_step: 43, learning_rate: 8.8e-06, grad_norm: 8.62, step_time: 0.186, tokens_per_second: 6.06e+03, unpadded_tokens_per_second: 6.06e+03


Training:  22%|██▎       | 45/200 [00:16<00:29,  5.32it/s, loss=2.86]

[2025-09-26 23:56:07,959][perf_logger][INFO] - loss: 2.86, global_step: 44, learning_rate: 9e-06, grad_norm: 6.69, step_time: 0.183, tokens_per_second: 7.01e+03, unpadded_tokens_per_second: 7.01e+03


Training:  23%|██▎       | 46/200 [00:16<00:28,  5.38it/s, loss=2.94]

[2025-09-26 23:56:08,140][perf_logger][INFO] - loss: 2.94, global_step: 45, learning_rate: 9.2e-06, grad_norm: 5.68, step_time: 0.181, tokens_per_second: 7.59e+03, unpadded_tokens_per_second: 7.59e+03


Training:  24%|██▎       | 47/200 [00:16<00:28,  5.44it/s, loss=2.94]

[2025-09-26 23:56:08,319][perf_logger][INFO] - loss: 2.94, global_step: 46, learning_rate: 9.4e-06, grad_norm: 8.84, step_time: 0.179, tokens_per_second: 6.11e+03, unpadded_tokens_per_second: 6.11e+03


Training:  24%|██▍       | 48/200 [00:16<00:28,  5.35it/s, loss=2.99]

[2025-09-26 23:56:08,513][perf_logger][INFO] - loss: 2.99, global_step: 47, learning_rate: 9.6e-06, grad_norm: 9.7, step_time: 0.194, tokens_per_second: 2.99e+03, unpadded_tokens_per_second: 2.98e+03


Training:  24%|██▍       | 49/200 [00:16<00:28,  5.39it/s, loss=2.92]

[2025-09-26 23:56:08,696][perf_logger][INFO] - loss: 2.92, global_step: 48, learning_rate: 9.8e-06, grad_norm: 4.02, step_time: 0.182, tokens_per_second: 6.11e+03, unpadded_tokens_per_second: 6.11e+03


Training:  25%|██▌       | 50/200 [00:17<00:28,  5.30it/s, loss=2.85]

[2025-09-26 23:56:08,891][perf_logger][INFO] - loss: 2.85, global_step: 49, learning_rate: 1e-05, grad_norm: 5.37, step_time: 0.196, tokens_per_second: 1.03e+04, unpadded_tokens_per_second: 1.03e+04
[2025-09-26 23:56:18,475][checkpoint][INFO] - Saved mFSDP checkpoint to checkpoints/esm2_t33_650M_UR50D_sanity/train_mfsdp/step_50


Training:  26%|██▌       | 51/200 [00:26<07:28,  3.01s/it, loss=2.9] 

[2025-09-26 23:56:18,479][perf_logger][INFO] - loss: 2.9, global_step: 50, learning_rate: 1.02e-05, grad_norm: 5.98, step_time: 9.59, tokens_per_second: 100, unpadded_tokens_per_second: 100


Training:  26%|██▌       | 52/200 [00:26<05:21,  2.17s/it, loss=2.85]

[2025-09-26 23:56:18,700][perf_logger][INFO] - loss: 2.85, global_step: 51, learning_rate: 1.04e-05, grad_norm: 8.08, step_time: 0.221, tokens_per_second: 5.92e+03, unpadded_tokens_per_second: 5.92e+03


Training:  26%|██▋       | 53/200 [00:27<03:52,  1.58s/it, loss=2.93]

[2025-09-26 23:56:18,893][perf_logger][INFO] - loss: 2.93, global_step: 52, learning_rate: 1.06e-05, grad_norm: 5.21, step_time: 0.193, tokens_per_second: 8.91e+03, unpadded_tokens_per_second: 8.9e+03


Training:  27%|██▋       | 54/200 [00:27<02:49,  1.16s/it, loss=2.84]

[2025-09-26 23:56:19,078][perf_logger][INFO] - loss: 2.84, global_step: 53, learning_rate: 1.08e-05, grad_norm: 6.04, step_time: 0.186, tokens_per_second: 6.8e+03, unpadded_tokens_per_second: 6.8e+03


Training:  28%|██▊       | 55/200 [00:27<02:05,  1.15it/s, loss=2.82]

[2025-09-26 23:56:19,264][perf_logger][INFO] - loss: 2.82, global_step: 54, learning_rate: 1.1e-05, grad_norm: 6.71, step_time: 0.185, tokens_per_second: 5.68e+03, unpadded_tokens_per_second: 5.68e+03


Training:  28%|██▊       | 56/200 [00:27<01:36,  1.50it/s, loss=2.88]

[2025-09-26 23:56:19,461][perf_logger][INFO] - loss: 2.88, global_step: 55, learning_rate: 1.12e-05, grad_norm: 6.47, step_time: 0.198, tokens_per_second: 2.86e+03, unpadded_tokens_per_second: 2.86e+03


Training:  28%|██▊       | 57/200 [00:27<01:14,  1.91it/s, loss=2.86]

[2025-09-26 23:56:19,653][perf_logger][INFO] - loss: 2.86, global_step: 56, learning_rate: 1.14e-05, grad_norm: 5.19, step_time: 0.192, tokens_per_second: 7.12e+03, unpadded_tokens_per_second: 7.12e+03


Training:  29%|██▉       | 58/200 [00:27<01:00,  2.36it/s, loss=2.84]

[2025-09-26 23:56:19,841][perf_logger][INFO] - loss: 2.84, global_step: 57, learning_rate: 1.16e-05, grad_norm: 5.19, step_time: 0.188, tokens_per_second: 9.22e+03, unpadded_tokens_per_second: 9.22e+03


Training:  30%|██▉       | 59/200 [00:28<00:49,  2.84it/s, loss=2.79]

[2025-09-26 23:56:20,025][perf_logger][INFO] - loss: 2.79, global_step: 58, learning_rate: 1.18e-05, grad_norm: 3.46, step_time: 0.184, tokens_per_second: 6.87e+03, unpadded_tokens_per_second: 6.87e+03


Training:  30%|███       | 60/200 [00:28<00:42,  3.31it/s, loss=2.78]

[2025-09-26 23:56:20,212][perf_logger][INFO] - loss: 2.78, global_step: 59, learning_rate: 1.2e-05, grad_norm: 3.75, step_time: 0.186, tokens_per_second: 5.32e+03, unpadded_tokens_per_second: 5.32e+03


Training:  30%|███       | 61/200 [00:28<00:49,  2.83it/s, loss=2.91]

[2025-09-26 23:56:20,686][perf_logger][INFO] - loss: 2.91, global_step: 60, learning_rate: 1.22e-05, grad_norm: 5.31, step_time: 0.475, tokens_per_second: 2.88e+03, unpadded_tokens_per_second: 2.88e+03


Training:  31%|███       | 62/200 [00:29<00:42,  3.27it/s, loss=2.83]

[2025-09-26 23:56:20,880][perf_logger][INFO] - loss: 2.83, global_step: 61, learning_rate: 1.24e-05, grad_norm: 7.35, step_time: 0.193, tokens_per_second: 7.81e+03, unpadded_tokens_per_second: 7.81e+03


Training:  32%|███▏      | 63/200 [00:29<00:36,  3.72it/s, loss=2.91]

[2025-09-26 23:56:21,062][perf_logger][INFO] - loss: 2.91, global_step: 62, learning_rate: 1.26e-05, grad_norm: 7.25, step_time: 0.183, tokens_per_second: 7.26e+03, unpadded_tokens_per_second: 7.26e+03


Training:  32%|███▏      | 64/200 [00:29<00:33,  4.02it/s, loss=2.88]

[2025-09-26 23:56:21,265][perf_logger][INFO] - loss: 2.88, global_step: 63, learning_rate: 1.28e-05, grad_norm: 5.24, step_time: 0.202, tokens_per_second: 5.84e+03, unpadded_tokens_per_second: 5.84e+03


Training:  32%|███▎      | 65/200 [00:29<00:30,  4.36it/s, loss=2.85]

[2025-09-26 23:56:21,448][perf_logger][INFO] - loss: 2.85, global_step: 64, learning_rate: 1.3e-05, grad_norm: 6.12, step_time: 0.183, tokens_per_second: 8.84e+03, unpadded_tokens_per_second: 8.84e+03


Training:  33%|███▎      | 66/200 [00:29<00:29,  4.57it/s, loss=2.98]

[2025-09-26 23:56:21,642][perf_logger][INFO] - loss: 2.98, global_step: 65, learning_rate: 1.32e-05, grad_norm: 9.75, step_time: 0.194, tokens_per_second: 5.07e+03, unpadded_tokens_per_second: 5.07e+03


Training:  34%|███▎      | 67/200 [00:29<00:28,  4.74it/s, loss=2.9] 

[2025-09-26 23:56:21,835][perf_logger][INFO] - loss: 2.9, global_step: 66, learning_rate: 1.34e-05, grad_norm: 7.95, step_time: 0.193, tokens_per_second: 4.88e+03, unpadded_tokens_per_second: 4.88e+03


Training:  34%|███▍      | 68/200 [00:30<00:27,  4.88it/s, loss=2.78]

[2025-09-26 23:56:22,026][perf_logger][INFO] - loss: 2.78, global_step: 67, learning_rate: 1.36e-05, grad_norm: 5.94, step_time: 0.191, tokens_per_second: 9.46e+03, unpadded_tokens_per_second: 9.46e+03


Training:  34%|███▍      | 69/200 [00:30<00:26,  5.02it/s, loss=2.85]

[2025-09-26 23:56:22,212][perf_logger][INFO] - loss: 2.85, global_step: 68, learning_rate: 1.38e-05, grad_norm: 7.35, step_time: 0.185, tokens_per_second: 8.47e+03, unpadded_tokens_per_second: 8.46e+03


Training:  35%|███▌      | 70/200 [00:30<00:25,  5.10it/s, loss=2.78]

[2025-09-26 23:56:22,401][perf_logger][INFO] - loss: 2.78, global_step: 69, learning_rate: 1.4e-05, grad_norm: 6.97, step_time: 0.189, tokens_per_second: 4.44e+03, unpadded_tokens_per_second: 4.44e+03


Training:  36%|███▌      | 71/200 [00:30<00:24,  5.18it/s, loss=2.85]

[2025-09-26 23:56:22,587][perf_logger][INFO] - loss: 2.85, global_step: 70, learning_rate: 1.42e-05, grad_norm: 5.74, step_time: 0.186, tokens_per_second: 5.76e+03, unpadded_tokens_per_second: 5.76e+03


Training:  36%|███▌      | 72/200 [00:30<00:24,  5.25it/s, loss=2.81]

[2025-09-26 23:56:22,771][perf_logger][INFO] - loss: 2.81, global_step: 71, learning_rate: 1.44e-05, grad_norm: 7.37, step_time: 0.185, tokens_per_second: 8.51e+03, unpadded_tokens_per_second: 8.51e+03


Training:  36%|███▋      | 73/200 [00:31<00:23,  5.31it/s, loss=2.95]

[2025-09-26 23:56:22,954][perf_logger][INFO] - loss: 2.95, global_step: 72, learning_rate: 1.46e-05, grad_norm: 9.14, step_time: 0.183, tokens_per_second: 4.27e+03, unpadded_tokens_per_second: 4.27e+03


Training:  37%|███▋      | 74/200 [00:31<00:23,  5.37it/s, loss=2.9] 

[2025-09-26 23:56:23,135][perf_logger][INFO] - loss: 2.9, global_step: 73, learning_rate: 1.48e-05, grad_norm: 5.48, step_time: 0.181, tokens_per_second: 9.16e+03, unpadded_tokens_per_second: 9.15e+03


Training:  38%|███▊      | 75/200 [00:31<00:23,  5.41it/s, loss=3.09]

[2025-09-26 23:56:23,317][perf_logger][INFO] - loss: 3.09, global_step: 74, learning_rate: 1.5e-05, grad_norm: 7.72, step_time: 0.182, tokens_per_second: 3.05e+03, unpadded_tokens_per_second: 3.05e+03


Training:  38%|███▊      | 76/200 [00:31<00:22,  5.40it/s, loss=2.85]

[2025-09-26 23:56:23,503][perf_logger][INFO] - loss: 2.85, global_step: 75, learning_rate: 1.52e-05, grad_norm: 7.77, step_time: 0.186, tokens_per_second: 5.35e+03, unpadded_tokens_per_second: 5.34e+03


Training:  38%|███▊      | 77/200 [00:31<00:27,  4.53it/s, loss=2.83]

[2025-09-26 23:56:23,808][perf_logger][INFO] - loss: 2.83, global_step: 76, learning_rate: 1.54e-05, grad_norm: 6.75, step_time: 0.304, tokens_per_second: 6.98e+03, unpadded_tokens_per_second: 6.98e+03


Training:  39%|███▉      | 78/200 [00:32<00:26,  4.67it/s, loss=2.9] 

[2025-09-26 23:56:24,007][perf_logger][INFO] - loss: 2.9, global_step: 77, learning_rate: 1.56e-05, grad_norm: 5.26, step_time: 0.199, tokens_per_second: 9.8e+03, unpadded_tokens_per_second: 9.8e+03


Training:  40%|███▉      | 79/200 [00:32<00:25,  4.76it/s, loss=2.9]

[2025-09-26 23:56:24,207][perf_logger][INFO] - loss: 2.9, global_step: 78, learning_rate: 1.58e-05, grad_norm: 4.84, step_time: 0.2, tokens_per_second: 6.94e+03, unpadded_tokens_per_second: 6.94e+03


Training:  40%|████      | 80/200 [00:32<00:24,  4.94it/s, loss=2.78]

[2025-09-26 23:56:24,391][perf_logger][INFO] - loss: 2.78, global_step: 79, learning_rate: 1.6e-05, grad_norm: 5.66, step_time: 0.185, tokens_per_second: 7.19e+03, unpadded_tokens_per_second: 7.19e+03


Training:  40%|████      | 81/200 [00:32<00:23,  5.04it/s, loss=2.87]

[2025-09-26 23:56:24,580][perf_logger][INFO] - loss: 2.87, global_step: 80, learning_rate: 1.62e-05, grad_norm: 5.34, step_time: 0.189, tokens_per_second: 1.12e+04, unpadded_tokens_per_second: 1.12e+04


Training:  41%|████      | 82/200 [00:32<00:22,  5.14it/s, loss=2.95]

[2025-09-26 23:56:24,766][perf_logger][INFO] - loss: 2.95, global_step: 81, learning_rate: 1.64e-05, grad_norm: 5.59, step_time: 0.187, tokens_per_second: 7.57e+03, unpadded_tokens_per_second: 7.55e+03


Training:  42%|████▏     | 83/200 [00:33<00:26,  4.34it/s, loss=2.86]

[2025-09-26 23:56:25,080][perf_logger][INFO] - loss: 2.86, global_step: 82, learning_rate: 1.66e-05, grad_norm: 4.71, step_time: 0.313, tokens_per_second: 8.96e+03, unpadded_tokens_per_second: 8.96e+03


Training:  42%|████▏     | 84/200 [00:33<00:25,  4.58it/s, loss=2.75]

[2025-09-26 23:56:25,270][perf_logger][INFO] - loss: 2.75, global_step: 83, learning_rate: 1.68e-05, grad_norm: 6.71, step_time: 0.19, tokens_per_second: 3.4e+03, unpadded_tokens_per_second: 3.39e+03


Training:  42%|████▎     | 85/200 [00:33<00:24,  4.77it/s, loss=2.83]

[2025-09-26 23:56:25,459][perf_logger][INFO] - loss: 2.83, global_step: 84, learning_rate: 1.7e-05, grad_norm: 5.43, step_time: 0.189, tokens_per_second: 8.93e+03, unpadded_tokens_per_second: 8.92e+03


Training:  43%|████▎     | 86/200 [00:33<00:22,  4.96it/s, loss=2.9] 

[2025-09-26 23:56:25,642][perf_logger][INFO] - loss: 2.9, global_step: 85, learning_rate: 1.72e-05, grad_norm: 6.36, step_time: 0.183, tokens_per_second: 7.41e+03, unpadded_tokens_per_second: 7.41e+03


Training:  44%|████▎     | 87/200 [00:33<00:22,  4.97it/s, loss=2.96]

[2025-09-26 23:56:25,842][perf_logger][INFO] - loss: 2.96, global_step: 86, learning_rate: 1.74e-05, grad_norm: 7.59, step_time: 0.2, tokens_per_second: 6.12e+03, unpadded_tokens_per_second: 6.12e+03


Training:  44%|████▍     | 88/200 [00:34<00:22,  5.09it/s, loss=2.88]

[2025-09-26 23:56:26,028][perf_logger][INFO] - loss: 2.88, global_step: 87, learning_rate: 1.76e-05, grad_norm: 7.23, step_time: 0.186, tokens_per_second: 8.1e+03, unpadded_tokens_per_second: 8.1e+03


Training:  44%|████▍     | 89/200 [00:34<00:21,  5.20it/s, loss=2.97]

[2025-09-26 23:56:26,211][perf_logger][INFO] - loss: 2.97, global_step: 88, learning_rate: 1.78e-05, grad_norm: 6.99, step_time: 0.183, tokens_per_second: 6e+03, unpadded_tokens_per_second: 5.99e+03


Training:  45%|████▌     | 90/200 [00:34<00:20,  5.28it/s, loss=2.74]

[2025-09-26 23:56:26,393][perf_logger][INFO] - loss: 2.74, global_step: 89, learning_rate: 1.8e-05, grad_norm: 4.89, step_time: 0.183, tokens_per_second: 9.99e+03, unpadded_tokens_per_second: 9.98e+03


Training:  46%|████▌     | 91/200 [00:34<00:20,  5.28it/s, loss=2.9] 

[2025-09-26 23:56:26,582][perf_logger][INFO] - loss: 2.9, global_step: 90, learning_rate: 1.82e-05, grad_norm: 7.85, step_time: 0.189, tokens_per_second: 3.47e+03, unpadded_tokens_per_second: 3.47e+03


Training:  46%|████▌     | 92/200 [00:34<00:20,  5.30it/s, loss=2.86]

[2025-09-26 23:56:26,770][perf_logger][INFO] - loss: 2.86, global_step: 91, learning_rate: 1.84e-05, grad_norm: 5.65, step_time: 0.187, tokens_per_second: 6.61e+03, unpadded_tokens_per_second: 6.61e+03


Training:  46%|████▋     | 93/200 [00:35<00:20,  5.34it/s, loss=2.93]

[2025-09-26 23:56:26,953][perf_logger][INFO] - loss: 2.93, global_step: 92, learning_rate: 1.86e-05, grad_norm: 5.31, step_time: 0.184, tokens_per_second: 6.79e+03, unpadded_tokens_per_second: 6.78e+03


Training:  47%|████▋     | 94/200 [00:35<00:19,  5.35it/s, loss=2.86]

[2025-09-26 23:56:27,140][perf_logger][INFO] - loss: 2.86, global_step: 93, learning_rate: 1.88e-05, grad_norm: 5.29, step_time: 0.186, tokens_per_second: 6.73e+03, unpadded_tokens_per_second: 6.73e+03


Training:  48%|████▊     | 95/200 [00:35<00:20,  5.16it/s, loss=2.91]

[2025-09-26 23:56:27,350][perf_logger][INFO] - loss: 2.91, global_step: 94, learning_rate: 1.9e-05, grad_norm: 6.98, step_time: 0.21, tokens_per_second: 3.03e+03, unpadded_tokens_per_second: 3.03e+03


Training:  48%|████▊     | 96/200 [00:35<00:20,  5.06it/s, loss=2.9] 

[2025-09-26 23:56:27,556][perf_logger][INFO] - loss: 2.9, global_step: 95, learning_rate: 1.92e-05, grad_norm: 9.3, step_time: 0.206, tokens_per_second: 3.42e+03, unpadded_tokens_per_second: 3.41e+03


Training:  48%|████▊     | 97/200 [00:35<00:20,  5.08it/s, loss=2.86]

[2025-09-26 23:56:27,751][perf_logger][INFO] - loss: 2.86, global_step: 96, learning_rate: 1.94e-05, grad_norm: 7.03, step_time: 0.195, tokens_per_second: 3.37e+03, unpadded_tokens_per_second: 3.37e+03


Training:  49%|████▉     | 98/200 [00:36<00:19,  5.16it/s, loss=2.88]

[2025-09-26 23:56:27,938][perf_logger][INFO] - loss: 2.88, global_step: 97, learning_rate: 1.96e-05, grad_norm: 8.28, step_time: 0.187, tokens_per_second: 6.11e+03, unpadded_tokens_per_second: 6.11e+03


Training:  50%|████▉     | 99/200 [00:36<00:19,  5.22it/s, loss=2.92]

[2025-09-26 23:56:28,124][perf_logger][INFO] - loss: 2.92, global_step: 98, learning_rate: 1.98e-05, grad_norm: 6.76, step_time: 0.186, tokens_per_second: 1.02e+04, unpadded_tokens_per_second: 1.02e+04


Training:  50%|█████     | 100/200 [00:36<00:18,  5.30it/s, loss=2.89]

[2025-09-26 23:56:28,305][perf_logger][INFO] - loss: 2.89, global_step: 99, learning_rate: 2e-05, grad_norm: 6.09, step_time: 0.182, tokens_per_second: 1.02e+04, unpadded_tokens_per_second: 1.02e+04
[2025-09-26 23:56:36,659][checkpoint][INFO] - Saved mFSDP checkpoint to checkpoints/esm2_t33_650M_UR50D_sanity/train_mfsdp/step_100


Training:  50%|█████     | 101/200 [00:44<04:21,  2.64s/it, loss=2.86]

[2025-09-26 23:56:36,662][perf_logger][INFO] - loss: 2.86, global_step: 100, learning_rate: 2.02e-05, grad_norm: 5.47, step_time: 8.36, tokens_per_second: 181, unpadded_tokens_per_second: 181


Training:  51%|█████     | 102/200 [00:45<03:07,  1.91s/it, loss=2.84]

[2025-09-26 23:56:36,873][perf_logger][INFO] - loss: 2.84, global_step: 101, learning_rate: 2.04e-05, grad_norm: 5.23, step_time: 0.211, tokens_per_second: 8.15e+03, unpadded_tokens_per_second: 8.15e+03


Training:  52%|█████▏    | 103/200 [00:45<02:15,  1.39s/it, loss=2.89]

[2025-09-26 23:56:37,060][perf_logger][INFO] - loss: 2.89, global_step: 102, learning_rate: 2.06e-05, grad_norm: 5.46, step_time: 0.187, tokens_per_second: 9.25e+03, unpadded_tokens_per_second: 9.25e+03


Training:  52%|█████▏    | 104/200 [00:45<01:39,  1.03s/it, loss=2.78]

[2025-09-26 23:56:37,247][perf_logger][INFO] - loss: 2.78, global_step: 103, learning_rate: 2.08e-05, grad_norm: 5.16, step_time: 0.187, tokens_per_second: 9.98e+03, unpadded_tokens_per_second: 9.97e+03


Training:  52%|█████▎    | 105/200 [00:45<01:13,  1.29it/s, loss=2.85]

[2025-09-26 23:56:37,429][perf_logger][INFO] - loss: 2.85, global_step: 104, learning_rate: 2.1e-05, grad_norm: 5.9, step_time: 0.182, tokens_per_second: 1.13e+04, unpadded_tokens_per_second: 1.13e+04


Training:  53%|█████▎    | 106/200 [00:45<00:56,  1.67it/s, loss=2.95]

[2025-09-26 23:56:37,612][perf_logger][INFO] - loss: 2.95, global_step: 105, learning_rate: 2.12e-05, grad_norm: 8.66, step_time: 0.183, tokens_per_second: 7.08e+03, unpadded_tokens_per_second: 7.08e+03


Training:  54%|█████▎    | 107/200 [00:45<00:44,  2.10it/s, loss=2.86]

[2025-09-26 23:56:37,800][perf_logger][INFO] - loss: 2.86, global_step: 106, learning_rate: 2.14e-05, grad_norm: 6.99, step_time: 0.188, tokens_per_second: 6.98e+03, unpadded_tokens_per_second: 6.97e+03


Training:  54%|█████▍    | 108/200 [00:46<00:35,  2.58it/s, loss=2.98]

[2025-09-26 23:56:37,985][perf_logger][INFO] - loss: 2.98, global_step: 107, learning_rate: 2.16e-05, grad_norm: 8.32, step_time: 0.185, tokens_per_second: 3.57e+03, unpadded_tokens_per_second: 3.56e+03


Training:  55%|█████▍    | 109/200 [00:46<00:29,  3.06it/s, loss=2.88]

[2025-09-26 23:56:38,169][perf_logger][INFO] - loss: 2.88, global_step: 108, learning_rate: 2.18e-05, grad_norm: 4.97, step_time: 0.184, tokens_per_second: 4.92e+03, unpadded_tokens_per_second: 4.92e+03


Training:  55%|█████▌    | 110/200 [00:46<00:25,  3.52it/s, loss=2.68]

[2025-09-26 23:56:38,352][perf_logger][INFO] - loss: 2.68, global_step: 109, learning_rate: 2.2e-05, grad_norm: 5.55, step_time: 0.183, tokens_per_second: 5.72e+03, unpadded_tokens_per_second: 5.72e+03


Training:  56%|█████▌    | 111/200 [00:46<00:22,  3.94it/s, loss=2.82]

[2025-09-26 23:56:38,535][perf_logger][INFO] - loss: 2.82, global_step: 110, learning_rate: 2.22e-05, grad_norm: 5.66, step_time: 0.183, tokens_per_second: 4.52e+03, unpadded_tokens_per_second: 4.52e+03


Training:  56%|█████▌    | 112/200 [00:46<00:20,  4.30it/s, loss=2.76]

[2025-09-26 23:56:38,718][perf_logger][INFO] - loss: 2.76, global_step: 111, learning_rate: 2.24e-05, grad_norm: 5.36, step_time: 0.184, tokens_per_second: 6.6e+03, unpadded_tokens_per_second: 6.6e+03


Training:  56%|█████▋    | 113/200 [00:47<00:19,  4.56it/s, loss=2.84]

[2025-09-26 23:56:38,907][perf_logger][INFO] - loss: 2.84, global_step: 112, learning_rate: 2.26e-05, grad_norm: 6.49, step_time: 0.188, tokens_per_second: 9.85e+03, unpadded_tokens_per_second: 9.85e+03


Training:  57%|█████▋    | 114/200 [00:47<00:18,  4.71it/s, loss=2.8] 

[2025-09-26 23:56:39,103][perf_logger][INFO] - loss: 2.8, global_step: 113, learning_rate: 2.28e-05, grad_norm: 4.69, step_time: 0.196, tokens_per_second: 6.03e+03, unpadded_tokens_per_second: 6.03e+03


Training:  57%|█████▊    | 115/200 [00:47<00:17,  4.75it/s, loss=2.75]

[2025-09-26 23:56:39,309][perf_logger][INFO] - loss: 2.75, global_step: 114, learning_rate: 2.3e-05, grad_norm: 4.87, step_time: 0.206, tokens_per_second: 9.84e+03, unpadded_tokens_per_second: 9.83e+03


Training:  58%|█████▊    | 116/200 [00:47<00:17,  4.89it/s, loss=2.78]

[2025-09-26 23:56:39,499][perf_logger][INFO] - loss: 2.78, global_step: 115, learning_rate: 2.32e-05, grad_norm: 4.97, step_time: 0.19, tokens_per_second: 4.12e+03, unpadded_tokens_per_second: 4.12e+03


Training:  58%|█████▊    | 117/200 [00:47<00:16,  4.93it/s, loss=2.81]

[2025-09-26 23:56:39,699][perf_logger][INFO] - loss: 2.81, global_step: 116, learning_rate: 2.34e-05, grad_norm: 5.11, step_time: 0.2, tokens_per_second: 9.7e+03, unpadded_tokens_per_second: 9.69e+03


Training:  59%|█████▉    | 118/200 [00:48<00:16,  5.02it/s, loss=2.92]

[2025-09-26 23:56:39,889][perf_logger][INFO] - loss: 2.92, global_step: 117, learning_rate: 2.36e-05, grad_norm: 5.32, step_time: 0.19, tokens_per_second: 4.12e+03, unpadded_tokens_per_second: 4.12e+03


Training:  60%|█████▉    | 119/200 [00:48<00:15,  5.13it/s, loss=2.65]

[2025-09-26 23:56:40,074][perf_logger][INFO] - loss: 2.65, global_step: 118, learning_rate: 2.38e-05, grad_norm: 4.51, step_time: 0.186, tokens_per_second: 5.18e+03, unpadded_tokens_per_second: 5.18e+03


Training:  60%|██████    | 120/200 [00:48<00:15,  5.20it/s, loss=2.79]

[2025-09-26 23:56:40,261][perf_logger][INFO] - loss: 2.79, global_step: 119, learning_rate: 2.4e-05, grad_norm: 4.41, step_time: 0.187, tokens_per_second: 1.21e+04, unpadded_tokens_per_second: 1.21e+04


Training:  60%|██████    | 121/200 [00:48<00:15,  5.21it/s, loss=2.71]

[2025-09-26 23:56:40,452][perf_logger][INFO] - loss: 2.71, global_step: 120, learning_rate: 2.42e-05, grad_norm: 5.42, step_time: 0.191, tokens_per_second: 5.62e+03, unpadded_tokens_per_second: 5.62e+03


Training:  61%|██████    | 122/200 [00:49<00:22,  3.52it/s, loss=2.71]

[2025-09-26 23:56:40,951][perf_logger][INFO] - loss: 2.71, global_step: 121, learning_rate: 2.44e-05, grad_norm: 4.86, step_time: 0.499, tokens_per_second: 3.88e+03, unpadded_tokens_per_second: 3.88e+03


Training:  62%|██████▏   | 123/200 [00:49<00:19,  3.91it/s, loss=2.76]

[2025-09-26 23:56:41,140][perf_logger][INFO] - loss: 2.76, global_step: 122, learning_rate: 2.46e-05, grad_norm: 6.03, step_time: 0.19, tokens_per_second: 7.48e+03, unpadded_tokens_per_second: 7.47e+03


Training:  62%|██████▏   | 124/200 [00:49<00:17,  4.27it/s, loss=2.8] 

[2025-09-26 23:56:41,324][perf_logger][INFO] - loss: 2.8, global_step: 123, learning_rate: 2.48e-05, grad_norm: 4.53, step_time: 0.183, tokens_per_second: 5.22e+03, unpadded_tokens_per_second: 5.22e+03


Training:  62%|██████▎   | 125/200 [00:49<00:16,  4.57it/s, loss=2.82]

[2025-09-26 23:56:41,507][perf_logger][INFO] - loss: 2.82, global_step: 124, learning_rate: 2.5e-05, grad_norm: 7.11, step_time: 0.184, tokens_per_second: 4.58e+03, unpadded_tokens_per_second: 4.56e+03


Training:  63%|██████▎   | 126/200 [00:49<00:15,  4.76it/s, loss=2.72]

[2025-09-26 23:56:41,697][perf_logger][INFO] - loss: 2.72, global_step: 125, learning_rate: 2.52e-05, grad_norm: 4.69, step_time: 0.189, tokens_per_second: 8.09e+03, unpadded_tokens_per_second: 8.08e+03


Training:  64%|██████▎   | 127/200 [00:50<00:14,  4.94it/s, loss=2.77]

[2025-09-26 23:56:41,882][perf_logger][INFO] - loss: 2.77, global_step: 126, learning_rate: 2.54e-05, grad_norm: 4.36, step_time: 0.185, tokens_per_second: 7.06e+03, unpadded_tokens_per_second: 7.06e+03


Training:  64%|██████▍   | 128/200 [00:50<00:14,  5.06it/s, loss=2.73]

[2025-09-26 23:56:42,068][perf_logger][INFO] - loss: 2.73, global_step: 127, learning_rate: 2.56e-05, grad_norm: 4, step_time: 0.187, tokens_per_second: 7.16e+03, unpadded_tokens_per_second: 7.15e+03


Training:  64%|██████▍   | 129/200 [00:50<00:14,  5.04it/s, loss=2.82]

[2025-09-26 23:56:42,268][perf_logger][INFO] - loss: 2.82, global_step: 128, learning_rate: 2.58e-05, grad_norm: 4.23, step_time: 0.199, tokens_per_second: 5.84e+03, unpadded_tokens_per_second: 5.84e+03


Training:  65%|██████▌   | 130/200 [00:50<00:13,  5.18it/s, loss=2.87]

[2025-09-26 23:56:42,449][perf_logger][INFO] - loss: 2.87, global_step: 129, learning_rate: 2.6e-05, grad_norm: 5.16, step_time: 0.181, tokens_per_second: 3.63e+03, unpadded_tokens_per_second: 3.63e+03


Training:  66%|██████▌   | 131/200 [00:50<00:13,  5.19it/s, loss=2.77]

[2025-09-26 23:56:42,640][perf_logger][INFO] - loss: 2.77, global_step: 130, learning_rate: 2.62e-05, grad_norm: 5.02, step_time: 0.192, tokens_per_second: 4.16e+03, unpadded_tokens_per_second: 4.16e+03


Training:  66%|██████▌   | 132/200 [00:50<00:13,  5.20it/s, loss=2.74]

[2025-09-26 23:56:42,832][perf_logger][INFO] - loss: 2.74, global_step: 131, learning_rate: 2.64e-05, grad_norm: 4.39, step_time: 0.192, tokens_per_second: 8.47e+03, unpadded_tokens_per_second: 8.47e+03


Training:  66%|██████▋   | 133/200 [00:51<00:13,  5.11it/s, loss=2.77]

[2025-09-26 23:56:43,035][perf_logger][INFO] - loss: 2.77, global_step: 132, learning_rate: 2.66e-05, grad_norm: 5.21, step_time: 0.203, tokens_per_second: 5.64e+03, unpadded_tokens_per_second: 5.64e+03


Training:  67%|██████▋   | 134/200 [00:51<00:12,  5.22it/s, loss=2.72]

[2025-09-26 23:56:43,217][perf_logger][INFO] - loss: 2.72, global_step: 133, learning_rate: 2.68e-05, grad_norm: 4.72, step_time: 0.182, tokens_per_second: 6.59e+03, unpadded_tokens_per_second: 6.59e+03


Training:  68%|██████▊   | 135/200 [00:51<00:12,  5.30it/s, loss=2.75]

[2025-09-26 23:56:43,400][perf_logger][INFO] - loss: 2.75, global_step: 134, learning_rate: 2.7e-05, grad_norm: 5.38, step_time: 0.183, tokens_per_second: 4.49e+03, unpadded_tokens_per_second: 4.49e+03


Training:  68%|██████▊   | 136/200 [00:51<00:11,  5.34it/s, loss=2.75]

[2025-09-26 23:56:43,583][perf_logger][INFO] - loss: 2.75, global_step: 135, learning_rate: 2.72e-05, grad_norm: 4.74, step_time: 0.183, tokens_per_second: 7.37e+03, unpadded_tokens_per_second: 7.37e+03


Training:  68%|██████▊   | 137/200 [00:51<00:11,  5.37it/s, loss=2.74]

[2025-09-26 23:56:43,768][perf_logger][INFO] - loss: 2.74, global_step: 136, learning_rate: 2.74e-05, grad_norm: 4.6, step_time: 0.185, tokens_per_second: 7.17e+03, unpadded_tokens_per_second: 7.17e+03


Training:  69%|██████▉   | 138/200 [00:52<00:11,  5.34it/s, loss=2.84]

[2025-09-26 23:56:43,957][perf_logger][INFO] - loss: 2.84, global_step: 137, learning_rate: 2.76e-05, grad_norm: 4.26, step_time: 0.19, tokens_per_second: 9.35e+03, unpadded_tokens_per_second: 9.35e+03


Training:  70%|██████▉   | 139/200 [00:52<00:11,  5.38it/s, loss=2.73]

[2025-09-26 23:56:44,140][perf_logger][INFO] - loss: 2.73, global_step: 138, learning_rate: 2.78e-05, grad_norm: 6.06, step_time: 0.182, tokens_per_second: 6.24e+03, unpadded_tokens_per_second: 6.24e+03


Training:  70%|███████   | 140/200 [00:52<00:11,  5.41it/s, loss=2.65]

[2025-09-26 23:56:44,323][perf_logger][INFO] - loss: 2.65, global_step: 139, learning_rate: 2.8e-05, grad_norm: 6.48, step_time: 0.183, tokens_per_second: 8.13e+03, unpadded_tokens_per_second: 8.13e+03


Training:  70%|███████   | 141/200 [00:52<00:10,  5.41it/s, loss=2.85]

[2025-09-26 23:56:44,507][perf_logger][INFO] - loss: 2.85, global_step: 140, learning_rate: 2.82e-05, grad_norm: 4.1, step_time: 0.184, tokens_per_second: 6.65e+03, unpadded_tokens_per_second: 6.64e+03


Training:  71%|███████   | 142/200 [00:52<00:10,  5.33it/s, loss=2.87]

[2025-09-26 23:56:44,701][perf_logger][INFO] - loss: 2.87, global_step: 141, learning_rate: 2.84e-05, grad_norm: 5.51, step_time: 0.194, tokens_per_second: 6.93e+03, unpadded_tokens_per_second: 6.92e+03


Training:  72%|███████▏  | 143/200 [00:53<00:10,  5.25it/s, loss=2.78]

[2025-09-26 23:56:44,899][perf_logger][INFO] - loss: 2.78, global_step: 142, learning_rate: 2.86e-05, grad_norm: 5.02, step_time: 0.198, tokens_per_second: 9.39e+03, unpadded_tokens_per_second: 9.39e+03


Training:  72%|███████▏  | 144/200 [00:53<00:10,  5.20it/s, loss=2.8] 

[2025-09-26 23:56:45,095][perf_logger][INFO] - loss: 2.8, global_step: 143, learning_rate: 2.88e-05, grad_norm: 5.12, step_time: 0.196, tokens_per_second: 3.56e+03, unpadded_tokens_per_second: 3.56e+03


Training:  72%|███████▎  | 145/200 [00:53<00:10,  5.20it/s, loss=2.9]

[2025-09-26 23:56:45,288][perf_logger][INFO] - loss: 2.9, global_step: 144, learning_rate: 2.9e-05, grad_norm: 5.39, step_time: 0.193, tokens_per_second: 4.79e+03, unpadded_tokens_per_second: 4.79e+03


Training:  73%|███████▎  | 146/200 [00:53<00:10,  5.22it/s, loss=2.74]

[2025-09-26 23:56:45,477][perf_logger][INFO] - loss: 2.74, global_step: 145, learning_rate: 2.92e-05, grad_norm: 4.22, step_time: 0.189, tokens_per_second: 1.48e+04, unpadded_tokens_per_second: 1.48e+04


Training:  74%|███████▎  | 147/200 [00:53<00:10,  5.22it/s, loss=2.84]

[2025-09-26 23:56:45,668][perf_logger][INFO] - loss: 2.84, global_step: 146, learning_rate: 2.94e-05, grad_norm: 5.1, step_time: 0.192, tokens_per_second: 6.42e+03, unpadded_tokens_per_second: 6.41e+03


Training:  74%|███████▍  | 148/200 [00:54<00:09,  5.22it/s, loss=2.74]

[2025-09-26 23:56:45,860][perf_logger][INFO] - loss: 2.74, global_step: 147, learning_rate: 2.96e-05, grad_norm: 3.37, step_time: 0.192, tokens_per_second: 6.41e+03, unpadded_tokens_per_second: 6.4e+03


Training:  74%|███████▍  | 149/200 [00:54<00:09,  5.26it/s, loss=2.8] 

[2025-09-26 23:56:46,047][perf_logger][INFO] - loss: 2.8, global_step: 148, learning_rate: 2.98e-05, grad_norm: 3.72, step_time: 0.187, tokens_per_second: 5.07e+03, unpadded_tokens_per_second: 5.06e+03


Training:  75%|███████▌  | 150/200 [00:54<00:09,  5.25it/s, loss=2.66]

[2025-09-26 23:56:46,239][perf_logger][INFO] - loss: 2.66, global_step: 149, learning_rate: 3e-05, grad_norm: 3.61, step_time: 0.192, tokens_per_second: 6.85e+03, unpadded_tokens_per_second: 6.84e+03
[2025-09-26 23:56:56,347][checkpoint][INFO] - Saved mFSDP checkpoint to checkpoints/esm2_t33_650M_UR50D_sanity/train_mfsdp/step_150


Training:  76%|███████▌  | 151/200 [01:04<02:35,  3.17s/it, loss=2.61]

[2025-09-26 23:56:56,350][perf_logger][INFO] - loss: 2.61, global_step: 150, learning_rate: 3.02e-05, grad_norm: 4.79, step_time: 10.1, tokens_per_second: 119, unpadded_tokens_per_second: 119


Training:  76%|███████▌  | 152/200 [01:04<01:49,  2.28s/it, loss=2.73]

[2025-09-26 23:56:56,558][perf_logger][INFO] - loss: 2.73, global_step: 151, learning_rate: 3.04e-05, grad_norm: 5.46, step_time: 0.209, tokens_per_second: 4.8e+03, unpadded_tokens_per_second: 4.8e+03


Training:  76%|███████▋  | 153/200 [01:04<01:17,  1.65s/it, loss=2.83]

[2025-09-26 23:56:56,744][perf_logger][INFO] - loss: 2.83, global_step: 152, learning_rate: 3.06e-05, grad_norm: 5.69, step_time: 0.185, tokens_per_second: 5.46e+03, unpadded_tokens_per_second: 5.46e+03


Training:  77%|███████▋  | 154/200 [01:05<00:55,  1.21s/it, loss=2.73]

[2025-09-26 23:56:56,927][perf_logger][INFO] - loss: 2.73, global_step: 153, learning_rate: 3.08e-05, grad_norm: 4.8, step_time: 0.184, tokens_per_second: 7.4e+03, unpadded_tokens_per_second: 7.4e+03


Training:  78%|███████▊  | 155/200 [01:05<00:40,  1.11it/s, loss=2.82]

[2025-09-26 23:56:57,115][perf_logger][INFO] - loss: 2.82, global_step: 154, learning_rate: 3.1e-05, grad_norm: 4.31, step_time: 0.187, tokens_per_second: 8.26e+03, unpadded_tokens_per_second: 8.26e+03


Training:  78%|███████▊  | 156/200 [01:05<00:30,  1.45it/s, loss=2.73]

[2025-09-26 23:56:57,312][perf_logger][INFO] - loss: 2.73, global_step: 155, learning_rate: 3.12e-05, grad_norm: 5.1, step_time: 0.197, tokens_per_second: 4.75e+03, unpadded_tokens_per_second: 4.75e+03


Training:  78%|███████▊  | 157/200 [01:05<00:23,  1.84it/s, loss=2.69]

[2025-09-26 23:56:57,510][perf_logger][INFO] - loss: 2.69, global_step: 156, learning_rate: 3.14e-05, grad_norm: 4.1, step_time: 0.198, tokens_per_second: 1.22e+04, unpadded_tokens_per_second: 1.22e+04


Training:  79%|███████▉  | 158/200 [01:05<00:18,  2.29it/s, loss=2.84]

[2025-09-26 23:56:57,695][perf_logger][INFO] - loss: 2.84, global_step: 157, learning_rate: 3.16e-05, grad_norm: 4.84, step_time: 0.185, tokens_per_second: 7.42e+03, unpadded_tokens_per_second: 7.41e+03


Training:  80%|███████▉  | 159/200 [01:06<00:15,  2.71it/s, loss=2.77]

[2025-09-26 23:56:57,908][perf_logger][INFO] - loss: 2.77, global_step: 158, learning_rate: 3.18e-05, grad_norm: 5.5, step_time: 0.212, tokens_per_second: 2.46e+03, unpadded_tokens_per_second: 2.46e+03


Training:  80%|████████  | 160/200 [01:06<00:12,  3.17it/s, loss=2.68]

[2025-09-26 23:56:58,100][perf_logger][INFO] - loss: 2.68, global_step: 159, learning_rate: 3.2e-05, grad_norm: 4.02, step_time: 0.192, tokens_per_second: 1.09e+04, unpadded_tokens_per_second: 1.09e+04


Training:  80%|████████  | 161/200 [01:06<00:10,  3.61it/s, loss=2.67]

[2025-09-26 23:56:58,285][perf_logger][INFO] - loss: 2.67, global_step: 160, learning_rate: 3.22e-05, grad_norm: 6.81, step_time: 0.185, tokens_per_second: 4.52e+03, unpadded_tokens_per_second: 4.52e+03


Training:  81%|████████  | 162/200 [01:06<00:09,  3.99it/s, loss=2.68]

[2025-09-26 23:56:58,474][perf_logger][INFO] - loss: 2.68, global_step: 161, learning_rate: 3.24e-05, grad_norm: 5.15, step_time: 0.19, tokens_per_second: 1.01e+04, unpadded_tokens_per_second: 1e+04


Training:  82%|████████▏ | 163/200 [01:06<00:08,  4.33it/s, loss=2.65]

[2025-09-26 23:56:58,659][perf_logger][INFO] - loss: 2.65, global_step: 162, learning_rate: 3.26e-05, grad_norm: 3.17, step_time: 0.185, tokens_per_second: 5.71e+03, unpadded_tokens_per_second: 5.7e+03


Training:  82%|████████▏ | 164/200 [01:06<00:07,  4.63it/s, loss=2.71]

[2025-09-26 23:56:58,841][perf_logger][INFO] - loss: 2.71, global_step: 163, learning_rate: 3.28e-05, grad_norm: 3.44, step_time: 0.182, tokens_per_second: 6.9e+03, unpadded_tokens_per_second: 6.89e+03


Training:  82%|████████▎ | 165/200 [01:07<00:07,  4.86it/s, loss=2.72]

[2025-09-26 23:56:59,022][perf_logger][INFO] - loss: 2.72, global_step: 164, learning_rate: 3.3e-05, grad_norm: 4.56, step_time: 0.182, tokens_per_second: 7.43e+03, unpadded_tokens_per_second: 7.43e+03


Training:  83%|████████▎ | 166/200 [01:07<00:06,  5.02it/s, loss=2.78]

[2025-09-26 23:56:59,206][perf_logger][INFO] - loss: 2.78, global_step: 165, learning_rate: 3.32e-05, grad_norm: 4.63, step_time: 0.184, tokens_per_second: 7.27e+03, unpadded_tokens_per_second: 7.26e+03


Training:  84%|████████▎ | 167/200 [01:07<00:06,  5.11it/s, loss=2.71]

[2025-09-26 23:56:59,394][perf_logger][INFO] - loss: 2.71, global_step: 166, learning_rate: 3.34e-05, grad_norm: 4.55, step_time: 0.188, tokens_per_second: 5.42e+03, unpadded_tokens_per_second: 5.41e+03


Training:  84%|████████▍ | 168/200 [01:07<00:06,  5.18it/s, loss=2.69]

[2025-09-26 23:56:59,580][perf_logger][INFO] - loss: 2.69, global_step: 167, learning_rate: 3.36e-05, grad_norm: 3.72, step_time: 0.187, tokens_per_second: 9.83e+03, unpadded_tokens_per_second: 9.83e+03


Training:  84%|████████▍ | 169/200 [01:07<00:05,  5.19it/s, loss=2.75]

[2025-09-26 23:56:59,773][perf_logger][INFO] - loss: 2.75, global_step: 168, learning_rate: 3.38e-05, grad_norm: 4.81, step_time: 0.192, tokens_per_second: 5.91e+03, unpadded_tokens_per_second: 5.9e+03


Training:  85%|████████▌ | 170/200 [01:08<00:05,  5.21it/s, loss=2.77]

[2025-09-26 23:56:59,963][perf_logger][INFO] - loss: 2.77, global_step: 169, learning_rate: 3.4e-05, grad_norm: 4.52, step_time: 0.191, tokens_per_second: 8.79e+03, unpadded_tokens_per_second: 8.79e+03


Training:  86%|████████▌ | 171/200 [01:08<00:05,  5.21it/s, loss=2.73]

[2025-09-26 23:57:00,155][perf_logger][INFO] - loss: 2.73, global_step: 170, learning_rate: 3.42e-05, grad_norm: 5.23, step_time: 0.191, tokens_per_second: 8.55e+03, unpadded_tokens_per_second: 8.54e+03


Training:  86%|████████▌ | 172/200 [01:08<00:05,  5.18it/s, loss=2.72]

[2025-09-26 23:57:00,350][perf_logger][INFO] - loss: 2.72, global_step: 171, learning_rate: 3.44e-05, grad_norm: 3.1, step_time: 0.196, tokens_per_second: 9.28e+03, unpadded_tokens_per_second: 9.27e+03


Training:  86%|████████▋ | 173/200 [01:08<00:05,  5.19it/s, loss=2.63]

[2025-09-26 23:57:00,542][perf_logger][INFO] - loss: 2.63, global_step: 172, learning_rate: 3.46e-05, grad_norm: 3.14, step_time: 0.191, tokens_per_second: 1.13e+04, unpadded_tokens_per_second: 1.13e+04


Training:  87%|████████▋ | 174/200 [01:08<00:05,  5.20it/s, loss=2.48]

[2025-09-26 23:57:00,734][perf_logger][INFO] - loss: 2.48, global_step: 173, learning_rate: 3.48e-05, grad_norm: 4.59, step_time: 0.192, tokens_per_second: 6.32e+03, unpadded_tokens_per_second: 6.31e+03


Training:  88%|████████▊ | 175/200 [01:09<00:05,  4.98it/s, loss=2.75]

[2025-09-26 23:57:00,954][perf_logger][INFO] - loss: 2.75, global_step: 174, learning_rate: 3.5e-05, grad_norm: 3.45, step_time: 0.22, tokens_per_second: 5.12e+03, unpadded_tokens_per_second: 5.11e+03


Training:  88%|████████▊ | 176/200 [01:09<00:04,  5.05it/s, loss=2.77]

[2025-09-26 23:57:01,146][perf_logger][INFO] - loss: 2.77, global_step: 175, learning_rate: 3.52e-05, grad_norm: 4.06, step_time: 0.192, tokens_per_second: 8.44e+03, unpadded_tokens_per_second: 8.44e+03


Training:  88%|████████▊ | 177/200 [01:09<00:04,  5.04it/s, loss=2.83]

[2025-09-26 23:57:01,345][perf_logger][INFO] - loss: 2.83, global_step: 176, learning_rate: 3.54e-05, grad_norm: 4.08, step_time: 0.199, tokens_per_second: 1.33e+04, unpadded_tokens_per_second: 1.33e+04


Training:  89%|████████▉ | 178/200 [01:09<00:04,  5.09it/s, loss=2.73]

[2025-09-26 23:57:01,538][perf_logger][INFO] - loss: 2.73, global_step: 177, learning_rate: 3.56e-05, grad_norm: 4.35, step_time: 0.192, tokens_per_second: 1.34e+04, unpadded_tokens_per_second: 1.34e+04


Training:  90%|████████▉ | 179/200 [01:10<00:05,  3.52it/s, loss=2.82]

[2025-09-26 23:57:02,025][perf_logger][INFO] - loss: 2.82, global_step: 178, learning_rate: 3.58e-05, grad_norm: 5.09, step_time: 0.488, tokens_per_second: 2.49e+03, unpadded_tokens_per_second: 2.48e+03


Training:  90%|█████████ | 180/200 [01:10<00:05,  3.89it/s, loss=2.57]

[2025-09-26 23:57:02,220][perf_logger][INFO] - loss: 2.57, global_step: 179, learning_rate: 3.6e-05, grad_norm: 5.48, step_time: 0.195, tokens_per_second: 4.94e+03, unpadded_tokens_per_second: 4.94e+03


Training:  90%|█████████ | 181/200 [01:10<00:04,  4.25it/s, loss=2.76]

[2025-09-26 23:57:02,404][perf_logger][INFO] - loss: 2.76, global_step: 180, learning_rate: 3.62e-05, grad_norm: 3.73, step_time: 0.184, tokens_per_second: 7.04e+03, unpadded_tokens_per_second: 7.04e+03


Training:  91%|█████████ | 182/200 [01:10<00:03,  4.51it/s, loss=2.55]

[2025-09-26 23:57:02,594][perf_logger][INFO] - loss: 2.55, global_step: 181, learning_rate: 3.64e-05, grad_norm: 3.3, step_time: 0.19, tokens_per_second: 5.84e+03, unpadded_tokens_per_second: 5.84e+03


Training:  92%|█████████▏| 183/200 [01:10<00:03,  4.68it/s, loss=2.76]

[2025-09-26 23:57:02,789][perf_logger][INFO] - loss: 2.76, global_step: 182, learning_rate: 3.66e-05, grad_norm: 4.48, step_time: 0.195, tokens_per_second: 6.81e+03, unpadded_tokens_per_second: 6.81e+03


Training:  92%|█████████▏| 184/200 [01:11<00:03,  4.86it/s, loss=2.72]

[2025-09-26 23:57:02,976][perf_logger][INFO] - loss: 2.72, global_step: 183, learning_rate: 3.68e-05, grad_norm: 4.05, step_time: 0.187, tokens_per_second: 7.38e+03, unpadded_tokens_per_second: 7.38e+03


Training:  92%|█████████▎| 185/200 [01:11<00:02,  5.02it/s, loss=2.65]

[2025-09-26 23:57:03,160][perf_logger][INFO] - loss: 2.65, global_step: 184, learning_rate: 3.7e-05, grad_norm: 4.01, step_time: 0.184, tokens_per_second: 7.23e+03, unpadded_tokens_per_second: 7.22e+03


Training:  93%|█████████▎| 186/200 [01:11<00:02,  5.07it/s, loss=2.72]

[2025-09-26 23:57:03,353][perf_logger][INFO] - loss: 2.72, global_step: 185, learning_rate: 3.72e-05, grad_norm: 4.48, step_time: 0.193, tokens_per_second: 3.91e+03, unpadded_tokens_per_second: 3.91e+03


Training:  94%|█████████▎| 187/200 [01:11<00:02,  5.11it/s, loss=2.66]

[2025-09-26 23:57:03,545][perf_logger][INFO] - loss: 2.66, global_step: 186, learning_rate: 3.74e-05, grad_norm: 3.82, step_time: 0.192, tokens_per_second: 6.18e+03, unpadded_tokens_per_second: 6.18e+03


Training:  94%|█████████▍| 188/200 [01:11<00:02,  5.15it/s, loss=2.78]

[2025-09-26 23:57:03,736][perf_logger][INFO] - loss: 2.78, global_step: 187, learning_rate: 3.76e-05, grad_norm: 5.84, step_time: 0.19, tokens_per_second: 6.84e+03, unpadded_tokens_per_second: 6.83e+03


Training:  94%|█████████▍| 189/200 [01:12<00:02,  5.12it/s, loss=2.71]

[2025-09-26 23:57:03,933][perf_logger][INFO] - loss: 2.71, global_step: 188, learning_rate: 3.78e-05, grad_norm: 6.78, step_time: 0.198, tokens_per_second: 3.44e+03, unpadded_tokens_per_second: 3.44e+03


Training:  95%|█████████▌| 190/200 [01:12<00:01,  5.04it/s, loss=2.82]

[2025-09-26 23:57:04,140][perf_logger][INFO] - loss: 2.82, global_step: 189, learning_rate: 3.8e-05, grad_norm: 6.26, step_time: 0.206, tokens_per_second: 5.37e+03, unpadded_tokens_per_second: 5.37e+03


Training:  96%|█████████▌| 191/200 [01:12<00:01,  5.11it/s, loss=2.8] 

[2025-09-26 23:57:04,328][perf_logger][INFO] - loss: 2.8, global_step: 190, learning_rate: 3.82e-05, grad_norm: 5.38, step_time: 0.189, tokens_per_second: 5.8e+03, unpadded_tokens_per_second: 5.8e+03


Training:  96%|█████████▌| 192/200 [01:12<00:01,  5.20it/s, loss=2.91]

[2025-09-26 23:57:04,513][perf_logger][INFO] - loss: 2.91, global_step: 191, learning_rate: 3.84e-05, grad_norm: 5.87, step_time: 0.185, tokens_per_second: 3.65e+03, unpadded_tokens_per_second: 3.65e+03


Training:  96%|█████████▋| 193/200 [01:12<00:01,  5.28it/s, loss=2.69]

[2025-09-26 23:57:04,695][perf_logger][INFO] - loss: 2.69, global_step: 192, learning_rate: 3.86e-05, grad_norm: 4.75, step_time: 0.182, tokens_per_second: 5.5e+03, unpadded_tokens_per_second: 5.5e+03


Training:  97%|█████████▋| 194/200 [01:13<00:01,  5.32it/s, loss=2.73]

[2025-09-26 23:57:04,880][perf_logger][INFO] - loss: 2.73, global_step: 193, learning_rate: 3.88e-05, grad_norm: 4.58, step_time: 0.185, tokens_per_second: 7.15e+03, unpadded_tokens_per_second: 7.15e+03


Training:  98%|█████████▊| 195/200 [01:13<00:00,  5.35it/s, loss=2.8] 

[2025-09-26 23:57:05,064][perf_logger][INFO] - loss: 2.8, global_step: 194, learning_rate: 3.9e-05, grad_norm: 4.4, step_time: 0.184, tokens_per_second: 8.23e+03, unpadded_tokens_per_second: 8.22e+03


Training:  98%|█████████▊| 196/200 [01:13<00:00,  5.35it/s, loss=2.73]

[2025-09-26 23:57:05,251][perf_logger][INFO] - loss: 2.73, global_step: 195, learning_rate: 3.92e-05, grad_norm: 4.58, step_time: 0.187, tokens_per_second: 7.1e+03, unpadded_tokens_per_second: 7.1e+03


Training:  98%|█████████▊| 197/200 [01:13<00:00,  5.33it/s, loss=2.61]

[2025-09-26 23:57:05,441][perf_logger][INFO] - loss: 2.61, global_step: 196, learning_rate: 3.94e-05, grad_norm: 3.57, step_time: 0.19, tokens_per_second: 8.25e+03, unpadded_tokens_per_second: 8.24e+03


Training:  99%|█████████▉| 198/200 [01:13<00:00,  5.33it/s, loss=2.74]

[2025-09-26 23:57:05,629][perf_logger][INFO] - loss: 2.74, global_step: 197, learning_rate: 3.96e-05, grad_norm: 4.29, step_time: 0.188, tokens_per_second: 7.41e+03, unpadded_tokens_per_second: 7.4e+03


Training: 100%|█████████▉| 199/200 [01:13<00:00,  5.15it/s, loss=2.5] 

[2025-09-26 23:57:05,838][perf_logger][INFO] - loss: 2.5, global_step: 198, learning_rate: 3.98e-05, grad_norm: 6.17, step_time: 0.21, tokens_per_second: 4.58e+03, unpadded_tokens_per_second: 4.58e+03


Training: 100%|██████████| 200/200 [01:14<00:00,  5.19it/s, loss=2.69]

[2025-09-26 23:57:06,027][perf_logger][INFO] - loss: 2.69, global_step: 199, learning_rate: 4e-05, grad_norm: 4.65, step_time: 0.189, tokens_per_second: 7.21e+03, unpadded_tokens_per_second: 7.21e+03
[2025-09-26 23:57:06,029][checkpoint][INFO] - Starting mFSDP parameter gathering...
[2025-09-26 23:57:06,038][checkpoint][INFO] - mFSDP parameter gathering completed
[2025-09-26 23:57:10,645][checkpoint][INFO] - Saved final mFSDP model to checkpoints/esm2_t33_650M_UR50D_sanity/train_mfsdp/final_model


wandb: 
wandb: Run history:
wandb:                train/global_step ▁▁▁▁▂▂▂▂▂▃��▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇█████
wandb:                  train/grad_norm █▆█▆▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▃▁▂▁��▁▁▂▂▂▁▁▁▁▁▁▁▂▁▁▂
wandb:              train/learning_rate ▁▁▁���▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▄▄▅▅▅▅▅▆▆▆▇▇▇▇▇▇��███
wandb:                       train/loss █▆▅▂▃▆▃▃▄▃▄▄▃▃▄▄���▄▃▃▂▂▃▃▃▂▃▃▃▁▂▂▂▂▃▃▁▂▂▂
wandb:                  train/step_time ▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁���▁▁▁▁▁▁▁▁▁▁
wandb:          train/tokens_per_second ▃▅▄▅▆▅▄▄▃▅▄▁▆▄▃▄▆▅▅▅▇▅▂▂▂▆▃▅▇▄▅▅▃▆█▁▅▅▅▆
wandb: train/unpadded_tokens_per_second ▂▂▅▅▅▆▄▅▆▄▄▄▁▅▂▅▅▆▃▅▃▁▅▆▃▆▄▃▆▄▄▆▅▅▅▆▅█▄▃
wandb: 
wandb: Run summary:
wandb:                train/global_step 199
wandb:                  train/grad_norm 4.65146
wandb:              train/learning_rate 4e-05
wandb:                       train/loss 2.68555
wandb:                  train/step_time 0.18905
wandb:          train/tokens_per_second 7209.77588
wandb: train/unpadded_tokens_per_second 7209.77588
wandb: 
wandb: You can sync this

[2025-09-26 23:57:10,659][perf_logger][INFO] - RUN CONFIG:
{'adamw_kwargs': {'betas': [0.9, 0.98],
                  'eps': 1e-08,
                  'fused': True,
                  'lr': 0.0004,
                  'weight_decay': 0.01},
 'checkpoint': {'ckpt_dir': 'checkpoints/esm2_t33_650M_UR50D_sanity',
                'resume_from_checkpoint': True,
                'save_checkpoints': True,
                'save_every_n_steps': 50,
                'save_final_model': True,
                'use_distributed_checkpoint_fsdp2': True},
 'dataset': {'load_dataset_kwargs': {'data_files': 'train.parquet',
                                     'path': 'parquet',
                                     'split': 'train',
                                     'streaming': True},
             'max_seq_length': 1024,
             'micro_batch_size': 4,
             'num_workers': 1,
             'sequence_packing_pad_to_multiple_of': None,
             'tokenizer_name': 'nvidia/esm2_t33_650M_UR50D',
 

# TODO: analysis