Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix resuming checkpoint path bug #661

Merged
merged 6 commits into from
Oct 26, 2023
Merged

Conversation

larrylawl
Copy link
Contributor

@larrylawl larrylawl commented Oct 20, 2023

Minor fix to resuming checkpoint path . Currently, resuming takes the last path of the sorted list. However, the sorted list may not return a list sorted in ascending iter_num. For example,

from pathlib import Path
out_dir = Path("out") / "openwebtext"
paths = [out_dir / f"iter-{iter_num}-ckpt.pth" for iter_num in range(0, 1100000, 40000)]
print(sorted(paths))

returns the following. Note that the list isn't sorted (i.e. iter-1080000 should be the last, but it isn't). Thus resuming will not take the latest checkpoint in this case.

[PosixPath('out/openwebtext/iter-0-ckpt.pth'), PosixPath('out/openwebtext/iter-1000000-ckpt.pth'), PosixPath('out/openwebtext/iter-1040000-ckpt.pth'), PosixPath('out/openwebtext/iter-1080000-ckpt.pth'), PosixPath('out/openwebtext/iter-120000-ckpt.pth'), PosixPath('out/openwebtext/iter-160000-ckpt.pth'), PosixPath('out/openwebtext/iter-200000-ckpt.pth'), PosixPath('out/openwebtext/iter-240000-ckpt.pth'), PosixPath('out/openwebtext/iter-280000-ckpt.pth'), PosixPath('out/openwebtext/iter-320000-ckpt.pth'), PosixPath('out/openwebtext/iter-360000-ckpt.pth'), PosixPath('out/openwebtext/iter-40000-ckpt.pth'), PosixPath('out/openwebtext/iter-400000-ckpt.pth'), PosixPath('out/openwebtext/iter-440000-ckpt.pth'), PosixPath('out/openwebtext/iter-480000-ckpt.pth'), PosixPath('out/openwebtext/iter-520000-ckpt.pth'), PosixPath('out/openwebtext/iter-560000-ckpt.pth'), PosixPath('out/openwebtext/iter-600000-ckpt.pth'), PosixPath('out/openwebtext/iter-640000-ckpt.pth'), PosixPath('out/openwebtext/iter-680000-ckpt.pth'), PosixPath('out/openwebtext/iter-720000-ckpt.pth'), PosixPath('out/openwebtext/iter-760000-ckpt.pth'), PosixPath('out/openwebtext/iter-80000-ckpt.pth'), PosixPath('out/openwebtext/iter-800000-ckpt.pth'), PosixPath('out/openwebtext/iter-840000-ckpt.pth'), PosixPath('out/openwebtext/iter-880000-ckpt.pth'), PosixPath('out/openwebtext/iter-920000-ckpt.pth'), PosixPath('out/openwebtext/iter-960000-ckpt.pth')]

pretrain/openwebtext.py Outdated Show resolved Hide resolved
pretrain/openwebtext.py Outdated Show resolved Hide resolved
pretrain/redpajama.py Outdated Show resolved Hide resolved
larrylawl and others added 3 commits October 26, 2023 10:34
Co-authored-by: Andrei-Aksionov <58434077+Andrei-Aksionov@users.noreply.github.com>
Co-authored-by: Andrei-Aksionov <58434077+Andrei-Aksionov@users.noreply.github.com>
@larrylawl
Copy link
Contributor Author

Hi, can we merge this PR in? Thank you!

@carmocca carmocca merged commit 09c8281 into Lightning-AI:main Oct 26, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants