-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resuming Training with New Dataset Fails #263
Comments
Hey @schopra8, Did you consume exactly N epochs of the first dataset ? As a temporary hack, did you try dropping the dataloader state from the checkpoint before reloading it. This might unblock you. We should enable dropping the state entirely when the epoch is terminated to enable reloading with different parameters. Maybe we can do it there: https://github.com/Lightning-AI/litdata/blob/main/src/litdata/streaming/dataloader.py#L468. Would you be interested in trying to contribute a fix ? |
In reality, I'd use this after exactly N epochs for the first dataset was consumed. In this dummy example, I manually killed training on the first dataset after K steps (less than 1 epoch) and then tried changing the datasets. Would that require a different change to the codebase? I'll try the hack today -- but definitely down to contribute a fix. |
Yes, we need to hack around and find the right fix. Feel free to make a draft PR and we can help you land a reliable fix. |
Hey @schopra8. Any updates ? |
@tchaton No updates on my end yet -- got busy with another modeling task. I'll be taking a crack it in the next few days |
Hi @schopra8, We’ve released an update with bug fixes, including the one related to this issue. Currently, it only supports StreamingDataset. We will be adding fixes for CombinedStreamingDataset soon as well. Please feel free to try it out and let us know how it goes. Thanks! 😊 |
🐛 Bug
If you train a model with a particular dataset for N epochs and then want to continue training with a new dataset, LitData throws an exception.
To Reproduce
Steps to reproduce the behavior:
dataset-1
trainer.fit(model, datamodule=datamodule, ckpt_path=ckpt_path)
where the datamodule now points todataset-2
.Code sample
Expected behavior
Training to start with the optimizer states, model weights, etc. but with a net new dataset.
Environment
conda
,pip
, source): pipAdditional context
The text was updated successfully, but these errors were encountered: