Resuming Training with New Dataset Fails #263

schopra8 · 2024-07-24T03:57:44Z

🐛 Bug

If you train a model with a particular dataset for N epochs and then want to continue training with a new dataset, LitData throws an exception.

To Reproduce

Steps to reproduce the behavior:

Train a model with dataset-1
Cancel training after the first checkpoint is aved
Resume training with trainer.fit(model, datamodule=datamodule, ckpt_path=ckpt_path) where the datamodule now points to dataset-2.
Capture the following error

[rank7]: Original Traceback (most recent call last):
[rank7]:   File "/home/sahil/.cache/pypoetry/virtualenvs/auw7Hy33-py3.10/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 252, in _worker_loop
[rank7]:     fetcher = _DatasetKind.create_fetcher(dataset_kind, dataset, auto_collation, collate_fn, drop_last)
[rank7]:   File "/home/sahil/.cache/pypoetry/virtualenvs/auw7Hy33-py3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 79, in create_fetcher
[rank7]:     return _utils.fetch._IterableDatasetFetcher(dataset, auto_collation, collate_fn, drop_last)
[rank7]:   File "/home/sahil/.cache/pypoetry/virtualenvs/auw7Hy33-py3.10/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 21, in __init__
[rank7]:     self.dataset_iter = iter(dataset)
[rank7]:   File "/home/sahil/.cache/pypoetry/virtualenvs/auw7Hy33-py3.10/lib/python3.10/site-packages/litdata/streaming/combined.py", line 155, in __iter__
[rank7]:     self._iterator = _CombinedDatasetIterator(
[rank7]:   File "/home/sahil/.cache/pypoetry/virtualenvs/auw7Hy33-py3.10/lib/python3.10/site-packages/litdata/streaming/combined.py", line 203, in __init__
[rank7]:     self._dataset_iters = [iter(dataset) for dataset in datasets]
[rank7]:   File "/home/sahil/.cache/pypoetry/virtualenvs/auw7Hy33-py3.10/lib/python3.10/site-packages/litdata/streaming/combined.py", line 203, in <listcomp>
[rank7]:     self._dataset_iters = [iter(dataset) for dataset in datasets]
[rank7]:   File "/home/sahil/.cache/pypoetry/virtualenvs/auw7Hy33-py3.10/lib/python3.10/site-packages/litdata/streaming/dataset.py", line 219, in __iter__
[rank7]:     self._validate_state_dict()
[rank7]:   File "/home/sahil/.cache/pypoetry/virtualenvs/auw7Hy33-py3.10/lib/python3.10/site-packages/litdata/streaming/dataset.py", line 447, in _validate_state_dict
[rank7]:     raise ValueError(
[rank7]: ValueError: The provided input_dir URL state doesn't match the current one. Found s3://dataset-2 instead of s3://dataset-1.

Code sample

Expected behavior

Training to start with the optimizer states, model weights, etc. but with a net new dataset.

Environment

PyTorch Version (e.g., 1.0): 2.3.1
OS (e.g., Linux): Linux
How you installed PyTorch (conda, pip, source): pip
Build command you used (if compiling from source):
Python version: 3.10
CUDA/cuDNN version: 12.1
GPU models and configuration: 2x8H100
Any other relevant information:

Additional context

The text was updated successfully, but these errors were encountered:

tchaton · 2024-07-24T06:29:50Z

Hey @schopra8,

Did you consume exactly N epochs of the first dataset ?

As a temporary hack, did you try dropping the dataloader state from the checkpoint before reloading it. This might unblock you.

We should enable dropping the state entirely when the epoch is terminated to enable reloading with different parameters. Maybe we can do it there: https://github.com/Lightning-AI/litdata/blob/main/src/litdata/streaming/dataloader.py#L468.

Would you be interested in trying to contribute a fix ?

schopra8 · 2024-07-24T13:55:32Z

In reality, I'd use this after exactly N epochs for the first dataset was consumed. In this dummy example, I manually killed training on the first dataset after K steps (less than 1 epoch) and then tried changing the datasets. Would that require a different change to the codebase?

I'll try the hack today -- but definitely down to contribute a fix.

tchaton · 2024-07-24T20:24:13Z

Yes, we need to hack around and find the right fix. Feel free to make a draft PR and we can help you land a reliable fix.

tchaton · 2024-07-29T21:46:27Z

Hey @schopra8. Any updates ?

schopra8 · 2024-07-29T23:55:11Z

@tchaton No updates on my end yet -- got busy with another modeling task. I'll be taking a crack it in the next few days

bhimrazy · 2024-08-14T19:06:30Z

Hi @schopra8,

We’ve released an update with bug fixes, including the one related to this issue. Currently, it only supports StreamingDataset. We will be adding fixes for CombinedStreamingDataset soon as well.

Please feel free to try it out and let us know how it goes. Thanks! 😊

schopra8 added bug Something isn't working help wanted Extra attention is needed labels Jul 24, 2024

schopra8 mentioned this issue Jul 24, 2024

Resuming Training w/ Streaming Dataset on DDP with Multiple Nodes Fails #248

Closed

bhimrazy self-assigned this Aug 7, 2024

bhimrazy mentioned this issue Aug 11, 2024

Bugfix: inconsistent streaming dataloader state (specific to StreamingDataset) #318

Merged

4 tasks

tchaton closed this as completed in #318 Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resuming Training with New Dataset Fails #263

Resuming Training with New Dataset Fails #263

schopra8 commented Jul 24, 2024

tchaton commented Jul 24, 2024 •

edited

Loading

schopra8 commented Jul 24, 2024 •

edited

Loading

tchaton commented Jul 24, 2024

tchaton commented Jul 29, 2024

schopra8 commented Jul 29, 2024

bhimrazy commented Aug 14, 2024

Resuming Training with New Dataset Fails #263

Resuming Training with New Dataset Fails #263

Comments

schopra8 commented Jul 24, 2024

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

Additional context

tchaton commented Jul 24, 2024 • edited Loading

schopra8 commented Jul 24, 2024 • edited Loading

tchaton commented Jul 24, 2024

tchaton commented Jul 29, 2024

schopra8 commented Jul 29, 2024

bhimrazy commented Aug 14, 2024

tchaton commented Jul 24, 2024 •

edited

Loading

schopra8 commented Jul 24, 2024 •

edited

Loading