Skip to content

optimize() with num_workers > 1 leads to deletion issues #245

@awaelchli

Description

@awaelchli

🐛 Bug

In the LitData tests, we only ever call optimize() with num_workers=1. In the PR #237 I found that if optimize is called with more workers, then we get a race condition (??) causing some chunks to be deleted and then streaming fails.
#237 (comment)

This happens in this test:
https://github.com/Lightning-AI/litdata/blob/c58b67346a3be22de26679fb6788f38894c47cd1/tests/streaming/test_dataset.py#L826
(see ToDo comments).

The test fails with

__________________ test_dataset_resume_on_future_chunks[True] __________________

shuffle = True
tmpdir = local('/tmp/pytest-of-runner/pytest-0/test_dataset_resume_on_future_0')
monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7f6a4124f460>

    @pytest.mark.skipif(sys.platform == "win32", reason="Not tested on windows and MacOs")
    @mock.patch.dict(os.environ, {}, clear=True)
    @pytest.mark.timeout(60)
    @pytest.mark.parametrize("shuffle", [True, False])
    def test_dataset_resume_on_future_chunks(shuffle, tmpdir, monkeypatch):
        """This test is constructed to test resuming from a chunk past the first chunk, when subsequent chunks don't have
        the same size."""
        s3_cache_dir = str(tmpdir / "s3cache")
        optimize_data_cache_dir = str(tmpdir / "optimize_data_cache")
        optimize_cache_dir = str(tmpdir / "optimize_cache")
        data_dir = str(tmpdir / "optimized")
        monkeypatch.setenv("DATA_OPTIMIZER_DATA_CACHE_FOLDER", optimize_data_cache_dir)
        monkeypatch.setenv("DATA_OPTIMIZER_CACHE_FOLDER", optimize_cache_dir)
    
>       optimize(
            fn=_simple_preprocess,
            inputs=list(range(8)),
            output_dir=data_dir,
            chunk_size=190,
            num_workers=4,
            num_uploaders=1,
copying /tmp/pytest-of-runner/pytest-0/test_dataset_resume_on_future_0/optimize_cache/chunk-3-1.bin to /tmp/pytest-of-runner/pytest-0/test_dataset_resume_on_future_0/optimized/chunk-3-1.bin
putting /tmp/pytest-of-runner/pytest-0/test_dataset_resume_on_future_0/optimize_cache/chunk-3-1.bin on the remove queue
Worker 1 is done.
Worker 2 is done.
Worker 3 is done.
Worker 0 is done.
Workers are finished.
----------------------------- Captured stderr call -----------------------------


Progress:   0%|          | 0/8 [00:00<?, ?it/s]Process Process-85:1:
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/runner/work/litdata/litdata/src/litdata/processing/data_processor.py", line 259, in _upload_fn
    shutil.copy(local_filepath, output_filepath)
  File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/shutil.py", line 427, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/shutil.py", line 264, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pytest-of-runner/pytest-0/test_dataset_resume_on_future_0/optimize_cache/chunk-0-0.bin'

Progress: 100%|██████████| 8/8 [00:00<00:00, 122.77it/s]
=========================== short test summary info ============================
FAILED tests/streaming/test_dataset.py::test_dataset_resume_on_future_chunks[True] - RuntimeError: All the chunks should have been deleted. Found ['chunk-0-1.bin']
====== 1 failed, 191 passed, 8 skipped, 11 warnings in [247](https://github.com/Lightning-AI/litdata/actions/runs/10010459328/job/27671682379?pr=237#step:10:248).94s (0:04:07) =======

when setting optimize(num_workers=4). This needs to be investigated. However, not possible so far to reproduce locally (only observed in CI)!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingci / testshelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions