Skip to content

Large number of chunks causes OSError: [Errno 24] Too many open files #366

@fdalvi

Description

@fdalvi

🐛 Bug

If an optimized dataset has too many chunks (can be replicated using small chunk size or reasonable chunk size with lots of data), while reading (e.g. using StreamingDataLoader), a Too many open files error is encountered at some point. This is because each chunk is mmap'ed as it is loaded, and the mmap'ed handle is never released, leading to too many open files across the system:

https://github.com/Lightning-AI/litdata/blob/7efd76197e0191c0a07c9f0583ce9aec4b80a49e/src/litdata/streaming/item_loader.py#L320-L322

One could ofcourse use ulimit to increase it, but at some point the number of chunks can really be too high.

It is easily resolved by keeping the number of mmap'ed files under a certain limit (for example by occasionally deleting some entries in the above dict, either randomly or smartly using FIFO or so).

I wanted to understand if this fix is alright, I am happy to send a PR for this if it makes sense.

To Reproduce

Steps to reproduce the behavior:

  1. Optimize a dataset that results in a large number of chunks
  2. Use StreamingDataset and StreamingDataLoader to load the data
  3. At some point, OSError: [Errno 24] Too many open files will be encountered

Minimal code sample below to reproduce the issue.

Code sample

import glob
import numpy as np
import random

from pathlib import Path

from litdata import optimize
from litdata.streaming import StreamingDataLoader, StreamingDataset, TokensLoader

# Fake tokenizer
def tokenize_fn(filepath):
    yield np.array([random.randint(0, 10000) for _ in range(random.randint(100, 1000))])

def main():
    Path("fake_file.txt").touch()
    outputs = optimize(
        fn=tokenize_fn,
        inputs=["fake_file.txt" for i in range(10000)], # increase number of files if error is not encountered on a specific machine
        output_dir="./optimized/",
        chunk_bytes="10KB",
        num_workers=1,
        item_loader=TokensLoader(block_size=1024),
    )
    


    train_dataset = StreamingDataset(
        input_dir="./optimized/",
        item_loader=TokensLoader(block_size=1024),
        shuffle=True,
        drop_last=False,
    )

    train_dataloader = StreamingDataLoader(
        train_dataset, batch_size=1, pin_memory=False, num_workers=1, drop_last=False
    )

    total_tokens = 0
    total_batches = 0
    for sample in train_dataloader:
        total_batches += 1
        total_tokens += np.prod(sample.shape)

        print(total_batches, total_tokens)
            
    print("Batches:", total_batches)
    print("Tokens:", total_tokens)

if __name__ == "__main__":
    main()

Expected behavior

Environment

  • LitData version: 0.2.26
  • OS (e.g., Linux): Linux
  • How you installed: pip
  • Python version: 3.12

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinghelp wantedExtra attention is needed

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions