-
Notifications
You must be signed in to change notification settings - Fork 80
Description
🐛 Bug
If an optimized dataset has too many chunks (can be replicated using small chunk size or reasonable chunk size with lots of data), while reading (e.g. using StreamingDataLoader), a Too many open files error is encountered at some point. This is because each chunk is mmap'ed as it is loaded, and the mmap'ed handle is never released, leading to too many open files across the system:
One could ofcourse use ulimit to increase it, but at some point the number of chunks can really be too high.
It is easily resolved by keeping the number of mmap'ed files under a certain limit (for example by occasionally deleting some entries in the above dict, either randomly or smartly using FIFO or so).
I wanted to understand if this fix is alright, I am happy to send a PR for this if it makes sense.
To Reproduce
Steps to reproduce the behavior:
- Optimize a dataset that results in a large number of chunks
- Use StreamingDataset and StreamingDataLoader to load the data
- At some point,
OSError: [Errno 24] Too many open fileswill be encountered
Minimal code sample below to reproduce the issue.
Code sample
import glob
import numpy as np
import random
from pathlib import Path
from litdata import optimize
from litdata.streaming import StreamingDataLoader, StreamingDataset, TokensLoader
# Fake tokenizer
def tokenize_fn(filepath):
yield np.array([random.randint(0, 10000) for _ in range(random.randint(100, 1000))])
def main():
Path("fake_file.txt").touch()
outputs = optimize(
fn=tokenize_fn,
inputs=["fake_file.txt" for i in range(10000)], # increase number of files if error is not encountered on a specific machine
output_dir="./optimized/",
chunk_bytes="10KB",
num_workers=1,
item_loader=TokensLoader(block_size=1024),
)
train_dataset = StreamingDataset(
input_dir="./optimized/",
item_loader=TokensLoader(block_size=1024),
shuffle=True,
drop_last=False,
)
train_dataloader = StreamingDataLoader(
train_dataset, batch_size=1, pin_memory=False, num_workers=1, drop_last=False
)
total_tokens = 0
total_batches = 0
for sample in train_dataloader:
total_batches += 1
total_tokens += np.prod(sample.shape)
print(total_batches, total_tokens)
print("Batches:", total_batches)
print("Tokens:", total_tokens)
if __name__ == "__main__":
main()Expected behavior
Environment
- LitData version: 0.2.26
- OS (e.g., Linux): Linux
- How you installed: pip
- Python version: 3.12