-
Notifications
You must be signed in to change notification settings - Fork 80
Description
🐛 Bug
I'm currently using litdata completely locally where I first convert the dataset using 'optimize' and use StreamingDataset to stream the records from a local directory to train my model. I want to train multiple models (using the same dataset) in parallel but it seems that the cache files created from the previous runs ends up blocking the StreamingDataset of the future runs (probably due to locking?) It took me quite a while to figure out that the freeze was due to the cache files.
My workaround for now is to create a new caching directory for each run following the documentation using the 'Dir' class from resolver.py which was a bit confusing at first because 'Dir' takes arguments 'url' and 'path' which makes it seem like it only works when you have data in the cloud (url.) It would have made more sense it the arguments were like 'path' (either url or local directory) and 'cache_dir' (the directory to store cache)
The question I had was: why does it have to cache data when all the data is already available locally?
It would be great if StreamingDataset directly took an argument like cache_dir.
Thanks.