Existing Cache files leads to permanent DataLoader hang

## 🐛 Bug
I'm currently using litdata completely locally where I first convert the dataset using 'optimize' and use StreamingDataset to stream the records from a local directory to train my model. I want to train multiple models (using the same dataset) in parallel but it seems that the cache files created from the previous runs ends up blocking the StreamingDataset of the future runs (probably due to locking?) It took me quite a while to figure out that the freeze was due to the cache files.

My workaround for now is to create a new caching directory for each run following the documentation using the 'Dir' class from resolver.py  which was a bit confusing at first because 'Dir' takes arguments 'url' and 'path' which makes it seem like it only works when you have data in the cloud (url.) It would have made more sense it the arguments were like 'path' (either url or local directory) and 'cache_dir' (the directory to store cache) 

The question I had was: why does it have to cache data when all the data is already available locally?

It would be great if StreamingDataset directly took an argument like cache_dir. 

Thanks.





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Existing Cache files leads to permanent DataLoader hang #398

🐛 Bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Existing Cache files leads to permanent DataLoader hang #398

Description

🐛 Bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions