Skip to content

Conversation

@bhimrazy
Copy link
Collaborator

@bhimrazy bhimrazy commented Aug 1, 2025

What does this PR do ?

This PR introduces a multi-level caching mechanism for the file index in StreamingRawDataset to significantly speed up dataset initialization, especially for large datasets stored in the cloud.

The index is now cached both locally and remotely, reducing the need to repeatedly scan the entire dataset directory on subsequent runs.

Multi-Level Cache System

  1. It first attempts to load the index from a local cache file (<cache_dir>/index.json.zstd).
  2. If the local cache is not found, it tries to download the index from the remote input_dir (<remote_path>/index.json.zstd).
  3. If no cached index is available in either location, it discovers the files, builds a new index, saves it to the local cache, and uploads it to the remote input_dir for future use.

Tested over Lightning Studios for S3 and GS.

Note from thomas: Can we add an argument to the dataset to re-compute the index. Once the index is computed, can we push it the index to the folder, so next read is fast unless that argument is set to True.

Usage Example

from litdata import StreamingRawDataset

# First run: builds and caches index
dataset = StreamingRawDataset("s3://bucket/files/")

# Subsequent runs: loads from cache instantly
dataset = StreamingRawDataset("s3://bucket/files/")

# Force rebuild when files change
dataset = StreamingRawDataset("s3://bucket/files/", recompute_index=True)

Follow up to #652

Note: The fsspec parts will be replaced when working to add obstore support to these indexing components.

@bhimrazy bhimrazy self-assigned this Aug 1, 2025
@bhimrazy bhimrazy added the enhancement New feature or request label Aug 1, 2025
@codecov
Copy link

codecov bot commented Aug 1, 2025

Codecov Report

❌ Patch coverage is 87.01299% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 84%. Comparing base (e572b84) to head (26263c5).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@         Coverage Diff         @@
##           main   #666   +/-   ##
===================================
  Coverage    84%    84%           
===================================
  Files        52     52           
  Lines      7023   7081   +58     
===================================
+ Hits       5906   5958   +52     
- Misses     1117   1123    +6     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bhimrazy bhimrazy changed the title [wip] cache index file for raw dataset to cloud for faster load feat(litdata/raw): Implement remote and local index caching for StreamingRawDataset Aug 10, 2025
@bhimrazy bhimrazy marked this pull request as ready for review August 10, 2025 17:15
bhimrazy and others added 2 commits August 11, 2025 17:43
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
@tchaton tchaton merged commit 47a8c7a into Lightning-AI:main Aug 11, 2025
36 checks passed
@bhimrazy bhimrazy deleted the feat/store-index-in-raw-dataset branch August 11, 2025 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants