feat(litdata/raw): Implement remote and local index caching for `StreamingRawDataset` #666

bhimrazy · 2025-08-01T07:04:04Z

What does this PR do ?

This PR introduces a multi-level caching mechanism for the file index in StreamingRawDataset to significantly speed up dataset initialization, especially for large datasets stored in the cloud.

The index is now cached both locally and remotely, reducing the need to repeatedly scan the entire dataset directory on subsequent runs.

Multi-Level Cache System

It first attempts to load the index from a local cache file (<cache_dir>/index.json.zstd).
If the local cache is not found, it tries to download the index from the remote input_dir (<remote_path>/index.json.zstd).
If no cached index is available in either location, it discovers the files, builds a new index, saves it to the local cache, and uploads it to the remote input_dir for future use.

Tested over Lightning Studios for S3 and GS.

Note from thomas: Can we add an argument to the dataset to re-compute the index. Once the index is computed, can we push it the index to the folder, so next read is fast unless that argument is set to True.

Usage Example

from litdata import StreamingRawDataset

# First run: builds and caches index
dataset = StreamingRawDataset("s3://bucket/files/")

# Subsequent runs: loads from cache instantly
dataset = StreamingRawDataset("s3://bucket/files/")

# Force rebuild when files change
dataset = StreamingRawDataset("s3://bucket/files/", recompute_index=True)

Follow up to #652

Note: The fsspec parts will be replaced when working to add obstore support to these indexing components.

for more information, see https://pre-commit.ci

codecov · 2025-08-01T07:26:07Z

Codecov Report

❌ Patch coverage is 87.01299% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 84%. Comparing base (e572b84) to head (26263c5).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@         Coverage Diff         @@
##           main   #666   +/-   ##
===================================
  Coverage    84%    84%           
===================================
  Files        52     52           
  Lines      7023   7081   +58     
===================================
+ Hits       5906   5958   +52     
- Misses     1117   1123    +6

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…ing in BaseIndexer

…cal file discovery

…supported schemes

…exer

…ndexer

for more information, see https://pre-commit.ci

…clarity

for more information, see https://pre-commit.ci

src/litdata/raw/indexer.py

Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>

bhimrazy and others added 2 commits August 1, 2025 12:47

Prepare for cloud/local index caching

dc06c65

[pre-commit.ci] auto fixes from pre-commit.com hooks

b02a3e7

for more information, see https://pre-commit.ci

bhimrazy self-assigned this Aug 1, 2025

bhimrazy added the enhancement New feature or request label Aug 1, 2025

bhimrazy added 9 commits August 4, 2025 10:39

Merge branch 'main' into feat/store-index-in-raw-dataset

9c04ca0

Merge branch 'main' into feat/store-index-in-raw-dataset

792be1d

ref: Enhance file indexing with local and remote caching mechanisms

ecfb732

ref: Exclude index files from inclusion in file indexing

38cb6ac

Improve index loading documentation and streamline remote cache handl…

fbb087c

…ing in BaseIndexer

ref: Validate input directory scheme in BaseIndexer and streamline lo…

0530a8e

…cal file discovery

Validate input directory scheme in FileIndexer and raise error for un…

f6e1a2a

…supported schemes

Add tests for handling unsupported input directory schemes in FileInd…

a90b910

…exer

Add tests for building and loading remote index with caching in FileI…

ff1d58a

…ndexer

bhimrazy changed the title ~~[wip] cache index file for raw dataset to cloud for faster load~~ feat(litdata/raw): Implement remote and local index caching for StreamingRawDataset Aug 10, 2025

bhimrazy and others added 4 commits August 10, 2025 22:31

Add test to ensure index file is excluded during recompute

fc93dad

Enhance test description for index file exclusion during recompute

44f736b

[pre-commit.ci] auto fixes from pre-commit.com hooks

2ae9325

for more information, see https://pre-commit.ci

Merge branch 'main' into feat/store-index-in-raw-dataset

4bbd8cb

bhimrazy mentioned this pull request Aug 10, 2025

feat(litdata): Add Support for StreamingRawDataset to Stream Raw Datasets from Cloud Storage #652

Merged

Add documentation for Smart Index Caching in StreamingRawDataset

36acd45

bhimrazy marked this pull request as ready for review August 10, 2025 17:15

bhimrazy requested review from Borda, justusschock, lantiga and tchaton as code owners August 10, 2025 17:15

bhimrazy and others added 4 commits August 10, 2025 23:02

Refine description of Smart Index Caching in StreamingRawDataset for …

0c9673b

…clarity

Add Windows compatibility checks for remote index tests

06eb0b1

Add Windows compatibility check for recompute index test

ed0e532

[pre-commit.ci] auto fixes from pre-commit.com hooks

a31ee84

for more information, see https://pre-commit.ci

Borda reviewed Aug 11, 2025

View reviewed changes

src/litdata/raw/indexer.py Outdated Show resolved Hide resolved

Borda reviewed Aug 11, 2025

View reviewed changes

src/litdata/raw/indexer.py Outdated Show resolved Hide resolved

Borda approved these changes Aug 11, 2025

View reviewed changes

bhimrazy and others added 2 commits August 11, 2025 17:43

Apply suggestions

ec6d6a6

Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>

Apply suggestions

26263c5

Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>

tchaton approved these changes Aug 11, 2025

View reviewed changes

tchaton merged commit 47a8c7a into Lightning-AI:main Aug 11, 2025
36 checks passed

bhimrazy deleted the feat/store-index-in-raw-dataset branch August 11, 2025 19:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(litdata/raw): Implement remote and local index caching for `StreamingRawDataset` #666

feat(litdata/raw): Implement remote and local index caching for `StreamingRawDataset` #666

Uh oh!

bhimrazy commented Aug 1, 2025 •

edited

Loading

Uh oh!

codecov bot commented Aug 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(litdata/raw): Implement remote and local index caching for StreamingRawDataset #666

feat(litdata/raw): Implement remote and local index caching for StreamingRawDataset #666

Uh oh!

Conversation

bhimrazy commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Multi-Level Cache System

Usage Example

Uh oh!

codecov bot commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(litdata/raw): Implement remote and local index caching for `StreamingRawDataset` #666

feat(litdata/raw): Implement remote and local index caching for `StreamingRawDataset` #666

bhimrazy commented Aug 1, 2025 •

edited

Loading

codecov bot commented Aug 1, 2025 •

edited

Loading