Skip to content

Conversation

@bhimrazy
Copy link
Collaborator

@bhimrazy bhimrazy commented Jul 31, 2025

What does this PR do ?

Enhance the StreamingRawDataset class with grouping capabilities in the setup method, allowing for flexible item structuring for users. Improve asynchronous handling for batch downloads and update tests to ensure functionality. Remove unused imports and streamline the code.

Fixes #662

Follow up to #652

Usage Example

from litdata.streaming.raw_dataset import StreamingRawDataset, FileMetadata
from torch.utils.data import DataLoader
from typing import Union

class CustomStreamingRawDataset(StreamingRawDataset):
    def setup(self, files: list[FileMetadata]) -> Union[list[FileMetadata], list[list[FileMetadata]]]:
       # Customize grouping logic here.
       # For example: return pairs like [[image_1, mask_1], [image_2, mask_2], ...]
        return files

# Initialize the streaming raw dataset from S3 path
dataset = CustomStreamingRawDataset("s3://bucket/files/")

@bhimrazy bhimrazy self-assigned this Jul 31, 2025
@bhimrazy bhimrazy added the enhancement New feature or request label Jul 31, 2025
@bhimrazy bhimrazy changed the title feat(litdata): Add grouping functionality and improve StreamingRawDataset feat(litdata): Add grouping functionality in StreamingRawDataset Jul 31, 2025
@codecov
Copy link

codecov bot commented Jul 31, 2025

Codecov Report

❌ Patch coverage is 94.28571% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 84%. Comparing base (f88a139) to head (9e1f8ee).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@         Coverage Diff         @@
##           main   #665   +/-   ##
===================================
  Coverage    83%    84%           
===================================
  Files        50     50           
  Lines      7124   7137   +13     
===================================
+ Hits       5935   5960   +25     
+ Misses     1189   1177   -12     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind adding example in the README ?

@bhimrazy
Copy link
Collaborator Author

bhimrazy commented Jul 31, 2025

Mind adding example in the README ?

Sure! Just updated.
My bad — I missed adding it earlier. Thanks for the reminder!

Copy link
Collaborator

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great !

@bhimrazy bhimrazy merged commit 47c9098 into Lightning-AI:main Jul 31, 2025
35 checks passed
@bhimrazy bhimrazy deleted the feat/add-support-for-regrouping branch July 31, 2025 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for re-grouping in Streaming raw dataset

2 participants