Skip to content

Conversation

@deependujha
Copy link
Collaborator

@deependujha deependujha commented Jul 14, 2025

Before submitting
  • Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

Adds feature to pass a list of transformation functions which will be applied to streamed dataset.

# Define two simple transform function
    def transform_fn_1(x):
        """A simple transform function that doubles the input."""
        return x * 2

    def transform_fn_2(x, extra_num):
        """A simple transform function that adds one to the input."""
        return x + extra_num

    dataset = StreamingDataset(
        data_dir,
        cache_dir=str(cache_dir),
        shuffle=shuffle,
        transform=[transform_fn_1, partial(transform_fn_2, extra_num=100)],
    )

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@deependujha deependujha requested a review from Copilot July 14, 2025 09:17
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds an optional transform_kwargs API to StreamingDataset, allowing users to pass keyword arguments into their transform functions, and updates the core logic, tests, and documentation to support this feature.

  • Introduce fn_accepts_kwargs utility to detect if a function can accept **kwargs.
  • Extend StreamingDataset to accept and inject transform_kwargs into each transform call.
  • Add tests for the new behavior and update the README to document transform_kwargs.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/utilities/test_dataset_utilities.py Import and test the new fn_accepts_kwargs helper
tests/streaming/test_dataset.py Add test_dataset_multiple_transform to verify multiple transforms and transform_kwargs
src/litdata/utilities/dataset_utilities.py Implement fn_accepts_kwargs, import inspect and Callable
src/litdata/streaming/dataset.py Extend constructor, store transform_kwargs, and update __getitem__ to pass kwargs
README.md Document use of transform and transform_kwargs in StreamingDataset API
Comments suppressed due to low confidence (3)

README.md:938

  • There's a small grammatical issue: it should read "a list of transformation functions" (plural) instead of "a list of transformation function."
- You can use the `transform` argument in `StreamingDataset` to apply a `transformation function` or `a list of transformation function` to each sample as it is streamed.

tests/streaming/test_dataset.py:1699

  • Consider adding a complementary test where transform is a single callable (not a list) to verify that transform_kwargs are injected correctly for non-list transforms.
def test_dataset_multiple_transform(tmpdir, shuffle):

src/litdata/utilities/dataset_utilities.py:347

  • [nitpick] The name fn_accepts_kwargs is clear but a bit terse; consider renaming to function_accepts_kwargs or accepts_kwargs to improve readability.
def fn_accepts_kwargs(_fn: Callable) -> bool:

@codecov
Copy link

codecov bot commented Jul 14, 2025

Codecov Report

Attention: Patch coverage is 90.00000% with 1 line in your changes missing coverage. Please review.

Project coverage is 83%. Comparing base (54cb63b) to head (c79dbd6).
Report is 2 commits behind head on main.

Additional details and impacted files
@@         Coverage Diff         @@
##           main   #655   +/-   ##
===================================
- Coverage    83%    83%   -0%     
===================================
  Files        49     49           
  Lines      6761   6768    +7     
===================================
+ Hits       5642   5647    +5     
- Misses     1119   1121    +2     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@deependujha deependujha changed the title Feat: add support to pass transform kwargs in StreamingDataset Feat: add multiple transform_fn support in StreamingDataset Jul 14, 2025
@deependujha deependujha requested a review from Borda July 14, 2025 12:24
@Borda Borda changed the title Feat: add multiple transform_fn support in StreamingDataset Feat: add multiple transform_fn support in StreamingDataset Jul 14, 2025
@tchaton
Copy link
Collaborator

tchaton commented Jul 14, 2025

@deependujha Do you have use case for this ?

@deependujha
Copy link
Collaborator Author

deependujha commented Jul 14, 2025

Screenshot 2025-07-14 at 9 57 47 PM

https://github.com/Lightning-AI/litData/pull/651/files#diff-7d756c9b6b80c4ccc46eb78e1f37bade5d944e57095e98ea6dcd3b9884d1d51aR76-R80

Ultralytics parses labels file and has some of the helper methods implemented for it. I could've written another wrapper transform method for the same. But, this felt cleaner at that time and I guess might be helpful in future integration as well.

wdyt? @tchaton @Borda @bhimrazy

@tchaton tchaton merged commit fc59c8a into Lightning-AI:main Jul 14, 2025
35 checks passed
@deependujha deependujha deleted the feat/add-support-to-pass-transform-kwargs branch July 14, 2025 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants