- 
                Notifications
    You must be signed in to change notification settings 
- Fork 79
          Feat: add multiple transform_fn support in StreamingDataset
          #655
        
          New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
  
    Feat: add multiple transform_fn support in StreamingDataset
  
  #655
              Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds an optional transform_kwargs API to StreamingDataset, allowing users to pass keyword arguments into their transform functions, and updates the core logic, tests, and documentation to support this feature.
- Introduce fn_accepts_kwargsutility to detect if a function can accept**kwargs.
- Extend StreamingDatasetto accept and injecttransform_kwargsinto each transform call.
- Add tests for the new behavior and update the README to document transform_kwargs.
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description | 
|---|---|
| tests/utilities/test_dataset_utilities.py | Import and test the new fn_accepts_kwargshelper | 
| tests/streaming/test_dataset.py | Add test_dataset_multiple_transformto verify multiple transforms andtransform_kwargs | 
| src/litdata/utilities/dataset_utilities.py | Implement fn_accepts_kwargs, importinspectandCallable | 
| src/litdata/streaming/dataset.py | Extend constructor, store transform_kwargs, and update__getitem__to pass kwargs | 
| README.md | Document use of transformandtransform_kwargsinStreamingDatasetAPI | 
Comments suppressed due to low confidence (3)
README.md:938
- There's a small grammatical issue: it should read "a list of transformation functions" (plural) instead of "a list of transformation function."
- You can use the `transform` argument in `StreamingDataset` to apply a `transformation function` or `a list of transformation function` to each sample as it is streamed.
tests/streaming/test_dataset.py:1699
- Consider adding a complementary test where transformis a single callable (not a list) to verify thattransform_kwargsare injected correctly for non-list transforms.
def test_dataset_multiple_transform(tmpdir, shuffle):
src/litdata/utilities/dataset_utilities.py:347
- [nitpick] The name fn_accepts_kwargsis clear but a bit terse; consider renaming tofunction_accepts_kwargsoraccepts_kwargsto improve readability.
def fn_accepts_kwargs(_fn: Callable) -> bool:
| Codecov ReportAttention: Patch coverage is  
 Additional details and impacted files@@         Coverage Diff         @@
##           main   #655   +/-   ##
===================================
- Coverage    83%    83%   -0%     
===================================
  Files        49     49           
  Lines      6761   6768    +7     
===================================
+ Hits       5642   5647    +5     
- Misses     1119   1121    +2     🚀 New features to boost your workflow:
 | 
transform_fn support in StreamingDataset
      | @deependujha Do you have use case for this ? | 

Before submitting
What does this PR do?
Adds feature to pass a list of
transformation functionswhich will be applied to streamed dataset.PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃