Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Append data to pre-optimize dataset #180

Conversation

deependujha
Copy link
Collaborator

Before submitting
  • Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

Fixes #23

A test to understand this feature best is:

# tests/processing/test_functions.py file

def test_optimize_function_modes(tmpdir):
    output_dir = tmpdir.mkdir("output")
    output_dir = str(output_dir)

    def compress(index: int) -> Tuple[int, int]:
        return (index, index ** 2)
    
    def different_compress(index: int)->  Tuple[int, int, int]:
        return (index, index ** 2, index**3)

    # none mode
    optimize(
        fn=compress,
        inputs=list(range(1, 101)),
        output_dir = output_dir,
        chunk_bytes="64MB",
    )

    my_dataset = StreamingDataset(output_dir)
    assert len(my_dataset) == 100
    assert my_dataset[:] == [(i, i**2) for i in range(1, 101)]


    # append mode
    optimize(
        fn=compress,
        mode = "append",
        inputs=list(range(101, 201)),
        output_dir=output_dir,
        chunk_bytes="64MB",
    )

    my_dataset = StreamingDataset(output_dir)
    assert len(my_dataset) == 200
    assert my_dataset[:] == [(i, i**2) for i in range(1, 201)]


    # overwrite mode
    optimize(
        fn=compress,
        mode = "overwrite",
        inputs=list(range(201, 351)),
        output_dir=output_dir,
        chunk_bytes="64MB",
    )

    my_dataset = StreamingDataset(output_dir)
    assert len(my_dataset) == 150
    assert my_dataset[:] == [(i, i**2) for i in range(201, 351)]


    # failing case
    with pytest.raises(ValueError, match="The config of the optimized dataset is different from the original one."):
        # overwrite mode
        optimize(
            fn=different_compress,
            mode = "overwrite",
            inputs=list(range(201, 351)),
            output_dir=output_dir,
            chunk_bytes="64MB",
        )

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 馃檭

Copy link

codecov bot commented Jun 22, 2024

Codecov Report

Attention: Patch coverage is 90.76923% with 6 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@d5eff39). Learn more about missing BASE report.

Additional details and impacted files
@@          Coverage Diff          @@
##             main   #180   +/-   ##
=====================================
  Coverage        ?    78%           
=====================================
  Files           ?     33           
  Lines           ?   4380           
  Branches        ?      0           
=====================================
  Hits            ?   3410           
  Misses          ?    970           
  Partials        ?      0           

@deependujha
Copy link
Collaborator Author

deependujha commented Jun 22, 2024

It needs to be tested to see if it works for S3, as it requires pro account to set up s3 connection with LightningAI studio.

But, if it works, then one improvement can be done:

If the incompatible (different config) datasets are appended or overwritten, the current behavior is to do all the operations and then in the end, try to merge/ overwrite them.

So, the better approach will be to check for compatibility initially.


Also, tests are passing in the lightning studio. I don't know why they failed here in mac and windows.

@deependujha deependujha marked this pull request as draft June 24, 2024 04:36
@deependujha deependujha deleted the feat/append-data-to-preoptimize-dataset branch June 26, 2024 03:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Append data to pre-optimized dataset
1 participant