Feat: Append data to pre-optimize dataset #180

deependujha · 2024-06-22T18:41:02Z

Before submitting

Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

Fixes #23

A test to understand this feature best is:

# tests/processing/test_functions.py file

def test_optimize_function_modes(tmpdir):
    output_dir = tmpdir.mkdir("output")
    output_dir = str(output_dir)

    def compress(index: int) -> Tuple[int, int]:
        return (index, index ** 2)
    
    def different_compress(index: int)->  Tuple[int, int, int]:
        return (index, index ** 2, index**3)

    # none mode
    optimize(
        fn=compress,
        inputs=list(range(1, 101)),
        output_dir = output_dir,
        chunk_bytes="64MB",
    )

    my_dataset = StreamingDataset(output_dir)
    assert len(my_dataset) == 100
    assert my_dataset[:] == [(i, i**2) for i in range(1, 101)]


    # append mode
    optimize(
        fn=compress,
        mode = "append",
        inputs=list(range(101, 201)),
        output_dir=output_dir,
        chunk_bytes="64MB",
    )

    my_dataset = StreamingDataset(output_dir)
    assert len(my_dataset) == 200
    assert my_dataset[:] == [(i, i**2) for i in range(1, 201)]


    # overwrite mode
    optimize(
        fn=compress,
        mode = "overwrite",
        inputs=list(range(201, 351)),
        output_dir=output_dir,
        chunk_bytes="64MB",
    )

    my_dataset = StreamingDataset(output_dir)
    assert len(my_dataset) == 150
    assert my_dataset[:] == [(i, i**2) for i in range(201, 351)]


    # failing case
    with pytest.raises(ValueError, match="The config of the optimized dataset is different from the original one."):
        # overwrite mode
        optimize(
            fn=different_compress,
            mode = "overwrite",
            inputs=list(range(201, 351)),
            output_dir=output_dir,
            chunk_bytes="64MB",
        )

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

testing on S3

for more information, see https://pre-commit.ci

codecov · 2024-06-22T18:45:09Z

Codecov Report

Attention: Patch coverage is 90.76923% with 6 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@d5eff39). Learn more about missing BASE report.

Additional details and impacted files

@@          Coverage Diff          @@
##             main   #180   +/-   ##
=====================================
  Coverage        ?    78%           
=====================================
  Files           ?     33           
  Lines           ?   4380           
  Branches        ?      0           
=====================================
  Hits            ?   3410           
  Misses          ?    970           
  Partials        ?      0

for more information, see https://pre-commit.ci

deependujha · 2024-06-22T19:07:20Z

It needs to be tested to see if it works for S3, as it requires pro account to set up s3 connection with LightningAI studio.

But, if it works, then one improvement can be done:

If the incompatible (different config) datasets are appended or overwritten, the current behavior is to do all the operations and then in the end, try to merge/ overwrite them.

So, the better approach will be to check for compatibility initially.

Also, tests are passing in the lightning studio. I don't know why they failed here in mac and windows.

deependujha added 2 commits June 22, 2024 16:10

most logic implemented. Time to test.

8859e1b

Append data to pre-optimized dataset done. Works on local, but needs

ec27f52

testing on S3

deependujha requested review from tchaton and awaelchli as code owners June 22, 2024 18:41

[pre-commit.ci] auto fixes from pre-commit.com hooks

1017de7

for more information, see https://pre-commit.ci

deependujha and others added 2 commits June 22, 2024 18:46

fix pre-commit ci

9975698

[pre-commit.ci] auto fixes from pre-commit.com hooks

21adc74

for more information, see https://pre-commit.ci

deependujha added 2 commits June 23, 2024 03:08

fix failing testcase

3a261fa

fix failing testcase on mac and windows

d650508

deependujha marked this pull request as draft June 24, 2024 04:36

deependujha closed this Jun 26, 2024

deependujha deleted the feat/append-data-to-preoptimize-dataset branch June 26, 2024 03:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Append data to pre-optimize dataset #180

Feat: Append data to pre-optimize dataset #180

deependujha commented Jun 22, 2024

codecov bot commented Jun 22, 2024 •

edited

Loading

deependujha commented Jun 22, 2024 •

edited

Loading

Feat: Append data to pre-optimize dataset #180

Feat: Append data to pre-optimize dataset #180

Conversation

deependujha commented Jun 22, 2024

What does this PR do?

PR review

Did you have fun?

codecov bot commented Jun 22, 2024 • edited Loading

Codecov Report

deependujha commented Jun 22, 2024 • edited Loading

codecov bot commented Jun 22, 2024 •

edited

Loading

deependujha commented Jun 22, 2024 •

edited

Loading