Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deletes: adding purge option for consolidation. #3458

Merged
merged 4 commits into from Sep 3, 2022

Conversation

KiterLuc
Copy link
Contributor

This adds the ability to purge deleted cells when running consolidation with deletes. When this is done, the cells that were deleted are fully removed from the fragment, unless they get added again after the deletion. This will also not write the delete metadata columns for this fragment as there is no delete times for the cells.

The harder problem to solve for this PR was for the no duplicates array, when a cell gets deleted, deduplication needs to delete only the cells that were added before a certain cell was deleted. For fragments with timestamps, as we still want to write every cells with their appropriate timestamps, this means that a fragment could have more than one cell with the same coordinate to process. The solution is to add all cells with the same coordinate to the sorting tile queue, and to add the timestamp dimension to the sorting (with the greater timestamp coming first). That way we can merge all cells until a deleted cell gets hit, at which point we stop and get rid of the cells that came in before the delete.

This also fixed a few tests that actually didn't run consolidation, and fixes consolidating a fragment consolidated with deletes, as the delete condition index tiles were not getting loaded properly.


TYPE: IMPROVEMENT
DESC: Deletes: adding purge option for consolidation.

@KiterLuc KiterLuc requested a review from ihnorton August 17, 2022 16:12
@shortcut-integration
Copy link

This pull request has been linked to Shortcut Story #20031: Deletes: implement purge option for consolidation..

@KiterLuc KiterLuc marked this pull request as draft August 17, 2022 20:04
@KiterLuc KiterLuc force-pushed the lr/deletes-purge-deleted-cells/ch20031 branch 2 times, most recently from 2223fc6 to 41f5112 Compare August 18, 2022 14:17
@KiterLuc KiterLuc marked this pull request as ready for review August 18, 2022 14:57
@KiterLuc KiterLuc force-pushed the lr/deletes-purge-deleted-cells/ch20031 branch 2 times, most recently from a9f9990 to 659af3a Compare August 18, 2022 18:39
tiledb/sm/misc/comparators.h Show resolved Hide resolved
tiledb/sm/misc/comparators.h Show resolved Hide resolved
tiledb/sm/query/readers/result_tile.cc Show resolved Hide resolved
tiledb/sm/query/readers/sparse_global_order_reader.cc Outdated Show resolved Hide resolved
tiledb/sm/query/readers/result_tile.h Outdated Show resolved Hide resolved
tiledb/sm/query/readers/sparse_global_order_reader.cc Outdated Show resolved Hide resolved
tiledb/sm/query/readers/sparse_global_order_reader.cc Outdated Show resolved Hide resolved
tiledb/sm/query/readers/sparse_global_order_reader.cc Outdated Show resolved Hide resolved
tiledb/sm/query/readers/sparse_global_order_reader.cc Outdated Show resolved Hide resolved
This adds the ability to purge deleted cells when running consolidation with deletes. When this is done, the cells that were deleted are fully removed from the fragment, unless they get added again after the deletion. This will also not write the delete metadata columns for this fragment as there is no delete times for the cells.

The harder problem to solve for this PR was for the no duplicates array, when a cell gets deleted, deduplication needs to delete only the cells that were added before a certain cell was deleted. For fragments with timestamps, as we still want to write every cells with their appropriate timestamps, this means that a fragment could have more than one cell with the same coordinate to process. The solution is to add all cells with the same coordinate to the sorting tile queue, and to add the timestamp dimension to the sorting (with the greater timestamp coming first). That way we can merge all cells until a deleted cell gets hit, at which point we stop and get rid of the cells that came in before the delete.

This also fixed a few tests that actually didn't run consolidation, and fixes consolidating a fragment consolidated with deletes, as the delete condition index tiles were not getting loaded properly.

---
TYPE: IMPROVEMENT
DESC: Deletes: adding purge option for consolidation.
Copy link
Member

@ihnorton ihnorton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be an internal ordering assumption violation if I do a write at the same timestamp as a delete, and then consolidate+purge. See the following sequence (from here as with the previous issue).

write(ts=1,
	idx=[0,1,2,3,4], data=[0,1,2,3,4]
)
write(ts=2,
	idx=[5,6,7,8,9], data=[5,6,7,8,9]
)

apply_delete(ts=3, “data == 2”)

consolidate()
vacuum()

# THIS step causes failures below; if I disable, consolidate works fine and state matches
write(ts=3, idx=[2], data=[2])

consolidate(purge=True)

I get this when running consolidate:

TileDBError                               Traceback (most recent call last)
File ~/work/git/TileDB-Py/tiledb/cc/tests/test_cc_deletes.py:150, in <module>
    145     assert 2 not in rr(3)
    147     p_all()
--> 150 test_sparse_delete_purge()

File ~/work/git/TileDB-Py/tiledb/cc/tests/test_cc_deletes.py:135, in test_sparse_delete_purge()
    132 cfg = tiledb.Config({"sm.consolidation.purge_deleted_cells": "true"})
    134 with tiledb.scope_ctx(cfg):
--> 135     tiledb.consolidate(uri, config=cfg)
    136     print("---- consolidate WITH PURGE")
    137 assert 2 in rr(2)

File ~/work/git/TileDB-Py/tiledb/libtiledb.pyx:5406, in tiledb.libtiledb.consolidate()

File ~/work/git/TileDB-Py/tiledb/libtiledb.pyx:575, in tiledb.libtiledb._raise_ctx_err()

File ~/work/git/TileDB-Py/tiledb/libtiledb.pyx:560, in tiledb.libtiledb._raise_tiledb_error()

TileDBError: [TileDB::IO] Error: Cannot get file size of '/var/folders/qr/vvplm_c12mbbptpkrzc_jcfh0000gn/T/tmpb8uto0c8/__fragments/__1_3_6335269638534211b13bc2d4d365d1e7_16/__fragment_metadata.tdb'; No such file or directory

@KiterLuc KiterLuc force-pushed the lr/deletes-purge-deleted-cells/ch20031 branch from 3220067 to 45a951a Compare August 31, 2022 19:37
@KiterLuc KiterLuc merged commit 348e735 into dev Sep 3, 2022
@KiterLuc KiterLuc deleted the lr/deletes-purge-deleted-cells/ch20031 branch September 3, 2022 08:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants