New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deletes: adding purge option for consolidation. #3458
Conversation
This pull request has been linked to Shortcut Story #20031: Deletes: implement purge option for consolidation.. |
2223fc6
to
41f5112
Compare
a9f9990
to
659af3a
Compare
This adds the ability to purge deleted cells when running consolidation with deletes. When this is done, the cells that were deleted are fully removed from the fragment, unless they get added again after the deletion. This will also not write the delete metadata columns for this fragment as there is no delete times for the cells. The harder problem to solve for this PR was for the no duplicates array, when a cell gets deleted, deduplication needs to delete only the cells that were added before a certain cell was deleted. For fragments with timestamps, as we still want to write every cells with their appropriate timestamps, this means that a fragment could have more than one cell with the same coordinate to process. The solution is to add all cells with the same coordinate to the sorting tile queue, and to add the timestamp dimension to the sorting (with the greater timestamp coming first). That way we can merge all cells until a deleted cell gets hit, at which point we stop and get rid of the cells that came in before the delete. This also fixed a few tests that actually didn't run consolidation, and fixes consolidating a fragment consolidated with deletes, as the delete condition index tiles were not getting loaded properly. --- TYPE: IMPROVEMENT DESC: Deletes: adding purge option for consolidation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be an internal ordering assumption violation if I do a write at the same timestamp as a delete, and then consolidate+purge. See the following sequence (from here as with the previous issue).
write(ts=1,
idx=[0,1,2,3,4], data=[0,1,2,3,4]
)
write(ts=2,
idx=[5,6,7,8,9], data=[5,6,7,8,9]
)
apply_delete(ts=3, “data == 2”)
consolidate()
vacuum()
# THIS step causes failures below; if I disable, consolidate works fine and state matches
write(ts=3, idx=[2], data=[2])
consolidate(purge=True)
I get this when running consolidate
:
TileDBError Traceback (most recent call last)
File ~/work/git/TileDB-Py/tiledb/cc/tests/test_cc_deletes.py:150, in <module>
145 assert 2 not in rr(3)
147 p_all()
--> 150 test_sparse_delete_purge()
File ~/work/git/TileDB-Py/tiledb/cc/tests/test_cc_deletes.py:135, in test_sparse_delete_purge()
132 cfg = tiledb.Config({"sm.consolidation.purge_deleted_cells": "true"})
134 with tiledb.scope_ctx(cfg):
--> 135 tiledb.consolidate(uri, config=cfg)
136 print("---- consolidate WITH PURGE")
137 assert 2 in rr(2)
File ~/work/git/TileDB-Py/tiledb/libtiledb.pyx:5406, in tiledb.libtiledb.consolidate()
File ~/work/git/TileDB-Py/tiledb/libtiledb.pyx:575, in tiledb.libtiledb._raise_ctx_err()
File ~/work/git/TileDB-Py/tiledb/libtiledb.pyx:560, in tiledb.libtiledb._raise_tiledb_error()
TileDBError: [TileDB::IO] Error: Cannot get file size of '/var/folders/qr/vvplm_c12mbbptpkrzc_jcfh0000gn/T/tmpb8uto0c8/__fragments/__1_3_6335269638534211b13bc2d4d365d1e7_16/__fragment_metadata.tdb'; No such file or directory
3220067
to
45a951a
Compare
This adds the ability to purge deleted cells when running consolidation with deletes. When this is done, the cells that were deleted are fully removed from the fragment, unless they get added again after the deletion. This will also not write the delete metadata columns for this fragment as there is no delete times for the cells.
The harder problem to solve for this PR was for the no duplicates array, when a cell gets deleted, deduplication needs to delete only the cells that were added before a certain cell was deleted. For fragments with timestamps, as we still want to write every cells with their appropriate timestamps, this means that a fragment could have more than one cell with the same coordinate to process. The solution is to add all cells with the same coordinate to the sorting tile queue, and to add the timestamp dimension to the sorting (with the greater timestamp coming first). That way we can merge all cells until a deleted cell gets hit, at which point we stop and get rid of the cells that came in before the delete.
This also fixed a few tests that actually didn't run consolidation, and fixes consolidating a fragment consolidated with deletes, as the delete condition index tiles were not getting loaded properly.
TYPE: IMPROVEMENT
DESC: Deletes: adding purge option for consolidation.