Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

persist: truncate String and Vec<u8> stats to a maximum byte length #18771

Merged
merged 1 commit into from
Apr 14, 2023

Conversation

danhhz
Copy link
Contributor

@danhhz danhhz commented Apr 13, 2023

These stats are inclusive lower and upper bounds, so we can (carefully) truncate almost any String or Vec value to produce one that's smaller but still a relatively tight bound.

There is TONS of stats pruning left to do (including given the ProtoDatum and Json stats the same treatment), but this is enough to get CI (which sometimes inserts 100MiB strings) to pass with the stats collection ff on, so it seems like a nice start.

Touches #12684

Motivation

  • This PR adds a known-desirable feature.

Tips for reviewer

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • This PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way) and therefore is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:

@danhhz danhhz requested a review from bkirwi April 13, 2023 22:51
@danhhz danhhz requested a review from a team as a code owner April 13, 2023 22:51
Copy link
Contributor

@bkirwi bkirwi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

(I think it would probably be okay to just not write stats for columns like this. But this is of course even better!)

These stats are inclusive lower and upper bounds, so we can (carefully)
truncate almost any String or Vec<u8> value to produce one that's
smaller but still a relatively tight bound.

There is TONS of stats pruning left to do (including given the
ProtoDatum and Json stats the same treatment), but this is enough to get
CI (which sometimes inserts 100MiB strings) to pass with the stats
collection ff on, so it seems like a nice start.
@danhhz
Copy link
Contributor Author

danhhz commented Apr 14, 2023

yeah, I think we want a general mechanism for pruning stats for a column, but it's not clear to me yet exactly what the heuristics for that should be (I'm tempted to say that's probably something to be vetted during the design doc process, which it seems about time to circle back to), whereas this felt like a straightforward win. TFTR!

@danhhz danhhz enabled auto-merge April 14, 2023 17:39
@danhhz danhhz merged commit 1ed65df into MaterializeInc:main Apr 14, 2023
@danhhz danhhz deleted the persist_mfp_stats_pruning branch April 14, 2023 20:17
Copy link
Contributor

@pH14 pH14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic to get an upper bound from a UTF8 string actually reads much simpler than I expected, very nice!

Wouldn't mind a proptest that generates strings/Vec and verifies the lower and upper bounds work as intended over a myriad of inputs

@danhhz
Copy link
Contributor Author

danhhz commented Apr 17, 2023

The logic to get an upper bound from a UTF8 string actually reads much simpler than I expected, very nice!

lol that took me 7 hours to write and I almost gave up at one point XD

@danhhz
Copy link
Contributor Author

danhhz commented Apr 17, 2023

Wouldn't mind a proptest that generates strings/Vec and verifies the lower and upper bounds work as intended over a myriad of inputs

done! #18805

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants