-
Notifications
You must be signed in to change notification settings - Fork 457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
persist: truncate String and Vec<u8> stats to a maximum byte length #18771
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
(I think it would probably be okay to just not write stats for columns like this. But this is of course even better!)
These stats are inclusive lower and upper bounds, so we can (carefully) truncate almost any String or Vec<u8> value to produce one that's smaller but still a relatively tight bound. There is TONS of stats pruning left to do (including given the ProtoDatum and Json stats the same treatment), but this is enough to get CI (which sometimes inserts 100MiB strings) to pass with the stats collection ff on, so it seems like a nice start.
91b1724
to
43fcec1
Compare
yeah, I think we want a general mechanism for pruning stats for a column, but it's not clear to me yet exactly what the heuristics for that should be (I'm tempted to say that's probably something to be vetted during the design doc process, which it seems about time to circle back to), whereas this felt like a straightforward win. TFTR! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic to get an upper bound from a UTF8 string actually reads much simpler than I expected, very nice!
Wouldn't mind a proptest that generates strings/Vec and verifies the lower and upper bounds work as intended over a myriad of inputs
lol that took me 7 hours to write and I almost gave up at one point XD |
done! #18805 |
These stats are inclusive lower and upper bounds, so we can (carefully) truncate almost any String or Vec value to produce one that's smaller but still a relatively tight bound.
There is TONS of stats pruning left to do (including given the ProtoDatum and Json stats the same treatment), but this is enough to get CI (which sometimes inserts 100MiB strings) to pass with the stats collection ff on, so it seems like a nice start.
Touches #12684
Motivation
Tips for reviewer
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way) and therefore is tagged with aT-proto
label.