-
Notifications
You must be signed in to change notification settings - Fork 487
[WIP] Dictionary compressed arrangements #32095
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[WIP] Dictionary compressed arrangements #32095
Conversation
5768a61 to
6b555df
Compare
6b555df to
35774af
Compare
antiguru
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some feedback, but otherwise I like the change. We should figure out the runtime overhead, but I suspect it won't be too bad.
| fn report(&self) { | ||
| if self.total > 500000 { | ||
| //} && self.columns.iter().all(|c| c.decode.len() > 0) { | ||
| println!( | ||
| "REPORT: {:?} -> {:?} (x{:?})", | ||
| self.total, | ||
| self.bytes, | ||
| self.total / self.bytes | ||
| ); | ||
| println!("COLUMNS: {:?}", self.columns.len()); | ||
| for column in self.columns.iter() { | ||
| column.report() | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might need to change before we turn it on in prod :) I think it'd be fine to use trace! instead.
| ColumnsIter { | ||
| index: None, | ||
| column: 0, | ||
| data: row.data(), | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could call without_codec instead. Would require unsafe and the associated safety argument, but then we'd have a comment as to why this is safe.
| // /// Allocates a Misra-Gries summary which intends to hold up to `k` examples. | ||
| // /// | ||
| // /// After `n` insertions it will contain only elements that were inserted at least `n/k` times. | ||
| // /// The actual memory use is proportional to `2 * k`, so that we can amortize the consolidation. | ||
| // pub fn with_capacity(k: usize) -> Self { | ||
| // Self { | ||
| // inner: Vec::with_capacity(2 * k), | ||
| // } | ||
| // } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Remove, or allow dead code.
| fn report(&self) { | ||
| let mut tags_used = 0; | ||
| tags_used += self.stats.1[0].count_ones(); | ||
| tags_used += self.stats.1[1].count_ones(); | ||
| tags_used += self.stats.1[2].count_ones(); | ||
| tags_used += self.stats.1[3].count_ones(); | ||
| let mg = self.stats.0.clone().done(); | ||
| let mut bytes = 0; | ||
| for (vec, _count) in mg.iter() { | ||
| bytes += vec.len(); | ||
| } | ||
| // if self.total > 10000 && !mg.is_empty() { | ||
| println!( | ||
| "\t{:?}v{:?}: {:?} -> {:?} + {:?} = (x{:?})", | ||
| tags_used, | ||
| mg.len(), | ||
| self.total, | ||
| self.bytes, | ||
| bytes, | ||
| self.total / (self.bytes + bytes), | ||
| ) | ||
| // } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove, or use tracing instead.
| /// Enable per-column dictionary compression for row containers in arrangements. | ||
| pub const ENABLE_ARRANGEMENT_DICTIONARY_COMPRESSION: Config<bool> = Config::new( | ||
| "enable_arrangement_dictionary_compression", | ||
| true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| true, | |
| false, |
Please disable by default before you merge.
8df3452 to
1d08926
Compare
1d08926 to
5ea8e0c
Compare
Evolving PR for enabling dictionary compression in arrangements.
Roughly, as each
row: &[u8]is presented, we'll rip it apart into columns, and look for the option to use otherwise unused byte patterns (tags, otherwise used for row decoding) to reference popular values in each column. Columns are encoded independently, so things can be differently popular in different columns.Fair bit of work still to do, but checking in for the moment.
Motivation
Tips for reviewer
Checklist
$T ⇔ Proto$Tmapping (possibly in a backwards-incompatible way), then it is tagged with aT-protolabel.