Skip to content

Conversation

@frankmcsherry
Copy link
Contributor

Evolving PR for enabling dictionary compression in arrangements.

Roughly, as each row: &[u8] is presented, we'll rip it apart into columns, and look for the option to use otherwise unused byte patterns (tags, otherwise used for row decoding) to reference popular values in each column. Columns are encoded independently, so things can be differently popular in different columns.

Fair bit of work still to do, but checking in for the moment.

Motivation

Tips for reviewer

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@frankmcsherry frankmcsherry changed the title WIP checkin [WIP] Dictionary compressed arrangements Apr 4, 2025
@frankmcsherry frankmcsherry force-pushed the dictionary_arrangements branch 7 times, most recently from 5768a61 to 6b555df Compare April 9, 2025 20:59
@frankmcsherry frankmcsherry force-pushed the dictionary_arrangements branch from 6b555df to 35774af Compare April 14, 2025 13:59
Copy link
Member

@antiguru antiguru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some feedback, but otherwise I like the change. We should figure out the runtime overhead, but I suspect it won't be too bad.

Comment on lines +1169 to +1313
fn report(&self) {
if self.total > 500000 {
//} && self.columns.iter().all(|c| c.decode.len() > 0) {
println!(
"REPORT: {:?} -> {:?} (x{:?})",
self.total,
self.bytes,
self.total / self.bytes
);
println!("COLUMNS: {:?}", self.columns.len());
for column in self.columns.iter() {
column.report()
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might need to change before we turn it on in prod :) I think it'd be fine to use trace! instead.

Comment on lines +1186 to +1321
ColumnsIter {
index: None,
column: 0,
data: row.data(),
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could call without_codec instead. Would require unsafe and the associated safety argument, but then we'd have a comment as to why this is safe.

Comment on lines +1472 to +1620
// /// Allocates a Misra-Gries summary which intends to hold up to `k` examples.
// ///
// /// After `n` insertions it will contain only elements that were inserted at least `n/k` times.
// /// The actual memory use is proportional to `2 * k`, so that we can amortize the consolidation.
// pub fn with_capacity(k: usize) -> Self {
// Self {
// inner: Vec::with_capacity(2 * k),
// }
// }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Remove, or allow dead code.

Comment on lines +1338 to +1493
fn report(&self) {
let mut tags_used = 0;
tags_used += self.stats.1[0].count_ones();
tags_used += self.stats.1[1].count_ones();
tags_used += self.stats.1[2].count_ones();
tags_used += self.stats.1[3].count_ones();
let mg = self.stats.0.clone().done();
let mut bytes = 0;
for (vec, _count) in mg.iter() {
bytes += vec.len();
}
// if self.total > 10000 && !mg.is_empty() {
println!(
"\t{:?}v{:?}: {:?} -> {:?} + {:?} = (x{:?})",
tags_used,
mg.len(),
self.total,
self.bytes,
bytes,
self.total / (self.bytes + bytes),
)
// }
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove, or use tracing instead.

/// Enable per-column dictionary compression for row containers in arrangements.
pub const ENABLE_ARRANGEMENT_DICTIONARY_COMPRESSION: Config<bool> = Config::new(
"enable_arrangement_dictionary_compression",
true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
true,
false,

Please disable by default before you merge.

@antiguru antiguru force-pushed the dictionary_arrangements branch 2 times, most recently from 8df3452 to 1d08926 Compare September 18, 2025 08:38
@frankmcsherry frankmcsherry force-pushed the dictionary_arrangements branch from 1d08926 to 5ea8e0c Compare October 21, 2025 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants