Skip to content

Yc remove unnecessary#32

Merged
GordonYuanyc merged 15 commits intomainfrom
yc-remove-unnecessary
Apr 1, 2026
Merged

Yc remove unnecessary#32
GordonYuanyc merged 15 commits intomainfrom
yc-remove-unnecessary

Conversation

@GordonYuanyc
Copy link
Copy Markdown
Collaborator

@GordonYuanyc GordonYuanyc commented Mar 30, 2026

Summary

Remove unused infrastructure, clean up core sketch APIs, add OctoSketch multi-threaded framework.

Removed: orchestrator, Locher, Microscope, benchmarks

  • Deleted the entire sketch_framework/orchestrator/ module (6 files, ~1480 lines) — NodeOrchestrator, NodeCatalog, and all node types (EHNode, HashlayerNode, NitroNode, SketchNode).
  • Deleted sketches/locher.rs and sketches/microscope.rs.
  • Deleted all benches/ files (9 benchmark harnesses) and removed the criterion dev-dependency.
  • Added core_affinity and crossbeam-channel dependencies (used by Octo).

HashLayer: user-facing API

HashLayer<H> groups sketches that share a compatible hash. It hashes each input once and fans the result out to every sketch in the layer.

Accepted sketch types (only prehashed fast-path):

  • CountMin<_, FastPath, _> — Count-Min Sketch
  • Count<_, FastPath, _> — Count Sketch
  • HyperLogLog<DataFusion> / HyperLogLog<Regular> / HyperLogLogHIP

All matrix-backed sketches in one layer must have the same dimensions (rows x cols). HLL can coexist because it only consumes the lower 64 bits of the shared hash.

Example usage:

use sketchlib_rust::*;
use sketchlib_rust::sketch_framework::HashSketchEnsemble;

// Two CMS + one HLL sharing one hash per insert
let mut ensemble = HashSketchEnsemble::<DefaultXxHasher>::new(vec![
    CountMin::<Vector2D<i32>, FastPath>::with_dimensions(3, 4096).into(),
    CountMin::<Vector2D<i32>, FastPath>::with_dimensions(3, 4096).into(),
    HyperLogLog::<DataFusion>::default().into(),
]).unwrap();

// Insert — hashes once, updates all 3 sketches
ensemble.insert(&SketchInput::U64(42));

// Query frequency (CMS at index 0)
let freq = ensemble.estimate(0, &SketchInput::U64(42)).unwrap();

// Query cardinality (HLL at index 2)
let card = ensemble.cardinality(2).unwrap();

// Pre-computed hash path for hot loops
let hash = ensemble.hash_input(&SketchInput::U64(42));
ensemble.insert_with_hash(&hash);
let freq = ensemble.estimate_with_hash(0, &hash).unwrap();

Full API:

Method Description
new(Vec<HashLayerSketch>) Construct with validation
push(sketch) Add a sketch (rejects incompatible)
insert(&SketchInput) Hash once, insert to all
insert_with_hash(&hash) Insert pre-computed hash to all
insert_at(&[usize], &SketchInput) Insert to specific indices
insert_at_with_hash(&[usize], &hash) Same with pre-computed hash
bulk_insertbulk_insert_with_hashes Batch variants
bulk_insert_atbulk_insert_at_with_hashes Batch + index variants
estimate(index, &SketchInput) Frequency query (CMS/Count)
estimate_with_hash(index, &hash) Frequency with pre-hash
cardinality(index) Distinct-count (HLL)
hash_input(&SketchInput) Expose the shared hash
get / get_mut / len / is_empty Accessors

Also cleaned up sketch_catalog.rs: removed all unused catalog enums (FreqSketch, CardinalitySketch, QuantileSketch, etc.). Only the adapter traits (CountMinFastOps, CountFastOps, etc.) used by HashLayer remain.

HLL: adjustable register storage

HllBucketList types are now backed by Box<[u8; N]> with a HllRegisterStorage trait providing PRECISION, REGISTER_BITS, NUM_REGISTERS, and slice access. Three precisions are supported: P12 (4096 registers), P14 (16384), P16 (65536).

KLL: rearranged memory layout for speed

  • Pre-allocated flat buffer (Box<[f64]>) with level offsets instead of per-level Vecs.
  • MAX_LEVELS = 61 hard cap with compute_max_capacity sizing.
  • In-place randomly_halve_up and merge_sorted_runs with a reusable scratch buffer — avoids allocations during compaction.

OctoSketch: multi-threaded sketch framework

New sketch_framework/octo.rs module implementing the OctoSketch pattern — pin workers to cores, each with a local small sketch, and promote deltas to a shared parent sketch when local counters overflow.

  • OctoWorker / OctoAggregator traits for custom sketch types.
  • Built-in implementations: CmOctoWorker, CountOctoWorker, HllOctoWorker.
  • OctoRuntime manages worker threads with core_affinity pinning and crossbeam-channel communication.
  • OctoReadHandle for lock-free reads of the parent sketch during ingestion.
  • Delta types in sketches/octo_delta.rs: CmDelta, CountDelta, HllDelta with configurable promotion thresholds.
  • CMS and Count workers are generic over FastPath / RegularPath.

@GordonYuanyc GordonYuanyc marked this pull request as ready for review March 31, 2026 20:37
use crate::common::input::sketch_input_to_f64;
use crate::{SketchInput, Vector1D};

const MAX_LEVELS: usize = 61;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intent to use const variables, or avoid them?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is intentional for speed/performance.
Theoretically, this seems to be large enough (2^60 insertions or more).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will be the user API of configuring the e.g., K in KLL?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one:
pub fn init(k: usize, m: usize) -> Self
k is bottom compactor size and m is minimum compactor size

@GordonYuanyc GordonYuanyc merged commit 8a99d57 into main Apr 1, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants