Skip to content

CTable: Columnar Compressed Table for Blosc2#614

Merged
FrancescAlted merged 22 commits intoBlosc:ctable4from
Jacc4224:my_ctable3
Apr 8, 2026
Merged

CTable: Columnar Compressed Table for Blosc2#614
FrancescAlted merged 22 commits intoBlosc:ctable4from
Jacc4224:my_ctable3

Conversation

@Jacc4224
Copy link
Copy Markdown

@Jacc4224 Jacc4224 commented Apr 7, 2026

CTable: Columnar Compressed Table for Blosc2

This PR introduces CTable, a typed columnar table backed by blosc2.NDArray per column, with full
schema validation, persistence, and interoperability.

Schema layer

  • @DataClass + blosc2.field() spec primitives: int8/16/32/64, uint8/16/32/64, float32/64, bool,
    complex64/128, string, bytes
  • Pydantic-backed row validation (append) and vectorized NumPy bulk validation (extend)
  • Schema serialization/deserialization for persistence

Core table operations

  • append() / extend() with schema validation
  • Tombstone deletion model: delete() sets a mask, compact() closes gaps
  • sort_by(cols, ascending, inplace) — single and multi-column stable sort
  • where(expr) row-filter views, select(cols) column-projection views (no data copy)
  • Schema mutations: add_column, drop_column, rename_column
  • Column aggregates: sum, min, max, mean, std, any, all, unique, value_counts
  • Table-level: describe(), cov(), head(), tail(), sample(), info()

Persistence

  • File-backed NDArrays: one .b2nd per column + _valid_rows.b2nd + _meta.b2frame
  • CTable(Row, urlpath=...), CTable.open(), t.save(), CTable.load()

Interoperability

  • Arrow: to_arrow() → pyarrow.Table, from_arrow(arrow_table) → CTable — direct bulk-write per
    column, no row-level overhead
  • CSV: to_csv(path), from_csv(path, row_cls) → CTable — stdlib only, chunk-aware, respects
    deletions

Tests & examples

  • 363 tests across construct, validation, delete, compact, sort, aggregates, persistence, Arrow
    interop, CSV interop
  • 9 standalone example scripts (examples/ctable/) covering every feature
  • Benchmark: pandas ↔ CTable roundtrip pipeline (bench/ctable/bench_pandas_roundtrip.py)
  • Jupyter notebook tutorial (examples/ctable/ctable_tutorial.ipynb) with a 10-city climate
    dataset as the running example

Jacc4224 and others added 22 commits March 26, 2026 11:05
Introduce CTable, a new columnar table class for efficient in-memory
data storage using Blosc2 as the underlying compression engine.

Each column is represented as a Column object wrapping a blosc2.NDArray
with typed, compressed storage. Building on top of blosc2's existing
infrastructure, CTable supports append, iteration and
column-based queries.

This is an early-stage (beta) implementation; the table is always fully
loaded in memory.

New files:
- src/blosc2/ctable.py: CTable and Column class definitions
- tests/ctable/: unit tests covering construction, slicing, deletion,
  compaction and row logic
- bench/ctable/: benchmarks comparing CTable against pandas
Add CTable, a columnar in-memory table built on top of blosc2
  - Add schema.py with spec primitives: int8/16/32/64, uint8/16/32/64,
    float32/64, bool, complex64/128, string, bytes — sharing a _NumericSpec
    mixin to avoid boilerplate
  - Add schema_compiler.py: compile_schema(), CompiledSchema/Column/Config,
    schema_to_dict() / schema_from_dict() for persistence groundwork
  - Export all spec types and field() from blosc2 namespace

  Validation:
  - Add schema_validation.py: Pydantic-backed row validation for append(),
    cached per schema, re-raised as plain ValueError
  - Add schema_vectorized.py: vectorized NumPy constraint checks for extend(),
    using np.char.str_len() for string/bytes columns
  - validate= per-call override on extend() (None inherits table default)

  CTable refactor:
  - Constructor accepts dataclass schemas; legacy Pydantic adapter kept
  - Schema introspection: table.schema, column_schema(), schema_dict()
  - _last_pos cache eliminates backward chunk scan on every append/extend
  - _grow() shared resize helper; delete() writes back in-place without
    creating a new array; _n_rows updated by subtraction not count_nonzero
  - head() and tail() unified through _find_physical_index()

  Tests and docs:
  - 135 tests across 10 test files, all passing
  - plans/ctable-implementation-log.md and ctable-user-guide.md added
  - Benchmarks: bench_validation.py and bench_append_regression.py
…QoL)

  Persistency:
    - FileTableStorage backend: disk layout _meta.b2frame / _valid_rows.b2nd / _cols/<name>.b2nd
    - CTable(Row, urlpath=..., mode="w"/"a"/"r"), CTable.open(), CTable.save(), CTable.load()
    - Read-only mode blocks all writes; save() always writes compacted rows

  Column aggregates: sum, min, max, mean, std, any, all (chunk-aware via iter_chunks)
  Column utilities: unique(), value_counts(), assign(), boolean mask __getitem__/__setitem__

  Schema mutations: add_column (fills default for existing rows), drop_column, rename_column
    - All three update schema, handle disk files, and block on views

  View mutability model fix:
    - Views allow value writes (assign, __setitem__) — only structural mutations are blocked
    - _read_only=True reserved for mode="r" disk tables; base is not None guards structural ops

  QoL: __str__ pandas-style, __repr__, cbytes/nbytes, sample(n), Column.iter_chunks(size)

  Tests: 258 tests, ~5s — new test_persistency.py (33), test_schema_mutations.py (41),
    expanded test_column.py; optimized helpers to use to_numpy() instead of row[i]
…QoL)

  Persistency:
    - FileTableStorage backend: disk layout _meta.b2frame / _valid_rows.b2nd / _cols/<name>.b2nd
    - CTable(Row, urlpath=..., mode="w"/"a"/"r"), CTable.open(), CTable.save(), CTable.load()
    - Read-only mode blocks all writes; save() always writes compacted rows

  Column aggregates: sum, min, max, mean, std, any, all (chunk-aware via iter_chunks)
  Column utilities: unique(), value_counts(), assign(), boolean mask __getitem__/__setitem__

  Schema mutations: add_column (fills default for existing rows), drop_column, rename_column
    - All three update schema, handle disk files, and block on views

  View mutability model fix:
    - Views allow value writes (assign, __setitem__) — only structural mutations are blocked
    - _read_only=True reserved for mode="r" disk tables; base is not None guards structural ops

  QoL: __str__ pandas-style, __repr__, cbytes/nbytes, sample(n), Column.iter_chunks(size)

  Tests: 258 tests, ~5s — new test_persistency.py (33), test_schema_mutations.py (41),
    expanded test_column.py; optimized helpers to use to_numpy() instead of row[i]
Arrow compatibility
Examples
Tutorial
@FrancescAlted FrancescAlted changed the title Ctable 4 request CTable: Columnar Compressed Table for Blosc2 Apr 8, 2026
@FrancescAlted FrancescAlted merged commit a3852b6 into Blosc:ctable4 Apr 8, 2026
6 of 12 checks passed
@FrancescAlted FrancescAlted mentioned this pull request Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants