CTable: Columnar Compressed Table for Blosc2#614
Merged
FrancescAlted merged 22 commits intoBlosc:ctable4from Apr 8, 2026
Merged
CTable: Columnar Compressed Table for Blosc2#614FrancescAlted merged 22 commits intoBlosc:ctable4from
FrancescAlted merged 22 commits intoBlosc:ctable4from
Conversation
Introduce CTable, a new columnar table class for efficient in-memory data storage using Blosc2 as the underlying compression engine. Each column is represented as a Column object wrapping a blosc2.NDArray with typed, compressed storage. Building on top of blosc2's existing infrastructure, CTable supports append, iteration and column-based queries. This is an early-stage (beta) implementation; the table is always fully loaded in memory. New files: - src/blosc2/ctable.py: CTable and Column class definitions - tests/ctable/: unit tests covering construction, slicing, deletion, compaction and row logic - bench/ctable/: benchmarks comparing CTable against pandas
Add CTable, a columnar in-memory table built on top of blosc2
- Add schema.py with spec primitives: int8/16/32/64, uint8/16/32/64,
float32/64, bool, complex64/128, string, bytes — sharing a _NumericSpec
mixin to avoid boilerplate
- Add schema_compiler.py: compile_schema(), CompiledSchema/Column/Config,
schema_to_dict() / schema_from_dict() for persistence groundwork
- Export all spec types and field() from blosc2 namespace
Validation:
- Add schema_validation.py: Pydantic-backed row validation for append(),
cached per schema, re-raised as plain ValueError
- Add schema_vectorized.py: vectorized NumPy constraint checks for extend(),
using np.char.str_len() for string/bytes columns
- validate= per-call override on extend() (None inherits table default)
CTable refactor:
- Constructor accepts dataclass schemas; legacy Pydantic adapter kept
- Schema introspection: table.schema, column_schema(), schema_dict()
- _last_pos cache eliminates backward chunk scan on every append/extend
- _grow() shared resize helper; delete() writes back in-place without
creating a new array; _n_rows updated by subtraction not count_nonzero
- head() and tail() unified through _find_physical_index()
Tests and docs:
- 135 tests across 10 test files, all passing
- plans/ctable-implementation-log.md and ctable-user-guide.md added
- Benchmarks: bench_validation.py and bench_append_regression.py
…QoL)
Persistency:
- FileTableStorage backend: disk layout _meta.b2frame / _valid_rows.b2nd / _cols/<name>.b2nd
- CTable(Row, urlpath=..., mode="w"/"a"/"r"), CTable.open(), CTable.save(), CTable.load()
- Read-only mode blocks all writes; save() always writes compacted rows
Column aggregates: sum, min, max, mean, std, any, all (chunk-aware via iter_chunks)
Column utilities: unique(), value_counts(), assign(), boolean mask __getitem__/__setitem__
Schema mutations: add_column (fills default for existing rows), drop_column, rename_column
- All three update schema, handle disk files, and block on views
View mutability model fix:
- Views allow value writes (assign, __setitem__) — only structural mutations are blocked
- _read_only=True reserved for mode="r" disk tables; base is not None guards structural ops
QoL: __str__ pandas-style, __repr__, cbytes/nbytes, sample(n), Column.iter_chunks(size)
Tests: 258 tests, ~5s — new test_persistency.py (33), test_schema_mutations.py (41),
expanded test_column.py; optimized helpers to use to_numpy() instead of row[i]
…QoL)
Persistency:
- FileTableStorage backend: disk layout _meta.b2frame / _valid_rows.b2nd / _cols/<name>.b2nd
- CTable(Row, urlpath=..., mode="w"/"a"/"r"), CTable.open(), CTable.save(), CTable.load()
- Read-only mode blocks all writes; save() always writes compacted rows
Column aggregates: sum, min, max, mean, std, any, all (chunk-aware via iter_chunks)
Column utilities: unique(), value_counts(), assign(), boolean mask __getitem__/__setitem__
Schema mutations: add_column (fills default for existing rows), drop_column, rename_column
- All three update schema, handle disk files, and block on views
View mutability model fix:
- Views allow value writes (assign, __setitem__) — only structural mutations are blocked
- _read_only=True reserved for mode="r" disk tables; base is not None guards structural ops
QoL: __str__ pandas-style, __repr__, cbytes/nbytes, sample(n), Column.iter_chunks(size)
Tests: 258 tests, ~5s — new test_persistency.py (33), test_schema_mutations.py (41),
expanded test_column.py; optimized helpers to use to_numpy() instead of row[i]
Arrow compatibility Examples Tutorial
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CTable: Columnar Compressed Table for Blosc2
This PR introduces CTable, a typed columnar table backed by blosc2.NDArray per column, with full
schema validation, persistence, and interoperability.
Schema layer
complex64/128, string, bytes
Core table operations
Persistence
Interoperability
column, no row-level overhead
deletions
Tests & examples
interop, CSV interop
dataset as the running example