Skip to content

Ctable column alignment and more#641

Merged
FrancescAlted merged 9 commits into
mainfrom
ctable-column-alignment
May 30, 2026
Merged

Ctable column alignment and more#641
FrancescAlted merged 9 commits into
mainfrom
ctable-column-alignment

Conversation

@FrancescAlted
Copy link
Copy Markdown
Member

Summary of main changes

1. CTable column alignment (core feature)

Fixed-size scalar columns are now written on a shared chunk/block grid during
Arrow import, ensuring every numeric and fixed-string column has identical chunk
and block sizes. This is critical for efficient multi-column scans (no
cross-column re-chunking overhead). Dictionary code arrays are also allocated at
capacity on the same grid. Small fixed-length strings (up to a configurable
threshold) are admitted to the shared grid alongside numeric columns.

2. Vectorized dictionary column import

Arrow dictionary import was rewritten to operate on full batches rather than
row-by-row, giving a significant throughput improvement for dictionary-encoded
columns.

3. --reduce-mem flag for Parquet import

The parquet-to-blosc2 CLI gained a --reduce-mem flag. Without it (the
default), the original auto batch sizing is used for maximum speed. With
--reduce-mem, the auto Parquet batch size is capped so one Arrow read batch
fits a ~48 MiB budget — trading speed for lower peak RSS (~700 MB vs ~1.6 GB
on the Chicago taxi dataset).

4. NestedColumn — public class for nested column groups

The internal _NestedColumnNamespace (returned by attribute access such as
t.trip or t.trip.begin on a nested CTable) was renamed to NestedColumn,
given a proper docstring, exported in blosc2.__all__, and documented in the
Sphinx reference alongside CTable and Column.

5. Unified .info across all Blosc2 objects

CTable.info, Column.info, and NestedColumn.info were redesigned with a
consistent block layout (identity → shape/grid → sizes → content → params) and
consistent field names (nrows/ncols replacing rows/columns; storage
meaning persistent/in-memory/computed; backend describing the physical
column kind: NDArray, dictionary, list, variable-length scalar,
LazyExpr). The cratio format was unified to .2fx (two decimals + x
suffix) across all Blosc2 info reporters: NDArray, SChunk, C2Array,
ObjectArray, ListArray, BatchArray, Column, and CTable. CTable,
Column, and NestedColumn were also added to __all__.

Lazy expressions over CTable columns only take the fast_eval path when
all operands share identical element-unit chunk/block grids. Previously
each column sized its grid from its own dtype, so mixing dtypes (e.g. a
float64 column among float32 ones) forced the slow slices_eval path.

Compute one table-wide grid sized for the median itemsize of the
fixed-size scalar columns (snapped to {1,2,4,8,16}, ties round down) and
apply it to all eligible columns and the _valid_rows mask. Wide string
and ndarray-item columns, and user-pinned grids, are left untouched; a
4x cap keeps wide columns off the shared grid. Applied across every
creation path: __init__, from_arrow/from_parquet, from_pandas, from_csv,
save, load, and _empty_copy.

Also add CTable.chunks / CTable.blocks properties exposing the shared grid.

On the chicago-taxi benchmark this drops a mixed-dtype where() onto the
fast path: ~31% faster and ~60% less peak memory.
…umerics

Fixed-width string equality runs on the lazyexpr fast_eval path, but only
when the string column shares the numeric columns' chunk/block grid — a
misaligned string forces the slow slices_eval path. Previously fixed
strings were sized per-dtype and rarely aligned.

Size the shared grid from numeric columns only, so string itemsizes no
longer coarsen the grid that fused arithmetic depends on. Then admit
fixed string/bytes columns to that grid via an absolute byte ceiling
(_MAX_ALIGNED_STR_ITEMSIZE = 128, i.e. U32 / 32 UTF-32 chars) rather than
the numeric 4x relative cap: they fast-path equality filters and a few-MB
block is fine. Larger strings keep per-dtype sizing — dictionary-encode
those instead. All-string tables fall back to sizing the grid from the
string columns.

With this, short high-cardinality string filters (e.g. s == 'foo') stay
on the fast path when combined with numeric conditions.
extend_from_arrow translated Arrow dictionary indices to global codes with
a per-row Python loop over indices.to_pylist() — ~24M iterations on the
chicago-taxi import. Replace it with a numpy lookup table (lut[indices]),
keeping the null and ordered= validation paths intact.

Cuts the parquet-to-blosc2 import of chicago-taxi (24.3M rows) from 37.6s
to 24.6s (-35%).
Arrow/Parquet import wrote each variable-sized batch directly to
arr[pos:pos+m]. Because batch sizes don't align to the column chunk grid,
most writes straddled chunk boundaries, forcing a decompress-merge-
recompress of partially filled chunks (~1.86x overhead measured).

Add _ChunkAlignedWriter, which buffers per-column appends and flushes them
in exact chunk-aligned blocks so each chunk is compressed once; only the
final tail may be partial. Wire it into _write_arrow_batches for fixed-size
columns and mark _valid_rows in a single write. List/dict/varlen paths are
unchanged.

Cuts the chicago-taxi import (24.3M rows) from 24.6s to 19.6s; combined
with the dictionary vectorization the import is down from 37.6s to 19.6s.
Output is byte-identical, verified across chunk boundaries.
Dictionary columns created their int32 codes NDArray at chunks=(4096,) and
then resize()'d it to capacity, leaving thousands of micro-chunks (5,937
for chicago-taxi). That made codes writes slow, hurt compression, and kept
dict-column filters off the fast path.

create_dictionary_column now accepts a codes grid; the Arrow-import path
creates codes at full capacity on the shared aligned grid (codes are int32,
matching the numeric grid), dropping the create-then-resize.

chicago-taxi import: 19.6s -> 12.5s (37.6s -> 12.5s cumulative); output
900 MB (was 909). Codes go from 5,937 chunks to 12, and mixed numeric+dict
filters now take the fast_eval path. Output verified byte-identical.
The unnamed-root list<struct> import defaulted the parquet read batch to the
source row-group size (309 rows for chicago-taxi). Nested path lists amplify
~10x downstream, so peak RSS reached 1.63 GB regardless of output format.

Cap the auto batch size so one Arrow read batch stays within a 48 MiB budget
(estimated from a small sample), and read that estimation sample with only a
few outer rows so it no longer leaves a large batch retained in the Arrow
pool. An explicit --parquet-batch-size is still honored verbatim.

chicago-taxi import peak: 1.63 GB -> 708 MB (-57%), ~15s. Output identical.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aligns CTable fixed-size columns on a single shared chunk/block grid during construction and Arrow/Parquet import so that lazy expressions can take the fast_eval path; it also vectorizes Arrow dictionary import, adds a --reduce-mem CLI flag, promotes _NestedColumnNamespace to a public NestedColumn class, and unifies info/cratio formatting across all Blosc2 array types.

Changes:

  • Introduce _compute_aligned_grid + _ChunkAlignedWriter so fixed-size scalar (and small fixed-string) columns and the _valid_rows mask share one grid; dictionary code arrays are now created at full capacity on that grid.
  • Vectorize DictionaryColumn.extend_from_arrow (LUT gather instead of per-row Python loop) and add --reduce-mem to cap auto Parquet batch sizes.
  • Rename _NestedColumnNamespace → public NestedColumn, restructure Column/CTable/NestedColumn info_items, and unify cratio to .2fx everywhere.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/blosc2/ctable.py Aligned-grid logic, _ChunkAlignedWriter, info restructure, NestedColumn public class, chunks/blocks properties.
src/blosc2/ctable_storage.py create_dictionary_column accepts codes_shape/chunks/blocks.
src/blosc2/dictionary_column.py Vectorized Arrow batch import via numpy LUT gather.
src/blosc2/cli/parquet_to_blosc2.py --reduce-mem flag and Arrow batch memory-budget capping.
src/blosc2/{schunk,ndarray,c2array,objectarray,list_array,batch_array}.py Unified cratio to .2fx format.
src/blosc2/init.py Export Column, CTable, NestedColumn.
src/blosc2/b2view/model.py Rename rows/columnsnrows/ncols in fallback metadata.
doc/reference/{ctable,classes}.rst Document NestedColumn.
tests/ctable/* and tests/test_b2view_model.py Tests for aligned grid, chunk-aligned writer, dictionary vectorization, and updated info field names.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@FrancescAlted FrancescAlted merged commit 19c69b6 into main May 30, 2026
18 checks passed
@FrancescAlted FrancescAlted deleted the ctable-column-alignment branch May 30, 2026 07:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants