Ctable column alignment and more by FrancescAlted · Pull Request #641 · Blosc/python-blosc2

FrancescAlted · 2026-05-30T07:04:48Z

Summary of main changes

1. CTable column alignment (core feature)

Fixed-size scalar columns are now written on a shared chunk/block grid during
Arrow import, ensuring every numeric and fixed-string column has identical chunk
and block sizes. This is critical for efficient multi-column scans (no
cross-column re-chunking overhead). Dictionary code arrays are also allocated at
capacity on the same grid. Small fixed-length strings (up to a configurable
threshold) are admitted to the shared grid alongside numeric columns.

2. Vectorized dictionary column import

Arrow dictionary import was rewritten to operate on full batches rather than
row-by-row, giving a significant throughput improvement for dictionary-encoded
columns.

3. `--reduce-mem` flag for Parquet import

The parquet-to-blosc2 CLI gained a --reduce-mem flag. Without it (the
default), the original auto batch sizing is used for maximum speed. With
--reduce-mem, the auto Parquet batch size is capped so one Arrow read batch
fits a ~48 MiB budget — trading speed for lower peak RSS (~700 MB vs ~1.6 GB
on the Chicago taxi dataset).

4. `NestedColumn` — public class for nested column groups

The internal _NestedColumnNamespace (returned by attribute access such as
t.trip or t.trip.begin on a nested CTable) was renamed to NestedColumn,
given a proper docstring, exported in blosc2.__all__, and documented in the
Sphinx reference alongside CTable and Column.

5. Unified `.info` across all Blosc2 objects

CTable.info, Column.info, and NestedColumn.info were redesigned with a
consistent block layout (identity → shape/grid → sizes → content → params) and
consistent field names (nrows/ncols replacing rows/columns; storage
meaning persistent/in-memory/computed; backend describing the physical
column kind: NDArray, dictionary, list, variable-length scalar,
LazyExpr). The cratio format was unified to .2fx (two decimals + x
suffix) across all Blosc2 info reporters: NDArray, SChunk, C2Array,
ObjectArray, ListArray, BatchArray, Column, and CTable. CTable,
Column, and NestedColumn were also added to __all__.

Lazy expressions over CTable columns only take the fast_eval path when all operands share identical element-unit chunk/block grids. Previously each column sized its grid from its own dtype, so mixing dtypes (e.g. a float64 column among float32 ones) forced the slow slices_eval path. Compute one table-wide grid sized for the median itemsize of the fixed-size scalar columns (snapped to {1,2,4,8,16}, ties round down) and apply it to all eligible columns and the _valid_rows mask. Wide string and ndarray-item columns, and user-pinned grids, are left untouched; a 4x cap keeps wide columns off the shared grid. Applied across every creation path: __init__, from_arrow/from_parquet, from_pandas, from_csv, save, load, and _empty_copy. Also add CTable.chunks / CTable.blocks properties exposing the shared grid. On the chicago-taxi benchmark this drops a mixed-dtype where() onto the fast path: ~31% faster and ~60% less peak memory.

…umerics Fixed-width string equality runs on the lazyexpr fast_eval path, but only when the string column shares the numeric columns' chunk/block grid — a misaligned string forces the slow slices_eval path. Previously fixed strings were sized per-dtype and rarely aligned. Size the shared grid from numeric columns only, so string itemsizes no longer coarsen the grid that fused arithmetic depends on. Then admit fixed string/bytes columns to that grid via an absolute byte ceiling (_MAX_ALIGNED_STR_ITEMSIZE = 128, i.e. U32 / 32 UTF-32 chars) rather than the numeric 4x relative cap: they fast-path equality filters and a few-MB block is fine. Larger strings keep per-dtype sizing — dictionary-encode those instead. All-string tables fall back to sizing the grid from the string columns. With this, short high-cardinality string filters (e.g. s == 'foo') stay on the fast path when combined with numeric conditions.

extend_from_arrow translated Arrow dictionary indices to global codes with a per-row Python loop over indices.to_pylist() — ~24M iterations on the chicago-taxi import. Replace it with a numpy lookup table (lut[indices]), keeping the null and ordered= validation paths intact. Cuts the parquet-to-blosc2 import of chicago-taxi (24.3M rows) from 37.6s to 24.6s (-35%).

Arrow/Parquet import wrote each variable-sized batch directly to arr[pos:pos+m]. Because batch sizes don't align to the column chunk grid, most writes straddled chunk boundaries, forcing a decompress-merge- recompress of partially filled chunks (~1.86x overhead measured). Add _ChunkAlignedWriter, which buffers per-column appends and flushes them in exact chunk-aligned blocks so each chunk is compressed once; only the final tail may be partial. Wire it into _write_arrow_batches for fixed-size columns and mark _valid_rows in a single write. List/dict/varlen paths are unchanged. Cuts the chicago-taxi import (24.3M rows) from 24.6s to 19.6s; combined with the dictionary vectorization the import is down from 37.6s to 19.6s. Output is byte-identical, verified across chunk boundaries.

Dictionary columns created their int32 codes NDArray at chunks=(4096,) and then resize()'d it to capacity, leaving thousands of micro-chunks (5,937 for chicago-taxi). That made codes writes slow, hurt compression, and kept dict-column filters off the fast path. create_dictionary_column now accepts a codes grid; the Arrow-import path creates codes at full capacity on the shared aligned grid (codes are int32, matching the numeric grid), dropping the create-then-resize. chicago-taxi import: 19.6s -> 12.5s (37.6s -> 12.5s cumulative); output 900 MB (was 909). Codes go from 5,937 chunks to 12, and mixed numeric+dict filters now take the fast_eval path. Output verified byte-identical.

The unnamed-root list<struct> import defaulted the parquet read batch to the source row-group size (309 rows for chicago-taxi). Nested path lists amplify ~10x downstream, so peak RSS reached 1.63 GB regardless of output format. Cap the auto batch size so one Arrow read batch stays within a 48 MiB budget (estimated from a small sample), and read that estimation sample with only a few outer rows so it no longer leaves a large batch retained in the Arrow pool. An explicit --parquet-batch-size is still honored verbatim. chicago-taxi import peak: 1.63 GB -> 708 MB (-57%), ~15s. Output identical.

… --reduce-mem

Copilot

Pull request overview

This PR aligns CTable fixed-size columns on a single shared chunk/block grid during construction and Arrow/Parquet import so that lazy expressions can take the fast_eval path; it also vectorizes Arrow dictionary import, adds a --reduce-mem CLI flag, promotes _NestedColumnNamespace to a public NestedColumn class, and unifies info/cratio formatting across all Blosc2 array types.

Changes:

Introduce _compute_aligned_grid + _ChunkAlignedWriter so fixed-size scalar (and small fixed-string) columns and the _valid_rows mask share one grid; dictionary code arrays are now created at full capacity on that grid.
Vectorize DictionaryColumn.extend_from_arrow (LUT gather instead of per-row Python loop) and add --reduce-mem to cap auto Parquet batch sizes.
Rename _NestedColumnNamespace → public NestedColumn, restructure Column/CTable/NestedColumn info_items, and unify cratio to .2fx everywhere.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
src/blosc2/ctable.py	Aligned-grid logic, `_ChunkAlignedWriter`, info restructure, `NestedColumn` public class, `chunks`/`blocks` properties.
src/blosc2/ctable_storage.py	`create_dictionary_column` accepts `codes_shape/chunks/blocks`.
src/blosc2/dictionary_column.py	Vectorized Arrow batch import via numpy LUT gather.
src/blosc2/cli/parquet_to_blosc2.py	`--reduce-mem` flag and Arrow batch memory-budget capping.
src/blosc2/{schunk,ndarray,c2array,objectarray,list_array,batch_array}.py	Unified `cratio` to `.2fx` format.
src/blosc2/init.py	Export `Column`, `CTable`, `NestedColumn`.
src/blosc2/b2view/model.py	Rename `rows`/`columns` → `nrows`/`ncols` in fallback metadata.
doc/reference/{ctable,classes}.rst	Document `NestedColumn`.
tests/ctable/* and tests/test_b2view_model.py	Tests for aligned grid, chunk-aligned writer, dictionary vectorization, and updated info field names.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

FrancescAlted added 9 commits May 29, 2026 18:40

Make shrink an auto-chosen Parquet batch size accessible via optional…

f382bec

… --reduce-mem

Make .info more uniform throughout all classes that have it

9579514

New NestedColumn to better represent groups of hierarchical fields

e1576c9

FrancescAlted requested a review from Copilot May 30, 2026 07:04

Copilot started reviewing on behalf of FrancescAlted May 30, 2026 07:05 View session

Copilot AI reviewed May 30, 2026

View reviewed changes

FrancescAlted merged commit 19c69b6 into main May 30, 2026
18 checks passed

FrancescAlted deleted the ctable-column-alignment branch May 30, 2026 07:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ctable column alignment and more#641

Ctable column alignment and more#641
FrancescAlted merged 9 commits into
mainfrom
ctable-column-alignment

FrancescAlted commented May 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

FrancescAlted commented May 30, 2026

Summary of main changes

1. CTable column alignment (core feature)

2. Vectorized dictionary column import

3. --reduce-mem flag for Parquet import

4. NestedColumn — public class for nested column groups

5. Unified .info across all Blosc2 objects

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

3. `--reduce-mem` flag for Parquet import

4. `NestedColumn` — public class for nested column groups

5. Unified `.info` across all Blosc2 objects