Ctable column alignment and more#641
Merged
Merged
Conversation
Lazy expressions over CTable columns only take the fast_eval path when
all operands share identical element-unit chunk/block grids. Previously
each column sized its grid from its own dtype, so mixing dtypes (e.g. a
float64 column among float32 ones) forced the slow slices_eval path.
Compute one table-wide grid sized for the median itemsize of the
fixed-size scalar columns (snapped to {1,2,4,8,16}, ties round down) and
apply it to all eligible columns and the _valid_rows mask. Wide string
and ndarray-item columns, and user-pinned grids, are left untouched; a
4x cap keeps wide columns off the shared grid. Applied across every
creation path: __init__, from_arrow/from_parquet, from_pandas, from_csv,
save, load, and _empty_copy.
Also add CTable.chunks / CTable.blocks properties exposing the shared grid.
On the chicago-taxi benchmark this drops a mixed-dtype where() onto the
fast path: ~31% faster and ~60% less peak memory.
…umerics Fixed-width string equality runs on the lazyexpr fast_eval path, but only when the string column shares the numeric columns' chunk/block grid — a misaligned string forces the slow slices_eval path. Previously fixed strings were sized per-dtype and rarely aligned. Size the shared grid from numeric columns only, so string itemsizes no longer coarsen the grid that fused arithmetic depends on. Then admit fixed string/bytes columns to that grid via an absolute byte ceiling (_MAX_ALIGNED_STR_ITEMSIZE = 128, i.e. U32 / 32 UTF-32 chars) rather than the numeric 4x relative cap: they fast-path equality filters and a few-MB block is fine. Larger strings keep per-dtype sizing — dictionary-encode those instead. All-string tables fall back to sizing the grid from the string columns. With this, short high-cardinality string filters (e.g. s == 'foo') stay on the fast path when combined with numeric conditions.
extend_from_arrow translated Arrow dictionary indices to global codes with a per-row Python loop over indices.to_pylist() — ~24M iterations on the chicago-taxi import. Replace it with a numpy lookup table (lut[indices]), keeping the null and ordered= validation paths intact. Cuts the parquet-to-blosc2 import of chicago-taxi (24.3M rows) from 37.6s to 24.6s (-35%).
Arrow/Parquet import wrote each variable-sized batch directly to arr[pos:pos+m]. Because batch sizes don't align to the column chunk grid, most writes straddled chunk boundaries, forcing a decompress-merge- recompress of partially filled chunks (~1.86x overhead measured). Add _ChunkAlignedWriter, which buffers per-column appends and flushes them in exact chunk-aligned blocks so each chunk is compressed once; only the final tail may be partial. Wire it into _write_arrow_batches for fixed-size columns and mark _valid_rows in a single write. List/dict/varlen paths are unchanged. Cuts the chicago-taxi import (24.3M rows) from 24.6s to 19.6s; combined with the dictionary vectorization the import is down from 37.6s to 19.6s. Output is byte-identical, verified across chunk boundaries.
Dictionary columns created their int32 codes NDArray at chunks=(4096,) and then resize()'d it to capacity, leaving thousands of micro-chunks (5,937 for chicago-taxi). That made codes writes slow, hurt compression, and kept dict-column filters off the fast path. create_dictionary_column now accepts a codes grid; the Arrow-import path creates codes at full capacity on the shared aligned grid (codes are int32, matching the numeric grid), dropping the create-then-resize. chicago-taxi import: 19.6s -> 12.5s (37.6s -> 12.5s cumulative); output 900 MB (was 909). Codes go from 5,937 chunks to 12, and mixed numeric+dict filters now take the fast_eval path. Output verified byte-identical.
The unnamed-root list<struct> import defaulted the parquet read batch to the source row-group size (309 rows for chicago-taxi). Nested path lists amplify ~10x downstream, so peak RSS reached 1.63 GB regardless of output format. Cap the auto batch size so one Arrow read batch stays within a 48 MiB budget (estimated from a small sample), and read that estimation sample with only a few outer rows so it no longer leaves a large batch retained in the Arrow pool. An explicit --parquet-batch-size is still honored verbatim. chicago-taxi import peak: 1.63 GB -> 708 MB (-57%), ~15s. Output identical.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR aligns CTable fixed-size columns on a single shared chunk/block grid during construction and Arrow/Parquet import so that lazy expressions can take the fast_eval path; it also vectorizes Arrow dictionary import, adds a --reduce-mem CLI flag, promotes _NestedColumnNamespace to a public NestedColumn class, and unifies info/cratio formatting across all Blosc2 array types.
Changes:
- Introduce
_compute_aligned_grid+_ChunkAlignedWriterso fixed-size scalar (and small fixed-string) columns and the_valid_rowsmask share one grid; dictionary code arrays are now created at full capacity on that grid. - Vectorize
DictionaryColumn.extend_from_arrow(LUT gather instead of per-row Python loop) and add--reduce-memto cap auto Parquet batch sizes. - Rename
_NestedColumnNamespace→ publicNestedColumn, restructureColumn/CTable/NestedColumninfo_items, and unifycratioto.2fxeverywhere.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| src/blosc2/ctable.py | Aligned-grid logic, _ChunkAlignedWriter, info restructure, NestedColumn public class, chunks/blocks properties. |
| src/blosc2/ctable_storage.py | create_dictionary_column accepts codes_shape/chunks/blocks. |
| src/blosc2/dictionary_column.py | Vectorized Arrow batch import via numpy LUT gather. |
| src/blosc2/cli/parquet_to_blosc2.py | --reduce-mem flag and Arrow batch memory-budget capping. |
| src/blosc2/{schunk,ndarray,c2array,objectarray,list_array,batch_array}.py | Unified cratio to .2fx format. |
| src/blosc2/init.py | Export Column, CTable, NestedColumn. |
| src/blosc2/b2view/model.py | Rename rows/columns → nrows/ncols in fallback metadata. |
| doc/reference/{ctable,classes}.rst | Document NestedColumn. |
| tests/ctable/* and tests/test_b2view_model.py | Tests for aligned grid, chunk-aligned writer, dictionary vectorization, and updated info field names. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary of main changes
1. CTable column alignment (core feature)
Fixed-size scalar columns are now written on a shared chunk/block grid during
Arrow import, ensuring every numeric and fixed-string column has identical chunk
and block sizes. This is critical for efficient multi-column scans (no
cross-column re-chunking overhead). Dictionary code arrays are also allocated at
capacity on the same grid. Small fixed-length strings (up to a configurable
threshold) are admitted to the shared grid alongside numeric columns.
2. Vectorized dictionary column import
Arrow dictionary import was rewritten to operate on full batches rather than
row-by-row, giving a significant throughput improvement for dictionary-encoded
columns.
3.
--reduce-memflag for Parquet importThe
parquet-to-blosc2CLI gained a--reduce-memflag. Without it (thedefault), the original auto batch sizing is used for maximum speed. With
--reduce-mem, the auto Parquet batch size is capped so one Arrow read batchfits a ~48 MiB budget — trading speed for lower peak RSS (~700 MB vs ~1.6 GB
on the Chicago taxi dataset).
4.
NestedColumn— public class for nested column groupsThe internal
_NestedColumnNamespace(returned by attribute access such ast.triport.trip.beginon a nested CTable) was renamed toNestedColumn,given a proper docstring, exported in
blosc2.__all__, and documented in theSphinx reference alongside
CTableandColumn.5. Unified
.infoacross all Blosc2 objectsCTable.info,Column.info, andNestedColumn.infowere redesigned with aconsistent block layout (identity → shape/grid → sizes → content → params) and
consistent field names (
nrows/ncolsreplacingrows/columns;storagemeaning
persistent/in-memory/computed;backenddescribing the physicalcolumn kind:
NDArray,dictionary,list,variable-length scalar,LazyExpr). Thecratioformat was unified to.2fx(two decimals +xsuffix) across all Blosc2 info reporters:
NDArray,SChunk,C2Array,ObjectArray,ListArray,BatchArray,Column, andCTable.CTable,Column, andNestedColumnwere also added to__all__.