Skip to content

Support for fixed-shape N-dimensional CTable columns#637

Merged
FrancescAlted merged 24 commits into
mainfrom
ndim-cols
May 18, 2026
Merged

Support for fixed-shape N-dimensional CTable columns#637
FrancescAlted merged 24 commits into
mainfrom
ndim-cols

Conversation

@FrancescAlted
Copy link
Copy Markdown
Member

Summary:

  • Added blosc2.ndarray(...) schema support for per-row fixed-shape array columns.
  • Extended CTable append/extend/read paths to validate and preserve ndarray item shapes.
  • Added logical Column.shape, ndim, size, and item-shape helpers.
  • Added ndarray-column reductions and row-wise generated-column helpers.
  • Improved display/info output for ndarray-backed columns and nested column namespaces.
  • Extended Arrow/Parquet/CSV/Pandas interop to handle ndarray columns where supported.
  • Added validation around groupby/indexing Cython helpers and fixed object-key group_reduce sorting with None.
  • Added/updated tests covering ndarray columns, nested namespace info, groupby fallback sorting, and low-level validation paths.

 Changed:
 - src/blosc2/ctable.py
     - ndarray column metadata: item_shape, item_ndim, item_size, logical ndim/size
     - tuple inner-axis indexing: t.embedding[:, 0], t.image[:, :, :, 0]
     - direct comparison guards for full ndarray columns
     - string where() ndarray-column guard + 1-D row-mask validation
     - scalar-only guards for sort/index/describe/cov
     - axis-aware reductions: sum/mean/min/max/std/norm(axis=...)
     - RowTransformer + Column.row_transformer
     - add_generated_column()
     - generated-column append/extend autofill
     - generated-column staleness and refresh:
           - refresh_generated_column()
           - refresh_generated_columns(source=...)
     - compact display for ndarray cells
     - Column.summary()
     - Arrow FixedSizeList export/import for ndarray columns
 - src/blosc2/groupby.py
     - group-by/aggregate guards for ndarray columns
 - src/blosc2/__init__.py
     - exports RowTransformer
 - Added tests:
     - tests/ctable/test_ctable_ndarray_columns.py
 to_csv() — ndarray column cells are serialized as JSON arrays (e.g., "[1.0, 2.0, 3.0]"). Null ndarray cells write empty CSV fields, matching the scalar
 null convention.

 from_csv() — ndarray column cells are parsed from JSON arrays and stacked into the proper (nrows, *item_shape) storage. Empty cells for nullable ndarray
 columns restore the null sentinel. Wrong-shaped JSON arrays raise a clear ValueError with the expected item_shape.

 _csv_ndarray_col_to_array() — new static helper for the JSON→ndarray conversion path.

 to_pandas() — new method. Scalar columns become regular DataFrame columns. Ndarray columns become object-dtype columns whose cells hold NumPy arrays of
 per-row item_shape.

 from_pandas() — new classmethod. Builds a CTable from a DataFrame using an explicit row_cls schema. Object-dtype columns are NOT automatically inferred
 as ndarray — the schema must declare blosc2.ndarray fields explicitly. Validates column name matching and rejects non-object columns for ndarray specs.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds first-class support for fixed-shape N-dimensional array (“ndarray”) columns in blosc2.CTable, including schema representation, storage/layout changes, validation, reductions, generated-column helpers, and multiple interop paths (Arrow/Parquet/CSV/Pandas), plus expanded test coverage.

Changes:

  • Introduces NDArraySpec / blosc2.ndarray(...) schema support and compiler/validation plumbing for fixed-shape per-row arrays.
  • Extends CTable/Column to preserve ndarray item shapes through append/extend/read, enable tuple inner slicing, axis-aware reductions, and generated columns via RowTransformer (with stale tracking + refresh APIs).
  • Updates groupby/indexing Cython helpers with additional validation and improves group-reduce sorting behavior with None.

Reviewed changes

Copilot reviewed 18 out of 19 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/ndarray/test_indexing.py Adds tests ensuring Cython indexing helpers validate input shapes/lengths.
tests/ctable/test_groupby.py Adds tests for object-key sorting (including None) and shape validation for checked kernels.
tests/ctable/test_ctable_ndarray_columns.py New comprehensive tests for ndarray columns (metadata, slicing, reductions, nullable behavior, Arrow roundtrip, generated columns + staleness).
tests/ctable/test_ctable_dataclass_schema.py Tests ndarray schema roundtrip/persistence and nested namespace info helpers.
tests/ctable/test_csv_interop.py Adds CSV and Pandas interop tests for fixed-shape ndarray columns (including nullable behavior).
tests/ctable/test_column.py Updates expectations for Column.info_items (logical shape/nrows instead of logical/physical length).
src/blosc2/schema.py Adds NDArraySpec and ndarray() schema constructor with metadata serialization.
src/blosc2/schema_vectorized.py Adds batch validation logic for ndarray columns (including null-sentinel handling).
src/blosc2/schema_validation.py Extends null-masking logic to recognize ndarray null sentinels during row validation.
src/blosc2/schema_compiler.py Adds schema compiler support for NDArraySpec (metadata kind mapping, display width, annotation validation, deserialization).
src/blosc2/indexing_ext.pyx Adds validation guards for array dimensionality, length matching, and bounds in indexing kernels.
src/blosc2/groupby.py Adds ndarray guardrails for group-by/agg, optional-kernel fallback handling, and improved object-key sorting with None.
src/blosc2/groupby_ext.pyx Adds shape validation for checked dense groupby kernels.
src/blosc2/ctable.py Core implementation: ndarray physical shapes, coercion, tuple indexing, reductions w/ axis, RowTransformer, generated columns + staleness/refresh, improved info/display, Arrow/CSV/Pandas interop, and additional safety guards.
src/blosc2/init.py Exports ndarray, NDArraySpec, and RowTransformer; keeps compatibility attrs on blosc2.ndarray.
plans/ctable-ndarray-cols.md Design/status plan document for ndarray columns and generated-column behavior.
plans/ctable-ndarray-cols-copilot-sonnet.md Additional design/implementation planning notes for ndarray columns.
doc/reference/reduction_functions.rst Updates reduction function reference listing (currently drops group_reduce).
doc/reference/ctable.rst Enhances CTable reference docs around __getitem__/__getattr__ and attribute-vs-item column access.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread doc/reference/reduction_functions.rst
Comment thread src/blosc2/ctable.py Outdated
Comment thread src/blosc2/ctable.py Outdated
Comment thread src/blosc2/ctable.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 20 changed files in this pull request and generated 6 comments.

Comment thread src/blosc2/groupby.py
Comment on lines 1749 to 1752
order = list(acc)
if sort:
order.sort(key=lambda k: (1, "") if k is _NAN_KEY else (0, display[k]))
order.sort(key=lambda k: _sortable_key_part(display[k]))
groups = np.asarray([display[k] for k in order], dtype=keys.dtype)
Comment thread src/blosc2/ctable.py Outdated
Comment on lines +6187 to +6199
new_cols: dict[str, blosc2.NDArray] = {}
for col in schema.columns:
shape = cls._column_physical_shape(col, capacity)
chunks, blocks = cls._column_chunks_blocks(col, shape)
new_cols[col.name] = mem_storage.create_column(
col.name,
dtype=col.dtype,
shape=shape,
chunks=chunks,
blocks=blocks,
cparams=None,
dparams=None,
)
Comment thread src/blosc2/ctable.py Outdated
Comment on lines 1718 to 1724
if isinstance(nv, float) and np.isnan(nv):
elem_mask = np.isnan(arr)
else:
elem_mask = arr == nv
inner_axes = tuple(range(1, elem_mask.ndim))
return elem_mask.all(axis=inner_axes) if inner_axes else elem_mask.astype(np.bool_)
if isinstance(nv, float) and np.isnan(nv):
Comment on lines +103 to +108
if isinstance(col.spec, NDArraySpec):
try:
arr = np.asarray(val, dtype=col.spec.dtype)
is_null = arr.shape == col.spec.item_shape and bool(
np.isnan(arr).all() if isinstance(nv, float) and math.isnan(nv) else (arr == nv).all()
)
Comment thread src/blosc2/ctable.py Outdated
Comment on lines +5942 to +5951
if stripped == "" and null_value is not None:
rows.append(np.full(item_shape, null_value, dtype=dtype))
else:
arr = np.array(json.loads(stripped), dtype=dtype)
if arr.shape != item_shape:
raise ValueError(
f"Column {col.name!r}: expected item shape {item_shape}, got {arr.shape}"
)
rows.append(arr)

Comment thread src/blosc2/schema.py
Comment on lines +740 to +744
if self.nullable:
d["nullable"] = True
if hasattr(self, "null_value"):
d["null_value"] = self.null_value
return d
   - DictionarySpec
   - ListSpec
   - VLStringSpec / VLBytesSpec
   - other varlen scalar specs

It now fills data via obj.extend(...), so the normal CTable ingestion paths handle dictionary/list/varlen/ndarray/scalar columns consistently.

Pandas missing values (None, NaN, pandas.NA) are normalized to None for special/object-style columns before ingestion.
@FrancescAlted FrancescAlted merged commit b3d514e into main May 18, 2026
17 checks passed
@FrancescAlted FrancescAlted deleted the ndim-cols branch May 18, 2026 10:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants