[Merged in GitCode][BREAKING][refactor] Convert BatchMeta to columnar layout; enable zero-copy serialization by default by mpb159753 · Pull Request #39 · Ascend/TransferQueue

mpb159753 · 2026-02-27T07:12:01Z

https://gitcode.com/Ascend/TransferQueue/pull/29

Columnar BatchMeta + Zero-Copy Default

1. Context & Motivation

Closes: [refactor] Convert BatchMeta from row-oriented to column-oriented layout

The current BatchMeta uses a row-oriented design (BatchMeta → List[SampleMeta] → Dict[str, FieldMeta]), which introduces three scaling issues in high-throughput scenarios:

O(B×F) Complexity: Critical paths (build_storage_meta_groups, add_fields, _filter_storage_data) involve nested loops over every sample × every field, incurred multiple times per PUT.
Small Object Explosion: A batch of 1024 samples with 10 fields creates 10,000+ Python objects, causing GC pressure and unpredictable tail latency.
Redundant Transmission: Schema info (dtype, shape) is duplicated per sample; row-oriented serialization produces fragmented ZMQ frames, preventing zero-copy optimization.

This PR refactors BatchMeta to a column-oriented (structure-of-arrays) design, reducing metadata complexity from O(B×F) to O(B) + O(F), and enables zero-copy serialization by default with automatic pickle fallback.

2. Key Changes

2.1 Columnar BatchMeta (`metadata.py`)

Aspect	Before (Row-oriented)	After (Column-oriented)
Structure	`BatchMeta.samples: List[SampleMeta]`	Flat arrays: `global_indexes`, `partition_ids`, `production_status`
Field metadata	Per-sample `FieldMeta` objects (B×F instances)	Shared `field_schema` dict (F entries)
Status check	Loop over samples O(B)	`np.all()` on ndarray O(1)
Classes	`BatchMeta`, `SampleMeta`, `FieldMeta`	`BatchMeta` only

Removed: SampleMeta and FieldMeta classes entirely
Added: field_schema dict with three field types: Regular Tensor, Nested Tensor (is_nested), Non-Tensor (is_non_tensor)
Vectorized: production_status as np.ndarray(int8) — enables O(1) readiness checks via np.all()

2.2 Zero-Copy Serialization Default (`serial_utils.py`)

Zero-copy serialization is now the default behavior (previously gated by env var)
Automatic fallback to pickle on serialization failure, with one-time warning
Removed ZERO_COPY_SERIALIZATION environment variable switch

2.3 Storage & Transport Adaptation

simple_backend.py / simple_backend_manager.py / controller.py: Adapted to columnar API; clear() uses del instead of None assignment to reduce memory fragmentation
zmq_utils.py: ZMQ transport uses new serialization utilities; frame count reduced from O(B) to F+1 (one metadata header + one per field)

2.4 Test Suite

test_metadata.py: Fully rewritten for columnar API (net -799 lines)
All other test files adapted to new BatchMeta constructor

3. Benchmark Results

Tests conducted in Docker (single-node Ray) across 7 payload sizes. Three configurations compared:

main-no-zerocopy: Baseline (row-oriented, pickle serialization)
main-zero-copy: Row-oriented + custom zero-copy serialization (previous PR)
columnar-batchmeta-zero-copy: This PR (columnar + zero-copy default)

Throughput Comparison (Gbps)

Config	Operation	main (No ZC)	main (ZC)	This PR	vs main (ZC)
debug (0.05 MB)	PUT	0.004	0.005	0.005	+17%
	GET	0.005	0.006	0.008	+33%
tiny (0.6–1.5 MB)	PUT	0.055	0.058	0.119	+106%
	GET	0.057	0.086	0.220	+157%
small (50–150 MB)	PUT	0.89	1.56	4.71	+202%
	GET	1.14	2.53	5.87	+132%
medium (0.5–1.5 GB)	PUT	2.91	6.82	18.26	+168%
	GET	3.31	6.95	8.83	+27%
large (3–6 GB)	PUT	4.32	12.34	26.11	+112%
	GET	4.57	8.41	9.60	+14%
xlarge (6–13 GB)	PUT	4.37	11.86	25.74	+117%
	GET	4.67	8.47	10.20	+20%
huge (10–25 GB)	PUT	4.31	11.08	23.89	+116%
	GET	4.49	5.50	9.70	+76%

Speedup vs Baseline (main-no-zerocopy)

Config	PUT Speedup	GET Speedup
debug	1.2×	1.5×
tiny	2.2×	3.8×
small	5.3×	5.2×
medium	6.3×	2.7×
large	6.0×	2.1×
xlarge	5.9×	2.2×
huge	5.5×	2.2×

Visualization

Resource Usage

Columnar layout reduces CPU time by eliminating per-sample object creation and pickle overhead:

Config	main (No ZC) CPU-sec	main (ZC) CPU-sec	This PR CPU-sec	Reduction vs main (ZC)
large	850	572	574	~0% (2× throughput)
xlarge	1570	1166	1009	-13% (2× throughput)
huge	2569	2387	1936	-19% (2× throughput)

Note: CPU time is comparable or lower despite processing 2× more data per unit time.

4. API Breaking Changes

Item	Before	After
`BatchMeta.samples`	`List[SampleMeta]`	Removed
`SampleMeta` class	Available	Removed
`FieldMeta` class	Available	Removed
`sample.fields['x'].dtype`	Per-sample access	`batch.field_schema['x']['dtype']`
Constructor	`BatchMeta(samples=[...])`	`BatchMeta(global_indexes=..., partition_ids=..., field_schema=..., production_status=...)`

5. Files Changed

16 files changed, 1369 insertions(+), 2168 deletions(-)

Category	Files	Summary
Core	`metadata.py`	Columnar BatchMeta rewrite
Serialization	`serial_utils.py`, `zmq_utils.py`	Zero-copy default + ZMQ adaptation
Storage	`simple_backend.py`, `simple_backend_manager.py`, `base.py`	Columnar API adaptation
Controller	`controller.py`	Columnar API adaptation
Tests	`test_metadata.py` + 7 test files	Full rewrite + adaptation
Scripts	`put_benchmark.py`	Minor adjustments

6. Conclusion

The columnar BatchMeta refactoring combined with default zero-copy serialization delivers:

PUT throughput: Up to 6.3× improvement over baseline, +100–200% over previous zero-copy PR
GET throughput: Up to 5.2× improvement over baseline, +14–157% over previous zero-copy PR
CPU efficiency: Comparable or lower CPU time despite 2× higher throughput
Code reduction: Net -799 lines of metadata-related code

ascend-robot · 2026-02-27T07:12:11Z

CLA Signature Pass

mpb159753, thanks for your pull request. All authors of the commits have signed the CLA. 👍

ascend-robot · 2026-02-27T07:56:33Z

CLA Signature Pass

mpb159753, thanks for your pull request. All authors of the commits have signed the CLA. 👍

ascend-robot · 2026-02-27T09:03:53Z

CLA Signature Pass

mpb159753, thanks for your pull request. All authors of the commits have signed the CLA. 👍

ascend-robot · 2026-02-27T09:06:50Z

CLA Signature Pass

mpb159753, thanks for your pull request. All authors of the commits have signed the CLA. 👍

ascend-robot · 2026-02-27T09:11:17Z

CLA Signature Pass

mpb159753, thanks for your pull request. All authors of the commits have signed the CLA. 👍

0oshowero0 · 2026-03-05T08:31:36Z

+        dataclasses (bypassing enc_hook), and BatchMeta fields contain torch.dtype which
+        msgpack cannot handle natively.
+        """
+        meta_dict = obj.to_dict()


Why we still need BatchMeta.to_dict()?

0oshowero0 · 2026-03-05T08:38:43Z

+        storage_unit_to_global_indexes = self._group_by_hash(metadata.global_indexes)
+        # Build global_idx -> batch position mapping for non-contiguous slicing
+        gi_to_pos = {gi: pos for pos, gi in enumerate(metadata.global_indexes)}
        tasks = [
-            self._put_to_single_storage_unit(
-                meta_group.get_local_indexes(),
-                _filter_storage_data(meta_group, results),
-                target_storage_unit=storage_id,
+            self._prepare_and_send_to_unit_by_positions(
+                storage_id=su_id,
+                positions=[gi_to_pos[gi] for gi in gi_list],
+                data=data,
+                metadata=metadata,
            )
-            for storage_id, meta_group in storage_meta_groups.items()
+            for su_id, gi_list in storage_unit_to_global_indexes.items()
        ]


Very hard to understand

0oshowero0 · 2026-03-05T08:43:21Z

+    async def _prepare_and_send_to_unit_by_positions(
+        self,
+        storage_id,
+        positions,


what is this position?

0oshowero0 · 2026-03-05T09:47:56Z

        finally:
            _encoder_aux_buffers.reset(token)

+    def _preprocess_for_batchmeta(self, obj: Any) -> Any:


Not necessary?

0oshowero0 · 2026-03-05T09:48:08Z

+        # Pre-process to convert BatchMeta to Ext; msgspec auto-serializes dataclasses and won't call enc_hook for them.
+        obj = self._preprocess_for_batchmeta(obj)


Not necessary?

0oshowero0 · 2026-03-05T09:49:15Z

+    try:
+        return list(_encoder.encode(obj))
+    except (TypeError, ValueError) as e:
+        logger.debug(


this should be a warnning

0oshowero0 · 2026-03-05T09:49:37Z

 _decoder = MsgpackDecoder()
+
+
+def encode_with_fallback(obj: Any) -> list[bytestr]:


We can just call it encode

0oshowero0 · 2026-03-05T09:49:46Z

+        return [_PICKLE_FALLBACK_SENTINEL, pickle.dumps(obj, protocol=pickle.HIGHEST_PROTOCOL)]
+
+
+def decode_with_fallback(frames: list) -> Any:


We can just call it decode

0oshowero0 · 2026-03-05T10:08:44Z

+            i for i, global_index in enumerate(full_meta.global_indexes) if global_index in update_gis
+        ]
+        update_meta_with_backend = full_meta.select_samples(update_positions_in_full)
+        extended_meta = update_meta_with_backend.with_data_fields(


Why this new interface is needed?

0oshowero0 · 2026-03-05T10:10:31Z

-        extended_fields = base_fields + ["new_extra_tensor", "new_extra_non_tensor"]
-        update_region_meta = poll_for_meta(
-            client, partition_id, extended_fields, 20, "update_region_task", mode="force_fetch"
+        # 9. Verify new fields exist in update region (indices 10-29 only have new fields).


Very hard to understand

0oshowero0 · 2026-03-05T10:12:00Z

+async def test_put_data_no_batch_counter():
+    """put_data should not have _batch_counter attribute (already removed)."""
+    storage_unit_infos = {
+        "storage_0": ZMQServerInfo(
+            role=TransferQueueRole.STORAGE,
+            id="storage_0",
+            ip="127.0.0.1",
+            ports={"put_get_socket": 19002},
+        ),
+    }
+    with patch("transfer_queue.storage.managers.base.TransferQueueStorageManager._connect_to_controller"):
+        manager = AsyncSimpleStorageManager.__new__(AsyncSimpleStorageManager)
+        manager.storage_manager_id = "test_manager_2"
+        manager.storage_unit_infos = storage_unit_infos
+        manager.controller_info = None
+        manager.data_status_update_socket = None
+        manager.controller_handshake_socket = None
+        manager.zmq_context = None
+
+    assert not hasattr(manager, "_batch_counter"), "_batch_counter should have been removed"


0oshowero0 · 2026-03-05T10:21:38Z

+# ============================================================================
+# Numpy Native Serialization Tests (CUSTOM_TYPE_NUMPY)
+# ============================================================================
+class TestNumpyNativeSerialization:


consider merge essential testes in previous test class?

0oshowero0 · 2026-03-05T10:22:46Z

+    assert 1 in storage_data.field_data["log_probs"]  # other key intact
+
+
+def test_storage_unit_data_dict_key():


not necessary

0oshowero0 · 2026-03-05T10:23:28Z

+        )
+
+
+def test_storage_unit_data_partial_consume_safety():


unnecessary

0oshowero0 · 2026-03-05T10:23:51Z

+    torch.testing.assert_close(storage.field_data["f"][1], torch.tensor([9.0]))
+
+
+def test_storage_unit_data_active_keys_tracking():


unnecessary

ascend-robot · 2026-03-05T13:13:48Z

CLA Signature Pass

mpb159753, thanks for your pull request. All authors of the commits have signed the CLA. 👍

ascend-robot · 2026-03-05T13:23:17Z

CLA Signature Pass

mpb159753, thanks for your pull request. All authors of the commits have signed the CLA. 👍

ascend-robot · 2026-03-06T02:41:20Z

CLA Signature Pass

mpb159753, thanks for your pull request. All authors of the commits have signed the CLA. 👍

…zation - Convert BatchMeta/KVBatchMeta to columnar list layout for zero-copy serialization - Add columnar custom_meta and _custom_backend_meta support - Add with_data_fields to BatchMeta; fix cross-shard e2e test - Add CUSTOM_TYPE_NUMPY for native numpy round-trip in serial_utils - Apply code review fixes from columnar-batchmeta branch review - Simplify storage manager: extract helpers, rename variables for clarity - Rename local_indexes/gi_list to global_indexes across codebase - Remove unused StorageMetaGroup dead code - Replace deepcopy with shallow copy in BatchMeta.__post_init__ - Rewrite concat extra_info merge to batch-level semantics - Replace chunk-based routing with deterministic hash routing - Detect dtype/shape changes in field_schema_cache - Make _SampleView a complete read-only single-sample view - Remove to_dict/from_dict/_parse_dtype, use direct pickle for BatchMeta - Rename encode/decode_with_fallback to encode/decode Signed-off-by: 看我72遍 <m.pb@msn.com>

ascend-robot · 2026-03-06T02:45:19Z

CLA Signature Pass

mpb159753, thanks for your pull request. All authors of the commits have signed the CLA. 👍

0oshowero0 · 2026-03-06T03:31:25Z

Close as merged in GitCode https://gitcode.com/Ascend/TransferQueue/pull/28

…ller path Co-authored-by: 看我72遍<m.pb@msn.com> # message auto-generated for no-merge-commit merge: !29 merge refactor/columnar-field-schema into main [fix,refactor] Complete columnar metadata refactor for manager→controller path Created-by: mpb159753 Commit-by: 看我72遍 Merged-by: ascend-robot Description: # Columnar FieldSchema + Unified Controller Metadata ## 1. Context & Motivation Follows: [#28 — Columnar BatchMeta + Zero-Copy Default](https://gitcode.com/Ascend/TransferQueue/pull/28) PR #39 converted `BatchMeta` from row-oriented to columnar layout, but two O(B×F) bottlenecks remained on the **Manager → Controller** path: 1. **`notify_data_update` payload**: The Manager expanded columnar `field_schema` back into per-sample dicts (`dtypes: {global_index: {field: dtype}}`, `shapes: {global_index: {field: shape}}`), transmitting O(B×F) data over ZMQ for information that is inherently O(F). 2. **Controller metadata storage**: `DataPartitionStatus` maintained three separate stores (`field_dtypes`, `field_shapes`, `field_schema_cache`) with redundant per-sample indexing, requiring multi-pass reconciliation logic to detect nested tensors. This PR completes the columnar refactoring by: - Transmitting `field_schema` directly as O(F) columnar data (no per-sample expansion) - Introducing `FieldColumnMeta` as the **single source of truth** for per-field metadata in the Controller - Adding `RoutingGroup` to carry batch positions alongside global indexes, eliminating intermediate mapping - Extracting `_pack_field_values` as a reusable static method with defensive checks ## 2. Key Changes ### 2.1 Columnar `notify_data_update` Protocol (`base.py`, `simple_backend_manager.py`) **Before** (O(B×F) expansion in Manager): ```python dtypes_for_notify = { global_index: {field_name: field_meta.get("dtype") for field_name, field_meta in field_schema.items()} for global_index in metadata.global_indexes } shapes_for_notify = { ... } # same pattern await self.notify_data_update(partition_id, field_names, global_indexes, dtypes_for_notify, shapes_for_notify) ``` **After** (O(F) — pass through as-is): ```python await self.notify_data_update(partition_id, global_indexes, field_schema) ``` - Removed `fields`, `dtypes`, `shapes` parameters - `field_schema` is already columnar from `metadata.py` — no expansion needed - KV path (`base.py`) similarly simplified, removing 25-line per-sample expansion loop ### 2.2 `FieldColumnMeta` Dataclass (`controller.py`) Replaces three separate stores (`field_dtypes`, `field_shapes`, `field_schema_cache`) with a single `@dataclass`: ```python @DataClass class FieldColumnMeta: dtype: Any = None shape: Optional[tuple] = None is_nested: bool = False is_non_tensor: bool = False per_sample_shapes: dict[int, tuple] = field(default_factory=dict) ``` - Field-level attributes are O(1) — shared across all samples - Sample-level shapes only stored for nested tensors — O(B_nested) not O(B) - `to_batch_schema()` generates `BatchMeta`-compatible dicts on demand - `remove_samples()` cleans up released indexes ### 2.3 `RoutingGroup` NamedTuple (`simple_backend_manager.py`) ```python class RoutingGroup(NamedTuple): global_indexes: list[int] batch_positions: list[int] ``` - `_group_by_hash` now returns `dict[str, RoutingGroup]` instead of `dict[str, list[int]]` - Carries both global indexes and batch positions, eliminating the intermediate `global_idx → position` mapping in `get_data` - GET merge logic simplified: scatter results directly to batch positions without building per-sample dicts ### 2.4 `_pack_field_values` Extraction (`simple_backend_manager.py`) Extracted inline packing logic into a reusable `@staticmethod` with explicit error handling: - Validates non-empty input and absence of `None` values - Handles regular tensors (`torch.stack`), nested tensors (`torch.nested.as_nested_tensor`), and non-tensors (`NonTensorStack`) ### 2.5 Simplified Controller API - `update_production_status`: Removed `field_names` and `dtypes`/`shapes` parameters; `field_names` derived from `field_schema.keys()` - `get_field_schema`: Delegates to `FieldColumnMeta.to_batch_schema()` instead of building from cache - Removed `get_field_dtype` and `get_field_shape` helper methods (no longer needed) ### 2.6 Test Suite - All test files updated to match new `notify_data_update` and `update_production_status` signatures - `test_controller_data_partitions.py`: Tests adapted for `FieldColumnMeta`-based schema storage ## 3. Benchmark Results Tests conducted in Docker (single-node Ray) across 7 payload sizes (0.05 MB → 25.4 GB). Three configurations compared: - **pre-refactor**: Baseline (row-oriented, before PR #39) - **columnar-batch-meta**: After PR #39 (columnar BatchMeta + zero-copy) - **columnar-field-schema**: This PR (columnar notify + FieldColumnMeta + RoutingGroup) ### Speedup (relative to pre-refactor baseline) ![image.png](https://raw.gitcode.com/user-images/assets/8886051/4c49b557-9d15-4298-9d5e-bd06e8ea05a6/image.png 'image.png') ![image.png](https://raw.gitcode.com/user-images/assets/8886051/8992bfb1-e5fc-4f06-9585-f72906c53863/image.png 'image.png') | Data Scale | PUT Speedup (vs baseline) | PUT Speedup (vs PR #39) | GET Speedup (vs baseline) | GET Speedup (vs PR #39) | |------------|:------------------------:|:-----------------------:|:------------------------:|:-----------------------:| | debug (0.05 MB) | **1.4×** | +12% | **1.5×** | +16% | | tiny (1.5 MB) | **1.8×** | +19% | **2.1×** | +13% | | small (0.15 GB) | **5.1×** | +20% | **3.4×** | ≈0% | | medium (1.5 GB) | **5.8×** | +7% | **2.2×** | −1% | | large (6.3 GB) | **5.6×** | +8% | **2.0×** | −4% | | xlarge (12.7 GB) | **5.5×** | +8% | **2.2×** | +1% | | huge (25.4 GB) | **5.4×** | +6% | **2.2×** | +1% | ### Absolute Bandwidth ![image.png](https://raw.gitcode.com/user-images/assets/8886051/05b789cc-f4aa-4a5a-833b-55617cd3a673/image.png 'image.png') ![image.png](https://raw.gitcode.com/user-images/assets/8886051/e2f927cd-5556-46af-bf7b-71e451752c11/image.png 'image.png') | Data Scale | Pre-Refactor | Columnar BatchMeta (PR #39) | Columnar FieldSchema (This PR) | |------------|:-----------:|:---------------------------:|:------------------------------:| | **PUT** medium | 3.95 Gbps | 21.29 Gbps | **22.84 Gbps** | | **PUT** large | 5.04 Gbps | 26.14 Gbps | **28.18 Gbps** | | **PUT** huge | 5.09 Gbps | 26.05 Gbps | **27.49 Gbps** | | **GET** medium | 4.24 Gbps | 9.50 Gbps | **9.39 Gbps** | | **GET** large | 4.98 Gbps | 10.51 Gbps | **10.14 Gbps** | | **GET** huge | 4.86 Gbps | 10.46 Gbps | **10.53 Gbps** | ### Summary - **PUT path** benefits most: +6% to +20% over PR #39 across all scales, consistent 5×+ improvement over pre-refactor baseline at medium+ scales - **GET path** maintains parity with PR #39 — improvements are within noise margin; the GET bottleneck is in ZMQ transport, not metadata - Small payloads see the largest relative improvement, confirming the metadata overhead reduction ### Resource Usage Memory usage is comparable or slightly reduced (eliminated per-sample `field_dtypes`/`field_shapes` dicts in Controller). ## 4. API Breaking Changes - `notify_data_update()`: Removed `fields`, `dtypes`, `shapes` parameters; replaced with single `field_schema` dict - `update_production_status()`: Removed `field_names`, `dtypes`, `shapes` parameters; replaced with single `field_schema` dict; `field_names` derived from `field_schema.keys()` - `get_field_dtype()` / `get_field_shape()`: Removed (replaced by `FieldColumnMeta`) - `_group_by_hash()`: Now returns `dict[str, RoutingGroup]` instead of `dict[str, list[int]]` ## 5. Files Changed ``` 7 files changed, 451 insertions(+), 440 deletions(-) ``` | File | Description | |------|-------------| | `controller.py` | `FieldColumnMeta` dataclass; simplified `update_production_status` / `get_field_schema`; removed `get_field_dtype`/`get_field_shape` | | `simple_backend_manager.py` | `RoutingGroup`; `_pack_field_values`; position-based GET merge; columnar `notify_data_update` | | `base.py` | Columnar `notify_data_update` protocol; simplified KV path | | `test_controller.py` | Adapted to new API signatures | | `test_controller_data_partitions.py` | Adapted to `FieldColumnMeta`-based schema | | `test_async_simple_storage_manager.py` | Adapted to `RoutingGroup` and new notify protocol | | `test_kv_storage_manager.py` | Minor signature update | ## 6. Conclusion This PR completes the second phase of columnar refactoring by eliminating the remaining O(B×F) metadata expansion in the Manager→Controller path and unifying metadata storage in the Controller: - **PUT throughput**: Up to 5.8× over pre-refactor baseline, +6–20% over PR #39 - **GET throughput**: Up to 3.4× over pre-refactor baseline, parity with PR #39 - **Code clarity**: Three separate metadata stores → one `FieldColumnMeta` dataclass; per-sample expansion loops eliminated - **Net change**: +451 / −440 lines across 7 files > **Note on GET path**: The GET path performance improvement from metadata-level refactoring has reached diminishing returns — the minor fluctuations (±1–4%) observed in benchmarks are within normal measurement noise. Further GET throughput gains would likely require a deeper architectural change: fully columnarizing the GET data flow itself (e.g., columnar storage layout in StorageUnit, field-level parallel retrieval), rather than continuing to optimize the metadata layer. See merge request: Ascend/TransferQueue!29

ascend-robot added the ascend-cla/yes label Feb 27, 2026

mpb159753 force-pushed the refactor/columnar-batchmeta-zero-copy branch from efdcc43 to 0a17163 Compare February 27, 2026 07:56

mpb159753 force-pushed the refactor/columnar-batchmeta-zero-copy branch from 89b43bb to ed4a66f Compare February 27, 2026 09:06

mpb159753 force-pushed the refactor/columnar-batchmeta-zero-copy branch from ed4a66f to 83a3ee7 Compare February 27, 2026 09:11