[Merged in GitCode][BREAKING][refactor] Convert BatchMeta to columnar layout; enable zero-copy serialization by default#39
Conversation
CLA Signature Passmpb159753, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
efdcc43 to
0a17163
Compare
CLA Signature Passmpb159753, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
1 similar comment
CLA Signature Passmpb159753, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
89b43bb to
ed4a66f
Compare
CLA Signature Passmpb159753, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
ed4a66f to
83a3ee7
Compare
CLA Signature Passmpb159753, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
| dataclasses (bypassing enc_hook), and BatchMeta fields contain torch.dtype which | ||
| msgpack cannot handle natively. | ||
| """ | ||
| meta_dict = obj.to_dict() |
There was a problem hiding this comment.
Why we still need BatchMeta.to_dict()?
| storage_unit_to_global_indexes = self._group_by_hash(metadata.global_indexes) | ||
| # Build global_idx -> batch position mapping for non-contiguous slicing | ||
| gi_to_pos = {gi: pos for pos, gi in enumerate(metadata.global_indexes)} | ||
| tasks = [ | ||
| self._put_to_single_storage_unit( | ||
| meta_group.get_local_indexes(), | ||
| _filter_storage_data(meta_group, results), | ||
| target_storage_unit=storage_id, | ||
| self._prepare_and_send_to_unit_by_positions( | ||
| storage_id=su_id, | ||
| positions=[gi_to_pos[gi] for gi in gi_list], | ||
| data=data, | ||
| metadata=metadata, | ||
| ) | ||
| for storage_id, meta_group in storage_meta_groups.items() | ||
| for su_id, gi_list in storage_unit_to_global_indexes.items() | ||
| ] |
There was a problem hiding this comment.
Very hard to understand
| async def _prepare_and_send_to_unit_by_positions( | ||
| self, | ||
| storage_id, | ||
| positions, |
| finally: | ||
| _encoder_aux_buffers.reset(token) | ||
|
|
||
| def _preprocess_for_batchmeta(self, obj: Any) -> Any: |
| # Pre-process to convert BatchMeta to Ext; msgspec auto-serializes dataclasses and won't call enc_hook for them. | ||
| obj = self._preprocess_for_batchmeta(obj) |
| try: | ||
| return list(_encoder.encode(obj)) | ||
| except (TypeError, ValueError) as e: | ||
| logger.debug( |
There was a problem hiding this comment.
this should be a warnning
| _decoder = MsgpackDecoder() | ||
|
|
||
|
|
||
| def encode_with_fallback(obj: Any) -> list[bytestr]: |
There was a problem hiding this comment.
We can just call it encode
| return [_PICKLE_FALLBACK_SENTINEL, pickle.dumps(obj, protocol=pickle.HIGHEST_PROTOCOL)] | ||
|
|
||
|
|
||
| def decode_with_fallback(frames: list) -> Any: |
There was a problem hiding this comment.
We can just call it decode
| i for i, global_index in enumerate(full_meta.global_indexes) if global_index in update_gis | ||
| ] | ||
| update_meta_with_backend = full_meta.select_samples(update_positions_in_full) | ||
| extended_meta = update_meta_with_backend.with_data_fields( |
There was a problem hiding this comment.
Why this new interface is needed?
| extended_fields = base_fields + ["new_extra_tensor", "new_extra_non_tensor"] | ||
| update_region_meta = poll_for_meta( | ||
| client, partition_id, extended_fields, 20, "update_region_task", mode="force_fetch" | ||
| # 9. Verify new fields exist in update region (indices 10-29 only have new fields). |
There was a problem hiding this comment.
Very hard to understand
| async def test_put_data_no_batch_counter(): | ||
| """put_data should not have _batch_counter attribute (already removed).""" | ||
| storage_unit_infos = { | ||
| "storage_0": ZMQServerInfo( | ||
| role=TransferQueueRole.STORAGE, | ||
| id="storage_0", | ||
| ip="127.0.0.1", | ||
| ports={"put_get_socket": 19002}, | ||
| ), | ||
| } | ||
| with patch("transfer_queue.storage.managers.base.TransferQueueStorageManager._connect_to_controller"): | ||
| manager = AsyncSimpleStorageManager.__new__(AsyncSimpleStorageManager) | ||
| manager.storage_manager_id = "test_manager_2" | ||
| manager.storage_unit_infos = storage_unit_infos | ||
| manager.controller_info = None | ||
| manager.data_status_update_socket = None | ||
| manager.controller_handshake_socket = None | ||
| manager.zmq_context = None | ||
|
|
||
| assert not hasattr(manager, "_batch_counter"), "_batch_counter should have been removed" |
| # ============================================================================ | ||
| # Numpy Native Serialization Tests (CUSTOM_TYPE_NUMPY) | ||
| # ============================================================================ | ||
| class TestNumpyNativeSerialization: |
There was a problem hiding this comment.
consider merge essential testes in previous test class?
| assert 1 in storage_data.field_data["log_probs"] # other key intact | ||
|
|
||
|
|
||
| def test_storage_unit_data_dict_key(): |
| ) | ||
|
|
||
|
|
||
| def test_storage_unit_data_partial_consume_safety(): |
| torch.testing.assert_close(storage.field_data["f"][1], torch.tensor([9.0])) | ||
|
|
||
|
|
||
| def test_storage_unit_data_active_keys_tracking(): |
CLA Signature Passmpb159753, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
1 similar comment
CLA Signature Passmpb159753, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
33bb19f to
14391db
Compare
CLA Signature Passmpb159753, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
…zation - Convert BatchMeta/KVBatchMeta to columnar list layout for zero-copy serialization - Add columnar custom_meta and _custom_backend_meta support - Add with_data_fields to BatchMeta; fix cross-shard e2e test - Add CUSTOM_TYPE_NUMPY for native numpy round-trip in serial_utils - Apply code review fixes from columnar-batchmeta branch review - Simplify storage manager: extract helpers, rename variables for clarity - Rename local_indexes/gi_list to global_indexes across codebase - Remove unused StorageMetaGroup dead code - Replace deepcopy with shallow copy in BatchMeta.__post_init__ - Rewrite concat extra_info merge to batch-level semantics - Replace chunk-based routing with deterministic hash routing - Detect dtype/shape changes in field_schema_cache - Make _SampleView a complete read-only single-sample view - Remove to_dict/from_dict/_parse_dtype, use direct pickle for BatchMeta - Rename encode/decode_with_fallback to encode/decode Signed-off-by: 看我72遍 <m.pb@msn.com>
14391db to
f6ab22e
Compare
CLA Signature Passmpb159753, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
|
Close as merged in GitCode https://gitcode.com/Ascend/TransferQueue/pull/28 |
…ller path Co-authored-by: 看我72遍<m.pb@msn.com> # message auto-generated for no-merge-commit merge: !29 merge refactor/columnar-field-schema into main [fix,refactor] Complete columnar metadata refactor for manager→controller path Created-by: mpb159753 Commit-by: 看我72遍 Merged-by: ascend-robot Description: # Columnar FieldSchema + Unified Controller Metadata ## 1. Context & Motivation Follows: [#28 — Columnar BatchMeta + Zero-Copy Default](https://gitcode.com/Ascend/TransferQueue/pull/28) PR #39 converted `BatchMeta` from row-oriented to columnar layout, but two O(B×F) bottlenecks remained on the **Manager → Controller** path: 1. **`notify_data_update` payload**: The Manager expanded columnar `field_schema` back into per-sample dicts (`dtypes: {global_index: {field: dtype}}`, `shapes: {global_index: {field: shape}}`), transmitting O(B×F) data over ZMQ for information that is inherently O(F). 2. **Controller metadata storage**: `DataPartitionStatus` maintained three separate stores (`field_dtypes`, `field_shapes`, `field_schema_cache`) with redundant per-sample indexing, requiring multi-pass reconciliation logic to detect nested tensors. This PR completes the columnar refactoring by: - Transmitting `field_schema` directly as O(F) columnar data (no per-sample expansion) - Introducing `FieldColumnMeta` as the **single source of truth** for per-field metadata in the Controller - Adding `RoutingGroup` to carry batch positions alongside global indexes, eliminating intermediate mapping - Extracting `_pack_field_values` as a reusable static method with defensive checks ## 2. Key Changes ### 2.1 Columnar `notify_data_update` Protocol (`base.py`, `simple_backend_manager.py`) **Before** (O(B×F) expansion in Manager): ```python dtypes_for_notify = { global_index: {field_name: field_meta.get("dtype") for field_name, field_meta in field_schema.items()} for global_index in metadata.global_indexes } shapes_for_notify = { ... } # same pattern await self.notify_data_update(partition_id, field_names, global_indexes, dtypes_for_notify, shapes_for_notify) ``` **After** (O(F) — pass through as-is): ```python await self.notify_data_update(partition_id, global_indexes, field_schema) ``` - Removed `fields`, `dtypes`, `shapes` parameters - `field_schema` is already columnar from `metadata.py` — no expansion needed - KV path (`base.py`) similarly simplified, removing 25-line per-sample expansion loop ### 2.2 `FieldColumnMeta` Dataclass (`controller.py`) Replaces three separate stores (`field_dtypes`, `field_shapes`, `field_schema_cache`) with a single `@dataclass`: ```python @DataClass class FieldColumnMeta: dtype: Any = None shape: Optional[tuple] = None is_nested: bool = False is_non_tensor: bool = False per_sample_shapes: dict[int, tuple] = field(default_factory=dict) ``` - Field-level attributes are O(1) — shared across all samples - Sample-level shapes only stored for nested tensors — O(B_nested) not O(B) - `to_batch_schema()` generates `BatchMeta`-compatible dicts on demand - `remove_samples()` cleans up released indexes ### 2.3 `RoutingGroup` NamedTuple (`simple_backend_manager.py`) ```python class RoutingGroup(NamedTuple): global_indexes: list[int] batch_positions: list[int] ``` - `_group_by_hash` now returns `dict[str, RoutingGroup]` instead of `dict[str, list[int]]` - Carries both global indexes and batch positions, eliminating the intermediate `global_idx → position` mapping in `get_data` - GET merge logic simplified: scatter results directly to batch positions without building per-sample dicts ### 2.4 `_pack_field_values` Extraction (`simple_backend_manager.py`) Extracted inline packing logic into a reusable `@staticmethod` with explicit error handling: - Validates non-empty input and absence of `None` values - Handles regular tensors (`torch.stack`), nested tensors (`torch.nested.as_nested_tensor`), and non-tensors (`NonTensorStack`) ### 2.5 Simplified Controller API - `update_production_status`: Removed `field_names` and `dtypes`/`shapes` parameters; `field_names` derived from `field_schema.keys()` - `get_field_schema`: Delegates to `FieldColumnMeta.to_batch_schema()` instead of building from cache - Removed `get_field_dtype` and `get_field_shape` helper methods (no longer needed) ### 2.6 Test Suite - All test files updated to match new `notify_data_update` and `update_production_status` signatures - `test_controller_data_partitions.py`: Tests adapted for `FieldColumnMeta`-based schema storage ## 3. Benchmark Results Tests conducted in Docker (single-node Ray) across 7 payload sizes (0.05 MB → 25.4 GB). Three configurations compared: - **pre-refactor**: Baseline (row-oriented, before PR #39) - **columnar-batch-meta**: After PR #39 (columnar BatchMeta + zero-copy) - **columnar-field-schema**: This PR (columnar notify + FieldColumnMeta + RoutingGroup) ### Speedup (relative to pre-refactor baseline)   | Data Scale | PUT Speedup (vs baseline) | PUT Speedup (vs PR #39) | GET Speedup (vs baseline) | GET Speedup (vs PR #39) | |------------|:------------------------:|:-----------------------:|:------------------------:|:-----------------------:| | debug (0.05 MB) | **1.4×** | +12% | **1.5×** | +16% | | tiny (1.5 MB) | **1.8×** | +19% | **2.1×** | +13% | | small (0.15 GB) | **5.1×** | +20% | **3.4×** | ≈0% | | medium (1.5 GB) | **5.8×** | +7% | **2.2×** | −1% | | large (6.3 GB) | **5.6×** | +8% | **2.0×** | −4% | | xlarge (12.7 GB) | **5.5×** | +8% | **2.2×** | +1% | | huge (25.4 GB) | **5.4×** | +6% | **2.2×** | +1% | ### Absolute Bandwidth   | Data Scale | Pre-Refactor | Columnar BatchMeta (PR #39) | Columnar FieldSchema (This PR) | |------------|:-----------:|:---------------------------:|:------------------------------:| | **PUT** medium | 3.95 Gbps | 21.29 Gbps | **22.84 Gbps** | | **PUT** large | 5.04 Gbps | 26.14 Gbps | **28.18 Gbps** | | **PUT** huge | 5.09 Gbps | 26.05 Gbps | **27.49 Gbps** | | **GET** medium | 4.24 Gbps | 9.50 Gbps | **9.39 Gbps** | | **GET** large | 4.98 Gbps | 10.51 Gbps | **10.14 Gbps** | | **GET** huge | 4.86 Gbps | 10.46 Gbps | **10.53 Gbps** | ### Summary - **PUT path** benefits most: +6% to +20% over PR #39 across all scales, consistent 5×+ improvement over pre-refactor baseline at medium+ scales - **GET path** maintains parity with PR #39 — improvements are within noise margin; the GET bottleneck is in ZMQ transport, not metadata - Small payloads see the largest relative improvement, confirming the metadata overhead reduction ### Resource Usage Memory usage is comparable or slightly reduced (eliminated per-sample `field_dtypes`/`field_shapes` dicts in Controller). ## 4. API Breaking Changes - `notify_data_update()`: Removed `fields`, `dtypes`, `shapes` parameters; replaced with single `field_schema` dict - `update_production_status()`: Removed `field_names`, `dtypes`, `shapes` parameters; replaced with single `field_schema` dict; `field_names` derived from `field_schema.keys()` - `get_field_dtype()` / `get_field_shape()`: Removed (replaced by `FieldColumnMeta`) - `_group_by_hash()`: Now returns `dict[str, RoutingGroup]` instead of `dict[str, list[int]]` ## 5. Files Changed ``` 7 files changed, 451 insertions(+), 440 deletions(-) ``` | File | Description | |------|-------------| | `controller.py` | `FieldColumnMeta` dataclass; simplified `update_production_status` / `get_field_schema`; removed `get_field_dtype`/`get_field_shape` | | `simple_backend_manager.py` | `RoutingGroup`; `_pack_field_values`; position-based GET merge; columnar `notify_data_update` | | `base.py` | Columnar `notify_data_update` protocol; simplified KV path | | `test_controller.py` | Adapted to new API signatures | | `test_controller_data_partitions.py` | Adapted to `FieldColumnMeta`-based schema | | `test_async_simple_storage_manager.py` | Adapted to `RoutingGroup` and new notify protocol | | `test_kv_storage_manager.py` | Minor signature update | ## 6. Conclusion This PR completes the second phase of columnar refactoring by eliminating the remaining O(B×F) metadata expansion in the Manager→Controller path and unifying metadata storage in the Controller: - **PUT throughput**: Up to 5.8× over pre-refactor baseline, +6–20% over PR #39 - **GET throughput**: Up to 3.4× over pre-refactor baseline, parity with PR #39 - **Code clarity**: Three separate metadata stores → one `FieldColumnMeta` dataclass; per-sample expansion loops eliminated - **Net change**: +451 / −440 lines across 7 files > **Note on GET path**: The GET path performance improvement from metadata-level refactoring has reached diminishing returns — the minor fluctuations (±1–4%) observed in benchmarks are within normal measurement noise. Further GET throughput gains would likely require a deeper architectural change: fully columnarizing the GET data flow itself (e.g., columnar storage layout in StorageUnit, field-level parallel retrieval), rather than continuing to optimize the metadata layer. See merge request: Ascend/TransferQueue!29
https://gitcode.com/Ascend/TransferQueue/pull/29
Columnar BatchMeta + Zero-Copy Default
1. Context & Motivation
Closes: [refactor] Convert BatchMeta from row-oriented to column-oriented layout
The current
BatchMetauses a row-oriented design (BatchMeta→List[SampleMeta]→Dict[str, FieldMeta]), which introduces three scaling issues in high-throughput scenarios:build_storage_meta_groups,add_fields,_filter_storage_data) involve nested loops over every sample × every field, incurred multiple times per PUT.This PR refactors
BatchMetato a column-oriented (structure-of-arrays) design, reducing metadata complexity from O(B×F) to O(B) + O(F), and enables zero-copy serialization by default with automatic pickle fallback.2. Key Changes
2.1 Columnar BatchMeta (
metadata.py)BatchMeta.samples: List[SampleMeta]global_indexes,partition_ids,production_statusFieldMetaobjects (B×F instances)field_schemadict (F entries)np.all()on ndarray O(1)BatchMeta,SampleMeta,FieldMetaBatchMetaonlySampleMetaandFieldMetaclasses entirelyfield_schemadict with three field types: Regular Tensor, Nested Tensor (is_nested), Non-Tensor (is_non_tensor)production_statusasnp.ndarray(int8)— enables O(1) readiness checks vianp.all()2.2 Zero-Copy Serialization Default (
serial_utils.py)ZERO_COPY_SERIALIZATIONenvironment variable switch2.3 Storage & Transport Adaptation
simple_backend.py/simple_backend_manager.py/controller.py: Adapted to columnar API;clear()usesdelinstead ofNoneassignment to reduce memory fragmentationzmq_utils.py: ZMQ transport uses new serialization utilities; frame count reduced from O(B) to F+1 (one metadata header + one per field)2.4 Test Suite
test_metadata.py: Fully rewritten for columnar API (net -799 lines)BatchMetaconstructor3. Benchmark Results
Tests conducted in Docker (single-node Ray) across 7 payload sizes. Three configurations compared:
Throughput Comparison (Gbps)
Speedup vs Baseline (main-no-zerocopy)
Visualization
Resource Usage
Columnar layout reduces CPU time by eliminating per-sample object creation and pickle overhead:
4. API Breaking Changes
BatchMeta.samplesList[SampleMeta]SampleMetaclassFieldMetaclasssample.fields['x'].dtypebatch.field_schema['x']['dtype']BatchMeta(samples=[...])BatchMeta(global_indexes=..., partition_ids=..., field_schema=..., production_status=...)5. Files Changed
metadata.pyserial_utils.py,zmq_utils.pysimple_backend.py,simple_backend_manager.py,base.pycontroller.pytest_metadata.py+ 7 test filesput_benchmark.py6. Conclusion
The columnar
BatchMetarefactoring combined with default zero-copy serialization delivers: