Rework deserializing #15

Alex-PLACET · 2025-09-02T13:11:33Z

No description provided.

pitrou · 2025-09-04T09:39:08Z

include/sparrow_ipc/encapsulated_message.hpp

+
+    private:
+
+        const uint8_t* m_buf_ptr;


If you don't store the buffer length, how can you make sure you're not accessing memory out of bounds?

pitrou · 2025-09-04T09:43:07Z

include/sparrow_ipc/encapsulated_message.hpp

+        const uint8_t* m_buf_ptr;
+    };
+
+    [[nodiscard]] EncapsulatedMessage create_encapsulated_message(const uint8_t* buf_ptr);


Perhaps you want an API like:

// Return the encapsulated message and the rest of the span std::pair<EncapsulatedMessage, std::span<const uint8_t>> extract_encapsulated_message(std::span<const uint8_t>);

include/sparrow_ipc/deserialize_primitive_array.hpp

src/encapsulated_message.cpp

Hind-M · 2025-09-04T12:49:45Z

As a side note, I think in the future, and for relatively big PRs with multiple files like this one, having multiple commits where corresponding messages describe what they are doing would make the review easier.
The linting changes are adding some noise as well, maybe keep that in specific independent PRs in the future.

I don't know if the additional code is making the project not buildable anymore, but maybe it's best if we remove all the versioning and install parts to focus on the core changes in this PR.

Hind-M · 2025-09-04T12:52:07Z

Are we planning eventually to use functions from sparrow directly (everything related to arrow_interface and comparison functions in the tests)?

wip wip wip WIP wip wip wip wip wip

Copilot

Pull Request Overview

This PR implements a comprehensive rework of the serialization system to improve code organization, add new functionality, and enhance testing infrastructure. The changes restructure the codebase with proper namespacing, introduce new deserialization capabilities, and add extensive integration testing with Arrow data files.

Reorganized headers and source files with proper sparrow_ipc namespace structure
Added new deserialization functionality for streams and various array types
Introduced comprehensive integration testing with Arrow testing data files

Reviewed Changes

Copilot reviewed 41 out of 44 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/test_utils.cpp	Reformatted test assertions for better readability
tests/test_primitive_array_with_files.cpp	New integration tests comparing stream vs JSON deserialization
tests/test_primitive_array_serialization.cpp	Updated includes and minor formatting improvements
tests/test_null_array_serialization.cpp	Updated includes to use new header structure
tests/test_arrow_schema.cpp	New comprehensive tests for Arrow schema functionality
tests/metadata_sample.hpp	New helper for metadata testing with endianness support
tests/CMakeLists.txt	Added new test files and dependencies
src/utils.cpp	Updated includes and improved code formatting
src/serialize_null_array.cpp	Updated to use new deserialization functions
src/serialize.cpp	Moved deserialization functions and improved formatting
src/metadata.cpp	New utility for metadata conversion
src/encapsulated_message.cpp	New class for handling encapsulated Arrow messages
src/deserialize_utils.cpp	New utilities for deserialization operations
src/deserialize_fixedsizebinary_array.cpp	New deserialization for fixed-size binary arrays
src/deserialize.cpp	New comprehensive deserialization implementation
Multiple header files	Reorganized with proper namespace structure and new functionality

Comments suppressed due to low confidence (3)

src/utils.cpp:1

[nitpick] These multi-line trailing comments are hard to read and maintain. Consider moving them above the variable declarations or making them single-line comments.

#include "sparrow_ipc/utils.hpp"

src/utils.cpp:1

[nitpick] These multi-line trailing comments are hard to read and maintain. Consider moving them above the variable declarations or making them single-line comments.

#include "sparrow_ipc/utils.hpp"

src/utils.cpp:1

[nitpick] These multi-line trailing comments are hard to read and maintain. Consider moving them above the variable declarations or making them single-line comments.

#include "sparrow_ipc/utils.hpp"

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

tests/test_utils.cpp

Copilot · 2025-09-05T13:14:29Z

src/utils.cpp

+                    const auto map_type = org::apache::arrow::flatbuf::CreateMap(builder, false);  // not
+                                                                                                   // sorted
+                                                                                                   // keys


[nitpick] The multi-line comment is unnecessarily fragmented. Consider using a single-line comment: // not sorted keys or moving the comment above the line.

Suggested change

const auto map_type = org::apache::arrow::flatbuf::CreateMap(builder, false); // not

// sorted

// keys

const auto map_type = org::apache::arrow::flatbuf::CreateMap(builder, false); // not sorted keys

Copilot · 2025-09-05T13:14:29Z

include/sparrow_ipc/encapsulated_message.hpp

@@ -0,0 +1,44 @@
+#include <cstdint>


Missing header guard. Add #pragma once at the beginning of the file to prevent multiple inclusions.

pitrou · 2025-09-08T10:22:39Z

include/sparrow_ipc/deserialize_variable_size_binary_array.hpp

+        const auto offset_metadata = record_batch.buffers()->Get(buffer_index++);
+        auto offset_ptr = const_cast<uint8_t*>(body.data() + offset_metadata->offset());
+        const auto buffer_metadata = record_batch.buffers()->Get(buffer_index++);
+        auto buffer_ptr = const_cast<uint8_t*>(body.data() + buffer_metadata->offset());


Here as well, you might check that the advertised buffer lengths for offsets and data are consistent with the batch length and fall within bounds of the body.

Besides being good security practice, having such checks will make the debugging experience much nicer in general.

pitrou · 2025-09-08T10:26:40Z

include/sparrow_ipc/serialize_primitive_array.hpp

-        details::deserialize_schema_message(buf_ptr, current_offset, name, metadata);
+        deserialize_schema_message(std::span<const uint8_t>(buffer), current_offset, name, metadata);

        // II - Deserialize the RecordBatch message


I'm a bit surprised by the logic here. Typically, an IPC stream has a single Schema message at start, followed by an arbitrary number of RecordBatch messages (optionally other messages as well). Here you seem to be assuming that the IPC stream will only ever contain a single RecordBatch, which is weird.

In other words, I would not expect a single stateless function to handle both Schema and RecordBatch at once.

I'm also surprised that this function knows the desired type primitive_array<T> before deserializing the Schema.

I only reworked the deserialization. The serialization will be in another PR. Here it's the old implementation which is kept for tests. But it will be removed soon.

pitrou · 2025-09-08T10:29:28Z

include/sparrow_ipc/serialize_primitive_array.hpp

+        );

        current_offset += utils::align_to_8(batch_meta_len);
        const uint8_t* body_ptr = buf_ptr + current_offset;


Unrelated to this PR, but I'm surprised to see raw new invocations below.

It's the old implementation which be removed soon

pitrou · 2025-09-08T10:32:00Z

src/deserialize.cpp

+    )
+    {
+        const uint32_t schema_meta_len = *(reinterpret_cast<const uint32_t*>(data.data() + current_offset));
+        current_offset += sizeof(uint32_t);


Similarly, you might want some bounds checking here.

pitrou · 2025-09-08T10:46:26Z

src/deserialize.cpp

+            schema_message->header()
+        );
+        const auto fields = flatbuffer_schema->fields();
+        if (fields->size() != 1)


I suppose you plan to remove this limitation later? Otherwise you'll fail on most real-world data.

pitrou · 2025-09-08T10:47:18Z

src/deserialize.cpp

+        size_t buffer_index = 0;
+
+        std::vector<sparrow::array> arrays;
+        arrays.reserve(schema.fields()->size());


Interestingly you are not limited to 1 field here.

pitrou · 2025-09-08T10:50:54Z

src/deserialize.cpp

+            const std::optional<std::vector<sparrow::metadata_pair>>
+                metadata = fb_custom_metadata == nullptr
+                               ? std::nullopt
+                               : std::make_optional(to_sparrow_metadata(*fb_custom_metadata));


This deserializes Schema metadata again for each RecordBatch, why?

For the moment I avoid recreating the metadatas, but we still create several time ArrowSchema. I created a PR to share the same schema in several record batch: #20

pitrou · 2025-09-08T11:14:40Z

src/deserialize.cpp

+        const EncapsulatedMessage& encapsulated_message
+    )
+    {
+        const size_t length = static_cast<size_t>(record_batch.length());


Given that you don't support buffer compression, you should IMHO check the compression field in the RecordBatch and error out if present. Better than returning garbage data to the user :)

#18 I created an issue for that

pitrou · 2025-09-08T11:17:24Z

src/deserialize.cpp

+            {
+                case org::apache::arrow::flatbuf::MessageHeader::Schema:
+                {
+                    schema = message->header_as_Schema();


Why do you have a function deserialize_schema_message if it's not being used here?

I removed the deserialize_schema_message 👍

Hind-M · 2025-09-08T11:23:40Z

So what's the plan here? This rework seems to be intending to drop all specific serializations/deserializations (primitive_array, null_array) since they are not used. If that's the case, we should remove them here to avoid confusion.

pitrou · 2025-09-08T11:35:07Z

src/serialize.cpp


-            const size_t current_size = final_buffer.size(); // Get the current size (which is the end of the Schema message)
+            const size_t current_size = final_buffer.size();  // Get the current size (which is the end of the
+                                                              // Schema message)


The comment assumes you're only serializing a single RecordBatch in the IPC stream?

This will be reworked in another PR 👍, I just reworked the deserialization for the moment

pitrou · 2025-09-08T11:37:02Z

src/serialize.cpp

+            const ArrowArray& arrow_arr,
+            const std::vector<int64_t>& buffers_sizes,
+            std::vector<uint8_t>& final_buffer
+        )


In a more performance-minded implementation, you would probably want to serialize directly to a generic writable handle (which can be an in-memory buffer writer but also a file handle), to avoid making an intermediate copy of all buffers before emitting them. Perhaps std::ostream is a suitable abstraction or perhaps not, I don't know :)

This will be reworked in another PR 👍, I just reworked the deserialization for the moment

pitrou · 2025-09-08T11:38:43Z

src/serialize.cpp

+                                                                // start

            // Write the 4-byte metadata length for the RecordBatch message
            *(reinterpret_cast<uint32_t*>(dst)) = batch_meta_len;


About the code below: the null bitmap is optional in the IPC stream as well, so the memset thing is really sub-optimal :)

This will be reworked in another PR 👍

Alex-PLACET · 2025-09-08T13:24:21Z

@Hind-M Ok let's delete the old implementation

codecov-commenter · 2025-09-15T13:17:26Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 82.96296% with 69 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@43abdae). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
src/deserialize.cpp	80.73%	21 Missing ⚠️
src/encapsulated_message.cpp	62.74%	19 Missing ⚠️
src/utils.cpp	74.50%	13 Missing ⚠️
src/metadata.cpp	0.00%	9 Missing ⚠️
...ow_interface/arrow_array_schema_common_release.hpp	93.75%	2 Missing ⚠️
...row_ipc/deserialize_variable_size_binary_array.hpp	88.23%	2 Missing ⚠️
...nclude/sparrow_ipc/deserialize_primitive_array.hpp	92.30%	1 Missing ⚠️
src/deserialize_fixedsizebinary_array.cpp	92.30%	1 Missing ⚠️
src/deserialize_utils.cpp	90.00%	1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #15   +/-   ##
=======================================
  Coverage        ?   70.91%           
=======================================
  Files           ?       18           
  Lines           ?      832           
  Branches        ?        0           
=======================================
  Hits            ?      590           
  Misses          ?      242           
  Partials        ?        0

Flag	Coverage Δ
unittests	`70.91% <82.96%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Alex-PLACET self-assigned this Sep 2, 2025

Alex-PLACET requested a review from Copilot September 3, 2025 15:05

Alex-PLACET force-pushed the rework_serializing branch 2 times, most recently from 85ef02d to f82d723 Compare September 3, 2025 15:07

This comment was marked as outdated.

Sign in to view

Alex-PLACET marked this pull request as ready for review September 4, 2025 09:27

Alex-PLACET requested review from Hind-M and JohanMabille September 4, 2025 09:28

pitrou reviewed Sep 4, 2025

View reviewed changes

Hind-M reviewed Sep 4, 2025

View reviewed changes

include/sparrow_ipc/deserialize_primitive_array.hpp Outdated Show resolved Hide resolved

src/encapsulated_message.cpp Outdated Show resolved Hide resolved

src/encapsulated_message.cpp Outdated Show resolved Hide resolved

Alex-PLACET added 4 commits September 4, 2025 15:38

Rework serialization

df92795

wip wip wip WIP wip wip wip wip wip

wip

190af27

wip

c6f0202

wip

620ea81

Alex-PLACET force-pushed the rework_serializing branch from 9347b29 to 620ea81 Compare September 4, 2025 13:38

Alex-PLACET added 8 commits September 4, 2025 15:46

Fix osx build

9fd39b8

fix compilation

5b779ba

wip

9c32fc0

compilation fix

1c1cd18

wip

2db5795

Use std::span

e618b0b

wip

4b8cf6c

wip

db55113

Alex-PLACET requested review from Hind-M, Copilot and pitrou September 5, 2025 13:13

Copilot AI reviewed Sep 5, 2025

View reviewed changes

pitrou reviewed Sep 8, 2025

View reviewed changes

Alex-PLACET changed the title ~~Rework serializing~~ Rework deserializing Sep 8, 2025

wip

6359136

Alex-PLACET added 4 commits September 8, 2025 16:04

Remove serialization

b6734ad

wip

c36b7de

Avoid recreating metadata

0ae315d

address comments

b94eea7

Hind-M mentioned this pull request Sep 10, 2025

Add code coverage #17

Merged

Alex-PLACET added 3 commits September 11, 2025 11:47

Update conda env

115d3a9

TRY FIX

e4bb2e9

Upgrade sparrow version

dd01882

Fix windows run tests

02d7322

Alex-PLACET requested a review from JohanMabille September 17, 2025 06:56

JohanMabille approved these changes Sep 17, 2025

View reviewed changes

Alex-PLACET merged commit 963b02b into QuantStack:main Sep 17, 2025
14 of 27 checks passed

Alex-PLACET deleted the rework_serializing branch September 17, 2025 07:08

Rework deserializing #15

Rework deserializing #15

Uh oh!

Conversation

Alex-PLACET commented Sep 2, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Hind-M commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Hind-M commented Sep 4, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Hind-M commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Hind-M commented Sep 4, 2025 •

edited

Loading

Hind-M commented Sep 8, 2025 •

edited

Loading

codecov-commenter commented Sep 15, 2025 •

edited

Loading