Skip to content

Conversation

Alex-PLACET
Copy link
Member

No description provided.

@Alex-PLACET Alex-PLACET self-assigned this Sep 2, 2025
@Alex-PLACET Alex-PLACET requested a review from Copilot September 3, 2025 15:05
@Alex-PLACET Alex-PLACET force-pushed the rework_serializing branch 2 times, most recently from 85ef02d to f82d723 Compare September 3, 2025 15:07
Copilot

This comment was marked as outdated.

@Alex-PLACET Alex-PLACET marked this pull request as ready for review September 4, 2025 09:27

private:

const uint8_t* m_buf_ptr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't store the buffer length, how can you make sure you're not accessing memory out of bounds?

const uint8_t* m_buf_ptr;
};

[[nodiscard]] EncapsulatedMessage create_encapsulated_message(const uint8_t* buf_ptr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you want an API like:

// Return the encapsulated message and the rest of the span
std::pair<EncapsulatedMessage, std::span<const uint8_t>> extract_encapsulated_message(std::span<const uint8_t>);

@Hind-M
Copy link
Member

Hind-M commented Sep 4, 2025

As a side note, I think in the future, and for relatively big PRs with multiple files like this one, having multiple commits where corresponding messages describe what they are doing would make the review easier.
The linting changes are adding some noise as well, maybe keep that in specific independent PRs in the future.

I don't know if the additional code is making the project not buildable anymore, but maybe it's best if we remove all the versioning and install parts to focus on the core changes in this PR.

@Hind-M
Copy link
Member

Hind-M commented Sep 4, 2025

Are we planning eventually to use functions from sparrow directly (everything related to arrow_interface and comparison functions in the tests)?

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a comprehensive rework of the serialization system to improve code organization, add new functionality, and enhance testing infrastructure. The changes restructure the codebase with proper namespacing, introduce new deserialization capabilities, and add extensive integration testing with Arrow data files.

  • Reorganized headers and source files with proper sparrow_ipc namespace structure
  • Added new deserialization functionality for streams and various array types
  • Introduced comprehensive integration testing with Arrow testing data files

Reviewed Changes

Copilot reviewed 41 out of 44 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/test_utils.cpp Reformatted test assertions for better readability
tests/test_primitive_array_with_files.cpp New integration tests comparing stream vs JSON deserialization
tests/test_primitive_array_serialization.cpp Updated includes and minor formatting improvements
tests/test_null_array_serialization.cpp Updated includes to use new header structure
tests/test_arrow_schema.cpp New comprehensive tests for Arrow schema functionality
tests/metadata_sample.hpp New helper for metadata testing with endianness support
tests/CMakeLists.txt Added new test files and dependencies
src/utils.cpp Updated includes and improved code formatting
src/serialize_null_array.cpp Updated to use new deserialization functions
src/serialize.cpp Moved deserialization functions and improved formatting
src/metadata.cpp New utility for metadata conversion
src/encapsulated_message.cpp New class for handling encapsulated Arrow messages
src/deserialize_utils.cpp New utilities for deserialization operations
src/deserialize_fixedsizebinary_array.cpp New deserialization for fixed-size binary arrays
src/deserialize.cpp New comprehensive deserialization implementation
Multiple header files Reorganized with proper namespace structure and new functionality
Comments suppressed due to low confidence (3)

src/utils.cpp:1

  • [nitpick] These multi-line trailing comments are hard to read and maintain. Consider moving them above the variable declarations or making them single-line comments.
#include "sparrow_ipc/utils.hpp"

src/utils.cpp:1

  • [nitpick] These multi-line trailing comments are hard to read and maintain. Consider moving them above the variable declarations or making them single-line comments.
#include "sparrow_ipc/utils.hpp"

src/utils.cpp:1

  • [nitpick] These multi-line trailing comments are hard to read and maintain. Consider moving them above the variable declarations or making them single-line comments.
#include "sparrow_ipc/utils.hpp"

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

src/utils.cpp Outdated
Comment on lines 380 to 382
const auto map_type = org::apache::arrow::flatbuf::CreateMap(builder, false); // not
// sorted
// keys
Copy link

Copilot AI Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The multi-line comment is unnecessarily fragmented. Consider using a single-line comment: // not sorted keys or moving the comment above the line.

Suggested change
const auto map_type = org::apache::arrow::flatbuf::CreateMap(builder, false); // not
// sorted
// keys
const auto map_type = org::apache::arrow::flatbuf::CreateMap(builder, false); // not sorted keys

Copilot uses AI. Check for mistakes.

@@ -0,0 +1,44 @@
#include <cstdint>
Copy link

Copilot AI Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing header guard. Add #pragma once at the beginning of the file to prevent multiple inclusions.

Copilot uses AI. Check for mistakes.

const auto offset_metadata = record_batch.buffers()->Get(buffer_index++);
auto offset_ptr = const_cast<uint8_t*>(body.data() + offset_metadata->offset());
const auto buffer_metadata = record_batch.buffers()->Get(buffer_index++);
auto buffer_ptr = const_cast<uint8_t*>(body.data() + buffer_metadata->offset());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here as well, you might check that the advertised buffer lengths for offsets and data are consistent with the batch length and fall within bounds of the body.

Besides being good security practice, having such checks will make the debugging experience much nicer in general.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

details::deserialize_schema_message(buf_ptr, current_offset, name, metadata);
deserialize_schema_message(std::span<const uint8_t>(buffer), current_offset, name, metadata);

// II - Deserialize the RecordBatch message
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit surprised by the logic here. Typically, an IPC stream has a single Schema message at start, followed by an arbitrary number of RecordBatch messages (optionally other messages as well). Here you seem to be assuming that the IPC stream will only ever contain a single RecordBatch, which is weird.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, I would not expect a single stateless function to handle both Schema and RecordBatch at once.

I'm also surprised that this function knows the desired type primitive_array<T> before deserializing the Schema.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only reworked the deserialization. The serialization will be in another PR. Here it's the old implementation which is kept for tests. But it will be removed soon.

);

current_offset += utils::align_to_8(batch_meta_len);
const uint8_t* body_ptr = buf_ptr + current_offset;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but I'm surprised to see raw new invocations below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the old implementation which be removed soon

)
{
const uint32_t schema_meta_len = *(reinterpret_cast<const uint32_t*>(data.data() + current_offset));
current_offset += sizeof(uint32_t);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, you might want some bounds checking here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

schema_message->header()
);
const auto fields = flatbuffer_schema->fields();
if (fields->size() != 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose you plan to remove this limitation later? Otherwise you'll fail on most real-world data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

size_t buffer_index = 0;

std::vector<sparrow::array> arrays;
arrays.reserve(schema.fields()->size());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly you are not limited to 1 field here.

const std::optional<std::vector<sparrow::metadata_pair>>
metadata = fb_custom_metadata == nullptr
? std::nullopt
: std::make_optional(to_sparrow_metadata(*fb_custom_metadata));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deserializes Schema metadata again for each RecordBatch, why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the moment I avoid recreating the metadatas, but we still create several time ArrowSchema. I created a PR to share the same schema in several record batch: #20

const EncapsulatedMessage& encapsulated_message
)
{
const size_t length = static_cast<size_t>(record_batch.length());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that you don't support buffer compression, you should IMHO check the compression field in the RecordBatch and error out if present. Better than returning garbage data to the user :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#18 I created an issue for that

{
case org::apache::arrow::flatbuf::MessageHeader::Schema:
{
schema = message->header_as_Schema();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you have a function deserialize_schema_message if it's not being used here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the deserialize_schema_message 👍

@Hind-M
Copy link
Member

Hind-M commented Sep 8, 2025

So what's the plan here? This rework seems to be intending to drop all specific serializations/deserializations (primitive_array, null_array) since they are not used. If that's the case, we should remove them here to avoid confusion.


const size_t current_size = final_buffer.size(); // Get the current size (which is the end of the Schema message)
const size_t current_size = final_buffer.size(); // Get the current size (which is the end of the
// Schema message)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment assumes you're only serializing a single RecordBatch in the IPC stream?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be reworked in another PR 👍, I just reworked the deserialization for the moment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

const ArrowArray& arrow_arr,
const std::vector<int64_t>& buffers_sizes,
std::vector<uint8_t>& final_buffer
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a more performance-minded implementation, you would probably want to serialize directly to a generic writable handle (which can be an in-memory buffer writer but also a file handle), to avoid making an intermediate copy of all buffers before emitting them. Perhaps std::ostream is a suitable abstraction or perhaps not, I don't know :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be reworked in another PR 👍, I just reworked the deserialization for the moment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

// start

// Write the 4-byte metadata length for the RecordBatch message
*(reinterpret_cast<uint32_t*>(dst)) = batch_meta_len;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the code below: the null bitmap is optional in the IPC stream as well, so the memset thing is really sub-optimal :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be reworked in another PR 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@Alex-PLACET Alex-PLACET changed the title Rework serializing Rework deserializing Sep 8, 2025
@Alex-PLACET
Copy link
Member Author

@Hind-M Ok let's delete the old implementation

@Hind-M Hind-M mentioned this pull request Sep 10, 2025
@codecov-commenter
Copy link

codecov-commenter commented Sep 15, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 82.96296% with 69 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@43abdae). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/deserialize.cpp 80.73% 21 Missing ⚠️
src/encapsulated_message.cpp 62.74% 19 Missing ⚠️
src/utils.cpp 74.50% 13 Missing ⚠️
src/metadata.cpp 0.00% 9 Missing ⚠️
...ow_interface/arrow_array_schema_common_release.hpp 93.75% 2 Missing ⚠️
...row_ipc/deserialize_variable_size_binary_array.hpp 88.23% 2 Missing ⚠️
...nclude/sparrow_ipc/deserialize_primitive_array.hpp 92.30% 1 Missing ⚠️
src/deserialize_fixedsizebinary_array.cpp 92.30% 1 Missing ⚠️
src/deserialize_utils.cpp 90.00% 1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@           Coverage Diff           @@
##             main      #15   +/-   ##
=======================================
  Coverage        ?   70.91%           
=======================================
  Files           ?       18           
  Lines           ?      832           
  Branches        ?        0           
=======================================
  Hits            ?      590           
  Misses          ?      242           
  Partials        ?        0           
Flag Coverage Δ
unittests 70.91% <82.96%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Alex-PLACET Alex-PLACET merged commit 963b02b into QuantStack:main Sep 17, 2025
14 of 27 checks passed
@Alex-PLACET Alex-PLACET deleted the rework_serializing branch September 17, 2025 07:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants