Skip to content

Expose the ability to have zero allocation sends. #4802

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

cptspacemanspiff
Copy link

The Objective:

At a high level I want use zeromq messages to send data as fast as possible from a sending thread. However, as as I have started going through the optimization process, via perf on x64 linux, about 25% of my steady-state time is spent in malloc calls allocating the control blocks for the reference counted long messages.

This pull request is an request to expose a public api to the zero-copy long message type type_zclmsg used internally on the receive side.

References:

As part of this process I have looked at previous issues that reference this topic, trying to not stomp on things, and get a better understanding of the concerns (If I have missed some it would be great to know):

The most discussion of this seems to be in:
#2795

Though this PR also solves the issue here:
#4343

Changes:

The primary change in this is to expose the internal function init_external_storage and allow users to pass in a pointer to a preallocated memory block for the init function to construct the content_t control block in. The method/struct we want to expose is below:

// src/msg.hpp

    struct content_t
    {
        void *data;
        size_t size;
        msg_free_fn *ffn;
        void *hint;
        zmq::atomic_counter_t refcnt;
    };

// The above control block is 40 bytes, with 8 byte alignment on x64.

    int init_external_storage (content_t *content_,
                               void *data_,
                               size_t size_,
                               msg_free_fn *ffn_,
                               void *hint_);

In order to not expose private implementation details, and allow for future modification of the internals, we round up the control block size from ~40 to 64 bytes, and expose the larger structure to the users in the draft api.

// include/zmq.h -- In the Draft API:

typedef struct zmq_msg_content_t{
#if defined(_MSC_VER) && (defined(_M_X64) || defined(_M_ARM64))
  __declspec (align (8)) unsigned char _[64];
#elif defined(_MSC_VER)                                                        \
&& (defined(_M_IX86) || defined(_M_ARM_ARMV7VE) || defined(_M_ARM))
  __declspec (align (4)) unsigned char _[64];
#elif defined(__GNUC__) || defined(__INTEL_COMPILER)                           \
|| (defined(__SUNPRO_C) && __SUNPRO_C >= 0x590)                              \
|| (defined(__SUNPRO_CC) && __SUNPRO_CC >= 0x590)
  unsigned char _[64] __attribute__ ((aligned (sizeof (void *))));
#else
  unsigned char _[64];
#endif
} zmq_msg_content_t;

// above is lifted from the definition of zmq_msg_t, it has 8 byte alignment.

ZMQ_EXPORT int zmq_msg_init_external_storage (
  zmq_msg_t *msg_, zmq_msg_content_t *content_, void *data_, size_t size_, zmq_free_fn *ffn_, void *hint_);

Thus, users just allocate a 64 byte control block, and pass it to zmq. They are then responsible for the lifetime of the block (In most use cases, it will probably be handled in the zmq_free_fn * ).

Internally all objects in the control data block are manually destructed as part of zmq_close (any nontrivial types are constructed with placement new):

// src/msg.cpp -- zmq::msg_t::close ()

    if (is_zcmsg ()) {
        zmq_assert (_u.zclmsg.content->ffn);

        //  If the content is not shared, or if it is shared and the reference
        //  count has dropped to zero, deallocate it.
        if (!(_u.zclmsg.flags & msg_t::shared)
            || !_u.zclmsg.content->refcnt.sub (1)) {
            //  We used "placement new" operator to initialize the reference
            //  counter so we call the destructor explicitly now.
            _u.zclmsg.content->refcnt.~atomic_counter_t ();

            _u.zclmsg.content->ffn (_u.zclmsg.content->data,
                                    _u.zclmsg.content->hint);
        }
    }

My Use-case:

So for my sending thread, I have a slab allocator, which pulls memory from a lock-free object pool. Under the hood this uses freelist that batches memory allocations. This is fast, though to reduce thread contention of the CAS loop, I generally pull larger chunks and construct multiple smaller messages within this allocated buffer.

The messages I am using are not huge, but not tiny, at around 1Kb, so I am currently using the zmq_msg_init_data, this works, and there is a inline reference counted control block at the beginning of the allocation that is decremented in the zmq's free_fn.

At this point, the malloc time from control block allocation starts to add up.

I realize I could send larger messages, and I will be doing that, but it involves rewriting a lot of the code and message logic, on both the send and receive. Just allowing me to point to a preallocated control block at the beginning of the message gives me an easy ~25% speedup.

Additionally I am generally wary of non-deterministic nature of new/delete, especially in the hot path loop.

Gotchas? / random thoughts?

So trying to think of issues with this, it seems pretty safe, obviously it involves properly managing/releasing the control block memory, but that seems pretty easy for people already using free function. It is not a solution for all issues, but the requirement of zmq to have new/delete for any messages above ~33 bytes seems like a less than ideal scenario.

It also seems like the content object, as a internal API is very stable (last touched 9 years ago?).

One question I was thinking of might be the alignment, 8 bytes on x64 is the minimum, and was easy to copy paste the msg_t alignment, but it may be better to up that to a larger 16 byte, if there is the potential of needing to use 128 bit cas type instructions... That being said, people who use the library should probably respect the the alignment of the types they are given and not assume.

Finally, as a gripe, I personally am not a fan of the name of content_t, since it is not actually the message content but the control block for the message content, it keeps confusing me (why I use control block when I refer to it in this PR)

@cptspacemanspiff cptspacemanspiff marked this pull request as ready for review July 1, 2025 13:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant