Expose the ability to have zero allocation sends. #4802

cptspacemanspiff · 2025-06-28T19:22:29Z

The Objective:

At a high level I want use zeromq messages to send data as fast as possible from a sending thread. However, as as I have started going through the optimization process, via perf on x64 linux, about 25% of my steady-state time is spent in malloc calls allocating the control blocks for the reference counted long messages.

This pull request is an request to expose a public api to the zero-copy long message type type_zclmsg used internally on the receive side.

References:

As part of this process I have looked at previous issues that reference this topic, trying to not stomp on things, and get a better understanding of the concerns (If I have missed some it would be great to know):

The most discussion of this seems to be in:
#2795

Though this PR also solves the issue here:
#4343

Changes:

The primary change in this is to expose the internal function init_external_storage and allow users to pass in a pointer to a preallocated memory block for the init function to construct the content_t control block in. The method/struct we want to expose is below:

// src/msg.hpp

    struct content_t
    {
        void *data;
        size_t size;
        msg_free_fn *ffn;
        void *hint;
        zmq::atomic_counter_t refcnt;
    };

// The above control block is 40 bytes, with 8 byte alignment on x64.

    int init_external_storage (content_t *content_,
                               void *data_,
                               size_t size_,
                               msg_free_fn *ffn_,
                               void *hint_);

In order to not expose private implementation details, and allow for future modification of the internals, we round up the control block size from ~40 to 64 bytes, and expose the larger structure to the users in the draft api.

// include/zmq.h -- In the Draft API:

typedef struct zmq_msg_content_t{
#if defined(_MSC_VER) && (defined(_M_X64) || defined(_M_ARM64))
  __declspec (align (8)) unsigned char _[64];
#elif defined(_MSC_VER)                                                        \
&& (defined(_M_IX86) || defined(_M_ARM_ARMV7VE) || defined(_M_ARM))
  __declspec (align (4)) unsigned char _[64];
#elif defined(__GNUC__) || defined(__INTEL_COMPILER)                           \
|| (defined(__SUNPRO_C) && __SUNPRO_C >= 0x590)                              \
|| (defined(__SUNPRO_CC) && __SUNPRO_CC >= 0x590)
  unsigned char _[64] __attribute__ ((aligned (sizeof (void *))));
#else
  unsigned char _[64];
#endif
} zmq_msg_content_t;

// above is lifted from the definition of zmq_msg_t, it has 8 byte alignment.

ZMQ_EXPORT int zmq_msg_init_external_storage (
  zmq_msg_t *msg_, zmq_msg_content_t *content_, void *data_, size_t size_, zmq_free_fn *ffn_, void *hint_);

Thus, users just allocate a 64 byte control block, and pass it to zmq. They are then responsible for the lifetime of the block (In most use cases, it will probably be handled in the zmq_free_fn * ).

Internally all objects in the control data block are manually destructed as part of zmq_close (any nontrivial types are constructed with placement new):

// src/msg.cpp -- zmq::msg_t::close ()

    if (is_zcmsg ()) {
        zmq_assert (_u.zclmsg.content->ffn);

        //  If the content is not shared, or if it is shared and the reference
        //  count has dropped to zero, deallocate it.
        if (!(_u.zclmsg.flags & msg_t::shared)
            || !_u.zclmsg.content->refcnt.sub (1)) {
            //  We used "placement new" operator to initialize the reference
            //  counter so we call the destructor explicitly now.
            _u.zclmsg.content->refcnt.~atomic_counter_t ();

            _u.zclmsg.content->ffn (_u.zclmsg.content->data,
                                    _u.zclmsg.content->hint);
        }
    }

My Use-case:

So for my sending thread, I have a slab allocator, which pulls memory from a lock-free object pool. Under the hood this uses freelist that batches memory allocations. This is fast, though to reduce thread contention of the CAS loop, I generally pull larger chunks and construct multiple smaller messages within this allocated buffer.

The messages I am using are not huge, but not tiny, at around 1Kb, so I am currently using the zmq_msg_init_data, this works, and there is a inline reference counted control block at the beginning of the allocation that is decremented in the zmq's free_fn.

At this point, the malloc time from control block allocation starts to add up.

I realize I could send larger messages, and I will be doing that, but it involves rewriting a lot of the code and message logic, on both the send and receive. Just allowing me to point to a preallocated control block at the beginning of the message gives me an easy ~25% speedup.

Additionally I am generally wary of non-deterministic nature of new/delete, especially in the hot path loop.

Gotchas? / random thoughts?

So trying to think of issues with this, it seems pretty safe, obviously it involves properly managing/releasing the control block memory, but that seems pretty easy for people already using free function. It is not a solution for all issues, but the requirement of zmq to have new/delete for any messages above ~33 bytes seems like a less than ideal scenario.

It also seems like the content object, as a internal API is very stable (last touched 9 years ago?).

One question I was thinking of might be the alignment, 8 bytes on x64 is the minimum, and was easy to copy paste the msg_t alignment, but it may be better to up that to a larger 16 byte, if there is the potential of needing to use 128 bit cas type instructions... That being said, people who use the library should probably respect the the alignment of the types they are given and not assume.

Finally, as a gripe, I personally am not a fan of the name of content_t, since it is not actually the message content but the control block for the message content, it keeps confusing me (why I use control block when I refer to it in this PR)

Nicholas Long added 9 commits June 28, 2025 15:32

Initial changes to the header file to support no-allocation sends.

62440f3

cast the memory content to actual zmq type.

c232158

duplicated ffn test.

3b39139

Added external storage test.

dcbe56d

moved the new methods to draft

2c787a0

fixed draft symbols/ formatting / tests

2d8c5e5

formatting.

2744358

autotools tests

7cd1725

added more tests locations?

981d964

cptspacemanspiff marked this pull request as ready for review July 1, 2025 13:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Expose the ability to have zero allocation sends. #4802

Expose the ability to have zero allocation sends. #4802

Uh oh!

cptspacemanspiff commented Jun 28, 2025

Uh oh!

Uh oh!

Expose the ability to have zero allocation sends. #4802

Are you sure you want to change the base?

Expose the ability to have zero allocation sends. #4802

Uh oh!

Conversation

cptspacemanspiff commented Jun 28, 2025

The Objective:

References:

Changes:

My Use-case:

Gotchas? / random thoughts?

Uh oh!

Uh oh!