Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipc_*, ipc, flow: Port them beyond Linux/x86-64 - macOS, ARM64; Windows. (No *BSD - due to no capnp) #101

Open
ygoldfeld opened this issue Mar 27, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@ygoldfeld
Copy link
Contributor

ygoldfeld commented Mar 27, 2024

As of this writing, in its original form, Flow-IPC (including Flow, repository flow, which is self-contained) builds in gcc/clang; targets Linux running in 64-bit mode (x64_64 aka AMD64). Nothing about its design is Linux-y per se (and earlier versions of Flow ran in such environments as Android and iOS and Windows); Linux was just the thing needed in practice.

This issue:

  • covers a possible road-map re. to what to port in what order (of course this could always change depending on demand; and on available help);
  • contains notes I've assembled about what it would, specifically, take to get it done.

I should note that some of the technical notes below could be wrong; if my knowledge was in error, or if I looked up something incorrectly. Still, should be usefl.

First the goal(s):

  • Random note: Having contemplated iOS and Android, I figure IPC in this form, in general, is not really that useful in the mobile setting. So for now at least it would appear we can skip this family of stuff. (Flow, being a general library, is different - so TODO: file Issue for that - but as part of Flow-IPC the not-that-useful calculus applies to it.)
  • The first target is two-fold: macOS + ARM64 (the latter being the current architecture used by modern Macs). Whether/how to test macOS + x86-64 (which is now a retired combo, I believe) is TBD; we should do it if convenient, otherwise forget it.
    • Why the first target: Why not? So many of us have Mac laptops. GitHub Actions apparently supplies Mac hardware (details TBD). Plus many of the solutions, such as cooking up a /dev/shm replacement, will probably keep working in all the other OS including Windows. So it's a (relatively) low-hanging fruit; and it has high value in the sense that many developers would be able to immediately build Flow-IPC directly on their laptops.
    • High-level notes in addition to the above:
      • macOS is at its core Darwin which is BSD-esque.
      • Local testing should be a relative cinch, while developing.
      • ARM64 versus x86-64 should not come up much at all, with one exception: The pointer tagging scheme used by our SHM-jemalloc module will need to be updated for ARM64 pointer conventions. (Spoiler alert: Apparently this adjustment will be quite easy, as basically they use the same system, except ARM64 is less restrictive - has no canonical form.)
        • It appears after covering x86-64 compatibility and adding ARM64 - where it would be an issue, which appears to be
  • The second target is a bit vague: it is *BSD, meaning FreeBSD, OpenBSD, maybe NetBSD. They're not the same, and also I personally am not super familiar with them. That said my impression so far is that the stuff that would need to be ported has straightforward answers that would either carry-over from macOS, or be different but fairly simple.
    • Why the second target: The way I see it, we are Unixy, and if we handle macOS, then the major family that remains of the Unixes = *BSD. EDIT: WAIT, NO! See note just below: No capnp for BSD, so no Flow-IPC for BSD until then.
    • High-level notes in addition to the above:
      • macOS is at its core Darwin which is BSD-esque.
      • Local testing should be a relative cinch, while developing.
      • No direct GitHub Actions BSD runners, but apparently something called "VM Actions" allows to run such within macOS runners. TBD.
      • OOPS!!! capnp (Cap'n Proto) apparently does not build for BSD, at least not officially. With that, we are dead in the water - for now.
  • The third target is Windows (Visual C++).
    • Why the third target: We are not really Windows people, so it is farthest from our expertise... we being the original developers/maintainers. But... so? Windows is important.
    • High-level notes in addition to the above:
      • Based on my notes, all the need-to-port items have fine answers in Windows. Though, some of it could be quite different from Unix-y code; for instance there is no Unix domain socket that transmits FDs, so we have to use the quite-different Windows equivalent.
      • Local testing should be doable; just need a Windows machine and Visual Studio.
      • GitHub Actions has Windows runners. However, our CI pipeline (the stuff in .github/main.yml and flow/.github/main.yml) casually assumes Unix-y command line tools, etc. etc. So that stuff will be significant work (and learning, depending on who does it).
        • So in that sense it will be added work versus macOS/BSD.

So, basically, at the moment:

  • Start with macOS + ARM64. Get it done.
  • Repeat at some point for Windows; some of the things done for macOS should be reusable.

Now for the notes.

Firstly, should get Flow-IPC/flow#88 out of the way before tackling this. Then it'll be static_assert()s failing instead of sometimes those, sometimes direct #errors. Detail... but anyway.

Then: Be aware that everything that we consciously knew was not portable, anywhere in the code, was #ifdefed and should #error or static_assert(false) if one tries to build for the port-target OS/arch. (NOTE!!!! SHM-jemalloc - namely anything in ipc_shm_arena_lend/ paths src/ipc/shm/arena_lend/** except src/ipc/shm/arena_lend/detail/shm_pool_offset_ptr_data.?pp src/ipc/session/standalone/** - may not have held to this. Will need special attention. Other than that, though, yep.)

So, I went over all instances of such to-port code and made the following notes. Following these (where they're correct at least) may well be the bulk of the work.

Last thing before that though... I'd recommend, before doing any of that, setting up two things:

  • (If possible) a local dev/test/debug environment. E.g., Macbook Pro with Xcode and all that.
  • A CI environment in GitHub Actions. E.g., take our main.yml for flow/, write a simple portable hello-world program instead of existing compiled code, and have the main.yml build and run that in the target-OS/arch.

Once that's ready can start hacking away at the to-port code. Now those notes:


This code checks the location of a certain executable, by itself: const auto bin_path = fs::read_symlink("/proc/self/exe"); // Can throw.

  • macOS: _NSGetExecutablePath() + realpath()
  • Windows: GetModuleFileName()

/var/run is assumed to exist in Linux by default (though I have a mechanism for overriding this location, if only for test/debug at least). Is it similarly assumed to exist in macOS/iOS? And is there a Windows equivalent? In particular, it would be nice if it were something that would get cleared via reboot.

  • macOS: Should be fine; do ensure it has the same kernel-persistent properties (emptied on reboot); and if not whether that matters for anything we care about (I forget at the moment).
  • Windows: Possibly GetTempPath() or ProgramData or AppData/Local. Look into it.

A key one: We use /dev/shm/ listing for important SHM-pool and bipc-MQ management. (Also /dev/mqueue for POSIX MQs - however since no POSIX MQs in macOS, it is moot.)

  • macOS, Windows: It would appear, from my research so far, that there is simply no such feature in either OS.
    • Need to really triple-check this. Because if it were available, this would really be much easier than the next bullet point.
    • If so, though... then there are essentially a couple of approaches. Both could be significant work (as of this writing looks like the hardest one). Not absurd amount of work but definitely new stuff/algorithms.
      • Approach A: Essentially make our own equivalent of /dev/shm: For example, make a SHM-segment that itself would keep track of the SHM-segments that we create or unlink. Conceptually it is straightforward; but it needs to be:
        • VERY robust/simple: If we mess something up here due to being too fancy, everything else SHM-related is potentially a mess - at least the cleanup aspects anyway. So keep it very very very simple.
        • Don't run out of space... but don't use some huge RAM chunk either. Now... in Unix at least (Windows: TBD)... we can size a quite-large SHM pool; that RAM is not actually taken from other apps/OS, until something is actually written to a particular page. So we should pick some size -- which will, after all, be system-wide, not app/process/whatever-wide -- which should be enough to track all SHM-pool-names currently in existence/not-unlinked. If we fill it up, then there's a problem. On the other hand maybe that's simply life; after all Linux keeps track of all of them, instead having OS-limits on how many pools can exist at a time. So we should reserve enough pool-space to handle that. Then, hopefully, there won't be that many in existence, so not that much RAM will be used to track them.
      • Approach B: Use smart naming for each object type. E.g., for objects of a similar conceptual type, name them ...1, ...2, ...3; then when trying to "list" them, just go from ...1 to <first guy that does not exist>. Or something like that.
        • This honestly would suck, as it would be a constant source of complexity and bugs. And, some things are named based on App::m_name, so there would still need to be some kind of kernel-persistent directory of those names. Anyway, a difficult mess.
        • It would potentially be more space-efficient.
    • NOTE: Perf-wise, this is not very sensitive. We don't often do the list-/dev/shm/* op; it's only at cleanup points basically. Similarly, we don't create new SHM-segments that often. It should be quick but need not be optimized down to the microsecond.

We use boost::interprocess::shared_memory_object::remove(X) to remove SHM-segment (SHM-pool) named X. Error handling, specifically, relies on the undocumented fact that on failure it'll set errno. So then we check that, if it throws, and throw the appropriate error. What about the other OS?

  • macOS: This should still work the same. They both just do unlink() really.
  • Windows: It depends on what boost::interprocess::shared_memory_object::remove(X) actually does. Find out by looking at Boost code. But potentially it'll be some WindowsThingLikeThis() and perhaps we would just check GetLastError().

Using kill(process_id, 0) to check whether PID process_id is running at the moment.

  • macOS: Should work.
  • Windows: Something like this:
    #include <windows.h>
    
    bool process_running(DWORD process_id) {
        HANDLE processHandle = OpenProcess(PROCESS_QUERY_LIMITED_INFORMATION, FALSE, process_id);
        if (processHandle != NULL) {
            CloseHandle(processHandle);
            return true; // Process is running
        }
        if (GetLastError() == ERROR_INVALID_PARAMETER) {
            return false; // Process does not exist
        }
        // ...etc...whatever...
    }
    

PIDs being used as effectively unique across time:

  • macOS: Still yes. Double-check.
  • Windows: Apparently still yes. And apparently the PID type is wider (32 bits).

Process_credentials::process_invoked_as() checks how (via what command line, argv[0]-ish) this->m_process_id process was executed, assuming that guy is currently active (which it is). It reads /proc/...pid.../cmdline. This works, unlike some other things, even if ...pid... is running as some other user - which is an important thing in our use-case. We use this for a safety check when opening session: A certain location shall be hard-coded by user in ipc::session::App struct known to us; and we use process_invoked_as() to see what the OS says. The two must match, or else the app is misbehaving, and we reject session. So the goal here is not the specific behavior we use in Linux, but merely some check like it that we could execute.

  • macOS: proc_pidpath() is somewhat similar (might be the actual location of executable, not as specified on command line... that would be fine, as long as specified and documented to mean that in ipc::session::App). However it might have security restrictions which would make it useless for us. If it can check it for a process run by another user or even another group -- then no problem. Otherwise, not useful. Allegedly only a privileged process can check it for other users; so that would mean no-go for us; but double-check.
    • Moral of the story: We may need to either
      • skip this check for the OS (and therefore omit it in ipc::session::App declaration); or
      • do something cheesy like have each process send-over-IPC its own argv[0] or proc_pidpath(...itself...) or equivalent. It's not being reported by OS then but still not-nothing as a safety feature.
  • Windows: <subsumed by Unix-domain-socket question, see below>

In Linux we use getsockopt(AF_LOCAL/SO_PEERCRED) to obtain the opposing process's info: PID (important, not just for safety), UID+GID (for safety check only).

  • macOS: There is LOCAL_PEEREPID sockopt. However, there is no mechanism for UID+GID apparently.
    • Can (again) either skip the check or have app send their own geteuid()+geteguid(). The fact we can get the PID still is good.
  • Windows: <subsumed by Unix-domain-socket question, see below>

HUGE IMPORTANT THING TO FILL-IN FOR Windows

We use, for Native_socket_stream, on which very much stuff is built, a Unix-domain stream socket. This is for two important purposes without which everything falls apart:

  • It is a fast/good way to transmit binary blobs. (bipc-MQ is also available in all OS. POSIX MQ is also available but in Linux only, not in Windows or macOs; BSD does have it, I believe, but moot.)

  • It is the only way to transmit I/O handles, including from socketpair() so as to create more pre-connected socket streams!

  • It also provides the bootstrap process-to-process acceptor-connector pair of classes. That's how sessions start, so it is very much a key thing!

  • Related: The very concept of transmitting Native_handles is well-defined in Linux and macOS (and BSD and any Unix): it just wraps an int (FD). (EDIT: I didn't mean to imply it's a mere matter of sending an int; rather it's sendmsg() with SOL_SOCKET/SCM_RIGHTS ancillary data, which will create an FD in the receiver process pointing to the same file-description.) But in Windows, what kind of thing would we even potentially want to transmit?

    • If there is something we can transmit to be able to set-up more socket-streams... then Native_handle would need to cover that at least.
    • If there is something typical users would want to transmit -- file handles? SOCKET network handles? What else? -- then we should support that.
  • macOS: It has all of this the same as Linux, portably. It is FD-based, with Unix-domain stream sockets being able to transmit FDs via sendmsg/recvmsg().... Whew!

  • Windows: Unclear! I have some notes, but I am not pasting them yet. We need a strategy for all of the above use-cases, before starting to port to Windows. So filling this is in a MAJOR TODO and precludes all other Windows-port work.

Flow's Logger::this_thread_set_logged_nickname(util::String_view thread_nickname, Logger* logger_ptr, bool also_set_os_name) -- if that bool=true -- uses pthread_setname_np(pthread_self(), os_name.c_str()) to set system-visible thread nickname. This is useful in top, sanitizer output, debuggers... it's quite useful. However, Linux has a 15-char limit which we grapple with, not very well.

const DWORD MS_VC_EXCEPTION = 0x406D1388;

#pragma pack(push, 8)
typedef struct tagTHREADNAME_INFO
{
    DWORD dwType; // Must be 0x1000.
    LPCSTR szName; // Pointer to name (in user addr space).
    DWORD dwThreadID; // Thread ID (-1=caller thread).
    DWORD dwFlags; // Reserved for future use, must be zero.
} THREADNAME_INFO;
#pragma pack(pop)

void SetThreadName(DWORD dwThreadID, const char* threadName) // -1 for cur thread
{
    THREADNAME_INFO info;
    info.dwType = 0x1000;
    info.szName = threadName;
    info.dwThreadID = dwThreadID;
    info.dwFlags = 0;

    __try
    {
        RaiseException(MS_VC_EXCEPTION, 0, sizeof(info)/sizeof(ULONG_PTR), (ULONG_PTR*)&info);
    }
    __except(EXCEPTION_EXECUTE_HANDLER)
    {
    }
}

cpu_idx() gets the current processor core index (0, 1, ..., N-1), where N = # of logical cores. In Linux can get it from ::sched_getcpu().

  • macOS: If Intel, can use a certain hack involving __cpuid_count(). But that'll be rare now anyway. With ARM64, I could not find a good answer. Check again - but so far it looks like there's nothing to use.
  • Windows: ...forgot to check....
  • Moral of the story: This is used only for logging (and TRACE/DATA logging at that). It might be best to just omit it for architectures where there's no good official technique.

optimize_pinning_in_thread_pool():

void optimize_pinning_in_thread_pool(flow::log::Logger* logger_ptr,
                                     const std::vector<util::Thread*>& threads_in_pool,
                                     [[maybe_unused]] bool est_hw_core_sharing_helps_algo,
                                     bool est_hw_core_pinning_helps_algo,
                                     bool hw_threads_is_grouping_collated)

where est_hw_core_pinning_helps_algo is to be set by the user to true if and only if the algorithm they're running over the given thread pool would be actively helped perf-wise if one were to pin each thread to a different CPU core. So if there are 32 threads and 32 logical cores, AND they set this to true, THEN it'll pin thread 1 to core 1, thread 2 to core 2, etc.; if there are 16 threads and 32 logical cores (but 16 physical cores), then it'll pin thread 1 to cores 1+17, thread 2 to 2+18, etc. (Unless hw_threads_is_grouping_collated, then t1 to 1+2, t2 to 3+4, etc.) I am not describing this perfectly, but I think you get the idea.
Linux: Use pthread_setaffinity_np().

  • macOS: Use thread_policy_set(). (The code is already there. However, it indicates a possible bug. Look into that, before considering the task done.)
  • Windows: Supposedly use SetThreadAffinityMask(GetCurrentThread(), affinityMask);.

Last but not least, in SHM-jemalloc fancy-pointers -- the core living in Shm_pool_offset_ptr_data which I wrote -- there is a pointer-tagging scheme in use, designed to ensure sizeof() is 64.

  • When storing a SHM-pointer, a pool ID and pool offset are encoded in 63 bits, but the MSB is set to 1 to indicate that, in fact, a SHM-pointer is there.
  • When storing a raw vaddr - such as to a stack location - the MSB is set to 0, while the other bits are simply copied from the original real vaddr. Then, the get-vaddr operation in this case:
    • Returns bits 62 through 0 as-is.
    • Returns the MSB as 0 if bit 62 is 0; else 1. This is because x86-64 pointers have to follow a canonical form, wherein the significant bits are the lowest 48-ish bits; while the remaining bits must equal to the most-significant significant (of those 48-ish). So it's either 00000..0...stuff...; or 11111..1...stuff.

This is a processor thing, not an OS thing. So this is where ARM64 could have different behavior. Turns out, apparently, its behavior is different but simpler: still only the low 48-ish bits are the significant ones; hence it is safe for us to keep using the MSB for our pointer-tagging scheme. But, it is simpler: there is no "canonical form": so we can just return the entire thing as-is in the get-vaddr op. There will be an #if but a simple one.

@ygoldfeld ygoldfeld added the enhancement New feature or request label Mar 27, 2024
@ygoldfeld
Copy link
Contributor Author

Missed one. This snippet pretty much covers it:

#ifdef FLOW_OS_WIN
static_assert(false, "Design of Permissions_level assumes a POSIX-y security model with users and groups; "
                       "we have not yet considered whether it can apply to Windows in its current form.");
#endif
  • macOS: Should be fine.
  • Windows: Well, see above. S_USER_ACCESS and S_GROUP_ACCESS, to begin with, assume a user/group-based security model - unsure how it works in Windows in the meantime. If it does map cleanly enough, then all instances where a Permissions_level is translated into a specific Permissions value (the POSIX RWXRWXRWC thing) have to be coded appropriately for Windows also. (At a glance, Boost.interprocess permissions class has some simple notion of Windows permissions. There is a void* involved and... stuff.)

@ygoldfeld
Copy link
Contributor Author

ygoldfeld commented Apr 14, 2024

A note from delightful HackerNews thread, by user rurban:

I also went this route and came to the very same conclusions. Cap'n proto for fast reading, SHM for shared data, simple short messaging, just everything in C.
My only problem is MacOS with its too small default SHM buffers, you need to enhance them. Most solutions need a reboot, but a simple setter is enough. Like sudo sysctl -w kern.sysv.shmmax=16777216

Details unclear at the moment (to me) but reminder to look into macOS's and Windows's SHM size limits in general at least. (Certainly they were a "thing" for SHM-classic in Linux, much less so in SHM-jemalloc.)

EDIT: Oh. System V. We use the modern POSIX API. This is probably irrelevant as written. But there may be other such limits applicable, as in Linux.

@ygoldfeld
Copy link
Contributor Author

Let's just link the HackerNews thread here - it'll be good to refer to some notes there. E.g., iceoryx guys indicate Windows porting might not be as simple (not that it looks that simply) as the above might indicate. Point is, comb through this when working on it. https://news.ycombinator.com/item?id=40028118

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant