addressing races in concurrent process startup #1

Artemy-Mellanox · 2023-11-28T15:48:30Z

In the process of initiating multiple processes concurrently, specifically with
automatic detection of the primary process, certain race conditions have been
identified. This patch series introduces a straightforward test that showcases
the issue and subsequently addresses the problems surfaced by the test. These
fixes aim to ensure the robust and secure utilization of DPDK within intricate
solutions that involve starting processes with job orchestrators such as Slurm
or Hadoop YARN.

This commit adds a test scenario that initiates multiple processes concurrently. These processes attach to the same shared heap, with an automatic detection mechanism to identify the primary process. Signed-off-by: Artemy Kovalyov <artemyko@nvidia.com>

There exists a time gap between the creation of the multiprocess channel and the registration of request action handlers. Within this window, a secondary process that receives an eal_dev_mp_request broadcast notification might respond with ENOTSUP. This, in turn, causes the rte_dev_probe() operation to fail in another secondary process. To avoid this, disregarding ENOTSUP responses to attach notifications. Fixes: 244d513 ("eal: enable hotplug on multi-process") Cc: stable@dpdk.org Signed-off-by: Artemy Kovalyov <artemyko@nvidia.com>

This commit addresses an issue related to the cleanup of the multiprocess channel. Previously, when closing the channel, there was a risk of losing trailing messages. This issue was particularly noticeable when broadcast message from primary to secondary processes was sent while a secondary process was closing it's mp channel. In this fix, we delete mp socket file before stopping mp receive thread. Fixes: e788528 ("ipc: stop mp control thread on cleanup") Cc: stable@dpdk.org Signed-off-by: Artemy Kovalyov <artemyko@nvidia.com>

If the configuration file is absent, the autodetection function should generate and secure it. Otherwise, multiple simultaneous openings could erroneously identify themselves as primary instances. Fixes: af75078 ("first public release") Cc: stable@dpdk.org Signed-off-by: Artemy Kovalyov <artemyko@nvidia.com>

The initialization of the Memzone file-backed array ensures its uniqueness by employing an exclusive lock. This is crucial because only one primary process can exist per specific shm_id, which is further protected by the exclusive EAL runtime configuration lock. However, during the process closure, the exclusive lock on both the fbarray and the configuration is not explicitly released. The responsibility of releasing these locks is left to the generic quit procedure. This can lead to a potential race condition when the configuration is released before the fbarray. To address this, we propose explicitly closing the memzone fbarray. This ensures proper order of operations during process closure and prevents any potential race conditions arising from the mismatched lock release timings. Fixes: af75078 ("first public release") Cc: stable@dpdk.org Signed-off-by: Artemy Kovalyov <artemyko@nvidia.com>

Artemy-Mellanox added 5 commits November 28, 2023 18:30

Artemy-Mellanox force-pushed the fix branch from 0c6e314 to 3ea4543 Compare November 28, 2023 16:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

addressing races in concurrent process startup #1

addressing races in concurrent process startup #1

Artemy-Mellanox commented Nov 28, 2023

addressing races in concurrent process startup #1

Are you sure you want to change the base?

addressing races in concurrent process startup #1

Conversation

Artemy-Mellanox commented Nov 28, 2023