Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

addressing races in concurrent process startup #1

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

Artemy-Mellanox
Copy link
Owner

In the process of initiating multiple processes concurrently, specifically with
automatic detection of the primary process, certain race conditions have been
identified. This patch series introduces a straightforward test that showcases
the issue and subsequently addresses the problems surfaced by the test. These
fixes aim to ensure the robust and secure utilization of DPDK within intricate
solutions that involve starting processes with job orchestrators such as Slurm
or Hadoop YARN.

This commit adds a test scenario that initiates multiple processes
concurrently. These processes attach to the same shared heap, with an
automatic detection mechanism to identify the primary process.

Signed-off-by: Artemy Kovalyov <artemyko@nvidia.com>
There exists a time gap between the creation of the multiprocess channel
and the registration of request action handlers. Within this window, a
secondary process that receives an eal_dev_mp_request broadcast
notification might respond with ENOTSUP. This, in turn, causes the
rte_dev_probe() operation to fail in another secondary process.
To avoid this, disregarding ENOTSUP responses to attach notifications.

Fixes: 244d513 ("eal: enable hotplug on multi-process")
Cc: stable@dpdk.org

Signed-off-by: Artemy Kovalyov <artemyko@nvidia.com>
This commit addresses an issue related to the cleanup of the
multiprocess channel. Previously, when closing the channel, there was a
risk of losing trailing messages. This issue was particularly noticeable
when broadcast message from primary to secondary processes was sent
while a secondary process was closing it's mp channel. In this fix, we
delete mp socket file before stopping mp receive thread.

Fixes: e788528 ("ipc: stop mp control thread on cleanup")
Cc: stable@dpdk.org

Signed-off-by: Artemy Kovalyov <artemyko@nvidia.com>
If the configuration file is absent, the autodetection function should
generate and secure it. Otherwise, multiple simultaneous openings could
erroneously identify themselves as primary instances.

Fixes: af75078 ("first public release")
Cc: stable@dpdk.org

Signed-off-by: Artemy Kovalyov <artemyko@nvidia.com>
The initialization of the Memzone file-backed array ensures its
uniqueness by employing an exclusive lock. This is crucial because only
one primary process can exist per specific shm_id, which is further
protected by the exclusive EAL runtime configuration lock.

However, during the process closure, the exclusive lock on both the
fbarray and the configuration is not explicitly released. The
responsibility of releasing these locks is left to the generic quit
procedure. This can lead to a potential race condition when the
configuration is released before the fbarray.

To address this, we propose explicitly closing the memzone fbarray. This
ensures proper order of operations during process closure and prevents
any potential race conditions arising from the mismatched lock release
timings.

Fixes: af75078 ("first public release")
Cc: stable@dpdk.org

Signed-off-by: Artemy Kovalyov <artemyko@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant