Possible race in sliding semaphore #2338

biddisco · 2016-09-21T19:19:13Z

Whilst experimenting with the osu_latency test using a sliding semaphore I found a problem that looks like a race in sliding_semaphore. The attached code snippet should reproduce a deadlock when run on >1 localities and >1 threads.

It looks as though the tread that calls wait on the semaphore, takes the lock and goes into a wait, but the signal to awake the thread comes in before the wait thread has actually properly begun waiting - so the notify_one call does not wake it and then it locks (because only one thread is processing actions, with a window_size>1 the problem goes away as another thread can signal a new lower_value and things awake properly).

This is only my guesswork from testing today ...

#include <hpx/hpx_main.hpp>
#include <hpx/include/lcos.hpp>
#include <hpx/include/actions.hpp>
#include <hpx/lcos/local/detail/sliding_semaphore.hpp>

#include <iostream>
// -----------------------------------------------------------------------------------
double message_double(double d)
{
    return d;
}
HPX_PLAIN_ACTION(message_double);

// -----------------------------------------------------------------------------------
int main()
{
    // use the first remote locality to bounce messages, if possible
    hpx::id_type here = hpx::find_here();

    hpx::id_type there = here;
    std::vector<hpx::id_type> localities = hpx::find_remote_localities();
    if (!localities.empty())
        there = localities[0];

    std::size_t parcel_count = 0;
    std::size_t loop         = 10000;
    std::size_t window_size  = 1;
    std::size_t skip         = 50;

    hpx::lcos::local::sliding_semaphore sem(window_size,0);
    message_double_action msg;
    //
    //
    for (std::size_t i = 0; i < (loop*window_size) + skip; ++i) {
        // launch a message to the remote node
        hpx::async(msg, there, 3.5).then(
            hpx::launch::sync,
            // when the message completes, increment our semaphore count
            // so that N are always in flight
            [&,parcel_count](auto &&f) -> void {
                sem.signal(parcel_count);
                std::cout << "Signalled with value " << parcel_count << std::endl;;
            }
        );

        //
        parcel_count++;

        //
        std::cout << "Waiting with value " << parcel_count << std::endl;
        sem.wait(parcel_count);
    }

    // wait on the last message, otherwise semaphore throws an exception
    // because it is signalled, but nobody is waiting on it
    std::cout << "Waiting for final signal before exit pc is " << parcel_count <<
        " wait is " << parcel_count+window_size-1 << "\n";
    sem.wait(parcel_count + window_size - 1);
    std::cout << "Finished Waiting for final signal before exit \n";

    return 0;
}

The text was updated successfully, but these errors were encountered:

hkaiser · 2016-09-22T00:42:39Z

@biddisco Could you try the fixing_2338 branch if it fixes your issues?

hkaiser · 2016-09-22T13:29:05Z

This has been fixed by merging #2339

biddisco added category: LCOs affecting: CSCS labels Sep 21, 2016

biddisco added this to the 1.0.0 milestone Sep 21, 2016

hkaiser added the type: defect label Sep 21, 2016

hkaiser added a commit that referenced this issue Sep 22, 2016

Attempt to fix sliding_semaphore (#2338)

3748b4f

hkaiser closed this as completed Sep 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible race in sliding semaphore #2338

Possible race in sliding semaphore #2338

biddisco commented Sep 21, 2016

hkaiser commented Sep 22, 2016

hkaiser commented Sep 22, 2016

Possible race in sliding semaphore #2338

Possible race in sliding semaphore #2338

Comments

biddisco commented Sep 21, 2016

hkaiser commented Sep 22, 2016

hkaiser commented Sep 22, 2016