Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix potential zero-copy for primarynamespace::bulk_service_async et.al. #1543

Closed
wants to merge 2 commits into from

Conversation

hkaiser
Copy link
Member

@hkaiser hkaiser commented May 22, 2015

This fixes one instance where zero-copy serialization was interfering with a plain async invocation.

// )
// {
// this->base_type::bulk_service_non_blocking(this->get_gid(), reqs, priority);
// }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll remove the code.

@hkaiser
Copy link
Member Author

hkaiser commented May 23, 2015

This is related to #1521

@sithhell
Copy link
Member

I'm not sure what this is supposed to fix. How is the cb different to what is already in place in parcelhandler::put_parcel?

@hkaiser
Copy link
Member Author

hkaiser commented May 24, 2015

I'm not sure what this is supposed to fix. How is the cb different to what is already in place in parcelhandler::put_parcel ?

This change turns one of the internal async calls (which has a high potential to trigger zero-copy serialization) into an equivalent call to async_cb. The difference is that the new code keeps the arguments alive long enough (until the parcel has been sent).

@sithhell
Copy link
Member

I'm not sure what this is supposed to fix. How is the cb different to what is already in place in parcelhandler::put_parcel ?

This change turns one of the internal async calls (which has a high potential to trigger zero-copy serialization) into an equivalent call to async_cb. The difference is that the new code keeps the arguments alive long enough (until the parcel has been sent).

Yes, that's what I can read from the code as well, but how is that
different to the mechanism already in place?

@hkaiser
Copy link
Member Author

hkaiser commented May 24, 2015

Yes, that's what I can read from the code as well, but how is that different to the mechanism already in place?

As a far as I remember, we have no such mechanism in place, do we?

@sithhell
Copy link
Member

Yes, that's what I can read from the code as well, but how is that different to the mechanism already in place?

As a far as I remember, we have no such mechanism in place, do we?

The parcel is being kept alive until after it has been completely sent.
That should take care of all lifetime issues (
https://github.com/STEllAR-GROUP/hpx/blob/master/src/runtime/parcelset/parcelhandler.cpp#L431
).
The proposed patch only takes care of the lifetime of the created requests,
not the ones passed to async, which are copied inside of the to be sent
parcel.

@hkaiser
Copy link
Member Author

hkaiser commented May 24, 2015

The parcel is being kept alive until after it has been completely sent. That should take care of all lifetime issues

If I understand the situation correctly, this is not true. The parcel stays alive indeed, but zero-copy-serialized arguments are not embedded in the parcel, but just referenced by it. Those arguments go out of scope as soon as async returns, at which point the parcel might not have been sent. Thus the parcel might end up referencing non-existing data. The proposed patch rectifies this situation for this one call site. This is a stop-gap measure necessary until we have properly fixed async to wait before returning if arguments have been zero-copy-serialized. I believe we discussed this at length.

@sithhell
Copy link
Member

The parcel is being kept alive until after it has been completely sent.
That should take care of all lifetime issues

If I understand the situation correctly, this is not true. The parcel stays alive indeed, but zero-copy-serialized arguments are not embedded in the parcel, but just referenced by it. Those arguments go out of scope as soon as async returns, at which point the parcel might not have been sent. Thus the parcel might end up referencing non-existing data. The proposed patch rectifies this situation for this one call site. This is a stop-gap measure necessary until we have properly fixed async to wait before returning if arguments have been zero-copy-serialized. I believe we discussed this at length.

I think this analysis of the problem is wrong. First of all, we don't have
any proof that zero copy serialization is the culprit for John's problems.
Second, every parameter that is passed to async and friends is decay copied
somewhere before those function calls return. In the case of zero copy
serialization that shouldn't be a problem either. As far as I can see is
that there are three scenarios:

  1. a parameter is zero copy serialized for which the parcel/transfer_action
    has unique ownership (for example std::vector<char>). There is absolutely
    no problem with that. We create a zero copy chunk and the data is being
    kept alive by the parcel
  2. a parameter is zero copy serialized for which the parcel/transfer_action
    has shared ownership (for example serialize_buffer<char>). Here, the zero
    copy chunk pointer is kept alive in the parcel as well by the help of
    shared pointer semantics. The only problem here is to properly synchronize
    in user code to not overwrite data inside of the serialize buffer which
    hasn't been sent yet. Another problem of course occurs if the user passes a
    pointer to data without taking care of the proper lifetime, there is IMHO
    nothing we can do about it.
  3. a parameter without any ownership semantics. There is nothing we can do
    here.

The proposed patch doesn't take care of the potential problems in 2 or the
third scenario.

@hkaiser
Copy link
Member Author

hkaiser commented May 25, 2015

Are you saying that there is no potential problem when async is triggering a zero-copy-serialization?

@sithhell
Copy link
Member

Are you saying that there is no potential problem when async is triggering a zero-copy-serialization?

That is what I'm trying to get at, yes. Do we have a unit/regression test
that suggests otherwise?

@hkaiser
Copy link
Member Author

hkaiser commented May 25, 2015

Are you saying that there is no potential problem when async is triggering a zero-copy-serialization?

That is what I'm trying to get at, yes. Do we have a unit/regression test that suggests otherwise?

You know as well as I do that reproducing this issue is very difficult. However, I will try to create a test. All we know is that this patch improved the problems @biddisco was seeing.

Besides, even if this patch turns out not to do any good, it definitely does not do any harm either.

@sithhell
Copy link
Member

Yes, I know that reproducing this is hard, yet, there hasn't been a single test failure within our existing teststuite that shows that problem (the TCP parcelport is sending fully asynchronous, the MPI one is currently sending synchronous if there are no futures which aren't ready yet involved). That's the primary reason why I am highly suspicous that zero copy serialization is to blame here.

stubs::primary_namespace::bulk_service_async(
(*it).first, (*it).second, action_priority_));
(*it).first, (*it).second,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though the requests are stored in a shared pointer which is kept alive by the callback function, the requests are copied here. The actual requests that are part of the asynchronous communication are not being held alive, by the bound callback function.

@sithhell
Copy link
Member

Besides, even if this patch turns out not to do any good, it definitely does not do any harm either.

That's correct, that patch, if I am not completely mistaken, is essentially a noop associated with the cost of allocating memory and atomic increments/decrements of the shared_ptr reference counts.

@hkaiser hkaiser closed this May 26, 2015
@hkaiser hkaiser deleted the zero_copy_bulk_service_fix branch May 26, 2015 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants