Towards safer migration #1378

hkaiser · 2015-02-20T17:41:38Z

This PR adds a new AGAS API: hpx::agas::begin_migration and hpx::agas::end_migration. These functions make sure that no address resolution proceeds while a migration operation is in flight. All address resolution requests will be deferred until end_migration is called.

This PR also adapts the existing API functions hpx::components::migrate, hpx::components::migrate_to_storage, and hpx::components::migrate_from_storage to use the new AGAS API.

Overall, this is the first step to make the process of migrating objects transparent to the user (see #559).

… to go still

…GAS API

…igration is always called

sithhell · 2015-02-20T20:29:24Z

While this is absolutely the step in the right direction, I don't like the proposed API:

What happens in the case of an exception between begin and end migration?
In the case of a migration to storage, isn't the end of the migration only marked when the component is migrated back into AGAS?

hkaiser · 2015-02-20T20:37:03Z

While this is absolutely the step in the right direction, I don't like the proposed API:

What would you suggest doing differently?

What happens in the case of an exception between begin and end migration?

The code makes sure that end_migration is always executed (even if some exception is thrown in between (see for instance here).

In the case of a migration to storage, isn't the end of the migration only marked when the
component is migrated back into AGAS?

Even if an object is migrated to storage, it still stays registered with AGAS. Otherwise we will not be able to bring it back transparently later on.

sithhell · 2015-02-20T21:04:27Z

On Friday, February 20, 2015 12:37:04 Hartmut Kaiser wrote:

While this is absolutely the step in the right direction, I don't like the
proposed API:
What would you suggest doing differently?

For the 1) case i'd suggest to implement something akin to scoped_lock.

What happens in the case of an exception between begin and end
migration?
The code makes sure that end_migration is always executed (even if some
exception is thrown in between (see for instance
[here](https://github.com/STEllAR-GROUP/hpx/blob/towards_safer_migration/hp
x/runtime/components/server/migrate_component.hpp#L169)).

In the case of a migration to storage, isn't the end of the migration
only marked when the component is migrated back into AGAS?

Even if an object is migrated to storage, it still stays registered with
AGAS. Otherwise we will not be able to bring it back transparently later
on.

That's true, however, it doesn't make sense to resolve the gid as long as it
is in the storage space. This can be interpreted as that migration hasn't been
completed yet. Guess it is a semantic peculiarity of migration to some storage
that needs to be discussed.

hkaiser · 2015-02-20T21:32:57Z

While this is absolutely the step in the right direction, I don't like the
proposed API:
. What would you suggest doing differently?

What happens in the case of an exception between begin and end
migration?
The code makes sure that end_migration is always executed (even if some
exception is thrown in between (see for instance
[here](https://github.com/STEllAR-GROUP/hpx/blob/towards_safer_migration/hp
x/runtime/components/server/migrate_component.hpp#L169)).
For the 1) case i'd suggest to implement something akin to scoped_lock.

That's exactly what has been implemented (semantically). All exceptions will go through the futures returned from the separate steps of the operation. Once one of the futures has an exception it will bubble up through all remaining continuations, which will make sure that end_migration ends up being executed. I wouldn't know how to have literal scoped_guard as the whole thing is an asynchronous operation.

In the case of a migration to storage, isn't the end of the migration
only marked when the component is migrated back into AGAS?

Even if an object is migrated to storage, it still stays registered with
AGAS. Otherwise we will not be able to bring it back transparently later
on.

That's true, however, it doesn't make sense to resolve the gid as long as it
is in the storage space. This can be interpreted as that migration hasn't been
completed yet. Guess it is a semantic peculiarity of migration to some storage
that needs to be discussed.

Sorry, I don't understand what you mean. An object which was moved to storage (or to disk) still maintains its global id and stays registered with the AGAS instance where it was initially registered. If somebody resolves this global id while the object is on disk (not in main memory), this will return the global id of the storage component (as the object's 'locality') and an invalid local virtual address.

IOW, storage components act like a special locality which puts those objects to disk which are migrated there. Migrating objects from this special locality resurrects the original object using its old global id.

sithhell · 2015-02-20T22:00:33Z

Am 20.02.2015 22:32 schrieb "Hartmut Kaiser" notifications@github.com:

While this is absolutely the step in the right direction, I don't like
the
proposed API:
. What would you suggest doing differently?

What happens in the case of an exception between begin and end
migration?
The code makes sure that end_migration is always executed (even if some
exception is thrown in between (see for instance
here).

For the 1) case i'd suggest to implement something akin to scoped_lock.

That's exactly what has been implemented (semantically). All exceptions
will go through the futures returned from the separate steps of the
operation. Once one of the futures has an exception it will bubble up
through all remaining continuations, which will make sure that
end_migration ends up being executed. I wouldn't know how to have literal
scoped_guard as the whole thing is an asynchronous operation.

In the case of a migration to storage, isn't the end of the
migration
only marked when the component is migrated back into AGAS?

Even if an object is migrated to storage, it still stays registered with
AGAS. Otherwise we will not be able to bring it back transparently later
on.

That's true, however, it doesn't make sense to resolve the gid as long
as it
is in the storage space. This can be interpreted as that migration
hasn't been
completed yet. Guess it is a semantic peculiarity of migration to some
storage
that needs to be discussed.

Sorry, I don't understand what you mean. An object which was moved to
storage (or to disk) still maintains its global id and stays registered
with the AGAS instance where it was initially registered. If somebody
resolves this global id while the object is on disk (not in main memory),
this will return the global id of the storage component (as the object's
'locality') and an invalid local virtual address.

IOW, storage components act like a special locality which puts those
objects to disk which are migrated there. Migrating objects from this
special locality resurrects the original object using its old global id.

You can't call on action on a component that's stored away, can you? As
such, it might look like the migration is still in process (until after
migrate_from_storage) is called. That is,
begin_migration->migrate_to_storage and
migrate_from_storage->end_migration. If that wouldn't be the case, it might
be extremely difficult to transparently hide the fact that a component has
been swapped out from a user of such a component.

hkaiser · 2015-02-20T22:43:13Z

You can't call an action on a component that's stored away, can you? As
such, it might look like the migration is still in process (until after
migrate_from_storage) is called. That is,
begin_migration->migrate_to_storage and
migrate_from_storage->end_migration. If that wouldn't be the case, it might
be extremely difficult to transparently hide the fact that a component has
been swapped out from a user of such a component.

Currently, you can't call an action on a migrated object, that's what I said above - it would blow up. However, the plan is to transparently bring the object back if somebody invokes and action on it. For this to happen, more work is required, though. But I have the distinct feeling that I still not understand what you have in mind...

sithhell · 2015-02-20T23:01:14Z

Ok, one more try:
In general, migration of a component is the process of moving that object from locality A to locality B, correct?

Now, as far as I understand, before migration starts, one calls begin_migration to mark the start of the migration process. This protects the component to be resolved incorrectly while being in transit. end_migration is then called when the component has been fully transfered (or some error occured) to locality B, correct?

In my understanding, migration to disk does not form an exception to the semantic described above. The "medium" through which the component is transfered is some storage, and possibly network. As such, begin_migration still forms the start of a migration process which is initiated with migrate_to_storage. As long as the component "lives" on the storage device, it can be seen as being "in transit". Now, whenever migrate_from_storage is called and succesfully returned, then migration process is done and end_migration should be called to mark the component as being fully migrated.

Having a special handling for the above described case sounds dangerous. How do you differentiate between program failure (naming::address being invalid and the GID accidently pointing to a component storage, yet the type of the component and the type of the storage mismatches) and an intentional call?

I hope that explains everything a little clearer.

hkaiser · 2015-02-20T23:17:58Z

Ok, one more try:
In general, migration of a component is the process of moving that object from locality A to
locality B, correct?

yes.

Now, as far as I understand, before migration starts, one calls begin_migration to mark the start of > the migration process. This protects the component to be resolved incorrectly while being in transit. > end_migration is then called when the component has been fully transfered (or some error
occured) to locality B, correct?

In my understanding, migration to disk does not form an exception to the semantic described above. > The "medium" through which the component is transfered is some storage, and possibly network.
As such, begin_migration still forms the start of a migration process which is initiated with
migrate_to_storage. As long as the component "lives" on the storage device, it can be seen as
being "in transit". Now, whenever migrate_from_storage is called and succesfully returned, then
migration process is done and end_migration should be called to mark the component as being
fully migrated.

Having a special handling for the above described case sounds dangerous. How do you
differentiate between program failure (naming::address being invalid and the GID accidently
pointing to a component storage, yet the type of the component and the type of the storage
mismatches) and an intentional call?

I hope that explains everything a little clearer.

Ok, that explains your point. The misunderstanding however is, that you see the process of migrating an object to storage and back as a whole as 'migration'.

The way I see it is that the storage is just another 'locality'. That means that migrating something to the storage itself comprises the migration step. Same would be true for migrating an object back from the storage locality. That means that in between the two migration operations the object 'lives' in the storage.

The goal is to have a mechanism which transparently blurs the boundary between in-core and out-of-core memory, where objects are brought back to life transparently once needed.

- don't use local AGAS cache for any object known to be migrated away from a locality. - send incoming parcel after migrated object.

hkaiser · 2015-02-24T19:03:47Z

Can we merge this now?

sithhell · 2015-02-24T19:45:19Z

On Friday, February 20, 2015 15:17:58 Hartmut Kaiser wrote:

Ok, one more try:
In general, migration of a component is the process of moving that object
from locality A to>
locality B, correct?

yes.

Now, as far as I understand, before migration starts, one calls
begin_migration to mark the start of > the migration process. This
protects the component to be resolved incorrectly while being in transit.

end_migration is then called when the component has been fully
transfered (or some error occured) to locality B, correct?

In my understanding, migration to disk does not form an exception to the
semantic described above. > The "medium" through which the component is
transfered is some storage, and possibly network. As such,
begin_migration still forms the start of a migration process which is
initiated with migrate_to_storage. As long as the component "lives" on
the storage device, it can be seen as being "in transit". Now, whenever
migrate_from_storage is called and succesfully returned, then migration
process is done and end_migration should be called to mark the
component as being fully migrated.

Having a special handling for the above described case sounds dangerous.
How do you differentiate between program failure (naming::address being
invalid and the GID accidently pointing to a component storage, yet the
type of the component and the type of the storage mismatches) and an
intentional call?

I hope that explains everything a little clearer.

Ok, that explains your point. The misunderstanding however is, that you see
the process of migrating an object to storage and back as a whole as
'migration'.

Quite some misunderstanding indeed. I made all those implications from reading
through the code. The only thing that didn't add up was the strange (IMHO)
semantics imposed by the way begin_migration and end_migration is called.
The primary use case I see in this migrate_to_storage is for micro
checkpointing or any other case where you don't immediately need the
components data again, for example swapping if you run out of main memory.

The way I see it is that the storage is just another 'locality'. That means
that migrating something to the storage itself comprises the migration
step. Same would be true for migrating an object back from the storage
locality. That means that in between the two migration operations the
object 'lives' in the storage.

I don't think the current implementation is suitable for that kind of
migration. There are far more efficient solutions to that that don't require to
completely swap the data out of the processes address space. That is such a
facility should come with some kind of allocation policy that takes care of
constructing the component in the appropriate place (different NUMA domain,
file, etc.), see for example memkind or memory mapped files. If there really is
a scenario where you don't have direct access to the memory region, one could
hide that through a "smart pointer" which marshals the data back and forth
with special attention to concurrent accesses (for example only serialize the
data once, this could be done by reference counting). By having such a
allocation policy which constructs the component appropriately instead of
storing it away and mark it as invalid, the order in which begin_migration
and end_migration is called makes sense again.

The goal is to have a mechanism which transparently blurs the boundary
between in-core and out-of-core memory, where objects are brought back to
life transparently once needed.

Why do they need to be not alive in the first place? By having them "not really
alive" it gets hard to argue which components are alive and which are not.
Additionally, this transparent layer is currently missing completely and it is
very hard for me to imagine how this might work, despite of adding a lot of
complexity and imposing artificial overheads to something that is, IMHO, not
needed for this particular use case. For a matter of fact, we already use in-
core memory (L1-Cache, L2-Cache, registers) and out-of-core memory (last level
cache, main memory) transparently already, which can be easily extended to any
other kind of memory (files, memory on dedicated accelerators, etc.). Why not
use the same mechanisms that are already available, instead of inventing yet
another layer of complexity?

hkaiser · 2015-02-25T04:20:26Z

@sithhell: Sorry, but I don't think the issues you're raising have any relation to the PR at hands. This PR is introducing 2 additional safeguards necessary to make migration more transparent in use (any kind of migration, even that which we already have for over a year).

I also think I have not been able to properly explain the migrate_to_storage and migrate_from_storage functionalities. Both are btw completely independent of HPX itself, they are implemented as a component, which could be easily separated from the core library.

The migrate_to_storage and migrate_from_storage functionalities are meant to test out the infrastructure needed to truly put objects to disk (or any other non-volatile memory like NVRAM, SDRAM, etc.). Those objects would maintain their global ids while being stored on disk, even beyond the lifetime of the application they were created by. In the end, AGAS itself could be put to disk this way. We need this functionality for one of our projects where we look into integration of HPX with parallel file systems in the context of some DNA sequencing application.

sithhell · 2015-02-25T09:01:33Z

On 02/25/2015 05:20 AM, Hartmut Kaiser wrote:

@sithhell https://github.com/sithhell: Sorry, but I don't think the
issues you're raising have any relation to the PR at hands. This PR is
introducing 2 additional safeguards necessary to make migration more
transparent in use (any kind of migration, even that which we already
have for over a year).

Right, this PR adds the safe guard and implements the safe guard for the
migration facilities we already have: migration between localities and
migration to/from storage. It's the implementation of the migration
to/from storage (part of this PR) that I disagree with.

I also think I have not been able to properly explain the
|migrate_to_storage| and |migrate_from_storage| functionalities. Both
are btw completely independent of HPX itself, they are implemented as a
component, which could be easily separated from the core library.

The |migrate_to_storage| and |migrate_from_storage| functionalities are
meant to test out the infrastructure needed to truly put objects to disk
(or any other non-volatile memory like NVRAM, SDRAM, etc.). Those
objects would maintain their global ids while being stored on disk, even
beyond the lifetime of the application they were created by. In the end,
AGAS itself could be put to disk this way. We need this functionality
for one of our projects where we look into integration of HPX with
parallel file systems in the context of some DNA sequencing application.

Ok, let's recap how the safeguard works and what it should protect from:

It implements a facility that prevents a GID from being resolved
while it is being migrated
begin_migration marks the start of a migration process, it's a
asynchronous functions. After the future
returned by this function is returned, it's safe to start the
migration process.
end_migration is the counter part. After that function returns,
the migration process should have been
finished, and it is safe to resolve that GID again.
Any resolve request that happened between begin_migration and
end_migration is being suspended and continues
to be executed after end_migration has been finished.

I hope I got those points correct. The implications I derive from this
semantics are that hpx::agas::resolve will always return a valid
hpx::naming::address. Anything else doesn't make sense to me.

Now here comes the, IMHO, problematic part: Those derived semantics are
broken with how this safeguard is implemented for migrate_to_storage.
After migrate_to_storage returns, the migration is being marked as
being "done", that is hpx::agas::resolve would return some nonsensical
address.
This is in contradiction to the implementation: The migrated component
is archived away to some storage location (which can be main memory,
disk, or any other non-volatile memory) and is therefor not alive, as
such, the migration process has not been finished.
The rationale you brought up for having this difference in semantics is
to still be able to use the component as it where still alive (this
functionality is clearly missing). I think persistent archiving of
components and the ability to use components stored in different
locations is sufficiently different such that they deserve a separation
of concerns. As explained in my previous mail, this scenario doesn't
even necessarily need serialization to begin with. If it turns out to
need it though, there shouldn't be a problem to implement that on top of
the existing migration_to_storage' and 'migration_from_storage' facilities. In either case, the reason why I am reluctant to agree to merge this PR is that neither of the use cases described here are sufficiently handled and the change in semantics with respect to whathpx::agas::resolve`
returns is, IMHO, very troublesome.

hkaiser · 2015-02-25T13:07:19Z

Ok, let's recap how the safeguard works and what it should protect from:

It implements a facility that prevents a GID from being resolved
while it is being migrated

begin_migration marks the start of a migration process, it's a
asynchronous functions. After the future
returned by this function is returned, it's safe to start the
migration process.

end_migration is the counter part. After that function returns,
the migration process should have been
finished, and it is safe to resolve that GID again.

Any resolve request that happened between begin_migration and
end_migration is being suspended and continues
to be executed after end_migration has been finished.

All of this is correct.

Those derived semantics are
broken with how this safeguard is implemented for migrate_to_storage.
After migrate_to_storage returns, the migration is being marked as
being "done", that is hpx::agas::resolve would return some nonsensical
address.

Ok, I see your point. I will make sure that any object in storage is transparently brought back when accessed. This PR however is all about the two safeguarding mechanisms, not about the migration to storage. I take it that you have no objections to the safeguarding code as it is.

As explained in my previous mail, this scenario doesn't
even necessarily need serialization to begin with

FWIW, you always have to serialize an object in order to put it to disk.

sithhell · 2015-02-25T13:16:54Z

FWIW, you always have to serialize an object in order to put it to disk.

That's not entirely true. One could imagine an allocator that uses a memory mapped file to store the necessary data (similar to boost::interprocess::allocator). This of course requires that all members of said component are either not heap allocated or support an allocator. Allocators using pointers into non-volatile memory can be thought of as well. Nevertheless, this is completely orthogonal to this PR, but is IMHO important for further considerations on how to proceed for migrating components to different storage locations.

sithhell · 2015-02-25T13:20:25Z

Ok, I see your point. I will make sure that any object in storage is transparently brought back when accessed. This PR however is all about the two safeguarding mechanisms, not about the migration to storage. I take it that you have no objections to the safeguarding code as it is.

No objection to the general safe guarding mechanism. The only objection is the way it is used for migrate_to_storage and migrate_from_storage which is also part of this PR. Before merging this PR I would like to have some form of consensus about how to proceed on that issue, either by filing a separate issue or by fixing the issues in question within that PR prior to the merge.

hkaiser · 2015-02-25T14:27:44Z

[Feb 25th, 08:17] heller: hkaiser: gtg now ... I'm fine with merging the PR ... but please create issue that explains what needs to be done so that migrate_to_storage/migrate_from_storage works again as expected

Towards safer migration

hkaiser added 5 commits February 19, 2015 09:45

Started to work on making migration transparent to the user, long way…

2958353

… to go still

Changes to migrate_component to include the new begin/end_migration A…

1d51c8e

…GAS API

Cleaning up migrate_component, adding explanation of how migration works

3a22f70

Adapting migration to/from storage to new AGAS API. Making sure end_m…

23a7a94

…igration is always called

Adding migrate to/from storage APIs to docs

78f87d8

hkaiser added type: enhancement category: components affecting: LSU labels Feb 20, 2015

hkaiser added this to the 0.9.10 milestone Feb 20, 2015

hkaiser mentioned this pull request Feb 20, 2015

Add hpx::migrate facility #559

Closed

Remove unneeded header

816a944

Removing extra qualification

35b4a48

hkaiser added 5 commits February 21, 2015 09:13

Disambiguate calls to members (gcc gets confused)

62759ee

Add description of how migrate_from_storgae works

17330f1

pass future by rvalue-ref

422b74f

Implement migration safety measures for gcc < 4.8 as well

a0c32de

Adding more safeguards for migration:

80289ef

- don't use local AGAS cache for any object known to be migrated away from a locality. - send incoming parcel after migrated object.

hkaiser added a commit that referenced this pull request Feb 25, 2015

Merge pull request #1378 from STEllAR-GROUP/towards_safer_migration

c041b66

Towards safer migration

hkaiser merged commit c041b66 into master Feb 25, 2015

hkaiser deleted the towards_safer_migration branch February 25, 2015 14:27

sithhell mentioned this pull request Mar 1, 2015

Failure with more than 4 localities #1387

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Towards safer migration #1378

Towards safer migration #1378

hkaiser commented Feb 20, 2015

sithhell commented Feb 20, 2015

hkaiser commented Feb 20, 2015

sithhell commented Feb 20, 2015

hkaiser commented Feb 20, 2015

sithhell commented Feb 20, 2015

hkaiser commented Feb 20, 2015

sithhell commented Feb 20, 2015

hkaiser commented Feb 20, 2015

hkaiser commented Feb 24, 2015

sithhell commented Feb 24, 2015

hkaiser commented Feb 25, 2015

sithhell commented Feb 25, 2015

hkaiser commented Feb 25, 2015

sithhell commented Feb 25, 2015

sithhell commented Feb 25, 2015

hkaiser commented Feb 25, 2015

Towards safer migration #1378

Towards safer migration #1378

Conversation

hkaiser commented Feb 20, 2015

sithhell commented Feb 20, 2015

hkaiser commented Feb 20, 2015

sithhell commented Feb 20, 2015

hkaiser commented Feb 20, 2015

sithhell commented Feb 20, 2015

hkaiser commented Feb 20, 2015

sithhell commented Feb 20, 2015

hkaiser commented Feb 20, 2015

hkaiser commented Feb 24, 2015

sithhell commented Feb 24, 2015

hkaiser commented Feb 25, 2015

sithhell commented Feb 25, 2015

hkaiser commented Feb 25, 2015

sithhell commented Feb 25, 2015

sithhell commented Feb 25, 2015

hkaiser commented Feb 25, 2015