Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Towards safer migration #1378

Merged
merged 12 commits into from Feb 25, 2015
Merged

Towards safer migration #1378

merged 12 commits into from Feb 25, 2015

Conversation

hkaiser
Copy link
Member

@hkaiser hkaiser commented Feb 20, 2015

This PR adds a new AGAS API: hpx::agas::begin_migration and hpx::agas::end_migration. These functions make sure that no address resolution proceeds while a migration operation is in flight. All address resolution requests will be deferred until end_migration is called.

This PR also adapts the existing API functions hpx::components::migrate, hpx::components::migrate_to_storage, and hpx::components::migrate_from_storage to use the new AGAS API.

Overall, this is the first step to make the process of migrating objects transparent to the user (see #559).

@sithhell
Copy link
Member

While this is absolutely the step in the right direction, I don't like the proposed API:

  1. What happens in the case of an exception between begin and end migration?
  2. In the case of a migration to storage, isn't the end of the migration only marked when the component is migrated back into AGAS?

@hkaiser
Copy link
Member Author

hkaiser commented Feb 20, 2015

While this is absolutely the step in the right direction, I don't like the proposed API:

What would you suggest doing differently?

  1. What happens in the case of an exception between begin and end migration?

The code makes sure that end_migration is always executed (even if some exception is thrown in between (see for instance here).

  1. In the case of a migration to storage, isn't the end of the migration only marked when the
    component is migrated back into AGAS?

Even if an object is migrated to storage, it still stays registered with AGAS. Otherwise we will not be able to bring it back transparently later on.

@sithhell
Copy link
Member

On Friday, February 20, 2015 12:37:04 Hartmut Kaiser wrote:

While this is absolutely the step in the right direction, I don't like the
proposed API:
What would you suggest doing differently?

For the 1) case i'd suggest to implement something akin to scoped_lock.

  1. What happens in the case of an exception between begin and end
    migration?
    The code makes sure that end_migration is always executed (even if some
    exception is thrown in between (see for instance
    [here](https://github.com/STEllAR-GROUP/hpx/blob/towards_safer_migration/hp
    x/runtime/components/server/migrate_component.hpp#L169)).
  2. In the case of a migration to storage, isn't the end of the migration
    only marked when the component is migrated back into AGAS?

Even if an object is migrated to storage, it still stays registered with
AGAS. Otherwise we will not be able to bring it back transparently later
on.

That's true, however, it doesn't make sense to resolve the gid as long as it
is in the storage space. This can be interpreted as that migration hasn't been
completed yet. Guess it is a semantic peculiarity of migration to some storage
that needs to be discussed.

@hkaiser
Copy link
Member Author

hkaiser commented Feb 20, 2015

While this is absolutely the step in the right direction, I don't like the
proposed API:
. What would you suggest doing differently?

  1. What happens in the case of an exception between begin and end
    migration?
    The code makes sure that end_migration is always executed (even if some
    exception is thrown in between (see for instance
    [here](https://github.com/STEllAR-GROUP/hpx/blob/towards_safer_migration/hp
    x/runtime/components/server/migrate_component.hpp#L169)).
    For the 1) case i'd suggest to implement something akin to scoped_lock.

That's exactly what has been implemented (semantically). All exceptions will go through the futures returned from the separate steps of the operation. Once one of the futures has an exception it will bubble up through all remaining continuations, which will make sure that end_migration ends up being executed. I wouldn't know how to have literal scoped_guard as the whole thing is an asynchronous operation.

  1. In the case of a migration to storage, isn't the end of the migration
    only marked when the component is migrated back into AGAS?

Even if an object is migrated to storage, it still stays registered with
AGAS. Otherwise we will not be able to bring it back transparently later
on.

That's true, however, it doesn't make sense to resolve the gid as long as it
is in the storage space. This can be interpreted as that migration hasn't been
completed yet. Guess it is a semantic peculiarity of migration to some storage
that needs to be discussed.

Sorry, I don't understand what you mean. An object which was moved to storage (or to disk) still maintains its global id and stays registered with the AGAS instance where it was initially registered. If somebody resolves this global id while the object is on disk (not in main memory), this will return the global id of the storage component (as the object's 'locality') and an invalid local virtual address.

IOW, storage components act like a special locality which puts those objects to disk which are migrated there. Migrating objects from this special locality resurrects the original object using its old global id.

@sithhell
Copy link
Member

Am 20.02.2015 22:32 schrieb "Hartmut Kaiser" notifications@github.com:

While this is absolutely the step in the right direction, I don't like
the
proposed API:
. What would you suggest doing differently?

  1. What happens in the case of an exception between begin and end
    migration?
    The code makes sure that end_migration is always executed (even if some
    exception is thrown in between (see for instance
    here).

For the 1) case i'd suggest to implement something akin to scoped_lock.

That's exactly what has been implemented (semantically). All exceptions
will go through the futures returned from the separate steps of the
operation. Once one of the futures has an exception it will bubble up
through all remaining continuations, which will make sure that
end_migration ends up being executed. I wouldn't know how to have literal
scoped_guard as the whole thing is an asynchronous operation.

  1. In the case of a migration to storage, isn't the end of the
    migration
    only marked when the component is migrated back into AGAS?

Even if an object is migrated to storage, it still stays registered with
AGAS. Otherwise we will not be able to bring it back transparently later
on.

That's true, however, it doesn't make sense to resolve the gid as long
as it
is in the storage space. This can be interpreted as that migration
hasn't been
completed yet. Guess it is a semantic peculiarity of migration to some
storage
that needs to be discussed.

Sorry, I don't understand what you mean. An object which was moved to
storage (or to disk) still maintains its global id and stays registered
with the AGAS instance where it was initially registered. If somebody
resolves this global id while the object is on disk (not in main memory),
this will return the global id of the storage component (as the object's
'locality') and an invalid local virtual address.

IOW, storage components act like a special locality which puts those
objects to disk which are migrated there. Migrating objects from this
special locality resurrects the original object using its old global id.

You can't call on action on a component that's stored away, can you? As
such, it might look like the migration is still in process (until after
migrate_from_storage) is called. That is,
begin_migration->migrate_to_storage and
migrate_from_storage->end_migration. If that wouldn't be the case, it might
be extremely difficult to transparently hide the fact that a component has
been swapped out from a user of such a component.

@hkaiser
Copy link
Member Author

hkaiser commented Feb 20, 2015

You can't call an action on a component that's stored away, can you? As
such, it might look like the migration is still in process (until after
migrate_from_storage) is called. That is,
begin_migration->migrate_to_storage and
migrate_from_storage->end_migration. If that wouldn't be the case, it might
be extremely difficult to transparently hide the fact that a component has
been swapped out from a user of such a component.

Currently, you can't call an action on a migrated object, that's what I said above - it would blow up. However, the plan is to transparently bring the object back if somebody invokes and action on it. For this to happen, more work is required, though. But I have the distinct feeling that I still not understand what you have in mind...

@sithhell
Copy link
Member

Ok, one more try:
In general, migration of a component is the process of moving that object from locality A to locality B, correct?

Now, as far as I understand, before migration starts, one calls begin_migration to mark the start of the migration process. This protects the component to be resolved incorrectly while being in transit. end_migration is then called when the component has been fully transfered (or some error occured) to locality B, correct?

In my understanding, migration to disk does not form an exception to the semantic described above. The "medium" through which the component is transfered is some storage, and possibly network. As such, begin_migration still forms the start of a migration process which is initiated with migrate_to_storage. As long as the component "lives" on the storage device, it can be seen as being "in transit". Now, whenever migrate_from_storage is called and succesfully returned, then migration process is done and end_migration should be called to mark the component as being fully migrated.

Having a special handling for the above described case sounds dangerous. How do you differentiate between program failure (naming::address being invalid and the GID accidently pointing to a component storage, yet the type of the component and the type of the storage mismatches) and an intentional call?

I hope that explains everything a little clearer.

@hkaiser
Copy link
Member Author

hkaiser commented Feb 20, 2015

Ok, one more try:
In general, migration of a component is the process of moving that object from locality A to
locality B, correct?

yes.

Now, as far as I understand, before migration starts, one calls begin_migration to mark the start of > the migration process. This protects the component to be resolved incorrectly while being in transit. > end_migration is then called when the component has been fully transfered (or some error
occured) to locality B, correct?

In my understanding, migration to disk does not form an exception to the semantic described above. > The "medium" through which the component is transfered is some storage, and possibly network.
As such, begin_migration still forms the start of a migration process which is initiated with
migrate_to_storage. As long as the component "lives" on the storage device, it can be seen as
being "in transit". Now, whenever migrate_from_storage is called and succesfully returned, then
migration process is done and end_migration should be called to mark the component as being
fully migrated.

Having a special handling for the above described case sounds dangerous. How do you
differentiate between program failure (naming::address being invalid and the GID accidently
pointing to a component storage, yet the type of the component and the type of the storage
mismatches) and an intentional call?

I hope that explains everything a little clearer.

Ok, that explains your point. The misunderstanding however is, that you see the process of migrating an object to storage and back as a whole as 'migration'.

The way I see it is that the storage is just another 'locality'. That means that migrating something to the storage itself comprises the migration step. Same would be true for migrating an object back from the storage locality. That means that in between the two migration operations the object 'lives' in the storage.

The goal is to have a mechanism which transparently blurs the boundary between in-core and out-of-core memory, where objects are brought back to life transparently once needed.

@hkaiser
Copy link
Member Author

hkaiser commented Feb 24, 2015

Can we merge this now?

@sithhell
Copy link
Member

On Friday, February 20, 2015 15:17:58 Hartmut Kaiser wrote:

Ok, one more try:
In general, migration of a component is the process of moving that object
from locality A to>
locality B, correct?

yes.

Now, as far as I understand, before migration starts, one calls
begin_migration to mark the start of > the migration process. This
protects the component to be resolved incorrectly while being in transit.

end_migration is then called when the component has been fully
transfered (or some error occured) to locality B, correct?

In my understanding, migration to disk does not form an exception to the
semantic described above. > The "medium" through which the component is
transfered is some storage, and possibly network. As such,
begin_migration still forms the start of a migration process which is
initiated with migrate_to_storage. As long as the component "lives" on
the storage device, it can be seen as being "in transit". Now, whenever
migrate_from_storage is called and succesfully returned, then migration
process is done and end_migration should be called to mark the
component as being fully migrated.

Having a special handling for the above described case sounds dangerous.
How do you differentiate between program failure (naming::address being
invalid and the GID accidently pointing to a component storage, yet the
type of the component and the type of the storage mismatches) and an
intentional call?

I hope that explains everything a little clearer.

Ok, that explains your point. The misunderstanding however is, that you see
the process of migrating an object to storage and back as a whole as
'migration'.

Quite some misunderstanding indeed. I made all those implications from reading
through the code. The only thing that didn't add up was the strange (IMHO)
semantics imposed by the way begin_migration and end_migration is called.
The primary use case I see in this migrate_to_storage is for micro
checkpointing or any other case where you don't immediately need the
components data again, for example swapping if you run out of main memory.

The way I see it is that the storage is just another 'locality'. That means
that migrating something to the storage itself comprises the migration
step. Same would be true for migrating an object back from the storage
locality. That means that in between the two migration operations the
object 'lives' in the storage.

I don't think the current implementation is suitable for that kind of
migration. There are far more efficient solutions to that that don't require to
completely swap the data out of the processes address space. That is such a
facility should come with some kind of allocation policy that takes care of
constructing the component in the appropriate place (different NUMA domain,
file, etc.), see for example memkind or memory mapped files. If there really is
a scenario where you don't have direct access to the memory region, one could
hide that through a "smart pointer" which marshals the data back and forth
with special attention to concurrent accesses (for example only serialize the
data once, this could be done by reference counting). By having such a
allocation policy which constructs the component appropriately instead of
storing it away and mark it as invalid, the order in which begin_migration
and end_migration is called makes sense again.

The goal is to have a mechanism which transparently blurs the boundary
between in-core and out-of-core memory, where objects are brought back to
life transparently once needed.

Why do they need to be not alive in the first place? By having them "not really
alive" it gets hard to argue which components are alive and which are not.
Additionally, this transparent layer is currently missing completely and it is
very hard for me to imagine how this might work, despite of adding a lot of
complexity and imposing artificial overheads to something that is, IMHO, not
needed for this particular use case. For a matter of fact, we already use in-
core memory (L1-Cache, L2-Cache, registers) and out-of-core memory (last level
cache, main memory) transparently already, which can be easily extended to any
other kind of memory (files, memory on dedicated accelerators, etc.). Why not
use the same mechanisms that are already available, instead of inventing yet
another layer of complexity?

@hkaiser
Copy link
Member Author

hkaiser commented Feb 25, 2015

@sithhell: Sorry, but I don't think the issues you're raising have any relation to the PR at hands. This PR is introducing 2 additional safeguards necessary to make migration more transparent in use (any kind of migration, even that which we already have for over a year).

I also think I have not been able to properly explain the migrate_to_storage and migrate_from_storage functionalities. Both are btw completely independent of HPX itself, they are implemented as a component, which could be easily separated from the core library.

The migrate_to_storage and migrate_from_storage functionalities are meant to test out the infrastructure needed to truly put objects to disk (or any other non-volatile memory like NVRAM, SDRAM, etc.). Those objects would maintain their global ids while being stored on disk, even beyond the lifetime of the application they were created by. In the end, AGAS itself could be put to disk this way. We need this functionality for one of our projects where we look into integration of HPX with parallel file systems in the context of some DNA sequencing application.

@sithhell
Copy link
Member

On 02/25/2015 05:20 AM, Hartmut Kaiser wrote:

@sithhell https://github.com/sithhell: Sorry, but I don't think the
issues you're raising have any relation to the PR at hands. This PR is
introducing 2 additional safeguards necessary to make migration more
transparent in use (any kind of migration, even that which we already
have for over a year).

Right, this PR adds the safe guard and implements the safe guard for the
migration facilities we already have: migration between localities and
migration to/from storage. It's the implementation of the migration
to/from storage (part of this PR) that I disagree with.

I also think I have not been able to properly explain the
|migrate_to_storage| and |migrate_from_storage| functionalities. Both
are btw completely independent of HPX itself, they are implemented as a
component, which could be easily separated from the core library.

The |migrate_to_storage| and |migrate_from_storage| functionalities are
meant to test out the infrastructure needed to truly put objects to disk
(or any other non-volatile memory like NVRAM, SDRAM, etc.). Those
objects would maintain their global ids while being stored on disk, even
beyond the lifetime of the application they were created by. In the end,
AGAS itself could be put to disk this way. We need this functionality
for one of our projects where we look into integration of HPX with
parallel file systems in the context of some DNA sequencing application.

Ok, let's recap how the safeguard works and what it should protect from:

  1. It implements a facility that prevents a GID from being resolved
    while it is being migrated
  2. begin_migration marks the start of a migration process, it's a
    asynchronous functions. After the future
    returned by this function is returned, it's safe to start the
    migration process.
  3. end_migration is the counter part. After that function returns,
    the migration process should have been
    finished, and it is safe to resolve that GID again.
  4. Any resolve request that happened between begin_migration and
    end_migration is being suspended and continues
    to be executed after end_migration has been finished.

I hope I got those points correct. The implications I derive from this
semantics are that hpx::agas::resolve will always return a valid
hpx::naming::address. Anything else doesn't make sense to me.

Now here comes the, IMHO, problematic part: Those derived semantics are
broken with how this safeguard is implemented for migrate_to_storage.
After migrate_to_storage returns, the migration is being marked as
being "done", that is hpx::agas::resolve would return some nonsensical
address.
This is in contradiction to the implementation: The migrated component
is archived away to some storage location (which can be main memory,
disk, or any other non-volatile memory) and is therefor not alive, as
such, the migration process has not been finished.
The rationale you brought up for having this difference in semantics is
to still be able to use the component as it where still alive (this
functionality is clearly missing). I think persistent archiving of
components and the ability to use components stored in different
locations is sufficiently different such that they deserve a separation
of concerns. As explained in my previous mail, this scenario doesn't
even necessarily need serialization to begin with. If it turns out to
need it though, there shouldn't be a problem to implement that on top of
the existing migration_to_storage' and 'migration_from_storage' facilities. In either case, the reason why I am reluctant to agree to merge this PR is that neither of the use cases described here are sufficiently handled and the change in semantics with respect to whathpx::agas::resolve`
returns is, IMHO, very troublesome.

@hkaiser
Copy link
Member Author

hkaiser commented Feb 25, 2015

Ok, let's recap how the safeguard works and what it should protect from:

  1. It implements a facility that prevents a GID from being resolved
    while it is being migrated
  2. begin_migration marks the start of a migration process, it's a
    asynchronous functions. After the future
    returned by this function is returned, it's safe to start the
    migration process.
  3. end_migration is the counter part. After that function returns,
    the migration process should have been
    finished, and it is safe to resolve that GID again.
  4. Any resolve request that happened between begin_migration and
    end_migration is being suspended and continues
    to be executed after end_migration has been finished.

All of this is correct.

Those derived semantics are
broken with how this safeguard is implemented for migrate_to_storage.
After migrate_to_storage returns, the migration is being marked as
being "done", that is hpx::agas::resolve would return some nonsensical
address.

Ok, I see your point. I will make sure that any object in storage is transparently brought back when accessed. This PR however is all about the two safeguarding mechanisms, not about the migration to storage. I take it that you have no objections to the safeguarding code as it is.

As explained in my previous mail, this scenario doesn't
even necessarily need serialization to begin with

FWIW, you always have to serialize an object in order to put it to disk.

@sithhell
Copy link
Member

FWIW, you always have to serialize an object in order to put it to disk.

That's not entirely true. One could imagine an allocator that uses a memory mapped file to store the necessary data (similar to boost::interprocess::allocator). This of course requires that all members of said component are either not heap allocated or support an allocator. Allocators using pointers into non-volatile memory can be thought of as well. Nevertheless, this is completely orthogonal to this PR, but is IMHO important for further considerations on how to proceed for migrating components to different storage locations.

@sithhell
Copy link
Member

Ok, I see your point. I will make sure that any object in storage is transparently brought back when accessed. This PR however is all about the two safeguarding mechanisms, not about the migration to storage. I take it that you have no objections to the safeguarding code as it is.

No objection to the general safe guarding mechanism. The only objection is the way it is used for migrate_to_storage and migrate_from_storage which is also part of this PR. Before merging this PR I would like to have some form of consensus about how to proceed on that issue, either by filing a separate issue or by fixing the issues in question within that PR prior to the merge.

@hkaiser
Copy link
Member Author

hkaiser commented Feb 25, 2015

[Feb 25th, 08:17] heller: hkaiser: gtg now ... I'm fine with merging the PR ... but please create issue that explains what needs to be done so that migrate_to_storage/migrate_from_storage works again as expected

hkaiser added a commit that referenced this pull request Feb 25, 2015
@hkaiser hkaiser merged commit c041b66 into master Feb 25, 2015
@hkaiser hkaiser deleted the towards_safer_migration branch February 25, 2015 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants