New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate components #1966

Merged
merged 28 commits into from Feb 9, 2016

Conversation

Projects
None yet
2 participants
@hkaiser
Member

hkaiser commented Jan 23, 2016

This fixes the remaining open issues in #559.

Transparent migration of arbitrary components is fully implemented now - \o/

@sithhell

This comment has been minimized.

Show comment
Hide comment
@sithhell

sithhell Jan 22, 2016

Member

I don't think the approach of going through the thread function/action decoration is the right approach.
The code as it is right now has indeed a race condition. The race condition is in applier::schedule_action:

If I am not completely mistaken, an object could have been migrated in the meantime before the actual pinning happened.

My proposal would be to add a function that returns a smart pointer to the component here that pins it in the ctor and unpins it in the dtor. If the component has been migrated in the meantime, the smart pointer would be invalid. The thread function should then take this special smart pointer as an argument if it is valid or route the parcel if it is invalid. This makes the check of being migrated and the pinning an atomic operation. In addition, RIIA will take care of correct unpinning once the thread has been finished execution.

Member

sithhell commented on 0e881f7 Jan 22, 2016

I don't think the approach of going through the thread function/action decoration is the right approach.
The code as it is right now has indeed a race condition. The race condition is in applier::schedule_action:

If I am not completely mistaken, an object could have been migrated in the meantime before the actual pinning happened.

My proposal would be to add a function that returns a smart pointer to the component here that pins it in the ctor and unpins it in the dtor. If the component has been migrated in the meantime, the smart pointer would be invalid. The thread function should then take this special smart pointer as an argument if it is valid or route the parcel if it is invalid. This makes the check of being migrated and the pinning an atomic operation. In addition, RIIA will take care of correct unpinning once the thread has been finished execution.

This comment has been minimized.

Show comment
Hide comment
@sithhell

sithhell Jan 22, 2016

Member

Here is a gist with an outline of how the implementation in schedule_action might look like:
https://gist.github.com/sithhell/64d3cb1bf7c54c148889

Member

sithhell replied Jan 22, 2016

Here is a gist with an outline of how the implementation in schedule_action might look like:
https://gist.github.com/sithhell/64d3cb1bf7c54c148889

This comment has been minimized.

Show comment
Hide comment
@hkaiser

hkaiser Jan 22, 2016

Member

The bottom line is: we need a) a scheme of scoped pinning/unpinning, and b) a means of avoiding the described race.

Member

hkaiser replied Jan 22, 2016

The bottom line is: we need a) a scheme of scoped pinning/unpinning, and b) a means of avoiding the described race.

Show outdated Hide outdated hpx/runtime/components/migrate_component.hpp
typedef server::trigger_migrate_component_action<Component> action_type;
return async<action_type>(naming::get_locality_from_id(to_migrate),
typedef server::perform_migrate_component_action<Component> action_type;
return hpx::detail::async_colocated<action_type>(to_migrate,

This comment has been minimized.

@sithhell

sithhell Jan 23, 2016

Member

Why don't you use hpx::components::colocated here?

@sithhell

sithhell Jan 23, 2016

Member

Why don't you use hpx::components::colocated here?

This comment has been minimized.

@hkaiser

hkaiser Jan 23, 2016

Member

It's an internal invocation, so I directly call the implementation for the colocated async.

@hkaiser

hkaiser Jan 23, 2016

Member

It's an internal invocation, so I directly call the implementation for the colocated async.

Show outdated Hide outdated hpx/runtime/components/server/memory_block.hpp
@@ -10,6 +10,7 @@
#include <hpx/traits/is_component.hpp>
#include <hpx/exception.hpp>
#include <hpx/runtime/components/component_type.hpp>
#include <hpx/runtime/components/pinned_ptr.hpp>

This comment has been minimized.

@sithhell

sithhell Jan 23, 2016

Member

I think this include can be removed now.

@sithhell

sithhell Jan 23, 2016

Member

I think this include can be removed now.

This comment has been minimized.

@hkaiser

hkaiser Jan 23, 2016

Member

Yes, will do.

@hkaiser

hkaiser Jan 23, 2016

Member

Yes, will do.

Show outdated Hide outdated hpx/runtime/components/server/migration_support.hpp
trigger_migration_.set_value();
l.unlock(); // avoid holding lock during suspension

This comment has been minimized.

@sithhell

sithhell Jan 23, 2016

Member

This unlock might be problematic. At this point, it is not guaranteed that the object has been marked as migrated in AGAS and actions might still be scheduled on to be migrated objects.

@sithhell

sithhell Jan 23, 2016

Member

This unlock might be problematic. At this point, it is not guaranteed that the object has been marked as migrated in AGAS and actions might still be scheduled on to be migrated objects.

This comment has been minimized.

@hkaiser

hkaiser Jan 23, 2016

Member

Yes, this is a race. Will fix.

@hkaiser

hkaiser Jan 23, 2016

Member

Yes, this is a race. Will fix.

Show outdated Hide outdated hpx/runtime/components/server/migration_support.hpp
@@ -73,42 +141,47 @@ namespace hpx { namespace components
static threads::thread_function_type
decorate_action(naming::address::address_type lva, F && f)
{
using util::placeholders::_1;
// Make sure we pin the thread at construction of the bound object

This comment has been minimized.

@sithhell

sithhell Jan 23, 2016

Member

I think this is a typo, we don't pin the thread, but the component (object) the task will work on.

@sithhell

sithhell Jan 23, 2016

Member

I think this is a typo, we don't pin the thread, but the component (object) the task will work on.

This comment has been minimized.

@hkaiser

hkaiser Jan 23, 2016

Member

Thanks!

@hkaiser

hkaiser Jan 23, 2016

Member

Thanks!

@sithhell

This comment has been minimized.

Show comment
Hide comment
@sithhell

sithhell Jan 23, 2016

Member

The code currently tries to unpin an already migrated component right after the migration has been completed: https://gist.github.com/sithhell/44bf05eb58c1a9b83eaa

Member

sithhell commented Jan 23, 2016

The code currently tries to unpin an already migrated component right after the migration has been completed: https://gist.github.com/sithhell/44bf05eb58c1a9b83eaa

sithhell and others added some commits Jan 23, 2016

More fixes for component migration
- added missing checks for was_migrated
- routing of just received parcel is now done with normal priority (if local)
- store stripped gids in AGAS was_migrated table
- add more tests

- flyby changes:
-- add optional deadlock detection in spinlock (very crude!)
-- minor move optimizations in parcel-port and AGAS
-- renamed HPX_THREAD_MINIMAL_DEADLOCK_DETECTION to HPX_HAVE_THREAD_...
Don't store migrating objects in AGAS cache
- flyby: write error message for failing tests
Fixed problem in AGAS migration table, migration seems to be fully fu…
…nctional now

- added more asserts to migration code
- diagnostic printouts in migrate_component test
Adding migrate() function taking a distribution policy
- adding serialization to distribution policies
Fixing and cleaning up migrate_[to|from]_storage
- migrate_[to|from]_storage is fully transparent now
- fixing some held locks while suspending
- adding tests, docs, and comments
@hkaiser

This comment has been minimized.

Show comment
Hide comment
@hkaiser

hkaiser Jan 31, 2016

Member

All discussed issues have been addressed. This PR is ready to be merged.

Member

hkaiser commented Jan 31, 2016

All discussed issues have been addressed. This PR is ready to be merged.

@hkaiser hkaiser referenced this pull request Jan 31, 2016

Closed

version 1.0? #1896

hkaiser added some commits Feb 1, 2016

Implemented ability to store component storage to disk
- implemented segmented iterator for hpx::unordered_map
- implemented serialization for std::unordered_map and corresponding test
- refactored test for component storage
- flyby: renamed partition_vector to partitioned_vector_partition

this fixes #1163
Unregister all gids from component store at destruction
- flyby: fixed un-registration of migrated simple components
@hkaiser

This comment has been minimized.

Show comment
Hide comment
@hkaiser

hkaiser Feb 3, 2016

Member

@sithhell Which of the tests is causing this?

Member

hkaiser commented Feb 3, 2016

@sithhell Which of the tests is causing this?

@sithhell

This comment has been minimized.

Show comment
Hide comment
@sithhell

sithhell Feb 3, 2016

Member

@hkaiser migrate_component

Member

sithhell commented Feb 3, 2016

@hkaiser migrate_component

@sithhell

This comment has been minimized.

Show comment
Hide comment
@sithhell

sithhell Feb 3, 2016

Member

shouldn't bind_sync be implemented by just calling bind(...).get(ec) to avoid code duplication?

shouldn't bind_sync be implemented by just calling bind(...).get(ec) to avoid code duplication?

This comment has been minimized.

Show comment
Hide comment
@hkaiser

hkaiser Feb 3, 2016

Member

Well yes, that can be done.

Member

hkaiser replied Feb 3, 2016

Well yes, that can be done.

@hkaiser

This comment has been minimized.

Show comment
Hide comment
@hkaiser

hkaiser Feb 3, 2016

Member

@sithhell I understand that migrate_component fails. I meant what sub-test is failing? Is it blowing up always or is it a race?

Member

hkaiser commented Feb 3, 2016

@sithhell I understand that migrate_component fails. I meant what sub-test is failing? Is it blowing up always or is it a race?

@sithhell

This comment has been minimized.

Show comment
Hide comment
@sithhell

sithhell Feb 3, 2016

Member

For completeness: The eariler post has been deleted because it used an earlier version. However, the problem is still there. Running two localities, one thread per locality.

On 02/03/2016 04:19 PM, Hartmut Kaiser wrote:

@sithhell https://github.com/sithhell I understand that
migrate_component fails. I meant what sub-test is failing? Is it blowing
up always or is it a race?

Seems to blow up always. Looks like I used a earlier version, however,
after updating, here is the full error:
https://gist.github.com/sithhell/c963f9a8957f6e1d2d16

A release build, without address sanitizer, segfaults as well and gives
this output:
https://gist.github.com/sithhell/70c29cbc3e9bf5482ca2

Member

sithhell commented Feb 3, 2016

For completeness: The eariler post has been deleted because it used an earlier version. However, the problem is still there. Running two localities, one thread per locality.

On 02/03/2016 04:19 PM, Hartmut Kaiser wrote:

@sithhell https://github.com/sithhell I understand that
migrate_component fails. I meant what sub-test is failing? Is it blowing
up always or is it a race?

Seems to blow up always. Looks like I used a earlier version, however,
after updating, here is the full error:
https://gist.github.com/sithhell/c963f9a8957f6e1d2d16

A release build, without address sanitizer, segfaults as well and gives
this output:
https://gist.github.com/sithhell/70c29cbc3e9bf5482ca2

@sithhell

This comment has been minimized.

Show comment
Hide comment
@sithhell

sithhell Feb 3, 2016

Member

There seems to be one leftover problem with caching. when disabling the cache, it works. However, once using more than one thread per locality, the test_migrate_busy_component2 test either hangs (in release):

[16:43:06]:heller@luna:/home/heller/build/hpx/release:0:$ mpirun -np 2 ./bin/migrate_component_test -Ihpx.stacks.small_size=0x20000 -t2 -Ihpx.agas.use_caching=0
test_migrate_component: ->{0000000200000000, 0000000000000000}
test_migrate_component: <-{0000000200000000, 0000000000000000}
test_migrate_busy_component: ->{0000000200000000, 0000000000000000}
test_migrate_busy_component: <-{0000000200000000, 0000000000000000}
test_migrate_component2: ->{0000000200000000, 0000000000000000}
....................................................................................................
test_migrate_component2: <-{0000000200000000, 0000000000000000}
....................................................................................................
test_migrate_busy_component2: ->{0000000200000000, 0000000000000000}
001.1100.001.100.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

or prints this output (debug + sanitizer):

mpirun -np 2 ./bin/migrate_component_test -Ihpx.stacks.small_size=0x20000 -t2 -Ihpx.agas.use_caching=0
test_migrate_component: ->{0000000200000000, 0000000000000000}
test_migrate_component: <-{0000000200000000, 0000000000000000}
test_migrate_busy_component: ->{0000000200000000, 0000000000000000}
test_migrate_busy_component: <-{0000000200000000, 0000000000000000}
test_migrate_component2: ->{0000000200000000, 0000000000000000}
....................................................................................................
test_migrate_component2: <-{0000000200000000, 0000000000000000}
......................................migrate_component_test: /home/heller/hpx/hpx/runtime/get_ptr.hpp:55: void hpx::detail::get_ptr_for_migration_deleter::operator()(Component *) [Component = test_server]: Assertion `was_migrated' failed.
[luna:mpi_rank_1][error_sighandler] Caught error: Aborted (signal 6)

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 22034 RUNNING AT luna
=   EXIT CODE: 6
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
Member

sithhell commented Feb 3, 2016

There seems to be one leftover problem with caching. when disabling the cache, it works. However, once using more than one thread per locality, the test_migrate_busy_component2 test either hangs (in release):

[16:43:06]:heller@luna:/home/heller/build/hpx/release:0:$ mpirun -np 2 ./bin/migrate_component_test -Ihpx.stacks.small_size=0x20000 -t2 -Ihpx.agas.use_caching=0
test_migrate_component: ->{0000000200000000, 0000000000000000}
test_migrate_component: <-{0000000200000000, 0000000000000000}
test_migrate_busy_component: ->{0000000200000000, 0000000000000000}
test_migrate_busy_component: <-{0000000200000000, 0000000000000000}
test_migrate_component2: ->{0000000200000000, 0000000000000000}
....................................................................................................
test_migrate_component2: <-{0000000200000000, 0000000000000000}
....................................................................................................
test_migrate_busy_component2: ->{0000000200000000, 0000000000000000}
001.1100.001.100.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

or prints this output (debug + sanitizer):

mpirun -np 2 ./bin/migrate_component_test -Ihpx.stacks.small_size=0x20000 -t2 -Ihpx.agas.use_caching=0
test_migrate_component: ->{0000000200000000, 0000000000000000}
test_migrate_component: <-{0000000200000000, 0000000000000000}
test_migrate_busy_component: ->{0000000200000000, 0000000000000000}
test_migrate_busy_component: <-{0000000200000000, 0000000000000000}
test_migrate_component2: ->{0000000200000000, 0000000000000000}
....................................................................................................
test_migrate_component2: <-{0000000200000000, 0000000000000000}
......................................migrate_component_test: /home/heller/hpx/hpx/runtime/get_ptr.hpp:55: void hpx::detail::get_ptr_for_migration_deleter::operator()(Component *) [Component = test_server]: Assertion `was_migrated' failed.
[luna:mpi_rank_1][error_sighandler] Caught error: Aborted (signal 6)

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 22034 RUNNING AT luna
=   EXIT CODE: 6
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
@hkaiser

This comment has been minimized.

Show comment
Hide comment
@hkaiser

hkaiser Feb 3, 2016

Member

Disabling caching just causes for more networking to happen as things will never be resolved by the AGAS caches. I think this has no bearing wrt the migration operation itself, just exposes the problem in a different way.

Member

hkaiser commented Feb 3, 2016

Disabling caching just causes for more networking to happen as things will never be resolved by the AGAS caches. I think this has no bearing wrt the migration operation itself, just exposes the problem in a different way.

@hkaiser hkaiser referenced this pull request Feb 3, 2016

Merged

Refactor action traits #1977

@hkaiser

This comment has been minimized.

Show comment
Hide comment
@hkaiser

hkaiser Feb 3, 2016

Member

@sithhell https://github.com/sithhell I understand that migrate_component fails. I meant what sub-test is failing? Is it blowing up always or is it a race?

Seems to blow up always. Looks like I used a earlier version, however, after updating, here is the full error: https://gist.github.com/sithhell/c963f9a8957f6e1d2d16

Which of the localities is failing? the one where the object comes from or the one where it goes to?

Member

hkaiser commented Feb 3, 2016

@sithhell https://github.com/sithhell I understand that migrate_component fails. I meant what sub-test is failing? Is it blowing up always or is it a race?

Seems to blow up always. Looks like I used a earlier version, however, after updating, here is the full error: https://gist.github.com/sithhell/c963f9a8957f6e1d2d16

Which of the localities is failing? the one where the object comes from or the one where it goes to?

@sithhell

This comment has been minimized.

Show comment
Hide comment
@sithhell

sithhell Feb 4, 2016

Member

The latest commit makes it work for the one thread per locality case, the more than one thread per locality test is still failing with the same symptoms.

Member

sithhell commented Feb 4, 2016

The latest commit makes it work for the one thread per locality case, the more than one thread per locality test is still failing with the same symptoms.

@sithhell

This comment has been minimized.

Show comment
Hide comment
@sithhell

sithhell Feb 8, 2016

Member

\o/
The latest commits seem to fix the remaining issues with component migration!

The bad news ... the touched migration to storage tests fail now:

Member

sithhell commented Feb 8, 2016

\o/
The latest commits seem to fix the remaining issues with component migration!

The bad news ... the touched migration to storage tests fail now:

@hkaiser

This comment has been minimized.

Show comment
Hide comment
@hkaiser

hkaiser Feb 8, 2016

Member

The assertion you're seeing is the same as reported in #1944. It seems to be unrelated to the migration functionality. By the looks of it, the code which handles sending parcels to gids which are not AGAS cached misses to set the destination locality properly under certain circumstances.

Member

hkaiser commented Feb 8, 2016

The assertion you're seeing is the same as reported in #1944. It seems to be unrelated to the migration functionality. By the looks of it, the code which handles sending parcels to gids which are not AGAS cached misses to set the destination locality properly under certain circumstances.

hkaiser added a commit that referenced this pull request Feb 9, 2016

@hkaiser hkaiser merged commit 0a2c647 into master Feb 9, 2016

1 check passed

ci/circleci Your tests passed on CircleCI!
Details

@hkaiser hkaiser deleted the migrate_component branch Feb 9, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment