Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate components #1966

Merged
merged 28 commits into from Feb 9, 2016
Merged

Migrate components #1966

merged 28 commits into from Feb 9, 2016

Conversation

hkaiser
Copy link
Member

@hkaiser hkaiser commented Jan 23, 2016

This fixes the remaining open issues in #559.

Transparent migration of arbitrary components is fully implemented now - \o/

@hkaiser hkaiser force-pushed the migrate_component branch 3 times, most recently from f402153 to 1762853 Compare January 23, 2016 18:48
typedef server::trigger_migrate_component_action<Component> action_type;
return async<action_type>(naming::get_locality_from_id(to_migrate),
typedef server::perform_migrate_component_action<Component> action_type;
return hpx::detail::async_colocated<action_type>(to_migrate,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you use hpx::components::colocated here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an internal invocation, so I directly call the implementation for the colocated async.

@sithhell
Copy link
Member

The code currently tries to unpin an already migrated component right after the migration has been completed: https://gist.github.com/sithhell/44bf05eb58c1a9b83eaa

sithhell and others added 8 commits January 23, 2016 22:25
- added missing checks for was_migrated
- routing of just received parcel is now done with normal priority (if local)
- store stripped gids in AGAS was_migrated table
- add more tests

- flyby changes:
-- add optional deadlock detection in spinlock (very crude!)
-- minor move optimizations in parcel-port and AGAS
-- renamed HPX_THREAD_MINIMAL_DEADLOCK_DETECTION to HPX_HAVE_THREAD_...
- flyby: write error message for failing tests
…nctional now

- added more asserts to migration code
- diagnostic printouts in migrate_component test
- adding serialization to distribution policies
@hkaiser
Copy link
Member Author

hkaiser commented Jan 31, 2016

All discussed issues have been addressed. This PR is ready to be merged.

@hkaiser hkaiser mentioned this pull request Jan 31, 2016
- implemented segmented iterator for hpx::unordered_map
- implemented serialization for std::unordered_map and corresponding test
- refactored test for component storage
- flyby: renamed partition_vector to partitioned_vector_partition

this fixes #1163
- flyby: fixed un-registration of migrated simple components
@hkaiser
Copy link
Member Author

hkaiser commented Feb 3, 2016

@sithhell Which of the tests is causing this?

@sithhell
Copy link
Member

sithhell commented Feb 3, 2016

@hkaiser migrate_component

@hkaiser
Copy link
Member Author

hkaiser commented Feb 3, 2016

@sithhell I understand that migrate_component fails. I meant what sub-test is failing? Is it blowing up always or is it a race?

@sithhell
Copy link
Member

sithhell commented Feb 3, 2016

For completeness: The eariler post has been deleted because it used an earlier version. However, the problem is still there. Running two localities, one thread per locality.

On 02/03/2016 04:19 PM, Hartmut Kaiser wrote:

@sithhell https://github.com/sithhell I understand that
migrate_component fails. I meant what sub-test is failing? Is it blowing
up always or is it a race?

Seems to blow up always. Looks like I used a earlier version, however,
after updating, here is the full error:
https://gist.github.com/sithhell/c963f9a8957f6e1d2d16

A release build, without address sanitizer, segfaults as well and gives
this output:
https://gist.github.com/sithhell/70c29cbc3e9bf5482ca2

@sithhell
Copy link
Member

sithhell commented Feb 3, 2016

There seems to be one leftover problem with caching. when disabling the cache, it works. However, once using more than one thread per locality, the test_migrate_busy_component2 test either hangs (in release):

[16:43:06]:heller@luna:/home/heller/build/hpx/release:0:$ mpirun -np 2 ./bin/migrate_component_test -Ihpx.stacks.small_size=0x20000 -t2 -Ihpx.agas.use_caching=0
test_migrate_component: ->{0000000200000000, 0000000000000000}
test_migrate_component: <-{0000000200000000, 0000000000000000}
test_migrate_busy_component: ->{0000000200000000, 0000000000000000}
test_migrate_busy_component: <-{0000000200000000, 0000000000000000}
test_migrate_component2: ->{0000000200000000, 0000000000000000}
....................................................................................................
test_migrate_component2: <-{0000000200000000, 0000000000000000}
....................................................................................................
test_migrate_busy_component2: ->{0000000200000000, 0000000000000000}
001.1100.001.100.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

or prints this output (debug + sanitizer):

mpirun -np 2 ./bin/migrate_component_test -Ihpx.stacks.small_size=0x20000 -t2 -Ihpx.agas.use_caching=0
test_migrate_component: ->{0000000200000000, 0000000000000000}
test_migrate_component: <-{0000000200000000, 0000000000000000}
test_migrate_busy_component: ->{0000000200000000, 0000000000000000}
test_migrate_busy_component: <-{0000000200000000, 0000000000000000}
test_migrate_component2: ->{0000000200000000, 0000000000000000}
....................................................................................................
test_migrate_component2: <-{0000000200000000, 0000000000000000}
......................................migrate_component_test: /home/heller/hpx/hpx/runtime/get_ptr.hpp:55: void hpx::detail::get_ptr_for_migration_deleter::operator()(Component *) [Component = test_server]: Assertion `was_migrated' failed.
[luna:mpi_rank_1][error_sighandler] Caught error: Aborted (signal 6)

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 22034 RUNNING AT luna
=   EXIT CODE: 6
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

@hkaiser
Copy link
Member Author

hkaiser commented Feb 3, 2016

Disabling caching just causes for more networking to happen as things will never be resolved by the AGAS caches. I think this has no bearing wrt the migration operation itself, just exposes the problem in a different way.

@hkaiser hkaiser mentioned this pull request Feb 3, 2016
@hkaiser
Copy link
Member Author

hkaiser commented Feb 3, 2016

@sithhell https://github.com/sithhell I understand that migrate_component fails. I meant what sub-test is failing? Is it blowing up always or is it a race?

Seems to blow up always. Looks like I used a earlier version, however, after updating, here is the full error: https://gist.github.com/sithhell/c963f9a8957f6e1d2d16

Which of the localities is failing? the one where the object comes from or the one where it goes to?

@sithhell
Copy link
Member

sithhell commented Feb 4, 2016

The latest commit makes it work for the one thread per locality case, the more than one thread per locality test is still failing with the same symptoms.

@sithhell
Copy link
Member

sithhell commented Feb 8, 2016

\o/
The latest commits seem to fix the remaining issues with component migration!

The bad news ... the touched migration to storage tests fail now:

@hkaiser
Copy link
Member Author

hkaiser commented Feb 8, 2016

The assertion you're seeing is the same as reported in #1944. It seems to be unrelated to the migration functionality. By the looks of it, the code which handles sending parcels to gids which are not AGAS cached misses to set the destination locality properly under certain circumstances.

hkaiser added a commit that referenced this pull request Feb 9, 2016
@hkaiser hkaiser merged commit 0a2c647 into master Feb 9, 2016
@hkaiser hkaiser deleted the migrate_component branch February 9, 2016 12:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants