forked from open-mpi/ompi
-
Notifications
You must be signed in to change notification settings - Fork 4
WeeklyTelcon_20160517
Geoff Paulsen edited this page May 17, 2016
·
4 revisions
- Dialup Info: (Do not post to public mailing list or public wiki)
- Geoff Paulsen
- Jeff Squyres
- Brad Benton
- Howard
- Josh Hursey
- Joshua Ladd
- Nathan Hjelm
- Ralph
- Sylvain Jeaugey
- Todd Kordenbrock
- Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.3
- 1161 - Open IB Error Path - Giles asked Mike to review, in 2nd iteration.
- Joshua Ladd tagged on 2.x version.
- 1150 - 2 places in Init and 1 in Finalize where we do RTE Barrier.
- If launched with mpirun, it works just fine.
- But direct launch will hang in cray or slurm PMIx because those have Blocking RTE barriers, and those DONT progress.
- Patched it in master with MPI Barrier to make other things progress.
- Will need to block 2.0.x for this fix also. Ralph will create PR.
- Once these get in, Do another RC and move this out.
- Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20
- Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker
- Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0
- PMIx barrier
- Nathan will review 1164.
- PR 1673 Multi-threaded issues that George ran into is a doozie.
- Free path in C++. In one thread in dereg hooks in Delete.
- Another thread was try to allocate space, and trigerring internal garbage collection.
- Classic deadlock.
- Nathan reworked the rcache / mpool code to not hold lock while doing deletes.
- All locks are always on in RDMA because no way around it.
- Last rcache bug was if you had > 100 registrations associated with memory registration being munmapped, ran into infinite loop.
- Nathan and George testing.
- IBM will do some multi-threaded testing as well.
- PowerPC issues as well. Nathan had to revise table a bit.
- ppc64le, if you do a dlsym, pointer is into table of contents: 1 is real address.
- problem is TOC is getting patched.
- when patching, need to patch the real function, not the other.
- ppc64BE - may still
- ppc64le, if you do a dlsym, pointer is into table of contents: 1 is real address.
- 1162 - multiple threads make same endpoint simultaneously.
- Nathan thought he handled that case.
- one thing we forgot to do for 2.0.0rc2, we forgot to send to users-alias. Will do for rc3.
- Put announcement about Migration guide into Announcement list.
Review Master MTT testing (https://mtt.open-mpi.org/)
- IBM trying to ramp up MTT testing. Hopefully will have Power8 XL compiler testing soon.
- Some issues passing certain flags to XL compilers. Josh Hersey is working on.
- Cisco / Intercomm create failures.
- Getbyte offset test requires v2.0.0 or greater and spins until timeout on 1.10.
- Mellanox, Sandia, Intel
- LANL, Houston, IBM
- Cisco, ORNL, UTK, NVIDIA