forked from open-mpi/ompi
-
Notifications
You must be signed in to change notification settings - Fork 4
WeeklyTelcon_20160606
Geoff Paulsen edited this page Jun 7, 2016
·
5 revisions
- Dialup Info: (Do not post to public mailing list or public wiki)
- Jeff Squyres
- Arm Painyasakdikul
- Nathan
- Edgar Gabriel
- Ralph
- Todd Kordenbrock
- Geoff Paulsen
- Howard Pritchard
- Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.3
- Does anyone care about NAG on 1.10? - consensus is no.
- Nathan: Some threading issue on 1.10 OB1 Pending Progress. There is a leak in RDMA. Hanging in RDMACM.
- master PR1758 - if BTL Vader (purposely allocate 4000 fragments) Isend, it would go ahead an progress, bug we've had for long long time. fraglist will leak, and will keep growing without bounds.
- Vader was first to hit it because all other BTL's free list max is -1.
- Need to hit really hard with lots of isends.
- Door is closing on RHEL 7.3, so want to get into 1.10.3.
- Redhat / Ubuntu / SLES generally just pickup latest for their release.
- Let it go into master tonight, and see how it goes tonight before deciding risk for 1.10.3
- Once it's into master, generate the PR for 1.10 to see how bad the backport is.
- Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20
- Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker
- with OB1 fixes, things are looking good. As of now, haven't had a test that's causing things to blow up.
- Hitting a lot of subsystems in Open MPI harder than we used to.
- Threaded tests that we
- Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0
- Nysal - File Open and a couple others not Multi-thread safe. Because we do lazy open of framework
- PR 1199 - we are expecting more commits?
- Yes a couple more George and Nathan going back and forth. Gone into Master and fixed hang on Master.
- OB1 failures.
- Nvidia issues.
- Nathan cherry-picked up some warning cleanup code also.
- Feeling better and better about this, but still more
- PR 1218 MPOOL - rcache. Timeout. RDMACM test.
- platform file revert on 1.10 around RDMACM.
- Iwarp people aren't complaining, and this is their only connection method.
- Howard not okay merging yet, want better explanation of why it's hanging.
- Was on 2.x before Request fallout happened.
- on master hanging in Finalize waiting for disconnect. 1758.
Review Master MTT testing (https://mtt.open-mpi.org/)
- LANL
- Houston
- IBM
- Cisco, ORNL, UTK, NVIDIA
- Mellanox, Sandia, Intel
- LANL, Houston, IBM