-
Notifications
You must be signed in to change notification settings - Fork 4
WeeklyTelcon_20160510
- Dialup Info: (Do not post to public mailing list or public wiki)
- Jeff Squyres
- Brad Benton
- george
- Howard
- Josh Hursey
- Joshua Ladd
- Ralph Castain
- Geoff Paulsen
- Ryan Grant
- Todd Kordenbrock
- Sylvain Jeaugey
- Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.3
- PMI Barrier - 2 PRs waiting for verification.
- When launched by SLURM, use PMIx ModeX and Blocked Opal Progress.
- Need Howard or Nathan to verify these two.
- A bunch of Hangs in 1.10 series, but noone can replicate by hand.
- Possibly MTT induced? Some looks like App is not hung, but MTT timeout.
- George identified a Blocker C++ hang issue.
- PMI Barrier - 2 PRs waiting for verification.
- Schedule? Maybe end of next week another RC. *
- Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20
- Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker
- 1663 hwloc fix go in after the call
- Ralph will fix configure logic around external pmix
- if user asked for external pmix, but can't find it, it doesn't fail, but could break at runtime.
- Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0
*
- Looking pretty good, until Paul found a bunch of obscure things.
- have most of them either fixes, or have issues or PRs to fix them.
- Nathan has - can't clobber EBX - would hope the compiler would put store/restore around it.
- Fix against master in.
- 32bit powerpc issue in hook.
- PR 1129 - ralph pulled the fix, waiting for paul's test result.
- PR 1133 - trivial.
- PR 1134 - OMPIO comp on netbsd - Paul queued tests.
- Howard queued up some Readme changes
- PR 1051 - marked for 2.1, but is annoying, would like to pull back to 2.0
- Howard is okay.
- Nathan kinda wants PR1127 in v2.x - OSC correctness fixes. Fixes map-by node for Graph500.
- Important for Mellanox, v2.0.0
- Howard is concerned about the change churn to put this into v2.0.0, and would prefer this in v2.0.1
- master PR1617 - hcoll, hang in Finalize with srun - Mellanox would prefer v2.0.0
- Fix on 1.10, but not on master or 2.x, but haven't opened PR for v2.x yet (today).
- Looking pretty good, until Paul found a bunch of obscure things.
- File-get-byte-offset - Edger (not here), jeff will ask about progress.
- coll tuned, two proc errors
- Discussion:
- What "gotchas" do we need to communicate to users? I.e., what will people upgrading from v1.8.x/v1.10.x be surprised by?
-
PR 1631 -
-
The most obvious one I can think of is mpirun requiring -np when slots are not specified somehow.
-
OMPIO is not default for Luster - Edgar will writeup blurb.
-
Discuss version numbering change.
-
What else do we need to communicate? It would be nice to avoid the confusion users experienced regarding affinity functionality/options when upgrading from v1.6 -> v1.8 (because we didn't communicate these changes well, IMHO).
-
https://github.com/open-mpi/ompi/wiki/User-Migration-Guide:-1.8.x-and-v1.10.x-to-v2.0.0
-
https://github.com/open-mpi/ompi/wiki/Developer-Migration-Guide:-v1.8.x-and-v1.10.x-to-v2.x
-
- Want it to be googlable.
- A couple of paragraphs or 3 on biggest changes.
- Removed support.
- We need to collectively edit on wiki, and then we'll put it up on the open-mpi website.
- new OSHMEM interfaces added, but still not implemented until 2.1
- Biggest change is job launch / stuff to support (Josh)
- PMI support changed, it's a framework now, expect orte_info components.
- New RMA capabilities (Nathan)
- Two minute blurbs, not too much details here.
- work on this over next couple of days.
- What "gotchas" do we need to communicate to users? I.e., what will people upgrading from v1.8.x/v1.10.x be surprised by?
- Jenkins is having problems, one is induced by Ralph,
- Ralph needs help by Josh Hursey or Josh Ladd.
- Env variable forwarding.
Review Master MTT testing (https://mtt.open-mpi.org/)
-
min-dist mapper test failing. Jeff opened Issue 1623.
- PMIx external seems like a red-herring.
- hwloc was upgraded.
-
static build issue because MPIR_ symbols in wrong place, so ORTE
-
IBM would like an explicit declaration of license the website / documentation is available under
- no objections.
- IBM will file a pull request, and email devel for more discussion.
- Some Discussion on MTT Timeouts
- Issue is that if MTT Timeout happens during timeout, it looks like a timeout, rather than a success.
- Josh considering adding some additional functionality to grab stack traced on hang.
- Geoff mentioned a possible feature in Platform-MPI could be added,
- Mellanox, Sandia, Intel
- LANL, Houston, IBM
- Cisco, ORNL, UTK, NVIDIA