Skip to content

WeeklyTelcon_20160510

Geoff Paulsen edited this page May 10, 2016 · 7 revisions

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Jeff Squyres
  • Brad Benton
  • george
  • Howard
  • Josh Hursey
  • Joshua Ladd
  • Ralph Castain
  • Geoff Paulsen
  • Ryan Grant
  • Todd Kordenbrock
  • Sylvain Jeaugey

Agenda

Review 1.10

  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.3
    • PMI Barrier - 2 PRs waiting for verification.
      • When launched by SLURM, use PMIx ModeX and Blocked Opal Progress.
      • Need Howard or Nathan to verify these two.
    • A bunch of Hangs in 1.10 series, but noone can replicate by hand.
    • Possibly MTT induced? Some looks like App is not hung, but MTT timeout.
    • George identified a Blocker C++ hang issue.
  • Schedule? Maybe end of next week another RC. *

Review 2.0.x

  • Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20
  • Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker
    • 1663 hwloc fix go in after the call
    • Ralph will fix configure logic around external pmix
      • if user asked for external pmix, but can't find it, it doesn't fail, but could break at runtime.
  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0 *
    • Looking pretty good, until Paul found a bunch of obscure things.
      • have most of them either fixes, or have issues or PRs to fix them.
    • Nathan has - can't clobber EBX - would hope the compiler would put store/restore around it.
      • Fix against master in.
    • 32bit powerpc issue in hook.
    • PR 1129 - ralph pulled the fix, waiting for paul's test result.
    • PR 1133 - trivial.
    • PR 1134 - OMPIO comp on netbsd - Paul queued tests.
    • Howard queued up some Readme changes
    • PR 1051 - marked for 2.1, but is annoying, would like to pull back to 2.0
      • Howard is okay.
    • Nathan kinda wants PR1127 in v2.x - OSC correctness fixes. Fixes map-by node for Graph500.
      • Important for Mellanox, v2.0.0
      • Howard is concerned about the change churn to put this into v2.0.0, and would prefer this in v2.0.1
    • master PR1617 - hcoll, hang in Finalize with srun - Mellanox would prefer v2.0.0
      • Fix on 1.10, but not on master or 2.x, but haven't opened PR for v2.x yet (today).

Master PRs

  • File-get-byte-offset - Edger (not here), jeff will ask about progress.
  • coll tuned, two proc errors

v2.0 Migration Guide

  • Discussion:
    • What "gotchas" do we need to communicate to users? I.e., what will people upgrading from v1.8.x/v1.10.x be surprised by?
    • Want it to be googlable.
    • A couple of paragraphs or 3 on biggest changes.
    • Removed support.
    • We need to collectively edit on wiki, and then we'll put it up on the open-mpi website.
    • new OSHMEM interfaces added, but still not implemented until 2.1
      • Biggest change is job launch / stuff to support (Josh)
      • PMI support changed, it's a framework now, expect orte_info components.
    • New RMA capabilities (Nathan)
    • Two minute blurbs, not too much details here.
    • work on this over next couple of days.

Jenins on Master

  • Jenkins is having problems, one is induced by Ralph,
    • Ralph needs help by Josh Hursey or Josh Ladd.
    • Env variable forwarding.

Review Master MTT testing (https://mtt.open-mpi.org/)

  • min-dist mapper test failing. Jeff opened Issue 1623.

    • PMIx external seems like a red-herring.
    • hwloc was upgraded.
  • static build issue because MPIR_ symbols in wrong place, so ORTE

  • IBM would like an explicit declaration of license the website / documentation is available under

    • no objections.
    • IBM will file a pull request, and email devel for more discussion.

MTT Dev status:

  • Some Discussion on MTT Timeouts
    1. Issue is that if MTT Timeout happens during timeout, it looks like a timeout, rather than a success.
    2. Josh considering adding some additional functionality to grab stack traced on hang.
    3. Geoff mentioned a possible feature in Platform-MPI could be added,

Status Updates:


Status Update Rotation

  1. Mellanox, Sandia, Intel
  2. LANL, Houston, IBM
  3. Cisco, ORNL, UTK, NVIDIA

Back to 2016 WeeklyTelcon-2016

Clone this wiki locally