Skip to content

WeeklyTelcon_20160517

Geoff Paulsen edited this page May 17, 2016 · 4 revisions

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen
  • Jeff Squyres
  • Brad Benton
  • Howard
  • Josh Hursey
  • Joshua Ladd
  • Nathan Hjelm
  • Ralph
  • Sylvain Jeaugey
  • Todd Kordenbrock

Agenda

Review 1.10

  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.3
  • 1161 - Open IB Error Path - Giles asked Mike to review, in 2nd iteration.
    • Joshua Ladd tagged on 2.x version.
  • 1150 - 2 places in Init and 1 in Finalize where we do RTE Barrier.
    • If launched with mpirun, it works just fine.
    • But direct launch will hang in cray or slurm PMIx because those have Blocking RTE barriers, and those DONT progress.
    • Patched it in master with MPI Barrier to make other things progress.
    • Will need to block 2.0.x for this fix also. Ralph will create PR.
  • Once these get in, Do another RC and move this out.

Review 2.0.x

  • Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20
  • Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker
  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0
    • PMIx barrier
    • Nathan will review 1164.
    • PR 1673 Multi-threaded issues that George ran into is a doozie.
      • Free path in C++. In one thread in dereg hooks in Delete.
      • Another thread was try to allocate space, and trigerring internal garbage collection.
      • Classic deadlock.
      • Nathan reworked the rcache / mpool code to not hold lock while doing deletes.
      • All locks are always on in RDMA because no way around it.
      • Last rcache bug was if you had > 100 registrations associated with memory registration being munmapped, ran into infinite loop.
      • Nathan and George testing.
      • IBM will do some multi-threaded testing as well.
    • PowerPC issues as well. Nathan had to revise table a bit.
      • ppc64le, if you do a dlsym, pointer is into table of contents: 1 is real address.
        • problem is TOC is getting patched.
      • when patching, need to patch the real function, not the other.
      • ppc64BE - may still
    • 1162 - multiple threads make same endpoint simultaneously.
      • Nathan thought he handled that case.
    • one thing we forgot to do for 2.0.0rc2, we forgot to send to users-alias. Will do for rc3.
      • Put announcement about Migration guide into Announcement list.

Review Master MTT testing (https://mtt.open-mpi.org/)

  • IBM trying to ramp up MTT testing. Hopefully will have Power8 XL compiler testing soon.
    • Some issues passing certain flags to XL compilers. Josh Hersey is working on.
  • Cisco / Intercomm create failures.
  • Getbyte offset test requires v2.0.0 or greater and spins until timeout on 1.10.

MTT Dev status:

Status Updates:


Status Update Rotation

  1. Mellanox, Sandia, Intel
  2. LANL, Houston, IBM
  3. Cisco, ORNL, UTK, NVIDIA

Back to 2016 WeeklyTelcon-2016

Clone this wiki locally