Skip to content

WeeklyTelcon_20160606

Geoff Paulsen edited this page Jun 7, 2016 · 5 revisions

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Jeff Squyres
  • Arm Painyasakdikul
  • Nathan
  • Edgar Gabriel
  • Ralph
  • Todd Kordenbrock
  • Geoff Paulsen
  • Howard Pritchard

Agenda

Review 1.10

  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.3
  • Does anyone care about NAG on 1.10? - consensus is no.
  • Nathan: Some threading issue on 1.10 OB1 Pending Progress. There is a leak in RDMA. Hanging in RDMACM.
  • master PR1758 - if BTL Vader (purposely allocate 4000 fragments) Isend, it would go ahead an progress, bug we've had for long long time. fraglist will leak, and will keep growing without bounds.
    • Vader was first to hit it because all other BTL's free list max is -1.
    • Need to hit really hard with lots of isends.
  • Door is closing on RHEL 7.3, so want to get into 1.10.3.
    • Redhat / Ubuntu / SLES generally just pickup latest for their release.
  • Let it go into master tonight, and see how it goes tonight before deciding risk for 1.10.3
    • Once it's into master, generate the PR for 1.10 to see how bad the backport is.

Review 2.0.x

  • Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20
  • Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker
    • with OB1 fixes, things are looking good. As of now, haven't had a test that's causing things to blow up.
    • Hitting a lot of subsystems in Open MPI harder than we used to.
    • Threaded tests that we
  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0
    • Nysal - File Open and a couple others not Multi-thread safe. Because we do lazy open of framework
    • PR 1199 - we are expecting more commits?
      • Yes a couple more George and Nathan going back and forth. Gone into Master and fixed hang on Master.
      • OB1 failures.
      • Nvidia issues.
      • Nathan cherry-picked up some warning cleanup code also.
    • Feeling better and better about this, but still more
    • PR 1218 MPOOL - rcache. Timeout. RDMACM test.
      • platform file revert on 1.10 around RDMACM.
      • Iwarp people aren't complaining, and this is their only connection method.
      • Howard not okay merging yet, want better explanation of why it's hanging.
      • Was on 2.x before Request fallout happened.
      • on master hanging in Finalize waiting for disconnect. 1758.

Review Master MTT testing (https://mtt.open-mpi.org/)

MTT Dev status:

Status Updates:

  • LANL
  • Houston
  • IBM

Status Update Rotation

  1. Cisco, ORNL, UTK, NVIDIA
  2. Mellanox, Sandia, Intel
  3. LANL, Houston, IBM

Back to 2016 WeeklyTelcon-2016

Clone this wiki locally