Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

v0.11.0: April. 2019 Release

Compare
Choose a tag to compare
@mzmssg mzmssg released this 03 Apr 04:41
4d63434

Release v0.11.0

New Features

  • Support team wise NFS storage, including:

    • An NFS configuration plug-in and a commandline tool. #2346
    • A simple NFS-job submit plug-in. #2358

    Refer to Simplified Job Submission for OpenPAI + NFS deployment for more details.

  • New alerts for unhealthy GPUs, currently including following alerts #2209:

    • gpu used by zombie container
    • gpu used by external process
    • gpu ecc error
    • gpu hangs
    • gpu memory leak
  • Admin could know all running jobs on a node. #2197

  • Filter supports in Job List View. #302

  • Hold the Env for failed jobs which are casued by user error. #2272

Improvements

Service

  • Webportal:

    • New job list page look and feel. #302
    • New job detail page: #2211
  • Alert-manager:
    Increase node memory and CPU threshold to reduce false alerts. #2345, #2296

  • Hadoop:
    Persist yarn and hdfs service log to host. #2244

  • Runtime:
    Support samba shares in container. #2318

Documentation

  • Add troubleshooting guide for jobs. #2305
  • Refine document for new user to submit job. #2278

Examples

  • Remove TensorFlow mpi example which cannot be run currently. #2337

Others

  • Operations:
    Add a commandline tool to query unhealthy gpu information from prometheus. #2319

Notable Fixes

  • Hadoop: Scheduler may get stuck in a indefinite loop. #2365
  • Hadoop: Sometimes, hadoop-ai can't detect ecc error. #2343
  • Runtime: Users might see unallocated gpus. #2352
  • Runtime: Jobs might get a free retry when using exceed memory. #1108
  • Drivers: Fix IB installation bugs. #2278, #2271, #2269

Known Issues

  • There might be a mismatch between linux kernel and driver. #2446
  • Retry link of new job details page is missing. #2466

Upgrading from Earlier Release

Please follow the Upgrading to v0.11 for detailed instructions.