This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

v0.11.0: April. 2019 Release

mzmssg released this 03 Apr 04:41

Release v0.11.0

New Features

Support team wise NFS storage, including:
- An NFS configuration plug-in and a commandline tool. #2346
- A simple NFS-job submit plug-in. #2358
Refer to Simplified Job Submission for OpenPAI + NFS deployment for more details.
New alerts for unhealthy GPUs, currently including following alerts #2209:
- gpu used by zombie container
- gpu used by external process
- gpu ecc error
- gpu hangs
- gpu memory leak
Admin could know all running jobs on a node. #2197
Filter supports in Job List View. #302
Hold the Env for failed jobs which are casued by user error. #2272

Improvements

Service

Webportal:
- New job list page look and feel. #302
- New job detail page: #2211
Alert-manager:
Increase node memory and CPU threshold to reduce false alerts. #2345, #2296
Hadoop:
Persist yarn and hdfs service log to host. #2244
Runtime:
Support samba shares in container. #2318

Documentation

Add troubleshooting guide for jobs. #2305
Refine document for new user to submit job. #2278

Examples

Remove TensorFlow mpi example which cannot be run currently. #2337

Others

Operations:
Add a commandline tool to query unhealthy gpu information from prometheus. #2319

Notable Fixes

Hadoop: Scheduler may get stuck in a indefinite loop. #2365
Hadoop: Sometimes, hadoop-ai can't detect ecc error. #2343
Runtime: Users might see unallocated gpus. #2352
Runtime: Jobs might get a free retry when using exceed memory. #1108
Drivers: Fix IB installation bugs. #2278, #2271, #2269

Known Issues

There might be a mismatch between linux kernel and driver. #2446
Retry link of new job details page is missing. #2466

Upgrading from Earlier Release

Please follow the Upgrading to v0.11 for detailed instructions.

Assets 2