This repository has been archived by the owner on Nov 16, 2023. It is now read-only.
v0.11.0: April. 2019 Release
Release v0.11.0
New Features
-
Support team wise NFS storage, including:
Refer to Simplified Job Submission for OpenPAI + NFS deployment for more details.
-
New alerts for unhealthy GPUs, currently including following alerts #2209:
- gpu used by zombie container
- gpu used by external process
- gpu ecc error
- gpu hangs
- gpu memory leak
-
Admin could know all running jobs on a node. #2197
-
Filter supports in Job List View. #302
-
Hold the Env for failed jobs which are casued by user error. #2272
Improvements
Service
-
Webportal:
-
Alert-manager:
Increase node memory and CPU threshold to reduce false alerts. #2345, #2296 -
Hadoop:
Persist yarn and hdfs service log to host. #2244 -
Runtime:
Support samba shares in container. #2318
Documentation
Examples
- Remove TensorFlow mpi example which cannot be run currently. #2337
Others
- Operations:
Add a commandline tool to query unhealthy gpu information from prometheus. #2319
Notable Fixes
- Hadoop: Scheduler may get stuck in a indefinite loop. #2365
- Hadoop: Sometimes, hadoop-ai can't detect ecc error. #2343
- Runtime: Users might see unallocated gpus. #2352
- Runtime: Jobs might get a free retry when using exceed memory. #1108
- Drivers: Fix IB installation bugs. #2278, #2271, #2269
Known Issues
- There might be a mismatch between linux kernel and driver. #2446
- Retry link of new job details page is missing. #2466
Upgrading from Earlier Release
Please follow the Upgrading to v0.11 for detailed instructions.