Skip to content

ZZ Feature Release Plan 2021 09 30

Zhaobo edited this page Oct 5, 2022 · 1 revision

1. Intelligent GPU allocation

  • Team: Yaohui(owner), David
  • Goal: Tailor Kubernetes' scheduler for AI workloads, considering the cluster-wise GPU status and historical execution results
  • Functions
    • Determine gpu counts, type, location(nodes), etc., for each given job
    • Resource allocation, reclaim, dynamic scaling (for already running job, not initialization, to-do in current elastic training framework)
    • Multi-objective: maximize utilization, minimize fragmentation, mitigate workload interference to cover various allocation scenarios (multiple gpu, single gpu, fraction gpu, gpu+cpu)
    • Refer to historical execution records, using ML to find the better solutions
    • Over-allocate/recycle/preempt/re-balance, hard/soft limit
    • Define metrics to evaluate scheduling results
  • Done
    • Interference analysis on sharing GPU (6 workloads, 3 type of GPUs, titanX, k80, RTX2080)
    • Complete interference analysis report, maybe packing suggestions (GPU util no more than 70%), and interference metrics (throughput)
    • Get familiar with scheduling framework, e.g. Yoda scheduler, Kubernetes scheduling SIG
    • Modify score plugin with static metrics
    • Integrate co-scheduling plugin in the default scheduler and deploy to cluster env
  • To-do
    • Integrate co-scheduling plugin to default scheduler and update elastic framework's scheduler and pod template to enable co-scheduling
    • Integrate customized score plugin, consider GPU type and Affinity in the score function
  • Status
    • Investigation & Design (100%)--> Development & Test (50%)
  • Questions
    • Do we extend Kubernetes scheduler? or just modify pod yaml (be simple) A: two layers of scheduling, top: allocator, bottom: kubernetes default scheduler. Work on the bottom mainly.
    • Job Migration (3rd release)
    • GPU resolution (1)
    • Notes: scheduling policy conflicts (e.g. spread v.s. minimize resource fragmentation)

2. Workload-oriented profiling

  • Team: Zhaobo(owner), Ziyu

  • Goal:

    • Collect workload resource utilization data, reflect resource availability of the cluster, for scheduler to make decisions based on availability
    • Collect workload behavior data, infer workload type, completion time, for scheduler to make decision based similarity, feedback (previous execution efficiency)
  • Function

    • report GPUs attributes and utilization on each node
    • monitor DLT jobs behavior, report resource utilization and workload characteristics on each job, store job execution data (start, complete time, no of GPUs used, etc.)
    • provide dry-run function to elastic horovod jobs and store structured execution results with various configurations
  • Done

    • Explore metrics collection library (cadvisor, nvml, metrics-server, nvprof, dcgm)
    • Identify metrics that represent workload characteristics (cpu/gpu/memory/memcpy util, disk io, network bandwidth) (cuda API call distribution, input data location/size, model type, weight size)
    • Associate GPU process ID to k8s pod name, and annotate pod with "patch_namespaced_pod"
    • Patched GPU mem_used to pod, remove annotation after job (GPU processes) is done
    • Dry-run (trial job) CRD and controller design
  • To-do

    • Aggregate pod level metrics to owner job, add DLT job annotation tagging (multiple components), results storing (Prometheus)
    • Complete the profiling trial job controller implementation
  • Future

    • Characterize jobs (infer job type, complete time, current iteration, execution stage, performance bottleneck, co-location, interference impact)
  • Status

    • Investigation & Design (100%)--> Development & Test (50%)
  • Questions

    • dry-run control before scheduling, how much extra time is allowed? (offline, collect training data, or customer choice, optional function, switch on/off)
    • the k8s resource used to run DLT, pod, job, or CRD? Q: CRD, modified Job
    • do we assume the logs of training jobs are accessible? current step and epoch, current accuracy/loss, Q: Yes

3. GPU sharing and isolation

  • Team: Hao(owner)

  • Goal: Limit memory and threads(computing resources) for container process using CUDA API intercept and MPS two solutions, stretchy goal: a complete Kubernetes device plugin that supports GPU sharing

  • Functions

    • CUDA API wrapper to intercept memory allocation/de-allocation and kernel launch functions
    • Utilization monitoring, grant/reject resource requests from processes
    • Compute resources isolation for AI inference workloads, fair share
    • MPS solution: configure limits, investigate and mitigate interference
  • Done

    • Explore Gaia and KubeShare two GPU sharing implementation methods and principles
    • Complete smoke test of using LD_PRELOAD, solve segmentation fault caused by symbol missing.
    • Demo cuda api intercept with preloaded .so library in docker container, .so library is mounted from host to container.
    • Design the components and data flow for K8s GPU plugin to support GPU virtualization. including intercept library, vGPU registration server,
    • Implement TCP based registration server to support information sharing between host and its containers (local communication).
    • Explore how CUDA driver API is used among different ML frameworks
  • To-do

    • Complete the intercept lib function, to constraint container's GPU mem and compute resource usage in a couple of typical scenarios
    • Test the accuracy and stability of the intercept lib. e.g., if the container's GPU utilization is indeed within the requested range (50%)
  • Future

    • Complete the entire GPU plugin to support fractional GPU request
    • nVidia kernel space reverse engineering
  • Status

    • Investigation & Design (70%)--> Development & Test (20%)
  • Questions

    • LD_PRELOAD limitation
    • pytorch and tensorflow(not clear use of cuda runtime lib) cuda call differences
    • if we need to wrap all APIs
    • What's ML framework's response when system/cude reject memory allocation (reach limits), whether training job can keep running with limited memory?