Codes for "Dynamic Flow Scheduling for DNN Training Workloads in Data Centers".
| Dependency | Version |
|---|---|
| OS | Ubuntu-18.04 |
| OS Kernel | Linux 5.4.0-77-generic |
| GCC | gcc 7.5.0 |
| CUDA-Toolkit | cuda 10.2 |
| OpenMPI | openmpi 4.1.1 |
| Horovod | v0.22.0 |
| BytePS | v0.2.4 |
| Python | python3.6 |
| Rust | rustc 1.58.0-nightly |
| NCCL | Custormized based on v2.11.4 |
The software dependency is listed in the table above. All can be downloaded from official repositories.
See ./proto/nccl_patch for the customizations of nccl.
6 servers each with two GTX 1080Ti GPUs.
cd ./proto/kernel && make && sudo ./ccp_kernel_load ipc=0
use sh ./proto/rebuild_ccp.sh to rebuild if the datapath program changes.
On a dedicated server, use switch.py to start an emulated switch with rpc services.
On each server, run distributed agent with ccp.py.
Inject workloads with template scripts in ./proto/workloads.
See ./bbr ./reno-cubic ./deepcc for part of baseline implementations.