Defragmentation Scheduling with Deep Reinforcement Learning in Shared GPU Clusters

DRR is a defragmentation scheduler for shared GPU clusters. It mitigates GPU fragmentation arising from GPU sharing, diverse jobs, and asynchronous lifecycles, improving resource utilization under dynamic scheduling.

Getting Started

Environment Version

python 3.10

Install dependencies

pip install -r requirements.txt

Run

For 64 nodes cluster simulation:

python simulator.py --num-node 64 --interarrival-time 8 --scheduler DRR \
                    --init_dim 3584 --action_space 64 --lr_actor 0.04 --lr_critic 0.02 \
                    --use_imitation True --imitation_loss_weight 0.1 \
                    --use_dynamic_entropy True --beta0 0.04 \
                    --use_attn True \
                    --use_advantage_adjustment 0.6 \
                    --use_rescheduling True --util_threshold 0.5 --re_time 3600 --re_num 5

# Other baseline schedulers
python simulator.py --num-node 64 --interarrival-time 8 --scheduler ElasticFlow
python simulator.py --num-node 64 --interarrival-time 8 --scheduler "R&P"
python simulator.py --num-node 64 --interarrival-time 8 --scheduler FGD
python simulator.py --num-node 64 --interarrival-time 8 --scheduler Hops

For 32 nodes cluster simulation:

python simulator.py --num-node 32 --interarrival-time 16 --scheduler DRR \
                    --init_dim 1792 --action_space 32 --lr_actor 0.03 --lr_critic 0.02 \
                    --use_imitation True --imitation_loss_weight 0.1 \
                    --use_dynamic_entropy True --beta0 0.01 \
                    --use_attn True \
                    --use_advantage_adjustment 0.4 \
                    --use_rescheduling True --util_threshold 0.5 --re_time 3600 --re_num 5

For 128 nodes cluster simulation:

python simulator.py --num-node 128 --interarrival-time 3.8 --scheduler DRR \
                    --init_dim 7168 --action_space 128 --lr_actor 0.06 --lr_critic 0.04 \
                    --use_imitation True --imitation_loss_weight 0.1 \
                    --use_dynamic_entropy True --beta0 0.06 \
                    --use_attn True \
                    --use_advantage_adjustment 0.1 \
                    --use_rescheduling True --util_threshold 0.5 --re_time 3600 --re_num 5

Project Structure

Defrag
├── cluster.py                  # Cluster environment implementation
├── clusterdata                 # Cluster trace data and preprocessing scripts
│   ├── cluster-trace-gpu-v2020
│   │   └── trace.txt
│   ├── filtered_traces.csv
│   ├── mypreprocess.ipynb
│   ├── sampled_traces.csv
│   ├── share_0.2_traces.csv
│   └── share_0.6_traces.csv
├── imgs
│   └── overview.jpg
├── job.py                      # Job representation
├── policy                      # Scheduling policies
│   ├── __init__.py
│   ├── drr.py
│   ├── elasticflow.py
│   ├── fgd.py
│   ├── gpupacking.py
│   ├── hops.py
│   └── policy.py
├── README.md
├── requirements.txt
├── simulator.py                 # Main simulation script
└── utils.py                     # Utility functions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Defragmentation Scheduling with Deep Reinforcement Learning in Shared GPU Clusters

Getting Started

Environment Version

Install dependencies

Run

Project Structure

Reference

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
clusterdata		clusterdata
imgs		imgs
policy		policy
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
cluster.py		cluster.py
job.py		job.py
requirements.txt		requirements.txt
simulator.py		simulator.py
utils.py		utils.py

IntelligentDDS/Defrag

Folders and files

Latest commit

History

Repository files navigation

Defragmentation Scheduling with Deep Reinforcement Learning in Shared GPU Clusters

Getting Started

Environment Version

Install dependencies

Run

Project Structure

Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages