Skip to content

IntelligentDDS/Defrag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Defragmentation Scheduling with Deep Reinforcement Learning in Shared GPU Clusters

DRR is a defragmentation scheduler for shared GPU clusters. It mitigates GPU fragmentation arising from GPU sharing, diverse jobs, and asynchronous lifecycles, improving resource utilization under dynamic scheduling.

Overview of DRR

Getting Started

Environment Version

  • python 3.10

Install dependencies

pip install -r requirements.txt

Run

For 64 nodes cluster simulation:

python simulator.py --num-node 64 --interarrival-time 8 --scheduler DRR \
                    --init_dim 3584 --action_space 64 --lr_actor 0.04 --lr_critic 0.02 \
                    --use_imitation True --imitation_loss_weight 0.1 \
                    --use_dynamic_entropy True --beta0 0.04 \
                    --use_attn True \
                    --use_advantage_adjustment 0.6 \
                    --use_rescheduling True --util_threshold 0.5 --re_time 3600 --re_num 5

# Other baseline schedulers
python simulator.py --num-node 64 --interarrival-time 8 --scheduler ElasticFlow
python simulator.py --num-node 64 --interarrival-time 8 --scheduler "R&P"
python simulator.py --num-node 64 --interarrival-time 8 --scheduler FGD
python simulator.py --num-node 64 --interarrival-time 8 --scheduler Hops

For 32 nodes cluster simulation:

python simulator.py --num-node 32 --interarrival-time 16 --scheduler DRR \
                    --init_dim 1792 --action_space 32 --lr_actor 0.03 --lr_critic 0.02 \
                    --use_imitation True --imitation_loss_weight 0.1 \
                    --use_dynamic_entropy True --beta0 0.01 \
                    --use_attn True \
                    --use_advantage_adjustment 0.4 \
                    --use_rescheduling True --util_threshold 0.5 --re_time 3600 --re_num 5                    

For 128 nodes cluster simulation:

python simulator.py --num-node 128 --interarrival-time 3.8 --scheduler DRR \
                    --init_dim 7168 --action_space 128 --lr_actor 0.06 --lr_critic 0.04 \
                    --use_imitation True --imitation_loss_weight 0.1 \
                    --use_dynamic_entropy True --beta0 0.06 \
                    --use_attn True \
                    --use_advantage_adjustment 0.1 \
                    --use_rescheduling True --util_threshold 0.5 --re_time 3600 --re_num 5                    

Project Structure

Defrag
├── cluster.py                  # Cluster environment implementation
├── clusterdata                 # Cluster trace data and preprocessing scripts
│   ├── cluster-trace-gpu-v2020
│   │   └── trace.txt
│   ├── filtered_traces.csv
│   ├── mypreprocess.ipynb
│   ├── sampled_traces.csv
│   ├── share_0.2_traces.csv
│   └── share_0.6_traces.csv
├── imgs
│   └── overview.jpg
├── job.py                      # Job representation
├── policy                      # Scheduling policies
│   ├── __init__.py
│   ├── drr.py
│   ├── elasticflow.py
│   ├── fgd.py
│   ├── gpupacking.py
│   ├── hops.py
│   └── policy.py
├── README.md
├── requirements.txt
├── simulator.py                 # Main simulation script
└── utils.py                     # Utility functions

Reference

About

Defragmentation scheduler DRR in SoCC'25

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published