Artifacts Evaluation of SDPipe [VLDB'23]

This code base is used for Artifact Evaluation of VLDB'23 paper: "SDPipe: A Semi-Decentralized Framework for Heterogeneity-aware Pipeline-parallel Training".

Installation

Clone the repository.
Prepare the environment. We use Anaconda to manage packages. The following command create the conda environment to be used:conda env create -f environment.yml. The environment requires Cuda toolkit version 10.2.
We use CMake to compile the system. Please copy the example configuration for compilation by cp cmake/config.example.cmake cmake/config.cmake. If your system is using another version of NCCL/cudnn/MPI, you should manually change the path defined in this config file.

# compile
# make all
mkdir build && cd build && cmake ..
make -j

Run conda activate pipe and prepare PYTHONPATH by executing the command source hetu.exp at the root folder.
Run python and try import hetu to check whether the compilation is successful.

Prepare datasets

The imagenet dataset is used for resnet50 model training. It can be downloaded at https://www.image-net.org/download.php. The downloaded dataset is in pytorch image folder format. It has a TRAIN folder and a EVAL folder, each folder has 1000 sub folders (the 1000 classes) in them. Please set the IMAGENET_TRAIN_ROOT and IMAGENET_VAL_ROOT variable in the pipeline/datasets/imagenet.py to your imagenet TRAIN and EVAL folder.

We use the wikipedia dataset for bert pretraining. We use the dataset processing method used in NVIDIA/DeepLearningExamples (commit ddbcd5). In our experiments, we used wiki (not wiki+bookcorpus) and use 128 as the sequence length. Please follow their steps to generate the dataset. You will finally get some hdf5 training files (we don't need the eval files). Put the hdf5 files in a folder and set the directory variable in pipeline/models/wiki.py to point to the folder where hdf5 files are stored.

Config distributed training

Our system can start multiple pipelines on different machines. If you have two machine named w0 and w1, each machine has 8 GPUs, then you should write a configuration file like the one in pipeline/config/w0_w1.yml. It tells how many worker process and server process will be launched during the training process.

You will also need to specify how the pipeline is placed on these GPUs. We provide how you can do this in pipeline/device_list.py.

pipe_4_4 = [
    [rgpu("w0", 0), rgpu("w0", 4), rgpu("w1", 0), rgpu("w1", 4)],
    [rgpu("w0", 1), rgpu("w0", 5), rgpu("w1", 1), rgpu("w1", 5)],
    [rgpu("w0", 2), rgpu("w0", 6), rgpu("w1", 2), rgpu("w1", 6)],
    [rgpu("w0", 3), rgpu("w0", 7), rgpu("w1", 3), rgpu("w1", 7)],
]
my_device = pipe_4_4 # exported

In this case there are 4 pipelines in the 2 machine with 16 GPUs, 0-1-2-3 GPU in machine w0 belongs to one pipeline.

Run pipelined data parallel

You will need to make sure that both MPI and NCCL is working in your cluster. You can enter the pipeline folder and then runs the command to get the full results.

# resnet50 training
heturun -c config/w0_w1.yml \
python3 train_resnet.py \
--batch-size 64 --model resnet50 --dataset imagenet --learning-rate 0.01 --weight-decay 1e-4 --epochs 100 \
--pipeline pipedream \
--preduce \

# bert training
heturun -c config/w0_w1.yml \
python3 train_bert.py \
--batch-size 64 --learning-rate 1e-5 --epochs 5000 \
--pipeline pipedream \
--preduce \

Arguments explanation: You can remove the --preduce flag to get the Allreduce results. You can add --adpsgd to get the time by appling ADPSGD algorithm. You can use the --pipeline hetpipe to get the baseline for HetPipe. You can add --hetero 0.1 flag to mimic the heterogeneous environments.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
artifacts		artifacts
bin		bin
cmake		cmake
ps-lite		ps-lite
python		python
src		src
CMakeLists.txt		CMakeLists.txt
README.md		README.md
VLDB2023_SDPipe_Artifacts_and_Proofs.pdf		VLDB2023_SDPipe_Artifacts_and_Proofs.pdf
environment.yml		environment.yml
hetu.exp		hetu.exp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

artifacts

artifacts

bin

bin

cmake

cmake

ps-lite

ps-lite

python

python

src

src

CMakeLists.txt

CMakeLists.txt

README.md

README.md

VLDB2023_SDPipe_Artifacts_and_Proofs.pdf

VLDB2023_SDPipe_Artifacts_and_Proofs.pdf

environment.yml

environment.yml

hetu.exp

hetu.exp

Repository files navigation

Artifacts Evaluation of SDPipe [VLDB'23]

Installation

Prepare datasets

Config distributed training

Run pipelined data parallel

About

Releases

Packages

Languages

Hsword/VLDB2023_SDPipe

Folders and files

Latest commit

History

Repository files navigation

Artifacts Evaluation of SDPipe [VLDB'23]

Installation

Prepare datasets

Config distributed training

Run pipelined data parallel

About

Resources

Stars

Watchers

Forks

Languages