HET

A distributed deep learning framework for huge embedding model training (previouly named Athena). HET is developed by DAIM Lab at Peking University. This is a previewed version for the reviewers to verify our reproducibility and the whole system is not fully released. If you have any questions, please email to xupeng.miao@pku.edu.cn

Installation

Clone the respository.
Edit the athena.exp file and set the environment path for python.

source athena.exp

CMake is used to compile Hetu. Generate the Makefile first:

conda install cmake # ensure cmake version >= 3.18
cp cmake/config.example.cmake cmake/config.cmake
# modify paths for CUDA, CUDNN, NCCL, MPI in cmake/config.cmake if necessary
mkdir build && cd build && cmake ..
# if nccl needed, please download nccl 2.7.8 and install.
# if hetu cache needed, please install pybind11: conda install pybind11.
# if GNN needed, please install metis.

Compile Athena by Makefile

# current directory is ./build/
make clean
make athena version=mkl -j 32
make athena version=gpu -j 32
# or: make athena version=all -j 32
make ps pslib -j 32 # for ps support
make mpi mpi_nccl -j 32 # for mpi-based allreduce, time-consuming
# btw: make -j32 does all the things

Install graphviz to support graph board visualization (not maintained, may deprecate)

sudo apt-get install graphviz
sudo pip install graphviz

Run some simple examples

Train logistic regression with gpu:

python tests/models_tests/main.py --model logreg --validate

Train a 3-layer mlp with cpu:

python tests/models_tests/main.py --model mlp --validate --gpu -1

Train a 3-layer mlp with gpu:

python tests/models_tests/main.py --model mlp --validate

Train a 3-layer cnn with cpu:

python tests/models_tests/main.py --model cnn_3_layers --validate --gpu -1

Train a 3-layer cnn with gpu:

python tests/models_tests/main.py --model cnn_3_layers --validate

Train a 3-layer mlp with allreduce on 2 gpus (use mpirun in open-mpi path):

path/to/deps/mpirun --allow-run-as-root -np 2 python tests/models_tests/allreduce_main.py --model mlp --validate

Train a 3-layer mlp with PS on 1 server and 2 workers (need to set configurations in json files):

# in scheduler process
python tests/models_tests/ps_main.py --model mlp --setting scheduler_conf.json
# in server process
python tests/models_tests/ps_main.py --model mlp --setting server_conf.json
# in worker1 process
python tests/models_tests/ps_main.py --model mlp --setting worker_conf.json --validate
# in worker2 process
python tests/models_tests/ps_main.py --model mlp --setting worker_conf_2.json --validate

Graphboard is on http://localhost:9997 during training. The port can be changed by the PORT of mnist_dlsys.py. (not maintained, may deprecate)

Evaluation on CTR and GNN tasks:

Please refer to our examples.

License

The entire codebase is under Apache-2.0 license

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
MNIST_data		MNIST_data
cmake		cmake
geometric		geometric
het_examples		het_examples
language_models		language_models
mkl-dnnl		mkl-dnnl
ps-lite		ps-lite
pstests		pstests
python		python
src		src
tests		tests
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
athena.exp		athena.exp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HET

Installation

Evaluation on CTR and GNN tasks:

License

About

Releases

Packages

Languages

License

DMALab/Het

Folders and files

Latest commit

History

Repository files navigation

HET

Installation

Evaluation on CTR and GNN tasks:

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages