Skip to content

Latest commit

 

History

History
37 lines (30 loc) · 1.13 KB

DISTRIBUTED.md

File metadata and controls

37 lines (30 loc) · 1.13 KB

Requirement

Here we only provide a guide to launch distributed training with singularity, please make sure your singularity works by checking INSTALL.md

Setup

  1. obtain the mxnet launcher and place it in the parent directory of the simpledet working directory
git clone https://github.com/RogerChern/mxnet-dist-lancher.git lancher
  1. mv data, pretrain_model, experiments outside of simpledet and symink them back. This step is to avoid unnecessary rsync of large binary files in the working directory during launching.

  2. after step 1 and 2, your directory should be as following

lancher/
simpledet/
  data -> /path/to/data
  pretrain_model -> /path/to/pretain_model
  experiments -> /path/to/experiments
  ...
  1. make a hostfile containing hostnames of all nodes, these nodes would be accessed from our launch node by ssh without password simpledet/hostfile.txt
node1
node2
  1. change the singulariy mounting point in scripts/dist_worker.sh

  2. change working directories in scritps/train_hpc.sh

  3. launch distributed training with scripts

bash scritps/train_hpc.sh