Skip to content
This repository has been archived by the owner on May 3, 2024. It is now read-only.

Batch Size Choosing for single GPU Traing and Multiple GPU Train #21

Open
dhzhd1 opened this issue Nov 2, 2017 · 2 comments
Open

Batch Size Choosing for single GPU Traing and Multiple GPU Train #21

dhzhd1 opened this issue Nov 2, 2017 · 2 comments

Comments

@dhzhd1
Copy link

dhzhd1 commented Nov 2, 2017

Issue summary

I am using Caffe and ImageNet dataset training on GoogleNet(v1). When I do the single GPU (MI25) training, the training batch size I used is '128'. Then I change the training applied to Multiple MI25 training on hipCaffe, since the total GPU memory capacity has 4 times ( 16GB x4), the batch size should able to fit 512 image/batch(128 image/batch/card). From my testing result, the batch size cannot be changed, even just '192' (multiple of 64), it shows "error: 'hipErrorMemory Allocation'(1002)" .

Since the batch size only has '128', I just do a roughly math, the four cards training time will 3 ~ 3.5x longer as training time on 4xP100 system (batch_size=512).

May I ask is there some environment parameters should I set before the training which can help on enlarge the batch size on multiple GPU training?

I crossed check with one of my NVIDIA P100x4 Server, the batch size could be increased as long as I use more cards. The batch number mentioned above was based on my experience when I did on the same dataset, same network, with NVIDIA P100(16GB GDDR), and V100(16GB GDDR) Training job.

Steps to reproduce

Use the bvlc_googlenet training network under the hipCaffe installation path. The ImageNet dataset from ImageNet official website.

Your system configuration

Operating system: Ubuntu 16.04.3
Compiler: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5)
CUDA version (if applicable):
CUDNN version (if applicable):
BLAS: USE_ROCBLAS := 1
Python or MATLAB version (for pycaffe and matcaffe respectively): 2.7.12
Other:
miopen-hip 1.1.4
miopengemm 1.1.5
rocm-libs 1.6.180
Server: Inventec P47
GPU: AMD MI25 x4
CPU: AMD EPYC 7601 x2
Memory: 512GB

@parallelo
Copy link
Contributor

Hi @dhzhd1,

Thanks for the feedback. If I'm understanding your comments correctly, I believe I just reproduced your setup, but I didn't hit OOM errors.

First, reboot and try re-running your workload.

If that doesn't work, can you please send the results of hipInfo? See this directory: /opt/rocm/hip/samples/1_Utils/hipInfo. Also, can you show how you are running this 4-GPU workload?

Thanks,

Jeff


PS - Here's an example of how you might accomplish a 4-GPU run. You'll have to point the prototxt files to wherever you have ImageNet data located.

Prepare GoogleNet

Params to be set by the user:

gpuids="0,1,2,3"
batchsize_per_gpu=128
iterations=500
model_path=models/bvlc_googlenet

Update the train_val prototxt's batch size:

train_val_prototxt=${model_path}/train_val_batch${batchsize_per_gpu}.prototxt
cp ${model_path}/train_val.prototxt ${model_path}/train_val_batch${batchsize_per_gpu}.prototxt
sed -i "s|batch_size: 32|batch_size: ${batchsize_per_gpu}|g" ./${train_val_prototxt}

Update the solver prototxt's max_iter, snapshot and train_val prototxt path:

solver_prototxt=${model_path}/solver_short.prototxt
cp ${model_path}/solver.prototxt ${solver_prototxt}
sed -i "s|max_iter: 10000000|max_iter: ${iterations}|g" ${solver_prototxt}
sed -i "s|snapshot: 40000|snapshot_after_train: 0|g" ${solver_prototxt}
sed -i "s|${model_path}/train_val.prototxt|${train_val_prototxt}|g" ${solver_prototxt}

Train with ImageNet data

Using the parameters set above, run it:

ngpus=$(( 1 + $(grep -o "," <<< "$gpuids" | wc -l) ))
train_log=./hipCaffe_nGPUs${ngpus}_batchsizePerGpu${batchsize_per_gpu}.log
train_log_sec=./hipCaffe_nGPUs${ngpus}_batchsizePerGpu${batchsize_per_gpu}_sec.log
./build/tools/caffe train --solver=${solver_prototxt} --gpu ${gpuids} 2>&1 | tee ${train_log}

@dhzhd1
Copy link
Author

dhzhd1 commented Nov 6, 2017

Hi @parallelo, Thanks for your feedback. Currently, the system was shipped to SC17 show together with the MI25. I will provide the update when I get the system back.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants