-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
Related to Classification models/TensorFlow/TensorFlow2
such as Tensorflow/Classification models,TensorFlow2/Classification/Detection models
Describe the bug
When training TensorFlow/resnet50 models, the last line outputs train_throughput : xxxx images/s
as the training throughput of the whole process, which is incorrect IMHO.
I find it is misunderstood after checking the code, data struct MeanAccumulator
accumulates images per second(aka. ips) every step and simply takes the average of them as throughput. (caller: https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/Classification/ConvNets/runtime/runner.py#L509 , the real implementation goes here: https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/Classification/ConvNets/utils/hooks/training_hooks.py#L32)
In general, train throughput can be calculated as below:
train_throughput
= total image processed
/ total process time
= global_batch_size
/ average of process time per step
In practice, train throughput is an important metric for vision/classification tasks. So, I prefer to call the mean_throughput.value()
output as average ips instead of train throughput. When the variance of ips per step sequence is large, those two metrics can differ a lot.
I also check some other models to see how they handle the throughput/average fps, it turns out both two ways exist:
average ips way:
- https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow2/Classification/ConvNets/utils/callbacks.py#L352
- https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow2/Detection/Efficientdet/model/callback_builder.py#L144
train throughput way:
- https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow2/Classification/ConvNets/utils/callbacks.py#L346
- https://github.com/NVIDIA/DeepLearningExamples/blob/master/MxNet/Classification/RN50v1.5/fit.py#L442
To Reproduce
Steps to reproduce the behavior:
- Follow the instructions of the README.md file under "TensorFlow/Classification/ConvNets#quick-start-guide"
- run commands
## sample train script takes from resnet50v1.5/training/DGXA100_RN50_AMP_90E.sh
WORKSPACE=${1:-"/workspace/rn50v15_tf"}
DATA_DIR=${2:-"/data"}
OTHERS=" --display_every 1"
mpiexec --allow-run-as-root -np 1 python3 main.py --arch=resnet50 \
--mode=train_and_evaluate --iter_unit=epoch --num_iter=90 \
--batch_size=256 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
--lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=3.0517578125e-05 \
--amp --static_loss_scale 128 \
--data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
--results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}
Expected behavior
IPS shall be stable for every step, and final train throughput matches global_batch_size/averge_process_time_per_step
correctly
Environment
Please provide at least:
- Container version: nvcr.io/nvidia/tensorflow:22.10-py3
- GPUs in the system: 1x A100 80GB
- CUDA driver version Driver Version: 520.61.05 CUDA Version: 11.8