Skip to content

[Resnet50/TensorFlow] train throughput calculation is incorrect #1264

@watsonw

Description

@watsonw

Related to Classification models/TensorFlow/TensorFlow2
such as Tensorflow/Classification models,TensorFlow2/Classification/Detection models

Describe the bug
When training TensorFlow/resnet50 models, the last line outputs train_throughput : xxxx images/s as the training throughput of the whole process, which is incorrect IMHO.

I find it is misunderstood after checking the code, data struct MeanAccumulator accumulates images per second(aka. ips) every step and simply takes the average of them as throughput. (caller: https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/Classification/ConvNets/runtime/runner.py#L509 , the real implementation goes here: https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/Classification/ConvNets/utils/hooks/training_hooks.py#L32)

In general, train throughput can be calculated as below:
train_throughput = total image processed / total process time = global_batch_size / average of process time per step

In practice, train throughput is an important metric for vision/classification tasks. So, I prefer to call the mean_throughput.value() output as average ips instead of train throughput. When the variance of ips per step sequence is large, those two metrics can differ a lot.

I also check some other models to see how they handle the throughput/average fps, it turns out both two ways exist:

average ips way:

train throughput way:

To Reproduce
Steps to reproduce the behavior:

  1. Follow the instructions of the README.md file under "TensorFlow/Classification/ConvNets#quick-start-guide"
  2. run commands
## sample train script takes from resnet50v1.5/training/DGXA100_RN50_AMP_90E.sh
WORKSPACE=${1:-"/workspace/rn50v15_tf"}
DATA_DIR=${2:-"/data"}
OTHERS=" --display_every 1"

mpiexec --allow-run-as-root -np 1 python3 main.py --arch=resnet50 \
    --mode=train_and_evaluate --iter_unit=epoch --num_iter=90 \
    --batch_size=256 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
    --lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=3.0517578125e-05 \
    --amp --static_loss_scale 128 \
    --data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
    --results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}

Expected behavior
IPS shall be stable for every step, and final train throughput matches global_batch_size/averge_process_time_per_step correctly

Environment
Please provide at least:

  • Container version: nvcr.io/nvidia/tensorflow:22.10-py3
  • GPUs in the system: 1x A100 80GB
  • CUDA driver version Driver Version: 520.61.05 CUDA Version: 11.8

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions