[Resnet50/TensorFlow] train throughput calculation is incorrect

Related to **Classification models/TensorFlow/TensorFlow2** 
such as Tensorflow/Classification models，TensorFlow2/Classification/Detection models

**Describe the bug**
When training TensorFlow/resnet50 models, the last line outputs `train_throughput : xxxx images/s` as the training throughput of the whole process, which is incorrect IMHO.

I find it is misunderstood after checking the code, data struct `MeanAccumulator` accumulates images per second(aka. ips) every step and simply takes the average of them as throughput. (caller: https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/Classification/ConvNets/runtime/runner.py#L509 , the real implementation goes here: https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/Classification/ConvNets/utils/hooks/training_hooks.py#L32)

In general, train throughput can be calculated as below:
`train_throughput` = `total image processed` / `total process time` = `global_batch_size` / `average of process time per step`

In practice, train throughput is an important metric for vision/classification tasks. So, I prefer to call the `mean_throughput.value()` output as average ips instead of train throughput. **When the variance of ips per step sequence is large, those two metrics can differ a lot.**

I also check some other models to see how they handle the throughput/average fps, it turns out both two ways exist:

average ips way:
* https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow2/Classification/ConvNets/utils/callbacks.py#L352
* https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow2/Detection/Efficientdet/model/callback_builder.py#L144

train throughput way:
* https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow2/Classification/ConvNets/utils/callbacks.py#L346
* [https://github.com/NVIDIA/DeepLearningExamples/blob/master/MxNet/Classification/RN50v1.5/fit.py#L442](https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow2/Classification/ConvNets/utils/callbacks.py#L346)

**To Reproduce**
Steps to reproduce the behavior:
1. Follow the instructions of the README.md file under "TensorFlow/Classification/ConvNets#quick-start-guide"
2. run commands
```
## sample train script takes from resnet50v1.5/training/DGXA100_RN50_AMP_90E.sh
WORKSPACE=${1:-"/workspace/rn50v15_tf"}
DATA_DIR=${2:-"/data"}
OTHERS=" --display_every 1"

mpiexec --allow-run-as-root -np 1 python3 main.py --arch=resnet50 \
    --mode=train_and_evaluate --iter_unit=epoch --num_iter=90 \
    --batch_size=256 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
    --lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=3.0517578125e-05 \
    --amp --static_loss_scale 128 \
    --data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
    --results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}
```

**Expected behavior**
IPS shall be stable for every step, and final train throughput matches `global_batch_size/averge_process_time_per_step` correctly

**Environment**
Please provide at least:
* Container version: nvcr.io/nvidia/tensorflow:22.10-py3
* GPUs in the system: 1x A100 80GB
* CUDA driver version Driver Version: 520.61.05 CUDA Version: 11.8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Resnet50/TensorFlow] train throughput calculation is incorrect #1264

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Resnet50/TensorFlow] train throughput calculation is incorrect #1264

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions