fix benchmark test example manifest #53

andy108369 · 2024-01-26T22:24:18Z

fixes #52

andy108369 · 2024-01-26T22:25:55Z

Tested

kubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/c9fc007f07fca4ea1c495ab57f54e10ffa9e2a6b/example/pod/alexnet-gpu.yaml

Logs

root@GPUF019:~# kubectl logs alexnet-tf-gpu-pod
2024-01-26 22:25:05.343811: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-01-26 22:25:05.407837: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
/usr/lib/python3/dist-packages/requests/__init__.py:89: RequestsDependencyWarning: urllib3 (1.26.18) or chardet (3.0.4) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
WARNING:tensorflow:From /usr/local/lib/python3.9/dist-packages/tensorflow/python/compat/v2_compat.py:108: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
2024-01-26 22:25:07.936951: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 63950 MB memory:  -> device: 0, name: AMD Instinct MI210, pci bus id: 0000:1b:00.0
WARNING:tensorflow:From /usr/local/lib/python3.9/dist-packages/tensorflow/python/util/dispatch.py:1260: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
W0126 22:25:08.468486 139660273379136 deprecation.py:50] From /usr/local/lib/python3.9/dist-packages/tensorflow/python/util/dispatch.py:1260: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
WARNING:tensorflow:From /benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2245: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
W0126 22:25:08.658049 139660273379136 deprecation.py:50] From /benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2245: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2024-01-26 22:25:08.701558: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 63950 MB memory:  -> device: 0, name: AMD Instinct MI210, pci bus id: 0000:1b:00.0
2024-01-26 22:25:08.705860: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:382] MLIR V1 optimization pass is not enabled
2024-01-26 22:25:08.777454: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:508] ROCm Fusion is enabled.
INFO:tensorflow:Running local_init_op.
I0126 22:25:09.774426 139660273379136 session_manager.py:526] Running local_init_op.
2024-01-26 22:25:09.784490: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:508] ROCm Fusion is enabled.
INFO:tensorflow:Done running local_init_op.
I0126 22:25:09.823423 139660273379136 session_manager.py:529] Done running local_init_op.
2024-01-26 22:25:09.863777: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:508] ROCm Fusion is enabled.
2024-01-26 22:25:09.869208: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:508] ROCm Fusion is enabled.
2024-01-26 22:25:09.873754: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:508] ROCm Fusion is enabled.
2024-01-26 22:25:09.942623: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:508] ROCm Fusion is enabled.
TensorFlow:  2.14
Model:       alexnet
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  512 global
             512 per device
Num batches: 100
Num epochs:  0.04
Devices:     ['/gpu:0']
NUMA bind:   False
Data format: NCHW
Optimizer:   sgd
Variables:   parameter_server
==========
Generating training model
Initializing graph
Running warm up
Done warm up
Step	Img/sec	total_loss
1	images/sec: 5858.7 +/- 0.0 (jitter = 0.0)	7.282
10	images/sec: 5857.6 +/- 1.8 (jitter = 3.5)	7.282
20	images/sec: 5857.1 +/- 1.0 (jitter = 2.6)	7.282
30	images/sec: 5856.1 +/- 0.8 (jitter = 4.0)	7.282
40	images/sec: 5856.0 +/- 0.7 (jitter = 4.1)	7.282
50	images/sec: 5855.5 +/- 0.6 (jitter = 4.3)	7.282
60	images/sec: 5854.9 +/- 0.7 (jitter = 4.4)	7.282
70	images/sec: 5854.0 +/- 0.7 (jitter = 5.0)	7.282
80	images/sec: 5853.7 +/- 0.6 (jitter = 5.0)	7.282
90	images/sec: 5852.7 +/- 0.7 (jitter = 6.2)	7.282
100	images/sec: 5851.8 +/- 0.7 (jitter = 7.1)	7.282
----------------------------------------------------------------
total images/sec: 5843.79
----------------------------------------------------------------
root@GPUF019:~#

fix benchmark test example manifest

c9fc007

fixes ROCm#52

andy108369 mentioned this pull request Jan 26, 2024

[Issue]: benchmark example is broken #52

Closed

y2kenny-amd merged commit f27abcb into ROCm:master Jan 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix benchmark test example manifest #53

fix benchmark test example manifest #53

andy108369 commented Jan 26, 2024

andy108369 commented Jan 26, 2024

fix benchmark test example manifest #53

fix benchmark test example manifest #53

Conversation

andy108369 commented Jan 26, 2024

andy108369 commented Jan 26, 2024

Logs