Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix benchmark test example manifest #53

Merged
merged 1 commit into from
Jan 27, 2024
Merged

Conversation

andy108369
Copy link

fixes #52

@andy108369
Copy link
Author

Tested

kubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/c9fc007f07fca4ea1c495ab57f54e10ffa9e2a6b/example/pod/alexnet-gpu.yaml

Logs

root@GPUF019:~# kubectl logs alexnet-tf-gpu-pod
2024-01-26 22:25:05.343811: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-01-26 22:25:05.407837: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
/usr/lib/python3/dist-packages/requests/__init__.py:89: RequestsDependencyWarning: urllib3 (1.26.18) or chardet (3.0.4) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
WARNING:tensorflow:From /usr/local/lib/python3.9/dist-packages/tensorflow/python/compat/v2_compat.py:108: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
2024-01-26 22:25:07.936951: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 63950 MB memory:  -> device: 0, name: AMD Instinct MI210, pci bus id: 0000:1b:00.0
WARNING:tensorflow:From /usr/local/lib/python3.9/dist-packages/tensorflow/python/util/dispatch.py:1260: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
W0126 22:25:08.468486 139660273379136 deprecation.py:50] From /usr/local/lib/python3.9/dist-packages/tensorflow/python/util/dispatch.py:1260: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
WARNING:tensorflow:From /benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2245: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
W0126 22:25:08.658049 139660273379136 deprecation.py:50] From /benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2245: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2024-01-26 22:25:08.701558: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 63950 MB memory:  -> device: 0, name: AMD Instinct MI210, pci bus id: 0000:1b:00.0
2024-01-26 22:25:08.705860: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:382] MLIR V1 optimization pass is not enabled
2024-01-26 22:25:08.777454: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:508] ROCm Fusion is enabled.
INFO:tensorflow:Running local_init_op.
I0126 22:25:09.774426 139660273379136 session_manager.py:526] Running local_init_op.
2024-01-26 22:25:09.784490: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:508] ROCm Fusion is enabled.
INFO:tensorflow:Done running local_init_op.
I0126 22:25:09.823423 139660273379136 session_manager.py:529] Done running local_init_op.
2024-01-26 22:25:09.863777: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:508] ROCm Fusion is enabled.
2024-01-26 22:25:09.869208: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:508] ROCm Fusion is enabled.
2024-01-26 22:25:09.873754: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:508] ROCm Fusion is enabled.
2024-01-26 22:25:09.942623: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:508] ROCm Fusion is enabled.
TensorFlow:  2.14
Model:       alexnet
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  512 global
             512 per device
Num batches: 100
Num epochs:  0.04
Devices:     ['/gpu:0']
NUMA bind:   False
Data format: NCHW
Optimizer:   sgd
Variables:   parameter_server
==========
Generating training model
Initializing graph
Running warm up
Done warm up
Step	Img/sec	total_loss
1	images/sec: 5858.7 +/- 0.0 (jitter = 0.0)	7.282
10	images/sec: 5857.6 +/- 1.8 (jitter = 3.5)	7.282
20	images/sec: 5857.1 +/- 1.0 (jitter = 2.6)	7.282
30	images/sec: 5856.1 +/- 0.8 (jitter = 4.0)	7.282
40	images/sec: 5856.0 +/- 0.7 (jitter = 4.1)	7.282
50	images/sec: 5855.5 +/- 0.6 (jitter = 4.3)	7.282
60	images/sec: 5854.9 +/- 0.7 (jitter = 4.4)	7.282
70	images/sec: 5854.0 +/- 0.7 (jitter = 5.0)	7.282
80	images/sec: 5853.7 +/- 0.6 (jitter = 5.0)	7.282
90	images/sec: 5852.7 +/- 0.7 (jitter = 6.2)	7.282
100	images/sec: 5851.8 +/- 0.7 (jitter = 7.1)	7.282
----------------------------------------------------------------
total images/sec: 5843.79
----------------------------------------------------------------
root@GPUF019:~# 

@y2kenny-amd y2kenny-amd merged commit f27abcb into ROCm:master Jan 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Issue]: benchmark example is broken
3 participants