XLA compilation not working on RNN example rocm 5.4.3 MI100 #2026

Epliz · 2023-03-25T09:22:27Z

Issue Type

Bug

Have you reproduced the bug with TF nightly?

No

Source

binary

Tensorflow Version

tf-rocm 2.11

Custom Code

No

OS Platform and Distribution

Ubuntu 22.04.2 LTS

Mobile device

No response

Python version

3.10.6

Bazel version

No response

GCC/Compiler version

No response

CUDA/cuDNN version

rocm 5.4.3

GPU model and memory

MI100

Current Behaviour?

Getting the current error:


2023-03-25 10:08:12.600064: I tensorflow/compiler/xla/service/service.cc:173] XLA service 0x7fa7f8102fa0 initialized for platform ROCM (this does not guarantee that XLA will be used). Devices:
2023-03-25 10:08:12.600101: I tensorflow/compiler/xla/service/service.cc:181]   StreamExecutor device (0): AMD Instinct MI100, AMDGPU ISA version: gfx908:sramecc+:xnack-
2023-03-25 10:08:12.623830: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-03-25 10:08:12.724333: E tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:289] bitcode module is required by this HLO module but was not found at ./opencl.bc
2023-03-25 10:08:12.725250: I tensorflow/compiler/jit/xla_compilation_cache.cc:477] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
2023-03-25 10:08:12.725363: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: bitcode module not found at ./opencl.bc
2023-03-25 10:08:12.749220: E tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:289] bitcode module is required by this HLO module but was not found at ./opencl.bc
2023-03-25 10:08:12.749567: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: bitcode module not found at ./opencl.bc
2023-03-25 10:08:12.782255: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
Traceback (most recent call last):
  File "/home/me/git/ml/textgen_rnn/./rnn.py", line 148, in <module>
    history = model.fit(dataset, epochs=EPOCHS, batch_size=BATCH_SIZE)
  File "/home/me/git/ml/venv-gpu/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/me/git/ml/venv-gpu/lib/python3.10/site-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:

Detected at node 'StatefulPartitionedCall_5' defined at (most recent call last):
    File "/home/me/git/ml/textgen_rnn/./rnn.py", line 148, in <module>
      history = model.fit(dataset, epochs=EPOCHS, batch_size=BATCH_SIZE)
    File "/home/me/git/ml/venv-gpu/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/home/me/git/ml/venv-gpu/lib/python3.10/site-packages/keras/engine/training.py", line 1650, in fit
      tmp_logs = self.train_function(iterator)
    File "/home/me/git/ml/venv-gpu/lib/python3.10/site-packages/keras/engine/training.py", line 1249, in train_function
      return step_function(self, iterator)
    File "/home/me/git/ml/venv-gpu/lib/python3.10/site-packages/keras/engine/training.py", line 1233, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/home/me/git/ml/venv-gpu/lib/python3.10/site-packages/keras/engine/training.py", line 1222, in run_step
      outputs = model.train_step(data)
    File "/home/me/git/ml/venv-gpu/lib/python3.10/site-packages/keras/engine/training.py", line 1027, in train_step
      self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
    File "/home/me/git/ml/venv-gpu/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 527, in minimize
      self.apply_gradients(grads_and_vars)
    File "/home/me/git/ml/venv-gpu/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1140, in apply_gradients
      return super().apply_gradients(grads_and_vars, name=name)
    File "/home/me/git/ml/venv-gpu/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 634, in apply_gradients
      iteration = self._internal_apply_gradients(grads_and_vars)
    File "/home/me/git/ml/venv-gpu/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1166, in _internal_apply_gradients
      return tf.__internal__.distribute.interim.maybe_merge_call(
    File "/home/me/git/ml/venv-gpu/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1216, in _distributed_apply_gradients_fn
      distribution.extended.update(
    File "/home/me/git/ml/venv-gpu/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1211, in apply_grad_to_update_var
      return self._update_step_xla(grad, var, id(self._var_key(var)))
Node: 'StatefulPartitionedCall_5'
bitcode module not found at ./opencl.bc
	 [[{{node StatefulPartitionedCall_5}}]] [Op:__inference_train_function_2592]



### Standalone code to reproduce the issue

```shell
code from https://www.tensorflow.org/text/tutorials/text_generation

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

Epliz · 2023-03-25T09:25:07Z

same symptoms as ROCm/ROCm#1796

Epliz · 2023-03-25T09:27:06Z

using the solution from there to set ROCM_PATH worked.
Please make the env variable to be automatically set instead of making users having to figure it out by themselves.

Epliz mentioned this issue Mar 25, 2023

Slowness on Fashion MNIST and RNN sample programs on MI100 (gfx908) rocm 5.3.3 #2025

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XLA compilation not working on RNN example rocm 5.4.3 MI100 #2026

XLA compilation not working on RNN example rocm 5.4.3 MI100 #2026

Epliz commented Mar 25, 2023 •

edited

Epliz commented Mar 25, 2023

Epliz commented Mar 25, 2023

XLA compilation not working on RNN example rocm 5.4.3 MI100 #2026

XLA compilation not working on RNN example rocm 5.4.3 MI100 #2026

Comments

Epliz commented Mar 25, 2023 • edited

Issue Type

Have you reproduced the bug with TF nightly?

Source

Tensorflow Version

Custom Code

OS Platform and Distribution

Mobile device

Python version

Bazel version

GCC/Compiler version

CUDA/cuDNN version

GPU model and memory

Current Behaviour?

Relevant log output

Epliz commented Mar 25, 2023

Epliz commented Mar 25, 2023

Epliz commented Mar 25, 2023 •

edited