Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XLA compilation not working on RNN example rocm 5.4.3 MI100 #2026

Open
Epliz opened this issue Mar 25, 2023 · 2 comments
Open

XLA compilation not working on RNN example rocm 5.4.3 MI100 #2026

Epliz opened this issue Mar 25, 2023 · 2 comments

Comments

@Epliz
Copy link

Epliz commented Mar 25, 2023

Issue Type

Bug

Have you reproduced the bug with TF nightly?

No

Source

binary

Tensorflow Version

tf-rocm 2.11

Custom Code

No

OS Platform and Distribution

Ubuntu 22.04.2 LTS

Mobile device

No response

Python version

3.10.6

Bazel version

No response

GCC/Compiler version

No response

CUDA/cuDNN version

rocm 5.4.3

GPU model and memory

MI100

Current Behaviour?

Getting the current error:


2023-03-25 10:08:12.600064: I tensorflow/compiler/xla/service/service.cc:173] XLA service 0x7fa7f8102fa0 initialized for platform ROCM (this does not guarantee that XLA will be used). Devices:
2023-03-25 10:08:12.600101: I tensorflow/compiler/xla/service/service.cc:181]   StreamExecutor device (0): AMD Instinct MI100, AMDGPU ISA version: gfx908:sramecc+:xnack-
2023-03-25 10:08:12.623830: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-03-25 10:08:12.724333: E tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:289] bitcode module is required by this HLO module but was not found at ./opencl.bc
2023-03-25 10:08:12.725250: I tensorflow/compiler/jit/xla_compilation_cache.cc:477] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
2023-03-25 10:08:12.725363: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: bitcode module not found at ./opencl.bc
2023-03-25 10:08:12.749220: E tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:289] bitcode module is required by this HLO module but was not found at ./opencl.bc
2023-03-25 10:08:12.749567: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: bitcode module not found at ./opencl.bc
2023-03-25 10:08:12.782255: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
Traceback (most recent call last):
  File "/home/me/git/ml/textgen_rnn/./rnn.py", line 148, in <module>
    history = model.fit(dataset, epochs=EPOCHS, batch_size=BATCH_SIZE)
  File "/home/me/git/ml/venv-gpu/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/me/git/ml/venv-gpu/lib/python3.10/site-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:

Detected at node 'StatefulPartitionedCall_5' defined at (most recent call last):
    File "/home/me/git/ml/textgen_rnn/./rnn.py", line 148, in <module>
      history = model.fit(dataset, epochs=EPOCHS, batch_size=BATCH_SIZE)
    File "/home/me/git/ml/venv-gpu/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/home/me/git/ml/venv-gpu/lib/python3.10/site-packages/keras/engine/training.py", line 1650, in fit
      tmp_logs = self.train_function(iterator)
    File "/home/me/git/ml/venv-gpu/lib/python3.10/site-packages/keras/engine/training.py", line 1249, in train_function
      return step_function(self, iterator)
    File "/home/me/git/ml/venv-gpu/lib/python3.10/site-packages/keras/engine/training.py", line 1233, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/home/me/git/ml/venv-gpu/lib/python3.10/site-packages/keras/engine/training.py", line 1222, in run_step
      outputs = model.train_step(data)
    File "/home/me/git/ml/venv-gpu/lib/python3.10/site-packages/keras/engine/training.py", line 1027, in train_step
      self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
    File "/home/me/git/ml/venv-gpu/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 527, in minimize
      self.apply_gradients(grads_and_vars)
    File "/home/me/git/ml/venv-gpu/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1140, in apply_gradients
      return super().apply_gradients(grads_and_vars, name=name)
    File "/home/me/git/ml/venv-gpu/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 634, in apply_gradients
      iteration = self._internal_apply_gradients(grads_and_vars)
    File "/home/me/git/ml/venv-gpu/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1166, in _internal_apply_gradients
      return tf.__internal__.distribute.interim.maybe_merge_call(
    File "/home/me/git/ml/venv-gpu/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1216, in _distributed_apply_gradients_fn
      distribution.extended.update(
    File "/home/me/git/ml/venv-gpu/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1211, in apply_grad_to_update_var
      return self._update_step_xla(grad, var, id(self._var_key(var)))
Node: 'StatefulPartitionedCall_5'
bitcode module not found at ./opencl.bc
	 [[{{node StatefulPartitionedCall_5}}]] [Op:__inference_train_function_2592]


### Standalone code to reproduce the issue

```shell
code from https://www.tensorflow.org/text/tutorials/text_generation

Relevant log output

No response

@Epliz
Copy link
Author

Epliz commented Mar 25, 2023

same symptoms as ROCm/ROCm#1796

@Epliz
Copy link
Author

Epliz commented Mar 25, 2023

using the solution from there to set ROCM_PATH worked.
Please make the env variable to be automatically set instead of making users having to figure it out by themselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant