Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] keras issue on tpu-vm #19448

Open
innat opened this issue Apr 5, 2024 · 3 comments
Open

[bug] keras issue on tpu-vm #19448

innat opened this issue Apr 5, 2024 · 3 comments
Assignees
Labels
To investigate Looks like a bug. It needs someone to investigate.

Comments

@innat
Copy link

innat commented Apr 5, 2024

keras: 3.0.5
tensorflow: 2.15.0

There seems some conflict to use keras 3 in tpu-vm. Kaggle/docker-python#1370 (comment)

import tensorflow as tf
import keras 

tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu="local")
strategy = tf.distribute.TPUStrategy(tpu)

with strategy.scope():
    # Construct and compile an instance of CustomModel
    inputs = keras.Input(shape=(32,))
    outputs = keras.layers.Dense(1)(inputs)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer="adam", loss="mse", metrics=["mae"])

# Just use `fit` as usual
x = np.random.random((1000, 32))
y = np.random.random((1000, 1))
model.fit(x, y, epochs=3)
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1712289536.759567      13 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
Epoch 1/3
---------------------------------------------------------------------------
NotFoundError                             Traceback (most recent call last)
Cell In[6], line 11
      9 x = np.random.random((1000, 32))
     10 y = np.random.random((1000, 1))
---> 11 model.fit(x, y, epochs=3)

File /usr/local/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py:123, in filter_traceback.<locals>.error_handler(*args, **kwargs)
    120     filtered_tb = _process_traceback_frames(e.__traceback__)
    121     # To get the full stack trace, call:
    122     # `keras.config.disable_traceback_filtering()`
--> 123     raise e.with_traceback(filtered_tb) from None
    124 finally:
    125     del filtered_tb

File /usr/local/lib/python3.10/site-packages/tensorflow/python/eager/execute.py:53, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     51 try:
     52   ctx.ensure_initialized()
---> 53   tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
     54                                       inputs, attrs, num_outputs)
     55 except core._NotOkStatusException as e:
     56   if name is not None:

NotFoundError: Graph execution error:

Detected at node TPUReplicate/_compile/_15189418723048853925/_4 defined at (most recent call last):
<stack traces unavailable>
Detected at node TPUReplicate/_compile/_15189418723048853925/_4 defined at (most recent call last):
<stack traces unavailable>
Detected at node TPUReplicate/_compile/_15189418723048853925/_4 defined at (most recent call last):
<stack traces unavailable>
Detected at node TPUReplicate/_compile/_15189418723048853925/_4 defined at (most recent call last):
<stack traces unavailable>
Detected at node TPUReplicate/_compile/_15189418723048853925/_4 defined at (most recent call last):
<stack traces unavailable>
Detected at node TPUReplicate/_compile/_15189418723048853925/_4 defined at (most recent call last):
<stack traces unavailable>
Detected at node TPUReplicate/_compile/_15189418723048853925/_4 defined at (most recent call last):
<stack traces unavailable>
Detected at node TPUReplicate/_compile/_15189418723048853925/_4 defined at (most recent call last):
<stack traces unavailable>
Detected at node TPUReplicate/_compile/_15189418723048853925/_4 defined at (most recent call last):
<stack traces unavailable>
9 root error(s) found.
  (0) NOT_FOUND:  XLA:TPU compile permanent error. Container localhost does not exist. (Could not find resource: localhost/tpu_mesh_common_state)
	 [[{{node TPUReplicate/_compile/_15189418723048853925/_4}}]]
  (1) NOT_FOUND:  XLA:TPU compile permanent error. Container localhost does not exist. (Could not find resource: localhost/tpu_mesh_common_state)
	 [[{{node TPUReplicate/_compile/_15189418723048853925/_4}}]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_316]]
  (2) NOT_FOUND:  XLA:TPU compile permanent error. Container localhost does not exist. (Could not find resource: localhost/tpu_mesh_common_state)
	 [[{{node TPUReplicate/_compile/_15189418723048853925/_4}}]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_316]]
	 [[tpu_compile_succeeded_assert/_15801172523729505459/_5/_255]]
  (3) NOT_FOUND:  XLA:TPU compile permanent error. Container localhost does not exist. (Could not find resource: localhost/tpu_mesh_common_state)
	 [[{{node TPUReplicate/_compile/_15189418723048853925/_4}}]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_316]]
	 [[tpu_compile_succeeded_assert/_15801172523729505459/_5/_255]]
	 [[tpu_compile_succeeded_assert/_15801172523729505459/_5/_271]]
  (4) NOT_FOUND:  XLA:TPU compile permanent error. Container localhost does not exist. (Could not find resource: localhost/tpu_mesh_common_state)
	 [[{{node TPUReplicate/_compile/_15189418723048853925/_4}}]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_316]]
	 [[tpu_compile_succeeded_assert/_15801172523729505459/_5/_255]]
	 [[tpu_compile_succeeded_assert/_15801172523729505459/_5/_271]]
	 [[cluster_one_step_on_iterator/control_after/_1/_387]]
  (5) NOT_FOUND:  XLA:TPU compile permanent error. Container localhost does not exist. (Could not find resource: localhost/tpu_mesh_common_state)
	 [[{{node TPUReplicate/_compile/_15189418723048853925/_4}}]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_316]]
	 [[tpu_compile_succeeded_assert/_15801172523729505459/_5/_255]]
	 [[tpu_compile_succeeded_assert/_15801172523729505459/_5/_271]]
	 [[cluster_one_step_on_iterator/control_after/_1/_387]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_220]]
  (6) NOT_FOUND:  XLA:TPU compile permanent error. Container localhost does not exist. (Could not find resource: localhost/tpu_mesh_common_state)
	 [[{{node TPUReplicate/_compile/_15189418723048853925/_4}}]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_316]]
	 [[tpu_compile_succeeded_assert/_15801172523729505459/_5/_255]]
	 [[tpu_compile_succeeded_assert/_15801172523729505459/_5/_271]]
	 [[cluster_one_step_on_iterator/control_after/_1/_387]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_220]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_284]]
  (7) NOT_FOUND:  XLA:TPU compile permanent error. Container localhost does not exist. (Could not find resource: localhost/tpu_mesh_common_state)
	 [[{{node TPUReplicate/_compile/_15189418723048853925/_4}}]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_316]]
	 [[tpu_compile_succeeded_assert/_15801172523729505459/_5/_255]]
	 [[tpu_compile_succeeded_assert/_15801172523729505459/_5/_271]]
	 [[cluster_one_step_on_iterator/control_after/_1/_387]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_220]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_284]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_236]]
  (8) NOT_FOUND:  XLA:TPU compile permanent error. Container localhost does not exist. (Could not find resource: localhost/tpu_mesh_common_state)
	 [[{{node TPUReplicate/_compile/_15189418723048853925/_4}}]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_316]]
	 [[tpu_compile_succeeded_assert/_15801172523729505459/_5/_255]]
	 [[tpu_compile_succeeded_assert/_15801172523729505459/_5/_271]]
	 [[cluster_one_step_on_iterator/control_after/_1/_387]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_220]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_284]]
	 [[TPUReplicate/_compile/_15189418723048853925/_4/_236]]
	 [[tpu_compile_succeeded_assert/_15801172523729505459/_5/_303]]
0 successful operations.
0 derived errors ignored. [Op:__inference_one_step_on_iterator_2865]
@SuryanarayanaY SuryanarayanaY added To investigate Looks like a bug. It needs someone to investigate. keras-team-review-pending Pending review by a Keras team member. and removed keras-team-review-pending Pending review by a Keras team member. labels Apr 5, 2024
@SuryanarayanaY
Copy link
Collaborator

Hi @innat ,

I have tested on colab TPU environment with TF 2.15 and Keras 3.1.1. and it seems working fine as per attached gist. Could you please cross check with keras 3.1.1 and update us ? Thanks!

@innat
Copy link
Author

innat commented Apr 8, 2024

@SuryanarayanaY
I think I clearly mentioned it is about tpu-vm. You can test it in kaggle env. Please check this too Kaggle/docker-python#1370 (comment)

@SuryanarayanaY
Copy link
Collaborator

Hi @innat ,

Thanks for more context.

@SuryanarayanaY SuryanarayanaY added the keras-team-review-pending Pending review by a Keras team member. label Apr 10, 2024
@sachinprasadhs sachinprasadhs removed the keras-team-review-pending Pending review by a Keras team member. label Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
To investigate Looks like a bug. It needs someone to investigate.
Projects
None yet
Development

No branches or pull requests

4 participants