Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check failed: !info.content.empty() #944

Open
MoFHeka opened this issue Jul 8, 2023 · 0 comments
Open

Check failed: !info.content.empty() #944

MoFHeka opened this issue Jul 8, 2023 · 0 comments

Comments

@MoFHeka
Copy link

MoFHeka commented Jul 8, 2023

Please describe the bug
Training with ShardParallel
Please describe the expected behavior
unexpected system error
System information and environment

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04, docker): ubuntu 20.04
  • Python version:3.8
  • CUDA version:11.7.0
  • NCCL version:2.12.10
  • cupy version:9.6.0
  • GPU model and memory:llama-13B-hf
  • Alpa version:0.2.3
  • TensorFlow version:2.4.0~2.8.0
  • JAX version:0.3.22

To Reproduce
Steps to reproduce the behavior:
1.run example get training
2.See error

Screenshots
(MeshHostWorker pid=595449) 2023-07-08 01:00:54.519514: F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_executable.cc:459] Check failed: !info.content.empty()
(MeshHostWorker pid=595449) *** SIGABRT received at time=1688778054 on cpu 149 ***
(MeshHostWorker pid=595449) PC: @ 0x7f41ad5cd03b (unknown) raise
(MeshHostWorker pid=595449) @ 0x7f41ad5cd0c0 4016 (unknown)
(MeshHostWorker pid=595449) @ 0x7f10a9fae28e 752 xla::gpu::GpuExecutable::ResolveConstantGlobals()
(MeshHostWorker pid=595449) @ 0x7f10ab561864 2784 xla::gpu::GpuExecutable::ExecuteAsyncOnStreamImpl()
(MeshHostWorker pid=595449) @ 0x7f10ab5631bf 128 xla::gpu::GpuExecutable::ExecuteAsyncOnStream()
(MeshHostWorker pid=595449) @ 0x7f10adf836e6 1376 xla::Executable::ExecuteAsyncOnStreamWrapper()
(MeshHostWorker pid=595449) @ 0x7f10ab9ff720 2432 xla::LocalExecutable::RunAsync()
(MeshHostWorker pid=595449) @ 0x7f10ab9ffe90 256 xla::LocalExecutable::RunAsync()
(MeshHostWorker pid=595449) @ 0x7f10ab5eb1fa 2720 xla::PjRtStreamExecutorExecutable::EnqueueExecution()
(MeshHostWorker pid=595449) @ 0x7f10ab5ec631 5360 xla::PjRtStreamExecutorExecutable::ExecuteHelper()
(MeshHostWorker pid=595449) @ 0x7f10ab5eea59 240 std::_Function_handler<>::_M_invoke()
(MeshHostWorker pid=595449) @ 0x7f10ab9d8378 208 xla::WorkerThread::WorkLoop()
(MeshHostWorker pid=595449) @ 0x7f10af0de3e5 80 tsl::(anonymous namespace)::PThread::ThreadFn()
(MeshHostWorker pid=595449) @ 0x7f41ad56f609 (unknown) start_thread
(MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: *** SIGABRT received at time=1688778054 on cpu 149 ***
(MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: PC: @ 0x7f41ad5cd03b (unknown) raise
(MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f41ad5cd0c0 4016 (unknown)
(MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10a9fae28e 752 xla::gpu::GpuExecutable::ResolveConstantGlobals()
(MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10ab561864 2784 xla::gpu::GpuExecutable::ExecuteAsyncOnStreamImpl()
(MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10ab5631bf 128 xla::gpu::GpuExecutable::ExecuteAsyncOnStream()
(MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10adf836e6 1376 xla::Executable::ExecuteAsyncOnStreamWrapper()
(MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10ab9ff720 2432 xla::LocalExecutable::RunAsync()
(MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10ab9ffe90 256 xla::LocalExecutable::RunAsync()
(MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10ab5eb1fa 2720 xla::PjRtStreamExecutorExecutable::EnqueueExecution()
(MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10ab5ec631 5360 xla::PjRtStreamExecutorExecutable::ExecuteHelper()
(MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10ab5eea59 240 std::_Function_handler<>::_M_invoke()
(MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10ab9d8378 208 xla::WorkerThread::WorkLoop()
(MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10af0de3e5 80 tsl::(anonymous namespace)::PThread::ThreadFn()
(MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f41ad56f609 (unknown) start_thread

Code snippet to reproduce the problem

Additional information
Add any other context about the problem here or include any logs that would be helpful to diagnose the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant