Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Killed unexpectedly in Colab with TPU #7

Closed
onegigbyte opened this issue Nov 25, 2022 · 1 comment
Closed

Killed unexpectedly in Colab with TPU #7

onegigbyte opened this issue Nov 25, 2022 · 1 comment

Comments

@onegigbyte
Copy link

On a budget, I'm running the training_agent for Caro on Colab with TPU.
However, somehow it always got killed at iteration #1 around 64% without much stacktraces provided.

Any experiences or theories on why this may happen?

!TF_CPP_MIN_LOG_LEVEL=0
!time python3 train_agent.py \
    --game-class="caro_game.CaroGame" \
    --agent-class="resnet_policy.ResnetPolicyValueNet128" \
    --selfplay-batch-size=1024 \
    --training-batch-size=1024 \
    --num-simulations-per-move=32 \
    --num-self-plays-per-iteration=102400 \
    --learning-rate=1e-2 \
    --random-seed=42 \
    --ckpt-filename="./caro_agent_9x9_128.ckpt" \
    --num-iterations=100 \
    --lr-decay-steps=500000

2022-11-25 08:59:37.077139: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Cores: [TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(0,0,0), core_on_chip=1), TpuDevice(id=2, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,0,0), core_on_chip=1), TpuDevice(id=4, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=5, process_index=0, coords=(0,1,0), core_on_chip=1), TpuDevice(id=6, process_index=0, coords=(1,1,0), core_on_chip=0), TpuDevice(id=7, process_index=0, coords=(1,1,0), core_on_chip=1)]
Loading weights at ./caro_agent_9x9_128.ckpt
Iteration 1
self play [######################--------------] 63% 00:09:41 /bin/bash: line 1: 2377 Killed python3 train_agent.py --game-class="caro_game.CaroGame" --agent-class="resnet_policy.ResnetPolicyValueNet128" --selfplay-batch-size=1024 --training-batch-size=1024 --num-simulations-per-move=32 --num-self-plays-per-iteration=102400 --learning-rate=1e-2 --random-seed=42 --ckpt-filename="./caro_agent_9x9_128.ckpt" --num-iterations=100 --lr-decay-steps=500000

real 17m19.797s
user 10m5.645s
sys 5m3.467s

@onegigbyte
Copy link
Author

Solved, sorry, it's just out of RAM, killed by colab.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant