Killed unexpectedly in Colab with TPU #7

onegigbyte · 2022-11-25T09:53:15Z

On a budget, I'm running the training_agent for Caro on Colab with TPU.
However, somehow it always got killed at iteration #1 around 64% without much stacktraces provided.

Any experiences or theories on why this may happen?

!TF_CPP_MIN_LOG_LEVEL=0
!time python3 train_agent.py \
    --game-class="caro_game.CaroGame" \
    --agent-class="resnet_policy.ResnetPolicyValueNet128" \
    --selfplay-batch-size=1024 \
    --training-batch-size=1024 \
    --num-simulations-per-move=32 \
    --num-self-plays-per-iteration=102400 \
    --learning-rate=1e-2 \
    --random-seed=42 \
    --ckpt-filename="./caro_agent_9x9_128.ckpt" \
    --num-iterations=100 \
    --lr-decay-steps=500000

2022-11-25 08:59:37.077139: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Cores: [TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(0,0,0), core_on_chip=1), TpuDevice(id=2, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,0,0), core_on_chip=1), TpuDevice(id=4, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=5, process_index=0, coords=(0,1,0), core_on_chip=1), TpuDevice(id=6, process_index=0, coords=(1,1,0), core_on_chip=0), TpuDevice(id=7, process_index=0, coords=(1,1,0), core_on_chip=1)]
Loading weights at ./caro_agent_9x9_128.ckpt
Iteration 1
self play [######################--------------] 63% 00:09:41 /bin/bash: line 1: 2377 Killed python3 train_agent.py --game-class="caro_game.CaroGame" --agent-class="resnet_policy.ResnetPolicyValueNet128" --selfplay-batch-size=1024 --training-batch-size=1024 --num-simulations-per-move=32 --num-self-plays-per-iteration=102400 --learning-rate=1e-2 --random-seed=42 --ckpt-filename="./caro_agent_9x9_128.ckpt" --num-iterations=100 --lr-decay-steps=500000

real 17m19.797s
user 10m5.645s
sys 5m3.467s

The text was updated successfully, but these errors were encountered:

onegigbyte · 2022-11-25T10:23:54Z

Solved, sorry, it's just out of RAM, killed by colab.

onegigbyte closed this as completed Nov 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Killed unexpectedly in Colab with TPU #7

Killed unexpectedly in Colab with TPU #7

onegigbyte commented Nov 25, 2022

onegigbyte commented Nov 25, 2022

Killed unexpectedly in Colab with TPU #7

Killed unexpectedly in Colab with TPU #7

Comments

onegigbyte commented Nov 25, 2022

onegigbyte commented Nov 25, 2022