Set size changed during iteration -- is this a problem #3

brianprichardson · 2017-11-06T22:08:52Z

Ubuntu 16.04 LTS
Thanks,
BrianR (author of Tinker chess engine)

brianr@Tinker-Ubuntu:~/alphagozero$ python3 main.py
Using TensorFlow backend.
2017-11-06 16:53:14.674061: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-11-06 16:53:14.674512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 770 major: 3 minor: 0 memoryClockRate(GHz): 1.163
pciBusID: 0000:01:00.0
totalMemory: 1.94GiB freeMemory: 1.67GiB
2017-11-06 16:53:14.674554: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 770, pci bus id: 0000:01:00.0, compute capability: 3.0)
Exception in thread Thread-2: | 1/20 [00:51<16:09, 51.02s/it]
Traceback (most recent call last):██▎ | 34/162 [00:49<03:06, 1.46s/it]
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner | 6/162 [00:08<03:36, 1.39s/it]
self.run()
File "/home/brianr/.local/lib/python3.5/site-packages/tqdm/_tqdm.py", line 144, in run
for instance in self.tqdm_cls._instances:
File "/usr/lib/python3.5/_weakrefset.py", line 60, in iter
for itemref in self.data:
RuntimeError: Set changed size during iteration

BTW it is running and says:
Evaluation model_2 vs model_1 (winrate:100%): 100%|██████████████████████████████████████████████████████| 10/10 [19:11<00:00, 115.13s/it]
We found a new best model : model_2!███████████████████████▉ | 78/162 [02:07<02:16, 1.63s/it]

Narsil · 2017-11-07T13:19:14Z

Yes, I'm having the same problem I'm not too sure what causes this. It's possible that tqdm (progressbar) does not like when a game finished early (right now it's only when the two player pass).
It does not happen every game though and does not seem to impact the results (you can just remove the tqdm decoractor in the loop over every move in self_play to disable this error message).

brianprichardson · 2017-11-07T18:59:30Z

As it is not an AlphaGoZero issue I don't mind ignoring it. I just wanted to make sure it seemed to be running ok in spite of the message. It ran for a few more cycles and then the 2GB on the 770 gpu was exceeded. It is odd that Keras will ask for too much gpu memory and TF does not seem to. I can always run it on another PC I have with a 1070 with 8GB, but I'd have to install Linux on that box as well. Right now I'm using VMware Workstation (free) which does not support gpus.

Narsil · 2017-11-07T19:24:35Z

I'm having the same problem on my GTX 970. It's a far greater issue, but I'm having trouble debugging it naively, the stacktrace for TF is way too large.

For me it seems to be happening after 5 or so hours.
I start to have some NaN before the actual OOM error, so I don't think debugging the dying stacktrace will help.

I'm trying to figure out who is keeping memory, but I'm not sure how to go about debugging GPU memleaks.

Are you sure you have a 1070 ? You stack says you have a 770 ?

Narsil · 2017-11-07T20:02:12Z

Probably linked to tensorflow/tensorflow#10408

brianprichardson · 2017-11-08T12:18:37Z

Edited earlier comment. Two PCs: one with a 770 and another with a 1070.

brianprichardson · 2017-11-08T14:17:45Z

Update to comment about TF not overflowing gpu memory. I was only running LeelaZero in playing mode and it also overflowed the 770 2GB limit in training mode with TF, so it is not just a Keras issue.

Narsil · 2017-11-08T14:47:49Z

The link I provided seems to blame, the clean_session. Some placeholder are always created when initiating a session (which keras does when calling fit).

I added in my local version a K.clear_session() after each evaluation, and so far, I did not have any OOM. It might still be overflowing because I did end up with NaNs after a while.

Narsil · 2017-11-09T12:48:46Z

New version seems to fix it for me. Does it fix yours ?

brianprichardson · 2017-11-09T16:29:48Z

Trying latest version. Will take about 5-6 hours to run until it ran out of memory last time. Will let you know.

brianprichardson · 2017-11-09T23:01:00Z

It has been running about 6 hours and is up to model 17. So far none have been better than model 1. Other than that, things seem more stable (no OOM for RAM or GPU)

Narsil · 2017-11-10T09:50:30Z

Ok, so that's one problem fixed.

Did you activate SHOW_END_GAME ? you can look at the end value for each model, to look for something suspicious (if the value is high while the model was beaten badly).

I changed the value from {1, 0} to {+1, -1} in the last version you have. For me, it never stays stuck with model1, but it does seem to reach some plateau after a few models(something like 6-10), and looking at end games it's still not playing quite properly, but nothing really suspicious (values are sometimes off but for both models).

I also have added more self play games in my current configuration. Because the algorithm only learns from these self play games, if they are not enough, it's unlikely to progress far enough. It also shows from deepmind parameters
self play games: 25k in deepmind but here the conf has 20.

I moved to 50 and it's not as stuck anymore but I feel it would need more like 250 or more in order to avoid being stuck. It's a costly check, so for now I'm focusing on adding the missing features of the algo.
I've got a feeling we could add evaluations games to the training set of games in order to improve almost freely this "plateau" effect.

Also I'm currently at .4s/move but for 16 MCTS simulations instead of 1600. So there's two orders of magnitude to work on. batch exploration could get 8x, and I'm hoping quantization could get another improvement. With this I think it would be easier to increase the number of self play games.

Narsil · 2017-11-10T09:51:10Z

Let's close this.

brianprichardson closed this as completed Nov 7, 2017

brianprichardson reopened this Nov 8, 2017

Narsil closed this as completed Nov 10, 2017

brianprichardson mentioned this issue Nov 10, 2017

No Model Progress #4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set size changed during iteration -- is this a problem #3

Set size changed during iteration -- is this a problem #3

brianprichardson commented Nov 6, 2017 •

edited

Loading

Narsil commented Nov 7, 2017

brianprichardson commented Nov 7, 2017 •

edited

Loading

Narsil commented Nov 7, 2017 •

edited

Loading

Narsil commented Nov 7, 2017

brianprichardson commented Nov 8, 2017

brianprichardson commented Nov 8, 2017

Narsil commented Nov 8, 2017

Narsil commented Nov 9, 2017

brianprichardson commented Nov 9, 2017

brianprichardson commented Nov 9, 2017

Narsil commented Nov 10, 2017

Narsil commented Nov 10, 2017

Set size changed during iteration -- is this a problem #3

Set size changed during iteration -- is this a problem #3

Comments

brianprichardson commented Nov 6, 2017 • edited Loading

Narsil commented Nov 7, 2017

brianprichardson commented Nov 7, 2017 • edited Loading

Narsil commented Nov 7, 2017 • edited Loading

Narsil commented Nov 7, 2017

brianprichardson commented Nov 8, 2017

brianprichardson commented Nov 8, 2017

Narsil commented Nov 8, 2017

Narsil commented Nov 9, 2017

brianprichardson commented Nov 9, 2017

brianprichardson commented Nov 9, 2017

Narsil commented Nov 10, 2017

Narsil commented Nov 10, 2017

brianprichardson commented Nov 6, 2017 •

edited

Loading

brianprichardson commented Nov 7, 2017 •

edited

Loading

Narsil commented Nov 7, 2017 •

edited

Loading