Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set size changed during iteration -- is this a problem #3

Closed
brianprichardson opened this issue Nov 6, 2017 · 12 comments
Closed

Set size changed during iteration -- is this a problem #3

brianprichardson opened this issue Nov 6, 2017 · 12 comments

Comments

@brianprichardson
Copy link

brianprichardson commented Nov 6, 2017

Ubuntu 16.04 LTS
Thanks,
BrianR (author of Tinker chess engine)

brianr@Tinker-Ubuntu:~/alphagozero$ python3 main.py
Using TensorFlow backend.
2017-11-06 16:53:14.674061: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-11-06 16:53:14.674512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 770 major: 3 minor: 0 memoryClockRate(GHz): 1.163
pciBusID: 0000:01:00.0
totalMemory: 1.94GiB freeMemory: 1.67GiB
2017-11-06 16:53:14.674554: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 770, pci bus id: 0000:01:00.0, compute capability: 3.0)
Exception in thread Thread-2: | 1/20 [00:51<16:09, 51.02s/it]
Traceback (most recent call last):██▎ | 34/162 [00:49<03:06, 1.46s/it]
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner | 6/162 [00:08<03:36, 1.39s/it]
self.run()
File "/home/brianr/.local/lib/python3.5/site-packages/tqdm/_tqdm.py", line 144, in run
for instance in self.tqdm_cls._instances:
File "/usr/lib/python3.5/_weakrefset.py", line 60, in iter
for itemref in self.data:
RuntimeError: Set changed size during iteration

BTW it is running and says:
Evaluation model_2 vs model_1 (winrate:100%): 100%|██████████████████████████████████████████████████████| 10/10 [19:11<00:00, 115.13s/it]
We found a new best model : model_2!███████████████████████▉ | 78/162 [02:07<02:16, 1.63s/it]

@Narsil
Copy link
Owner

Narsil commented Nov 7, 2017

Yes, I'm having the same problem I'm not too sure what causes this. It's possible that tqdm (progressbar) does not like when a game finished early (right now it's only when the two player pass).
It does not happen every game though and does not seem to impact the results (you can just remove the tqdm decoractor in the loop over every move in self_play to disable this error message).

@brianprichardson
Copy link
Author

brianprichardson commented Nov 7, 2017

As it is not an AlphaGoZero issue I don't mind ignoring it. I just wanted to make sure it seemed to be running ok in spite of the message. It ran for a few more cycles and then the 2GB on the 770 gpu was exceeded. It is odd that Keras will ask for too much gpu memory and TF does not seem to. I can always run it on another PC I have with a 1070 with 8GB, but I'd have to install Linux on that box as well. Right now I'm using VMware Workstation (free) which does not support gpus.

@Narsil
Copy link
Owner

Narsil commented Nov 7, 2017

I'm having the same problem on my GTX 970. It's a far greater issue, but I'm having trouble debugging it naively, the stacktrace for TF is way too large.

For me it seems to be happening after 5 or so hours.
I start to have some NaN before the actual OOM error, so I don't think debugging the dying stacktrace will help.

I'm trying to figure out who is keeping memory, but I'm not sure how to go about debugging GPU memleaks.

Are you sure you have a 1070 ? You stack says you have a 770 ?

@Narsil
Copy link
Owner

Narsil commented Nov 7, 2017

Probably linked to tensorflow/tensorflow#10408

@brianprichardson
Copy link
Author

Edited earlier comment. Two PCs: one with a 770 and another with a 1070.

@brianprichardson
Copy link
Author

Update to comment about TF not overflowing gpu memory. I was only running LeelaZero in playing mode and it also overflowed the 770 2GB limit in training mode with TF, so it is not just a Keras issue.

@Narsil
Copy link
Owner

Narsil commented Nov 8, 2017

The link I provided seems to blame, the clean_session. Some placeholder are always created when initiating a session (which keras does when calling fit).

I added in my local version a K.clear_session() after each evaluation, and so far, I did not have any OOM. It might still be overflowing because I did end up with NaNs after a while.

@Narsil
Copy link
Owner

Narsil commented Nov 9, 2017

New version seems to fix it for me. Does it fix yours ?

@brianprichardson
Copy link
Author

Trying latest version. Will take about 5-6 hours to run until it ran out of memory last time. Will let you know.

@brianprichardson
Copy link
Author

It has been running about 6 hours and is up to model 17. So far none have been better than model 1. Other than that, things seem more stable (no OOM for RAM or GPU)

@Narsil
Copy link
Owner

Narsil commented Nov 10, 2017

Ok, so that's one problem fixed.

Did you activate SHOW_END_GAME ? you can look at the end value for each model, to look for something suspicious (if the value is high while the model was beaten badly).

I changed the value from {1, 0} to {+1, -1} in the last version you have. For me, it never stays stuck with model1, but it does seem to reach some plateau after a few models(something like 6-10), and looking at end games it's still not playing quite properly, but nothing really suspicious (values are sometimes off but for both models).

I also have added more self play games in my current configuration. Because the algorithm only learns from these self play games, if they are not enough, it's unlikely to progress far enough. It also shows from deepmind parameters
self play games: 25k in deepmind but here the conf has 20.

I moved to 50 and it's not as stuck anymore but I feel it would need more like 250 or more in order to avoid being stuck. It's a costly check, so for now I'm focusing on adding the missing features of the algo.
I've got a feeling we could add evaluations games to the training set of games in order to improve almost freely this "plateau" effect.

Also I'm currently at .4s/move but for 16 MCTS simulations instead of 1600. So there's two orders of magnitude to work on. batch exploration could get 8x, and I'm hoping quantization could get another improvement. With this I think it would be easier to increase the number of self play games.

@Narsil
Copy link
Owner

Narsil commented Nov 10, 2017

Let's close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants