-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set size changed during iteration -- is this a problem #3
Comments
Yes, I'm having the same problem I'm not too sure what causes this. It's possible that tqdm (progressbar) does not like when a game finished early (right now it's only when the two player pass). |
As it is not an AlphaGoZero issue I don't mind ignoring it. I just wanted to make sure it seemed to be running ok in spite of the message. It ran for a few more cycles and then the 2GB on the 770 gpu was exceeded. It is odd that Keras will ask for too much gpu memory and TF does not seem to. I can always run it on another PC I have with a 1070 with 8GB, but I'd have to install Linux on that box as well. Right now I'm using VMware Workstation (free) which does not support gpus. |
I'm having the same problem on my GTX 970. It's a far greater issue, but I'm having trouble debugging it naively, the stacktrace for TF is way too large. For me it seems to be happening after 5 or so hours. I'm trying to figure out who is keeping memory, but I'm not sure how to go about debugging GPU memleaks. Are you sure you have a 1070 ? You stack says you have a 770 ? |
Probably linked to tensorflow/tensorflow#10408 |
Edited earlier comment. Two PCs: one with a 770 and another with a 1070. |
Update to comment about TF not overflowing gpu memory. I was only running LeelaZero in playing mode and it also overflowed the 770 2GB limit in training mode with TF, so it is not just a Keras issue. |
The link I provided seems to blame, the clean_session. Some placeholder are always created when initiating a session (which keras does when calling fit). I added in my local version a K.clear_session() after each evaluation, and so far, I did not have any OOM. It might still be overflowing because I did end up with NaNs after a while. |
New version seems to fix it for me. Does it fix yours ? |
Trying latest version. Will take about 5-6 hours to run until it ran out of memory last time. Will let you know. |
It has been running about 6 hours and is up to model 17. So far none have been better than model 1. Other than that, things seem more stable (no OOM for RAM or GPU) |
Ok, so that's one problem fixed. Did you activate SHOW_END_GAME ? you can look at the end value for each model, to look for something suspicious (if the value is high while the model was beaten badly). I changed the value from {1, 0} to {+1, -1} in the last version you have. For me, it never stays stuck with model1, but it does seem to reach some plateau after a few models(something like 6-10), and looking at end games it's still not playing quite properly, but nothing really suspicious (values are sometimes off but for both models). I also have added more self play games in my current configuration. Because the algorithm only learns from these self play games, if they are not enough, it's unlikely to progress far enough. It also shows from deepmind parameters I moved to 50 and it's not as stuck anymore but I feel it would need more like 250 or more in order to avoid being stuck. It's a costly check, so for now I'm focusing on adding the missing features of the algo. Also I'm currently at .4s/move but for 16 MCTS simulations instead of 1600. So there's two orders of magnitude to work on. batch exploration could get 8x, and I'm hoping quantization could get another improvement. With this I think it would be easier to increase the number of self play games. |
Let's close this. |
Ubuntu 16.04 LTS
Thanks,
BrianR (author of Tinker chess engine)
brianr@Tinker-Ubuntu:~/alphagozero$ python3 main.py
Using TensorFlow backend.
2017-11-06 16:53:14.674061: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-11-06 16:53:14.674512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 770 major: 3 minor: 0 memoryClockRate(GHz): 1.163
pciBusID: 0000:01:00.0
totalMemory: 1.94GiB freeMemory: 1.67GiB
2017-11-06 16:53:14.674554: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 770, pci bus id: 0000:01:00.0, compute capability: 3.0)
Exception in thread Thread-2: | 1/20 [00:51<16:09, 51.02s/it]
Traceback (most recent call last):██▎ | 34/162 [00:49<03:06, 1.46s/it]
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner | 6/162 [00:08<03:36, 1.39s/it]
self.run()
File "/home/brianr/.local/lib/python3.5/site-packages/tqdm/_tqdm.py", line 144, in run
for instance in self.tqdm_cls._instances:
File "/usr/lib/python3.5/_weakrefset.py", line 60, in iter
for itemref in self.data:
RuntimeError: Set changed size during iteration
BTW it is running and says:
Evaluation model_2 vs model_1 (winrate:100%): 100%|██████████████████████████████████████████████████████| 10/10 [19:11<00:00, 115.13s/it]
We found a new best model : model_2!███████████████████████▉ | 78/162 [02:07<02:16, 1.63s/it]
The text was updated successfully, but these errors were encountered: