-
-
Notifications
You must be signed in to change notification settings - Fork 507
add ThreadedEvaluator and DistributedEvaluator #96
Conversation
I have added a 'neat.threaded.ThreadedEvaluator' for evaluating genomes in threads. This is useful when using a python implementation without GIL. The ThreadedEvaluator is based on the ParallelEvaluator.
neat.threaded.ThreadEvolver will now start its worker automatically
I have added a test for neat.threaded.ThreadedEvaluator based on the test for neat.parallel.ParallelEvaluator.
i removed the test for checki if 'neat.threaded.ThreadedEvaluator.__del__' stops the threads. This is because __del__ is not always called and thus may result in false test results.
I have changed travis.yml to use `pypy3.5-5.8.0` instead of `pypy3`. Travis uses an outdated version of `pypy3`. `pypy3.5-5.8.0` contains some fixes for multithreaded scripts, which *may* fix the bug in the travic-ci build for `neat.threaded.ThreadedEvaluator`.
I added the first version of the DistributedEvaluator, an Evaluator for evaluating across multiple machines. While the tests (i will commit them later) seem to work pretty good, further tests are needed. The tests only use one machine, so i have to wait until i am able to use an cluster (or just use VMs).
I have added some test for neat.distributed. These tests are not perfect because they run on only one machine. However, i doubt that we can change this on Travis-CI. Unfortunately, my neat-python cloned repo now contains too many checkpoints, so i have to commit all the changes using the github webinterface :( . Well, why am i even writing this here? I doubt someone will actually read this commit messages. (quick note: the perfect diary: hidden in front of everyone).
The authkeys used in the tests are now explicitly binary.
Heh. Due to the browser I mostly use displaying the full commit messages instead of having to click on the '...', I actually did see that part of the commit message... -Allen |
Even the most recent version of pypy3 is rather slow on the parallel evaluator (dunno yet on threaded), according to profiling (on OS X 10.12), BTW - lots of time spent in mutexes (or equivalent - waiting for thread lock) that didn't happen with other Python versions (including 2.7 pypy). |
Now that I think about it, the parallel evaluator is supposed to be using subprocesses, not threads... I'm thinking that pypy3 may have problems with parallel/threaded execution because it's probably using a separate thread to do its JIT compilation. (To be fair, it's also clearly labeled as a beta...) Reducing the number of parallel subprocesses to 2 (from 4) in the test did not significantly affect this. (This is running on a machine with 1 processor, 2 cores, incidentally - 2011 Mac mini.) |
@drallensmith I think you are right. But i think the issue with the JIT and subprocesses is that each new subprocess spawned by
|
lynx - text-only, no javascript (I use Firefox for doing things like typing this comment)... but also low-memory and fast. NEAT-Tetris? Interesting! Probably after some testing of possible enhancements on less-complex systems (lander, perhaps?), I've been looking at doing some experiments with LARG/HFO. BTW, I should add that my profiling was using the test suite (since I was looking at why pypy3 was having problems - good spotting of the older version in use on Travis, BTW!), which probably doesn't run long enough for compilation to help much. I did put together a variation of the test suite meant for profiling, but have been working more on other things (particularly since LARG/HFO is mostly C++)... about all I did was trim it down to just the tests actually doing runs, then up the generation count and population size (and adjust the fitness function termination criteria so they wouldn't happen). |
@drallensmith I knew it! I actually tought that you may be using lynx (i use it sometimes too), but it seemed too unlikely. |
The `DistributedEvaluator` will now shutdown its manager when `stop()` is called.
Thank you thank you to both of you for the work on this! Apologies for not having the time to look through it yet, hoping to change that soon. :) |
This looks so thorough that I figured it's just best to merge and let everybody try it out. :) |
I have added the
ThreadedEvaluator
class and theDistributedEvaluator
class for a more flexible usage of computational resources during evaluation.ThreadedEvaluator: A class inspired by the
ParallelEvaluator
, but uses Threads for evaluating genomes. This is useful when using a python implementation without a GIL (e.g. jython and pypy-stm).DistributedEvaluator: An Evaluator for the evaluation of genomes accross multiple computer nodes. This class is also inspired by the
ParallelEvaluator
. However, the overhead is even bigger than the one of theParalellEvaluator
, which means it it only useful when the evaluation function requires heavy calculations.Both the
ThreadedEvaluator
and theDistributedEvaluator
are implemented using the standard modules.Usage of the
ThreadedEvaluator
:The usage of the
ThreadedEvaluator
is much like the usage of theThreadedEvaluator
. However, theThreadedEvaluator
does not support thetimeout
argument. Also, the worker threads are not stopped automatically when the instance is deleted. However, the threads can still be stopped using thestop()
method of theThreadedEvaluator
and will automatically stop when the other non-daemonic threads are done.Usage of the
DistributedEvaluator
:The usage of the
DistributedEvaluator
is very simple. Both the master node (=the computer mutating genomes) and the slave nodes (=the computers evaluating genomes) can run the exact same script.Please note that the master node will not try to evaluate any genomes by itself. At least one slave node is required, but you can launch a slave node on the same physical node as the master node. Please keep in mind, that you will have to force the slave-node into slave-mode in this case. The
examples/xor/evolve-feedforward-distributed.py
has a--force-slave
argument for this case.neat.DistributedEvaluator
using the following arguments:addr
is a tuple of (hostname/ip, port) pointing to the master node.authkey
is a password used for authentification to the main node.eval_function
is the function for evaluating a single genomeslave_chunksize=1
defines the number of genomes which will be send to a slave at once. When a slave node is using multiple worker processes, this number should be at least equal to the number of worker processes. Higher values may be more overhead efficient. Default: 1.num_workers=1
: When this value is greater than 1 and this node is in slave mode, use this many worker processes for the evaluation. Otherwise, evaluate in the thread which calledDistributedEvaluator.start()
(most likely the main thread). Default: 1.worker_timeout
: When this node is in slave mode, wait at most this many seconds for the result of the worker processes. Default: 60-mode=neat.distributed.MODE_AUTO
: In which mode this node should operate (one ofMODE_MASTER
,MODE_SLAVE
,MODE_AUTO
(the default), as defined inneat.distributed
). If the value isMODE_AUTO
(the default), check if theaddr
argument points to the localhost. If it does, set the mode toMODE_MASTER
, otherwise toMODE_SLAVE
. The other two values force this node into the corresponding mode.start()
method of theDistributedEvaluator
instance. This function blocks on the slave nodes, but returns on the master nodes. By default, the slave nodes will exit at the end of this call when the work is done. The arguments are all optional:exit_on_stop=True
: if in slave mode, callsys.exit(0)
once the work is done.slave_wait=0
: If in slave mode, wait this many seconds before connecting, This is useful if the master node may take some time to start.run
method of the population, using theevaluate
method/attribute of theDistributedEvaluator
instance.stop
method of theDistributedEvaluator
instance.If you are using multiple worker processes, all calls starting from step 2 should be done in a
if __name__ == "__main__"
statement.This PR also contains tests and xor-examples for the
ThreadedEvaluator
and theDistributedEvaluator
. TheDistributedEvaluator
example requires command line arguments to specify the address of the master node. It is not the most simple example for theDistributedEvaluator
, but tries to use most arguments to explain what they do.important: This PR changes the
travis.yml
config to usepypy3.5-5.8.0
instead ofpypy3
. Travis apparently uses an outdatedpypy3
version, which contains a few bugs.This PR replaces PR #95. However, some changes (like the changes to the docstrings of
ParallelEvaluator.__init__
) have been removed using a history rewrite.Edit: I added a note to this PR stating that the master node does not try to evaluate any genome wby itself.