add ThreadedEvaluator and DistributedEvaluator #96

bennr01 · 2017-07-15T18:17:25Z

I have added the ThreadedEvaluator class and the DistributedEvaluator class for a more flexible usage of computational resources during evaluation.

ThreadedEvaluator: A class inspired by the ParallelEvaluator, but uses Threads for evaluating genomes. This is useful when using a python implementation without a GIL (e.g. jython and pypy-stm).
DistributedEvaluator: An Evaluator for the evaluation of genomes accross multiple computer nodes. This class is also inspired by the ParallelEvaluator. However, the overhead is even bigger than the one of the ParalellEvaluator, which means it it only useful when the evaluation function requires heavy calculations.
Both the ThreadedEvaluator and the DistributedEvaluator are implemented using the standard modules.
Usage of the ThreadedEvaluator:
The usage of the ThreadedEvaluator is much like the usage of the ThreadedEvaluator. However, the ThreadedEvaluator does not support the timeout argument. Also, the worker threads are not stopped automatically when the instance is deleted. However, the threads can still be stopped using the stop() method of the ThreadedEvaluator and will automatically stop when the other non-daemonic threads are done.
Usage of the DistributedEvaluator:
The usage of the DistributedEvaluator is very simple. Both the master node (=the computer mutating genomes) and the slave nodes (=the computers evaluating genomes) can run the exact same script.
Please note that the master node will not try to evaluate any genomes by itself. At least one slave node is required, but you can launch a slave node on the same physical node as the master node. Please keep in mind, that you will have to force the slave-node into slave-mode in this case. The examples/xor/evolve-feedforward-distributed.py has a --force-slave argument for this case.

define evaluation logic, load config and create population as you would normally do
create an instance of neat.DistributedEvaluator using the following arguments:
- addr is a tuple of (hostname/ip, port) pointing to the master node.
- authkey is a password used for authentification to the main node.
- eval_function is the function for evaluating a single genome
- slave_chunksize=1 defines the number of genomes which will be send to a slave at once. When a slave node is using multiple worker processes, this number should be at least equal to the number of worker processes. Higher values may be more overhead efficient. Default: 1.
- num_workers=1: When this value is greater than 1 and this node is in slave mode, use this many worker processes for the evaluation. Otherwise, evaluate in the thread which called DistributedEvaluator.start() (most likely the main thread). Default: 1.
- worker_timeout: When this node is in slave mode, wait at most this many seconds for the result of the worker processes. Default: 60-
- mode=neat.distributed.MODE_AUTO: In which mode this node should operate (one of MODE_MASTER, MODE_SLAVE, MODE_AUTO (the default), as defined in neat.distributed). If the value is MODE_AUTO (the default), check if the addr argument points to the localhost. If it does, set the mode to MODE_MASTER, otherwise to MODE_SLAVE. The other two values force this node into the corresponding mode.
call the start() method of the DistributedEvaluator instance. This function blocks on the slave nodes, but returns on the master nodes. By default, the slave nodes will exit at the end of this call when the work is done. The arguments are all optional:
- exit_on_stop=True: if in slave mode, call sys.exit(0) once the work is done.
- slave_wait=0: If in slave mode, wait this many seconds before connecting, This is useful if the master node may take some time to start.
call the run method of the population, using the evaluate method/attribute of the DistributedEvaluator instance.
call the stop method of the DistributedEvaluator instance.
proceed normally (print statistics, show graphs, ...)
If you are using multiple worker processes, all calls starting from step 2 should be done in a if __name__ == "__main__" statement.

This PR also contains tests and xor-examples for the ThreadedEvaluator and the DistributedEvaluator. The DistributedEvaluator example requires command line arguments to specify the address of the master node. It is not the most simple example for the DistributedEvaluator, but tries to use most arguments to explain what they do.
important: This PR changes the travis.yml config to use pypy3.5-5.8.0 instead of pypy3. Travis apparently uses an outdated pypy3 version, which contains a few bugs.

This PR replaces PR #95. However, some changes (like the changes to the docstrings of ParallelEvaluator.__init__ ) have been removed using a history rewrite.

Edit: I added a note to this PR stating that the master node does not try to evaluate any genome wby itself.

I have added a 'neat.threaded.ThreadedEvaluator' for evaluating genomes in threads. This is useful when using a python implementation without GIL. The ThreadedEvaluator is based on the ParallelEvaluator.

neat.threaded.ThreadEvolver will now start its worker automatically

I have added a test for neat.threaded.ThreadedEvaluator based on the test for neat.parallel.ParallelEvaluator.

i removed the test for checki if 'neat.threaded.ThreadedEvaluator.__del__' stops the threads. This is because __del__ is not always called and thus may result in false test results.

I have changed travis.yml to use `pypy3.5-5.8.0` instead of `pypy3`. Travis uses an outdated version of `pypy3`. `pypy3.5-5.8.0` contains some fixes for multithreaded scripts, which *may* fix the bug in the travic-ci build for `neat.threaded.ThreadedEvaluator`.

I added the first version of the DistributedEvaluator, an Evaluator for evaluating across multiple machines. While the tests (i will commit them later) seem to work pretty good, further tests are needed. The tests only use one machine, so i have to wait until i am able to use an cluster (or just use VMs).

I have added some test for neat.distributed. These tests are not perfect because they run on only one machine. However, i doubt that we can change this on Travis-CI. Unfortunately, my neat-python cloned repo now contains too many checkpoints, so i have to commit all the changes using the github webinterface :( . Well, why am i even writing this here? I doubt someone will actually read this commit messages. (quick note: the perfect diary: hidden in front of everyone).

The authkeys used in the tests are now explicitly binary.

coveralls · 2017-07-15T18:30:04Z

Coverage increased (+0.7%) to 93.259% when pulling 5eea101 on bennr01:pr_prepare into 765c5b5 on CodeReclaimers:master.

drallensmith · 2017-07-15T20:03:12Z

Heh. Due to the browser I mostly use displaying the full commit messages instead of having to click on the '...', I actually did see that part of the commit message...

-Allen

drallensmith · 2017-07-15T20:05:32Z

Even the most recent version of pypy3 is rather slow on the parallel evaluator (dunno yet on threaded), according to profiling (on OS X 10.12), BTW - lots of time spent in mutexes (or equivalent - waiting for thread lock) that didn't happen with other Python versions (including 2.7 pypy).

drallensmith · 2017-07-15T20:18:37Z

Now that I think about it, the parallel evaluator is supposed to be using subprocesses, not threads... I'm thinking that pypy3 may have problems with parallel/threaded execution because it's probably using a separate thread to do its JIT compilation. (To be fair, it's also clearly labeled as a beta...) Reducing the number of parallel subprocesses to 2 (from 4) in the test did not significantly affect this. (This is running on a machine with 1 processor, 2 cores, incidentally - 2011 Mac mini.)

bennr01 · 2017-07-15T20:45:51Z

@drallensmith I think you are right. But i think the issue with the JIT and subprocesses is that each new subprocess spawned by multiprocessing launches its own instance of pypy with its own JIT. And because JITs rely on prediction of the next operations (i may be wrong about this, but i think i read somewhere that JIT learn that a call to a variable always points to a specific method and thus skip some of the abstract logic inbetween), but the next calls cant be predicted because they are initiated by the parent process.
My experience using neat-python, pypy2 and cpython2.7 was:

single-process + pypy is (after a short time) way faster than cpython (sometimes even faster than the parallel evaluator on cpython)
multi-process + pypy: slower than single-process + pypy
single-process + cpython: slow
- multi-process + cpython: faster than single-proces + cpython
However, this is probably very case dependent. An evaluation function which requires a minute of calculation time for a single genome on cpython may be faster on pypy + multiple processes than pypy + singleprocess.
Of course, the ThreadedEvaluator will currently be way slower than the ParallelEvaluator in most cases, but this will hopefully change when pypy-stm becomes stable. I think the pypy team calculated that pypy-stm would be 25% slower than pypy when using a single thread, which means just using two threads would make the whole script faster. This could mean that threaded evaluation on pypy-stm may be way faster than serial evaluation on pypy and cpython and also faster than pypy multi-process evaluation.
Unfortunately, i cant say much about the performance of the DistributedEvaluator. I used the xor-example for testing, but the overhead was way to large compared to the time required for evaluation.
Maybe i will try to make some benchmarks once i finally have fixed my neat-tetris.
By the way, which browser are you using?

drallensmith · 2017-07-15T21:07:45Z

lynx - text-only, no javascript (I use Firefox for doing things like typing this comment)... but also low-memory and fast.

NEAT-Tetris? Interesting! Probably after some testing of possible enhancements on less-complex systems (lander, perhaps?), I've been looking at doing some experiments with LARG/HFO.

BTW, I should add that my profiling was using the test suite (since I was looking at why pypy3 was having problems - good spotting of the older version in use on Travis, BTW!), which probably doesn't run long enough for compilation to help much. I did put together a variation of the test suite meant for profiling, but have been working more on other things (particularly since LARG/HFO is mostly C++)... about all I did was trim it down to just the tests actually doing runs, then up the generation count and population size (and adjust the fitness function termination criteria so they wouldn't happen).

bennr01 · 2017-07-15T21:17:31Z

@drallensmith I knew it! I actually tought that you may be using lynx (i use it sometimes too), but it seemed too unlikely.
About NEAT-tetris: well, it is very simple (no ui, but you can print an ascii version of the current state). I finally got the rotate-function to work, but the combination of the rotation logic and the not-yet-implemented movement logic for the x-axis is problematic. LARG/HFO seems interesting, though.

The `DistributedEvaluator` will now shutdown its manager when `stop()` is called.

CodeReclaimers · 2017-07-17T12:39:38Z

Thank you thank you to both of you for the work on this! Apologies for not having the time to look through it yet, hoping to change that soon. :)

coveralls · 2017-07-17T12:40:42Z

Coverage increased (+0.7%) to 93.267% when pulling 4da38f4 on bennr01:pr_prepare into 765c5b5 on CodeReclaimers:master.

CodeReclaimers · 2017-07-17T12:45:00Z

This looks so thorough that I figured it's just best to merge and let everybody try it out. :)

bennr01 and others added 13 commits July 8, 2017 19:28

add threaded evaluator

ac04fbc

I have added a 'neat.threaded.ThreadedEvaluator' for evaluating genomes in threads. This is useful when using a python implementation without GIL. The ThreadedEvaluator is based on the ParallelEvaluator.

fix neat.threaded.ThreadedEvolver; improved example

fd7ea42

neat.threaded.ThreadEvolver will now start its worker automatically

[rewritten history] pass config to threads

1f47a7c

add test for neat.threaded.ThreadedEvaluator

c3edf34

I have added a test for neat.threaded.ThreadedEvaluator based on the test for neat.parallel.ParallelEvaluator.

replace xrange in neat.threaded.ThreadedEvolver with range for py3

3886080

[rewritten history] add more tests

6bd4aee

improved docstrings; improved test

450b9bd

i removed the test for checki if 'neat.threaded.ThreadedEvaluator.__del__' stops the threads. This is because __del__ is not always called and thus may result in false test results.

add example; improve import in __init__.py

4af6c8f

fix test errors in py3

2abbe5f

The authkeys used in the tests are now explicitly binary.

add usage description to 'distributed.py'

5eea101

improvements to distributed.py

4da38f4

The `DistributedEvaluator` will now shutdown its manager when `stop()` is called.

CodeReclaimers merged commit a27b2c7 into CodeReclaimers:master Jul 17, 2017

bennr01 mentioned this pull request Jul 17, 2017

Working on documentation update - feedback desired #92

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

add ThreadedEvaluator and DistributedEvaluator #96

add ThreadedEvaluator and DistributedEvaluator #96

Uh oh!

bennr01 commented Jul 15, 2017 •

edited

Loading

Uh oh!

coveralls commented Jul 15, 2017 •

edited

Loading

Uh oh!

drallensmith commented Jul 15, 2017

Uh oh!

drallensmith commented Jul 15, 2017 •

edited

Loading

Uh oh!

drallensmith commented Jul 15, 2017

Uh oh!

bennr01 commented Jul 15, 2017

Uh oh!

drallensmith commented Jul 15, 2017 •

edited

Loading

Uh oh!

bennr01 commented Jul 15, 2017

Uh oh!

CodeReclaimers commented Jul 17, 2017

Uh oh!

coveralls commented Jul 17, 2017 •

edited

Loading

Uh oh!

CodeReclaimers commented Jul 17, 2017

Uh oh!

Uh oh!

Uh oh!

add ThreadedEvaluator and DistributedEvaluator #96

add ThreadedEvaluator and DistributedEvaluator #96

Uh oh!

Conversation

bennr01 commented Jul 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Jul 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drallensmith commented Jul 15, 2017

Uh oh!

drallensmith commented Jul 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drallensmith commented Jul 15, 2017

Uh oh!

bennr01 commented Jul 15, 2017

Uh oh!

drallensmith commented Jul 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bennr01 commented Jul 15, 2017

Uh oh!

CodeReclaimers commented Jul 17, 2017

Uh oh!

coveralls commented Jul 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CodeReclaimers commented Jul 17, 2017

Uh oh!

Uh oh!

bennr01 commented Jul 15, 2017 •

edited

Loading

coveralls commented Jul 15, 2017 •

edited

Loading

drallensmith commented Jul 15, 2017 •

edited

Loading

drallensmith commented Jul 15, 2017 •

edited

Loading

coveralls commented Jul 17, 2017 •

edited

Loading