Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelisation hangs #125

Closed
drvinceknight opened this issue Mar 25, 2015 · 10 comments
Closed

Parallelisation hangs #125

drvinceknight opened this issue Mar 25, 2015 · 10 comments

Comments

@drvinceknight
Copy link
Member

As far as I can tell this is something to do with #122 but I have no idea.

It hangs at this stage:

➜ Axelrod git:(master) ✗ python run_tournament.py -p 3
Starting basic_strategies tournament with 10 round robins of 200 turns per pair.
Passing cache with 0 entries to basic_strategies tournament
Running repetitions with 3 parallel processes
Finished basic_strategies tournament in 0.0s
Starting ecological variant of basic_strategies
Finished ecological variant of basic_strategies in 0.0s
Cache now has 10 entries
Finished all basic_strategies tasks in 1.3s

Starting strategies tournament with 10 round robins of 200 turns per pair.
Passing cache with 10 entries to strategies tournament
Running repetitions with 3 parallel processes
@langner
Copy link
Member

langner commented Mar 25, 2015

Yup, also hangs here. Only for standard strategies, cheaters and basic go through fine. I am not versed well enough with the code in #122 to tell what the matter is.

@langner
Copy link
Member

langner commented Mar 25, 2015

Now I see you are using multiprocessing. My first guess: deadlock caused by a large queue.

@drvinceknight
Copy link
Member Author

let me tag @meatballs just in case he hasn't seen this...

@meatballs
Copy link
Member

I think the problem here, as @langner says, is a deadlocked multiprocessing queue. Specifically, it's the 'done' queue.

I think the problem has been there all along. If you look at lines 92-95 of tournament.py:

       # There is a 0.5 second timeout here as the all_strategies tournament
       # occasionally hangs the join method for some strange reason.
        for process in processes:
            process.join(0.5)

I had to add a timeout to deal with the occasional hang on my machine. That timeout simply causes the process to terminate early and, although it stops the hang, it means the results are mostly an empty matrix. Increasing the the timeout to something like 60s hides the problem completely, but doesn't solve it!

I think the combination of the deterministic cache being passed by #122, the extra strategies added by #121 and the removal of the timeout in #122 have combined to bring the problem to a head.

The crux of the problem is that multiprocessing can't cope with the size of output queue that we are creating. Even setting the queue's maxsize property to 0 (supposedly infinite) doesn't solve the problem.

It might be worth looking at using threading instead of multiprocessing.

@meatballs
Copy link
Member

Another potential solution might be to kick off a daemon process first which reads from the done queue and appends to the payoffs list.

@meatballs
Copy link
Member

Seems I've fallen for a well known gotcha: https://docs.python.org/2/library/multiprocessing.html#all-platforms

Bear in mind that a process that has put items in a queue will wait before terminating until all the buffered items are fed by the “feeder” thread to the underlying pipe. (The child process can call the cancel_join_thread() method of the queue to avoid this behaviour.)

This means that whenever you use a queue you need to make sure that all items which have been put on the queue will eventually be removed before the process is joined. Otherwise you cannot be sure that processes which have put items on the queue will terminate. Remember also that non-daemonic processes will be joined automatically.

An example which will deadlock is the following:

from multiprocessing import Process, Queue
def f(q):
    q.put('X' * 1000000)
if __name__ == '__main__':
    queue = Queue()
    p = Process(target=f, args=(queue,))
    p.start()
    p.join()                    # this deadlocks
    obj = queue.get()

A fix here would be to swap the last two lines (or simply remove the p.join() line).

@meatballs
Copy link
Member

I think I have this fixed. Branch 125 in my repo. I'm still testing it out.

@drvinceknight
Copy link
Member Author

Works fine for me!

@langner
Copy link
Member

langner commented Mar 26, 2015

@drvinceknight
Copy link
Member Author

Fixed by #127

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants