Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement asynchronous methods #5

Closed
lake4790k opened this issue Mar 19, 2016 · 14 comments
Closed

Implement asynchronous methods #5

lake4790k opened this issue Mar 19, 2016 · 14 comments

Comments

@lake4790k
Copy link
Collaborator

http://arxiv.org/pdf/1602.01783v1.pdf describes asynchronous methods using off policy (1 step /n step Q learning) and even on policy (sarsa and advantage actor-critic (A3C)) reinforcement learning.

These algorithms converge faster with less resources (cpu only multithreaded on a single machine without using a large replay memory) and can achieve better results than other methods.

I think the Hogwild method they use for the lockfree updating of the shared network can be implemented with Torch/Lua threads.sharedserialize

@Kaixhin
Copy link
Owner

Kaixhin commented Mar 19, 2016

I currently have no plans to implement A3C in this repo because it is quite different, rather than a relatively simple addition to the original DQN. You're welcome to submit a PR if you manage to come up with an -asynchronous true-style option.

Edit: Actually, asynchronous one-step Q-learning could be in scope. And I agree that 'threads.sharedserialize' would be one part of the solution. But running multiple (presumably physical) threads multicore and coordinating that with a master thread may not be possible with threads?

@ghost
Copy link

ghost commented Apr 9, 2016

@Kaixhin @lake4790k I was actually working off this repo as a starting point not long ago to attempt to do this (maybe a month?) I evaluated threads and concluded it was not a very good option for one step or n-step due to the way upvalues work within the threads (in particular, the singleton instance+step counts within those algo's in particular cause a number of issues). If you're interested in collaborating, I can invite you to the library I was working on this in. I think that lua---parallel can handle a3c better, and was utilizing a structure similar to the original deepmind code.

Edit: To add to @Kaixhin's edit, this is possible. You can pretty easily run multiple instances of the Atari code in parallel. Just serialization at a given step is neither straightforward nor painless between threads.

@lake4790k
Copy link
Collaborator Author

@Kaixhin I think these async methods can be done in pure torch using threads and sharedserialize. Some implementation remarks (don't read this if you want to give it a try first, it's an interesting exercise...):

T (global shared counter): this would be an atomic variable (eg. c++ std::atomic) that is not available in lua/torch. In the paper this is used to update targetTheta from the threads. This is done less frequently (eg. Nth frames) Instead one can update targetTheta from a non learning thread at fixed time intervals (ie. sleep for x secs) in an unscyhronized way with the same effect, doing at exactly on every Nth step is not necessary.

A network with the global shared theta is created first, then a thread pool is started and the learner threads in the pool do a clone('weight','bias'), so dTheta (=accumulation of gradients and all other internal state) is per thread, but they share the Storage of theta. To acquire a flattened dTheta in the threads one can do:

_,gradParams = self.network:parameters()
dTheta = nn.Module.flatten(gradParams)

so dTheta is a single flattened storage tensor that can be added to theta

In the one step case then the learner does n forwards/backwards accumulating gradients in dTheta before doing a Hogwild theta += dTheta, ie. don't worry about synchronization and trust the cpu caches to be synchronised anyway for most of the updates. Adding the gradient is safe in the sense that the worst that can happen is loosing an update rarely, but theta doesn't get corrupted. Asynchronous = unsynchronised. There's no master thread and coordination, the pool can run forever, no need to stop on synchronize().

The shared RmsProp they describe with the shared g,g2 is trickier as the async neg()and sqrt() will corrupt the shared tensor with NaNs, I think there a thread local g,g2 copy is needed as well for the interim calculations.

@Kaixhin
Copy link
Owner

Kaixhin commented Apr 13, 2016

@michaelghaben @lake4790k Thanks for the comments - one-step (and possibly n-step) Q-learning can hopefully be integrated with the other features in this repo e.g. the dueling architecture. If that can be achieved then the others like advantage actor-critic can be considered.

I'm not going to be able to take the lead on this in the near future, but I'm happy to lend a hand to either a fork or a new repo if I can. For now I've just added an async branch to make it a little cleaner than working directly on master.

@lake4790k
Copy link
Collaborator Author

@Kaixhin I did a reference implementation as per above in a simple codebase without all the other methods. I hope I will have time to merge it with your codebase, would be interesting to see the performance compared to all the other methods. Will do a PR once ready.

@Kaixhin
Copy link
Owner

Kaixhin commented Apr 28, 2016

Someone is trying to replicate this and after skimming through miyosuda/async_deep_reinforce#1 it seems like they got hold of hyperparameters not noted in the paper. Worth keeping an eye on.

@lake4790k
Copy link
Collaborator Author

@Kaixhin yes, interesting about the hyperparameters. But I'm not sure that implementing this in Python is a good idea, afaik python does not support multithreaded execution of python script code at all (ie. all of the RL logic...), only code that is outside of GIL (ie. tensorflow operations, cython parts). They also mention slow performance compared to the original, I would not be surprised if that is because of the python issue. (but even single threaded python performance could be poor compared to lua/native for this use case)

I'll add my implementation of async to Atari in the coming days, I'm curious how it will work...

@lake4790k
Copy link
Collaborator Author

@Kaixhin btw I started with catch comparing cpu and gpu behaviour and noticed that cpu did not converge as gpu did, which should not be the case as all the code is the same for both cases. Except for the random initialization, as by default the code sets manual seed of 1, but in the gpu case it then does a cutorch.manualSeed(torch.random()) before constructing the net. So I think then cpu always gets a poor initialzation from seed 1, while the gpu net gets some other random weights that works better. If I set the cpu seed to random it then converges similarly to gpu (and also faster as you note because of the small net). I'm not sure if this random initialization behaviour is intended, got me quite confused first...

@Kaixhin
Copy link
Owner

Kaixhin commented Apr 28, 2016

@lake4790k Interesting - I went with the same initialisation as in DeepMind's code. I don't think a seed of 1 is worse than any other seed for a random number generator - it seems like you just got unlucky. It might be more obvious with Catch, but as far as I know weight initialisation hasn't been looked into for deep reinforcement learning.

@lake4790k
Copy link
Collaborator Author

lake4790k commented May 9, 2016

Some results running the async 1-step Q. I used pong to compare the learning speed with the async paper page 20 result.

This experiment I ran on 10 hyperthreads (5 physical cores). I would expect the equivalent deepmind performance to be somewhat below the midpoint between the 4 and 8 thread curves as the speed is limited by the 5 physical cores, but having more threads with diverse experiences helps a bit.

The time scale of this figure is little less than 14 hours. It achieved the score 0 in about 11 hours. That's exactly where the interpolated deepmind curve would be. I used a learning rate of 0.0007.

scores_10b

This experiment I ran an 8 hyperthreads (4 physical cores). The equivalent deepmind curve should be a bit above the 4 thread curve on page 20.

The time scale of this figure is little less than 24 hours. At 14 hours it achieved a score of -3. That is exactly where the 4 threaded deepmind curve is. I used a learning rate of 0.0016

scores_8

In these experiments I did not have learning rate decay as in the paper. The paper says they used the same experiment setup as the double Q paper, but then also says they used gradient norm clipping (which I didn't turn on either) which was introduced in the duelling paper.

I also had experiment with the more aggressive 0.0016 learning rate that got stuck in the beginning not improving for long. My guess would be that the gradient clipping would have helped it to get out of there (also the learning rate decay eventually).

As the curves in the paper are the average of the 3 best agents out of 50 experiments, and most likely they used an optimized c++ implementation (with tensorflow) and ours is pure torch and I had only a few experiments, these results look pretty good.

I still plan to implement n-step Q (in combination with double/PAL/dueling) and A3C and unify as much as possible with experience replay codebase as we discussed.

@Kaixhin
Copy link
Owner

Kaixhin commented May 9, 2016

@lake4790k looks great - thanks for comparing with DeepMind results. Epochs are more meaningful than training time due to differences in hardware, but your estimates sound about right. Keep at it!

@Kaixhin
Copy link
Owner

Kaixhin commented May 30, 2016

Closed by #30.

@Kaixhin Kaixhin closed this as completed May 30, 2016
@lake4790k
Copy link
Collaborator Author

Ready for the next method based on A3C...

@hym1120
Copy link

hym1120 commented Feb 26, 2017

In A3CAgent:accumulateGradients, what is the reason we have 0.5 in vTarget instead of 2?
I was thinking "d(R-V)^2 / dTheta" is equal to "-2 * (R- V) * dV/dTheta"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants