Skip to content

Commit

Permalink
Update report and readme draft
Browse files Browse the repository at this point in the history
Improved wording and corrected some spelling mistakes. Also corrected
for the score issue, where I've used the mean of both agent rewards
before. Now I use the correct max score.
  • Loading branch information
SwamyDev committed Mar 14, 2020
1 parent 120898a commit 3c6089c
Show file tree
Hide file tree
Showing 3 changed files with 19 additions and 18 deletions.
1 change: 1 addition & 0 deletions Report.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@ Each Udacity project is accompanied by a report. Each report follows a similar s

- [Project: Navigation](doc/Report_p1_navigation.md)
- [Project: Continuous Control](doc/Report_p2_continuous.md)
- [Project: Multi-Agent](doc/Report_p3_multiagent.md)
8 changes: 4 additions & 4 deletions doc/README_p3_multiagent.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# Project: Continuous Control
# Project: Multi-Agent

This project is part of the Udacity Reinforcement Learning Nanodegree. In this project, multiple `DDPG` agents are trained to solve a continuous control task. Specifically, each agent needs to control a tennis racket to pass a ball back and forth, keeping it in game as long as possible. The each agent receives a reward of each time it hits the ball over the net and gets penalized when the ball hits the ground or goes out of bounds. Hence, it is in the interest of both agents to keep the ball in play, making this a cooperative environment. The environment is considered solved, when the maximum score of each agent reaches an average of >0.5 points throughout 100 episodes.
This project is part of the Udacity Reinforcement Learning Nanodegree. In this project, multiple `DDPG` agents are trained to solve a multi-agent environment. Specifically, each agent needs to control a tennis racket to pass a ball back and forth, keeping it in play as long as possible. Each agent receives a reward, each time it hits the ball over the net and gets penalized when the ball hits the ground or goes out of bounds. Hence, it is in the interest of both agents to keep the ball in play, making this a cooperative environment. The environment is considered solved when the maximum score from the agents reaches an average of >0.5 points throughout 100 episodes.

## Environment Setup
### Reward Signal
Each agent receives a reward of `0.1` when it hits the ball over the net, but get's a penalty of -0.01 each time the ball hits the ground or goes out of bounds. The goal for both agents is therefore to keep the ball in play as long as possible.
Each agent receives a reward of `0.1` when it hits the ball over the net, but gets a penalty of `-0.01` each time the ball hits the ground or goes out of bounds. The goal for both agents is, therefore, to keep the ball in play as long as possible.

### Observation
An observation state for each agent individually consists of the agent's current position and velocity and the position and velocity of the ball. The total observation of both agent is encoded in a 2x24 tensor (stacking the observations of both agents).
An observation state for each agent individually consists of the agent's current position and velocity and the position and velocity of the ball. The total observation of both agents is encoded in a 2x24 tensor (stacking the observations of both agents).

### Actions
The action each agent can take consists of a 1x2 tensor corresponding to 2 continuous actions: Moving towards or away from the net, and jumping. The action values are normalized to a range between `-1` and `1`.
Expand Down
28 changes: 14 additions & 14 deletions doc/Report_p3_multiagent.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,28 @@
# Project: Continuous Control
# Project: Multi-Agent

This report details how multiple agents using the Deep Deterministic Policy Gradient (`DDPG`) algorithm solve a cooperative task. One of the main issues with multi-agent (`MA`) environments is, that actions of each agent influence observations, rewards and the set of valid actions for each other agent. This makes the problem non-stationary. However, in the particular case of the `Tennis` environment, this issue seems to be not that pronounced. In this environment two agents control a tennis racket and their goal is to keep the ball in play without hitting it out of bounds or dropping it to the ground (more details in [README_p3_multiagent.md](README_p3_multiagent.md)). Each agent can act relatively independent of the actions of the other agent (agents can't obstruct each other for instance) and focus marley on hitting the ball over the net. This makes this environment easier to solve then other `MA` settings, and is possibly the reason why the naive approach of training two independent `DDPG` agents was sufficient. As with the last project I was able to implement the agent relatively quick, and without running into too many defects, by using best software engineering practises such as Continuous Integration (`CI`) and Test Driven Development (`TDD`). Additionally I've learned, that it is a good idea to study the performance of a random agent within the environment, before diving into it. In this particular case it happened quite often that I thought the agent was doing well and getting excited. However, it turned out that there was some bug in the code resulting in constant random exploration. What made this phenomenon more pronounced was the fact that the learning agent would crash initially to the absolute minimum of rewards, hence the random agent actually looked better in comparison.
This report details how multiple agents using the Deep Deterministic Policy Gradient (`DDPG`) algorithm solve a cooperative task. One of the main issues with multi-agent (`MA`) environments is, that actions of each agent influence observations, rewards and the set of valid actions for other agents. This makes the problem non-stationary. However, in the case of the `Tennis` environment, this issue seems to be not that pronounced. In this environment, two agents control a tennis racket and their goal is to keep the ball in play without hitting it out of bounds or dropping it to the ground (more details in [README_p3_multiagent.md](README_p3_multiagent.md)). Each agent can act relatively independent of the actions of the other agent (agents can't interfere with each other for instance) and just focus on hitting the ball over the net. This makes this environment easier to solve than other `MA` settings and is possibly the reason why the naive approach of training two independent `DDPG` agents was sufficient. As with the last project I was able to implement the agent relatively quick, and without running into too many defects, by using best software engineering practises such as Continuous Integration (`CI`) and Test Driven Development (`TDD`). Additionally, I've learned, that it is a good idea to study the performance of a random agent within the environment, before diving into it. In this particular case, it happened quite often that I thought the agent was doing well when it was just acting randomly. Often this was due to some bug in the code or oversight in the configuration. What made this phenomenon more pronounced was the fact that the learning agents would crash initially to the absolute minimum of rewards, hence the random agent actually looked better in comparison.

## N Deep Deterministic Policy Gradient
The N Deep Deterministic Policy Gradient simply specifies multiple agents using the `DDPG` algorithm independently. For this environment it wasn't necessary to do much tweaking of the original algorithm, such as for learning the `Critic` together. In this implementation I simply reused the memory replay buffer, but without sharing experiences between agents.
The N Deep Deterministic Policy Gradient simply instantiates multiple independent agents using the `DDPG` algorithm. For this environment, it wasn't necessary to do much tweaking of the original algorithm. In this implementation, I simply reused the memory replay buffer, but without sharing experiences between agents.

The main "trick" was just to train both agents long enough to get them over the initial crashing phase. Once the algorithm reached peak performance around `>2`the algorithm would still fluctuate a lot, but not crash down to the minimum again. I'm suspecting this is due, the agents learning a careful balance where they would bounce the ball back end forth at the exact same position, for very long periods. This results a very high amount of similar experience leading to oversampling of these states. Intuitively this would mean that the agents do not see many different situations where the ball would come from different angles at different positions and hence are "surprised" when they still happen and do not react properly leading to poor performance until balance is found again.
The main "trick" was just to train both agents long enough to get them over the initial crashing phase. After about 1000 - 2000 episodes they would get out of their initial rut and quickly reach peak performance around `~4` points on the average 100 episodes. Once the algorithm reached peak performance the algorithm would still fluctuate a lot, but not crash down to the minimum again. In fact, in the run, I submitted it consistently stayed above `0.5` points, even though it got close once. I'm suspecting this is due to the agents learning to reach a delicate balance where they would bounce the ball back end forth at the exact same position, for very long periods. This results in accumulating a very high amount of similar experiences in the replay buffer. In turn, this leads to the oversampling of these states. Intuitively this would mean that the agents do not learn from a diverse set of situations where the ball would come from different angles at different positions. Hence agents are easily "surprised" when getting in an unusual situation. This might result in agents to underperform until some balance is found again.

## The DDPG Algorithm
The algorithm is the same as described in [Report_p2_continuous](Report_p2_continuous.md). However, there are some minor tweaks to the exploration handling. I implement a similar setup as the `DDPG` algorithm described in [Spinning Up RL from OpenAI](https://spinningup.openai.com/en/latest/algorithms/ddpg.html). I now start with completely random "preheating" steps before letting the agent act, and then simply add gaussian noise with a fixed standard deviation to ensure continued exploration.
The algorithm is the same as described in [Report_p2_continuous.md](Report_p2_continuous.md). However, there are some minor tweaks to how exploration is done. I implemented a similar setup as found in [Spinning Up RL from OpenAI](https://spinningup.openai.com/en/latest/algorithms/ddpg.html). Now the agents start with completely random "preheating" steps before they are allowed to act. After that, some Gaussian noise with a fixed standard deviation is added to ensure continued exploration.

## Notes on Development
First I implemented a single `DDPG` agent which would just control both rackets and receive both observations. With this much easier approach to solving the environment I could explore the various pitfalls of it. It turned out in the end that one of the major challenges wasn't agent coordination or the environments being non-stationary, but rather exploration/exploitation trade-off and the reward signal. Even the easier "one-mind" agent would crash early during training and not recover for a long time.
First I implemented a single `DDPG` agent which would just control both rackets and receive both observations, which I called the "one-mind" agent. With this much easier approach to solving the environment, I could explore the various pitfalls of it. It turned out in the end that one of the major challenges wasn't agent coordination or the environment being non-stationary, but rather the exploration/exploitation trade-off and the reward signal. Even the easier "one-mind" agent would crash early during training and not recover for a long time.

I've spent a lot of time investigating the exploration/exploitation tread-off because this hard initial crash of the agent (constant `-0.05` average reward). The setup described in the previous section turned out to be the best, however, it still didn't improve that much. I tried tuning various other hyper parameters like learning rate, model architecture, gamma and tau. However, none of these improved the performance much.
I've spent a lot of time investigating the exploration/exploitation tread-off because of this initial hard-crash of the agent (constant `0` average reward over 100 episodes). The exploration setup described in the previous section turned out to be the best, however, it still didn't improve that much. I tried tuning various other hyperparameters like learning rate, model architecture, gamma and tau. However, none of these improved the performance much.

I contemplated implementing prioritized experience replay as I suspected that accumulating lots of low reward experiences would lead to oversampling them and stall learning. However, once I trained the "one-mind" agent for longer episodes I noticed that it got out of its rut after a while. Hence, I decided to postpone the implementation of prioritized replay and try the environment with multiple agents. I started with a naive approach by just training two `DDPG` agents simultaneously on the environment. It turned out that these performed just as well as the "one-mind" agent, solving it after `~7000` episodes.
Following that, I considered implementing prioritized experience replay as I suspected that accumulating lots of similar low reward experiences would lead to oversampling them and stall learning. However, once I trained the "one-mind" agent for longer episodes I noticed that it got out of its rut after a while. Hence, I decided to postpone the implementation of prioritized replay and try the environment with multiple agents. I started with a naive approach by just training two `DDPG` agents simultaneously on the environment. It turned out that these performed just as well as the "one-mind" agent, solving it after `~4500` episodes.

From looking at poorly performing agents and well performing ones I have the suspicion that various factors contribute to the initial crash of performance. One being that there might be a better way to initialize model parameters - for now I use the default initialization of PyTorch. I'm also suspecting that the reward signal is difficult. When the agents constantly drop the ball they have no indication which of their actions actually got them closer to their desired goal. For instance, one agent might have hit the ball and got it closer to the net but still dropped it, however, the current reward signal would still be just `-0.05` no matter how close it was. This means that agents have to rely on randomly hitting the ball across the net. I'm thinking that prioritized replay might help with that.
Comparing poorly performing agents with well-performing ones, I have the suspicion that various factors contribute to the initial crash of performance. One is that there might be a better way to initialize model parameters. For now, I use the default initialization of PyTorch. I'm also suspecting that the reward signal is difficult. When agents constantly drop the ball they have no indication which of their actions actually got them closer to their desired goal. For instance, one agent might have hit the ball and got it closer to the net but still dropped it. However, the reward signal would still be just `-0.05` no matter how close the ball got to the net. This means that agents have to rely on randomly hitting the ball across the net. I'm thinking that prioritized replay might help with that. Of course, reshaping the reward function might as well.

While investigating these issues I've extended the command line interface with some convenience functionality. For instance it is now possible to create snapshots once the agent reaches a certain performance level. This is also what I used during training to save the agent model parameters when it actually achieved its highest performance. These are also the model parameters saved as final parameters in this repository. Additionally, I also now properly take care of keyboard interrupts allowing me to stop training in between, save the agent and show the training graph. This helped a lot in investigating issues with training performance.
While investigating these issues, I've extended the command line interface with some convenience functionality. For instance, it is now possible to create snapshots once the agent reaches a certain performance level. This feature was used during training to save the agent model parameters when it actually achieved its highest performance. This peak performance model is also the model reported in the repository (`resources/models/p3_tennis_final`). Additionally, I now properly take care of keyboard interrupts which allows me to stop training in-between, save the agent and display the training graph. This helped a lot in investigating issues with training performance.

## Results
Using a neural network architecture similar to the one used in the [Report_p2_continuous.md](Report_p2_continuous.md), my agent solved the environment after about ~7000 episodes. However, I must say that I do not take the maximum score from both agent rewards, but rather average them. However, this leads to an overall lower score, meaning that the agent might have solved it earlier, but I'm suspecting that the difference is small.
Using a neural network architecture similar to the one used in the [Report_p2_continuous.md](Report_p2_continuous.md), my agent solved the environment after about ~4500 episodes.

![Graph of Training Run](../resources/images/nddpg_training.png)

Expand Down Expand Up @@ -73,8 +73,8 @@ Both agents used the `multi_ddpg_ann_a_2x256_c_2x256_1x128-2020-03-07` agent con
The final trained model is stored under `resources/models/p3_tennis_final`.

## Future Work
Considering the challenges I faced in this environment, I think the focus for improving performance is more on improving the replayed experience and the exploration/exploitation trade-off. What leads me to this believe, is the fact that the "one-minded" agents exhibits similar performance to the naive multiple-agents approach. One easy improvement could be to find a better way to initialize the model weights, so the agents does better exploring initially.
Considering the challenges I faced in this environment, I think the most fruitful approach to improve performance is to focus on improving the replayed experience and the exploration/exploitation trade-off. What points me to this conclusion, is the fact that the "one-minded" agent exhibits similar performance to the naive multi-agent approach. An easy improvement could be to find a better way to initialize the model weights, so the agents do better exploring initially.

Additionally prioritized replay could help the agent to learn more from unusual experiences and reduce the oversampling of what I'd call "states in careful balance". The agent might then be not that "surprised" by unexpected ball trajectories and perform more robustly overall.
Additionally, prioritized replay could help the agent to learn more from unusual experiences and reduce the oversampling of what I'd call "states in delicate balance". The agent might then be not that "surprised" by unexpected ball trajectories and perform more robustly overall.

Of course one could also improve the `multi-agent` aspect of it as well and implement the [MADDPG](https://arxiv.org/abs/1706.02275) algorithm (which I intended initially, but turned out not to be necessary).
Of course, one could also improve the `multi-agent` aspect of it as well and implement the [MADDPG](https://arxiv.org/abs/1706.02275) algorithm (which I intended initially, but it turned out not to be necessary).

0 comments on commit 3c6089c

Please sign in to comment.