# Deep Reinforcement Learning for Multi-Agent Systems
https://arxiv.org/pdf/1812.11794.pdf

## Multi-Agent Systems
Multi-Agent Systems are able to solve complex tasks through the cooperation of individual agents. These agents can communicate with each other and interact with the environment.  
We consider a fully cooperative multi-agent setting in which $n$ agents identified by $a \in A=\{1, ..., n\}$ participate in a stochastic game.  
At every time step each agent takes an action $u_a$ forming a joint action $\textbf{u}$. A state of the environment is represented by $s$.  
The observation of each agent is denoted $O(s,a)$. Each agent $a$ conditions its behaviour on its own action-observation history $\tau_a = \{ (s,a,(u_a)), ...\}$ according to its policy $\pi_a(u_a|\tau_a)$.  
After each transition the action $u_a$ and new observation $O(s,a)$ are added to $\tau_a$.

## Dealing with Non-Stationarity
https://arxiv.org/pdf/1906.04737.pdf

In a multi-agent environment, an agent not only observes the outcome of its own action but also the behavior of other agents.
Since each agent's policy changes as the training progresses, the environment is non-stationnary from the perspective of an individual agent.

### Decentralized Learning techniques
The agents are trained independently from each other. Each agent possesses a policy, takes a local observation as input and outputs an action.
This architecture does not suffer from scalability issues but faces other problems such as the non-stationarity of the environment.

#### Stabilising Experience Replay
https://arxiv.org/pdf/1702.08887.pdf

The multi-agent RL method "Independent Q-Learning" (http://web.media.mit.edu/~cynthiab/Readings/tan-MAS-reinfLearn.pdf) introduces non-stationarity which makes it incompatible with the experience replay memory on which DQL relies.
Indeed for each individual agent the dynamics that generated the data in it's memory no longer reflect the current dynamics in which it is learning.  
To deal with this issue experience replay could be disabled or only short, recent buffers could be used however such simple solutions still threaten the stability of multi-agent RL.

##### Multi-Agent Importance Sampling
To enable importance sampling, at the time of collection $t$ we can record the action of the other agent denoted $\pi'_t(s)$ in the replay memory forming an augmented transition tuple $(s, u_a, r, \pi'_t(s), s')$.
At the time $t$ we train


##### Multi-Agent Fingerprints

The idea behind hyper Q-Learning (https://www.semanticscholar.org/paper/Extending-Q-Learning-to-General-Adaptive-Systems-Tesauro/810837f7308363fcc43e7673e13c5aaa3df65968) is to augment each agent's state space with an estimate of the other agents' policies computed via Bayesian inference.
This reduces each agent's learning problem to a standard problem in a stationnary but much larger environment.
The main issue of hyper Q-Learning is that it increases the dimensionality of the Q-function, especially if the other agents' policies consist of high dimensional neural networks.  
A naive approach to combining hyper Q-learning with deep reinforcement learning would include the weights of the other agents' networks in the observation function. 
Instead of using only its own state as the input for its network representing the Q-function the agent could use its own state as well as the weights of the other agents' networks.
The problem with that is that the weights of a network are far too large to include as input to the Q-function.  
The sequence of policies that generated the data in the agent's buffer can be thought of as following a single trajectory through the high-dimensional policy space.
To stabilise experience replay the agent would only need to know for each one of its state where along that trajectory the state originated from.  
How could we design a low-dimensional fingerprint containing that information?
For example we could include in the input of the agent the number of the training iteration as well as the rate of exploration. 
Those two element contain information about the policies of the other agents. Indeed the exploration rate is annealed smoothly throughout training and is correlated to performance.


#### Lenient-DQN
https://arxiv.org/pdf/1707.04402.pdf

### Centralized Critic Techniques
With this architecture agents are jointly modelled and a common centralized policy for all the agents is trained. The input of the network is the concatenation of all the agents' observations and the output is the combination of all the actions.
The main issue of this type of architecture is the large input and output space.

### Opponent Modelling
https://arxiv.org/pdf/1609.05559.pdf

To stabilize the training process of the agents it could also be interesting to model the intentions and policies of other agents.
We use $a$ to denote the action of the agent we control and $o$ to denote the joint action of all other agents. The joint policy of the secondary agents is denoted $\pi^o$.  
A Deep Reinforcement Opponent Network is:
- A Q-Network that evaluates actions for a state
- An opponent network that learns representation of $\pi^o$

Two different ways to combine the two networks are described in the article:
- DRON-concat concatenates the two networks
- DRON-MOE applies a Mixture-of-Experts model

### Communication
https://arxiv.org/pdf/1602.02672.pdf  

The different agents can communicate to exchange information about their observations and actions.
This article presents Deep Distributed Recurrent Q-Networks, an architecture where all the agents share the same hidden layers and learn to communicate to solve riddles.
Independent Q-Learning assumes that each agent can fully observe the state of the environment.  
This article focuses on settings that are both partially observable and multi-agent. Indeed deep recurrent Q-network architectures have been used to adress single-agent, partially observable settings but never to adress multi-agent and partially observable settings.  

Three modifications are introduced:
- Last-action inputs: supply each agent with its previous action as input on the next time step so that agents can approximate their action-observation history. .
- Inter-agent weight sharing: a single network's weights are used by all agents but that network conditions on the agent's unique ID to enable fast learning while allowing for diverse behaviour. Only one network is learned and used by all agents but the agents can still behave differently because they receive different observations. In addition each agent receives its own index $m$ as input  
- Disabling experience replay

So the Deep Distributed Recurrent Q-network learns a Q-function of the form $Q(o_t, h_{t-1}, i, a_{t-1}, a_t)$ with $o_t$ the observation (specific to this agent), $h_{t-1}$ the previous internal state (the hidden state of the network), $i$ the agent index, $a_{t-1}$ the previous action (which is a portion of history) and $a_t$ the action whose value the Q-network estimates.

See Deep Recurrent Q-Learning https://arxiv.org/pdf/1507.06527.pdf  
Partially Observable Markov Decision Processes are environments which present themselves in a limited way to the agent.  
To deal with partially observable worlds the idea is to give the agent a capacity for temporal integration of observations.  
This can be done by utilizing a recurrent block in the network. This block will allow the network to maintain a hidden state computed at every time-step and the block can feed the hidden state back into itself to tell the network what has come before.
