##### INF8250 – Reinforcement Learning - Fall 2024 - Final Project

# **Multi-Agent Reinforcement Learning**

### Members:

- Alexandre Fournier - 2147771

- Thomas Mousseau - 2149672


## Introduction

Multi-Agent Reinforcement Learning (MARL) extends the RL framework to environments where multiple agents interact, each potentially learning and adapting simultaneously. In MARL, agents may cooperate, compete, or both, adding complexity to the learning process. Understanding MARL is crucial for developing systems where multiple autonomous entities need to operate in shared environments, such as swarm robotics, autonomous driving, and distributed control systems.


## 1. From Single-Agent MDP to Multi-Agent Markov Game

### Markov Decision Process (MDP)

In the Markov decision process (MDP) formalization of reinforcement learning, a single adaptive agent interacts with an environment with the objective of maximizing the expected cumulative reward over time. The framework is defined as:

$$
\mathcal{M} = (\mathcal{S}, \mathcal{A}, \mathcal{P}, r, \gamma)
$$

$$
\begin{align}
\mathcal{S} &: \text{ States of the environment} \\
\mathcal{A} &: \text{ Actions set of the agent } \\
\mathcal{a} &: \text{ The action taken by the agent in state s } \\
\mathcal{P}(s' \mid s, a) &: \text{ Transition probability for reaching state } s' \text{ from } s \text{ after action } a \\
r(s, a) &: \text{ Reward for taking action } a \text{ in state } s \\
\gamma &: \text{ Discount factor, determining the importance of future rewards }
\end{align}
$$

The agent’s policy function $\pi(a \mid s)$ defines the probability of selecting action $(a)$ in state $(s)$, while the value function $V(s)$ represents the expected cumulative reward starting from $(s)$.

### Markov Game (MG)

In the Markov Game (MG) formalization of reinforcement learning, multiple adaptive agents interact within a shared environment. Each agent aims to maximize its expected cumulative reward, which may depend on the actions of other agents. It is defined as:

$$
\mathcal{M} = (\mathcal{N}, \mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma)
$$

$$
\begin{align}
\mathcal{N} &: \text{ Set of n agents } \\
\mathcal{S} &: \text{ States of the shared environment} \\
\mathcal{A} &: \text{ The action set, where each } A_i \text{ is the action set of agent } i \in N \\
a = (a_1, a_2, ..., a_n) &: \text{ The joint action taken by all agents in state } s \\
P(s' \mid s, a) &: \text{ The transition probability function, returning the probability of transitioning to state } s' \\
&\quad \text{ given state } s \text{ and joint actions } A_t \\
R_i(s, a) &: \text{ The reward function of agent } i, \text{ mapping states and joint actions to rewards } \\
R &: \text{ The set of reward functions } R = \{R_1, \dots, R_N\}. \\
\gamma &: \text{ Discount factor, determining the importance of future rewards }
\end{align}
$$

The policy function for agent \(i\) is extended to a joint policy function for all agents, defined as:

$$
\boldsymbol{\pi}(\mathbf{a} \mid s) = \prod_{i=1}^N \pi_i(a_i \mid s),
$$

The value function for agent \(i\) is similarly extended to incorporate the joint policy and joint actions, defined as:

$$
V_i(s) = \mathbb{E}_{\boldsymbol{\pi}} \left[ \sum_{t=0}^\infty \gamma^t \mathcal{R}_i(s_t, \mathbf{a}_t) \mid s_0 = s \right].
$$

In the case of zero-sum games, the value function satisfies:

$$
V_1(s) = -V_2(s),
$$

indicating that the gain of one agent is the loss of the other. In cooperative settings, the value function may represent a shared cumulative reward:

$$
V(s) = \sum_{i=1}^N V_i(s),
$$

capturing the collective outcome of all agents acting together.

