# Day 60: Multi-agent Reinforcement Learning

## Introduction

Welcome to Day 60! Today, we delve into one of the most exciting and challenging frontiers of reinforcement learning: **Multi-agent Reinforcement Learning (MARL)**. While we've explored single-agent RL in previous lessons, real-world scenarios often involve multiple agents interacting with each other and the environment simultaneously.

Imagine a fleet of autonomous vehicles navigating city traffic, a team of robots collaborating in a warehouse, or AI agents playing team-based games. In these scenarios, each agent must not only learn to achieve its own goals but also coordinate, compete, or cooperate with other agents. This introduces unique challenges such as non-stationarity (the environment changes as other agents learn), credit assignment (determining which agent contributed to success), and emergent behaviors.

MARL has transformative applications in:
- **Autonomous systems**: Self-driving cars coordinating at intersections
- **Robotics**: Multi-robot systems working together in manufacturing or search-and-rescue
- **Game AI**: Creating intelligent opponents and teammates in video games
- **Economics and trading**: Modeling markets with multiple strategic agents
- **Network optimization**: Distributed resource allocation and traffic routing

### Learning Objectives

By the end of this lesson, you will:
- Understand the fundamental differences between single-agent and multi-agent reinforcement learning
- Learn about cooperative, competitive, and mixed environments
- Explore key MARL algorithms including Independent Q-Learning (IQL) and Value Decomposition Networks
- Implement a simple multi-agent environment with cooperative agents
- Visualize agent interactions and emergent behaviors
- Understand the challenges and solutions in MARL systems

## Theory

### Single-Agent vs Multi-Agent RL

In **single-agent RL**, an agent interacts with a static (or Markovian) environment. The agent's goal is to find an optimal policy $\pi^*$ that maximizes expected cumulative reward:

$$J(\pi) = \mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid \pi\right]$$

In **multi-agent RL**, multiple agents $\{1, 2, \ldots, n\}$ interact simultaneously. Each agent $i$ has:
- Its own policy $\pi_i$
- Its own observations/state $s_i$ (or shares global state $s$)
- Its own reward function $r_i$
- Actions that affect other agents and the environment

The key challenge: **The environment is non-stationary** from any single agent's perspective because other agents are learning and changing their policies!

### Types of Multi-Agent Environments

**1. Fully Cooperative (Team)**
- All agents share the same reward: $r_1 = r_2 = \ldots = r_n$
- Goal: Maximize team reward through coordination
- Examples: Multi-robot systems, distributed sensor networks
- Challenge: Credit assignment (which agent contributed to success?)

**2. Fully Competitive (Zero-Sum)**
- One agent's gain is another's loss: $\sum_i r_i = 0$
- Goal: Outperform opponents
- Examples: Board games (Chess, Go), competitive sports
- Challenge: Predicting and countering opponent strategies

**3. Mixed (General-Sum)**
- Agents have individual rewards with potential conflicts and alignments
- Goal: Balance self-interest with cooperation
- Examples: Traffic navigation, economics, social dilemmas
- Challenge: Finding equilibrium strategies

### Key Concepts in MARL

**Nash Equilibrium**: A set of policies $(\pi_1^*, \pi_2^*, \ldots, \pi_n^*)$ where no agent can improve its expected return by unilaterally changing its policy:

$$J_i(\pi_i^*, \pi_{-i}^*) \geq J_i(\pi_i, \pi_{-i}^*) \quad \forall i, \forall \pi_i$$

**Joint Action Space**: The combined action space of all agents: $\mathcal{A} = \mathcal{A}_1 \times \mathcal{A}_2 \times \ldots \times \mathcal{A}_n$

The size grows exponentially with the number of agents, creating scalability challenges!

**Communication**: Agents may communicate to share information:
- **Centralized Training, Decentralized Execution (CTDE)**: Agents share information during training but act independently during deployment
- **Message passing**: Agents send explicit messages to coordinate

### MARL Algorithms

**Independent Q-Learning (IQL)**
- Simplest approach: Each agent learns independently using standard Q-learning
- Treats other agents as part of the environment
- Problem: Non-stationarity violates Markov assumption
- Works surprisingly well in practice for some problems

**Value Decomposition Networks (VDN)**
- Learns a joint action-value function as a sum of individual value functions:
$$Q_{tot}(s, \mathbf{a}) = \sum_{i=1}^n Q_i(s, a_i)$$
- Enables centralized training with decentralized execution
- Guarantees that individual greedy actions lead to global optimality

**QMIX**
- Extension of VDN using a mixing network
- Allows more expressive value decomposition while maintaining monotonicity
- State-of-the-art for cooperative tasks

**Multi-Agent Deep Deterministic Policy Gradient (MADDPG)**
- Actor-critic approach for continuous action spaces
- Centralized critics (see all agents' actions) + decentralized actors
- Handles competitive and mixed scenarios

### Challenges in MARL

1. **Scalability**: Exponential growth of joint action space
2. **Credit assignment**: Determining individual contributions to team success
3. **Non-stationarity**: Environment changes as other agents learn
4. **Partial observability**: Agents may not see full state
5. **Communication**: How and when should agents communicate?
6. **Emergent behaviors**: Unexpected strategies that arise from interactions

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict, deque
import random
from typing import List, Tuple, Dict
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
np.random.seed(42)
random.seed(42)

# Configure matplotlib
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("Libraries imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

In [None]:
# Visualize training progress
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Plot 1: Episode rewards over time
axes[0, 0].plot(rewards, alpha=0.3, label='Raw')
# Moving average for smoother visualization
window = 50
moving_avg = pd.Series(rewards).rolling(window=window, min_periods=1).mean()
axes[0, 0].plot(moving_avg, label=f'{window}-episode moving average', linewidth=2)
axes[0, 0].set_xlabel('Episode')
axes[0, 0].set_ylabel('Total Reward')
axes[0, 0].set_title('Episode Rewards During Training')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Plot 2: Episode lengths over time
axes[0, 1].plot(lengths, alpha=0.3, label='Raw')
moving_avg_length = pd.Series(lengths).rolling(window=window, min_periods=1).mean()
axes[0, 1].plot(moving_avg_length, label=f'{window}-episode moving average', linewidth=2)
axes[0, 1].set_xlabel('Episode')
axes[0, 1].set_ylabel('Episode Length')
axes[0, 1].set_title('Episode Length During Training')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Plot 3: Success rate over time (rolling window)
success_windows = []
for i in range(len(success)):
    start = max(0, i - 99)
    success_windows.append(np.mean(success[start:i+1]))

axes[1, 0].plot(success_windows, linewidth=2)
axes[1, 0].set_xlabel('Episode')
axes[1, 0].set_ylabel('Success Rate')
axes[1, 0].set_title('Success Rate (100-episode rolling average)')
axes[1, 0].set_ylim([0, 1.1])
axes[1, 0].axhline(y=0.5, color='r', linestyle='--', label='50% success')
axes[1, 0].axhline(y=0.8, color='g', linestyle='--', label='80% success')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Plot 4: Distribution of rewards (last 200 episodes)
last_rewards = rewards[-200:]
axes[1, 1].hist(last_rewards, bins=30, edgecolor='black', alpha=0.7)
axes[1, 1].axvline(x=np.mean(last_rewards), color='r', linestyle='--', 
                    linewidth=2, label=f'Mean: {np.mean(last_rewards):.2f}')
axes[1, 1].set_xlabel('Total Reward')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Reward Distribution (Last 200 Episodes)')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nTraining Statistics:")
print(f"  Average reward (last 100 episodes): {np.mean(rewards[-100:]):.2f}")
print(f"  Average length (last 100 episodes): {np.mean(lengths[-100:]):.1f}")
print(f"  Success rate (last 100 episodes): {np.mean(success[-100:]):.2%}")
print(f"  Total states explored by Agent 0: {len(trained_agents[0].q_table)}")
print(f"  Total states explored by Agent 1: {len(trained_agents[1].q_table)}")

In [None]:
# Visualize trained agents in action
def visualize_episode(env, agents, max_steps=50):
    """
    Run one episode with trained agents and visualize their paths
    """
    state = env.reset()
    
    # Record trajectory
    trajectories = [[] for _ in range(env.n_agents)]
    for i in range(env.n_agents):
        trajectories[i].append(tuple(env.agent_positions[i]))
    
    done = False
    steps = 0
    total_reward = 0
    
    while not done and steps < max_steps:
        # Get actions from trained agents (no exploration)
        actions = [agent.get_action(state, training=False) for agent in agents]
        
        # Take step
        next_state, reward, done = env.step(actions)
        
        # Record positions
        for i in range(env.n_agents):
            trajectories[i].append(tuple(env.agent_positions[i]))
        
        state = next_state
        total_reward += reward
        steps += 1
    
    # Create visualization
    fig, ax = plt.subplots(figsize=(8, 8))
    
    # Draw grid
    for i in range(env.grid_size + 1):
        ax.axhline(i, color='gray', linewidth=0.5)
        ax.axvline(i, color='gray', linewidth=0.5)
    
    # Draw goals
    colors = ['red', 'blue', 'green', 'orange']
    for i, goal in enumerate(env.goal_positions):
        circle = plt.Circle((goal[1] + 0.5, goal[0] + 0.5), 0.3, 
                          color=colors[i], alpha=0.3, label=f'Goal {i}')
        ax.add_patch(circle)
    
    # Draw trajectories
    for i, traj in enumerate(trajectories):
        traj_array = np.array(traj)
        ax.plot(traj_array[:, 1] + 0.5, traj_array[:, 0] + 0.5, 
               marker='o', color=colors[i], linewidth=2, markersize=8,
               label=f'Agent {i}', alpha=0.7)
        
        # Mark start position
        ax.scatter(traj[0][1] + 0.5, traj[0][0] + 0.5, 
                  marker='s', s=200, color=colors[i], edgecolors='black', linewidth=2)
    
    ax.set_xlim(0, env.grid_size)
    ax.set_ylim(0, env.grid_size)
    ax.set_aspect('equal')
    ax.invert_yaxis()
    ax.set_xlabel('Column')
    ax.set_ylabel('Row')
    ax.set_title(f'Trained Agents Trajectory\nSteps: {steps}, Reward: {total_reward:.2f}, Success: {done}')
    ax.legend(loc='upper left', bbox_to_anchor=(1, 1))
    plt.tight_layout()
    plt.show()
    
    return steps, total_reward, done

# Run and visualize multiple episodes
print("="*60)
print("Visualizing Trained Agent Behavior")
print("="*60)

successes = 0
for episode in range(3):
    print(f"\nEpisode {episode + 1}:")
    env = MultiAgentGridWorld(grid_size=5, n_agents=2)
    steps, reward, success_flag = visualize_episode(env, trained_agents)
    if success_flag:
        successes += 1
    print(f"  Steps: {steps}, Reward: {reward:.2f}, Success: {success_flag}")

print(f"\nSuccess rate: {successes}/3 = {successes/3:.1%}")

In [None]:
# Create a simple Multi-Agent Grid World Environment
class MultiAgentGridWorld:
    """
    A cooperative grid world where multiple agents must reach goals.
    Agents receive a shared reward when all reach their goals.
    """
    
    def __init__(self, grid_size=5, n_agents=2):
        self.grid_size = grid_size
        self.n_agents = n_agents
        self.action_space = 4  # Up, Down, Left, Right
        self.actions = {0: (-1, 0), 1: (1, 0), 2: (0, -1), 3: (0, 1)}
        self.action_names = {0: "Up", 1: "Down", 2: "Left", 3: "Right"}
        self.reset()
        
    def reset(self):
        """Reset environment to initial state"""
        # Place agents randomly
        self.agent_positions = []
        occupied = set()
        
        for i in range(self.n_agents):
            while True:
                pos = (np.random.randint(0, self.grid_size), 
                       np.random.randint(0, self.grid_size))
                if pos not in occupied:
                    self.agent_positions.append(list(pos))
                    occupied.add(pos)
                    break
        
        # Place goals randomly (different from agent positions)
        self.goal_positions = []
        for i in range(self.n_agents):
            while True:
                pos = (np.random.randint(0, self.grid_size), 
                       np.random.randint(0, self.grid_size))
                if pos not in occupied:
                    self.goal_positions.append(list(pos))
                    occupied.add(pos)
                    break
        
        self.steps = 0
        self.max_steps = 50
        return self._get_state()
    
    def _get_state(self):
        """Return current state (flattened positions)"""
        # State is all agent positions + all goal positions
        state = []
        for pos in self.agent_positions:
            state.extend(pos)
        for pos in self.goal_positions:
            state.extend(pos)
        return tuple(state)
    
    def step(self, actions):
        """
        Execute actions for all agents
        actions: list of action indices for each agent
        """
        self.steps += 1
        
        # Move each agent
        new_positions = []
        for i, action in enumerate(actions):
            current_pos = self.agent_positions[i]
            delta = self.actions[action]
            new_pos = [
                max(0, min(self.grid_size - 1, current_pos[0] + delta[0])),
                max(0, min(self.grid_size - 1, current_pos[1] + delta[1]))
            ]
            new_positions.append(new_pos)
        
        # Check for collisions (agents can't occupy same cell)
        for i in range(len(new_positions)):
            collision = False
            for j in range(len(new_positions)):
                if i != j and new_positions[i] == new_positions[j]:
                    collision = True
                    break
            if not collision:
                self.agent_positions[i] = new_positions[i]
        
        # Calculate reward
        reward = self._calculate_reward()
        
        # Check if done
        done = self._is_terminal()
        
        return self._get_state(), reward, done
    
    def _calculate_reward(self):
        """Calculate shared reward"""
        # Check if all agents reached their goals
        all_at_goal = all(
            self.agent_positions[i] == self.goal_positions[i]
            for i in range(self.n_agents)
        )
        
        if all_at_goal:
            return 10.0  # Large positive reward for success
        
        # Small negative reward for each step (encourages efficiency)
        return -0.1
    
    def _is_terminal(self):
        """Check if episode should end"""
        # Episode ends if all agents at goals or max steps reached
        all_at_goal = all(
            self.agent_positions[i] == self.goal_positions[i]
            for i in range(self.n_agents)
        )
        return all_at_goal or self.steps >= self.max_steps
    
    def render(self):
        """Print current state"""
        grid = np.zeros((self.grid_size, self.grid_size), dtype=str)
        grid[:] = '.'
        
        for i, pos in enumerate(self.goal_positions):
            grid[pos[0], pos[1]] = f'G{i}'
        
        for i, pos in enumerate(self.agent_positions):
            if grid[pos[0], pos[1]].startswith('G'):
                grid[pos[0], pos[1]] = f'*{i}'  # Agent at goal
            else:
                grid[pos[0], pos[1]] = f'A{i}'
        
        print("\n".join([" ".join(row) for row in grid]))
        print()

# Test the environment
print("="*50)
print("Testing Multi-Agent Grid World Environment")
print("="*50)

env = MultiAgentGridWorld(grid_size=5, n_agents=2)
state = env.reset()

print(f"Grid size: {env.grid_size}x{env.grid_size}")
print(f"Number of agents: {env.n_agents}")
print(f"Initial state: {state}")
print("\nInitial configuration:")
env.render()

# Take a few random actions
print("Taking 3 random steps...")
for step in range(3):
    actions = [np.random.randint(0, 4) for _ in range(env.n_agents)]
    action_str = [env.action_names[a] for a in actions]
    print(f"\nStep {step + 1}: Actions = {action_str}")
    
    state, reward, done = env.step(actions)
    env.render()
    print(f"Reward: {reward:.2f}, Done: {done}")
    
    if done:
        break

print("Environment test completed!")

In [None]:
# Implement Independent Q-Learning (IQL)
class IndependentQLearningAgent:
    """
    Q-Learning agent that learns independently.
    Each agent maintains its own Q-table.
    """
    
    def __init__(self, agent_id, action_space, learning_rate=0.1, 
                 discount_factor=0.95, epsilon=1.0, epsilon_decay=0.995, 
                 epsilon_min=0.01):
        self.agent_id = agent_id
        self.action_space = action_space
        self.lr = learning_rate
        self.gamma = discount_factor
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min
        self.q_table = defaultdict(lambda: np.zeros(action_space))
    
    def get_action(self, state, training=True):
        """Select action using epsilon-greedy policy"""
        if training and np.random.random() < self.epsilon:
            return np.random.randint(self.action_space)
        else:
            return np.argmax(self.q_table[state])
    
    def update(self, state, action, reward, next_state, done):
        """Update Q-value using Q-learning update rule"""
        current_q = self.q_table[state][action]
        
        if done:
            target_q = reward
        else:
            target_q = reward + self.gamma * np.max(self.q_table[next_state])
        
        # Q-learning update
        self.q_table[state][action] = current_q + self.lr * (target_q - current_q)
    
    def decay_epsilon(self):
        """Decay exploration rate"""
        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)

# Training function for IQL
def train_independent_q_learning(env, n_episodes=1000, verbose=True):
    """
    Train multiple agents using Independent Q-Learning
    """
    # Create one agent per environment agent
    agents = [
        IndependentQLearningAgent(
            agent_id=i, 
            action_space=env.action_space,
            learning_rate=0.1,
            discount_factor=0.95,
            epsilon=1.0,
            epsilon_decay=0.995,
            epsilon_min=0.01
        ) for i in range(env.n_agents)
    ]
    
    # Track metrics
    episode_rewards = []
    episode_lengths = []
    success_rate = deque(maxlen=100)
    
    for episode in range(n_episodes):
        state = env.reset()
        total_reward = 0
        done = False
        steps = 0
        
        while not done:
            # Each agent selects its action
            actions = [agent.get_action(state) for agent in agents]
            
            # Environment step
            next_state, reward, done = env.step(actions)
            
            # Each agent updates its Q-table
            for agent in agents:
                agent.update(state, actions[agent.agent_id], reward, next_state, done)
            
            state = next_state
            total_reward += reward
            steps += 1
        
        # Decay epsilon for all agents
        for agent in agents:
            agent.decay_epsilon()
        
        # Track metrics
        episode_rewards.append(total_reward)
        episode_lengths.append(steps)
        success_rate.append(1 if total_reward > 0 else 0)
        
        # Print progress
        if verbose and (episode + 1) % 100 == 0:
            avg_reward = np.mean(episode_rewards[-100:])
            avg_length = np.mean(episode_lengths[-100:])
            recent_success = np.mean(success_rate)
            epsilon = agents[0].epsilon
            print(f"Episode {episode + 1}/{n_episodes} | "
                  f"Avg Reward: {avg_reward:.2f} | "
                  f"Avg Length: {avg_length:.1f} | "
                  f"Success Rate: {recent_success:.2%} | "
                  f"Epsilon: {epsilon:.3f}")
    
    return agents, episode_rewards, episode_lengths, list(success_rate)

# Train the agents
print("="*60)
print("Training Independent Q-Learning Agents")
print("="*60)

env = MultiAgentGridWorld(grid_size=5, n_agents=2)
trained_agents, rewards, lengths, success = train_independent_q_learning(
    env, n_episodes=1000, verbose=True
)

print("\nTraining completed!")
print(f"Final success rate (last 100 episodes): {np.mean(success[-100:]):.2%}")

## Hands-On Activity

Now it's your turn to experiment with multi-agent reinforcement learning! In this activity, you'll:

1. **Modify the environment**: Change the grid size and number of agents
2. **Compare algorithms**: Try different learning rates and exploration strategies
3. **Analyze emergent behavior**: Observe how agents coordinate (or fail to coordinate)

### Task 1: Scale Up the Environment

Try training agents on a larger grid with more agents. What happens to the training time and success rate?

**Instructions:**
- Create a 7x7 grid with 3 agents
- Train for 2000 episodes
- Compare performance metrics

### Task 2: Implement a Competitive Scenario

Modify the reward function so agents compete for goals instead of cooperating.

**Instructions:**
- Change `_calculate_reward()` so only the first agent to reach any goal gets +10
- Other agents get -1 when another agent succeeds
- Train and observe competitive behavior

### Task 3: Add Communication

Extend the agents to share information about their intended actions or discovered states.

**Challenge:** Implement a simple message-passing mechanism where agents can observe each other's last action.

In [None]:
# Activity Solution: Task 1 - Scale up the environment

print("="*60)
print("Task 1: Training on Larger Environment (7x7 grid, 3 agents)")
print("="*60)

# Create larger environment
large_env = MultiAgentGridWorld(grid_size=7, n_agents=3)

# Train agents
large_agents, large_rewards, large_lengths, large_success = train_independent_q_learning(
    large_env, n_episodes=2000, verbose=True
)

# Compare results
print("\n" + "="*60)
print("Comparison: Original vs. Larger Environment")
print("="*60)

# Original environment (5x5, 2 agents)
orig_success = np.mean(success[-100:])
orig_steps = np.mean(lengths[-100:])

# Larger environment (7x7, 3 agents)
large_success_rate = np.mean(large_success[-100:])
large_steps = np.mean(large_lengths[-100:])

comparison_df = pd.DataFrame({
    'Metric': ['Grid Size', 'Num Agents', 'Success Rate', 'Avg Steps'],
    'Original': ['5x5', 2, f'{orig_success:.2%}', f'{orig_steps:.1f}'],
    'Larger': ['7x7', 3, f'{large_success_rate:.2%}', f'{large_steps:.1f}']
})

print(comparison_df.to_string(index=False))

# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Success rate comparison
axes[0].bar(['Original\n(5x5, 2 agents)', 'Larger\n(7x7, 3 agents)'], 
           [orig_success, large_success_rate], 
           color=['steelblue', 'coral'])
axes[0].set_ylabel('Success Rate')
axes[0].set_title('Success Rate Comparison')
axes[0].set_ylim([0, 1.0])
axes[0].grid(True, alpha=0.3)

# Average steps comparison
axes[1].bar(['Original\n(5x5, 2 agents)', 'Larger\n(7x7, 3 agents)'], 
           [orig_steps, large_steps], 
           color=['steelblue', 'coral'])
axes[1].set_ylabel('Average Steps')
axes[1].set_title('Average Episode Length Comparison')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nObservations:")
print(f"  - Larger environment is more complex (49 vs 25 cells)")
print(f"  - More agents increase coordination challenge")
print(f"  - Success rate: {large_success_rate/orig_success:.2f}x of original")
print(f"  - Episode length: {large_steps/orig_steps:.2f}x of original")

## Key Takeaways

Congratulations on completing this lesson on Multi-agent Reinforcement Learning! Here are the key concepts to remember:

- **Multi-agent systems are fundamentally different from single-agent RL** due to non-stationarity. As agents learn, they change the environment for each other, creating a moving target for learning.

- **Three main types of multi-agent environments**: Cooperative (shared rewards), competitive (zero-sum), and mixed (general-sum). Each requires different solution approaches and presents unique challenges.

- **Independent Q-Learning (IQL) is surprisingly effective** despite treating other agents as part of the environment. It's simple to implement and often works well in practice, making it a good baseline approach.

- **Value decomposition methods (VDN, QMIX)** address the credit assignment problem by learning how individual agent values contribute to team performance, enabling centralized training with decentralized execution.

- **Scalability is a major challenge**: The joint action space grows exponentially with the number of agents ($|\mathcal{A}|^n$), making naive approaches intractable for large teams.

- **Communication and coordination are critical** in cooperative settings. Agents must learn not just individual skills but also how to work together effectively.

- **Emergent behaviors** often arise in multi-agent systems that weren't explicitly programmed, such as collaborative strategies or competitive tactics.

- **The exploration-exploitation tradeoff** is even more critical in MARL, as agents must explore both their own action space and the space of joint policies.

### Practical Implications

- Start with Independent Q-Learning as a baseline before trying more complex algorithms
- Consider whether your problem is cooperative, competitive, or mixed when choosing algorithms
- Use centralized training with decentralized execution (CTDE) when possible
- Monitor both individual agent performance and team-level metrics
- Be prepared for longer training times compared to single-agent RL

## Further Resources

### Research Papers
- **[QMIX: Monotonic Value Function Factorisation for Decentralised Deep Reinforcement Learning](https://arxiv.org/abs/1803.11485)** - Key paper on value decomposition for cooperative MARL
- **[Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments](https://arxiv.org/abs/1706.02275)** - Introduction to MADDPG algorithm
- **[The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games](https://arxiv.org/abs/2103.01955)** - Recent insights on simple approaches

### Libraries and Frameworks
- **[PettingZoo](https://pettingzoo.farama.org/)** - Standardized multi-agent RL environments from Farama Foundation
- **[RLlib](https://docs.ray.io/en/latest/rllib/index.html)** - Scalable RL library with excellent multi-agent support
- **[EPyMARL](https://github.com/uoe-agents/epymarl)** - Educational codebase for multi-agent RL research

### Tutorials and Courses
- **[Multi-Agent Reinforcement Learning - Spinning Up](https://spinningup.openai.com/en/latest/spinningup/extra_pg_proof2.html)** - OpenAI's educational resource
- **[UCL Course on Multi-Agent Systems](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html)** - Lectures by David Silver
- **[Cooperative Multi-Agent Reinforcement Learning Tutorial](https://sites.google.com/view/cmarl)** - Comprehensive tutorial from NeurIPS

### Books
- **"Multi-Agent Reinforcement Learning" by Stefano V. Albrecht and Peter Stone** - Comprehensive textbook covering theory and algorithms
- **"Game Theory" by Drew Fudenberg and Jean Tirole** - Essential background on strategic interactions

### Environments for Practice
- **StarCraft Multi-Agent Challenge (SMAC)** - Complex cooperative scenarios
- **Google Research Football** - Multi-agent sports environment
- **Multi-Agent Particle Environment** - Simple scenarios for testing algorithms