
# **Disclaimer**
This assignment must be completed solely by the members of your group. Sharing of code between groups is strictly prohibited. However, you are allowed to discuss general solution approaches or share publicly available resources with members of other groups. Therefore, clearly indicate which public resources you consulted and/or copied code from. Any plagiarism between groups will result in the initiation of a fraud procedure with the director of education. Additionally, don't put the code of this assingnment online due to the license on the code.


# Introduction
Welcome to the third assignment of the Reinforcement Learning course! In this project, you will dive deep into the world of cooperative multi-agent reinforcement learning (MARL). Your mission is to implement, extend, and innovate upon the QMix algorithm to train a team of intelligent agents for the Pacman Capture the Flag challenge. You will control two blue Pacman agents, guiding them to work together to capture food while outsmarting their red opponents.

This assignment is divided into three distinct parts:
1. **Implement the QMix Algorithm**: You will begin by building the core components of the QMix architecture. This involves creating the individual agent networks and the crucial mixing network that enables centralized training. You will then implement the complete training loop and the QMix loss function to bring your agents to life. The goal here is to establish a solid, working implementation.
2. **Specialize in an Advanced Technique**: With a functional QMix agent, each student will choose one specific area to focus on for improvement. This allows you to explore a state-of-the-art MARL topic in depth. Your options will include implementing advanced mixing networks (such as QTran or QPLEX), sophisticated exploration strategies (such as count-based exploration), or enhancing individual agents with techniques from the Rainbow paper.
3. **Explore and Innovate**: In the final section, you will build upon the insights gained in Part 2. Based on your experiments, you will be able to identify a promising direction for further improvement and implement it. This is an open-ended challenge where you can explore advanced topics such as policy gradient methods (MAPPO), novel exploration techniques, generalization to random maps, or better observation representations for your agents.


## Tournaments: Put Your Agents to the Test
Throughout the assignment, you will have the opportunity to test your agents' skills in a series of round-robin tournaments against your peers:

Three Intermediate Tournaments: These serve as valuable checkpoints to evaluate
your models and refine your training strategies. This are the deadlines for submitting your agents:
1.	Monday 30 November before 10 am.
2.	Monday 7 December before 10 am.
3.	Friday 12 December before 1 pm.

One Final Tournament: This determines the ultimate winner! The victorious team will earn a permanent place in the RL course Hall of Fame.

All tournaments will be held on the bloxCapture.lay map. For detailed information on submission guidelines and deadlines, please refer to the main assignment document. **It is important to note that tournament results will not impact your grade.** However, we appreciate reflecting in the report based on the video footage of the results of the tournaments in your assignment.

By the end of this assignment, you will have gained hands-on experience implementing and experimenting with a powerful MARL algorithm, preparing you to tackle complex multi-agent problems.

## The Pacman Environment: [PacMan, a capture the flag variant ](https://ai.berkeley.edu/contest.html)

The environment you'll be working in is a cooperative, multi-agent variant of the classic Pacman game. It features multiple agents that must coordinate to achieve a shared objective. Each agent receives its own local observations and must act upon them, making this an ideal testbed for cooperative MARL research.
The core game logic is adapted from the Berkeley AI contest, and we encourage you to explore the source code to understand its mechanics. To facilitate training, we have wrapped the game in an interface that follows the popular Gym API, making it straightforward to integrate with your deep reinforcement learning algorithms. The game logic was originally developed at UC Berkeley and is not too hard understand if you want to delve a little bit deeper.

The provided code includes a display argument for visualizing gameplay. For the best experience, we recommend running the notebook on a local machine to render the game.

A separate document contains a detailed breakdown of the environment, including the observation space, action space, and reward function. Please review it carefully before you begin.

## Assignment submission

For the final submission, you will provide a report detailing your work across all three parts of the assignment. This report is a critical component of the project, as it allows you to document your journey and showcase what you have learned.

We place a high value on reflection. We want to see more than just final results as we want to understand your thought process. Explain why you made certain design choices, what challenges you encountered, and what your experiments (incl. the ones that failed) taught you about the algorithms you implemented. Your insights are just as important as the outcomes.

To communicate your findings effectively, please use plots and tables to visualize your results. A clear graph of win rates or training rewards often holds more information than a thousand words. While there is no hard page limit, we encourage you to be concise and clear. Focus on creating a well-structured and insightful analysis of your project.

We expect you to submit the following things for the final submission:

1. The report
2. The notebook
3. Any custom code you wrote
4. Your final agents for the tournament (in the format mentioned in the assignment description PDF)





# **Setup**

Before we dive into coding, let's make sure everything is set up correctly.

1. Install Dependencies

You'll need to install the following libraries to run the notebook. Run the cell below to install them:

*   PacMan Capture the Flag: a reinforcement learning environment.
*   Packages you'll use throughout the notebook.


In [1]:
!git init
!git clone https://student:vaAwWR2Kse-jkAMH2_U5@gitlab.ilabt.imec.be/emalomgr/rl-lab-3-pacman.git --branch student_version
!mv ./rl-lab-3-pacman/* ./
!rm -rf ./rl-lab-3-pacman/

[33mhint: Using 'master' as the name for the initial branch. This default branch name[m
[33mhint: is subject to change. To configure the initial branch name to use in all[m
[33mhint:[m
[33mhint: 	git config --global init.defaultBranch <name>[m
[33mhint:[m
[33mhint: Names commonly chosen instead of 'master' are 'main', 'trunk' and[m
[33mhint: 'development'. The just-created branch can be renamed via this command:[m
[33mhint:[m
[33mhint: 	git branch -m <name>[m
[33mhint:[m
[33mhint: Disable this message with "git config set advice.defaultBranchName false"[m
Initialized empty Git repository in /home/drizzy/Downloads/assignment_3/assignment_3/.git/
Cloning into 'rl-lab-3-pacman'...
remote: Enumerating objects: 812, done.[K
remote: Counting objects: 100% (124/124), done.[K
remote: Compressing objects: 100% (121/121), done.[K
remote: Total 812 (delta 45), reused 0 (delta 0), pack-reused 688 (from 1)[K
Receiving objects: 100% (812/812), 19.26 MiB | 3.33 MiB/s, done.


2. Import and Install Necessary Python Libraries

Once the dependencies are installed, import the key libraries you’ll need throughout the notebook:

> Remark: if you want to run the notebook on your local machine you'll have to install the packages manually. You can use the `requirements.txt` file from the cloned repository and the [PyTorch documentation](https://pytorch.org/get-started/locally/) to install PyTorch (with CUDA support).


In [None]:
!uv pip install numpy
!uv pip install torch
!uv pip install matplotlib

import os
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import matplotlib.pyplot as plt
from gymPacMan import gymPacMan_parallel_env

device = "cpu"
if torch.cuda.is_available():
    device = torch.device("cuda:0")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
print(f"Device: {device}")

[2K[2mResolved [1m1 package[0m [2min 475ms[0m[0m                                          [0m
[2K[2mPrepared [1m1 package[0m [2min 5.84s[0m[0m                                              
[2K[2mInstalled [1m1 package[0m [2min 28ms[0m[0m                                 [0m
 [32m+[39m [1mnumpy[0m[2m==2.3.5[0m
[2K[2mResolved [1m25 packages[0m [2min 389ms[0m[0m                                        [0m
[2K[37m⠧[0m [2mPreparing packages...[0m (0/2)                                                   

In [None]:
layout_name = 'tinyCapture.lay'                       # see 'layouts/' dir for other options
layout_path = os.path.join('layouts', layout_name)
env = gymPacMan_parallel_env(layout_file=layout_path, # see class def for options
                             display=False,
                             reward_forLegalAction=True,
                             defenceReward=False,
                             length=299,
                             enemieName = 'randomTeam',
                             self_play=False,
                             random_layout = False)
env.reset()

# **Section 1: QMix Implementation**

In this section, you will implement the QMix algorithm to control agents in the PacMan environment. QMix is a powerful algorithm in multi-agent reinforcement learning that allows for centralized training with decentralized execution (CTDE). The key idea behind QMix is to learn a mixing network that combines individual agent Q-values into a global Q-value, which allows agents to make coordinated decisions while still acting independently during execution. The original QMix paper can be found [here](https://arxiv.org/abs/1803.11485) and will come in handy during implementation of the architecture.

## QMix Theory Overview

QMix is a value-based multi-agent reinforcement learning algorithm designed for cooperative tasks. It addresses the challenge of decentralized control while maintaining a centralized training framework. The key idea is to learn individual Q-values for each agent and combine them into a global Q-value that represents the team's joint policy.

Core Concepts:

1.	Individual Q-Values: Each agent has a separate Q-network that predicts the Q-values for its actions based on its local observations.
2.	Global Q-Value: A mixer network aggregates the individual Q-values into a global Q-value, ensuring that the global Q-value is monotonic with respect to individual Q-values. This monotonicity ensures that maximizing the global Q-value aligns with maximizing the individual Q-values.
3.	Hypernetworks: QMix uses hypernetworks to generate the weights for the mixer network dynamically. These weights depend on the global state, allowing the mixer network to adapt its behavior based on the team's overall situation.


Step-by-Step implementation: You will be implementing QMix step by step, focusing on the following parts:

1.	Implement the individual agent Q-networks.
2.	Build the mixing network to combine individual Q-values.
3.	Set up the loss function and training loop.
4.	Train the agents in the PacMan environment.

Let's begin!

## 1.1   Agent Q-Network Implementation

Before implementing the QMix code, it is essential to have a solid baseline for comparison. You are provided with a working implementation of Independent Q-Learning (IQL). Moreover, you can actually run the notebook and add logging and visualization code (e.g. with WandB) right now, and use the performance of the IQL agent(s) as a reference, since it is guaranteed to reach the maximum score (i.e. the amount of food in the layout) on the tinyCapture layout.

**Your First Task**: Your first step is to analyze the the provided code and run the IQL agent and analyze its performance. Train it against random opponents on the smaller maps (tinyCapture.lay and smallCapture.lay). Observe its behavior and log its performance (e.g., win rate, score). This analysis will provide a crucial benchmark that will help you evaluate whether your QMix implementation is an improvement.

**Evaluate during training**: Track the learning progress of your agents to check whether they are learning or not. A good tool for this is WandB, where you can easily compare different strategies during training.

In [None]:
class AgentQNetwork(nn.Module):
    def __init__(self, obs_shape, action_dim, hidden_dim=128):
        super(AgentQNetwork, self).__init__()

        # Convolutional layers
        self.conv1 = nn.Conv2d(obs_shape[0], 16, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=1)
        self.conv3 = nn.Conv2d(32, 32, kernel_size=3, stride=1, padding=1)

        conv_output_shape = obs_shape[1] * obs_shape[2] * 32 # assuming obs shape (C, H, W)

        # Flatten layer
        self.flatten = nn.Flatten()

        # Fully connected layers
        self.fc1 = nn.Linear(conv_output_shape, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, action_dim)

    def forward(self, obs):
        # Pass through convolutional layers
        x = F.relu(self.conv1(obs))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))

        # Flatten the output
        x = self.flatten(x)

        x = F.relu(self.fc1(x))

        # Output Q-values
        q_values = self.fc2(x)
        return q_values

	•	obs_dim: The dimension of the agent’s local observation.
	•	action_dim: The number of possible actions the agent can take.

## 1.2 Mixing Network

The mixing network is responsible for combining the individual Q-values from each agent into a global Q-value. The mixing network ensures that the global Q-value is a monotonic function of each agent’s Q-value, which allows the system to maintain decentralized decision-making at runtime.

**Task:** Implement the mixing network.

The mixing network will take the Q-values of all agents as input and output a single global Q-value. Plot the results and compare with IQL.

Try to optimize the QMix parameters but don't spend too much time on this yet.


In [None]:
class SimpleQMixer(nn.Module):
    def __init__(self, n_agents, state_shape):
        super(SimpleQMixer, self).__init__()

        # Much simpler state processing

        # Single layer mixing network

        # Initialize close to equal weights

    def forward(self, agent_qs, states):

        # Simple positive weights

        # Simple weighted sum

        return q_tot.view(bs, -1, 1)


	•	state_dim: The dimension of the global state (available during centralized training).
	•	n_agents: The number of agents, which determines the number of Q-values being mixed.
	•	Weights and biases: The weights and biases of the mixing network depend on the global state, ensuring that different states lead to different weightings of agent Q-values.

## 1.3 Loss Function and Training Loop

The agents need to learn their Q-values by minimizing the Temporal Difference (TD) error. The loss is computed as the difference between the predicted Q-value (from the agent's Q-network) and the target Q-value (computed using the Bellman equation). Note that the huber loss is used for more stability.

**Task:** update the training loop for QMix.


In [None]:

def compute_td_loss(agent_q_networks, target_q_networks, batch, weights=None, gamma=0.99, lambda_=0.1):
    """
    Computes the TD loss for QMix training using the Huber loss.

    Args:
        agent_q_networks (list): List of Q-networks for each agent.
        target_q_networks (list): List of target Q-networks for each agent.
        batch (tuple): A batch of experiences (states, actions, rewards, next_states, dones).
        weights (torch.Tensor): Importance sampling weights (optional).
        gamma (float): Discount factor for future rewards.
        lambda_ (float): Regularization factor for stability.

    Returns:
        torch.Tensor: Total loss for training.
    """
    states, actions, rewards, next_states, dones = batch

    # Convert to tensors and move to device
    states = torch.tensor(states, dtype=torch.float32).to(device)
    actions = torch.tensor(actions, dtype=torch.long).to(device)
    rewards = torch.tensor(rewards, dtype=torch.float32).to(device)
    next_states = torch.tensor(next_states, dtype=torch.float32).to(device)
    dones = torch.tensor(dones, dtype=torch.float32).to(device)

    # Current Q-values for each agent
    agent_q_values = []
    for agent_index, q_net in enumerate(agent_q_networks):
        q_vals = q_net(states[:, agent_index, :, :, :])  # Get Q-values for each agent
        agent_q_values.append(
            q_vals.gather(dim=1, index=actions[:, agent_index].unsqueeze(1)))  # Select Q-value for taken action
    agent_q_values = torch.cat(agent_q_values, dim=1)  # Shape: (batch_size, n_agents)

    # Target Q-values using Double DQN
    with torch.no_grad():
        # Get actions from current Q-networks
        next_agent_q_values = []
        for agent_index, (q_net, target_net) in enumerate(zip(agent_q_networks, target_q_networks)):
            next_q_vals = q_net(next_states[:, agent_index, :, :, :])  # Get Q-values from current network
            max_next_actions = next_q_vals.argmax(dim=1, keepdim=True)  # Greedy actions
            target_q_vals = target_net(next_states[:, agent_index, :, :, :])  # Get Q-values from target network\
            max_next_q_vals = target_q_vals.gather(1, max_next_actions)
            done_mask = dones[:, 0, 0].unsqueeze(1)
            filtered_target_q_vals = max_next_q_vals * (1 - done_mask)

            next_agent_q_values.append(filtered_target_q_vals)  # Use target Q-values for selected actions
        next_agent_q_values = torch.cat(next_agent_q_values, dim=1)  # Shape: (batch_size, n_agents)

    # Independent Q-learning target for each agent (all members of the blue team receive the same reward)
    target_q = rewards[:, 0, 0].unsqueeze(1) + gamma * next_agent_q_values

    # Compute Huber loss, try also with MSE loss
    loss_fn = torch.nn.HuberLoss()

    loss_agent1 = loss_fn(agent_q_values[:, 0], target_q[:, 0])
    loss_agent2 = loss_fn(agent_q_values[:, 1], target_q[:, 1])

    return loss_agent1, loss_agent2

## 1.4 Training the QMix Algorithm

Now that you have defined the agent Q-networks, the mixing network, and the loss function, it's time to train the agents in the gym environment. Please note that the given IQL implementation uses soft updates, but feel free to use hard updates.

**Task:** Implement the training loop.

In [None]:
from collections import deque
import random

class ReplayBuffer:
    def __init__(self, buffer_size=10_000):
        self.buffer = deque(maxlen=buffer_size)

    def add(self, experience):
        self.buffer.append(experience)

    def sample(self, batch_size):
        experiences = random.sample(self.buffer, batch_size)

        # Restructure the batch into separate arrays for states, actions, rewards, next_states, and dones
        states = np.array([exp[0].cpu().numpy() for exp in experiences], dtype=np.float32)
        actions = np.array([exp[1] for exp in experiences], dtype=np.int64)
        rewards = np.array([exp[2] for exp in experiences])
        next_states = np.array([exp[3] for exp in experiences])
        dones = np.array([exp[4] for exp in experiences])

        return states, actions, rewards, next_states, dones

    def size(self):
        return len(self.buffer)


def epsilon_greedy_action(agent_q_network, state, epsilon, legal_actions):
    if random.random() < epsilon:
        # Explore: take a random action
        action = random.choice(legal_actions)
    else:
        state = torch.unsqueeze(state.clone().detach(), 0).to(device)
        q_values = agent_q_network(state).cpu().detach().numpy()
        action = np.random.choice(np.flatnonzero(q_values == q_values.max()))

    return action

def update_target_network(agent_q_networks, target_q_networks):
    for target, source in zip(target_q_networks, agent_q_networks):
        target.load_state_dict(source.state_dict())

def soft_update_target_network(agent_q_networks, target_q_networks, tau=0.01):
    for target, source in zip(target_q_networks, agent_q_networks):
        for target_param, source_param in zip(target.parameters(), source.parameters()):
            target_param.data.copy_(
                tau * source_param.data + (1 - tau) * target_param.data
            )

In [None]:
# NOTE: this is currently training Independent IQL and
#   it's up to you to make the necessary changes in order
#       to obtain the QMix architecture and training loop
def train_qmix(env, agent_q_networks, target_q_networks, replay_buffer, n_episodes=500,
               batch_size=32, gamma=0.95, lr=0.001):
    optimizer = optim.Adam([param for net in agent_q_networks for param in net.parameters()], lr=lr)

    epsilon = 1                  # Initial exploration probability
    epsilon_min = 0.01
    epsilon_decay = 0.99
    #target_update_frequency = 5 # Hint: in case you want to use hard updates instead
    steps_counter = 0
    legal_actions = [0, 1, 2, 3, 4]
    info = None
    agent_indexes = [1, 3]

    for episode in range(n_episodes):
        done = {agent_id: False for agent_id in agent_indexes}
        env.reset()
        blue_player1_reward = 0
        blue_player2_reward = 0
        score = 0
        while not all(done.values()):
            actions = [-1 for _, _ in enumerate(env.agents)]
            states = []
            for i, agent_index in enumerate(agent_indexes):
                obs_agent = env.get_Observation(agent_index)
                state = torch.tensor(obs_agent, dtype=torch.float32).to(device)
                states.append(state)
                action = epsilon_greedy_action(agent_q_networks[i],
                                               state,
                                               epsilon,
                                               legal_actions=(lambda: info["legal_actions"][agent_index] if info is not None else legal_actions)()
                                               )
                actions[agent_index] = action



            next_states, rewards, terminations, info = env.step(actions)
            score -= info["score_change"]
            done = {key: value for key, value in terminations.items() if key in agent_indexes}
            blue_player1_reward += rewards[1]
            blue_player2_reward += rewards[3]

            next_states_converted = []
            rewards_converted = []
            terminations_converted = []
            actions_converted = []

            for index in agent_indexes:
                next_states_converted.append(list(next_states.values())[index])
                rewards_converted.append(rewards[index])
                terminations_converted.append(terminations[index])
                actions_converted.append(actions[index])

            next_states_converted = torch.stack(next_states_converted)
            states_converted = torch.stack(states)
            rewards_converted = [rewards_converted]
            terminations_converted = [terminations_converted]
            replay_buffer.add(
                (states_converted, actions_converted, rewards_converted, next_states_converted, terminations_converted))

            if replay_buffer.size() >= batch_size:
                batch = replay_buffer.sample(batch_size)
                loss1, loss2 = compute_td_loss(agent_q_networks, target_q_networks, batch,
                                               gamma=gamma)
                # Zero gradients for all optimizers
                optimizer.zero_grad()

                # Backpropagate once for all losses
                loss1.backward(retain_graph=True)
                loss2.backward()

                # Update weights
                optimizer.step()

                # Soft update after each step
                soft_update_target_network(agent_q_networks, target_q_networks)

        steps_counter += 1
        epsilon = max(epsilon_min, epsilon * epsilon_decay)

In [None]:
n_agents = int(len(env.agents) / 2)
action_dim_individual_agent = 5  # North, South, East, West, Stop

obs_individual_agent = env.get_Observation(0)
obs_shape = obs_individual_agent.shape

agent_q_networks = [AgentQNetwork(obs_shape=obs_shape, action_dim=action_dim_individual_agent).to(device) for _ in
                    range(n_agents)]
target_q_networks = [AgentQNetwork(obs_shape=obs_shape, action_dim=action_dim_individual_agent).to(device) for _ in
                     range(n_agents)]

# Initialize target Q-networks with the same weights as the main Q-networks
update_target_network(agent_q_networks, target_q_networks)

# Initialize the replay buffer
replay_buffer = ReplayBuffer(buffer_size=10_000)

# NOTE: initially, this is just IQL!
train_qmix(env, agent_q_networks, target_q_networks, replay_buffer)

## 1.5 Reflection Questions

Evaluate your final results on "smallCapture.lay and bloxCapture.lay (for the tournament)" against random agents and answer the following questions:

*  How do your QMix agents improve over time during the training?
*  How does the performance of QMix compare to IQL?
*  Do you observe different roles for the agents within a team?
*  What other reflection questions can you think of yourself?






You've now implemented the QMix algorithm for the PacMan capture the flag environment!

# Section 2: Specializing in Advanced Techniques
With a functional QMix agent from Part 1, it's time to push the boundaries. In this section, each student will choose one of the following specialization tracks to implement a specific, advanced improvement.

The goal is to move beyond the basics and tackle some of the core challenges in multi-agent learning. All experiments in this section should be conducted on the larger and more complex bloxCapture.lay map.

## 2.1 Improvements

Each student must select one of the following three tracks.

### Student 1: Advanced Mixing Networks (QTran or QPLEX)
**The Challenge**: The standard QMix monotonic mixing network is effective but has representational limitations. More advanced architectures can model more complex team dynamics.

**Your Tasks**:
- Implement an Advanced Mixer: Choose and implement either QTran or QPLEX, which are more powerful alternatives to the standard QMix mixer.
- Integrate and Train: Replace the vanilla mixer in your existing architecture with your new implementation.
- Analyze and Compare: Rigorously analyze the performance, training time (time per 100k steps for example), training stability, and sample efficiency of your new agent. In your report, compare it directly against the QMix and IQL baseline from Part 1. Does the more expressive network lead to better coordination and higher win rates and average score?
### Student 2: Coordinated Exploration with Count-Based Methods
**The Challenge**: In complex environments, agents can easily fail to discover optimal strategies if they don't explore the state space effectively. This is especially true in MARL, where coordinated exploration is key.

**Your Tasks**:
- Implement a tabular Count-Based exploration bonus: Add an exploration bonus to the agents' rewards using a tabular count-based method. This bonus should be inversely related to how often a particular state has been visited.
- Start Simple: Begin by defining the "state" for the counting mechanism using only the agent's current position (from the observation's position layer).
- Experiment and Expand: Improve upon your initial implementation. Choose at least two expansions for your state definition. For example, you could include:
   - The enemy food layer.
  - The positions of your teammate.
  - The positions of enemies.
  - A combination of the above, or another creative idea.
- Analyze and Reflect:
Investigate the impact of the exploration scaling factor (beta). How does it affect the trade-off between exploration and exploitation? Do you see that your policy now acts suboptimally as exploration behavior is induced into the policy, if so how could you avoid this? In your report, reflect on the effectiveness of your chosen state representations. Did the exploration bonus lead to better coverage of the map and the discovery of new strategies?

### Student 3: Enhancing the Individual Agents
The Challenge: The performance of any multi-agent system is limited by the capabilities of its individual agents. Techniques from single-agent RL can make each agent smarter and more efficient.

**Your Tasks** :
- Drawing inspiration from Assignment 1, integrate components into your individual agent Q-networks. We do not expect you to implement every component but rather, we want you to use your insights to choose what works best.
- Implement these components within your QMix framework. Tune the hyperparameters to make them work well with the new architecture and environment.
- Analyze how these improvements affect agent learning. Do they learn faster? Do they achieve a higher final performance? How do the new components interact with the multi-agent credit assignment problem? How do the components differ in performance compared to the single agent setting?

## 2.2 Combine your work
Combine your work and try to briefly optimize it. Compare the improvements from each student and show how they work together.

## 2.2 Analysis and Reflection
In your report for this section, you must provide a detailed analysis of your chosen specialization.

1. Describe the approach you implemented. Why did you choose this specific method (e.g., QPLEX over QTran)? What were the key steps in your implementation?

2. Present clear evidence of your agent's performance. Use plots (e.g., win rate vs. episodes, rewards, loss curves) and tables to compare your improved agent against the vanilla QMix (and IQL) baseline. Is it better? Is it more stable? Does it learn faster? Why is this the case?

3. Analyze your agent's behavior. Did it learn the strategies you expected? What are its remaining limitations? If you had more time, what would you try next to overcome these issues?

4. What were the biggest challenges you faced during implementation and training? What key insights did you gain about your chosen topic and about MARL in general?

# **Section 3: Experimenting with Your Own Improvements**

This section is an open challenge; based on the previous section, choose what you want to improve in your algorithm. The goal is for you to explore different techniques and report on your findings. We encourage you to try out different things. You can go really in depth into one point but make sure to show us that you put a lot of effort into it. You will implement and test this algorithm on the Pacman Capture the Flag environment on the map "bloxCapture.lay".

This section is open-ended, allowing you to experiment and think critically about the challenges and opportunities in multi-agent learning.
## 3.1 Suggested Directions

Here are some ideas to get you started. You may choose one of these or propose a completely new direction:
1.	Policy Gradient Approaches
* Implement a multi-agent Proximal Policy Optimization (PPO) or Actor-Critic algorithm.
* How do policy gradient methods handle coordination between agents compared to value-based methods like QMix?
2.	Counterfactual Multi-Agent Policy Gradients (COMA)
* Explore COMA, which uses counterfactual baselines to address the credit assignment problem.
* How does COMA adjust the contribution of each agent to the team’s reward?
3.	Modifications to the coordination
* Experiment with a different mixer architectures or other coordination approaches.
4. Change the observation space
* Try to find a better representation of the environment as this could speed up and improve learning.
4. Explore if your method works on random maps
* Try to train you method on random maps and see if you method can learn a strategy that transfers to new maps.
* Reflect on whether your agents can learn and possibly try to implement something to make it work better
5. Go deeper into exploration methods
* Until now, you used a tabular version of Count-Based exploration. Now explore generalizable methods for which you are allowed to use the package [RLeXplore](https://github.com/RLE-Foundation/RLeXplore) and the code can be imported [here](https://github.com/RLE-Foundation/rllte).
* Possibly think about heurstic reward shaping methods that teach certain behavior. However, keep in mind that this can have unintended outcomes.
* Do you see a difference when using Potential-Based Reward Shaping? For implementation check the assinment presentation during class.
6. Create a training strategy
* Last year we didn't see any real improvements with curriculum learning or self play, but maybe you are able to solve it.
7. Have an idea for something else, go right ahead!

## 3.2 Reflection Questions

After implementing your chosen algorithm, reflect on the following:

1. Design choices
* Which things did you implement, and why?
* Did you get the results you expected?

2.	Performance
* How does your algorithm perform compared to standard QMix and your work from section 2?

3.	Strengths and Weaknesses
* What are the strengths of your chosen approach in the multi-agent Pacman environment?
* What are the weaknesses or challenges you encountered?

4.	Coordination
* Did your algorithm encourage better coordination between agents? Why or why not?

5.	Generalization
* How well do you think your algorithm generalizes to random maps? Did you have to change something to make this work?

6. Future work
* if you had more time what would you improve or implement and why?

What other reflection questions can you think of yourself?


