### Installing dependencies

In [1]:
import os
!pip install vmas
!pip install Pillow
!pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu
!pip install ipython
!pip install autoreload
!pip install torch-geometric
!pip install wandb

Looking in indexes: https://download.pytorch.org/whl/nightly/cpu
Collecting wandb
  Obtaining dependency information for wandb from https://files.pythonhosted.org/packages/ed/d7/8927aef63869d5d379adb63dc97f9cbc53830fdf85457b84a156fabcb231/wandb-0.15.8-py3-none-any.whl.metadata
  Downloading wandb-0.15.8-py3-none-any.whl.metadata (8.3 kB)
Collecting GitPython!=3.1.29,>=1.0.0 (from wandb)
  Obtaining dependency information for GitPython!=3.1.29,>=1.0.0 from https://files.pythonhosted.org/packages/67/50/742c2fb60989b76ccf7302c7b1d9e26505d7054c24f08cc7ec187faaaea7/GitPython-3.1.32-py3-none-any.whl.metadata
  Downloading GitPython-3.1.32-py3-none-any.whl.metadata (10.0 kB)
Collecting sentry-sdk>=1.0.0 (from wandb)
  Obtaining dependency information for sentry-sdk>=1.0.0 from https://files.pythonhosted.org/packages/86/bb/ecb87fd214d5bbade07edf2ecdd829cf346e5b552689d6228112c6517286/sentry_sdk-1.29.2-py2.py3-none-any.whl.metadata
  Downloading sentry_sdk-1.29.2-py2.py3-none-any.whl.me

### Importing dependencies

In [None]:
import threading
import copy
import random
import time

import torch
from PIL import Image
from torch import tensor, Tensor
from vmas import make_env

import wandb
from Cleaning import Scenario as CleaningScenario
from DeepQLearner import DeepQLearner
from LearningConfiguration import LearningConfiguration, NNFactory
from ReplayBuffer import ReplayBufferFactory
import Device

### Initializing wandb
Wandb is the online platform used to track the training process and store the results. It is used to log the reward, the loss, the epsilon value and the mean reward for each episode.

In [None]:
os.environ["WANDB_API_KEY"] = "0" #TODO: Add your API key here
run = wandb.init(project="vmas", reinit=True, config={
    "learning_rate": 0.0005,
    "architecture": "MLP",
    #"epochs": n_steps
})

### Defining the scenario
The CleaningScenario class defines the number of agents and targets, the size of the environment, the action space, the reward function and other parameters such as the number of epochs and the number of episodes (or steps).
The reward function in the "Cleaning Agents" scenario is designed to encourage agents to make optimal decisions while cleaning the targets. The reward function operates as follows:

1. When an agent's distance from a target falls below a predefined threshold (K), indicating successful cleaning, the agent is rewarded positively: $$Reward = 1 + \text{number of previously removed targets}$$

2. If an agent's LIDAR sensor does not detect any targets in its vicinity, it receives a negative reward (-1): $$Reward = -1$$

3. When an agent's LIDAR sensor detects a target, and the distance from the agent's previous position to the target decreases, the agent is rewarded based on the function: $$Reward = \frac{-\text{LIDAR range}}{x}$$ (normalized to fall within the range of -0.5 to 0)

4. Conversely, if the distance from the agent's previous position to the target increases, indicating a suboptimal move, the agent is penalized based on the same function: $$Reward = \frac{-\text{LIDAR range}}{x}$$ (normalized to fall within the range of -1 to -0.5)

The strategic use of negative rewards, except for successful target removal, ensures that agents are motivated to make coordinated and efficient moves. It encourages agents to continually improve their performance by minimizing suboptimal actions and maximizing successful cleaning operations.


In [None]:
scenario_name = CleaningScenario()

# Scenario specific variables
n_agents = 1
n_targets = 8
num_envs = 1  # Number of vectorized environments
continuous_actions = True
device = Device.get()  # or cuda or any other torch device
n_steps = 1000  # Number of steps before returning done
n_epochs = 1000
dict_spaces = True  # Weather to return obs, rewards, and infos as dictionaries with agent names (by default they are lists of len # of agents)

dataset_size = 10000

frame_list = []  # For creating a gif
init_time = time.time()
step = 0

# Actions
speed = 0.5
north = tensor([0, -1*speed])
south = tensor([0, speed])
east = tensor([speed, 0])
west = tensor([-1*speed, 0])
#stop = tensor([0, 0])
ne = tensor([speed, -1*speed])
nw = tensor([-1*speed, -1*speed])
se = tensor([speed, speed])
sw = tensor([-1*speed, speed])

lidar_measure_shape = 50# * 2
pos_shape = 2
vel_shape = 2
tot_shape = lidar_measure_shape# + pos_shape + vel_shape

actions = [north, south, east, west, ne, nw, se, sw]

### Deep Q-Learning
My implementation of Deep Q Learning (DQL) for Multilayer Perceptron (MLP) networks is designed to accommodate the unique characteristics of the Vectorized Multi-Agent Simulator (VMAS) environment. VMAS employs additional dimensions to parallelize environments and agents, necessitating a flexible approach to handle complex tensor shapes.

##### State Representation

In this implementation, I employ a state representation that aligns with the structure of VMAS. Specifically, the state of each agent is represented as a tensor of shape `[50, 1]`. This choice is informed by the fact that each agent's LIDAR sensor in VMAS emits 50 rays, returning float distances. The positions of the individual agents within the environment are not considered relevant for decision-making since agents rely solely on LIDAR measurements, and the positions of targets can vary across different VMAS environments.

##### Tensor Shapes

To facilitate the training and execution of the MLP networks within VMAS, my network architecture accommodates tensors with the shape `[number of environments, number of agents, 50, 1]`. This tensor shape aligns with VMAS's parallelized structure, allowing the network to process data from multiple environments and agents simultaneously.

By adapting my DQL implementation to handle tensors of complex shapes, I ensure seamless integration with VMAS's multi-agent framework. This flexibility enables agents to effectively learn and make decisions within the dynamic and parallelized VMAS environment, ultimately enhancing their performance in complex multi-agent tasks.

In [None]:
#learning_configuration = LearningConfiguration(update_each=math.floor(n_steps/3),dqn_factory=NNFactory(tot_shape,64,len(actions)))
learning_configuration = LearningConfiguration(update_each=200,dqn_factory=NNFactory(tot_shape,64,len(actions)))

dql = DeepQLearner(
    memory=ReplayBufferFactory(dataset_size),
    action_space=actions,
    learning_configuration=learning_configuration
)

#dql.load_snapshot("./-38-2023-09-21-23-55-46-agent-0")

### Utility functions
Some utility functions to save gifs of the training process and to check if an environment is done.

In [None]:
targets_pos = []

for i in range(n_targets):
    targets_pos.append(tensor([random.random() * random.randint(-1, 1), random.random() * random.randint(-1, 1)], device=Device.get()))

def isOneEnvDone(info_array):
    tensor = info_array["agent_0"]["active_targets"]
    for i in range(num_envs):
        if tensor[i] == 0:
            return True
    return False

def save_gif(frame_list, epoch):
    for i in range(1):
        gif_name = scenario_name.__class__.__name__ + "-env-" + str(i) + "-epoch-" + str(epoch) + ".gif"
        frame_list[i].save(
            gif_name,
            save_all=True,
            append_images=frame_list[1:],
            duration=1,
            loop=0,
        )
    dql.snapshot(epoch, "0")

### Training process
The core DQL algorithm can be summarized in several steps:

1. Initialize Q-network and target network with random weights.
2. Initialize replay buffer.
3. For each episode:
   1. Observe the current state `s`.
   2. Select an action `a` using $\epsilon$-greedy policy.
   3. Execute action `a`, observe reward `r` and next state `s'`.
   4. Store the experience `(s, a, r, s')` in the replay buffer.
   5. Sample a mini-batch from the replay buffer.
   6. Calculate the target Q-values using the target network.
   7. Calculate the loss between predicted and target Q-values using the Bellman equation.
   8. Update the Q-network weights using backpropagation.
   9. Periodically update the target network weights.
10. Repeat until convergence or a predetermined number of episodes.

In [None]:
for e in range(0, n_epochs):
    env = make_env(
        scenario=scenario_name,
        num_envs=num_envs,
        device=device,
        continuous_actions=continuous_actions,
        dict_spaces=dict_spaces,
        wrapper=None,
        seed=None,
        n_targets=n_targets,
        n_agents=n_agents,
        wandb=wandb,
        targets_pos=targets_pos
    )
    previous_states = {}
    for step in range(1, n_steps):
        print(f"Step {step}")
        actions = {}
        logs = {}
        for i, agent in enumerate(env.agents):
            lidar_measure = previous_states[agent.name]["lidar_measure"] if step > 1 else torch.zeros(num_envs, lidar_measure_shape).to(Device.get())
            positions = agent.state.pos
            velocities = agent.state.vel
            agent_actions_list = []
            for j in range(num_envs):
                state = lidar_measure[j]#torch.cat((positions[j], velocities[j], lidar_measure[j]),dim=-1).to(Device.get())
                action = dql.behavioural(state)
                #print(action)
                agent_actions_list.append(action)
            agent_actions = torch.stack(agent_actions_list)
            actions.update({agent.name: agent_actions})
            if step > dql.batch_size/num_envs:
                dql.improve() # Improve the model
                #TODO Should I do the improve once for each env or once for each agent?
        obs, rewards, dones, info = env.step(actions)
        mean_reward = 0
        #print(rewards)
        for i, agent in enumerate(env.agents):
            positions = agent.state.pos
            velocities = agent.state.vel
            lidar_measure = obs[agent.name][:, (tot_shape - lidar_measure_shape):]
            previous_states.update({agent.name: {"lidar_measure": lidar_measure, "pos": positions, "vel": velocities}})
            for j in range(num_envs):
                reward = rewards[agent.name][j]
                mean_reward += reward
                logs.update({f"reward_{agent.name}_env_{j}": reward})
                prev_state = previous_states[agent.name]
                prev_state = prev_state["lidar_measure"][j]#torch.cat((prev_state["pos"][j], prev_state["vel"][j], prev_state["lidar_measure"][j]),dim=-1).to(Device.get())
                state = obs[agent.name][j]
                action = actions[agent.name][j]
                dql.record(prev_state,action,reward,state)
        mean_reward /= (num_envs*n_agents)
        logs.update({"epsilon": dql.epsilon.value()})
        logs.update({"loss": dql.last_loss})
        logs.update({"mean_reward": mean_reward})
        logs.update({f"mean_reward_epoch_{e}": mean_reward})
    
        wandb.log(logs)
        dql.epsilon.update() # Update epsilon
        #dql.snapshot(step, "0")
        frame_list.append(
            Image.fromarray(env.render(mode="rgb_array", agent_index_focus=None))
        )  # Can give the camera an agent index to focus on
        
        #print(info)
        if isOneEnvDone(info):
            print("Env done")
            dql.target_network.load_state_dict(dql.policy_network.state_dict())
            break
    
    
    # Produce a gif
    frame_list_copy = copy.deepcopy(frame_list)
    thread = threading.Thread(target=save_gif, args=(frame_list_copy, e))
    
    thread.start()
    
    frame_list.clear()
    
    total_time = time.time() - init_time
    print(
        f"It took: {total_time}s for {n_steps} steps of {num_envs} parallel environments on device {device} "
        f"for {scenario_name} scenario."
    )

### Performance Results: MLP Networks in the Cleaning Agents Scenario

The performance evaluation of my Multilayer Perceptron (MLP) network-based agents in the Cleaning Agents scenario provides valuable insights into their learning and decision-making capabilities. The training process consisted of 150 epochs, each comprising 1000 steps. If an agent successfully achieved its task before completing all the steps in an epoch, it proceeded to the next epoch. Here, I present a comprehensive analysis of the training and evaluation results.

#### Training Progress

![Reward and Loss comparison](https://i.imgur.com/hwZ7rNF.png "Reward and Loss comparison")

During the initial epoch, agents exhibited limited familiarity with the task of removing targets from the 2D environment. Consequently, the mean reward achieved by the agents in this phase remained relatively low, with a maximum reward of approximately 3.0. As training progressed, agents displayed improved performance. By the 70th epoch, the highest recorded mean reward reached 8.0, typically occurring at around 900 steps into the episode. This milestone indicated that agents had grasped the fundamentals of their task but still had room for refinement.

A significant leap in performance was observed at the 145th epoch. Agents demonstrated remarkable efficiency by removing almost all targets in fewer than 500 steps, with the last remaining target eliminated in the final steps of the episode. Concurrently, the loss function exhibited a notable pattern. While initially showing spikes when agents were rewarded values exceeding 1.0, the loss gradually minimized and stabilized near zero after the 100th epoch.

![Mean rewards](https://i.imgur.com/QVtu3HF.png "mean rewards")

#### Evaluation Results

![Different stages of a simulation with one agent and 8 targets](https://i.imgur.com/unYsDTi.png "Different stages of a simulation with one agent and 8 targets")

To assess the agents' performance in practical scenarios, I conducted evaluations with varying numbers of agents and targets. In a scenario involving one agent and eight targets, the results indicated the following:
* On average, it took approximately 10 steps for the agent to remove the first target, reflecting a quick initiation of the cleaning process.
* Removal of half the targets occurred, on average, in approximately 170 steps, demonstrating consistent progress.
* To remove all eight targets, agents required an average of around 1440 steps, highlighting the complexity of clearing the entire environment.

![Different stages of a simulation with four agent and 14 targets](https://i.imgur.com/acA7fVT.png "Different stages of a simulation with four agent and 14 targets")

In a more challenging scenario with four agents and fourteen targets, the outcomes were as follows:
* Agents demonstrated improved collaboration, taking an average of about 400 steps to clear all fourteen targets.
* The first target was removed in an average of 10 steps, emphasizing swift task initiation.
* Removal of half the targets was achieved in approximately 240 steps, showcasing efficient teamwork among agents.

![Remaining targets at each step](https://i.imgur.com/8eLGGts.png "Remaining targets at each step")

These performance results underscore the adaptability and learning capabilities of my MLP network-based agents in the Cleaning Agents scenario. As training progressed, agents demonstrated a substantial improvement in their task execution, achieving efficient target removal even in scenarios with multiple agents and numerous targets. These findings reflect the effectiveness of my reinforcement learning approach in addressing complex multi-agent coordination tasks.