> ERC Starting Grant on COeXISTENCE between humans and machines in urban mobility.


<img src="../../images/img_mileston1.png" alt="Milestone 1 Image" width="500" height="400">

# Title: Machine training using Double DQN algorithm
## Name: Anastasia
### Date: July 11, 2024
---

### Description

> In this notebook, we implement the training of independent machine agents using the DQN algorithm.
---

## Objective

> The purpose of this notebook is to understand whether DQN algorithm can train effectively our RL agents.
---

## Experiment Summary

### Network Architecture
- Csomor network
---

### Agents
| **Type**          |           |
|-------------------|---------------------|
| **Number**        | 5 machines |
| **Total demand** | random |
---


### Origin and Destination Details
| **Origin Count**      | 2                            |
|-----------------------|------------------------------|
| **Destination Count** | 2                            |
| **Origin Pairing**    | 279952229#0, 115604053       |
| **Destination Pairing**| -115602933#2, -441496282#1     |
---

### Execution time
- 8 min 28 sec

### Hardware Utilized for Experiment Execution
| **Type of Machine** | Personal computer (or server) |
|----------------------|-------------------------------|
| **CPU**              | 12th Gen Intel(R) Core(TM) i7-1255U |
|                      | Cores: 10                   |
|                      | Sockets: 1                  |
|                      | Base Speed: 1.70 GHz        |
| **Memory**           | 16GB                          |
| **Disc (SSD)**       | 477 GB                        |
| **Operating System** | Windows 11                    |
---


### Imported libraries 

In [1]:
import matplotlib.pyplot as plt
import os
import pandas as pd
from tensordict.nn import TensorDictModule, TensorDictSequential
import torch
from torchrl.collectors import SyncDataCollector
from torch.distributions import Categorical
from torchrl.envs.libs.pettingzoo import PettingZooWrapper
from torchrl.envs.transforms import TransformedEnv, RewardSum
from torchrl.envs.utils import check_env_specs
from torchrl.data.replay_buffers import ReplayBuffer
from torchrl.data.replay_buffers.samplers import SamplerWithoutReplacement
from torchrl.data.replay_buffers.storages import LazyTensorStorage
from torchrl.modules import MultiAgentMLP, ProbabilisticActor
from torchrl.objectives.value import GAE
from torchrl.objectives import ClipPPOLoss, ValueEstimators
from torchrl.modules import MLP, QValueActor
from torchrl.data import CompositeSpec
from torchrl.modules import EGreedyModule
from torchrl.objectives import DQNLoss, HardUpdate, SoftUpdate
from torchrl.record.loggers import generate_exp_name, get_logger
from torchrl.envs.transforms import RenameTransform
from torchrl.modules.tensordict_module import QValueModule
from torchrl.trainers import (
    LogReward,
    Recorder,
    ReplayBufferTrainer,
    Trainer,
    UpdateWeights,
)
import wandb
from tqdm import tqdm
import sys

current_dir = os.getcwd()
parent_of_parent_dir = os.path.abspath(os.path.join(current_dir, os.pardir, os.pardir))
sys.path.append(parent_of_parent_dir)

from environment import TrafficEnvironment
from keychain import Keychain as kc
from services.plotter import Plotter
from utilities import get_params

os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"

### Hyperparameters specification

In [2]:
# Devices
device = (
    torch.device(0)
    if torch.cuda.is_available() and not is_fork
    else torch.device("cpu")
)
vmas_device = device  # The device where the simulator is run

# Sampling
frames_per_batch = 4  # Number of team frames collected per training iteration
n_iters = 4  # Number of sampling and training iterations
total_frames = frames_per_batch * n_iters

# Training
num_epochs = 100  # Number of optimization steps per training iteration
minibatch_size = 2  # Size of the mini-batches in each optimization step
lr = 3e-4  # Learning rate
max_grad_norm = 1.0  # Maximum norm for the gradients

# DQN
gamma = 0.99  # discount factor
hard_update_freq = 10

### Environment Creation

In [3]:
params = get_params(kc.PARAMS_PATH)

In [4]:
env = TrafficEnvironment(params[kc.RUNNER], params[kc.ENVIRONMENT], params[kc.SIMULATOR], params[kc.AGENT_GEN], params[kc.AGENTS], params[kc.PHASE])

[CONFIRMED] Environment variable exists: SUMO_HOME
[SUCCESS] Added module directory: C:\Program Files (x86)\Eclipse\Sumo\tools


In [5]:
env.start()

### Human learning

In [6]:
num_episodes = 10

for episode in range(num_episodes):
    env.step()

------------------STEP------------------

Reset episode [] 



Resetting the simulator
------------------STEP------------------

Reset episode [] 



Resetting the simulator
------------------STEP------------------

Reset episode [] 



Resetting the simulator
------------------STEP------------------

Reset episode [] 



Resetting the simulator
------------------STEP------------------

Reset episode [] 



Resetting the simulator
------------------STEP------------------

Reset episode [] 



Resetting the simulator
------------------STEP------------------

Reset episode [] 



Resetting the simulator
------------------STEP------------------

Reset episode [] 



Resetting the simulator
------------------STEP------------------

Reset episode [] 



Resetting the simulator
------------------STEP------------------

Reset episode [] 



Resetting the simulator


### Mutation

In [7]:
env.mutation()

### Machine learning

In [8]:
env = PettingZooWrapper(
    env=env,
    use_mask=True,
    group_map=None,
    categorical_actions=True,
    done_on_any = False
)

Resetting the simulator


In [9]:
out_keys = []

for group, agents in env.group_map.items():
    out_keys.append((group, "episode_reward"))

print(out_keys)

[('3', 'episode_reward')]


In [10]:
env = TransformedEnv(
    env,
    RewardSum(
        in_keys=env.reward_keys,
        reset_keys=["_reset"] * len(env.group_map.keys()),
        out_keys = out_keys
    ),
)

In [11]:
env.reward_keys

[('3', 'reward')]

In [12]:
env.group_map.keys()

dict_keys(['3'])

In [13]:
check_env_specs(env)

Resetting the simulator
------------------STEP------------------

Reset episode ['3'] 



Resetting the simulator
self.agent_selection is:  3 



------------------STEP------------------

Reset episode ['3'] 



Resetting the simulator
self.agent_selection is:  3 



------------------STEP------------------



2024-07-25 14:23:06,227 [torchrl][INFO] check_env_specs succeeded!


Reset episode ['3'] 



Resetting the simulator
self.agent_selection is:  3 





In [14]:
reset_td = env.reset()

Resetting the simulator


### Policy network

In [15]:
modules = {}
for group, agents in env.group_map.items():
    share_parameters_policy = False 

    mlp = MultiAgentMLP(
        n_agent_inputs =env.observation_spec[group, "observation"].shape[-1],  
        n_agent_outputs = env.full_action_spec[group, "action"].space.n, 
        n_agents = len(agents),
        centralised=False,  
        share_params = share_parameters_policy,
        device = device,
        depth = 4,
        num_cells = 64,
        activation_class=torch.nn.ReLU,
    )

    module = TensorDictModule(mlp, 
                              in_keys=[(group, "observation")],
                              out_keys=[(group,"action_value")],
    )

    modules[group] = module

In [16]:
q_value_modules = {}

for group, agents in env.group_map.items():

    q_value_module = QValueModule(
            action_value_key=(group, "action_value"),
            out_keys=[
                (group, "action"),
                (group, "action_value"),
                (group, "chosen_action_value"),
            ],
            spec=env.full_action_spec[group, "action"],
            action_space=None,
        )

    q_value_modules[group] = q_value_module

In [17]:
policy = TensorDictSequential(*modules.values(), *q_value_modules.values())

Run a random rollout to ensure it works.

In [18]:
for group, agents in env.group_map.items():

    tensordict = env.fake_tensordict()
    policy(tensordict)

### Greedy module

In [19]:
greedy_module = {}

for group, agents in env.group_map.items():

    greedy_module[group] = EGreedyModule(
        action_key = (group, "action"),
        spec=env.full_action_spec[group, "action"],
    )

Incorporate the greedy module inside the policy.

In [20]:
col_policy = {}

for group, agents in env.group_map.items():
    col_policy[group] = TensorDictSequential(policy, greedy_module[group])

In [21]:
col_policies = TensorDictSequential(*col_policy.values())

### Collector

In [22]:
collector = SyncDataCollector(
    env,
    col_policies,
    device=device,
    storing_device=device,
    frames_per_batch=frames_per_batch,
    reset_at_each_iter=False,
    total_frames=total_frames,
)

Resetting the simulator


### Replay Buffer

In [23]:
replay_buffers = {}
for group, _agents in env.group_map.items():
    replay_buffers[group] = ReplayBuffer(
        storage=LazyTensorStorage(
            frames_per_batch, device=device
        ), 
        batch_size=minibatch_size, 
    )

replay_buffer = replay_buffers[group]

### DQN loss function

In [24]:
losses = {}
optimizers = {}
target_net_updaters = {}


for group, _agents in env.group_map.items():
    loss_module = DQNLoss(
        value_network=col_policies,
        loss_function="l2",
        double_dqn = False,
        delay_value=True,
        action_space = "categorical"
    )

    loss_module.set_keys(  # We have to tell the loss where to find the keys
        reward=(group, "reward"),  
        action_value=(group, "action_value"),
        action=(group, "action"), 
        done=(group, "done"),
        terminated=(group, "terminated"),
        value=(group, "chosen_action_value"),
    )

    loss_module.make_value_estimator(gamma=gamma)

    target_net_updaters[group] = SoftUpdate(
        loss_module, eps=0.98
    )    

    losses[group] = loss_module

    optimizer = torch.optim.Adam(loss_module.parameters(), lr)
    
    optimizers[group] = optimizer

# Access loss module for the first group for example
group = next(iter(env.group_map))

### Trying to implement the Trainer class ~ probably not working for multi agent scenarios

In [25]:
"""n_optim = 8
trainer = Trainer(
    collector=collector,
    total_frames=total_frames,
    frame_skip=1,
    loss_module=loss_module,
    optimizer=optimizer,
    optim_steps_per_batch=n_optim,
)"""

In [26]:
"""log_keys = []
out_keys = {}

for group, _agents in env.group_map.items():
    log_keys.append(("next", group, "reward"))
    out_keys[("next", group, "reward")] = "rewards""""

In [27]:
"""buffer_hook = ReplayBufferTrainer(
    replay_buffer,
    flatten_tensordicts=False,
)
#buffer_hook.register(trainer)
weight_updater = UpdateWeights(collector, update_weights_interval=1)
weight_updater.register(trainer)
recorder = Recorder(
    record_interval=1,  # log every 100 optimization steps
    record_frames=1,  # maximum number of frames in the record
    frame_skip=1,
    policy_exploration=col_policies,
    environment=env,
    #exploration_type=ExplorationType.MODE,
    log_keys=log_keys,
    out_keys=out_keys,
    log_pbar=True,
)
recorder.register(trainer)"""

In [28]:
#trainer.train()

  0%|          | 0/16 [00:00<?, ?it/s]

------------------STEP------------------

Reset episode ['6'] 



Resetting the simulator
self.agent_selection is:  6 



------------------STEP------------------

Reset episode ['6'] 



Resetting the simulator
self.agent_selection is:  6 



------------------STEP------------------

Reset episode ['6'] 



Resetting the simulator
self.agent_selection is:  6 



------------------STEP------------------

Reset episode ['6'] 



Resetting the simulator
self.agent_selection is:  6 



Resetting the simulator
------------------STEP------------------

Reset episode ['6'] 



Resetting the simulator
self.agent_selection is:  6 





rewards: -1.7333:  25%|██▌       | 4/16 [00:06<00:18,  1.52s/it]

------------------STEP------------------



FatalTraCIError: Not connected.

### Create the logger

In [None]:
wandb.login()

ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33manastasiapsarou123[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [None]:
logger = None

exp_name = generate_exp_name("DQN", f"TrafficEnv")
logger = get_logger(
    "wandb",
    logger_name="dqn",
    experiment_name=exp_name,
    wandb_kwargs={
        "project": "2_machines_mutation",
    },
)

In [25]:
import time
import tqdm

collected_frames = 0
start_time = time.time()
num_updates = 5
batch_size = 10
test_interval = 5
max_grad = 1
num_test_episodes = 5
frames_per_batch = frames_per_batch
init_random_frames = 5
n_optim = 8
sampling_start = time.time()
q_losses = torch.zeros(num_updates, device=device)

### Training loop

In [26]:
q_losses_loop = {group: [] for group in env.group_map.keys()}

pbar = tqdm.tqdm(total=n_iters)
for i, tensordict_data in enumerate(collector):

    for group, _agents in env.group_map.items():
        tensordict_data.set(
            ("next", group, "done"),
            tensordict_data.get(("next", "done"))
            .unsqueeze(-1)
            .expand(tensordict_data.get_item_shape(("next", group, "reward"))),  # Adjust index to start from 0
        )
        tensordict_data.set(
            ("next", group, "terminated"),
            tensordict_data.get(("next", "terminated"))
            .unsqueeze(-1)
            .expand(tensordict_data.get_item_shape(("next", group, "reward"))),  # Adjust index to start from 0
        )

    log_info = {}
    sampling_time = time.time() - sampling_start
    pbar.update(tensordict_data.numel())

    data = tensordict_data.reshape(-1)
    current_frames = data.numel()
    collected_frames += current_frames

    for group, agents in env.group_map.items():
        replay_buffers[group].extend(data)
        greedy_module[group].step(current_frames)

    # Get and log training rewards and episode lengths
    
        episode_rewards = data["next", group, "episode_reward"][data["next", group, "done"]]
        if len(episode_rewards) > 0:
            episode_reward_mean = episode_rewards.mean().item()
            #episode_length = data["next", group, "step_count"][data["next", group, "done"]]
            #episode_length_mean = episode_length.sum().item() / len(episode_length)
            log_info.update(
                {
                    f"train/episode_reward_{group}": episode_reward_mean,
                    #"train/episode_length": episode_length_mean,
                }
            )

        """if collected_frames < init_random_frames:
            if logger:
                for key, value in log_info.items():
                    logger.log_scalar(key, value, step=collected_frames)
            continue"""


    # optimization steps
    training_start = time.time()
    for group, agent in env.group_map.items():
        for _ in range(frames_per_batch // minibatch_size):

            sampled_tensordict = replay_buffers[group].sample()
            sampled_tensordict = sampled_tensordict.to(device)

            loss_td = losses[group](sampled_tensordict)
            q_loss = loss_td["loss"]

            q_losses_loop[group].append(q_loss)

            optimizer.zero_grad()
            q_loss.backward()

            torch.nn.utils.clip_grad_norm_(
                list(losses[group].parameters()), max_norm=max_grad
            )
            
            optimizers[group].step()
            target_net_updaters[group].step()

            training_time = time.time() - training_start

            # Get and log q-values, loss, epsilon, sampling time and training time
            log_info.update(
                {
                    f"train/q_values_{group}": (data[group, "action_value"] * data[group, "action"]).sum().item()
                    / frames_per_batch,
                    f"train/q_loss_{group}": torch.stack(q_losses_loop[group]).mean().item(),
                    f"train/epsilon_{group}": greedy_module[group].eps,
                    "train/sampling_time": sampling_time,
                    "train/training_time": training_time,
                }
            )

            """if logger:
                for key, value in log_info.items():
                    logger.log_scalar(key, value, step=collected_frames)"""

            
            # update weights of the inference policy
            collector.update_policy_weights_()
            sampling_start = time.time()


collector.shutdown()
end_time = time.time()
execution_time = end_time - start_time
print(f"Training took {execution_time:.2f} seconds to finish")

  0%|          | 0/4 [00:00<?, ?it/s]

------------------STEP------------------

Reset episode ['3'] 



Resetting the simulator
self.agent_selection is:  3 



------------------STEP------------------

Reset episode ['3'] 



Resetting the simulator
self.agent_selection is:  3 



------------------STEP------------------

Reset episode ['3'] 



Resetting the simulator
self.agent_selection is:  3 



------------------STEP------------------



100%|██████████| 4/4 [00:05<00:00,  1.42s/it]

Reset episode ['3'] 



Resetting the simulator
self.agent_selection is:  3 



------------------STEP------------------

Reset episode ['3'] 



Resetting the simulator
self.agent_selection is:  3 



------------------STEP------------------

Reset episode ['3'] 



Resetting the simulator
self.agent_selection is:  3 



------------------STEP------------------

Reset episode ['3'] 



Resetting the simulator
self.agent_selection is:  3 



------------------STEP------------------



8it [00:10,  1.34s/it]                       

Reset episode ['3'] 



Resetting the simulator
self.agent_selection is:  3 



------------------STEP------------------

Reset episode ['3'] 



Resetting the simulator
self.agent_selection is:  3 



------------------STEP------------------

Reset episode ['3'] 



Resetting the simulator
self.agent_selection is:  3 



------------------STEP------------------

Reset episode ['3'] 



Resetting the simulator
self.agent_selection is:  3 



------------------STEP------------------



12it [00:17,  1.49s/it]

Reset episode ['3'] 



Resetting the simulator
self.agent_selection is:  3 



------------------STEP------------------

Reset episode ['3'] 



Resetting the simulator
self.agent_selection is:  3 



------------------STEP------------------

Reset episode ['3'] 



Resetting the simulator
self.agent_selection is:  3 



------------------STEP------------------

Reset episode ['3'] 



Resetting the simulator
self.agent_selection is:  3 



------------------STEP------------------

Reset episode ['3'] 



Resetting the simulator
self.agent_selection is:  3 





16it [00:24,  1.60s/it]

Training took 24.63 seconds to finish


In [None]:
from services import plotter

plotter(params[kc.PLOTTER])

<services.plotter.Plotter at 0x29278d1b910>

In [None]:
import os
from IPython.display import display, Markdown

# Path to the images directory
images_dir = '../../results/humans_mutation_dqn'

# List all image files in the directory
images = [f for f in os.listdir(images_dir) if os.path.isfile(os.path.join(images_dir, f))]

# Generate and display Markdown for each image
for image in images:
    markdown_image = f"![{image}]({images_dir}/{image})"
    display(Markdown(markdown_image))

![actions.png](../../results/humans_mutation_dqn/actions.png)

![actions_shifts.png](../../results/humans_mutation_dqn/actions_shifts.png)

![rewards.png](../../results/humans_mutation_dqn/rewards.png)

![simulation_length.png](../../results/humans_mutation_dqn/simulation_length.png)

![travel_times.png](../../results/humans_mutation_dqn/travel_times.png)

![tt_dist.png](../../results/humans_mutation_dqn/tt_dist.png)