# RouteRL Quickstart

We simulate a simple network topology where humans and later AVs make routing decisions to maximize their rewards (i.e., minimize travel times) over a sequence of days.

* For the first 100 days, we model a human-driven system, where drivers update their routing policies using behavioral models to optimize rewards.
* Each day, we simulate the impact of joint actions using the [`SUMO`](https://eclipse.dev/sumo/) traffic simulator, which returns the reward for each agent.
* After 100 days, we introduce 10 `Autononmous Vehicles` (AVs) as `Petting Zoo` agents, allowing them to use any `MARL` algorithm to maximise rewards. In this tutorial, we use a trained policy from the Independent Deep Q-Learning (IDQN) algorithm.
* Finally, we analyse basic results from the simulation.
  

# Tutorial Outline

* Establishing the Connection with SUMO

* Initializing the Traffic Environment
  - Define the `TrafficEnvironment`, which initializes human agents and generates the routes agents will travel within the network.

* Training Human Agents
  - Train human-driven vehicles to navigate the environment efficiently using human behavioural models from transportation research.

* Introducing Autonomous Vehicles (AVs)
  - Transform a subset of human agents into AVs.
  - AVs select their routes using a pre-trained policy based on the IDQN algorithm.

* Analyzing the Impact of AVs
  - Evaluate the effects of AV introduction on human travel time, congestion, and CO₂ emissions.
  - Demonstrate how AV deployment can potentially increase travel delays and environmental impact.

<p align="center">
  <img src="../../docs/img/two_route_net_1.png" alt="Two-route network" />
  <img src="../../docs/img/two_route_net_1_2.png" alt="Two-route network" />
</p>  

#### Import libraries

In [1]:
import sys
import os
import pandas as pd
import torch
from torchrl.envs.libs.pettingzoo import PettingZooWrapper
from torchrl.modules import QValueModule, SafeSequential
from torchrl.modules.models.multiagent import MultiAgentMLP
from tensordict.nn import TensorDictModule, TensorDictSequential
from torch import nn
from torchrl.modules import EGreedyModule

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '../../')))

from routerl import TrafficEnvironment

#### Define hyperparameters

> Users can customize parameters for the `TrafficEnvironment` class by consulting the [`routerl/environment/params.json`](https://github.com/COeXISTENCE-PROJECT/RouteRL/blob/4f4bc0a90d821e95b7193b00c93d6aaf10b34f41/routerl/environment/params.json) file. Based on its contents, they can create a dictionary with their preferred settings and pass it as an argument to the `TrafficEnvironment` class.

> **In this repository we don't recommend adjusting the number of agents, because the policy is trained for 10 AV agents.**


In [2]:
# Parameters for the torchrl policy
device = torch.device("cpu")

mlp_depth = 2
mlp_cells = 32

tau =  0.05
frames_per_batch = 100  # Number of team frames collected per training iteration
n_iters = 30 
total_frames = frames_per_batch * n_iters
exploration_fraction = 1/3 # Fraction of frames over which the exploration rate is annealed

eps = 1 - tau
eps_init = 0.99
eps_end = 0


human_learning_episodes = 100


env_params = {
    "agent_parameters" : {
        "num_agents" : 100,
        "new_machines_after_mutation": 10, # the number of human agents that will mutate to AVs
        "human_parameters" : {
            "model" : "gawron"
        },
        "machine_parameters" :
        {
            "behavior" : "selfish",
        }
    },
    "simulator_parameters" : {
        "network_name" : "two_route_yield"
    },  
    "plotter_parameters" : {
        "phases" : [0, human_learning_episodes], # the number of episodes human learning will take
    },
}

#### Environment initialization

In our setup, road networks initially consist of human agents, with AVs introduced later.

- The `TrafficEnvironment` environment is firstly initialized.
- The traffic network is instantiated and the paths between designated origin and destination points are determined.
- The drivers/agents objects are created.

In [3]:
env = TrafficEnvironment(seed=42, **env_params)

[CONFIRMED] Environment variable exists: SUMO_HOME
[SUCCESS] Added module directory: C:\Program Files (x86)\Eclipse\Sumo\tools


> Available paths create using the [Janux](https://github.com/COeXISTENCE-PROJECT/JanuX) framework.

<p >
  <img src="plots_saved/0_0.png" width="600" />
</p>  

In [4]:
print("Number of total agents is: ", len(env.all_agents), "\n")
print("Number of human agents is: ", len(env.human_agents), "\n")
print("Number of machine agents (autonomous vehicles) is: ", len(env.machine_agents), "\n")

Number of total agents is:  100 

Number of human agents is:  100 

Number of machine agents (autonomous vehicles) is:  0 



> Reset the environment and the connection with SUMO

In [5]:
env.start()

#### Human learning

In [6]:
for episode in range(human_learning_episodes):
    env.step() # all the human agents execute an action in the environment

> Average travel time of human agents during their training process.

<p align="center">
  <img src="plots_saved/human_learning.png"/>
</p> 

> Show the initial `.csv` file saved that contain the information about the agents available in the system.


In [7]:
df = pd.read_csv("plots_saved/_training_records/episodes/ep1.csv")
df


Unnamed: 0,travel_time,id,kind,action,origin,destination,start_time,reward,cost_table
0,3.483333,0,Human,1,0,0,99,-3.483333,"0.2217422606191504,-1.514644828413727"
1,1.200000,1,Human,1,0,0,58,-1.200000,"0.2217422606191504,-0.3729781617470602"
2,4.550000,2,Human,1,0,0,112,-4.550000,"0.2217422606191504,-2.0479781617470603"
3,4.916667,3,Human,1,0,0,118,-4.916667,"0.2217422606191504,-2.231311495080394"
4,0.933333,4,Human,1,0,0,31,-0.933333,"0.2217422606191504,-0.23964482841372692"
...,...,...,...,...,...,...,...,...,...
95,1.066667,95,Human,1,0,0,46,-1.066667,"0.2217422606191504,-0.30631149508039357"
96,1.083333,96,Human,1,0,0,50,-1.083333,"0.2217422606191504,-0.3146448284137269"
97,1.216667,97,Human,1,0,0,60,-1.216667,"0.2217422606191504,-0.3813114950803935"
98,3.566667,98,Human,1,0,0,101,-3.566667,"0.2217422606191504,-1.5563114950803936"


#### Mutation

> Mutation: a portion of human agents are converted into machine agents (autonomous vehicles). 

In [8]:
env.mutation()

In [9]:
print("Number of total agents is: ", len(env.all_agents), "\n")
print("Number of human agents is: ", len(env.human_agents), "\n")
print("Number of machine agents (autonomous vehicles) is: ", len(env.machine_agents), "\n")

Number of total agents is:  100 

Number of human agents is:  90 

Number of machine agents (autonomous vehicles) is:  10 



In [10]:
env.machine_agents

[Machine 1,
 Machine 15,
 Machine 10,
 Machine 91,
 Machine 22,
 Machine 73,
 Machine 5,
 Machine 52,
 Machine 81,
 Machine 77]

> In order to employ the `TorchRL` library in our environment we need to use their `PettingZooWrapper` function.

In [11]:
group = {'agents': [str(machine.id) for machine in env.machine_agents]}

env = PettingZooWrapper(
    env=env,
    use_mask=True,
    categorical_actions=True,
    done_on_any = False,
    group_map=group,
)

> Define the neural network used by `TorchRL`.

In [12]:
net = MultiAgentMLP(
        n_agent_inputs=env.observation_spec["agents", "observation"].shape[-1],
        n_agent_outputs=env.action_spec.space.n,
        n_agents=env.n_agents,
        centralised=False,
        share_params=False,
        device=device,
        depth=mlp_depth,
        num_cells=mlp_cells,
        activation_class=nn.ReLU,
    )

module = TensorDictModule(
        net, in_keys=[("agents", "observation")], out_keys=[("agents", "action_value")]
)

value_module = QValueModule(
    action_value_key=("agents", "action_value"),
    out_keys=[
        env.action_key,
        ("agents", "action_value"),
        ("agents", "chosen_action_value"),
    ],
    spec=env.action_spec,
    action_space=None,
)

qnet = SafeSequential(module, value_module)

qnet = TensorDictSequential(
    qnet,
    EGreedyModule(
        eps_init=eps_init,
        eps_end=eps_end,
        annealing_num_steps=int(total_frames * exploration_fraction),
        action_key=env.action_key,
        spec=env.action_spec,
    ),
)

> Use an already trained policy using the Independent Deep Q-Learning algorith.

In [13]:
state_dict = torch.load("policy_checkpoint.pth")

# Print all keys (i.e., parameter names)
print("\n🔍 Components inside the state dict:\n")
for key, value in state_dict.items():
    print(f"📌 {key}: {value.shape} | dtype: {value.dtype}")


🔍 Components inside the state dict:

📌 module.0.module.0.module.agent_networks.0.0.weight: torch.Size([32, 3]) | dtype: torch.float32
📌 module.0.module.0.module.agent_networks.0.0.bias: torch.Size([32]) | dtype: torch.float32
📌 module.0.module.0.module.agent_networks.0.2.weight: torch.Size([32, 32]) | dtype: torch.float32
📌 module.0.module.0.module.agent_networks.0.2.bias: torch.Size([32]) | dtype: torch.float32
📌 module.0.module.0.module.agent_networks.0.4.weight: torch.Size([2, 32]) | dtype: torch.float32
📌 module.0.module.0.module.agent_networks.0.4.bias: torch.Size([2]) | dtype: torch.float32
📌 module.0.module.0.module.agent_networks.1.0.weight: torch.Size([32, 3]) | dtype: torch.float32
📌 module.0.module.0.module.agent_networks.1.0.bias: torch.Size([32]) | dtype: torch.float32
📌 module.0.module.0.module.agent_networks.1.2.weight: torch.Size([32, 32]) | dtype: torch.float32
📌 module.0.module.0.module.agent_networks.1.2.bias: torch.Size([32]) | dtype: torch.float32
📌 module.0.modul

In [16]:
state_dict

OrderedDict([('module.0.module.0.module.agent_networks.0.0.weight',
              tensor([[ 0.3303,  0.0453,  0.2349],
                      [ 0.2892, -0.3562, -0.3027],
                      [-0.0565,  0.4493, -0.0386],
                      [-0.3070, -0.2662, -0.0022],
                      [ 0.3540, -0.6206, -0.2286],
                      [ 0.3214, -0.3819,  0.3637],
                      [ 0.2811,  0.3238,  0.0779],
                      [ 0.4129,  0.2227,  0.3723],
                      [-0.0993,  0.3398, -0.2848],
                      [ 0.3344, -0.5170,  0.4122],
                      [ 0.1427, -0.2798, -0.1840],
                      [ 0.4241,  0.2287,  0.2460],
                      [-0.0517, -0.2322, -0.2780],
                      [-0.5128, -0.1059,  0.4394],
                      [-0.2349, -0.5657,  0.0090],
                      [ 0.0366, -0.2812, -0.0121],
                      [-0.3770,  0.5375,  0.0897],
                      [-0.2851,  0.5158,  0.2310],
              

In [15]:
qnet.state_dict()

OrderedDict([('module.0.module.0.module.agent_networks.0.0.weight',
              tensor([[-0.1587, -0.5062, -0.3545],
                      [ 0.5383,  0.4344,  0.0621],
                      [ 0.4930, -0.1582,  0.1746],
                      [ 0.4054, -0.5095, -0.4369],
                      [ 0.1860,  0.0915, -0.1014],
                      [-0.0122, -0.1398, -0.0864],
                      [ 0.2846, -0.4876,  0.1144],
                      [ 0.4197,  0.0196, -0.1850],
                      [ 0.3979, -0.4183, -0.5424],
                      [ 0.3029,  0.1706,  0.0866],
                      [ 0.1645,  0.1927,  0.4139],
                      [-0.0248, -0.0635,  0.5416],
                      [ 0.0766, -0.5359, -0.0932],
                      [-0.5629,  0.2579,  0.0256],
                      [ 0.2590,  0.2211, -0.0178],
                      [-0.0242, -0.1881, -0.2966],
                      [ 0.5438,  0.5029, -0.3246],
                      [-0.3984,  0.2327,  0.3992],
              

In [18]:
qnet_keys = set(qnet.state_dict().keys())
loaded_keys = set(state_dict.keys())
print("Missing in loaded model:", qnet_keys - loaded_keys)
print("Extra in loaded model:", loaded_keys - qnet_keys)


Missing in loaded model: set()
Extra in loaded model: set()


In [17]:
import torch

# Load saved state_dict
state_dict = torch.load("policy_checkpoint.pth", map_location=torch.device('cpu'))

# Load model's expected keys
model_keys = set(qnet.state_dict().keys())
state_dict_keys = set(state_dict.keys())

# Find missing and extra keys
missing_keys = model_keys - state_dict_keys
extra_keys = state_dict_keys - model_keys

print("\n🚨 Missing keys in state_dict (expected by model):", missing_keys)
print("\n🚨 Extra keys in state_dict (not expected by model):", extra_keys)



🚨 Missing keys in state_dict (expected by model): set()

🚨 Extra keys in state_dict (not expected by model): set()


In [20]:
qnet.load_state_dict(torch.load("policy_checkpoint.pth", map_location=torch.device('cpu')), strict=False)
qnet.eval()

TensorDictSequential(
    module=ModuleList(
      (0): SafeSequential(
          module=ModuleList(
            (0): TensorDictModule(
                module=MultiAgentMLP(
                  (agent_networks): ModuleList(
                    (0-9): 10 x MLP(
                      (0): Linear(in_features=3, out_features=32, bias=True)
                      (1): ReLU()
                      (2): Linear(in_features=32, out_features=32, bias=True)
                      (3): ReLU()
                      (4): Linear(in_features=32, out_features=2, bias=True)
                    )
                  )
                ),
                device=cpu,
                in_keys=[('agents', 'observation')],
                out_keys=[('agents', 'action_value')])
            (1): QValueModule()
          ),
          device=cpu,
          in_keys=[('agents', 'observation')],
          out_keys=[('agents', 'action'), ('agents', 'action_value'), ('agents', 'chosen_action_value')])
      (1): EGreedyModule

In [None]:
#state_dict = torch.load("policy_checkpoint.pth", map_location=torch.device('cpu'))

In [None]:
# Print available keys in the state dict
#print("\n🔍 Saved State Dict Keys:\n")
#print(state_dict.keys())

In [None]:
"""expected_keys = set(qnet.state_dict().keys())
print("\n✅ Expected Model Keys:\n")
print(expected_keys)"""

'expected_keys = set(qnet.state_dict().keys())\nprint("\n✅ Expected Model Keys:\n")\nprint(expected_keys)'

> Human and AV agents interact with the environment over multiple episodes, with AVs following a trained policy.

In [19]:
num_test_episodes = 100

for episode in range(num_test_episodes): # run rollous in the environment using the already trained policy
    env.rollout(len(env.machine_agents), policy=qnet)

> Show the first `.csv` file saved after the mutation that contains the information about the agents available in the system after the mutation.

In [None]:
df = pd.read_csv("plots_saved/_training_records/episodes/ep101.csv")
df

Unnamed: 0,travel_time,id,kind,action,origin,destination,start_time,reward,cost_table
0,0.700000,1,AV,0,0,0,58,-0.700000,00
1,1.100000,15,AV,0,0,0,64,-1.100000,00
2,3.783333,10,AV,0,0,0,116,-3.783333,00
3,2.383333,91,AV,1,0,0,87,-2.383333,00
4,3.900000,22,AV,0,0,0,126,-3.900000,00
...,...,...,...,...,...,...,...,...,...
95,0.533333,95,Human,0,0,0,46,-0.533333,"-0.5833333333333333,-0.6864890808735301"
96,0.533333,96,Human,0,0,0,50,-0.533333,"-0.5833333333333333,-0.6989890808735301"
97,0.883333,97,Human,0,0,0,60,-0.883333,"-0.6833333333333333,-0.79898908087353"
98,2.983333,98,Human,0,0,0,101,-2.983333,"-2.916666666666667,-3.036205603551716"


#### Plot results 

>This will be shown in the `\plots` folder.

In [None]:
env.plot_results()

> The results highlight a critical challenge in AV deployment: rather than improving traffic flow, AVs may increase travel time for human drivers. This suggests potential inefficiencies in mixed traffic conditions due to differences in driving behavior. Understanding these effects is essential for designing better reinforcement learning strategies, informing policymakers, and optimizing AV integration to prevent increased congestion and CO₂ emissions.


| |  |
|---------|---------|
| **Action shifts of human and AV agents** ![](plots_saved/actions_shifts.png) | **Action shifts of all vehicles in the network** ![](plots_saved/actions.png) |
| ![](plots_saved/rewards.png) | ![](plots_saved/travel_times.png) |


<p align="center">
  <img src="plots_saved/tt_dist.png" width="700" />
</p>


> Interrupt the connection with `SUMO`.

In [None]:
env.stop_simulation()