# **Introduction, motivation, problem statement**

Urban traffic congestion is a growing challenge in cities worldwide, leading to increased travel times, fuel consumption, and air pollution. Traditional traffic light systems rely on pre-set timers or simple reactivepolicies, often failing to respond effectively to dynamic traffic conditions. This inefficiency not only frustrates commuters but also leads to environmental and economic costs.


# **Reinforcement Learning Task**

Dataset URL: https://github.com/eclipse-sumo/sumo

The task is to develop a deep reinforcement learning algorithm capable of dynamically controlling traffic lights to optimize traffic flow in real-time. Using SUMO (Simulation of Urban MObility), the RL agent will have a realistic traffic model to interact with, and so must learn to efficiently adapt traffic signal policies based on current traffic conditions, minimizing average waiting times, congestion, and improving overall traffic flow.

# **Exploratory Analysis of RL task**

The SUMO environment offers a very complex framework for simulating traffic in a variety of scenarios ranging from simple intersections to highly expansive urban networks. Using NetSim, we created a simple 8-lane intersection to be the environment all RL agents, including our models as well as the baselines, would be interacting with to create a more relevant comparison when looking at results.

The method by which the baseline RL agents controll an intersection, and hence the method we adapted, required the intersection to contain a traffic light program, with numerous light phases for the agent to switch between. So when the agent chooses to active the phase that is already activated, it will simple extend its duration. However, when the agent chooses to activate a new phase, it will first activate the yellow phase corresponding to the last green phase before switching to the new one.

Due to the vast library of SUMO apis available to the agents, it is not just the achitecture of the RL models that distinguishes any 2 agents, but the definition of the reward function which is central to governing how the agent interprets the optimization task as well as defining the state at any timestep. The challenging aspect here is adequately defining a reward function that translates to the, rather arbitrary, problem statement of improving traffic flow. For example, when creating an agent for this task, one might try to penalize the number of cars waiting at the intersection but find that the optimal solution found by their agent rapidly flickers the lights so that the cars are contantly in motion but few are making it through the intersection.

# **Baselines**

## 1. Preset Timer

In [None]:
import traci
import sumolib

environment = "intersection/sumo_config.sumocfg"

sumobin = sumolib.checkBinary('sumo-gui')
traci.start([sumobin, '-c', environment, '--start'])

trafficlight_id = traci.trafficlight.getIDList()[0]

# Function to reset the SUMO environment
def reset_sumo_environment():
    # reload the simulation
    traci.load(['-c', environment, '--start'])
    traci.trafficlight.setProgram(trafficlight_id, '0')


num_episodes = 100

for e in range(num_episodes):
    reset_sumo_environment()
    current_time = 0
    
    while current_time < 2000:
        traci.simulationStep()
        current_time = traci.simulation.getTime()

    print(f"Episode: {e+1}/{num_episodes}")

## 2. DQN

In [None]:
import os
import sys

import gymnasium as gym
from stable_baselines3.dqn.dqn import DQN


if "SUMO_HOME" in os.environ:
  tools = os.path.join(os.environ["SUMO_HOME"], "tools")
  sys.path.append(tools)
else:
  sys.exit("Please declare the environment variable 'SUMO_HOME'")
import traci

from sumo_rl import SumoEnvironment


if __name__ == "__main__":
  env = SumoEnvironment(
    net_file="intersection/environment.net.xml",
    route_file="intersection/episode_routes.rou.xml",
    out_csv_name="outputs/intersection/dqn",
    single_agent=True,
    use_gui=True,
    num_seconds=5400,
    yellow_time=4,
    min_green=5,
    max_green=60,
  )

  model = DQN(
    env=env,
    policy="MlpPolicy",
    learning_rate=1e-3,
    learning_starts=0,
    buffer_size=50000,
    train_freq=1,
    target_update_interval=500,
    exploration_fraction=0.05,
    exploration_final_eps=0.01,
    verbose=1,
  )
  model.learn(total_timesteps=100000)

## 3. Double DQN

In [None]:
import os
import sys

import gymnasium as gym
from stable_baselines3.dqn.dqn import DQN


if "SUMO_HOME" in os.environ:
  tools = os.path.join(os.environ["SUMO_HOME"], "tools")
  sys.path.append(tools)
else:
  sys.exit("Please declare the environment variable 'SUMO_HOME'")
import traci

from sumo_rl import SumoEnvironment


from stable_baselines3 import DQN

# Initialize the environment (SumoEnvironment)
env = SumoEnvironment(
  net_file=r"intersection\environment.net.xml",
  route_file=r"intersection\episode_routes.rou.xml",
  out_csv_name=r"outputs\intersection_DoubleDQN\dqn",
  single_agent=True,
  use_gui=True,
  num_seconds=5400,
  yellow_time=4,
  min_green=5,
  max_green=60,
)

# Define policy_kwargs to enable Double DQN
policy_kwargs = dict(
  net_arch=[128, 128],  
)

# Initialize the DQN model with Double DQN enabled via policy_kwargs
model = DQN(
  "MlpPolicy",  # Using a Multi-layer Perceptron policy
  env,
  learning_rate=1e-3,
  learning_starts=0,
  buffer_size=50000,
  train_freq=1,
  target_update_interval=500,
  exploration_fraction=0.05,
  exploration_final_eps=0.01,
  verbose=1,
  policy_kwargs=policy_kwargs,  # Pass policy_kwargs for additional configuration
)

model.learn(total_timesteps=100000)



# **Models and Methods**

## 1. DQN

### Setting up connection to SUMO

In [6]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import random
from collections import deque
import traci
import sumolib
import math

environment = "intersection/sumo_config.sumocfg"
phase_lane_control = np.array([
        ["N2TL_0", "N2TL_1", "N2TL_2", "S2TL_0", "S2TL_1", "S2TL_2"],
        ["N2TL_3", "S2TL_3"],
        ["W2TL_0", "W2TL_1", "W2TL_2", "E2TL_0", "E2TL_1", "E2TL_2"],
        ["W2TL_3", "E2TL_3"]
    ], dtype=object)

sumobin = sumolib.checkBinary('sumo-gui')

traci.start([sumobin, '-c', environment, '--start'])

traci.simulation.subscribe([traci.constants.VAR_COLLIDING_VEHICLES_IDS])

# Subscribe to vehicle accelerations for all vehicles
for veh_id in traci.vehicle.getIDList():
    traci.vehicle.subscribe(veh_id, traci.constants.VAR_ACCELERATION)

# for single agent
trafficlight_id = traci.trafficlight.getIDList()[0]
controlled_lanes = traci.trafficlight.getControlledLanes(trafficlight_id)
TIME_STEP = 0.8 # amount of time (in seconds) per step of the simulation, i.e. 0.01 => 10ms per step

print("Connected to TraCI")

Connected to TraCI


### SUMO helper functions

In [1]:
# Function to get the number of vehicles currently waiting
def get_avg_waiting():
    # grouped lanes by shared green light phases, record the number of cars waiting divided by the number of lanes
    grouped_avg_waiting = [get_lane_num_waiting(lanes) / len(lanes) for lanes in phase_lane_control]
    return grouped_avg_waiting

# returns the total number of cars waiting in the set of lanes
def get_lane_num_waiting(lanes):
    sum = 0
    for lane_id in lanes:
        sum += traci.lane.getLastStepHaltingNumber(lane_id)
    return sum

# returns a list of vehicle ids that are currently stopped in one of the lanes
def get_waiting_ids(lanes):
    ids = []
    for lane_id in lanes:
        ids.extend([veh_id for veh_id in traci.lane.getLastStepVehicleIDs(lane_id) if traci.vehicle.getSpeed(veh_id) < 0.1])
    return np.array(ids)

def pct_served(waiting_ids):
    if len(waiting_ids) == 0:
        return 0
    
    # vehicles that have been served but exited simulation need to be counted a different way
    still_loaded = [veh_id for veh_id in waiting_ids if veh_id in traci.vehicle.getLoadedIDList()]
    num_waiting_served = len([veh_id for veh_id in still_loaded if traci.vehicle.getSpeed(veh_id) > 0.5])
    num_waiting_served += len(waiting_ids) - len(still_loaded)

    return num_waiting_served / len(waiting_ids)
    
def get_total_waiting_time():
    vehicles = traci.vehicle.getIDList()
    waiting_times = [traci.vehicle.getWaitingTime(vehicle) for vehicle in vehicles]
    return sum(waiting_times)

### Environment Class

In [4]:
class Environment:
    def __init__(self):
        self.prev_action = traci.trafficlight.getPhase(trafficlight_id)
        self.yellow_duration = 3 # duration of yellow phases in seconds between actions
        self.green_duration = 5 # minimum amount of time the green phases are on for

        self.static_action = 0 # adds reward for not changing the phase, prevents flickering
        self.waiting_ids = [] # list of vehicle ids that were waiting in one of the lanes now greenlit in the current phase
        self.pct_served = 0 # percentage of cars waiting at the relevant lanes that made it through on the last light cycle


    # Function to reset the SUMO environment
    def reset_sumo_environment(self, environment):
        # reload the simulation
        traci.load(['-c', environment, '--start', '--step-length', TIME_STEP])
        traci.trafficlight.setProgram(trafficlight_id, '0')
        
        # reset some variables
        self.waiting_ids = []
        self.pct_served = 0
        state = self.get_state()

        return state


    # Function to step through the SUMO simulation
    def step_in_sumo(self, action):
        # Apply the action
        self.apply_action(action)
        
        # Step the SUMO simulation forward
        traci.simulationStep()
        
        # Get the new state after taking the action
        next_state = self.get_state()
        
        # Calculate the reward with the specified tls_id
        reward = self.calculate_reward()
        
        # Check if the episode is done
        done = self.check_done_condition()
        
        return next_state, reward, done


    # Function to get the current state (modify this based on what information you need)
    def get_state(self):
        # number of cars in the lanes each phase of the traffic light controls
        state = get_avg_waiting()
        state.append(self.pct_served) # include the served percent of the current phase
        state.append(self.prev_action) # include the current action value
        
        return np.array(state)


    # Function to apply the action (modify based on your action space)
    def apply_action(self, action):
        if action == self.prev_action:
            self.static_action = 1
            return
        
        # simulate the yellow light phase corresponding to the last green phase
        self.simulate_phase(2 * self.prev_action + 1, self.yellow_duration)

        # get the success parameters of the last light phase
        self.pct_served = pct_served(self.waiting_ids)
        self.waiting_ids = get_waiting_ids(phase_lane_control[action])
        
        # change to the new green phase, simulate for the minimum amount of time
        self.simulate_phase(2 * action, self.green_duration)
        self.prev_action = action


    # changes the phase and simulates it for the required amount of time
    def simulate_phase(self, action, duration):
        traci.trafficlight.setPhase(trafficlight_id, action)
        steps = 0
        while steps < duration / TIME_STEP:
            traci.simulationStep()
            steps += 1


    # Function to calculate the reward (implement your logic)
    def calculate_reward(self):
        reward = self.static_action + math.exp(4 * self.pct_served) - math.exp(0.2 * sum(get_avg_waiting()))
        
        self.static_action = 0
        self.pct_served = 0
        return reward


    # Function to check if the simulation should terminate
    def check_done_condition(self):
        # Example condition: terminate if simulation time exceeds a limit
        collision_data = traci.simulation.getSubscriptionResults()
        
        # Check for any collisions
        if collision_data and traci.constants.VAR_COLLIDING_VEHICLES_IDS in collision_data:
            return True
        
        current_time = traci.simulation.getTime()
        return current_time > 2000  # Change this threshold as necessary

### DQN Class

In [7]:
# Define the neural network for the Q-function
class DQN(nn.Module):
    def __init__(self, n_state_params, n_actions):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(n_state_params, 12)
        self.fc2 = nn.Linear(12, 12)
        self.fc3 = nn.Linear(12, n_actions)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

### RL Agent Class

In [8]:
# Define the RL agent
class RLAgent:
    def __init__(self, n_state_params, n_actions):
        self.n_state_params = n_state_params
        self.n_actions = n_actions
        self.memory = deque(maxlen=2000)
        self.gamma = 0.95  # discount rate
        self.epsilon = 0.05  # exploration rate
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.995
        self.model = DQN(n_state_params, n_actions)
        self.optimizer = optim.Adam(self.model.parameters(), lr=0.001)
        self.criterion = nn.MSELoss()

    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    def act(self, state):
        if np.random.rand() <= self.epsilon:
            return random.randrange(self.n_actions)
        state = torch.FloatTensor(state)
        q_values = self.model(state)
        return np.argmax(q_values.detach().numpy())

    def replay(self, batch_size):
        if len(self.memory) < batch_size:
            return
        minibatch = random.sample(self.memory, batch_size)
        for state, action, reward, next_state, done in minibatch:
            target = reward
            if not done:
                target += self.gamma * np.amax(self.model(torch.FloatTensor(next_state)).detach().numpy())
            target_f = self.model(torch.FloatTensor(state)).detach().numpy()
            # Check if action index is valid
            if 0 <= action < self.n_actions:
                target_f[action] = target
            else:
                print(f"Invalid action: {action}")

            # Convert back to tensor for loss calculation
            target_f_tensor = torch.FloatTensor(target_f)
            self.model.zero_grad()
            loss = self.criterion(target_f_tensor, self.model(torch.FloatTensor(state)))
            loss.backward()
            self.optimizer.step()
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

### Start learning!

In [9]:
# Simulation interaction loop
def run_simulation(agent, env, num_episodes, batch_size):
    for e in range(num_episodes):
        state = env.reset_sumo_environment(environment)  # Reset the SUMO environment and get the initial state
        done = False
        total_reward = 0

        while not done:
            action = agent.act(state)
            next_state, reward, done = env.step_in_sumo(action)  # Step through the SUMO simulation
            agent.remember(state, action, reward, next_state, done)
            state = next_state
            total_reward += reward

        print(f"Episode: {e+1}/{num_episodes}, Total Reward: {total_reward}")
        agent.replay(batch_size)
        
        
# number of state parameters: parameter for each lane controlled by the traffic light, giving the total delay
env = Environment()
n_state_params = len(env.get_state())
print("Number of inputs:", n_state_params)
# Get the full phase program for the traffic light
program = traci.trafficlight.getAllProgramLogics(trafficlight_id)[0]

# Get the number of phases
n_actions = int(len(program.phases) / 2)
print("actions:", n_actions)

agent = RLAgent(n_state_params, n_actions)
run_simulation(agent, env, num_episodes=200, batch_size=32)

Number of inputs: 6
actions: 4
Episode: 1/200, Total Reward: -150368.15186837577
Episode: 2/200, Total Reward: -13776.5059239758
Episode: 3/200, Total Reward: -2061.8097383189574
Episode: 4/200, Total Reward: 3835.394697783431
Episode: 5/200, Total Reward: -224.66470092972247
Episode: 6/200, Total Reward: 3039.657866567434
Episode: 7/200, Total Reward: 2409.3052344657235
Episode: 8/200, Total Reward: 1329.0604703049016
Episode: 9/200, Total Reward: 1274.9859693437254
Episode: 10/200, Total Reward: 850.1474864180457
Episode: 11/200, Total Reward: 1904.7084240220788
Episode: 12/200, Total Reward: 4143.226673564125
Episode: 13/200, Total Reward: 3766.861379099308
Episode: 14/200, Total Reward: -1737.9875706025498
Episode: 15/200, Total Reward: 4423.340685920078
Episode: 16/200, Total Reward: 3631.152514960783
Episode: 17/200, Total Reward: 4079.7195734691754
Episode: 18/200, Total Reward: 3515.013412373573
Episode: 19/200, Total Reward: 3424.262214589568
Episode: 20/200, Total Reward: -54

KeyboardInterrupt: 

## 2. Double DQN

### Setting up connection to SUMO

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import random
from collections import deque
import traci
import sumolib
import math

# Initialization for SUMO environment
environment = "intersection/sumo_config.sumocfg"
phase_lane_control = np.array([
    ["N2TL_0", "N2TL_1", "N2TL_2", "S2TL_0", "S2TL_1", "S2TL_2"],
    ["N2TL_3", "S2TL_3"],
    ["W2TL_0", "W2TL_1", "W2TL_2", "E2TL_0", "E2TL_1", "E2TL_2"],
    ["W2TL_3", "E2TL_3"]
], dtype=object)

sumobin = sumolib.checkBinary('sumo-gui')
traci.start([sumobin, '-c', environment, '--start'])  

traci.simulation.subscribe([traci.constants.VAR_COLLIDING_VEHICLES_IDS])

# Subscribe to vehicle accelerations for all vehicles
for veh_id in traci.vehicle.getIDList():
    traci.vehicle.subscribe(veh_id, traci.constants.VAR_ACCELERATION)

trafficlight_id = traci.trafficlight.getIDList()[0]
controlled_lanes = traci.trafficlight.getControlledLanes(trafficlight_id)
TIME_STEP = 0.8  # Simulation time step in seconds


### Sumo Helper functions

In [2]:
# Utility functions
def get_avg_waiting():
    grouped_avg_waiting = [get_lane_num_waiting(lanes) / len(lanes) for lanes in phase_lane_control]
    return grouped_avg_waiting

def get_lane_num_waiting(lanes):
    sum = 0
    for lane_id in lanes:
        sum += traci.lane.getLastStepHaltingNumber(lane_id)
    return sum

def get_waiting_ids(lanes):
    ids = []
    for lane_id in lanes:
        ids.extend([veh_id for veh_id in traci.lane.getLastStepVehicleIDs(lane_id) if traci.vehicle.getSpeed(veh_id) < 0.1])
    return np.array(ids)

def pct_served(waiting_ids):
    if len(waiting_ids) == 0:
        return 0
    still_loaded = [veh_id for veh_id in waiting_ids if veh_id in traci.vehicle.getLoadedIDList()]
    num_waiting_served = len([veh_id for veh_id in still_loaded if traci.vehicle.getSpeed(veh_id) > 0.5])
    num_waiting_served += len(waiting_ids) - len(still_loaded)
    return num_waiting_served / len(waiting_ids)
    
def get_total_waiting_time():
    vehicles = traci.vehicle.getIDList()
    waiting_times = [traci.vehicle.getWaitingTime(vehicle) for vehicle in vehicles]
    return sum(waiting_times)


### Environment Class

In [3]:
class Environment:
    def __init__(self):
        self.prev_action = traci.trafficlight.getPhase(trafficlight_id)
        self.yellow_duration = 3
        self.green_duration = 25
        self.static_action = 0
        self.waiting_ids = []
        self.pct_served = 0

    def reset_sumo_environment(self, environment):
        traci.load(['-c', environment, '--start', '--step-length', TIME_STEP])
        traci.trafficlight.setProgram(trafficlight_id, '0')
        self.waiting_ids = []
        self.pct_served = 0
        state = self.get_state()
        return state

    def step_in_sumo(self, action):
        self.apply_action(action)
        traci.simulationStep()
        next_state = self.get_state()
        reward = self.calculate_reward()
        done = self.check_done_condition()
        return next_state, reward, done

    def get_state(self):
        state = get_avg_waiting()
        state.append(self.pct_served)
        state.append(self.prev_action)
        return np.array(state)

    def apply_action(self, action):
        if action == self.prev_action:
            self.static_action = 1
            return
        self.simulate_phase(2 * self.prev_action + 1, self.yellow_duration)
        self.pct_served = pct_served(self.waiting_ids)
        self.waiting_ids = get_waiting_ids(phase_lane_control[action])
        self.simulate_phase(2 * action, self.green_duration)
        self.prev_action = action

    def simulate_phase(self, action, duration):
        traci.trafficlight.setPhase(trafficlight_id, action)
        steps = 0
        while steps < duration / TIME_STEP:
            traci.simulationStep()
            steps += 1

    def calculate_reward(self):
        reward = self.static_action + math.exp(4 * self.pct_served) - math.exp(0.2 * sum(get_avg_waiting()))
        self.static_action = 0
        self.pct_served = 0
        return reward

    def check_done_condition(self):
        collision_data = traci.simulation.getSubscriptionResults()
        if collision_data and traci.constants.VAR_COLLIDING_VEHICLES_IDS in collision_data:
            print('takkar BC')
            return True
        current_time = traci.simulation.getTime()
        return current_time > 3300

### DQN Class

In [4]:
# Define the Double DQN agent
class DQN(nn.Module):
    def __init__(self, n_state_params, n_actions):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(n_state_params, 64)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, n_actions)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

### Double DQN Agent Class

In [5]:
class DoubleDQNAgent:
    def __init__(self, n_state_params, n_actions):
        self.n_state_params = n_state_params
        self.n_actions = n_actions
        self.memory = deque(maxlen=3300)
        self.gamma = 0.95  # Discount rate
        self.epsilon = 0.05  # Exploration rate
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.999

        # Primary and Target Networks
        self.model = DQN(n_state_params, n_actions)
        self.target_model = DQN(n_state_params, n_actions)
        self.update_target_network()
        
        # Optimizer and loss
        self.optimizer = optim.Adam(self.model.parameters(), lr=0.0001)
        self.criterion = nn.MSELoss()

    def update_target_network(self):
        """Copy weights from the primary network to the target network."""
        self.target_model.load_state_dict(self.model.state_dict())

    def remember(self, state, action, reward, next_state, done):
        """Store experience in replay memory."""
        self.memory.append((state, action, reward, next_state, done))

    def act(self, state):
        """Select an action using an epsilon-greedy policy."""
        if np.random.rand() <= self.epsilon:
            return random.randrange(self.n_actions)
        state = torch.FloatTensor(state)
        q_values = self.model(state)
        return np.argmax(q_values.detach().numpy())

    def replay(self, batch_size):
        """Sample a minibatch from replay memory and update the primary network."""
        if len(self.memory) < batch_size:
            return
        minibatch = random.sample(self.memory, batch_size)
        
        for state, action, reward, next_state, done in minibatch:
            # Double DQN target calculation
            target = reward
            if not done:
                next_state_tensor = torch.FloatTensor(next_state)
                
                # Double DQN: use model to select action, and target_model for Q-value
                next_action = np.argmax(self.model(next_state_tensor).detach().numpy())
                target_q_value = self.target_model(next_state_tensor).detach().numpy()[next_action]
                target += self.gamma * target_q_value

            # Prepare for gradient update
            target_f = self.model(torch.FloatTensor(state)).detach().numpy()
            if 0 <= action < self.n_actions:
                target_f[action] = target
            else:
                print(f"Invalid action: {action}")

            # Convert back to tensor for loss calculation
            target_f_tensor = torch.FloatTensor(target_f)
            self.model.zero_grad()
            loss = self.criterion(target_f_tensor, self.model(torch.FloatTensor(state)))
            loss.backward()
            self.optimizer.step()

        # Decay epsilon after each replay
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

    def train_target_network(self, update_frequency, episode):
        """Update target network every 'update_frequency' episodes."""
        if episode % update_frequency == 0:
            self.update_target_network()



### Start Learning!

In [6]:
# Simulation interaction loop
def run_simulation(agent, env, num_episodes, batch_size, update_frequency=10):
    for e in range(num_episodes):
        state = env.reset_sumo_environment(environment)
        done = False
        total_reward = 0

        while not done:
            action = agent.act(state)
            next_state, reward, done = env.step_in_sumo(action)
            agent.remember(state, action, reward, next_state, done)
            state = next_state
            total_reward += reward

        print(f"Episode: {e+1}/{num_episodes}, Total Reward: {total_reward}")
        agent.replay(batch_size)
        agent.train_target_network(update_frequency, e)


# Initialize environment and agent
env = Environment()
n_state_params = len(env.get_state())
program = traci.trafficlight.getAllProgramLogics(trafficlight_id)[0]
n_actions = int(len(program.phases) / 2)
agent = DoubleDQNAgent(n_state_params, n_actions)

# Run simulation
run_simulation(agent, env, num_episodes=200, batch_size=64)


Episode: 1/200, Total Reward: -57010.95386306469
Episode: 2/200, Total Reward: -7554.14254449663
Episode: 3/200, Total Reward: -148201.11396121263
Episode: 4/200, Total Reward: -9576.780691198876
Episode: 5/200, Total Reward: -44917.928764175274
Episode: 6/200, Total Reward: -63919.36106463328
Episode: 7/200, Total Reward: -29817.006145240757
Episode: 8/200, Total Reward: 3244.0908053237304
Episode: 9/200, Total Reward: 4321.277429511865
Episode: 10/200, Total Reward: 3947.9031082897477
Episode: 11/200, Total Reward: 3411.043955194727
Episode: 12/200, Total Reward: 4144.071788057313
Episode: 13/200, Total Reward: 4044.3482756774424
Episode: 14/200, Total Reward: 3299.1847051534937
Episode: 15/200, Total Reward: 2840.901768566655
Episode: 16/200, Total Reward: -68621.29862786568
Episode: 17/200, Total Reward: 749.1485801259033
Episode: 18/200, Total Reward: 3889.2306740349027
Episode: 19/200, Total Reward: 459.81637499685655
Episode: 20/200, Total Reward: 815.5488888228346


KeyboardInterrupt: 

# **Results**

## Results Plotter

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt

# Directories where episode CSV files for both sets are stored
directory1 = 'outputs/baseline-1'
directory2 = 'outputs/baseline-2'
directory3 = 'outputs/'

# Initialize lists to store the average waiting time per episode for both sets
average_waiting_times_1 = []
average_waiting_times_2 = []

# Function to calculate average waiting times from a directory
def calculate_average_waiting_times(directory):
    average_waiting_times = []
    for filename in os.listdir(directory):
        if filename.endswith('.csv'):
            # Load the CSV file
            file_path = os.path.join(directory, filename)
            data = pd.read_csv(file_path)
            
            # Calculate the average waiting time for this episode
            avg_waiting_time = data['system_total_waiting_time'].mean()
            average_waiting_times.append(avg_waiting_time)
    return average_waiting_times

# Get average waiting times for both sets
average_waiting_times_1 = calculate_average_waiting_times(directory1)
average_waiting_times_2 = calculate_average_waiting_times(directory2)


data = pd.read_csv('outputs/simulation_results_wait_time.csv')
average_waiting_times_3 = data['system_total_waiting_time']


# Plot the results
plt.figure(figsize=(10, 6))
plt.plot(average_waiting_times_1, marker='o', linestyle='-', label='DQN')
plt.plot(average_waiting_times_2, marker='o', linestyle='-', label='Double DQN')
plt.plot(average_waiting_times_3, marker='o', linestyle='-', label='Our model')
plt.xlabel('Episode')
plt.ylabel('Average System Total Waiting Time')
plt.title('Average System Total Waiting Time Across Episodes')
plt.grid(True)
plt.legend()
plt.show()

## Baseline Results

DQN

![dqn](results/dqn_b.png)

Double DQN

![ddqn](results/ddqn_b.png)

## DQN Results

![eps = 0.05](results/dqn.png)

![eps = 0.9](results/dqn_0.9.png)

## Double DQN Results

![eps = 0.05](results/ddqn.png)

![eps = 0.5](results/ddqn_0.5.png)

# **Discussion**