# Introduction, motivation, problem statement

Urban traffic congestion is a growing challenge in cities worldwide, leading to increased travel times, fuel consumption, and air pollution. Traditional traffic light systems rely on pre-set timers or simple reactivepolicies, often failing to respond effectively to dynamic traffic conditions. This inefficiency not only frustrates commuters but also leads to environmental and economic costs.

The task is to develop a deep reinforcement learning algorithm capable of dynamically controlling traffic lights to optimize traffic flow in real-time. Using SUMO (Simulation of Urban MObility), the RL agent will have a realistic traffic model to interact with, and so must learn to efficiently adapt traffic signal policies based on current traffic conditions, minimizing average waiting times, congestion, and improving overall traffic flow


# Data sources or RL task

Data sources or reinforcement learning tasks are clearly documented and described

# Exploratory Analysis of Data or RL tasks

The SUMO environment offers a very complex framework for simulating traffic in a variety of scenarios ranging from simple intersections to highly expansive urban networks. Using NetSim, we created a simple 8-lane intersection to be the environment all RL agents, including our models as well as the baselines, would be interacting with to create a more relevant comparison when looking at results.

The method by which the baseline RL agents controll an intersection, and hence the method we adapted, required the intersection to contain a traffic light program, with numerous light phases for the agent to switch between. So when the agent chooses to active the phase that is already activated, it will simple extend its duration. However, when the agent chooses to activate a new phase, it will first activate the yellow phase corresponding to the last green phase before switching to the new one.

For our environment, we created 4 distnict green light phases as well as their accompanying yellow light phases. Hence, any RL agent interacting with our SUMO scenario would learn, through the definable reward state fuctions, which of the light phases should be active at the current time to increase the optimality of the traffic flow.

Due to the vast library of SUMO apis available to the agents, it is not just the achitecture of the RL models that distinguishes any 2 agents, but the definition of the state of the environment as well as the reward function which is central to governing how the agent interprets the optimization task. The challenging aspect here is adequately defining a reward function that directly translates to the, rather arbitrary, problem statement of improving traffic flow. For example, when creating an agent for this task, one might try to penalize the number of cars waiting at the intersection but find that the optimal solution found by their agent rapidly flickers the lights so that the cars are contantly in motion but few are making it through the intersection.

# Models and/or method

Models and methods are judiciously chosen and appropriately applied. If building on previous work, identify the source and clearly delineate which parts are your own work.

The key differences between our model and the baselines lie in our redefinition of the reward function. To combat the agent finding solutions where the rapid changing of the light phase is prioritized, we chose to focus on rewarding the success of each individual light phase. Before a specific phase is activated, we gather a list of all the vehicles waiting at the corresponding red light, and after the phase has changed once again, we look up those vehicles using SUMO api commands and check how many of them made it through the intersection. This ratio is rewarded exponentially to discourage situations where a vehicle would have to wait several light cycles to make it through the single intersection. Along with this, we still needed to discourage the agent from letting the other lanes from building up a very long queue, so we also included a penalty for this number, which is also penalized exponentially so as to not punish a few vehicles waiting too harshly as well as having a severe punishment when there are too many. We also included a simple reward for when the agent decides to extend the duration of the current phase to allow it to create a strong connection between the state of the environment and which of the 4 outputs is the active one.

Along with the reward function, we needed to redefine what information the agent receives upon a state call, which is more simple to understand, just give the agent all the information it would need to make the optimal solution using your reward function. In this case, we gave the agent the current number of cars waiting at the intersection, as well as the current success of the active phase expressed as a percentage, and the id number of the active phase to develop a strong connection and incentive to extend the phase duration over switching to a new one.

Tweaking the hyperparameters of the RL model also resulted in different convergence rates and stability over a large number of episodes. The most impactful was adjusting the epsilon value and its decay, which governs the rate at which the network explores a random option over choosing using its policy that is currently in development.

# Results

![title](results/dqn_b.png)
![title](results/ddqn_b.png)
![title](results/dqn.png)
![title](results/dqn_0.9.png)
![title](results/ddqn.png)
![title](results/ddqn_0.5.png)
![title](results/comparison.png)

# Baseline

### Environment

### Agent (simple)

### Run the program

