# COGS 181 - Final Project

# Insert title here

## Group members

- Abhay Anand
- Ashesh Kaji
- Varun Pillai
- Ansh Bhatnagar


# Abstract 
The goal of this project is to train a Reinforcement Learning (RL) Classifier on autonomous vehicles. We plan to use DonkeyCar Simulator to navigate the "Warren Feild" track and collect our data. The data we plan to collect includes visual input from the front-facing camera, speed and steering angle change. These will be collected during training in the simulated environment. We implement and compare two deep RL algorithms: Actor-Critic and Proximal Policy Optimization (PPO), both designed for continuous action spaces. Moreover we plan to include a simple Q learning model to show how inefficient/ ineffective it is for more complex problems. We will use the gathered data to train agents to take optimal actions such as steering left/right, acceleration, and braking based on the cars current position. Performance will be evaluated using key metrics such as cumulative reward, lap completion time, and DonkeySim's own accuracy rating. By comparing these metrics across different models and training scenarios, we aim to determine which RL method provides the most robust and efficient control for autonomous driving in simulated environments.

# Background

Autonomous driving has emerged as a rapidly evolving field, as recent advances in computing power
and machine learning continue to push the boundaries of vehicle autonomy<a name="koutnik"></a>[<sup>[1]</sup>](#koutniknote).
A major breakthrough was the introduction of deep reinforcement learning methods capable of learning
complex control policies from high-dimensional data, such as pixel inputs<a name="mnih"></a>[<sup>[2]</sup>](#mnihnote).

Among various simulation tools, DonkeyCar Simulator ("DonkeySim") provides a relatively accessible
environment where researchers can collect training data in a less resource-intensive, hobby-focused
setting<a name="donkeycar"></a>[<sup>[3]</sup>](#donkeycarnote). DonkeySim offers a front-facing camera stream,
speed readings, and steering angle data—sufficient for exploring end-to-end RL pipelines.

Concurrent work in policy optimization techniques, such as Proximal Policy Optimization (PPO),
has improved training stability for continuous control tasks, making them more suitable
for real-world driving problems<a name="schulman"></a>[<sup>[4]</sup>](#schulmannotenote). By leveraging
vision-based RL, robust network architectures, and user-friendly simulators like DonkeySim,
researchers aim to accelerate the development of autonomous vehicle control systems.

Why is this important? Autonomous driving stands to improve road safety, increase mobility,
and reduce congestion. However, it also introduces unique challenges in perception, planning,
and control. Studying reinforcement learning in this domain is crucial for advancing algorithms
that can handle high-dimensional state spaces and continuous action controls, ultimately bringing
us closer to reliable self-driving cars.

# Problem Statement

Autonomous navigation for industrial and factory environments requires precise and efficient vehicle control to ensure safe and timely transportation of goods. Traditional rule-based and vision-based approaches struggle with real-time adaptability and robustness in dynamic settings where numerous unexpected obstacles may arrise due to minor mishaps. Through our project, we aim to develop a deep reinforcement learning (RL) model that enables autonomous vehicles to navigate factory environments using only LiDAR data as input. By leveraging reinforcement learning techniques, particularly Proximal Policy Optimization (PPO) and Actor-Critic methods, we aim to train a model capable of handling continuous action spaces while minimizing computational complexity. Our goal is to create an efficient, collision-free, and fast driving policy that enhances safety, accuracy, and cost-effectiveness in automated logistics and manufacturing operations.

# Data
For this reinforcement learning project, we generated training data through interactions with the CARLA simulator, a high-fidelity environment designed for autonomous vehicle research. Unlike traditional datasets, our approach relies on real-time sensory inputs from the simulator, with LiDAR data serving as the cornerstone of our state representation.

### LiDAR Data Collection
- **LiDAR Sensor**: We attached a LiDAR sensor to the vehicle in CARLA, configured with a 50-meter range to capture point cloud data representing distances to surrounding objects.
- **Data Processing**: The raw LiDAR point cloud is processed to create a compact yet informative state:
  - The point cloud is divided into 8 sectors, each spanning 45 degrees around the vehicle (from -180° to 180°).
  - For each sector, we compute the minimum and maximum distances to obstacles, yielding 16 features (8 minimums and 8 maximums). If no points are detected in a sector, a default distance of 50 meters is used.
  - This approach reduces computational complexity while preserving spatial awareness, enabling the agent to detect obstacles in all directions.
- **State Augmentation**: The LiDAR-derived features are augmented with two additional variables:
  - Distance to the next waypoint, calculated as the Euclidean distance from the vehicle’s current position to the target waypoint’s location.
  - Angular difference between the vehicle’s current heading (yaw) and the direction to the next waypoint, normalized to [-180°, 180°].
  - This results in an 18-dimensional state vector (16 LiDAR features + 2 navigation features).
- **Implementation Details**: In our `CarlaEnvWrapper` class, the `_get_state` method processes LiDAR data by converting raw points into polar coordinates (angles and distances), binning them into sectors, and computing min/max distances. The state is updated synchronously with each simulation tick to ensure consistency.

By using LiDAR data instead of camera inputs, we avoid the computational overhead of convolutional neural networks, making our model efficient for real-time industrial applications while maintaining robust environmental perception.

## Old (needs to be updated for donkey sim and then removed)
For this reinforcement learning project, we will not use a pre-existing dataset to train our agent but we will instead rely on generating training data using the DonkeySim environment, which is an application with pre-made racing tracks for testing autonomous vehicles. The simulator software itself provides essential sensory inputs, including a front-facing camera stream, speed readings, obstacle-hit counters, and steering angle data, which will serve as the basis for our state space for our models (We may tinker with this and not use all of the listed data sources, or we may even add more. Since this is our first time using DonkeySim, we will have to experiment as we go a little bit). 

The agent will interact with the environment by taking actions such as steering, accelerating, and braking, and it will receive rewards based on reward function's pre-defined criteria, such as staying on track and minimizing lap time. During training, we currently plan to collect and store experience tuples (state, action, reward, next-state) to teach the agent to optimize the learning process. To enhance performance, we may experiment with different reward functions and data augmentation techniques, such as varying tracks, messing around with the weights for the reward function, and penaliing sharp turns to improve model generalization. By training our agent iteratively within DonkeySim, we ensure that our approach remains scalable and adaptable without the need for an external dataset.

# Proposed Solution

### 1. State Representation
- The vehicle relies solely on LiDAR data, processed into an 18-dimensional state vector (16 LiDAR features + 2 waypoint features), as detailed in the Data section.
- This design ensures a low computational footprint and real-time decision-making capability.

### 2. Action Space
- The action space is continuous, with two dimensions: throttle/brake ([-1, 1], where positive values are throttle and negative are brake) and steering ([-1, 1]).
- This allows the agent to dynamically adjust speed and direction, learning the interplay between velocity and turning for smooth navigation.

### 3. Reinforcement Learning Approach
- We implemented Proximal Policy Optimization (PPO) using PyTorch Lightning, training the agent in the CARLA simulator.
- PPO balances exploration and exploitation, making it ideal for our continuous control task.
#### Neural Network Architecture
The PPO agent relies on two distinct neural networks: the actor, which determines the policy (action selection), and the critic, which estimates the value function. These networks are designed as multi-layer perceptrons (MLPs) with the following structures:

- **Actor Network**:
  - **Structure**: A four-layer MLP:
    - **Input Layer**: 18 neurons, corresponding to the state dimension (e.g., sensor data, velocity, etc.).
    - **Hidden Layers**: Three layers with 256, 256, and 128 neurons, respectively, each followed by Tanh activation functions.
    - **Output Layer**: 4 neurons, representing the mean and log standard deviation (log_std) for two continuous actions: throttle and steering (2 neurons per action).
  - **Output Processing**:
    - **Throttle Mean**: Passed through a sigmoid function to produce values in the range [0,1], biasing the agent toward forward movement.
    - **Steering Mean**: Passed through a tanh function to produce values in the range [-1,1], enabling smooth left and right turns.
    - **Standard Deviation**: The log_std outputs are exponentiated, clamped, and constrained to a minimum value to ensure sufficient exploration during training.
  - **Initialization**: Weights are initialized using Xavier uniform initialization, and biases are set to zero.
  - **Purpose**: The Tanh activations help stabilize policy updates, while the split output design accommodates the continuous action space required for driving control.

- **Critic Network**:
  - **Structure**: A four-layer MLP:
    - **Input Layer**: 18 neurons, matching the state dimension.
    - **Hidden Layers**: Three layers with 256, 256, and 128 neurons, respectively, each followed by ReLU activation functions.
    - **Output Layer**: 1 neuron, providing the value estimate for the given state.
  - **Initialization**: Weights are initialized using Xavier uniform initialization, and biases are set to zero.
  - **Purpose**: The ReLU activations support effective value approximation, enabling the critic to provide stable and accurate estimates of the state’s expected return.

### 4. Reward Function Design
The reward function evolved iteratively to guide the agent toward safe, efficient, and route-following behavior:

- **Initial Reward Function**:
  - **Collision Avoidance**: A penalty of -50 was applied for collisions to prioritize safety.
  - **Speed Maintenance**: Reward was proportional to distance traveled per step, encouraging forward movement.
  - This basic design promoted movement while avoiding obstacles but lacked route guidance.

- **Intermediate Reward Function**:
  - **Lane Discipline**: Added a -1 penalty for lane invasions to keep the vehicle within track boundaries.
  - **Speed Regulation**: Introduced a target speed of 30 km/h, with a penalty (-0.1 * |speed - target|) for deviations, and an additional -1 penalty for speeds below 5 km/h to prevent stalling.
  - **Steering Smoothness**: Penalized large steering actions (-0.5 * |steering|) when speed was below 5 km/h to reduce erratic behavior at low speeds.
  - This improved track adherence and consistency but didn’t ensure progress along a specific path.

- **Final Reward Function**:
  - **Waypoint Following**: Added a reward based on proximity to the next waypoint (max(0, 5 - distance/10)), encouraging route adherence.
  - **Heading Alignment**: Included a bonus (max(0, 1 - |angle_diff|/180)) for aligning the vehicle’s heading with the waypoint direction, promoting smoother turns.
  - **Progress Reward**: Retained distance traveled as a base reward, augmented by waypoint incentives.
  - **Safety Penalties**: Kept collision (-50) and lane invasion (-1) penalties.
  - **Stuck Detection**: Penalized (-2) if the vehicle’s position varied by less than 1 meter over 20 steps, preventing circular or stagnant behavior.
  - Implemented in `CarlaEnvWrapper.step`, this final version balances safety, efficiency, and navigation.

### 5. Deployment and Applications
#### Setup deployment details
A full list complete setup details can be found in the code repo's README. Essentially the following items must be setup:
1. Carla 0.9.15 must be setup on a gpu based machine, as well as the python api for the same version for full compatibility
2. Python must be setup with respective packages
3. The car is setup in the default map with the default settings. It is a real world simulation, but with no NPCs to reduce complexity given the projects scale.
#### Potential Future applications
- The trained PPO model can optimize logistics in factory settings, enabling autonomous vehicles to transport goods safely and efficiently along predefined routes.
- This approach reduces costs and enhances precision in industrial automation.

# Evaluation Metrics
#### 1. Cumulative Reward
- **Definition**: The total reward accumulated over an episode, calculated based on the reward function defined in your project.
- **Significance**: This metric reflects the overall performance of the agent. Higher cumulative rewards indicate better navigation, fewer collisions, and more effective adherence to the intended route. It serves as a primary indicator of policy improvement during training.

#### 2. Collision Rate
- **Definition**: The frequency of collisions with obstacles or boundaries during an episode.
- **Significance**: A lower collision rate is desirable, as it demonstrates the agent’s ability to navigate safely and avoid hazards. This metric is critical for evaluating the safety performance of the driving policy.

#### 3. Distance Traveled
- **Definition**: The total distance covered by the vehicle over the course of an episode.
- **Significance**: When paired with a low collision rate, a higher distance traveled suggests efficient and effective navigation. This metric highlights the agent’s progress and ability to follow the desired path.


# Results

The primary objective of this study is to demonstrate that a well-designed reward function is essential for effective reinforcement learning (RL)-based autonomous navigation. Additional considerations include the utility of LiDAR data for state representation and the appropriateness of PPO for continuous control tasks. We evaluate the PPO agent's performance using two distinct reward models: a simpler reward-based model and an older, more complicated reward-based model. The analysis centers on key performance metrics, including episode rewards, distance traveled, episode lengths, and reward components, derived from the training progress at epoch 50 for the older model and the rewards vs. steps relationship for the simpler model.

---
#### Subsection 1: Performance of the Simpler Reward-Based Model
<img src="PPO_agent/training_progress_epoch_50.png" alt="Training Progress Epoch 50" />
The performance of the PPO agent with the simpler reward-based model is illustrated in a scatter plot titled "Episode Reward vs Steps (Colored by Epoch)," which tracks episode rewards against the number of steps across epochs 35 to 72.

- **Early Epochs (35-40)**:
  - During the initial training phase (epochs 35-40), episode rewards are consistently low, ranging between -6000 and -4000. At around 100 steps, rewards frequently reach -6000, reflecting poor performance early in training. As steps increase to 200, rewards remain predominantly negative, with significant variability—some points improve to -1000, while others drop back to -5000 or lower.

- **Mid Epochs (41-55)**:
  - In the mid-training phase (epochs 41-55), the agent's performance begins to improve. Between 200 and 300 steps, episode rewards cluster between -4000 and -2000, with fewer points falling below -5000. Variability persists, with some episodes achieving rewards as high as -1000, while others remain around -4000, indicating inconsistent progress.

- **Later Epochs (56-72)**:
  - By the later epochs (56-72), further improvement is evident. Between 300 and 450 steps, rewards concentrate between -3000 and -1000, with several points nearing 0 (e.g., around -500 at 400-450 steps). However, some outliers still drop to -4000, suggesting that the agent has not yet achieved fully stable performance.

**Key Observation**:
The scatter plot reveals a clear upward trend in episode rewards as both steps and epochs increase, demonstrating that the PPO agent is learning and refining its navigation abilities under the simpler reward model. Nevertheless, the persistent variability in rewards indicates that additional training or model refinement may be required to achieve consistent and optimal performance.

---

#### Subsection 2: Performance of the Older, More Complicated Reward-Based Model (Training Progress at Epoch 50)
<img src="PPO_agent/reward_vs_steps_by_epoch.png" alt="Training Progress Epoch 50" />
The older, more complicated reward-based model’s performance is assessed using four charts depicting training progress over 100 epochs. Here, we focus specifically on the agent's behavior at epoch 50.

- **Episode Rewards**:
  - At epoch 50, the total reward is approximately 2,000. This follows a peak of around 10,000 at episode 40 and a subsequent decline. Across the 100 epochs, rewards fluctuate widely, with notable spikes at episode 10 (around 20,000), episode 60 (around 15,000), and episode 80 (around 5,000), but they generally hover around 2,000 outside these peaks.

- **Distance Traveled**:
  - The distance traveled at epoch 50 reaches approximately 120 meters, a significant peak compared to surrounding episodes, which typically range between 20 and 40 meters. Other prominent peaks include episode 10 (120 meters), episode 30 (100 meters), episode 70 (80 meters), and episode 90 (140 meters).

- **Episode Lengths**:
  - At epoch 50, the episode length is around 300 steps, relatively modest compared to earlier peaks like episode 10 (900 steps). Episode lengths generally fluctuate between 200 and 400 steps, with occasional spikes, such as 600 steps at episode 60 and 700 steps at episode 80.

- **Reward Components**:
  - The total reward at epoch 50 is dominated by the "progress" component (approximately 2,000), with a minor contribution from the "longevity" component (around 500). Other components—collision, lane invasion, lane centering, heading, speed, steering smoothness, survival bonus, and completion bonus—remain near zero, indicating negligible influence on the total reward.

**Key Observation**:
At epoch 50, the older model yields a total reward of 2,000, with a notable distance traveled of 120 meters. The overwhelming contribution of the "progress" reward component suggests that the agent is primarily incentivized to move forward, while safety and efficiency metrics (e.g., collision avoidance, lane discipline) play a minimal role. This imbalance may result in suboptimal navigation behavior, as the agent prioritizes distance over other critical objectives.

---

#### Subsection 3: Comparative Analysis and Key Insights

- **Simpler Reward Model**:
  - The simpler model exhibits a steady increase in episode rewards, progressing from -6000 to near 0 across epochs 35 to 72. While this indicates effective learning, the high variability in rewards suggests that the agent has not yet converged to a stable, optimal policy.
  - Without detailed reward component data, specific behavioral improvements (e.g., collision avoidance, lane adherence) are difficult to assess, but the overall trend reflects enhanced navigation capability.

- **Older, More Complicated Reward Model**:
  - The older model displays significant reward volatility, with occasional high-reward episodes (e.g., 20,000 at episode 10) followed by drops to lower values (e.g., 2,000 at epoch 50). Distance traveled and episode lengths also fluctuate widely.
  - The reward components highlight a heavy reliance on "progress," with minimal contributions from safety or efficiency metrics. This skewed reward structure likely undermines consistent performance by favoring forward movement over balanced navigation.

**Insight**:
The simpler reward model, despite its simplicity, shows more consistent improvement in rewards over time, suggesting that a straightforward, well-tuned reward function may outperform a complex one in this context. Conversely, the older model’s focus on progress at the expense of other behaviors leads to erratic outcomes, underscoring the need for a balanced reward design.

---

#### Subsection 4: Impact of Reward Function on Agent Behavior

- **Simpler Reward Model**:
  - The steady reward improvement implies that the agent is learning to balance basic navigation tasks, such as advancing while avoiding obstacles. However, the lack of reward component breakdown limits insight into specific behavioral advancements.

- **Older Reward Model**:
  - The dominance of the "progress" component likely drives the agent to prioritize distance covered over safety or precision. The high distance traveled at epoch 50 (120 meters) despite a modest reward (2,000) suggests risky actions, such as neglecting collisions or lane discipline, which contribute little to the reward.

**Observation**:
The older model’s reward structure may encourage unsafe navigation by underweighting critical safety and efficiency metrics. This highlights the importance of a reward function that integrates multiple objectives—progress, safety, and adherence—to foster robust autonomous navigation.




# Discussion

### Interpreting the result

OK, you've given us quite a bit of tech informaiton above, now its time to tell us what to pay attention to in all that.  Think clearly about your results, decide on one main point and 2-4 secondary points you want us to understand. Highlight HOW your results support those points.  You probably want 2-5 sentences per point.


### Limitations

Are there any problems with the work?  For instance would more data change the nature of the problem? Would it be good to explore more hyperparams than you had time for?   


### Future work
Looking at the limitations and/or the toughest parts of the problem and/or the situations where the algorithm(s) did the worst... is there something you'd like to try to make these better.

### Ethics & Privacy

If your project has obvious potential concerns with ethics or data privacy discuss that here.  Almost every ML project put into production can have ethical implications if you use your imagination. Use your imagination.

Even if you can't come up with an obvious ethical concern that should be addressed, you should know that a large number of ML projects that go into producation have unintended consequences and ethical problems once in production. How will your team address these issues?

Consider a tool to help you address the potential issues such as https://deon.drivendata.org

### Conclusion

Reiterate your main point and in just a few sentences tell us how your results support it. Mention how this work would fit in the background/context of other work in this field if you can. Suggest directions for future work if you want to.

# Footnotes
<a name="lorenznote"></a>1.[^](#lorenz): Lorenz, T. (9 Dec 2021) Birds Aren’t Real, or Are They? Inside a Gen Z Conspiracy Theory. *The New York Times*. https://www.nytimes.com/2021/12/09/technology/birds-arent-real-gen-z-misinformation.html<br> 
<a name="admonishnote"></a>2.[^](#admonish): Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.<br>
<a name="sotanote"></a>3.[^](#sota): Perhaps the current state of the art solution such as you see on [Papers with code](https://paperswithcode.com/sota). Or maybe not SOTA, but rather a standard textbook/Kaggle solution to this kind of problem
