# Reinforcement Learning Exploration on `CartPole-v1` by [GitHub User]

This report highlights the remarkable work conducted by a GitHub user on applying reinforcement learning techniques to solve the `CartPole-v1` environment using the Policy Gradients method. The project showcases the development from a basic policy to a sophisticated neural network-based strategy.

## Initial Strategy: Basic Policy

The initial approach employed a simple, intuitive policy: if the pole tilts left, the cart moves left, and vice versa. This strategy, although straightforward, managed to achieve a maximum of 63 steps in keeping the pole balanced, underscoring its limitations in solving the environment completely, which requires the pole to be balanced for 200 steps.

## Advancement with Neural Network Policies

Recognizing the need for a more advanced solution, the GitHub user introduced a neural network to function as the policy model. This model predicts the probability of moving the cart left or right based on the environment's state, aiming to enhance decision-making.

### Neural Network Design:

- **Input Layer**: Takes in the state of the environment.
- **Hidden Layer**: Comprises 5 neurons with ReLU activation to introduce complexity.
- **Output Layer**: A single neuron with sigmoid activation outputs the probability of moving left.

## Training via Policy Gradients

The Policy Gradients algorithm was utilized for training the neural network. This involved:

- **Episode Playthroughs**: Generating data by playing multiple episodes.
- **Reward Discounting**: Valuing immediate over delayed rewards to tackle the credit assignment problem.
- **Reward Normalization**: Standardizing the learning signal across episodes.
- **Probability Adjustment**: Making beneficial actions more probable based on their outcomes.

## Achievements

Through iterative training (150 iterations with 10 episodes each), the neural network policy demonstrated significant progress:

- **Mean Rewards**: The policy notably reached an average of nearly 190.3 rewards per episode, showcasing its ability to nearly solve the `CartPole-v1` environment consistently.

## Conclusion and Acknowledgement

The GitHub user's exploration into reinforcement learning with Policy Gradients has effectively demonstrated how a shift from basic strategies to more complex neural network policies can significantly improve performance in the `CartPole-v1` environment. This work not only highlights the potential of neural networks in learning and decision-making tasks but also serves as an insightful reference for those looking to delve into reinforcement learning.

We commend the original contributor for their innovative approach and contributions to the field.


In [1]:
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env

# Create the environment
env = make_vec_env('CartPole-v1', n_envs=1)

# Initialize the agent
model = PPO('MlpPolicy', env, verbose=1)

# Train the agent
model.learn(total_timesteps=10000)

# Evaluate the agent
obs = env.reset()
for i in range(1000):
    action, _states = model.predict(obs, deterministic=True)
    obs, rewards, dones, info = env.step(action)
    env.render()


Using cpu device
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 21.2     |
|    ep_rew_mean     | 21.2     |
| time/              |          |
|    fps             | 5318     |
|    iterations      | 1        |
|    time_elapsed    | 0        |
|    total_timesteps | 2048     |
---------------------------------
----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 26.1       |
|    ep_rew_mean          | 26.1       |
| time/                   |            |
|    fps                  | 3467       |
|    iterations           | 2          |
|    time_elapsed         | 1          |
|    total_timesteps      | 4096       |
| train/                  |            |
|    approx_kl            | 0.00899041 |
|    clip_fraction        | 0.107      |
|    clip_range           | 0.2        |
|    entropy_loss         | -0.686     |
|    explained_variance   | -0.00108   |
|    learning_rate        | 

In [5]:
initial_state = env.reset()
print(initial_state)


(array([ 0.04134518, -0.01705906, -0.01273812, -0.03185254], dtype=float32), {})


# PPO Agent Training on CartPole-v1 Environment

In this project, we employed the Proximal Policy Optimization (PPO) algorithm, a state-of-the-art reinforcement learning method, to train an agent on the `CartPole-v1` environment. This classic environment challenges the agent to balance a pole on a moving cart, serving as a benchmark for evaluating the performance of reinforcement learning algorithms.

## Environment Setup and Agent Initialization

The environment was initialized using the `make_vec_env` function from the Stable Baselines3 library, allowing for straightforward setup and potential parallelization. For this demonstration, a single environment instance was used:

```python
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env

# Create the environment
env = make_vec_env('CartPole-v1', n_envs=1)

Following this, we initialized the PPO agent with a Multi-Layer Perceptron (MLP) policy and set the verbosity level to 1 to enable detailed logging throughout the training process:
    
# Initialize the agent
model = PPO('MlpPolicy', env, verbose=1)


# Training
The agent was trained over 10,000 timesteps, engaging with the environment, gathering experiences, and iteratively refining its policy based on the rewards and outcomes of its actions.

## Training Results
Throughout the training phase, several key metrics were logged, shedding light on the agent's performance and the learning progression. These metrics included the average episode length, average rewards, and various loss metrics essential for understanding the PPO algorithm's optimization dynamics.

# A brief overview of the training progression based on the logs:

There was a consistent increase in the average length of episodes and the average rewards, indicating the agent's successful learning to keep the pole balanced for longer periods.
Performance indicators such as frames per second (fps) and total timesteps were monitored, providing insight into the training process's efficiency.
The monitoring of loss metrics (e.g., policy gradient loss, value loss) ensured the stability and effectiveness of the learning algorithm.
For instance, towards the end of training:

The average episode length improved to about 63.6 steps.
The average reward per episode also increased, reflecting the agent's enhanced ability to maintain the pole's balance.
Evaluation and Visualization
Post-training, the agent's performance was evaluated through its interaction with the environment, employing the learned policy. Although direct visualization results are not displayed in this report, they can be achieved by running the provided code within an environment that supports rendering, enabling qualitative assessment of the agent's behavior.

## Conclusion
The observed improvements in episode length and rewards throughout the training iterations demonstrate the effectiveness of the PPO algorithm and an MLP policy in mastering the CartPole-v1 task. This project not only showcases the capabilities of contemporary reinforcement learning techniques but also lays the groundwork for addressing more complex environments and challenges within the domain.