# Stepwise Policy Gradient Reinforcement Learning on CartPole-v1

This notebook demonstrates training a reinforcement learning agent using the **Stepwise Policy Gradient (REINFORCE)** algorithm on the classic OpenAI Gym environment **CartPole-v1**.

---

## Objective

The goal is to keep the pole balanced upright on the cart as long as possible. The maximum reward per episode is 500 steps.

We will:

- Evaluate a **hardcoded baseline policy** that chooses actions based on pole angle.
- Train a **neural network policy** using policy gradients.
- Compare performance before and after training.
- Visualize training progress.


In [None]:
# Imports and setup
import numpy as np
import matplotlib.pyplot as plt
import gym
import tensorflow as tf
from stepwise_policy_gradient import StepwisePolicyGradientAgent, create_policy_network

## 1. Baseline Hardcoded Policy

We start with a simple policy that accelerates left if the pole is leaning left (angle < 0), else right.

This will serve as a baseline for comparison.


In [None]:
def basic_policy(obs):
    angle = obs[2]
    return 0 if angle < 0 else 1

env = gym.make('CartPole-v1')

rewards = []
n_episodes = 500
for episode in range(n_episodes):
    obs = env.reset()
    total_reward = 0
    done = False
    while not done:
        action = basic_policy(obs)
        obs, reward, done, _ = env.step(action)
        total_reward += reward
    rewards.append(total_reward)

rewards = np.array(rewards)
print(f"Hardcoded policy average reward over {n_episodes} episodes: {rewards.mean():.2f}")
print(f"Min reward: {rewards.min()}, Max reward: {rewards.max()}")

## 2. Initialize and Evaluate Untrained Neural Network Policy

Now, we create a neural network policy and evaluate its performance **before training**.


In [None]:
seed = 42
tf.random.set_seed(seed)
np.random.seed(seed)

policy_net = create_policy_network(n_inputs=4)

agent = StepwisePolicyGradientAgent(
    env_name='CartPole-v1',
    model=policy_net,
    loss_fn=tf.keras.losses.binary_crossentropy,
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
    discount_factor=0.95,
    seed=seed
)

mean_reward_before, _ = agent.evaluate_policy(n_eval_episodes=50)
print(f"Untrained policy average reward over 50 episodes: {mean_reward_before:.2f}")


## 3. Train the Neural Network Policy using Policy Gradient

We train for 150 iterations, updating the policy after every 10 episodes.

Training progress will be printed and saved for visualization.


In [None]:
n_iterations = 150
n_episodes_per_update = 10
n_max_steps = gym.make('CartPole-v1').spec.max_episode_steps

eval_rewards = agent.train(
    n_iterations=n_iterations,
    n_episodes_per_update=n_episodes_per_update,
    n_max_steps=n_max_steps
)


## 4. Evaluate the Trained Policy

We evaluate the trained policy over 50 episodes and compare it to the baseline.


In [None]:
mean_reward_after, rewards_after = agent.evaluate_policy(n_eval_episodes=50)
print(f"Trained policy average reward over 50 episodes: {mean_reward_after:.2f}")

## 5. Visualize Training Progress

Plot average reward during training to see how the policy improves over time.


In [None]:
iterations = list(range(0, n_iterations, 10)) + [n_iterations - 1]
plt.figure(figsize=(10, 6))
plt.plot(iterations, eval_rewards, label='Policy Training Reward')
plt.xlabel('Training Iteration')
plt.ylabel('Average Reward over 10 Episodes')
plt.title('Policy Performance Over Training Iterations')
plt.grid(True)
plt.legend()
plt.show()

## 6. Summary

- The **hardcoded policy** provides a baseline reward score.
- The **untrained neural network policy** starts near random performance.
- Training using the **stepwise policy gradient** significantly improves the policy.
- Visualizing rewards confirms successful learning toward the maximum reward of 500.

---

This concludes the demonstration of the Stepwise Policy Gradient method for reinforcement learning on CartPole-v1.
