# REINFORCE Algorithm (No Baseline) — CartPole-v1

This notebook demonstrates a stochastic policy gradient method — the **REINFORCE algorithm** — implemented from scratch using TensorFlow and applied to the classic `CartPole-v1` environment.

### Key Characteristics:
- No baseline used (pure Monte Carlo returns)
- Stochastic policy output from a neural network
- Trained using log-likelihood gradient scaling


In [None]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
import gym
import matplotlib.pyplot as plt

from reinforce_policy import (build_policy_network,evaluate_policy,train_policy_model)

np.random.seed(65)
tf.random.set_seed(65)

## Try a Handcrafted Policy (Angle Heuristic)

Let's see how well a basic deterministic policy performs before training anything.


In [None]:
env = gym.make("CartPole-v1")

def angle_policy(obs):
    angle = obs[2]
    return 0 if angle < 0 else 1

rewards = []
for episode in range(500):
    obs = env.reset()
    total = 0
    for _ in range(200):
        action = angle_policy(obs)
        obs, reward, done, _ = env.step(action)
        total += reward
        if done:
            break
    rewards.append(total)

rewards = np.array(rewards)
print(f"Mean reward: {rewards.mean():.2f}")
print(f"Max reward: {rewards.max():.2f}")
print(f"Min reward: {rewards.min():.2f}")

## Train REINFORCE Policy (Stochastic)

We'll train a shallow neural network using the REINFORCE algorithm with Monte Carlo returns and a log-likelihood loss.

In [None]:
# Build the stochastic policy model
model = build_policy_network()

# Optimizer with fixed learning rate
optimizer = keras.optimizers.Adam(learning_rate=0.01)

# Evaluate before training
pretrain_reward = evaluate_policy(env, model)
print(f"Reward before training: {pretrain_reward}")

# Train using REINFORCE algorithm
reward_history = train_policy_model(env_name="CartPole-v1",model=model,
                                    optimizer=optimizer,episodes=500,gamma=0.99)


## Training Progress


In [None]:
plt.plot(reward_history)
plt.xlabel("Episode")
plt.ylabel("Total Reward")
plt.title("REINFORCE Training Progress")
plt.grid(True)
plt.show()

Evaluate after training

In [None]:
# Evaluate after training
posttrain_reward = evaluate_policy(env, model)
print(f"Reward after training: {posttrain_reward}")

# Average over multiple runs
eval_runs = [evaluate_policy(env, model) for _ in range(20)]
print(f"Mean reward over 20 eval runs: {np.mean(eval_runs):.2f}")


## Summary

- The REINFORCE algorithm improved our policy significantly.
- Stochastic sampling from the model’s output was key to exploration.
- The reward steadily increased toward the 500-step max in CartPole.

Next steps:
- Add a **baseline** to reduce variance
- Explore **actor-critic methods**
