<a href="https://colab.research.google.com/github/RAVITEJA-VADLURI/Reinforcement_Learning/blob/main/2303A51942_ASGN(8).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Policy Gradient Methods & the REINFORCE Algorithm

Below is a compact, assignment-ready explanation **plus** a ready-to-run PyTorch implementation of the **REINFORCE (Monte Carlo policy gradient)** algorithm using the CartPole-v1 environment. Use the explanation for theory, and the code to demonstrate the algorithm experimentally.

---

# 1. High-level introduction

Policy gradient (PG) methods directly parameterize and optimize a policy ( \pi_\theta(a|s) ) to maximize expected return ( J(\theta) = \mathbb{E}*{\tau\sim\pi*\theta}[R(\tau)] ).
Unlike value-based methods (Q-learning, DQN), PG methods optimize over stochastic policies and are naturally suited to continuous action spaces and stochastic policies. They work by computing (or estimating) the gradient ( \nabla_\theta J(\theta) ) and performing gradient ascent on ( \theta ).

---

# 2. Core concepts (concise)

* **Policy** ( \pi_\theta(a|s) ): probability of action (a) in state (s) parameterized by (\theta) (e.g., neural network weights).
* **Trajectory** (\tau = (s_0,a_0,r_1,s_1,a_1,r_2,\dots)).
* **Return** ( G_t = \sum_{k=t}^{T-1} \gamma^{k-t} r_{k+1} ).
* **Performance objective** ( J(\theta) = \mathbb{E}*{\tau\sim\pi*\theta}[G_0] ).
* **Policy gradient theorem** (intuitively):
  [
  \nabla_\theta J(\theta) = \mathbb{E}*{\tau\sim\pi*\theta}\Big[ \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t),G_t \Big],
  ]
  so we can use sampled trajectories and Monte Carlo estimates to compute the gradient without differentiating environment dynamics.

---

# 3. REINFORCE algorithm (conceptual steps)

1. Initialize policy parameters (\theta).
2. Repeat for many episodes:

   * Generate a full episode trajectory ( \tau ) by following (\pi_\theta).
   * For each timestep (t) in the episode compute discounted return (G_t).
   * For each timestep (t) compute gradient estimate: ( \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t ).
   * Sum/average these gradients over the episode and update:
     [
     \theta \leftarrow \theta + \alpha \sum_{t}\nabla_\theta \log \pi_\theta(a_t|s_t),G_t
     ]
     where (\alpha) is the learning rate.

---

# 4. Variance reduction (practical note)

Direct REINFORCE has **high variance**. Common improvements:

* **Baseline** (b(s)) subtraction: use advantage (A_t = G_t - b(s_t)). If (b(s)) is the state value (V_\phi(s)) (learned), variance reduces while keeping the gradient unbiased.
* Use **reward-to-go** (use (G_t) instead of full-episode return) — already typical.
* Use **entropy bonus** to encourage exploration.

---

# 5. Pseudocode (compact)

```
Initialize policy parameters θ
for episode = 1..N:
    generate trajectory τ by following πθ
    compute G_t for each time step t in τ
    for each timestep t:
        θ ← θ + α * ∇θ log πθ(a_t|s_t) * (G_t - baseline(s_t))
```

---

# 6. REINFORCE — PyTorch implementation (CartPole-v1)

This is a minimal, clear implementation using a neural policy that outputs action probabilities. It uses reward-to-go and an optional baseline (value network) is *not* included here — see comments for adding a baseline.

In [4]:
# reinforce_cartpole.py
import numpy as np
np.bool8 = np.bool_

import gym
import time
import math
import numpy as np
import torch
import torch.nn as nn
from torch.distributions import Categorical

# ---------------------
# Hyperparameters
# ---------------------
ENV_ID = "CartPole-v1"
SEED = 1
GAMMA = 0.99
LR = 1e-3
HIDDEN_SIZE = 128
MAX_EPISODES = 1000
PRINT_INTERVAL = 10
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
USE_BASELINE = False  # If True, implement a value network baseline (not implemented below)

# ---------------------
# Utilities for gym compatibility
# ---------------------
def safe_reset(env):
    res = env.reset()
    return res[0] if isinstance(res, tuple) else res

def safe_step(env, action):
    res = env.step(action)
    if len(res) == 5:  # gym >=0.26 (obs, reward, terminated, truncated, info)
        obs, reward, terminated, truncated, info = res
        done = terminated or truncated
        return obs, reward, done, info
    obs, reward, done, info = res
    return obs, reward, done, info

def set_seed(env, seed=SEED):
    np.random.seed(seed)
    torch.manual_seed(seed)
    env.seed(seed) if hasattr(env, 'seed') else None
    try:
        env.action_space.seed(seed)
    except:
        pass

# ---------------------
# Policy Network
# ---------------------
class PolicyNet(nn.Module):
    def __init__(self, obs_dim, action_dim, hidden=HIDDEN_SIZE):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, hidden),
            nn.ReLU(),
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.Linear(hidden, action_dim)
        )

    def forward(self, x):
        logits = self.net(x)
        return torch.softmax(logits, dim=-1)

# ---------------------
# Helper: compute discounted rewards-to-go
# ---------------------
def compute_returns(rewards, gamma=GAMMA):
    # reward-to-go: G_t = r_t + gamma*r_{t+1} + ...
    returns = []
    R = 0.0
    for r in reversed(rewards):
        R = r + gamma * R
        returns.insert(0, R)
    return returns

# ---------------------
# Training loop
# ---------------------
def train():
    env = gym.make(ENV_ID)
    set_seed(env, SEED)
    obs_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n

    policy = PolicyNet(obs_dim, action_dim).to(DEVICE)
    optimizer = torch.optim.Adam(policy.parameters(), lr=LR)

    running_rewards = []

    for episode in range(1, MAX_EPISODES + 1):
        obs = safe_reset(env)
        log_probs = []
        rewards = []
        done = False

        # generate one episode (full trajectory)
        while not done:
            obs_tensor = torch.tensor(obs, dtype=torch.float32, device=DEVICE).unsqueeze(0)
            probs = policy(obs_tensor).squeeze(0)
            dist = Categorical(probs)
            action = dist.sample().item()
            log_prob = dist.log_prob(torch.tensor(action, device=DEVICE))
            next_obs, reward, done, _ = safe_step(env, action)

            log_probs.append(log_prob)
            rewards.append(reward)

            obs = next_obs

        # compute returns
        returns = compute_returns(rewards)
        returns = torch.tensor(returns, dtype=torch.float32, device=DEVICE)
        # optional: normalize returns for stability
        returns = (returns - returns.mean()) / (returns.std(unbiased=False) + 1e-8)

        # policy gradient step (REINFORCE)
        policy_loss = []
        for log_prob, G in zip(log_probs, returns):
            policy_loss.append(-log_prob * G)
        policy_loss = torch.stack(policy_loss).sum()

        optimizer.zero_grad()
        policy_loss.backward()
        optimizer.step()

        episode_return = sum(rewards)
        running_rewards.append(episode_return)

        if episode % PRINT_INTERVAL == 0:
            avg_return = np.mean(running_rewards[-PRINT_INTERVAL:])
            print(f"Episode {episode}\tAvgReturn(last {PRINT_INTERVAL}) = {avg_return:.2f}")

    env.close()
    print("Training finished.")

if __name__ == "__main__":
    start = time.time()
    train()
    print("Elapsed:", time.time() - start)


  deprecation(
  deprecation(
  deprecation(


Episode 10	AvgReturn(last 10) = 22.60
Episode 20	AvgReturn(last 10) = 21.10
Episode 30	AvgReturn(last 10) = 32.90
Episode 40	AvgReturn(last 10) = 30.30
Episode 50	AvgReturn(last 10) = 33.00
Episode 60	AvgReturn(last 10) = 44.50
Episode 70	AvgReturn(last 10) = 41.50
Episode 80	AvgReturn(last 10) = 52.10
Episode 90	AvgReturn(last 10) = 52.30
Episode 100	AvgReturn(last 10) = 68.90
Episode 110	AvgReturn(last 10) = 168.70
Episode 120	AvgReturn(last 10) = 168.70
Episode 130	AvgReturn(last 10) = 205.10
Episode 140	AvgReturn(last 10) = 170.40
Episode 150	AvgReturn(last 10) = 220.40
Episode 160	AvgReturn(last 10) = 228.00
Episode 170	AvgReturn(last 10) = 210.50
Episode 180	AvgReturn(last 10) = 207.00
Episode 190	AvgReturn(last 10) = 175.50
Episode 200	AvgReturn(last 10) = 167.20
Episode 210	AvgReturn(last 10) = 203.90
Episode 220	AvgReturn(last 10) = 216.10
Episode 230	AvgReturn(last 10) = 354.60
Episode 240	AvgReturn(last 10) = 261.60
Episode 250	AvgReturn(last 10) = 362.20
Episode 260	AvgRetu

---

# 7. How to run, dependencies & notes

* Install dependencies: `pip install torch gym numpy`
  (If you prefer `gymnasium`, minor `reset/step` semantics differ — adjust `safe_reset` and `safe_step` accordingly.)
* Run: `python reinforce_cartpole.py`
* Expected behavior: REINFORCE is simple but high-variance. On CartPole you may see gradual improvement, but convergence is slower and noisier than actor-critic or PPO. Use more episodes, a baseline (value network), or advantage normalization for better performance.

---

# 8. Extensions / improvements (quick list)

* **Add baseline**: train a value network (V_\phi(s)) and use advantage (A_t = G_t - V_\phi(s_t)). This reduces variance.
* **Actor-Critic**: update policy using estimated advantage from a critic trained by bootstrapping (TD).
* **Entropy bonus**: add (-\beta \cdot \mathbb{E}[\log \pi]) to the loss to keep exploration.
* **Mini-batch & vectorized envs**: collect multiple episodes in parallel (faster and lower variance).
* **Use reward normalization or advantage normalization** for more stable training.

---

# 9. Short assignment-ready summary (copy into your report)

Policy gradient methods directly parameterize a stochastic policy and optimize its parameters by gradient ascent on expected return. REINFORCE is a straightforward Monte Carlo policy gradient method that estimates the policy gradient using complete sampled episodes. It computes gradients of log-probabilities weighted by the trajectory returns (G_t). REINFORCE is unbiased but high-variance; common practical improvements include subtracting a baseline (e.g., value function), advantage normalization, and using actor-critic architectures. A simple PyTorch implementation applied to CartPole demonstrates core ideas: parameterize a policy network, sample episodes, compute returns, and update parameters with (\nabla_\theta \log \pi_\theta(a|s),G_t).

---