# REINFORCE Algorithm

The **REINFORCE algorithm** is the foundational policy gradient method in reinforcement learning. It is a *Monte Carlo* approach that optimizes the policy directly by maximizing the expected reward.

This notebook explains REINFORCE step by step, along with a practical implementation in PyTorch using the `CartPole-v1` environment from OpenAI Gym.

## 🎯 1. What is REINFORCE?

The REINFORCE algorithm updates the policy parameters based on how good the actions taken were : that is, whether they led to higher or lower returns.

Formally, the objective is to maximize the expected return:

$$ J(\theta) = \mathbb{E}_{\pi_\theta}[R] $$

The policy gradient is computed as:

$$ \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} [ \nabla_\theta \log \pi_\theta(a_t|s_t) G_t ] $$

where $G_t$ is the **discounted return** from time $t$ onward.

## 🧠 2. Algorithm Steps

The REINFORCE algorithm can be summarized as follows:

1. Initialize the policy parameters $\theta$.
2. For each episode:
   - Generate an episode by following the current policy $\pi_\theta$.
   - For each time step $t$ in the episode:
     - Compute the return $G_t$.
     - Update parameters using:

     $$ \theta \leftarrow \theta + \alpha G_t \nabla_\theta \log \pi_\theta(a_t|s_t) $$

3. Repeat until convergence.

👉 The algorithm improves the probability of actions that yield higher cumulative rewards.

## ⚙️ 3. Implementation with PyTorch

In [ ]:
import gym
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

# Define Policy Network
class PolicyNetwork(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim=128):
        super(PolicyNetwork, self).__init__()
        self.model = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim),
            nn.Softmax(dim=-1)
        )
    
    def forward(self, x):
        return self.model(x)

# Create environment
env = gym.make('CartPole-v1')
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n

policy = PolicyNetwork(state_dim, action_dim)
optimizer = optim.Adam(policy.parameters(), lr=0.01)

# Function to choose action
def select_action(state):
    state = torch.FloatTensor(state)
    probs = policy(state)
    dist = torch.distributions.Categorical(probs)
    action = dist.sample()
    return action.item(), dist.log_prob(action)

In [ ]:
# REINFORCE Training Loop
num_episodes = 1000
gamma = 0.99

for episode in range(num_episodes):
    log_probs = []
    rewards = []
    state = env.reset()[0]
    done = False
    
    while not done:
        action, log_prob = select_action(state)
        next_state, reward, done, _, _ = env.step(action)
        log_probs.append(log_prob)
        rewards.append(reward)
        state = next_state

    # Compute discounted returns
    returns = []
    G = 0
    for r in reversed(rewards):
        G = r + gamma * G
        returns.insert(0, G)
    returns = torch.tensor(returns)
    returns = (returns - returns.mean()) / (returns.std() + 1e-9)

    # Compute policy loss
    policy_loss = []
    for log_prob, Gt in zip(log_probs, returns):
        policy_loss.append(-log_prob * Gt)
    policy_loss = torch.stack(policy_loss).sum()

    # Update policy
    optimizer.zero_grad()
    policy_loss.backward()
    optimizer.step()

    if episode % 50 == 0:
        print(f"Episode {episode} | Total Reward: {sum(rewards)}")

## 📊 4. Visualizing the Learning Progress
We can track how the total reward per episode improves over time to confirm that our policy is learning effectively.

In [ ]:
import matplotlib.pyplot as plt

episode_rewards = []

for episode in range(500):
    state = env.reset()[0]
    done = False
    total_reward = 0
    
    while not done:
        action, _ = select_action(state)
        next_state, reward, done, _, _ = env.step(action)
        total_reward += reward
        state = next_state
    episode_rewards.append(total_reward)

plt.plot(episode_rewards)
plt.title('Policy Performance over Episodes')
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.show()

## ✅ 5. Advantages and Limitations

**Advantages:**
- Simple and easy to implement.
- Works well with stochastic policies.
- Directly optimizes policy without a value function.

**Limitations:**
- High variance in gradient estimates.
- Requires complete episodes to update (Monte Carlo approach).
- Can be unstable without normalization or baselines.

➡️ The next step to improve REINFORCE is to introduce **baselines** (e.g., value functions), leading to the **Actor Critic** approach.