# PyTorch Tutorial: Reinforcement Learning (RL)

Reinforcement Learning is a paradigm where an **agent** learns to make decisions by interacting with an **environment** to maximize a **reward**. Unlike supervised learning (where you have labels), in RL, the feedback is delayed and sparse.

## Learning Objectives
- Understand the Vocabulary of RL (Agent, Environment, State, Action, Reward)
- Implement a Q-Learning Agent (Tabular)
- Implement a Policy Gradient Agent (REINFORCE) using PyTorch
- Solve the `CartPole-v1` environment from Gymnasium

## 1. Vocabulary First

Before coding, let's define the key terms:

- **Agent**: The learner (the model).
- **Environment**: The world the agent interacts with (e.g., a game, a robot simulation).
- **State ($s$)**: The current situation of the agent (e.g., position, velocity).
- **Action ($a$)**: What the agent does (e.g., move left, jump).
- **Reward ($r$)**: Feedback from the environment (e.g., +1 for surviving, -10 for crashing).
- **Policy ($\pi$)**: The strategy the agent uses to decide an action given a state ($s \to a$).
- **Episode**: One full run of the game/task from start to finish.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import gymnasium as gym # Standard RL library (formerly OpenAI Gym)
import numpy as np
import matplotlib.pyplot as plt

print("Ready for RL!")

## 2. The Environment: CartPole

We will use `CartPole-v1`. The goal is to balance a pole on a cart.
- **State**: [Cart Position, Cart Velocity, Pole Angle, Pole Angular Velocity]
- **Actions**: 0 (Push Left), 1 (Push Right)
- **Reward**: +1 for every step the pole stays upright.

In [None]:
env = gym.make('CartPole-v1')
state, info = env.reset()
print(f"Initial State: {state}")
print(f"Action Space: {env.action_space}") # Discrete(2) -> 0 or 1

## 3. Policy Gradient (REINFORCE)

In Deep RL, we use a Neural Network to approximate the policy $\pi(a|s)$.

### The Network
Input: State (4 values) -> Hidden Layer -> Output: Probability of each Action (2 values).

In [None]:
class PolicyNetwork(nn.Module):
    def __init__(self, s_size, a_size, h_size):
        super(PolicyNetwork, self).__init__()
        self.fc1 = nn.Linear(s_size, h_size)
        self.fc2 = nn.Linear(h_size, a_size)
        self.relu = nn.ReLU()
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.fc2(x)
        return self.softmax(x)

# Initialize
policy = PolicyNetwork(s_size=4, a_size=2, h_size=16)
optimizer = optim.Adam(policy.parameters(), lr=0.01)

### The Training Loop (REINFORCE Algorithm)

1. **Collect Trajectory**: Play an entire episode using the current policy.
2. **Calculate Returns**: Compute the total discounted reward for each step.
3. **Update Policy**: Increase probability of actions that led to high rewards.

$$ Loss = - \sum \log \pi(a_t|s_t) \times R_t $$

In [None]:
def reinforce(env, policy, optimizer, n_episodes=500):
    gamma = 0.99 # Discount factor
    
    for episode in range(n_episodes):
        state, _ = env.reset()
        log_probs = []
        rewards = []
        
        # 1. Collect Trajectory
        done = False
        while not done:
            state_t = torch.from_numpy(state).float().unsqueeze(0)
            probs = policy(state_t)
            
            # Sample action from probability distribution
            m = torch.distributions.Categorical(probs)
            action = m.sample()
            
            log_probs.append(m.log_prob(action))
            state, reward, terminated, truncated, _ = env.step(action.item())
            rewards.append(reward)
            done = terminated or truncated
            
        # 2. Calculate Returns (Discounted Reward)
        returns = []
        R = 0
        for r in reversed(rewards):
            R = r + gamma * R
            returns.insert(0, R)
        returns = torch.tensor(returns)
        returns = (returns - returns.mean()) / (returns.std() + 1e-9) # Normalize
        
        # 3. Update Policy
        policy_loss = []
        for log_prob, R in zip(log_probs, returns):
            policy_loss.append(-log_prob * R)
        
        optimizer.zero_grad()
        policy_loss = torch.cat(policy_loss).sum()
        policy_loss.backward()
        optimizer.step()
        
        if episode % 50 == 0:
            print(f"Episode {episode}, Total Reward: {sum(rewards)}")

# Run training
# reinforce(env, policy, optimizer)

## 4. Value Functions and Advantage

Before Actor-Critic, we need to understand **value functions**.

In [None]:
# Value Functions
print("""
VALUE FUNCTIONS IN RL
=====================

1. State Value Function V(s):
   "Expected total reward starting from state s"
   V(s) = E[R_t + γR_{t+1} + γ²R_{t+2} + ... | s_t = s]

2. Action Value Function Q(s, a):
   "Expected total reward taking action a in state s"
   Q(s, a) = E[R_t + γR_{t+1} + ... | s_t = s, a_t = a]

3. Advantage Function A(s, a):
   "How much better is action a compared to average?"
   A(s, a) = Q(s, a) - V(s)
   
   - A > 0: Better than average action
   - A < 0: Worse than average action
   - A = 0: Average action

Why Advantage is useful:
- Reduces variance in policy gradient
- Makes credit assignment clearer
- The "advantage" tells us exactly how good an action was
""")

## 5. Actor-Critic Architecture

**Actor-Critic** combines policy gradient (Actor) with value estimation (Critic).

- **Actor**: Policy network π(a|s) - decides actions
- **Critic**: Value network V(s) - evaluates states

In [None]:
class ActorCritic(nn.Module):
    """
    Actor-Critic Network with shared backbone.
    
    Architecture:
    State -> Shared Layers -> Actor Head (actions)
                           -> Critic Head (value)
    """
    def __init__(self, state_dim, action_dim, hidden_dim=64):
        super().__init__()
        
        # Shared layers (feature extraction)
        self.shared = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        
        # Actor head: outputs action probabilities
        self.actor = nn.Sequential(
            nn.Linear(hidden_dim, action_dim),
            nn.Softmax(dim=-1)
        )
        
        # Critic head: outputs state value
        self.critic = nn.Linear(hidden_dim, 1)
    
    def forward(self, state):
        features = self.shared(state)
        action_probs = self.actor(features)
        state_value = self.critic(features)
        return action_probs, state_value
    
    def act(self, state):
        action_probs, value = self.forward(state)
        dist = torch.distributions.Categorical(action_probs)
        action = dist.sample()
        return action, dist.log_prob(action), value

# Create network
ac_network = ActorCritic(state_dim=4, action_dim=2)

# Forward pass
state = torch.randn(1, 4)
action_probs, value = ac_network(state)
print(f"Action probabilities: {action_probs.detach().numpy()}")
print(f"State value: {value.item():.4f}")

In [None]:
def train_actor_critic(env, model, optimizer, n_episodes=500, gamma=0.99):
    """
    Advantage Actor-Critic (A2C) Training Loop.
    
    Loss = Policy Loss + Value Loss
    
    Policy Loss: -log(π(a|s)) * Advantage
    Value Loss:  (V(s) - Return)²
    """
    for episode in range(n_episodes):
        state, _ = env.reset()
        log_probs = []
        values = []
        rewards = []
        
        done = False
        while not done:
            state_t = torch.FloatTensor(state).unsqueeze(0)
            action, log_prob, value = model.act(state_t)
            
            log_probs.append(log_prob)
            values.append(value)
            
            state, reward, terminated, truncated, _ = env.step(action.item())
            rewards.append(reward)
            done = terminated or truncated
        
        # Calculate returns and advantages
        returns = []
        R = 0
        for r in reversed(rewards):
            R = r + gamma * R
            returns.insert(0, R)
        returns = torch.tensor(returns)
        returns = (returns - returns.mean()) / (returns.std() + 1e-9)
        
        values = torch.cat(values)
        log_probs = torch.cat(log_probs)
        
        # Advantage = Return - Value (how much better than expected)
        advantages = returns - values.squeeze()
        
        # Policy loss (maximize reward for good actions)
        policy_loss = -(log_probs * advantages.detach()).mean()
        
        # Value loss (accurate value prediction)
        value_loss = F.mse_loss(values.squeeze(), returns)
        
        # Combined loss (often add entropy bonus for exploration)
        loss = policy_loss + 0.5 * value_loss
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if episode % 50 == 0:
            print(f"Episode {episode}, Reward: {sum(rewards):.0f}, Policy Loss: {policy_loss.item():.4f}")
    
    return model

# Training would be done like:
# ac_model = ActorCritic(4, 2)
# optimizer = optim.Adam(ac_model.parameters(), lr=0.001)
# train_actor_critic(env, ac_model, optimizer)

print("Actor-Critic training loop defined!")

## 6. Proximal Policy Optimization (PPO)

PPO is the most popular RL algorithm in 2025 (used to train ChatGPT via RLHF!).

Key insight: Limit how much the policy can change in one update to ensure stable training.

In [None]:
def ppo_loss(old_log_probs, new_log_probs, advantages, epsilon=0.2):
    """
    PPO Clipped Objective.
    
    Instead of directly using policy gradient, PPO:
    1. Computes ratio: r(θ) = π_new(a|s) / π_old(a|s)
    2. Clips the ratio to [1-ε, 1+ε]
    3. Takes minimum of clipped and unclipped objective
    
    This prevents the policy from changing too much in one update.
    """
    # Probability ratio
    ratio = torch.exp(new_log_probs - old_log_probs)
    
    # Unclipped objective
    obj_unclipped = ratio * advantages
    
    # Clipped objective
    obj_clipped = torch.clamp(ratio, 1 - epsilon, 1 + epsilon) * advantages
    
    # PPO uses the minimum (pessimistic bound)
    loss = -torch.min(obj_unclipped, obj_clipped).mean()
    
    return loss

# Demonstrate clipping behavior
print("PPO Clipping Visualization:")
print("=" * 50)

import matplotlib.pyplot as plt

ratios = torch.linspace(0.5, 1.5, 100)
advantages_pos = torch.ones_like(ratios) * 1.0  # Positive advantage
advantages_neg = torch.ones_like(ratios) * -1.0  # Negative advantage

epsilon = 0.2

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Positive advantage
clipped_pos = torch.clamp(ratios, 1 - epsilon, 1 + epsilon)
obj_unclipped_pos = ratios * advantages_pos
obj_clipped_pos = clipped_pos * advantages_pos
obj_ppo_pos = torch.min(obj_unclipped_pos, obj_clipped_pos)

axes[0].plot(ratios.numpy(), obj_unclipped_pos.numpy(), 'b--', label='Unclipped')
axes[0].plot(ratios.numpy(), obj_clipped_pos.numpy(), 'r--', label='Clipped')
axes[0].plot(ratios.numpy(), obj_ppo_pos.numpy(), 'g-', linewidth=2, label='PPO (min)')
axes[0].axvline(x=1.0, color='gray', linestyle=':')
axes[0].set_title('Positive Advantage (A > 0)')
axes[0].set_xlabel('Probability Ratio')
axes[0].set_ylabel('Objective')
axes[0].legend()

# Negative advantage
clipped_neg = torch.clamp(ratios, 1 - epsilon, 1 + epsilon)
obj_unclipped_neg = ratios * advantages_neg
obj_clipped_neg = clipped_neg * advantages_neg
obj_ppo_neg = torch.min(obj_unclipped_neg, obj_clipped_neg)

axes[1].plot(ratios.numpy(), obj_unclipped_neg.numpy(), 'b--', label='Unclipped')
axes[1].plot(ratios.numpy(), obj_clipped_neg.numpy(), 'r--', label='Clipped')
axes[1].plot(ratios.numpy(), obj_ppo_neg.numpy(), 'g-', linewidth=2, label='PPO (min)')
axes[1].axvline(x=1.0, color='gray', linestyle=':')
axes[1].set_title('Negative Advantage (A < 0)')
axes[1].set_xlabel('Probability Ratio')
axes[1].set_ylabel('Objective')
axes[1].legend()

plt.tight_layout()
plt.show()

print("\nKey insight: PPO prevents extreme policy updates by clipping the ratio.")
print("This makes training much more stable than vanilla policy gradient.")

## 7. Reward Shaping

**Sparse rewards** (e.g., +1 only at goal) make learning hard. **Reward shaping** adds intermediate rewards.

In [None]:
# Reward Shaping Examples
print("""
REWARD SHAPING STRATEGIES
=========================

1. Distance-based rewards:
   reward = -distance_to_goal
   
2. Progress rewards:
   reward = previous_distance - current_distance
   
3. Potential-based shaping (theoretically sound):
   F(s, s') = γ * Φ(s') - Φ(s)
   where Φ is a potential function
   
4. Curriculum learning:
   Start with easy tasks, gradually increase difficulty

Common Pitfalls:
- Reward hacking: Agent finds unintended shortcuts
- Overfit to shaped reward, ignore true objective
- Dense rewards can distract from sparse goal

Example: Self-driving car
- Sparse: +1 for reaching destination, -1 for crash
- Shaped: +0.1 for staying in lane, -0.5 for near-miss, +0.01 per meter forward

Best Practice: Start with sparse, add shaping only if needed.
""")

def shaped_reward(state, action, next_state, done, sparse_reward):
    """
    Example of potential-based reward shaping.
    This is guaranteed not to change the optimal policy!
    """
    gamma = 0.99
    
    # Potential function (example: negative distance to goal)
    def potential(s):
        goal = np.array([0, 0])
        return -np.linalg.norm(s[:2] - goal)  # Closer to goal = higher potential
    
    # Potential-based shaping bonus
    if done:
        shaping_bonus = -gamma * potential(state)  # No next state
    else:
        shaping_bonus = gamma * potential(next_state) - potential(state)
    
    return sparse_reward + shaping_bonus

print("Potential-based shaping preserves optimal policy (provably safe)!")

## 8. FAANG Interview Questions

### Q1: Compare REINFORCE, Actor-Critic, and PPO. When would you use each?

**Answer**:

| Algorithm | Pros | Cons | Use Case |
|-----------|------|------|----------|
| **REINFORCE** | Simple, easy to implement | High variance, sample inefficient | Learning, prototyping |
| **Actor-Critic** | Lower variance (baseline), online updates | Still unstable | Continuous control |
| **PPO** | Stable, efficient, works well in practice | Hyperparameter sensitive | Production, RLHF |

Decision framework:
1. **Learning/Prototyping**: REINFORCE (simple baseline)
2. **Continuous control**: Actor-Critic (TD learning)
3. **Production/RLHF**: PPO (stable, widely used)

---

### Q2: Explain the exploration-exploitation tradeoff. How do different algorithms handle it?

**Answer**:
- **Exploitation**: Use current best action (greedy)
- **Exploration**: Try new actions to find better strategies

**Methods**:
1. **ε-greedy**: Random action with probability ε (simple, widely used)
2. **Softmax/Boltzmann**: Sample from action distribution (natural for policy gradient)
3. **UCB (Upper Confidence Bound)**: Optimism in face of uncertainty (principled)
4. **Entropy regularization**: Add entropy bonus to encourage diverse actions (PPO uses this)
5. **Intrinsic motivation**: Reward curiosity/novelty (complex environments)

In policy gradient methods, exploration comes naturally from sampling actions from the policy distribution.

---

### Q3: What is RLHF (Reinforcement Learning from Human Feedback) and how does it work?

**Answer**: RLHF is how ChatGPT and Claude are trained to be helpful.

**Three-stage process**:
1. **Supervised Fine-Tuning (SFT)**: Train on human demonstrations
2. **Reward Model Training**: Learn to predict human preferences from comparisons
3. **RL Fine-Tuning**: Use PPO to optimize policy against reward model

**Key components**:
- **Reward Model**: R(prompt, response) → scalar score
- **KL Penalty**: Prevent policy from diverging too far from SFT
- **PPO**: Stable policy optimization

**Challenges**:
- Reward hacking (model finds loopholes)
- Reward model errors (wrong preferences)
- Distribution shift (training ≠ deployment)

---

### Q4: How do you handle sparse rewards in RL?

**Answer**:

1. **Reward Shaping**: Add intermediate rewards
   - Risk: Can change optimal policy if not careful
   - Safe: Use potential-based shaping

2. **Curriculum Learning**: Start easy, increase difficulty
   - Example: Short mazes → Long mazes

3. **Hindsight Experience Replay (HER)**: 
   - Failed attempts become successes with different goals
   - "I didn't reach A, but I reached B!"

4. **Intrinsic Motivation**:
   - Curiosity-driven exploration
   - Reward visiting new states

5. **Demonstration Learning**:
   - Imitation learning from expert
   - Combine with RL (DQfD, GAIL)

---

### Q5: Explain the credit assignment problem in RL.

**Answer**: Credit assignment is determining which actions contributed to a reward.

**The Problem**:
- Reward comes at end of episode (delayed)
- Many actions preceded the reward
- Which ones were actually responsible?

**Solutions**:

1. **Discounting (γ)**:
   - Recent actions weighted more
   - $R = r_0 + \gamma r_1 + \gamma^2 r_2 + ...$

2. **Temporal Difference (TD)**:
   - Bootstrap from value estimates
   - Update after each step, not episode

3. **Advantage Function**:
   - A(s,a) = Q(s,a) - V(s)
   - Measures action quality relative to average

4. **Generalized Advantage Estimation (GAE)**:
   - Balance bias-variance in advantage estimation
   - λ interpolates between MC and TD

5. **Attention mechanisms**:
   - Transformers in RL (Decision Transformer)
   - Learn to attend to relevant past states

## Key Takeaways

1. **RL is different**: No labels, just rewards from environment interaction.
2. **Policy Gradient (REINFORCE)**: Directly optimizes the policy network to maximize expected reward.
3. **Value Functions**: V(s) and Q(s,a) estimate expected future rewards.
4. **Advantage**: A(s,a) = Q(s,a) - V(s) measures how much better an action is than average.
5. **Actor-Critic**: Combines policy (actor) and value (critic) networks for lower variance.
6. **PPO**: Clips policy updates for stability - the go-to algorithm for production RL.
7. **Exploration vs Exploitation**: Balance trying new things with using known strategies.
8. **RLHF**: How modern LLMs are aligned with human preferences using PPO.