# PyTorch Tutorial: Reinforcement Learning (RL)

Reinforcement Learning is a paradigm where an **agent** learns to make decisions by interacting with an **environment** to maximize a **reward**. Unlike supervised learning (where you have labels), in RL, the feedback is delayed and sparse.

## Learning Objectives
- Understand the Vocabulary of RL (Agent, Environment, State, Action, Reward)
- Implement a Q-Learning Agent (Tabular)
- Implement a Policy Gradient Agent (REINFORCE) using PyTorch
- Solve the `CartPole-v1` environment from Gymnasium

## 1. Vocabulary First

Before coding, let's define the key terms:

- **Agent**: The learner (the model).
- **Environment**: The world the agent interacts with (e.g., a game, a robot simulation).
- **State ($s$)**: The current situation of the agent (e.g., position, velocity).
- **Action ($a$)**: What the agent does (e.g., move left, jump).
- **Reward ($r$)**: Feedback from the environment (e.g., +1 for surviving, -10 for crashing).
- **Policy ($\pi$)**: The strategy the agent uses to decide an action given a state ($s \to a$).
- **Episode**: One full run of the game/task from start to finish.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import gymnasium as gym # Standard RL library (formerly OpenAI Gym)
import numpy as np
import matplotlib.pyplot as plt

print("Ready for RL!")

## 2. The Environment: CartPole

We will use `CartPole-v1`. The goal is to balance a pole on a cart.
- **State**: [Cart Position, Cart Velocity, Pole Angle, Pole Angular Velocity]
- **Actions**: 0 (Push Left), 1 (Push Right)
- **Reward**: +1 for every step the pole stays upright.

In [None]:
env = gym.make('CartPole-v1')
state, info = env.reset()
print(f"Initial State: {state}")
print(f"Action Space: {env.action_space}") # Discrete(2) -> 0 or 1

## 3. Policy Gradient (REINFORCE)

In Deep RL, we use a Neural Network to approximate the policy $\pi(a|s)$.

### The Network
Input: State (4 values) -> Hidden Layer -> Output: Probability of each Action (2 values).

In [None]:
class PolicyNetwork(nn.Module):
    def __init__(self, s_size, a_size, h_size):
        super(PolicyNetwork, self).__init__()
        self.fc1 = nn.Linear(s_size, h_size)
        self.fc2 = nn.Linear(h_size, a_size)
        self.relu = nn.ReLU()
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.fc2(x)
        return self.softmax(x)

# Initialize
policy = PolicyNetwork(s_size=4, a_size=2, h_size=16)
optimizer = optim.Adam(policy.parameters(), lr=0.01)

### The Training Loop (REINFORCE Algorithm)

1. **Collect Trajectory**: Play an entire episode using the current policy.
2. **Calculate Returns**: Compute the total discounted reward for each step.
3. **Update Policy**: Increase probability of actions that led to high rewards.

$$ Loss = - \sum \log \pi(a_t|s_t) \times R_t $$

In [None]:
def reinforce(env, policy, optimizer, n_episodes=500):
    gamma = 0.99 # Discount factor
    
    for episode in range(n_episodes):
        state, _ = env.reset()
        log_probs = []
        rewards = []
        
        # 1. Collect Trajectory
        done = False
        while not done:
            state_t = torch.from_numpy(state).float().unsqueeze(0)
            probs = policy(state_t)
            
            # Sample action from probability distribution
            m = torch.distributions.Categorical(probs)
            action = m.sample()
            
            log_probs.append(m.log_prob(action))
            state, reward, terminated, truncated, _ = env.step(action.item())
            rewards.append(reward)
            done = terminated or truncated
            
        # 2. Calculate Returns (Discounted Reward)
        returns = []
        R = 0
        for r in reversed(rewards):
            R = r + gamma * R
            returns.insert(0, R)
        returns = torch.tensor(returns)
        returns = (returns - returns.mean()) / (returns.std() + 1e-9) # Normalize
        
        # 3. Update Policy
        policy_loss = []
        for log_prob, R in zip(log_probs, returns):
            policy_loss.append(-log_prob * R)
        
        optimizer.zero_grad()
        policy_loss = torch.cat(policy_loss).sum()
        policy_loss.backward()
        optimizer.step()
        
        if episode % 50 == 0:
            print(f"Episode {episode}, Total Reward: {sum(rewards)}")

# Run training
# reinforce(env, policy, optimizer)

## Key Takeaways

1. **RL is different**: No labels, just rewards.
2. **Policy Gradient**: Directly optimizes the policy network to maximize expected reward.
3. **Exploration vs Exploitation**: The agent needs to try random things (exploration) to find good strategies, but also use what it knows (exploitation).