In [None]:
# === Environment Setup ===
import os, sys, math, time, random, json, textwrap, warnings
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
try:
    import gymnasium as gym
    import tensorflow as tf
    from tensorflow import keras
    from tensorflow.keras import layers
    DEEP_RL_AVAILABLE = True
except ImportError:
    DEEP_RL_AVAILABLE = False
from collections import deque
from IPython.display import display, Markdown, Image

# --- Configuration ---
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams.update({'font.size': 14, 'figure.figsize': (10, 6), 'figure.dpi': 150})
np.set_printoptions(suppress=True, linewidth=120, precision=4)

# --- Utility Functions ---
def note(msg): display(Markdown(f"<div class='alert alert-block alert-info'>📝 **Note:** {msg}</div>"))
def sec(title): print(f"\n{80*'='}\n| {title.upper()} |\n{80*'='}")

if not DEEP_RL_AVAILABLE: note("TensorFlow and/or Gymnasium not installed. Skipping Deep RL code labs.")
note(f"Environment initialized. Deep RL libraries available: {DEEP_RL_AVAILABLE}")

# Chapter 7.15: Reinforcement Learning: Learning from Interaction

---

### Table of Contents

1.  [**Introduction: Learning Goal-Directed Behavior**](#intro)
2.  [**The Mathematical Framework: Markov Decision Processes (MDPs)**](#mdps)
    - [The Bellman Optimality Equations](#bellman)
3.  [**Model-Free Learning: Q-Learning**](#q-learning)
    - [Code Lab: Q-Learning for a Grid World](#code-q-learning)
4.  [**Deep Reinforcement Learning: Deep Q-Networks (DQN)**](#dqn)
    - [Code Lab: DQN for CartPole](#code-dqn)
5.  [**Policy Gradient Methods**](#policy-gradient)
    - [The Policy Gradient Theorem: A Proof Sketch](#pg-theorem)
6.  [**Actor-Critic Methods**](#actor-critic)
7.  [**Case Study: Optimal Resource Extraction**](#casestudy)
8.  [**Exercises**](#exercises)
9.  [**Summary and Key Takeaways**](#summary)

<a id='intro'></a>
## 1. Introduction: Learning Goal-Directed Behavior

**Reinforcement Learning (RL)** is a computational approach to learning goal-directed behavior through **interaction**. It stands as one of the three primary paradigms of machine learning, but its philosophy is distinct. An RL **agent**, situated within an **environment**, is not given a static dataset but must learn by doing, generating its own data through trial and error. Its goal is to learn a behavioral strategy, or **policy**, that maximizes a cumulative reward signal over time.

The connection to economics is deep and fundamental. RL is, in essence, the computational implementation of **dynamic programming** and **optimal control theory**. The agent's problem is precisely the problem faced by an economic agent maximizing a lifetime utility function. The **Bellman equation**, the cornerstone of dynamic programming, is also the cornerstone of reinforcement learning.

In [None]:
import graphviz
dot = graphviz.Digraph(comment='RL Loop')
dot.attr('node', shape='box', style='rounded', fontname='Helvetica', fontsize='14')
dot.attr('edge', fontname='Helvetica', fontsize='12')
dot.attr(rankdir='LR', size='8,5')

dot.node('Agent', 'Agent', height='1.2', width='2.2')
dot.node('Env', 'Environment', height='1.2', width='2.2')

dot.edge('Agent', 'Env', label=' Action (A_t) ')
dot.edge('Env', 'Agent', label=' State (S_{t+1})\nReward (R_{t+1}) ')

dot.graph_attr['label'] = '\nFigure 1: The Agent-Environment Interaction Loop'
dot.graph_attr['labelloc'] = 't'
dot.graph_attr['fontsize'] = '18'

display(dot)

<a id='mdps'></a>
## 2. The Mathematical Framework: Markov Decision Processes (MDPs)

To formalize the problem of sequential decision-making, we use the framework of a **Markov Decision Process (MDP)**. An MDP is defined by a tuple ($S, A, P, R, \gamma$):

- **$S$ - The State Space**: The set of all possible states.
- **$A$ - The Action Space**: The set of all possible actions.
- **$P$ - The Transition Model**: $P(s' | s, a)$, the probability of transitioning to state $s'$ given the current state $s$ and action $a$.
- **$R$ - The Reward Function**: $R(s, a, s')$, the immediate reward received.
- **$\gamma$ - The Discount Factor**: $\gamma \in [0, 1)$, which discounts future rewards.

The agent's goal is to learn a **policy** $\pi(a|s)$ that maximizes the expected **return** (the discounted sum of future rewards, $G_t = \sum_{k=0}^\infty \gamma^k R_{t+k+1}$). To do this, it learns one of two **value functions**:

1.  **State-Value Function ($V^\pi(s)$)**: The expected return starting from state $s$ and then following policy $\pi$.
2.  **Action-Value Function ($Q^\pi(s, a)$)**: The expected return from taking action $a$ in state $s$, and thereafter following policy $\pi$. This is the central object of value-based RL.

<a id='bellman'></a>
### 2.1 The Bellman Optimality Equations

The value functions for the optimal policy, $\pi^*$, must satisfy the **Bellman Optimality Equations**. These equations decompose the value of a state or state-action pair into the immediate reward and the discounted value of the *optimal* successor state. They are the foundation for nearly all RL algorithms.

$$ V^*(s) = \max_{a} E[R_{t+1} + \gamma V^*(S_{t+1}) | S_t=s, A_t=a] $$ 
$$ Q^*(s, a) = E[R_{t+1} + \gamma \max_{a'} Q^*(S_{t+1}, a') | S_t=s, A_t=a] $$ 
If we know the transition dynamics $P$ and reward function $R$, we can solve these equations using dynamic programming methods like **value iteration**.

<a id='q-learning'></a>
## 3. Model-Free Learning: Q-Learning

In most interesting problems, the agent does not know the model of the environment ($P$ and $R$ are unknown). This is **model-free** reinforcement learning. The agent must learn the optimal policy purely from trial-and-error interaction.

**Q-Learning** is the canonical model-free, value-based algorithm. It directly learns an estimate of the optimal action-value function, $Q^*(s,a)$, from experience. It uses a **temporal-difference (TD)** update, which updates the Q-value for a state-action pair based on the reward received and the estimated value of the *next* state.

The update rule for the Q-value of the state-action pair $(s, a)$ experienced at time $t$ is:
$$ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left( \underbrace{r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a')}_{\text{TD Target}} - Q(s_t, a_t) \right) $$ 
Where $\alpha$ is the learning rate. Q-learning is an **off-policy** algorithm because the update rule uses the `max` operator, meaning it updates its estimate based on the value of the *optimal* action in the next state, not necessarily the action that was actually taken by the current (potentially exploratory) policy.

In [None]:
<a id='code-q-learning'></a>
sec("Code Lab: Q-Learning to Solve a Grid World")

class GridWorld:
    def __init__(self):
        self.grid = np.zeros((5, 5)); self.pos = [0, 0]
        self.grid[4, 4] = 10
        self.actions = {0: (-1, 0), 1: (1, 0), 2: (0, -1), 3: (0, 1)}
    def step(self, action):
        move = self.actions[action]
        self.pos[0] = np.clip(self.pos[0] + move[0], 0, 4)
        self.pos[1] = np.clip(self.pos[1] + move[1], 0, 4)
        reward = self.grid[tuple(self.pos)]
        return tuple(self.pos), reward
    def reset(self): self.pos = [0, 0]; return tuple(self.pos)

def q_learning(env, episodes=1000, alpha=0.1, gamma=0.99, epsilon=0.1):
    q_table = np.zeros((5, 5, 4))
    for ep in range(episodes):
        state = env.reset()
        done = False
        while not done:
            action = np.argmax(q_table[state]) if np.random.random() > epsilon else np.random.choice(4)
            next_state, reward = env.step(action)
            q_table[state][action] = q_table[state][action] + alpha * (reward + gamma * np.max(q_table[next_state]) - q_table[state][action])
            state = next_state
            if reward > 0: done = True
    return q_table

env = GridWorld()
q_table = q_learning(env)
optimal_policy = np.argmax(q_table, axis=2)
plt.imshow(optimal_policy, cmap='viridis')
plt.title('Figure 2: Learned Optimal Policy (0:U, 1:D, 2:L, 3:R)')
plt.colorbar(); plt.show()

<a id='dqn'></a>
## 4. Deep Reinforcement Learning: Deep Q-Networks (DQN)

The Q-learning approach with a lookup table works for small, discrete state spaces. For problems with large or continuous state spaces (e.g., controlling a robot from pixel inputs), we cannot store a Q-value for every state. **Deep Q-Networks (DQN)**, developed by DeepMind, solve this by using a deep neural network to approximate the action-value function: $Q(s, a; \theta) \approx Q^*(s, a)$. The network takes the state $s$ as input and outputs a Q-value for each possible action $a$.

Training a Q-network with the standard Q-learning update is notoriously unstable. DQN introduced two key innovations to stabilize training:
1.  **Experience Replay:** The agent stores its experiences $(s, a, r, s')$ in a large **replay buffer**. During training, it samples random mini-batches from this buffer. This breaks the strong temporal correlation between consecutive samples, making the training data more like the i.i.d. data that neural networks are designed for.
2.  **Target Network:** The TD target, $r + \gamma \max_{a'} Q(s', a')$, creates a moving target problem because the same network is used to predict the current Q-value and the target Q-value. DQN solves this by using a separate **target network** to calculate the TD target. The target network is a periodically updated copy of the main policy network, providing a more stable target for the loss calculation.

In [None]:
<a id='code-dqn'></a>
sec("Code Lab: Deep Q-Network (DQN) for CartPole")
if DEEP_RL_AVAILABLE:
    class DQNAgent:
        def __init__(self, state_size, action_size):
            self.state_size, self.action_size = state_size, action_size
            self.memory = deque(maxlen=2000)
            self.gamma, self.epsilon = 0.95, 1.0
            self.epsilon_decay, self.epsilon_min = 0.995, 0.01
            self.model = self._build_model()
            self.target_model = self._build_model()
            self.update_target_model()
        def _build_model(self):
            model = keras.Sequential([layers.Dense(24, input_dim=self.state_size, activation='relu'),
                                      layers.Dense(24, activation='relu'),
                                      layers.Dense(self.action_size, activation='linear')])
            model.compile(loss='mse', optimizer=keras.optimizers.Adam(learning_rate=0.001))
            return model
        def update_target_model(self): self.target_model.set_weights(self.model.get_weights())
        def remember(self, s, a, r, s_next, d): self.memory.append((s, a, r, s_next, d))
        def act(self, state):
            if np.random.rand() <= self.epsilon: return random.randrange(self.action_size)
            return np.argmax(self.model.predict(state, verbose=0)[0])
        def replay(self, batch_size):
            minibatch = random.sample(self.memory, batch_size)
            for state, action, reward, next_state, done in minibatch:
                target = self.model.predict(state, verbose=0)
                if done: target[0][action] = reward
                else: target[0][action] = reward + self.gamma * np.amax(self.target_model.predict(next_state, verbose=0)[0])
                self.model.fit(state, target, epochs=1, verbose=0)
            if self.epsilon > self.epsilon_min: self.epsilon *= self.epsilon_decay

    env = gym.make('CartPole-v1'); state_size = env.observation_space.shape[0]; action_size = env.action_space.n
    agent = DQNAgent(state_size, action_size); episodes = 100; batch_size = 32; rewards = []
    note(f"Running DQN training for {episodes} episodes...")
    for e in range(episodes):
        state = env.reset()[0]; state = np.reshape(state, [1, state_size]); total_reward = 0
        for time in range(500):
            action = agent.act(state)
            next_state, reward, done, _, _ = env.step(action)
            total_reward += reward; next_state = np.reshape(next_state, [1, state_size])
            agent.remember(state, action, reward, next_state, done); state = next_state
            if done: agent.update_target_model(); break
        rewards.append(total_reward)
        if len(agent.memory) > batch_size: agent.replay(batch_size)
        if e % 20 == 0: print(f"Episode: {e}/{episodes}, Score: {total_reward}, Epsilon: {agent.epsilon:.2}")
            
    plt.figure(figsize=(12, 7)); plt.plot(rewards); plt.plot(pd.Series(rewards).rolling(10).mean(), label='10-episode MA')
    plt.title('DQN Training Progress on CartPole'); plt.xlabel('Episode'); plt.ylabel('Total Reward'); plt.legend()
    plt.show()
else:
    note("Deep RL libraries not available. Skipping DQN example.")

<a id='policy-gradient'></a>
## 5. Policy Gradient Methods

An alternative to value-based methods like Q-learning is the family of **policy gradient methods**. Instead of learning a value function and deriving a policy from it, these methods learn a **parameterized policy** directly, $\pi(a|s; \theta)$. The goal is to update the policy parameters $\theta$ by performing gradient ascent on an objective function $J(\theta)$ that measures the expected return.

<a id='pg-theorem'></a>
### 5.1 The Policy Gradient Theorem: A Proof Sketch

The **Policy Gradient Theorem** provides a surprisingly simple expression for the gradient of the performance objective, linking it to the policy and the action-value function.

$$ \nabla_\theta J(\theta) = E_\pi [\nabla_\theta \log \pi(A_t|S_t; \theta) Q^\pi(S_t, A_t)] $$ 

**Proof Sketch:**
1. The objective is the expected return: $J(\theta) = E_{\tau \sim \pi_\theta}[R(\tau)]$, where $\tau$ is a trajectory.
2. The gradient is $\nabla_\theta J(\theta) = \int \nabla_\theta p_\theta(\tau) R(\tau) d\tau$.
3. Use the **log-derivative trick**: $\nabla_x \log f(x) = \frac{\nabla_x f(x)}{f(x)}$, so $\nabla_x f(x) = f(x) \nabla_x \log f(x)$.
4. Apply this: $\nabla_\theta J(\theta) = \int p_\theta(\tau) \nabla_\theta \log p_\theta(\tau) R(\tau) d\tau = E_{\tau \sim \pi_\theta}[\nabla_\theta \log p_\theta(\tau) R(\tau)]$.
5. The probability of a trajectory is $p_\theta(\tau) = p(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t|s_t) p(s_{t+1}|s_t, a_t)$.
6. The log-probability is $\log p_\theta(\tau) = \log p(s_0) + \sum_{t=0}^{T-1} (\log \pi_\theta(a_t|s_t) + \log p(s_{t+1}|s_t, a_t))$.
7. The gradient $\nabla_\theta$ only affects the policy term: $\nabla_\theta \log p_\theta(\tau) = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t)$.
8. Substituting back into the expectation and using the definition of $Q^\pi$ gives the final result.

The **REINFORCE** algorithm is the simplest implementation of this theorem. It runs an episode, calculates the actual return $G_t$ at each step (as a noisy estimate of $Q^\pi$), and updates the policy parameters in the direction that makes actions leading to high returns more likely.

<a id='actor-critic'></a>
## 6. Actor-Critic Methods

Policy gradient methods can have high variance because the return $G_t$ is a noisy estimate. **Actor-Critic** methods reduce this variance by learning a value function alongside the policy. 
- The **Actor** is the parameterized policy, which decides which action to take.
- The **Critic** is a learned value function (e.g., a neural network) that estimates the Q-value of the action taken. 

The critic's estimate of Q is used to update the actor's policy parameters, providing a lower-variance and more stable learning signal. This hybrid approach underpins many state-of-the-art RL algorithms.

In [None]:
display(Image(filename='../images/07-Machine-Learning/actor_critic_architecture.webp'))

<a id='casestudy'></a>
## 7. Case Study: Optimal Resource Extraction (The Fish Pond Problem)

To connect RL directly to economic principles, we can apply Q-learning to solve a classic dynamic optimization problem: the management of a renewable resource. We will model a simple fishery where our agent must decide how many fish to harvest each period to maximize the long-run discounted profit.

**The Economic Model as an MDP:**
- **State ($s_t$):** The current fish stock (biomass).
- **Action ($a_t$):** The quantity of fish to harvest.
- **Reward ($r_{t+1}$):** The profit from the harvest, e.g., $p \cdot a_t$, where $p$ is the price of fish.
- **Transition:** The fish stock follows a logistic growth function, less the amount harvested. $s_{t+1} = (s_t - a_t) + g(s_t - a_t)$, where $g(\cdot)$ is the growth function.

The agent faces a trade-off: harvesting more today yields higher immediate profits but depletes the stock, potentially reducing future profits. The Q-learning algorithm will learn an optimal policy that balances this trade-off.

In [None]:
sec("Case Study: Q-Learning for Optimal Harvesting")

class FisheryEnv:
    def __init__(self, k=100, r=0.1, price=1):
        self.k, self.r, self.price = k, r, price
        self.max_stock = k
        self.state = self.max_stock
    def reset(self): self.state = self.max_stock; return int(self.state)
    def step(self, action):
        harvest = min(action, self.state)
        stock_after_harvest = self.state - harvest
        growth = self.r * stock_after_harvest * (1 - stock_after_harvest / self.k)
        self.state = min(self.max_stock, stock_after_harvest + growth)
        reward = self.price * harvest
        return int(self.state), reward

def q_learning_fishery(env, episodes=20000, alpha=0.1, gamma=0.9, epsilon=0.1):
    n_states = env.max_stock + 1
    n_actions = env.max_stock // 5 + 1
    q_table = np.zeros((n_states, n_actions))
    for ep in range(episodes):
        state = env.reset()
        for t in range(50):
            action_idx = np.argmax(q_table[state]) if np.random.random() > epsilon else np.random.choice(n_actions)
            harvest_amount = action_idx * 5
            next_state, reward = env.step(harvest_amount)
            q_table[state, action_idx] += alpha * (reward + gamma * np.max(q_table[next_state]) - q_table[state, action_idx])
            state = next_state
    return q_table

note("Solving for the optimal harvesting policy using Q-learning...")
fish_env = FisheryEnv()
q_table_fish = q_learning_fishery(fish_env)
optimal_policy_fish = np.argmax(q_table_fish, axis=1) * 5

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 7))
fig.suptitle('Figure 4: Optimal Fishery Management via Q-Learning', fontsize=18, y=1.02)
im = ax1.imshow(q_table_fish.T, cmap='viridis', aspect='auto', origin='lower')
ax1.set_title('a) Learned Q-Values'); ax1.set_xlabel('Current Fish Stock'); ax1.set_ylabel('Harvest Amount')
fig.colorbar(im, ax=ax1, label='Expected Future Reward')
ax2.plot(range(len(optimal_policy_fish)), optimal_policy_fish, '-o')
ax2.set_title('b) Learned Optimal Harvesting Policy'); ax2.set_xlabel('Current Fish Stock'); ax2.set_ylabel('Optimal Harvest Amount')
plt.show()
note("The Q-learning agent learns a sensible policy. When the fish stock is low, it harvests very little to allow the stock to recover. When the stock is high, it harvests a larger amount, maintaining a sustainable equilibrium.")

<a id='exercises'></a>\n## 8. Exercises\n\n1.  **The Bellman Equation:** Write out the Bellman optimality equation for $Q^*(s,a)$ in words. What does it say about the relationship between the value of taking an action now and the value of the best action in the future?\n2.  **Exploration vs. Exploitation:** In the Q-learning algorithm, what is the role of the $\epsilon$ parameter in the $\epsilon$-greedy policy? What would happen if $\epsilon=0$? What if $\epsilon=1$?\n3.  **On-Policy vs. Off-Policy:** Q-learning is an off-policy algorithm. What does this mean? Why is it considered off-policy, and what is the practical advantage of this?\n4.  **Optimal Harvesting:** In the fish pond case study, how would you expect the optimal harvesting policy to change if the discount factor $\gamma$ were increased (i.e., the agent becomes more patient)? What if the fish growth rate was higher?

<a id='summary'></a>\n## 9. Summary and Key Takeaways\n\nThis chapter introduced Reinforcement Learning, a powerful framework for solving sequential decision-making problems that is deeply connected to the economic theory of dynamic programming.\n\n**Key Concepts**:\n- **The RL Problem**: An **agent** learns to take **actions** in an **environment** to maximize a cumulative, discounted **reward**. This is formalized as a **Markov Decision Process (MDP)**.\n- **Value Functions & Bellman Equations**: The core of value-based RL is estimating value functions ($V^\pi(s)$ or $Q^\pi(s,a)$) that satisfy the Bellman equations, which provide a recursive definition of value.\n- **Q-Learning**: A classic model-free, off-policy algorithm that directly learns the optimal action-value function, $Q^*(s,a)$, using temporal-difference updates.\n- **Deep Q-Networks (DQN)**: A major breakthrough that uses deep neural networks to approximate the Q-function, enabling RL to solve high-dimensional problems. It relies on **experience replay** and a **target network** for stable training.\n- **Policy Gradient Methods**: An alternative approach that directly optimizes a parameterized policy $\pi(a|s;\theta)$ by performing gradient ascent on the expected return.\n- **Actor-Critic Methods**: A hybrid approach that combines the strengths of value-based and policy-based methods, forming the basis for most state-of-the-art RL agents.

### Solutions to Exercises\n\n---\n\n**1. The Bellman Equation:**\nThe Bellman optimality equation for $Q^*(s,a)$ says: "The value of taking action *a* in state *s* and then acting optimally forever after is equal to the expected immediate reward you get, plus the discounted value of the *best possible action* you can take from the next state you land in." It provides the fundamental recursive relationship that links the value of the current state-action pair to the values of all possible next state-action pairs.\n\n---\n\n**2. Exploration vs. Exploitation:**\nThe $\epsilon$ parameter controls the trade-off between exploration and exploitation. \n- **If $\epsilon=0$**: The agent always chooses the action with the highest current estimated Q-value. This is pure **exploitation**. The agent never tries new actions and can get stuck in a suboptimal policy if its initial Q-value estimates are poor.\n- **If $\epsilon=1$**: The agent always chooses a random action, regardless of its Q-value estimates. This is pure **exploration**. The agent never uses what it has learned and will behave randomly.\nA small, non-zero $\epsilon$ (that often decays over time) ensures the agent mostly exploits its knowledge but occasionally explores to discover better policies.\n\n---\n\n**3. On-Policy vs. Off-Policy:**\nAn **on-policy** algorithm updates its policy based on actions taken by that same policy. A **off-policy** algorithm can learn about the optimal policy while following a different, more exploratory policy. Q-learning is off-policy because its update rule involves the term $\max_{a'} Q(s_{t+1}, a')$. This term finds the value of the *best* possible action in the next state, regardless of what action the exploratory $\epsilon$-greedy policy might actually choose. The practical advantage is that it allows the agent to explore its environment widely while still learning a deterministic, optimal policy.\n\n---\n\n**4. Optimal Harvesting:**\n- **Higher $\gamma$**: A higher discount factor means the agent is more patient and values the future more. The agent would learn to harvest *less* in the present to ensure the fish stock remains high, leading to larger, sustainable harvests in the future. The optimal policy would be more conservative.\n- **Higher Growth Rate**: If the fish stock replenishes more quickly, the agent can afford to harvest more aggressively without depleting the stock. The optimal policy would be to harvest a larger quantity at each step, as the resource is less scarce.