**Table of contents**<a id='toc0_'></a>    
- [Episodic Sarsa with Function Approximation](#toc1_)    
- [Control with Function Approximation](#toc2_)    
- [Expected Sarsa with Function Approximation](#toc3_)    
- [Exploration under Function Approximation](#toc4_)    
- [Average Reward](#toc5_)    
  - [Average Reward: A New Way of Formulating Control Problems](#toc5_1_)    
- [tldr](#toc6_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Episodic Sarsa with Function Approximation](#toc0_)

$\mathbf{OBJECTIVES}$

**Lesson 1: Episodic Sarsa with Function Approximation**

- Explain the update for Episodic Sarsa with function approximation 
- Introduce the feature choices, including passing actions to features or stacking state features 
- Visualize value function and learning curves 
- Discuss how this extends to Q-learning easily, since it is a subset of Expected Sarsa 

**Lesson 2: Exploration under Function Approximation**

- Understanding optimistically initializing your value function as a form of exploration

**Lesson 3: Average Reward**

- Describe the average reward setting 
- Explain when average reward optimal policies are different from discounted solutions 
- Understand how differential value functions are different from discounted value functions


# <a id='toc2_'></a>[Control with Function Approximation](#toc0_)

<p align="center">
  <img width="600" height="300" src="imgs/c3m4-computing-action-values.png">
</p>

- to move from TD to SARSA, we need action value functions
- **Action-dependent features**: Extend state features by stacking them per action — only the selected action's features are active.
- **Linear approximation**: Each action has its own segment of the weight vector; Q-values are computed via dot product with stacked features.
- **Neural networks**: Equivalent behavior via multiple outputs, one per action, from shared state features.
  - we input the states **and** and the actions
- **Alternative (generalization)**: Input both state and action into a single-output network to generalize over actions.
- **Sarsa with approximation**: Use parameterized Q-functions and gradient-based updates (like semi-gradient TD) to learn.

<p align="center">
  <img width="600" height="400" src="imgs/c3m4-episodic-sarsa.png">
</p>

# <a id='toc3_'></a>[Expected Sarsa with Function Approximation](#toc0_)

$\textbf{Sarsa}$
$$
    Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha( R_{t+1} + \gamma Q({\color{red}{S_{t+1}, A_{t+1}}}) - Q(S_t, A_t) )
$$

$\textbf{Expected Sarsa}$
$$
    Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha( R_{t+1} + \gamma {\color{red}{\sum_{a'}\pi(a' | S_{t+1})Q(S_{t+1, a'})}} - Q(S_t, A_t))
$$
- To compute the expectation, we simply sum over the action values weighted by their probability under the target policy. 

$\textbf{Sarsa with Function Approximation}$
$$
    \mathbf{w} \leftarrow \mathbf{w} + \alpha( R_{t+1} + \gamma \hat{q}({\color{red}{S_{t+1}, A_{t+1}}}, \mathbf{w}) - \hat{q}(S_t, A_t, \mathbf{w})) \nabla\hat{q}(S_t, A_t, \mathbf{w})
$$

$\textbf{Expected Sarsa with Function Approximation}$
$$
    \mathbf{w} \leftarrow \mathbf{w} + \alpha( R_{t+1} + \gamma{\color{red}{\sum_{a'}\pi(a' | S_{t+1})\hat{q}(S_{t+1}, a', \mathbf{w})}} - \hat{q}(S_t, A_t, \mathbf{w})) \nabla\hat{q}(S_t, A_t, \mathbf{w})
$$
where $q_{\pi}(s, a) \approx \hat{q}(s,a,\mathbf{w})$.

$\textbf{Expected Sarsa to Q-Learning}$
$$
    \mathbf{w} \leftarrow \mathbf{w} + \alpha( R_{t+1} + \gamma{\color{red}{\max_{a'}\hat{q}(S_{t+1}, a', \mathbf{w})}} - \hat{q}(S_t, A_t, \mathbf{w})) \nabla\hat{q}(S_t, A_t, \mathbf{w})
$$
where $q_{\pi}(s, a) \approx \hat{q}(s,a,\mathbf{w}) = \mathbf{w}^T\mathbf{x}(s,a)$.

# <a id='toc4_'></a>[Exploration under Function Approximation](#toc0_)

- **Optimistic initial values** work well in tabular settings — they encourage the agent to try actions by pretending they’re better than they are.
- With **function approximation**, it's harder: updates affect many states at once, so optimism can fade quickly, even before visiting all states.
- Some methods like **tile coding** allow more localized updates, which helps keep optimism alive longer.
- **Neural networks** often generalize too much, making optimism harder to maintain.
- **Epsilon-greedy** is simple and works with any function approximator, but it’s **random**, not systematic.
- Good exploration with function approximation is still an open challenge — for now, we use simple tools like epsilon-greedy.

$\mathbf{\epsilon}\textbf{-greedy}$
$$
\begin{align}
    1-\epsilon &\qquad\qquad A_t = \argmax_a\hat{q}(S_t, a, \mathbf{w}) \\ 
    \epsilon &\qquad\qquad A_t = \text{Random action}
\end{align}
$$

# <a id='toc5_'></a>[Average Reward](#toc0_)
## <a id='toc5_1_'></a>[Average Reward: A New Way of Formulating Control Problems](#toc0_)

- Avg Reward Formulation = new way of formulating continuing problems 

- **Average reward $r(\pi)$**: Measures long-term performance as reward per time step (no discounting needed).

$\textbf{The Average Reward Objective:}$
$$
\begin{align}
    r(\pi) = 
        \underbrace{\sum_s \mu_\pi(s)}_{\substack{\text{State} \\ \text{visitation}}}
        \underbrace{\sum_a \pi(a|s)}_{\substack{\text{Expected reward in the state} \\ \text{under policy } \pi}}
        \underbrace{\sum_{s',r} p(s', r | s,a) \, r}_{\substack{\text{Immediate reward given } s,a \\ \text{(weighted by how frequently } \pi \text{ visits } s)}}
\end{align}    
$$
$\textbf{Returns for Average Reward:}$
$$
\begin{align}
    G_t = R_{t+1} - r(\pi) + R_{t+2} - r(\pi) + R_{t+3} - r(\pi) + \dots
\end{align}    
$$
The subparts $R_{t+i} - r(\pi), i={1,2,\dots}$ are called $\textbf{Differential Returns}$.
- $\textbf{Differential Returns}$ 
  - $=$ the sum of rewards into the future with the average reward subtracted from each one.
  - Represents how much better it is to take an action in a state then on average under a certain policy. 
  - Can only be used to compare actions if the same policy is followed on subsequent time steps. 
  - To compare policies, their average reward should be used instead.

- **Why it's useful**: Unlike discounting, it treats near and far rewards equally — ideal for continuing tasks.
- **Differential return**: Measures how much more (or less) reward you get compared to the average — used to evaluate actions within a fixed policy.
- **Stacked loops example**: Shows that discount-based policies can prefer different paths than average reward-based ones, depending on γ.
- **Differential Sarsa**: Like regular Sarsa, but subtracts the average reward estimate (𝑅̄) during updates.

$\textbf{Value Functions for Average Reward:}$
$$
\begin{align}
    G_t        &= R_{t+1} - r(\pi) + R_{t+2} - r(\pi) + R_{t+3} - r(\pi) + \dots \\
    q_\pi(s,a) &= \mathbb{E}_\pi[G_t | S_t=s, A_t=a] \\
    q_\pi(s,a) &= \sum_{s',r}p(s',r | s,a)(r - r(\pi) + \sum_{a'}\pi(a'|s')q_\pi(s',a'))
\end{align}    
$$

The second quantity captures how much more reward the agent will get by starting in a particular state than it would get on average over all states if it followed a fixed policy.

<p align="center">
  <img width="700" height="400" src="imgs/c3m4-differential-sarsa.png">
</p>

> 📌 Use **average reward** to compare policies and **differential return** to compare actions under a policy.

# Intrinsic Rewards

<p align="center">
  <img width="700" height="400" src="imgs/c3m4-revised-autonomous-agent.png">
</p>

(1) A diagram of the standard reinforcement learning agent where the reward function is assumed to be part of the environment. The credit is part of the environment and then the agent's task is to optimize some cumulative measure overlap reward function.

(2) Another diagram in which we blow up the agent into let's call it the organism, in which inside the agent is a critic which then rewards the agent and then the agent optimizes as an external environment, which played the role of the previous environment which the extrinsic reward would lie. Here,
- Agent reward is ***internal/intrinsic*** to the agent.
- Parameters to be designed by agent-designer.


# <a id='toc6_'></a>[tldr](#toc0_)

> Topic: *how to do control when using function approximation*

**How to estimate action values with function approximation?**

- If the action space is discrete, it's probably easiest to stack the state features. 
- If the action space is continuous or you want to generalize over actions, the action can be passed as an input like any other state variable.

**Control Algorithms (Function Approximation)**

- Covers: SARSA, Expected SARSA, Q-Learning
- Based on tabular versions from Course 2
- Key difference: use gradients to update weights: $\mathbf{w} \leftarrow \mathbf{w} + \alpha\delta\nabla\hat{q}(s,a,\mathbf{w})$

**Exploration Strategies**

- **Optimistic Initialization**: works with structured features (e.g. tile coding); unreliable with neural nets.
- $\epsilon$-greedy**: works regardless of the function approximator.

**Average Reward Framework** (=new way to think about the continuing control problem)

- Alternative to discounting:
  - Maximize average reward per time step
- Introduced:
  - Differential Return
  - Differential Value
  - Differential Semi-gradient SARSA

**Remember**

- SARSA, expected SARSA, and Q-Learning are all extensions of the tabular control algorithms
- Differential SARSA is also in the left half of the algorithm map, but unlike the algorithms we covered earlier in the week, it uses the average reward framework. 


---

This week, we talked about how to do control when using function approximation, let's go over the main ideas. First, we showed you how to estimate action values with function approximation. If the action space is discrete, it's probably easiest to stack the state features. If the action space is continuous or you want to generalize over actions, the action can be passed as an input like any other state variable. Let's get some context about the next part of the module by looking at the algorithm map. Function approximation puts us on the left side of the map, the focus of the first lecture was on the control algorithms in the bottom left corner, SARSA, expected SARSA, and Q-Learning. These are all extensions of the tabular control algorithms we covered in Course 2, the only difference between these algorithms and their tabular counterparts are the update equations. The updates are all adapted for function approximation in the same way, using the gradient to update the weights. We also saw how episodic SARSA could be used to solve the mountain car problem. In this case, the larger step size is 0.5 was able to learn more quickly. Next, we talked about exploration, optimistic initialization can be used with some structured feature representations like tele-coding. But in general, it's not clear how to optimistically initialize values with nonlinear function approximators like neural networks, and it might not behave as expected, for example the optimism may fade too quickly. Epsilon-greedy can be used regardless of the function approximator. Finally, we talked about a new way to think about the continuing control problem. Instead of maximizing the discounted return from the current state, we can think about maximizing the average reward that a policy receives overtime. We defined differential returns and differential values, these enable the agent to assess the relative value of actions in the average reward setting. Finally, we introduced differential semi-gradient SARSA that approximates differential values to learn policies. Differential SARSA is also in the left half of the algorithm map, but unlike the algorithms we covered earlier in the week, it uses the average reward framework. w

We extended our tabular control algorithms to function approximation, discussed how exploration changes, and introduced a new way to think about the control problem. Next week, we'll talk all about how to do reinforcement learning without learning the value function.