**Table of contents**<a id='toc0_'></a>    
- [Lunar Lander MDP](#toc1_)    
- [Review: Expected SARSA](#toc2_)    
- [Review: Q-Learning](#toc3_)    
- [Review: Average Reward- A New Way of Formulating Control Problems](#toc4_)    
  - [Summary](#toc4_1_)    
- [Review: Actor-Critic Algorithm](#toc5_)    
- [tldr](#toc6_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Lunar Lander MDP](#toc0_)

$\textbf{How to solve the Lunar Lander MDP? Which algorithm to use?}$

1. Can we represent the value function using only a table? 
   - The agent observes the position, orientation, velocity and contact sensors of the lunar module. 
   - Six of the eight state variables are continuous, which means that we cannot represent them with a table. 
   - And in any case we'd like to take advantage of generalization to learn faster.
2. would this be well formulated as an average word problem?
   - Think about the dynamics of this problem. 
   - The lunar module starts in low orbit and descends until it comes to rest on the surface of the moon. 
   - This process then repeats with each new attempt at landing beginning independently of how the previous one ended. 
   - This is exactly our definition of an episodic task.
   - We use the average reward formulation for continuing tasks, so that is not the best choice here.
3. possible and beneficial to update the policy and value function on every time step? MC or TD?
   - Think about landing your module on the moon. 
   - If any of our sensors becomes damaged during the episode, we want to be able to update the policy before the end of the episode. 
   - It's like what we discussed in the driving home example, we expect the TD method to do better in this kind of problem.
4. Objective? 
   - We want to learn a safe and a robust policy in our simulator so that we can use it on the moon. We want to learn a policy that maximizes reward, and so this is a control task.

This leaves us with three algorithms, SARSA, expected SARSA and Q-learning. 

- Since we are using function approximation, learning and epsilon soft policy will be more robust than learning a deterministic policy. 
- Remember the example where due to state aliasing, a deterministic policy was suboptimal. 
- Expected SARSA and SARSA, both allow us to learn an optimal epsilon soft policy, but Q-learning does not. Now we need to choose between expected SARSA and SARSA. 
- We mentioned in an earlier video that expected SARSA usually performs better than SARSA. So, let's eliminate SARSA.

> ${\textcolor{red}{\textbf{Expected SARSA}}}$

# <a id='toc2_'></a>[Review: Expected SARSA](#toc0_)

<p align="center">
  <img width="700" height="400" src="imgs/c4m3-sarsa.png">
</p>

Recall the Bellman equation for action-values. Here you can see the expectation over values of possible next state action pairs. Breaking this expectation apart, we see a sum over possible next states as well as possible next action. Sarsa estimates this expectation by sampling the next date from the environment and the next action from its policy. But the agent already knows this policy, so why should it have to sample its next action? Instead, it should just compute the expectation directly. In this case, we can take a weighted sum of the values of all possible next actions. The weights are the probability of taking each action under the agents policy.

> Explicitly computing the expectation over next actions is the **main idea** behind the ${\textcolor{red}{\textbf{expected SARSA algorithm}}}$.

The algorithm is nearly identical to Sarsa, except the TD-error uses the *expected estimate* of the next action value instead of a sample of the next action value.

- That means that on every time step, the agent has to average the next state's action values according to how likely they are under the policy.

<p align="center">
  <img width="700" height="400" src="imgs/c4m3-expected-sarsa.png">
</p>

- In general, expected Sarsas update targets are much lower variance than Sarsas.

The lower variance comes with a downside though. Computing the average over next actions becomes more expensive as the number of actions increases. When there are many actions, computing the average might take a long time, especially since the average has to be computed every time step.

# <a id='toc3_'></a>[Review: Q-Learning](#toc0_)

<p align="center">
  <img width="700" height="400" src="imgs/c4m3-q-learning-algo.png">
</p>

This differs from Sarsa, which uses the value of the next state action pair in its target.

<p align="center">
  <img width="700" height="400" src="imgs/c4m3-revisiting-bellman-equations.png">
</p>

If you look at the update equation for Sarsa, it's suspiciously similar to the Bellman equation for action values. In fact, Sarsa is a sample-based algorithm to solve the Bellman equation for action values. Q-learning also solves the Bellman equation using samples from the environment. But instead of using the standard Bellman equation, Q-learning uses the Bellman's Optimality Equation for action values. The optimality equations enable Q-learning to directly learn Q-star instead of switching between policy improvement and policy evaluation steps. Even though Sarsa and Q-learning are both based on Bellman equations, they're based on very different Bellman equations. Sarsa is sample-based version of policy iteration which uses Bellman equations for action values, that each depend on a fixed policy. Q-learning is a sample-based version of value iteration which iteratively applies the Bellman optimality equation. Applying the Bellman's Optimality Equation strictly improves the value function, unless it is already optimal. So value iteration continually improves as value function estimate, which eventually converges to the optimal solution. For the same reason, Q-learning also converges to the optimal value function as long as the aging continues to explore and samples all areas of the state action space. 

$$
\begin{align*}
    \text{SARSA}      \sim \text{Policy Iteration} \\ 
    \text{Q-Learning} \sim \text{Value Iteration}
\end{align*}
$$

# <a id='toc4_'></a>[Review: Average Reward- A New Way of Formulating Control Problems](#toc0_)

- ${\textcolor{red}{\textbf{Average Reward Formulation}}}=$ new way of formulating continuing problems
- If the agents goal is to maximize this average reward, then it cares equally about nearby and distant rewards.

<p align="center">
  <img width="600" height="400" src="imgs/c4m3-avg-reward-objective.png">
</p>

$$
\begin{align}
    r(\pi) = \underbrace{
                \sum_{s}\mu(s) 
                \underbrace{
                    \sum_{a}\pi(a | s,\theta) 
                        \underbrace{
                            \sum_{s',r}p(s',r' | s,a)r
                        }_{\mathbb{E}_{\pi}[R_t | S_t=s, A_t=a]}
                }_{\mathbb{E}_{\pi}[R_t | S_t=s]}
             }_{\mathbb{E}_{\pi}[R_t]}
\end{align}
$$

- This inner term is the expected reward in a state under policy pi. 
- The outer sum takes the expectation over how frequently the policy is in that state. 
- Together, we get the expected reward across states. In other words, the average reward for a policy.

> We can see the average reward puts preference on the policy that receives more reward in total without having to consider larger and larger discounts.

$$
\begin{align}
    \text{Episodic:}   \qquad &G_t = \sum_{t=0}^{T} R_t\\
    \text{Continuing:} \qquad &G_t = \sum_{t=0}^{\infty} \gamma^t R_t \qquad \underbrace{G_t = \sum_{t=0}^{\infty}\underbrace{\overbrace{R_t}^{\text{Immediate}} - \overbrace{r(\pi)}^{\text{Average}}}_{{\textcolor{red}{\textbf{Differential Return}}}}}_{\text{Average Reward Formulation}}
\end{align}
$$

In the average reward setting, returns are defined in terms of differences between rewards and the average reward R pi. This is called the ${\textcolor{red}{\textbf{differential return}}}$.

<p align="center">
  <img width="700" height="400" src="imgs/c4m3-differential-sarsa.png">
</p>

## <a id='toc4_1_'></a>[Summary](#toc0_)

Discounting vs. Average Reward
- Discounting emphasizes near-term rewards.
- Average reward treats all time steps equally.
- In the “nearsighted MDP”:
  - Small $\gamma$: prefer immediate rewards
  - Large $\gamma$: prefer delayed, higher rewards
- Average reward eliminates the need to tune $\gamma$

Differential Return
- Measures how much better a trajectory is than the average
- Converges **only if** the correct $\bar{r}$ is subtracted

Differential SARSA
- Similar to standard SARSA
- Tracks average reward $\bar{r}$
- Uses $R_{t+1} - \bar{r}$ in its TD update
- Improved performance with variance-reducing average reward update


# <a id='toc5_'></a>[Review: Actor-Critic Algorithm](#toc0_)

In this setup, the parameterized policy plays the role of an actor, while the value function plays the role of a critic, evaluating the actions selected by the actor. These so-called actor-critic methods, were some of the earliest TD-based methods introduced in RL.

<p align="center">
  <img width="600" height="400" src="imgs/c4m3-actor-critic.png">
</p>

# Thinking About RL Problems

$\textbf{State the goal!}$

1. Is the goal to solve a (type of a) problem?
2. Is it to develop a tool? For what? 

$\textbf{Specify the problem!}$

1. How is data collected? 
2. How will algorithms be evaluated? 
3. Are there specific properties of the environments? 

$\textbf{Data sources}$

1. Interact with the real world?
2. Interact with simulator? 
3. No interaction: Learn from a batch of data.

## Algorithm Evaluation

$\textbf{Efficient learning?}$

1. How well does it do at the end? 
2. How well does it do during learning?

$\textbf{Performance metrics}$

1. Total (discounted?) reward
2. Risk-sensitivity
3. Average vs. worst-case
4. Knowledge acquired, skills

## Aspects of environments

1. #states, #actions
2. Bounded rewards? Random rewards?
3. Deterministic transitions?
4. Factored state? Factored actions? 
5. Special dynamics (e.g. linear)?
6. Observations: How much do they reveal of the state of the environment? 
7. How much memory is needed to solve the task?
8. Factored observations?
9. Diameter of environment?
10. ...

# <a id='toc6_'></a>[tldr](#toc0_)

- The ${\textcolor{red}{\textbf{expected SARSA}}}$ algorithm explicitly computes the **expectation under its policy**, which is more expensive than sampling but has **lower variance**.