**Table of contents**<a id='toc0_'></a>    
- [Temporal Difference Learning Methods for Control](#toc1_)    
  - [TD for Control](#toc1_1_)    
    - [SARSA GPI with TD](#toc1_1_1_)    
  - [Off-Policy TD Control: Q-Learning](#toc1_2_)    
  - [Expected SARSA](#toc1_3_)    
- [Summary](#toc2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

> This notebook contains notes and summaries for week 3️⃣ from course 2️⃣ from the `Reinforcement Learning Specialization` from Coursera and the University of Alberta.

# <a id='toc1_'></a>[Temporal Difference Learning Methods for Control](#toc0_)

**Learning Objectives**

- Explain how generalized policy iteration can be used with TD to find improved policies
- Describe the Sarsa Control algorithm
- Understand how the Sarsa control algorithm operates in an example MDP
- Analyze the performance of a learning algorithm
- Describe the Q-learning algorithm
- Explain the relationship between Q-learning and the Bellman optimality equations.
- Apply Q-learning to an MDP to find the optimal policy
- Understand how Q-learning performs in an example MDP
- Understand the differences between Q-learning and Sarsa
- Understand how Q-learning can be off-policy without using importance sampling
- Describe how the on-policy nature of SARSA and the off-policy nature of Q-learning affect their relative performance
- Describe the Expected Sarsa algorithm
- Describe Expected Sarsa’s behaviour in an example MDP
- Understand how Expected Sarsa compares to Sarsa control
- Understand how Expected Sarsa can do off-policy learning without using importance sampling
- Explain how Expected Sarsa generalizes Q-learning

## <a id='toc1_1_'></a>[TD for Control](#toc0_)

In chapters, we talked about using generalized policy iteration to find an optimal policy. We've also talked about using TD to estimate value functions. What would it look like if we use TD to do the policy evaluation step in generalized policy iteration?

### <a id='toc1_1_1_'></a>[SARSA GPI with TD](#toc0_)

Sarsa makes predictions about the values of state action pairs. The agent chooses an action, in the initial state to create the first state action pair. Next, it takes that action in the current state and observes the reward RT plus 1 and next state ST plus one. In Sarsa, the agent needs to know its next state action pair before updating its value estimates. That means it has to commit to its next action before the update. Since our agent is learning action values for a specific policy, it uses that policy to sample the next action. 


<p align="center">
  <img width="700" height="350" src="imgs/sarsa-illustrated.png">
</p>

> SARSA is an **action-value** form of TD which combines these ideas

## <a id='toc1_2_'></a>[Off-Policy TD Control: Q-Learning](#toc0_)

The new element in Q-learning is the action value update. Here, the target is the reward RT plus one plus Gamma times the maximum action value in the following state. This differs from Sarsa, which uses the value of the next state action pair in its target. 

<p align="center">
  <img width="700" height="350" src="imgs/q-learning-algo.png">
</p>

> In fact, $\textsf{Sarsa}$ is a sample-based algorithm to solve the Bellman equation for action values. $\textsf{Q-learning}$ also solves the Bellman equation using samples from the environment. But instead of using the standard Bellman equation, Q-learning uses the Bellman's Optimality Equation for action values. 

<p align="center">
  <img width="700" height="350" src="imgs/q-learning-and-bellman.png">
</p>

The optimality equations enable Q-learning to directly learn Q-star instead of switching between policy improvement and policy evaluation steps. 

> Even though Sarsa and Q-learning are both based on Bellman equations, they're based on very different Bellman equations. Sarsa is sample-based version of policy iteration which uses Bellman equations for action values, that each depend on a fixed policy. Q-learning is a sample-based version of value iteration which iteratively applies the Bellman optimality equation. 

Applying the Bellman's Optimality Equation strictly improves the value function, unless it is already optimal.

- $\textsf{Sarsa}$ $\sim$ Policy Iteration
- $\textsf{Q-learning}$ $\sim$ Value Iteration
  - = off-policy algo
  - but without using importance sampling

Recall that an agent estimates its value function according to expected returns under their target policy. They actually behave according to their behavior policy. When the target policy and behavior policy are the same, the agent is learning **on-policy**, otherwise, the agent is learning **off-policy**. 

- In $\textsf{Sarsa}$, the agent bootstraps off of the value of the action it's going to take next, which is sampled from its behavior policy. 
- $\textsf{Q-learning}$ instead, bootstraps off of the largest action value in its next state. This is like sampling an action under an estimate of the optimal policy rather than the behavior policy. 

> Since Q-learning learns about the best action it could possibly take rather than the actions it actually takes, it is learning off-policy.

Whenever you see a reinforced learning algorithm, a natural question to ask is, **"What are the target in behavior policies?"**

But if Q-learning learns off-policy, why don't we see any important sampling ratios? 

- It is because the agent is estimating action values with unknown policy. It does not need important sampling ratios to correct for the difference in action selection. The action value function represents the returns following each action in a given state. The agents target policy represents the probability of taking each action in a given state. Putting these two elements together, the agent can calculate the expected return under its target policy from any given state, in particular, the next state, $S_{t+1}$.


<p align="center">
  <img width="700" height="350" src="imgs/sarsa-vs-q-learning-cliff-env.png">
</p>

## <a id='toc1_3_'></a>[Expected SARSA](#toc0_)

<p align="center">
  <img width="700" height="350" src="imgs/bellmann-equation-for-action-values.png">
</p>

> Explicitly computing the expectation over next actions is the main idea behind the expected Sarsa algorithm.

The algorithm is nearly identical to Sarsa, except the T error uses the expected estimate of the next action value instead of a sample of the next action value. 

> That means that on every time step, the agent has to average the next state's action values according to how likely they are under the policy. 

<p align="center">
  <img width="700" height="350" src="imgs/expected-sarsa-algo.png">
</p>

> In general, expected Sarsas update targets are much lower variance than Sarsas. 

The lower variance comes with a downside though. Computing the average over next actions becomes more expensive as the number of actions increases.  When there are many actions, computing the average might take a long time, especially since the average has to be computed every time step. In this video, we show that the expected Sarsa algorithm explicitly computes the expectation under its policy, which is more expensive than sampling but has lower variance. 

- $\textsf{Expected Sarsa}$ is **more stable and learns faster** than Sarsa, especially with large step sizes, due to averaging over policy randomness.
- $\textsf{Sarsa}$ is **more sensitive to $\alpha$** and may fail to converge at higher values.

Expected Sarsa generalizes Q-learning. 

<p align="center">
  <img width="700" height="350" src="imgs/off-policy-expected-sarsa.png">
</p>

- $\textsf{Expected Sarsa}$ uses the same technique as **Q-learning** to learn **off-policy** without **importance sampling**
- $\textsf{Expected Sarsa}$ with a **target policy** that's **greedy** with respect to its action-values is exactly **Q-learning**

# <a id='toc2_'></a>[Summary](#toc0_)

<p align="center">
  <img width="700" height="350" src="imgs/summary-td-control-and-bellman-equations.png">
</p>

- Sarsa, Q-learning, and Expected Sarsa are TD control methods based on Bellman equations.
- Expected Sarsa outperforms Sarsa and Q-learning in online performance by reducing variance through averaging over next actions.