**Table of contents**<a id='toc0_'></a>    
- [Policy Gradient](#toc1_)    
- [Learning Parameterized Policies](#toc2_)    
  - [Advantages of Policy Parameterization](#toc2_1_)    
- [Policy Gradient for Continuing Tasks](#toc3_)    
  - [The Objective for Learning Policies](#toc3_1_)    
  - [The Policy Gradient Theorem](#toc3_2_)    
  - [Estimating the Policy Gradient](#toc3_3_)    
- [Actor-Critic for Continuing Tasks](#toc4_)    
  - [Actor-Critic Algorithm](#toc4_1_)    
- [Policy Parameterizations](#toc5_)    
  - [Actor-Critic with Softmax Policies](#toc5_1_)    
  - [Gaussian Policies for Continuous Actions](#toc5_2_)    
- [tldr](#toc6_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Policy Gradient](#toc0_)

$\mathbf{OBJECTIVES}$

**Lesson 1: Learning Parameterized Policies**

- Understand how to define policies as parameterized functions 
- Define one class of parameterized policies based on the softmax function 
- Understand the advantages of using parameterized policies over action-value based methods 

**Lesson 2: Policy Gradient for Continuing Tasks**

- Describe the objective for policy gradient algorithms 
- Describe the results of the policy gradient theorem 
- Understand the importance of the policy gradient theorem 

**Lesson 3: Actor-Critic for Continuing Tasks**

- Derive a sample-based estimate for the gradient of the average reward objective 
- Describe the actor-critic algorithm for control with function approximation, for continuing tasks 

**Lesson 4: Policy Parameterizations**

- Derive the actor-critic update for a softmax policy with linear action preferences 
- Implement this algorithm 
- Design concrete function approximators for an average reward actor-critic algorithm 
- Analyze the performance of an average reward agent 
- Derive the actor-critic update for a gaussian policy 
- Apply average reward actor-critic with a gaussian policy to a particular task with continuous actions

# <a id='toc2_'></a>[Learning Parameterized Policies](#toc0_)

- Understand how to define policies as parameterized functions
- Define one class of parameterized policies based on the softmax function

We're not actually going to specify policies by hand. Rather, we will learn them. We can use the language of function approximation to both represent and learn policies directly.

$$
\begin{align}
    \text{Learning value functions:} \qquad
    \underbrace{s,a}_{\text{Input}} \rightarrow \underbrace{\mathbf{w}}_{\text{Parameters}} \rightarrow \underbrace{\hat{q}(s,a,\mathbf{w})}_{\text{Output}} 
    \\
    \text{Learning policies:} \qquad \underbrace{s,a}_{\text{Input}} \rightarrow \underbrace{\mathbf{\theta}}_{\text{Parameters}} \rightarrow \underbrace{\pi(a | s, \theta)}_{\text{Output}}
\end{align}
$$

The parameterized function, has to generate a valid policy. This means, it has to generate a valid probability distribution over actions for every state.

**Contraints on the Policy Parameterization**:
$$
\begin{align}
    \pi(a | s, \theta) \geq 0 \qquad &\text{for all} \; a \in \mathcal{A} \; \text{and} \; s \in \mathcal{S} \\ 
    \sum_{a \in \mathcal{A}} \pi(a | s, \theta) = 1 \qquad &\text{for all} \; s \in \mathcal{S} \\ 
\end{align}
$$

So far, all the methods we've looked at for learning good policies estimate action values. Every control arguably study, was built on the framework of generalized policy iteration. In this module, we'll explore a new class of methods where the policies are parameterized directly. 

$\textbf{The Softmax Policy Parameterization:}$

$$
    \pi(a | s, \Theta) = \frac{e^{h(s,a,\Theta)}}{\sum_{b\in\mathcal{A}}e^{h(s,a,\Theta)}}
$$

- $h=$ action preference: A higher preference for a particular action in a state, means that the action is more likely to be selected.
  - Preferences indicate how much the agent prefers each action but they are not summaries of future reward.

This is a simple yet effective way to satisfy the above conditions (1) and (2).

- The policy can start off stochastic to guarantee expiration. 
- Then as learning progresses, the policy can naturally converge towards a deterministic greedy policy.
- A softmax policy can adequately approximate a deterministic policy by making one action preference very large.

## <a id='toc2_1_'></a>[Advantages of Policy Parameterization](#toc0_)

We learned that it's possible to learn parameterized policies directly. Now, we consider learning approximate values and learning approximate policies. Advantages of parameterized stochastic policies:
1. They can autonomously decrease exploration over time. (Agent can make its policy more greedy over time autonomously.)
2. They can avoid failures due to deterministic policies with limited function approximation.
3. Sometimes the policy is less complicated than the value function.

In general:
- We want our agents to be autonomous. We don't want them to rely on us to decide when exploration is done. We can avoid this issue with parameterized policies.
- In the tabular settings there is always a deterministic optimal policy.

# <a id='toc3_'></a>[Policy Gradient for Continuing Tasks](#toc0_)

## <a id='toc3_1_'></a>[The Objective for Learning Policies](#toc0_)

Now that we've introduced the idea of parameterizing policies directly, we're ready to talk about how we can learn to improve a parameterized policy.

> Just like with action value-based methods, the basic idea will be to specify an objective, and then figure out how to *estimate the gradient of that objective from an agent's experience*.

The Goal of RL: ***Maximize Reward in the Long Run***. It turns out that when we parameterize our policy directly, we can also use this goal directly as the learning objective. But what does it mean to obtain as much reward as possible in the long-run?

$\textbf{Formalizing the Goals as an Objective:}$

$$
\begin{align}
    \text{Episodic:}   \qquad &G_t = \sum_{t=0}^{T} R_t\\
    \text{Continuing:} \qquad &G_t = \sum_{t=0}^{\infty} \gamma^t R_t \qquad \underbrace{G_t = \sum_{t=0}^{\infty}\overbrace{R_t}^{\text{Immediate}} - \overbrace{r(\pi)}^{\text{Average}} }_{\text{Average Reward Formulation}}
\end{align}
$$

- For the episodic case, we can use the undiscounted return, which is the sum of rewards over a whole episode.
- For the continuing case, we introduced the discounted return which places more emphasis on immediate reward in order to keep the sum finite.

Now, our aim is to learn a policy that directly optimizes average reward.
$$
\begin{align}
    r(\pi) = \underbrace{
                \sum_{s}\mu(s) 
                \underbrace{
                    \sum_{a}\pi(a | s,\theta) 
                        \underbrace{
                            \sum_{s',r}p(s',r' | s,a)r
                        }_{\mathbb{E}_{\pi}[R_t | S_t=s, A_t=a]}
                }_{\mathbb{E}_{\pi}[R_t | S_t=s]}
             }_{\mathbb{E}_{\pi}[R_t]}
\end{align}
$$

Our goal of policy optimization will be to find a policy which maximizes the average reward. Our basic approach will be to estimate the gradient of the objective with respect to the policy parameters and adjust the parameters based on this estimate. The class of methods they use this idea are often referred to as policy gradient methods.

$\textbf{Policy-Gradient Method:}$

$$
    \nabla r(\pi) = \nabla \sum_{s}\mu(s) \sum_{a}\pi(a | s,\theta) \sum_{s',r}p(s',r' | s,a)r
$$

this *optimizes the average reward objective*.

- There are few challenges in computing this gradient however, the main difficulty is that modifying our policy changes the distribution $\mu$.
- This contrast value function approximation where we minimized means grid value error under a particular policy. There the distribution $\mu$ was fixed.

$\textbf{The Policy Gradient Theorem:}$

$$
    \nabla r(\pi) = \sum_{s}\mu(s) \sum_{a}\nabla\pi(a | s,\theta) q_{\pi}(s,a)
$$

- gives us an expression for the gradient of the average reward

## <a id='toc3_2_'></a>[The Policy Gradient Theorem](#toc0_)

We just discussed an objective for policy optimization. The next step is to figure out how the agent can optimize it based on its own experience.

Now, we discuss the $\textbf{Policy Gradient Theorem}$. This is a key theoretical result. It allows us to write the gradient of the average reward so that it is easier to estimate from experience.

- Policy gradient methods use a similar approach, but with the average reward objective and the policy parameters theta. We want to maximize the average reward rather than minimizing it. This means we do gradient ascent and move in the direction of the positive gradient.

The gradient of $\mu$ is not straightforward to estimate. The stationary distribution $\mu$ depends on a long-term interaction between the policy and the environment.

## <a id='toc3_3_'></a>[Estimating the Policy Gradient](#toc0_)

We have an objective for policy optimization. We also have the policy gradient theorem, which gives us a simple expression for the gradient of that objective. Now we'll complete the puzzle by showing how to estimate this gradient using the experience of an agent interacting with the environment.

<p align="center">
  <img width="700" height="400" src="imgs/c3m5-stochastic-gradient-ascent.png">
</p>

# <a id='toc4_'></a>[Actor-Critic for Continuing Tasks](#toc0_)

## <a id='toc4_1_'></a>[Actor-Critic Algorithm](#toc0_)

Do we have to choose between directly learning the policy parameters and learning a value function? No. Even within policy gradient methods, value-learning methods like TD still have an important role to play. In this setup, the parameterized policy plays the role of an actor, while the value function plays the role of a critic, evaluating the actions selected by the actor.

In RL, **actor-critic methods** combine **policy gradient methods** (the "actor," which learns the best actions to take) with **value function learning** (the "critic," which evaluates how good those actions are). The actor uses policy gradient updates to adjust action probabilities, while the critic approximates the value function using temporal-difference (TD) learning, specifically the average reward version of semi-gradient TD.

To improve learning efficiency, a baseline (the state's value estimate) is subtracted from the critic’s evaluation, forming the TD error. This reduces variance in updates without changing their expected value, leading to faster learning. A positive TD error means the action was better than expected, so the actor increases its probability; a negative error decreases it. Both actor and critic learn simultaneously, with the actor improving the policy based on the critic’s feedback, and the critic refining its value estimates based on the actor’s actions.

For continuing tasks, the algorithm initializes policy and value function parameters, tracks average rewards, and updates both using TD error and policy gradient steps. This allows ongoing improvement of the policy over time.

- $\textbf{Critic:}$
  - The critic provides immediate feedback. 
  - To train the critic, we can use any state value learning algorithm.
- $\textbf{Actor:}$
  - Is the parameterized policy.

<p align="center">
  <img width="700" height="400" src="imgs/c3m5-approximating-the-action-value.png">
</p>

<p align="center">
  <img width="800" height="300" src="imgs/c3m5-subtracting-the-current-stats-value-estimate.png">
</p>

Subtracting the baseline $\hat{v}(S_t, w)$ tends to reduce the variance of the update which results in faster learning. 

- This update makes sense intuitively. After we execute an action, we use the TD error to decide how good the action was compared to the average for that state. If the TD error is positive, then it means the selected action resulted in a higher value than expected. Taking that action more often should improve our policy.
- That is exactly what this update does. It changes the policy parameters to increase the probability of actions that were better than expected according to the critic. Correspondingly, if the critic is disappointed and the TD error is negative, then the probability of the action is decreased. The actor and the critic learn at the same time, constantly interacting. The actor is continually changing the policy to exceed the critics expectation, and the critic is constantly updating its value function to evaluate the actors changing policy.

<p align="center">
  <img width="700" height="450" src="imgs/c3m5-actor-critic-algo.png">
</p>

# <a id='toc5_'></a>[Policy Parameterizations](#toc0_)
## <a id='toc5_1_'></a>[Actor-Critic with Softmax Policies](#toc0_)

Actor critic elegantly mixes direct policy optimization, value functions, and temporal difference learning. 

$\textbf{Actor Critic Algorithm}$: 
- combines *policy evaluation*, which is the **critic**, 
- and the *policy gradient rule* to update the policy, which is the **actor**.

$$
\begin{align}
    \mathbf{w}      &\leftarrow \mathbf{w} + \alpha^{\mathbf{w}}\delta\nabla\hat{v}(S, \mathbf{w}) \\
    \boldsymbol{\theta} &\leftarrow \boldsymbol{\theta} + \alpha^{\boldsymbol{\theta}}\delta\nabla\text{ln}\pi(A | S,\boldsymbol{\theta})
\end{align}
$$

A common choice of policy parameterization for finite actions is the Softmax policy we discussed before. 
$$
    \pi(a | s,\boldsymbol{\theta}) = \frac{e^{h(s,a,\boldsymbol{\theta})}}{\sum_{b\in\mathcal{A}}e^{h(s,a,\boldsymbol{\theta})}}
$$

- To select an action, we follow a simple procedure. 
- In the current state we query the Softmax for each action. 
- This generates a vector of probabilities, one entry for each action. 
- Given this vector, we pick an action proportionally to these probabilities. 

Features of the Action Preference Function: 

- Remember the critic updates an estimate of the state value function, so the critic only requires a feature Vector characterizing the current state.
- The actor's action preferences depend on the state and action, so our action preference function requires a state action feature vector.

$\textbf{Concrete Implementation of Actor Critic Algorithm}$: 
$$
\begin{align}
    \mathbf{w}          &\leftarrow \mathbf{w} + \alpha^{\mathbf{w}}
                                    &&\delta
                                    \underbrace{\nabla\hat{v}(S, \mathbf{w})}_{=\mathbf{x}(S)} \\ 
    \boldsymbol{\theta} &\leftarrow \boldsymbol{\theta} + \alpha^{\boldsymbol{\theta}}
                                    &&\delta
                                    \underbrace{\nabla\text{ln}\pi(A | S,\boldsymbol{\theta})}_{\mathbf{x}_h(s,a) - \sum_{b}\pi(b | s,\boldsymbol{\theta})\mathbf{x}_h (s,b)}
\end{align}
$$

## <a id='toc5_2_'></a>[Gaussian Policies for Continuous Actions](#toc0_)

One of the nice things about policy-based methods, is that they give us a natural way to deal with very large or even continuous action spaces. We don't have to use a policy parameterization that assigns individual probabilities to each action. We could instead learn the parameters of some distribution over actions. In this video, we will discuss how to learn a state-dependent Gaussian distribution over continuous actions. 

> **Probability density** means that for a given range, the probability of x lying in that range will be the area under the probability density curve. 

$\textbf{Gaussian Policy:}$

$$
\begin{align}
    \pi(a | s, \boldsymbol{\theta}) &= \frac{1}{\sigma(s, \boldsymbol{\theta})\sqrt{2\pi}}\exp
                                       \Big(  
                                            -\frac{(a - \mu(s, \boldsymbol{\theta}))^2}{2\sigma(s, \boldsymbol{\theta})^2}
                                       \Big) \\
        \mu(s, \boldsymbol{\theta}) &= \boldsymbol{\theta}^T_\mu \mathbf{x}(s) \\
     \sigma(s, \boldsymbol{\theta}) &= \exp(\boldsymbol{\theta}^T_\sigma \mathbf{x}(s))
\end{align}
$$

The policy parameters associated with Sigma are denoted by Theta Sigma. Our policy parameters now consists of these two stack parameter vectors of equal size.
$$
    \boldsymbol{\theta} = \begin{pmatrix} \boldsymbol{\theta}_\mu \\ \boldsymbol{\theta}_\sigma \end{pmatrix}
$$

We've defined our Gaussian policy. Remember, the main point of a policy is that it gives us a way to select actions. To select actions with this policy, we sample from the Gaussian.

> The procedure is simple. In a state, we compute $\mu$ and $\sigma$ from that state. 
> We then call the Gaussian random number generator with that $\mu$ and $\sigma$. 
> In one state, $\mu$ and $\sigma$ might look like this. For this state, a wide range of actions are likely to be sampled.

<p align="center">
  <img width="700" height="450" src="imgs/c3m5-sampling-from-the-gaussian-policy.png">
</p>

For this state, a wide range of actions are likely to be sampled.

- $\sigma$ essentially controls the degree of exploration
  - We typically initialize the variance to be large so that a wide range of actions are tried.
  - As learning progresses, we expect the variance to shrink and the policy to concentrate around the best action in each state.

<p align="center">
  <img width="700" height="450" src="imgs/c3m5-gradient-of-log-of-gaussian-policy.png">
</p>

But why did we do all this given that the discrete actions that we used previously seemed to work fine on this problem? 
The most obvious reason is it might not be straightforward to choose a discrete set of actions. 
For example, imagine a robot trained to play golf. Putting accurately requires quite a bit of precision. 
We could try to pick a discrete set of forces to apply to the club and hope it's good enough. 
But inevitably, there will be situations where it isn't. 
Depending on the distance and terrain, we may need to use the full range of forces. 
Continuous actions also allow us to generalize over actions. 
For example, the Gaussian policies smoothly assigns probability density to nearby actions. 
Imagine an action is found to be good and the update increases density for that action. 
The density for a nearby action will also increase with the agent generalizing that those actions are likely also to be good. 
Finally, even if the true action said is discrete but very large, it might be better to treat them as a continuous range. 
This gives a natural way to generalize over them and avoids the costly process of exploring each one independently. 

# <a id='toc6_'></a>[tldr](#toc0_)

This week, we talked about a whole new way to do control, directly learning parameterized policies. We introduce a new objective and a new algorithmic framework called actor-critic. 

<p align="center">
  <img width="900" height="450" src="imgs/c3m5-summary.png">
</p>

We focus on actor-critic using the average reward objective so that puts us here, from there we introduced two possible parameterizations depending on whether the actions are continuous or discrete. 
- For **discrete actions**, we use the softmax policy parameterization. 
- For **continuous action spaces**, we use a Gaussian policy which samples actions from a continuous range. 

A parameterized policy has to be a valid probability distribution. $\textbf{Softmax Parameterization}$:

$$
\begin{align}
    \pi(a | s, \theta) \geq 0 \qquad &\text{for all} \; a \in \mathcal{A} \; \text{and} \; s \in \mathcal{S} \\ 
    \sum_{a \in \mathcal{A}} \pi(a | s, \theta) = 1 \qquad &\text{for all} \; s \in \mathcal{S} \\ 
    \pi(a | s,\boldsymbol{\theta}) &= \frac{e^{h(s,a,\boldsymbol{\theta})}}{\sum_{b\in\mathcal{A}}e^{h(s,a,\boldsymbol{\theta})}}
\end{align}
$$

$\textbf{Advantages of Policy Parameterization}$:
- The ability to autonomously converge to a deterministic policy over time
- The ability to learn stochastic policies
- The fact that a good policy might be easier to learn and represent than precise action-values

To learn parameterized policies, we had to consider a new objective and a new strategy to optimize it. $\textbf{The Average Reward Objective}$:
$$
\begin{align}
    r(\pi) = \sum_{s}\mu(s) 
             \sum_{a}\pi(a | s,\boldsymbol{\theta}) 
             \sum_{s',r}p(s',r' | s,a)r
\end{align}
$$

Our strategy to optimize this objective was to use stochastic gradient descent. For that, we needed an estimate of the gradient of the average reward. $\textbf{The Policy Gradient Theorem}$:
$$
    \nabla r(\pi) = \sum_{s}\mu(s) \sum_{a}\nabla\pi(a | s, \boldsymbol{\theta}) q_{\pi}(s,a)
$$

Using the policy gradient theorem, we derive the $\textbf{Actor-Critic Algorithm}$. 


- Simultaneously learns a parameterized policy, the actor, and an estimate of the policies value function, the critic. 
- The Theta air from the critic reflects that the actor did better or worse than the critic expected. The actor is trained to favor actions that exceed the critic's expectations, the critic is trained to improve its value estimates of the actor so that it knows what value it should expect for this actor.

<p align="center">
  <img width="700" height="450" src="imgs/c3m5-actor-critic-algorithm.png">
</p>

- Can be used for both discrete and continuous action spaces. 
  - For discrete actions, we used a softmax policy parameterization. 
  - For continuous actions, we used a Gaussian policy with state dependent mean and variance. 
 
<p align="center">
  <img width="800" height="450" src="imgs/c3m5-actor-critic-for-discrete-and-cont.png">
</p>

This made the Gaussian policy over actions conditional on the state. 

# long summary 

Average reward actor-critic can be used in the same settings as the differential semi gradient SARSA algorithm we introduced previously. We started this week by making the shift from parameterize action values to parameterize policies. Parameterized policies taken state action pairs and output the associated action probabilities. This function is parameterized by a vector of parameters denoted by Theta. A parameterized policy has to be a valid probability distribution. That is, every action probability must be greater than zero, and the sum over all actions in a given state must be one. A softmax over action preferences is one way to ensure the parameterized policy obeys these constraints in discrete action spaces. Parameterizing policies directly has a number of potential advantages. These advantages include the ability to autonomously converge to a deterministic policy over time while using a stochastic policy to explore early on. Another advantage is the ability to learn stochastic policies. This can be useful with function approximation, when the optimal deterministic policy is not representable. Finally, we discussed that a good policy might be easier to learn and represent than precise action values. To learn parameterized policies, we had to consider a new objective and a new strategy to optimize it. We revisited the average reward objective. We showed that the average reward can be expanded out like this. Our strategy to optimize this objective was to use stochastic gradient descent. For that, we needed an estimate of the gradient of the average reward. This is challenging because the state distribution Mu depends on the policy parameters. The policy gradient theorem provides an expression for the gradient of the average reward objective that's convenient to optimize. This form allows us to estimate the gradient by sampling states from Mu by following the policy, Pi. Using the policy gradient theorem, we derive the actor-critic algorithm. Actor-critic simultaneously learns a parameterized policy, the actor, and an estimate of the policies value function, the critic. The Theta air from the critic reflects that the actor did better or worse than the critic expected. The actor is trained to favor actions that exceed the critic's expectations, the critic is trained to improve its value estimates of the actor so that it knows what value it should expect for this actor. We demonstrated how the actor-critic algorithm can be used for both discrete and continuous action spaces. For discrete actions, we used a softmax policy parameterization. For continuous actions, we used a Gaussian policy with state dependent mean and variance. This made the Gaussian policy over actions conditional on the state. Now you know how to parameterize and learn policies directly. In addition to the action value methods we looked at before, this is a valuable new tool in your reinforcement learning toolbox.


---
- We can use the average reward as an objective for policy optimization.
- It is useful to learn a value function to estimate the gradient for the policy parameters, and the actor-critic algorithm implements this idea with a critic to learn the value function for the actor.