# <center> Policy Gradient Methods</center>

## Table of Content

1) Introduction: Policy Gradient vs the World, Advantages and Disadvantages


2) REINFORCE: Simplest Policy Gradient Method


3) Actor-Critic Methods


4) Additional Enhancements to Actor-Critic Methods


# <center>Introduction</center>

## <center> Introduction </center>
<center><img src="img/pg_1.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center> Value-Based Vs Policy-Based RL</center>
<center><img src="img/pg_2.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center> Why Policy-Based RL</center>
<center><img src="img/pg_3.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center> Can Learning Policy be easier than Learning Values of states?</center>
* The policy may be a simpler function to approximate.
* This is the simplest advantage that policy parameterization may have over action-value parameterization.

Why?
* Problems vary in the complexity of their policies and action-value functions. 
* For some, the action-value function is simpler and thus easier to approximate. 
* For others, the policy is simpler. 


** In the latter case a policy-based method will typically be faster to learn and yield a superior asymptotic policy.**

Example: In Robotics Tasks with continuous Action space.

## <center> Example of Stochastic Optimal Policy</center>
<center><img src="img/pg_4.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center> Example of Stochastic Optimal Policy</center>
<center><img src="img/pg_5.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center> Example of Stochastic Optimal Policy</center>
<center><img src="img/pg_6.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center> Example of Stochastic Optimal Policy</center>
<center><img src="img/pg_7.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## Why not use softmax of Action-Values for stochastic Policies?

* This alone would not approach determinism if and when required.
* The action-value estimates would differ by a finite amount, translating to specific probabilities other than 0 and 1.


* If softmax + Temprature Paramenter T: T could be reduced over time to approach determinism.
* However, in practice it would be difficult to choose the reduction schedule, or even the initial temperature, without more knowledge of the true action values.

* Whereas, Policy gradient is driven to produce the optimal stochastic policy.
* If the optimal policy is deterministic, then the preferences of the optimal actions will be driven infinitely higher than all suboptimal actions

## <center>REINFORCE: Simplest Policy Gradient Method</center>

## <center>Quality Measure of Policy</center>
<center><img src="img/pg_8.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center>Policy Optimisation</center>
<center><img src="img/pg_9.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center>Gradient Ascent</center>
<center><img src="img/pg_10.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center>Gradient Ascent - FDM</center>
<center><img src="img/pg_11.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center>Analytic Gradient Ascent</center>
<center><img src="img/pg_12.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center>Example- Softmax Policy</center>
<center><img src="img/pg_13.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center>Example- Gaussian Policy</center>
<center><img src="img/pg_14.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center>One-step MDP</center>
<center><img src="img/pg_15.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center>Policy Gradient Theorem</center>
<center><img src="img/pg_16.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center>Policy Gradient Theorem-Proof</center>
<center><img src="img/sutton_1.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center>Monte-Carlo Policy Gradient (REINFORCE)</center>
<center><img src="img/pg_17.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center> PuckWorld Example</center>
<center><img src="img/pg_18.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>
** DQN Demo [Reinforce.js](http://cs.stanford.edu/people/karpathy/reinforcejs/puckworld.html) **

## <center> REINFORCE with Baseline</center>

* REINFORCE has good theoretical convergence properties. 
* The expected update over an episode is in the same direction as the performance gradient.
* This assures: 
  * An improvement in expected performance for sufficiently small $\alpha$, and 
  * Convergence to a local optimum under standard stochastic approximation conditions. 
  

* **However**,
  * Monte Carlo method REINFORCE may be of high variance, and thus 
  * slow to learn.

** Can we reduce the variance somehow? **

## <center> REINFORCE with Baseline</center>

* The derivative of the quality $\eta(\theta)$ of policy network can be written as 


<center><img src="img/sutton_5.JPG" alt="Multi-armed Bandit" style="width: 400px;"/></center>



* Instead of using the Rewards/Action Vaules generated directly, we first compare it with a baseline:


<center><img src="img/sutton_2.JPG" alt="Multi-armed Bandit" style="width: 500px;"/></center>


* The baseline can be any function, even a random variable, 
* **Only Condition**: Should not vary with action $a$; 
* **Any guesses why?**

<center><img src="img/sutton_3.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>


* Finally the Update equation becomes:
<center><img src="img/sutton_4.JPG" alt="Multi-armed Bandit" style="width: 500px;"/></center>


# <center> Actor Critic Methods</center>

## <center> Reducing Variance Using a Critic</center>
<center><img src="img/pg_19.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center> Estimating the Action-Value Function</center>
<center><img src="img/pg_20.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center> Action Value Actor Critic</center>
<center><img src="img/pg_21.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center> Bias In Actor-Critic Algorithm</center>
<center><img src="img/pg_22.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center> Compatible Function Approximation</center>
<center><img src="img/pg_23.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center> Compatible Function Approximation- Proof</center>
<center><img src="img/pg_24.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

# <center>Additional Enhancements to Actor Critic</center>

## <center> Actor Critic with Baseline</center>
<center><img src="img/pg_25.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center> Estimating the Advantage Function</center>
<center><img src="img/pg_26.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center> Estimating the Advantage Function</center>
<center><img src="img/pg_27.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center> Critics at different Time-Scales</center>
<center><img src="img/pg_27.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>`

## <center> Actors at different Time-Scale</center>
<center><img src="img/pg_28.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center>Policy Gradient with Eligibility Traces</center>
<center><img src="img/pg_29.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center>Policy Gradient with Eligibility Traces</center>
<center><img src="img/pg_30.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center>Summary</center>
<center><img src="img/pg_34.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>