
# Dynamic Programming



### Value iteration




Update rule:



$V(s) \leftarrow max\sum_{s',r}^{} p(s',r|s,a)[r+\gamma V(s')]$



Moves towards the value under the optimal policy. We follow the following pseudocode:



![Value Iteration pseudocode](https://i.imgur.com/xLAssHj.png)




We'll have a table with the initial values at all states. For each state, the algorithm finds a new value based off of the optimal policy. It terminates when the changes are less than the theta. Here is an example of the algorithm at work:



![Value Iteration example](https://i.imgur.com/kaJqfze.png)




In this example, a ball starts at the top left and must find its way out of the maze. By default, all of the states except for the terminating state at the bottom right have an initial value of -1. The program moves iteratively from the terminating state to update the values of each state.




The maze example follows the following steps to go through the process of value iteration:



- Initialize the environment

- Define the policy $pi(\bullet | s)$

	- Create the policy

	- Test the policy with state $(0,0)$

	- See how the random policy does in the maze

	- Plot the policy

- Define value table $V(s)$

	- Create the $V(s)$ table

	- Plot $V(s)$

- Implement the value iteration algorithm

- Show results

	- Show resulting value table $V(s)$

	- Show resulting policy $pi(\bullet | s)$

	- Test the resulting agent



### Policy iteration



Give an initial policy and value function, the process of policy iteration will find the optimal policy and values through a series of evaluations and improvements.





Policy iteration can be summarized by the following pseudocode:




![Policy Iteration Pseudocode](https://i.imgur.com/lfgs9g0.png)




The maze goes through the following steps to go through the process of policy iteration:

- Initialize the environment

- Define the policy $pi(\bullet | s)$

	- Create the policy

	- Test the policy with state $(0,0)$

	- See how the random policy does in the maze

	- Plot the policy

- Define value table $V(s)$

	- Create the $V(s)$ table

	- Plot $V(s)$

- Implement the value iteration algorithm

- Show results

	- Show resulting value table $V(s)$

	- Show resulting policy $pi(\bullet | s)$

	- Test the resulting agent


### Bellman variants


Policy evaluation iteratively determines the approximate value of a given policy. The formula to approximate the values is identical to the Bellman equation.


$V(s) \leftarrow \sum_{a}^{} \pi(s|a) \sum_{s',r}^{}p(s',r|s,a)[r+\gamma V(s')]$

The algorithm is also very similar to the policy iteration algorithm.


![Policy evaluation pseudocode](https://i.imgur.com/VfavhoA.png)

There's also policy improvement. There's the decision to change the current policy or to stick with the current one. Policy improvement can be expressed with the following formula:


$q_\pi(s,a) =  \sum_{s',r}^{}p(s',r|s,a)[r+\gamma v_\pi(s')]$


Here is the policy improvement theorem. Note that $\pi$ and $\pi'$ only differ in the action they take at the same state.

If $q_\pi(s,\pi'(s)) \geq v_\pi(s)$ then $v_{\pi'}(s)\geq v_\pi(s)$

$\pi'(s) =$ arg max $\sum_{s',r} p(s', r|s,a)[r+\gamma v_\pi (s')]$


![enter image description here](https://i.imgur.com/tnR8SjJ.png)

## DP Summary:

![enter image description here](https://i.imgur.com/BKgPgbQ.png)


The algorithm will first use policy evaluation. It will generate values for each state in the table. Once policy evaluation is complete, policy improvement will be executed. A policy will be made to determine the action at the states. Another loop of policy evaluation is ran to update the table of values. It keeps looping between evaluation and improvement until no change has been made.

The disadvantages of dynamic programming are:

 - High computational cost
 - In each sweep we update all the states
 - Complexity grows very rapidly with the number of states


# Monte Carlo methods

The agent will use a policy to tackle the task for an entire episode. They approximate the values by interacting with the environment to generate sample returns and averaging them.

Advantages of Monte Carlo method:

 - The estimate of a state does not depend on the rest. The cost of estimating a state value is independent of the total number of states, so the complexity will be much more efficient than DP
 - We can focus the estimations on the states that help solve the task instead of all the states to avoid the irrelevant ones.
 - No need to know the dynamics of the environment. For many tasks it is easier to generate samples rather than the dynamics of the environment.


Monte Carlo methods depend on the following equation:


$\pi'(s) =$ arg max $q_\pi(s,a)$


TODO: Pickup from here
