# 1. Intro to Dynamic Programming and Iterative Policy Evaluation
We are now going to start looking at solutions to MDP's. As we saw in the last section, the center piece of the discussion is the **Bellman Equation**:

#### <span style="color:#0000cc">$$\text{Bellman Equation} \rightarrow V_\pi(s) = \sum_a \pi(a \mid s) * \sum_{s'}\sum_r p(s',r \mid s,a) \Big \{ r + \gamma V_\pi(s') \Big \}$$</span>

In fact, the bellman equation can be used directly to solve for the value function. If you look carefully, you will see that this is actually a set of $S$ equations with $S$ unknowns. In fact, it is a linear equation, meaning it is not too difficult to solve. In addtion, a lot of the matrix entries will be zero, since the state transitions will most likely be sparse. 

However, this is _not_ the approach we will take. Instead, we will do what is called **iterative policy evaluation**. 

## 1.1 Iterative Policy Evaluation
What exactly is iterative policy evaluation? Well, essentially it means that we apply the bellman equation again and again, and eventually it will just converge. We can write down the algorithm in pseudocode as follows:

---

$
\text{def iterative_policy_evaluation}(\pi)\text{:} \\
\hspace{1cm} \text{initialize V(s) = 0 for all s} \in \text{S} \\
\hspace{1cm} \text{while True:} \\
\hspace{2cm} \Delta = 0 \\
\hspace{2cm} \text{for each s} \in \text{S:} \\
\hspace{3cm} \text{old_v = V(s)} \\
\hspace{3cm} V_\pi(s) = \sum_a \pi(a \mid s) * \sum_{s'}\sum_r p(s',r \mid s,a) \Big \{ r + \gamma V_\pi(s') \Big \}\\
\hspace{3cm} \Delta = \text{max(} \Delta \text{, |V(s) - old_v|)} \\
\hspace{2cm} \text{if} \Delta \text{< threshold: break} \\
\hspace{1cm} \text{return V(s)}
$

---

We can see above that the input to iterative policy evaluation is a policy, $\pi$, and the output is the value function for that particular policy. It works as follows:
> * We start by initializing $V(s)$ to 0 for all states. 
* Then, in an infinite loop, we initialize a variable called $\Delta$, which represents the maximum change during the current iteration. $\Delta$ is used to determine when to quit. When it is small enough, that is when we will break out of the loop. 
* Then, for every state in the state space, we keep a copy of the old $V(s)$, and then we use bellmans equation to update V(s). 
* We set $\Delta$ to be the maximum change for $V(s)$ in that iteration. 
* Once this converges, we return $V(s)$

The main point of interest in this algorithm, is of course the part that contains the bellman equation. Notice how strictly speaking, the value at iteration $k+1$ depend only on the values at iteration $k$:

#### $$ V_{k+1}(s) = \sum_a \pi(a \mid s) * \sum_{s'}\sum_r p(s',r \mid s,a) \Big \{ r + \gamma V_{k}(s') \Big \}$$

However, this need not be the case. In fact, we can always just use our most up to date versions of the value function for any state. This actually ends up converging faster. 

## 1.2 Definitions
A final note; we generally call the act of finding the value function for a given policy the _**prediction problem**_. Soon, we will learn an algorithm for finding the optimal policy, which is known as the _**control problem**_. 