# Dynamic Programming

The term dynamic programming (DP) refers to a collection of algorithms that can be  
used to compute optimal policies given a perfect model of the environment as a Markov  
decision process (MDP).

Classical DP algorithms are of limited utility in reinforcement  
learning both because of their assumption of a perfect model and because of their great  
computational expense, but they are still important theoretically.

We usually assume that the environment is a finite MDP. That is, we assume that its  
state, action, and reward sets, S, A, and R, are finite, and that its dynamics are given by a  
set of probabilities p(s0 , r |s, a), for all s 2 S, a 2 A(s), r 2 R, and s0 2 S+ (S+ is S plus a  
terminal state if the problem is episodic).

The key idea of DP, and of reinforcement learning generally, is the use of value functions  
to organize and structure the search for good policies.

we can easily obtain optimal policies once we have found the optimal value functions, v⇤ or
q⇤ , which satisfy the Bellman optimality equations:

$
\begin{aligned}
v_*(s) &= {max_a E[R_{t+1} + \gamma * v_*(S_{t+1}) | S_t = s, A_t = a]} \\
       &= {max_a \sum_{s', r} p(s', r | s, a)[r + \gamma *v_*(s') ]}
\end{aligned}
$

$
\begin{aligned}
q_*(s, a) &= {E[R_{t+1} + \gamma * max_{a'} q_*(S_{t+1}, a') | S_t = s, A_t = a]} \\
          &= {\sum_{s', r} p(s', r | s, a)[r + \gamma * max_{a'} q_*(s', a') ]}
\end{aligned}
$

As we shall see, DP algorithms are obtained by turning Bellman equations such as these into assignments,  
that is, into update rules for improving approximations of the desired value functions.

## Policy Evaluation (Prediction)

First we consider how to compute the state-value function v⇡ for an arbitrary policy ⇡.
This is called **policy evaluation** (or **prediction problem**) in the DP literature.