tags | comments | dg-publish | ||
---|---|---|---|---|
|
true |
true |
Considering finite horizons, The time-limited value for a state s with a time-limit of k timesteps is denoted
Value iteration is a dynamic programming algorithm that uses an iteratively longer time limit to compute time-limited values until convergence1 , operating as follows:
-
, initialze , since no actions can be taken to acquire rewards. - repear the following update rule until convergence (B is called the Bellman operator):
When convergence is reached, the Bellman equation will hold for every state:
Q-value iteration is a dynamic programming algorithm that computes time-limited Q-values. It is described in the following equation:
$$\forall s\in S,:\boldsymbol{\pi}^{}(s)=\underset{a}{\operatorname{\operatorname*{\operatorname*{argmax}}}}Q^{}(s,a)=\underset{a}{\operatorname{\operatorname*{\arg\max}}}\sum_{{s^{\prime}}}T(s,a,s^{\prime})[R(s,a,s^{\prime})+\boldsymbol{\gamma}U^{*}(s^{\prime})]$$
For a policy π, policy evaluation means computing
Once we’ve evaluated the current policy, use policy improvement to generate a better policy.
Define the policy at iteration i of policy iteration as
$$U_{k+1}^{{\boldsymbol{\pi}{i}}}(s)\leftarrow\sum{{s^{\prime}}}T(s,\boldsymbol{\pi}{i}(s),s^{\prime})[R(s,\boldsymbol{\pi}{i}(s),s^{\prime})+\boldsymbol{\gamma}U_{k}^{{\boldsymbol{\pi}_{i}}}(s^{\prime})]$$
finally improve policy with:
$$\boldsymbol{\pi}{i+1}(s)=\underset a{\operatorname*{argmax}}\sum{s^{\prime}}T(s,a,s^{\prime})[R(s,a,s^{\prime})+\boldsymbol{\gamma}U^{\pi_i}(s^{\prime})]$$
Footnotes
-
convergence means that $\forall s \in S, U_{k+1}(s) = U_{k}(s)$. ↩