Skip to content

Files

Latest commit

 

History

History
61 lines (36 loc) · 2.6 KB

18-Iteration.md

File metadata and controls

61 lines (36 loc) · 2.6 KB
tags comments dg-publish
notes
cs188
true
true

note

Value Iteration

Considering finite horizons, The time-limited value for a state s with a time-limit of k timesteps is denoted U k ( s ) ,

Value iteration is a dynamic programming algorithm that uses an iteratively longer time limit to compute time-limited values until convergence1 , operating as follows:

  1. s S , initialze U 0 ( s ) = 0 , since no actions can be taken to acquire rewards.
  2. repear the following update rule until convergence (B is called the Bellman operator):

s S , U k + 1 ( s ) max a s T ( s , a , s ) [ R ( s , a , s ) + γ U k ( s ) ]     ( o r     U k + 1 B U k )

When convergence is reached, the Bellman equation will hold for every state: s S , U k + 1 ( s ) = U k ( s ) = U ( s ) .

Q-value Iteration

Q-value iteration is a dynamic programming algorithm that computes time-limited Q-values. It is described in the following equation: Q k + 1 ( s , a ) s T ( s , a , s ) [ R ( s , a , s ) + γ max a Q k ( s , a ) ] .

Policy Iteration

policy extraction

$$\forall s\in S,:\boldsymbol{\pi}^{}(s)=\underset{a}{\operatorname{\operatorname*{\operatorname*{argmax}}}}Q^{}(s,a)=\underset{a}{\operatorname{\operatorname*{\arg\max}}}\sum_{{s^{\prime}}}T(s,a,s^{\prime})[R(s,a,s^{\prime})+\boldsymbol{\gamma}U^{*}(s^{\prime})]$$

policy evaluation

For a policy π, policy evaluation means computing U π ( s ) for all states s, where U π ( s ) is expected utility of starting in state s when following π :

U π ( s ) = s T ( s , π ( s ) , s ) [ R ( s , π ( s ) , s ) + γ U π ( s ) ]

policy improvement

Once we’ve evaluated the current policy, use policy improvement to generate a better policy.

Define the policy at iteration i of policy iteration as π i , we have

$$U_{k+1}^{{\boldsymbol{\pi}{i}}}(s)\leftarrow\sum{{s^{\prime}}}T(s,\boldsymbol{\pi}{i}(s),s^{\prime})[R(s,\boldsymbol{\pi}{i}(s),s^{\prime})+\boldsymbol{\gamma}U_{k}^{{\boldsymbol{\pi}_{i}}}(s^{\prime})]$$

finally improve policy with:

$$\boldsymbol{\pi}{i+1}(s)=\underset a{\operatorname*{argmax}}\sum{s^{\prime}}T(s,a,s^{\prime})[R(s,a,s^{\prime})+\boldsymbol{\gamma}U^{\pi_i}(s^{\prime})]$$

summary

link

Footnotes

  1. convergence means that $\forall s \in S, U_{k+1}(s) = U_{k}(s)$.