# On-policy Control with Approximation

Last time we spoke about:

$\hat v(s,w)\approx v_{\pi}(s)$

Now, we speak about

$\hat q(s,a,w)\approx v_{\pi}(s,a)$

Quite logical extension, but today - we introduce the **average reward** instead of **discounted reward**.


## Episodic Semi-gradient Methods

Last time we had for TD(0)

$$w_{t+1} = w_{t}+\alpha(R_{t+1}+\hat v(S_{t+1},w)-\hat v(S_{t},w_t))\nabla \hat v (S_{t},w_t)$$

We can move easily to Sarsa and $\hat q(\cdot,\cdot,w)$ like this:
$$w_{t+1} = w_{t}+\alpha(R_{t+1}+\hat q(S_{t+1},A_{t+1},w)-\hat q(S_{t},A_t,w_t))\nabla \hat q (S_{t},A_t,w_t)$$

Question:

* If we have learned $\hat q(\cdot,\cdot,w)$, what would be the action selection mechaninsm? 

<img src="https://i.stack.imgur.com/nQotE.png" width="65%"/>


In [6]:
import gym
env = gym.envs.make('MountainCar-v0')

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m


In [7]:
env.action_space

Discrete(3)

In [11]:
env.observation_space

Box(2,)

In [12]:
env.observation_space.low

array([-1.20000005, -0.07      ], dtype=float32)

In [13]:
env.observation_space.high

array([ 0.60000002,  0.07      ], dtype=float32)

In [14]:
import numpy as np
def features(s):
    return np.array([1,s[0],s[1],s[0]*s[1],s[0]**2,s[1]**2])

In [18]:
x_high = features(env.observation_space.high)
x_high

array([ 1.        ,  0.60000002,  0.07      ,  0.042     ,  0.36000003,
        0.0049    ])

In [19]:
n_features = len(x_high)
n_features

6

In [92]:
w = 0*np.random.randn(env.action_space.n,n_features)
w

array([[ 0., -0.,  0., -0., -0., -0.],
       [-0.,  0.,  0.,  0., -0.,  0.],
       [ 0.,  0.,  0.,  0., -0., -0.]])

In [93]:
gamma = 0.9
alpha = 0.01
epsilon = 0.1

In [94]:
def epsilon_greedy(values,epsilon):
    n = len(values)
    if np.random.random()<epsilon:
        return np.random.randint(n)
    else:
        return values.argmax()
epsilon_greedy(np.array([1,2,3]),0.1)

2

In [95]:
w.shape

(3, 6)

In [96]:
for i_episode in range(1000):
    observation = env.reset()
    values = np.matmul(w,features(observation))
    action = epsilon_greedy(values,epsilon)
    for t in range(1000):
        observation_new, reward, done, info = env.step(action)
        if done:
            w[action,:] += alpha*(reward-values[action])*features(observation)
            print(t)
            break
        action_new = epsilon_greedy(np.matmul(w,features(observation_new)),epsilon)
        values_new = np.matmul(w,features(observation_new))
        w += alpha*(reward+values_new[action_new]-values[action])*features(observation)
        observation = observation_new
        action = action_new
        values = values_new

199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199
199


## $n$-step Semi-gradient Sarsa

We can use the approximation $\hat q$ also for bootstrapping after more than one steps, i.e.

$$
G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{n-1}R_{t+n} +\gamma^{n}\hat q(S_{t+n},A_{t+n},w)
$$
The update is then practically the same, just using a different target:
$$
w_{t+1} = w_{t}+\alpha(R_{t+1}+\hat q(S_{t+1},A_{t+1},w)-\hat q(S_{t},A_t,w_t))\nabla \hat q (S_{t},A_t,w_t)
$$

<img src="https://jaydottechdotblog.files.wordpress.com/2017/01/rl-episodic-semi-gradient-n-step-sarsa-for-estimatimation-algorithm.png?w=730" width="65%"/>

## Average Reward: A New Problem for Continuing Tasks

How we accumulate reward?

* Simple sum (episodic tasks only)
* Discounted reward ($\gamma$ needs to be specified)
* **Average reward**

$$
\begin{align}
r(\pi) & = & \lim_{h\to\infty}\sum_{t=1}^{h} \mathbb{E}[R_t | A_{0:t-1}\sim \pi] \\
 & = & \lim_{t\to\infty} \mathbb{E}[R_t|A_{0:t-1}\sim \pi] \\
 & = & \sum_s \mu_\pi(s) \sum_a \pi(a|s) \sum_{s',r} p(s',r|s,a)r
\end{align}
$$

where $\mu_\pi(s)$ is the steady-state distribution
$$
\mu_\pi(s) = \lim_{t\to\infty} \textrm{Pr}\{S_t|A_{0:t-1}\sim \pi\}
$$

which is assumed to be independent of $S_0$ => we call this *ergodicity*.

Important property of steady-state distributions:
$$
\sum_s \mu_\pi(s) \sum_a \pi(a|s) \sum_{s',r} p(s',r|s,a) = \mu_\pi(s')
$$

Question:

* For what can we use $r(\pi)$?

We can introduce **differential return** and differential value functions $v$ and $q$:
$$
G_t = R_{t+1} - r(\pi) + R_{t+2} - r(\pi) + \dots
$$

Differential equations for the differential case (no $\gamma$, but difference):
$$
v_{\pi}(s) = \sum_a \pi(a|s) \sum_{s',r} p(s',r|s,a)\left[r-r(\pi)+v_{\pi}(s')\right]
$$


$$
q^{*}(s) =  \sum_{s',r} p(s',r|s,a)\left[r-\max_{\pi}r(\pi)+\max_{a'} q^{*}(s',a')\right]
$$


Question:

* What about $q_{\pi}$ and $v^{*}$?

TD error:
$$
\delta_t = R_{t+1}-\bar R_{t+1} + \hat q(S_{t+1},A_{t+1})-\hat q(S_{t},A_{t})
$$
where $R_{t+1}$ is the resursive estimate of $r(\pi)$.

The approximative learning is then:
$$
w_{t+1}=w_{t}+\alpha \delta_t \nabla \hat q(S_t,A_t,w_t)
$$
<img src="https://i.stack.imgur.com/W38yU.jpg" width="65%" />

## Deprecating the Discounted Setting

## $n$-step Differential Semi-gradient Sarsa

$$
G_{t:t+n}= R_{t+1}-\bar R_{t+1}+\dots+R_{t+n}-\bar R_{t+n} + \hat q(S_{t+n},A_{t+n})
$$

$$
\delta_t=G_{t:t+n}-\hat q(S_t,A_t) 
$$

<img src= widht="65%">

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
