# Deep SARSA : On-policy Prediction & Control Using NN

**Approximating** $v_\pi$ from experience generated using a known policy $\pi$. It is not represented as a table, but **parameterized functional form with weight vector** $\mathbf{w} \in \mathbb{R}^d$.

However, what function approximation cannot do is **augment the state representation with memories of past observations**

## 1. On-policy Prediction with Approximation

### 1.1. The Prediction Objective (VE)

By assumption, we have **far more states than weights**, so making one state's estimate more accurate invariably means making others' less accurate. We must specify a state distribution $\mu(s) \ge 0, \sum_s \mu(s) = 1$, representing how much we care about the error in each state $s$.

$$
\text{VE}(w) \overset{def}{\equiv} \sum_{s \in S} \mu (s) \Big[ v_{\pi} (s) - \hat{v} (s, w) \Big] ^2
$$

### Comparison of estimators

1. **Maximum MSE** : $\max \mathbb{E}_\theta \Big[ ( \hat{\eta} - \eta(\theta) ) \Big]^2$

$$
\max_{\theta \in \Omega} \mathbb{E}_\theta \Big[ ( \hat{\eta}^{*} - \eta(\theta))^2 \Big] = \min_{\hat{\eta}} \max_{\theta \in \Omega} \mathbb{E}_\theta \Big[ (\hat{\eta} - \eta(\theta) \Big]
$$

2. **Bayesian MSE** : weighted average of **prior density** $\pi$, a function on $\Omega$

$$
r(\pi, \hat{\eta}) = \int_{\Omega} \mathbb{E}_\theta \Big[ ( \hat{\eta} - \eta(\theta) )^2 \Big] \pi(\theta) d \theta
$$

**When unbiased** : $\text{Var} (\hat{\eta}^{\text{UE}})$ minimized : **UMVUE**

$$
\text{MSE} (\hat{\eta}^{\text{UE}}, \theta) = \mathbb{E}_\theta \Big[ (\hat{\eta}^{\text{UE}} - \eta(\theta) )^2 \Big] = \mathbb{E}_\theta \Big[ (\hat{\eta}^{\text{UE}} - \mathbb{E}_\theta \hat{\eta}^{\text{UE}} )^2 \Big] = \text{Var}_\theta (\hat{\eta}^{\text{UE}})
$$

### 1.2. Stochastic-gradient and Semi-gradient Methods

### Stochastic gradient methods (Linear regression cases)

**a. Batch gradient descent** : $J_{\text{train}} = \frac{1}{2m} \sum_{i=1}^m (h_\theta (x^{(i)}) - y^{(i)})^2$

Repeat

$
\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m (h_\theta (x^{(i)}) - y^{(i)}) x^{(i)} \Big(= \theta_j -  \alpha  \frac{\partial}{\partial \theta_j} J_{\text{train}}(\theta) \Big)
$

for every $j = 0, \ldots, n$

**b. Stochastic gradient descent** : single sample-based

$
\text{cost} \big(\theta, (x^{(i)}, y^{(i)}) \big) = \frac{1}{2} \big( h_\theta (x^{(i)}) - y^{(i)} \big)^2
$

$J_{\text{train}} = \frac{1}{m} \sum_{i=1}^m \text{cost} \big(\theta, (x^{(i)}, y^{(i)})$

1) Randomly shuffle dataset.

2) Repeat

for $i = 1, \ldots, m$

$
\theta_j := \theta_j - \alpha (h_\theta (x^{(i)}) - y^{(i)}) x^{(i)} \Big(= \theta_j -  \alpha  \frac{\partial}{\partial \theta_j} \text{cost} \big(\theta, (x^{(i)}, y^{(i)}) \Big)
$

for every $j = 0, \ldots, n$

### 1.3. Linear Approximation Methods

#### Proof of Convergence of Linear TD(0)

General SGD method formula for state-value prediction ($U_t$ : substituted because $v_\pi (S_t)$ : **unknown**)

$$
w_{t+1} \equiv w_t + \alpha \Big[ U_t - \hat{v} (S_t, w_t) \Big] \nabla \hat{v} (S_t, w_t)
$$

Since in linear methods,

$$
\hat{v} (s, w) \equiv w^T \mathbf{x} (s) \equiv \sum_{i=1}^d w_i x_i (s) \\
\nabla \hat{v} (s, w) = \mathbf{x} (s)
$$

thus,

$$
\begin{aligned}
w_{t+1} &\equiv w_t + \alpha \Big[ U_t - \hat{v} (S_t, w_t) \Big] \mathbf{x} (s) \\[15pt]
&\equiv w_t + \alpha \big( R_{t+1} + \gamma w_t ^T \mathbf{x}_{t+1} - w_t^T \mathbf{x}_t \big) \mathbf{x}_t \\[15pt]
&= w_t + \alpha \big( R_{t+1} \mathbf{x}_t - \mathbf{x}_t ( \mathbf{x}_t - \gamma \mathbf{x}_{t+1})^T w_t \big)
\end{aligned}
$$

When the system has reached **steady state**, for any given $w_t$,

$$
\mathbb{E} [w_{t+1} | w_t ] = w_t + \alpha (b - \mathbf{A} w_t)
$$

where $\mathbf{A} = \mathbf{X}^T \mathbf{D} (\mathbf{I} - \gamma \mathbf{P}) \mathbf{X}$. We have $\mathbf{D} (\mathbf{I} - \gamma \mathbf{P})$ : positive definite and 

$$
1^T \mathbf{D} (\mathbf{I} - \gamma \mathbf{P}) = (1 - \gamma) \mathbf{\mu}^T
$$

where $\mathbf{\mu}$ : stationary distribution

### 1.4. Feature Construction for Linear Methods

#### Polynomials

$$
x_i (s) \equiv \prod_{j=1}^k s_k ^{c_{i,j}}
$$

#### Fourier Basis

$$
x_i (s) \equiv \cos (\pi \mathbf{s} ^T \mathbf{c}^i )
$$

#### Radial Basis Functions

**RBF**s are natural **generalization of coarse coding** to **continuous-valued features**.

$$
x_i (s) \equiv exp \Big( - \frac {||s - c_i||^2} {2 \sigma_i^2} \Big)
$$

### 1.5. Nonlinear Function Approximation : ANN

The most sophisticated ANN and statistical methods all assume a static training set over which multiple passes are made.

#### States defined

1. **The position** of the terminal relative to the agent

2. **The label** of the terminal

3. **The positions** of obstacles relative to the agent

4. **The label** of the obstacles

5. **The speed** of the obstacles

#### Q function updating

$$
Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \Big(R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) -Q(S_t, A_t) \Big)
$$

#### Gradient Descent : MSE

$$
\text{MSE} = \Big(R_{t+1} + \gamma Q(S_{t+1}, A_{t+1} ) - Q(S_t, A_t) \Big)^2
$$

## 2. On-policy Control with Approximation

With parametric apporximation of **action-value function** $\hat{q}(s,a,\mathbf{w}) \sim q_{*} (s, a)$ (where $\mathbf{w} \in \mathbb{R}^d$) 

### 2.1. Episodic Semi-gradient Control : SARSA

$$
w_{t+1} \equiv w_t + \alpha \Big[ U_t - \hat{q} (S_t, A_t, w_t) \Big] \nabla \hat{q} (S_t, A_t, w_t)
$$

### 2.2. Semi-gradient $n$-step SARSA