# Practical Session 4: Approximation Methods in Reinforcement Learning 

##### *M2 Artificial Intelligence (Paris Saclay University) - Reinforcement Learning*

---

In the case where the state or action spaces are very large (or continuous), it is not possible to represent the value function or the policy as a simple vector (a.k.a. "tabular" representation). Instead, the use of **function approximation** techniques to estimate these functions is required.

--

Good references for the approximation part of Reinforcement Learning are again *J. Kwon - "An Introduction to Reinforcement Learning: From theory to algorithms, 2024"* and the classic book by *R. Sutton and A. Barto - "Reinforcement Learning: An Introduction, 2nd edition, 2018"*.


<br>
<br>
<br>
<br>
<br>

## Part I: Semi-Gradient Methods

### Function Approximation

Recall that one of the main challenges in Reinforcement Learning is to estimate value functions (state-value $v^\pi(s)$ or action-value $q^\pi(s,a)$). In the *function approximation* setting, we assume that the value function can be represented as a parametric function $v(s; \theta)$ (or $q(s,a; \theta)$) where $\theta \in \Theta \subset \mathbb{R}^d$ is a vector of parameters (*e.g.* neural network weights, linear coefficients, SVM parameters, etc.).

An important kind of parametrisation is the **linear function approximation**, where the value function is represented as a linear combination of features:
$$v(s; \theta) = \phi(s)^T \theta = \sum_{i=1}^d \phi_i(s) \theta_i$$
where $\phi(s) = [\phi_1(s), \phi_2(s), \ldots, \phi_d(s)]^T$ is a vector of $d$ features extracted from the state $s$ (ex: polynomial features, neural network etc).

### Solving the Bellman Equation with Function Approximation

The Bellman equations are given by:

- for the value functions, the operators $\mathcal{T}_\gamma^\pi$ and $\mathcal{T}_\gamma^*$ are defined as:
$$
\mathcal{T}_\gamma^\pi v(s) = r(s, \pi(s)) + \gamma \mathbb{E}_{s' \sim P(\cdot | s, \pi(s))}[v(s')]
$$
$$
\mathcal{T}_\gamma^* v(s) = \max_{a \in A} \left( r(s, a) + \gamma \mathbb{E}_{s' \sim P(\cdot | s, a)}[v(s')] \right)
$$
- for the Q-functions:
$$
\mathcal{T}_\gamma^\pi q(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P(\cdot | s, a)}[q(s', \pi(s'))]
$$
$$
\mathcal{T}_\gamma^* q(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P(\cdot | s, a)}\left[ \max_{a' \in A} q(s', a') \right]
$$


Consequently, in the function approximation setting, we want to find a parameter vector $\theta^*$ such that the approximated value function $v(s; \theta^*)$ (or $q(s,a; \theta^*)$) satisfies the Bellman equation:

- for the value function:
$$v(s; \theta^*) = \mathcal{T}_\gamma^\pi v(s; \theta^*) \quad \text{or} \quad v(s; \theta^*) = \mathcal{T}_\gamma^* v(s; \theta^*)$$

- for the Q-function:
$$q(s,a; \theta^*) = \mathcal{T}_\gamma^\pi q(s,a; \theta^*) \quad \text{or} \quad q(s,a; \theta^*) = \mathcal{T}_\gamma^* q(s,a; \theta^*)$$

However, due to the function approximation, it is generally not possible to satisfy these equations exactly for all states (or state-action pairs). Instead, we aim to find a parameter vector $\theta^*$ that minimizes the **mean squared Bellman error** over the state (or state-action) space.
$$
\theta^* = \arg \min_\theta \mathbb{E}_{s \sim \rho(s)} \left[ \left( v(s; \theta) - \mathcal{T}_\gamma^\pi v(s; \theta) \right)^2 \right]
$$
or
$$
\theta^* = \arg \min_\theta \mathbb{E}_{(s,a) \sim \rho(s,a)} \left[ \left( q(s,a; \theta) - \mathcal{T}_\gamma^\pi q(s,a; \theta) \right)^2 \right]
$$
where $\rho(s)$ (or $\rho(s,a)$) is a distribution over states (or state-action pairs), often chosen as the visitation distribution induced by the policy $\pi$.
<br>

--

**Theorem (optimality necessary condition)**:
The function $v(s; \theta^*)$ (or $q(s,a; \theta^*)$) minimises the mean squared Bellman error, i.e. these functions solve the optimisation problems above, if and only if the following condition holds:

- for the value function:
$$
 \left( v(s; \theta^*) - \mathcal{T}_\gamma^\pi v(s; \theta^*) \right) \nabla_\theta v(s; \theta^*) = 0
$$

or (for the optimal value function):

$$
 \left( v(s; \theta^*) - \mathcal{T}_\gamma^* v(s; \theta^*) \right) \nabla_\theta v(s; \theta^*) = 0
$$


- for the Q-function:
$$
 \left( q(s,a; \theta^*) - \mathcal{T}_\gamma^\pi q(s,a; \theta^*) \right) \nabla_\theta q(s,a; \theta^*) = 0
$$

or (for the optimal Q-function):

$$
 \left( q(s,a; \theta^*) - \mathcal{T}_\gamma^* q(s,a; \theta^*) \right) \nabla_\theta q(s,a; \theta^*) = 0
$$

--

The theorem above is a standard criterion in optimisation theory (critical point condition).

Consequently, a necessary condition for a value function $v(s; \theta^*)$ to minimise the mean squared Bellman error is that its parameters $\theta^*$ satisfy:
$$
\mathbb{E}_{s \sim \rho(s)} \left[ \left( v(s; \theta^*) - \mathcal{T}_\gamma^\pi v(s; \theta^*) \right) \nabla_\theta v(s; \theta^*) \right] = 0
$$
and similarly for the other cases.

This suggests a stochastic approximation approach to find such a parameter vector $\theta^*$.
Indeed, stochastic approximation methods aim in particular to find zeros of functions defined as expectations.

This condition forms the basis for the **Semi-Gradient Policy Evaluation** algorithm. The term "semi-gradient" refers to the fact this gradient method is performed with a moving target (the Bellman operator depends on the current parameter $\theta_n \in \Theta$ at iteration $n$).

### Semi-Gradient Policy Evaluation Algorithm
The **Semi-Gradient Policy Evaluation** algorithm is defined as follows:
- Initialise the parameter vector $\theta_0 \in \Theta$, set $n = 0$.
- Repeat until convergence:
    - Generate a transition $(S_n, A_n, R_n, S_{n+1})$ by following policy $\pi$ in the environment.
    - Update the parameter vector:
    $$\theta_{n+1} = \theta_n + \alpha_n \left( R_n + \gamma v(S_{n+1}; \theta_n) - v(S_n; \theta_n) \right) \nabla_\theta v(S_n; \theta_n) = \theta_n + \alpha_n (v(S_n; \theta_n) - \mathcal{T}_\gamma^\pi v(S_n; \theta_n)) \nabla_\theta v(S_n; \theta_n)$$
    
    - Increment $n$ by 1.
    where $\{\alpha_n\}_{n \geq 0}$ is a sequence of positive step-sizes satisfying the Robbins-Monro conditions.
<br>


<div class="alert alert-warning">

#### Question: Implementation of Semi-Gradient policy evaluation

Implement the Semi-Gradient TD(0) algorithm for estimating the state-value function $v^\pi$ using linear function approximation. Use polynomial features for the state representation.

</div>


<div class="alert alert-warning">

#### Question: Implementation of Semi-Gradient Q-learning

Give the semi-gradient Q-learning update rule for estimating the optimal state-action value function $q^*$.
Implement the algorithm with different types of function approximators (linear, neural networks, etc.).

</div>

<br>
<br>
<br>
<br>
<br>

## Part II: Policy Gradient Methods



### Policy Parametrisation

This approach is similar to the value function approximation methods seen previously. Here, the policy is not derived from a value function estimate, but is directly parametrised as a function $\pi(a|s; \theta)$ where $\theta \in \Theta \subset \mathbb{R}^d$ is a vector of parameters.

The policy is considered differentiable w.r.t. $\theta \in \Theta$ and verifies
$$
\pi(a|s; \theta) \geq 0, \quad \sum_{a \in A} \pi(a|s; \theta) = 1, \quad \forall s \in S, a \in A, \theta \in \Theta
$$

#### Example: Softmax Policy
A common choice for discrete action spaces is the **softmax policy** defined as:
$$\pi(a|s; \theta) = \frac{e^{f(s,a; \theta)}}{\sum_{a' \in A} e^{f(s,a'; \theta)}}$$
where $f(s,a; \theta)$ is a parametric function (ex: linear function, neural network, etc.) that outputs a score for each action $a \in \mathcal{A}$ given the state $s \in \mathcal{S}$.


### Optimising the Policy Parameters

The goal of policy gradient methods is to find the optimal parameter vector $\theta^*$ that maximises the expected cumulative reward:
$$\theta^* = \arg \max_{\theta \in \Theta} \mathbb{E}_{S \sim \rho} \left[ v^{\pi(\cdot| \cdot ; \theta)}_\gamma(S) \right]$$
where $\rho$ is a distribution over states (ex: initial state distribution, stationary distribution induced by the policy, etc.).

This optimisation problem can be solved using gradient ascent methods. The following key theorem provides an expression of a stochastic estimator of the gradient of the above objective function, to be used in a stochastic gradient ascent algorithm. Note that there are different versions of the policy gradient theorem.



--

**Theorem (Policy Gradient Theorem):**
Suppose that the policy $\pi(a|s; \theta)$ is differentiable w.r.t. $\theta \in \Theta$. Let $\rho$ be the distribution over states used in the objective function.

Then, the gradient of the expected cumulative reward w.r.t. $\theta$ is given by:
$$\nabla_\theta \mathbb{E}_{S \sim \rho} \left[ v^{\pi(\cdot| \cdot ; \theta)}_\gamma(S) \right] = \mathbb{E}_{S \sim \rho, \pi(\cdot | \cdot ; \theta)} \left[ \sum_{t = 0}^\infty \gamma^t r_t(S_t, A_t) \sum_{t = 0}^\infty \nabla_\theta \log \pi(A_t | S_t; \theta) \right]$$


--


Remark:

The policy gradient theorem says that the gradient of the expected cumulative reward is an expectation. This expectation can be estimated using independent samples of MDP trajectories obtained by interacting with the environment using the current policy $\pi(\cdot | \cdot ; \theta)$.


<div class="alert alert-warning">

#### Question: On the hypothesis of the Policy Gradient Theorem

Why the hypothesis of differentiability and positivity of the policy $\pi(a|s; \theta)$ are necessary to derive the Policy Gradient Theorem?

</div>


<div class="alert alert-warning">

#### Question: Implementation of the Policy Gradient Theorem

Implement the policy gradient theorem to learn a policy for a simple MDP (ex: the Stochastic Inventory Problem, a Gymnasium 'classic control' environment, or an environment with continuous state or action space).

</div>