# RL Exploration considering Uncertainty

# 1. Theoretical view

## 1-1. Posterior Sampling for Reinforcement Learning (PSRL)

Assume the reward $\mu$ and the transition $P$ is stochastic. For each episode, one sample an MDP from the posterior distribution, conditioned on the history $\mathcal{F}_{t}$ that is generated up to episode $t$. Then, the algorithm computes the optimal policy given the sampled MDP.

## 1-2. Uncertainty Bellman Equation (UBE)

Assume the posterior distributions of $\mu$ and $P$ can be derived. Then, $\textbf{Uncertainty Bellman Equation}$ is

\begin{equation}
u_{sa}^{h}
=\nu_{sa}^{h}+\sum_{s'a'}\pi_{s'a'}^{h}\mathbb{E}\left( P_{s'sa}^{h} | \mathcal{F}_{t} \right)u_{s'a'}^{h+1}
\end{equation}

where $\nu_{sa}^{h}$ is the local uncertainty at $(s,a)$, which is given by

\begin{equation}
\nu_{sa}^{h}=\mathrm{Var}\left(\hat{\mu}_{sa}^{h} | \mathcal{F}_{t} \right)+Q_{\mathrm{max}}^{2}\sum_{s'}\frac{\mathrm{Var}\left(\hat{P}_{s'sa}^{h} | \mathcal{F}_{t} \right)}{\mathbb{E}\left( \hat{P}_{s'sa}^{h} | \mathcal{F}_{t} \right)}
\end{equation}

Given UBE, one can approximate the posterior distribution of Q value as

\begin{equation}
Q_{sa}^{h}\sim\mathcal{N}\left( \bar{Q}_{sa}^{h}, \mathbf{diag}(u_{sa}^{h}) \right) \\
\bar{Q}_{sa}^{h}=\mathbb{E}\left( \hat{\mu}_{sa}^{h} | \mathcal{F}_{t} \right)+\sum_{s',a'}\pi_{s'a'}^{h}\mathbb{E}\left( \hat{P}_{s'sa}^{h} | \mathcal{F}_{t} \right)\bar{Q}_{s'a'}^{h+1}
\end{equation}

and use it to perform Thompson sampling(=posterior sampling) as an exploration strategy.

\begin{equation}
a=\mathrm{argmax}_{b}\left( \bar{Q}_{sb}^{h}+\epsilon_{b}\cdot\left( u_{sb}^{h} \right)^{0.5} \right) \\
\epsilon_{b}\sim\mathcal{N}(0,1)
\end{equation}

In [1]:
import numpy as np
import torch
import torch.nn as nn
import torch.distributions as D
import torch.nn.functional as F