# Continuous Observation Spaces

In this notebook we develop the active inference agent for environments with continuous observation spaces. 

We start by modifying the minimal environment to emit continous-valued observations that represent the proportion of repeated experiments yielding food in a given state. Then, we modify the components of the minimal agent that currently exploit the discreteness of the observation space, namely the belief-update after a new observation occurred and the estimation of information gain during action selection.

#### Housekeeping (run once per kernel restart)

In [None]:
# change directory to parent
import os
os.chdir('..')
print(os.getcwd())

# Imports

In [None]:
import importlib
import itertools

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.markers import CARETUP, CARETDOWN
import pandas as pd
from scipy.stats import beta
import seaborn as sns
import torch

# Continuous Observation Environment

So far, the environment emitted binary observations `NO_FOOD` or `FOOD` with a probability that decayed exponentially with the distance of a state from the food source. To make observations continuous we replace this binary observation with samples from the _Beta_distribution with mean equal to the probability of observing `FOOD` , which represent (roughly) the proportion of positive observations in a finite set of repeated coin flips. The `sample_size` governs the number of repeated experiments with variance decreasing with increasing sample size. In the limit of inifinite observations, the _Beta_ distribution is zero everywhere except at the mean.

## The Beta distribution

Let's explore the density of the _Beta_ distribution with varying sample size.

In [None]:
mean = 0.3
params = [(mean, 3), (mean, 10), (mean, 30), (mean, 200)]
fig, ax = plt.subplots(figsize=(8, 6))
plt.sca(ax)
for mean, sample_size in params:
  
  x = np.linspace(0, 1, 1000)
  p = beta.pdf(x, a=mean * sample_size, b=(1-mean) * sample_size)
  plt.plot(x, p, label=f'mean:{mean}, sample size:{sample_size}')
  
plt.legend()
plt.title('Density of the Beta distribution with varying sample size')
plt.ylabel('density')
plt.xlabel('observation o')

## From discrete to continuous observations
We can now define, sample from, and visualise the distribution of observations generated in each state of the environment, where observations are sampled from the _Beta_ distribution with mean equal to the probability of observing food in a single coin flip experiment as in the `MinimalEnvironment`.

In [None]:
import minimal_environment as me
importlib.reload(me)

sample_size=10
n_samples = 1000

fig, ax = plt.subplots(figsize=(12, 6))
plt.sca(ax)
env = me.MinimalEnv(N=16, # number of states
                    s_food=0, # location of the food source
                    o_decay=0.2) # decay of observing food away from source

def emission_probability(sample_size):
  means = env.emission_probability()[:,1]
  return [beta(a=m*sample_size, b=(1-m)*sample_size) for m in means]

def sample_o(p_o_given_s, s, n_samples):
  return p_o_given_s[s].rvs(size=n_samples)
  
p_o_given_s = emission_probability(sample_size)
samples = [sample_o(p_o_given_s, s, n_samples) for s in range(env.s_N)]
df = pd.DataFrame(np.array(samples).T)
sns.violinplot(df, cut=0, width=2)
plt.xlabel('state s')
plt.ylabel('p(o|s)')
plt.title('Continuous environment emission probability')

## Full environment specification

We modify the code of `MinimalEnv` as follows
- `emission_probability(sample_size)` returns a list of beta distributions, one for each state.
- `p_o_given_s` stores a copy of this list, which is computed once at initialization.
- `sample_o` samples from the beta distribution of the current state.

In [None]:
import numpy as np
from scipy.stats import beta

# environment
class ContinuousObservationEnv(object):
  """ Wrap-around 1D state space with single food source.
  
  The probability of sensing food at locations near the food source decays 
  exponentially with increasing distance.
  
  state (int): 1 of N discrete locations in 1D space.
  observation (float): proportion of times food detected in finite sample.
  actions(int): {-1, 1} intention to move left or right.
  """
  def __init__(self, 
               N = 16, # how many discrete locations can the agent reside in
               s_0 = 0, # where does the agent start each episode?
               s_food = 0, # where is the food?
               p_move = 0.75, # execute intent with p, else don't move.
               o_sample_size=10, # observation Beta distribution parameter.
               p_o_max = 0.9, # maximum probability of sensing food
               o_decay = 0.2 # decay rate of observing distant food source
               ):
    
    self.o_decay = o_decay
    self.p_move = p_move
    self.o_sample_size = o_sample_size
    self.p_o_max = p_o_max
    self.s_0 = s_0
    self.s_food = s_food
    self.s_N = N
    self.a_N = 2 # {0, 1} to move left/ right in wrap-around 1D state-space
    """
    environment dynamics are governed by two probability distributions
    1. state transition probability p(s'|s, a)
    2. emission/ observation probability p(o|s)
    although we only need to be able to sample from these distributions to 
    implement the environment, we pre-compute the full conditional probability
    table (1.) and conditional emission random variables (2.) here so agents 
    can access the true dynamics if required.
    """
    self.p_s1_given_s_a = self.transition_dynamics() # Matrix B
    self.p_o_given_s = self.emission_probability() # Matrix A
    self.s_t = None # state at current timestep


  def transition_dynamics(self):
    """ computes transition probability p(s'| s, a) 
    
    Returns:
    p[s, a, s1] of size (s_N, a_N, s_N)
    """

    p = np.zeros((self.s_N, self.a_N, self.s_N))
    p[:,0,:] = self.p_move * np.roll(np.identity(self.s_N), -1, axis=1) \
              + (1-self.p_move) * np.identity(self.s_N)
    p[:,1,:] = self.p_move * np.roll(np.identity(self.s_N), 1, axis=1) \
              + (1-self.p_move) * np.identity(self.s_N)
    return p

  def emission_probability(self):
    """ initialises conditional random variables p(o|s). 
    
    Returns:
    p[s] of size (s_N) with one scipy.stats.rv_continuous per state
    """
    s = np.arange(self.s_N)
    # distance from food source
    d = np.minimum(np.abs(s - self.s_food), 
                   np.minimum(
                   np.abs(s - self.s_N - self.s_food), 
                   np.abs(s + self.s_N - self.s_food)))
  
    # exponentially decaying concentration ~ probability of detection
    mean = self.p_o_max * np.exp(-self.o_decay * d)
    # continuous relaxation: proportion of food detected in finite sample
    sample_size = self.o_sample_size
    return np.array([beta(a=m*sample_size, b=(1-m)*sample_size) for m in mean])

  def reset(self):
    self.s_t = self.s_0
    return self.sample_o()

  def step(self, a):
    if (self.s_t is None):
      print("Warning: reset environment before first action.")
      self.reset()

    if (a not in [0, 1]):
      print("Warning: only permitted actions are [0, 1].")

    # convert action index to action
    a = [-1,1][a]

    if np.random.random() < self.p_move:
      self.s_t = (self.s_t + a) % self.s_N
    return self.sample_o()

  def sample_o(self):
    return self.p_o_given_s[self.s_t].rvs()

## Random Agent Behavior

To test the environment we simulate a random agent's interactions with it. Here, the random agent samples actions uniformly in the interval `[-2, 2]`.

In [None]:
import continuous_observation_environment as coe
importlib.reload(coe)

env = coe.ContinuousObservationEnv(N=16, # number of states
                    s_food=0, # location of the food source
                    o_sample_size=100) # variance of observation decreases with increasing sample size.

n_steps = 100
ss, oo, aa = [], [], []

o = env.reset()
ss.append(env.s_t)
oo.append(o)

for i in range(n_steps):
  a = np.random.choice([0,1]) # random agent
  o = env.step(a)
  ss.append(env.s_t)
  oo.append(o)
  aa.append(a)

We inspect the sequence of states, actions and emissions during this interaction.

In [None]:
fig, ax = plt.subplots(2, 1, figsize=(16, 12))
ax[0].plot(ss, label='agent state $s_t$')
ax[0].plot(np.ones_like(ss) * env.s_food, 
           'r--', label='food source', linewidth=1)
for i in range(len(aa)):
  ax[0].plot([i, i], [ss[i], ss[i]+[-1,1][aa[i]]], 
             color='orange', 
             linewidth=0.5,
             marker= CARETUP if aa[i] > 0 else CARETDOWN,
             label=None if i > 0 else 'action')
  
ax[0].set_xlabel('timestep t')
ax[0].set_ylabel('state s')
ax[0].legend()
ax[1].plot(np.array(oo))
ax[1].set_xlabel('timestep t')
ax[1].set_ylabel('observation s')

# Continuous Observation Agent

We start implementing the continuous observation agent by updating its belief in light of a new observation.

## Update based on new observation

The belief update based on new observations involves minimizing the KL-divergence between $Q(s;\theta')$ and $p(o, s) = Q(s; \theta) p(o|s)$ with respect to $\theta'$. The new observation is used to compute the joint probability $p(o, s)$ for each discrete state $s$, for which we need to estimate the probability density at $o$ in each state $s$.

```
p_o_given_s = np.array([p.pdf(o) for p in env.p_o_given_s])
p = torch.tensor(p_o_given_s * q(theta)) # p(o|s)p(s)
```

In [None]:
def update_belief(env, theta_prev, o, lr=4., n_steps=10, debug=False):
    theta = torch.tensor(theta_prev)
    
    # make p(s) from b
    q = torch.nn.Softmax(dim=0)
    p_o_given_s = torch.tensor([p.pdf(o) for p in env.p_o_given_s])
    p = p_o_given_s * q(theta) # p(o|s)p(s)
    log_p = torch.log(p)
    
    # initialize updated belief with current belief
    theta1 = torch.tensor(theta_prev, requires_grad=True)
    
    # estimate loss
    def forward():
        q1 = q(theta1)
        # free energy: KL[ q(s) || p(s, o) ]
        fe = torch.sum(q1 * (torch.log(q1) - log_p))
        return fe
    
    optimizer = torch.optim.SGD([theta1], lr=lr)
    ll = np.zeros(n_steps)
    for i in range(n_steps):
        l = forward()
        optimizer.zero_grad()
        l.backward()
        optimizer.step()
        
        if debug:
            ll[i] = l.detach().numpy()
            
    theta1 = theta1.detach().numpy()
    if debug:
        return theta1, ll
        
    return theta1

Let's see the effect of this in action. We start with a uniform prior belief. Recall that the food source is in state 0 and that the probability of observing food decreases exponentially with the distance of a state from the food source, with the state space wrapping around.

If we observed infrequent food ($o=0.15$), then it is most likely that we are in the state furthest away from the food source. If we observed frequent food ($o=0.95$), then it is most likely that we are at the food source. If we observed food half the time ($o=0.65$), then we infer a bi-modal distribution with two most likely states at equal distance on either side of the food source and a lot of uncertainty about the exact state. This uncertainty decreases as we increase the `sample_size` of the environment to, e.g., 100.

In [None]:
o = 0.15

env = coe.ContinuousObservationEnv(N=8,s_food=0, o_sample_size=10)
theta = np.zeros(env.s_N)
theta1, ll = update_belief(env, theta, o=o, lr=2., n_steps=20, debug=True)

def softmax(x):
  e = np.exp(x - x.max())
  return e / e.sum()

fig, ax = plt.subplots(1, 2, figsize=(12, 6))
plt.sca(ax[0])
plt.plot(ll)
plt.plot([0, ll.shape[0]-1], [ll.min()]*2, 'k--')
plt.xlabel('optimization step')
plt.ylabel('loss')

plt.sca(ax[1])
plt.bar(np.arange(env.s_N)-0.2, width=0.4, height=softmax(theta), alpha=0.5, label='before') # belief before update
plt.bar(np.arange(env.s_N)+0.2, width=0.4, height=softmax(theta1), alpha=0.5, label='after') # belief before update
plt.xlabel('env state')
plt.ylabel('belief')
plt.title('Updating beliefs in light of a new observation.')
plt.legend()

## Information gain for action selection

Now we update the estimation of information gain during action selection. Before we start we copy some code unchanged from the previous notebook, because action selection requires belief updating in light of new observations and rollouts.



In [None]:
def kl(a, b):
    """ Discrete KL-divergence """
    return (a * (np.log(a) - np.log(b))).sum()
  
def update_belief_a(env, theta_prev, a, lr=4., n_steps=10, debug=False):
    # prior assumed to be expressed as parameters of the softmax (logits)
    theta = torch.tensor(theta_prev)
    q = torch.nn.Softmax(dim=0)(theta)
    
    # this is the prior for the distribution at time t
    # if we worked on this level, we would be done. 
    # but we need to determine the parameters of Q that produce 
    # this distribution
    q1 = torch.matmul(q, torch.tensor(env.p_s1_given_s_a[:,a,:]))

    # initialize updated belief to uniform
    theta1 = torch.zeros_like(theta, requires_grad=True)
    loss = torch.nn.CrossEntropyLoss() # expects logits and target distribution.
    optimizer = torch.optim.SGD([theta1], lr=lr)
    if debug:
        ll = np.zeros(n_steps)
        
    for i in range(n_steps):
        l = loss(theta1, q1)
        optimizer.zero_grad()
        l.backward()
        optimizer.step()
        
        if debug:
            ll[i] = l.detach().numpy()
            
    theta1 = theta1.detach().numpy()
    if debug:
        return theta1, ll
        
    return theta1

Recall that the _information gain_ term quantifies the belief update due to making observations in future states.

$\mathbb{E}_{s \sim Q_{\theta}, o \sim p(o|s)}\left[ D_{KL}(\, Q_{\theta'}(s|o),  Q_{\theta}(s) \,) \right]$

Because observations were discrete in the minimal environment, we were able to enumerate all possible observations, compute the belief update after making each observation and weigh the KL divergence by the probability of generating each observation given our current belief about the state distribution.

Because observations are now continuous, we need to resort to sampling and Monte Carlo approximation of the information gain. For a finite set of samples, we 
- generate observations by first sampling a state $s\sim Q_{\theta}$ and then sampling an observation  $o \sim p(o|s)$
- perform one belief update based on each sampled observation
- compute the KL between the discrete belief over states before and after each belief update
- compute the mean of KL across samples.

Note that while previously we performed one belief update per possible observation, i.e. 2 belief updates, we now perform one belief update in each rollout step per sampled observation, e.g., 100. This massively increases the computation time, because all belief updates are performed in sequence. Future work should explore how this could be vectorized. In order to explore how the estimate of info_gain changes with the number of sampled observations, we provide as optional debug output the list of information gain estimates per observation.

In [None]:
def rollout_step(env, log_p_c, theta, pi,
                 n_samples, use_info_gain, use_pragmatic_value, debug=False):
    
    if pi == []:
        return []

    a, pi_rest = pi[0], pi[1:]

    # Where will I be after taking action a?
    theta1 = update_belief_a(env, theta, a=a, lr=1.) 
    q = softmax(theta1)

    # Do I like being there?
    pragmatic = np.dot(q, log_p_c)

    # What might I observe after taking action a? (marginalize p(o, s) over s)
    ss = np.random.choice(range(env.s_N), p=q, size=n_samples)
    oo = [rv.rvs() for rv in env.p_o_given_s[ss]]
    # Do I learn about s from observing o?
    q_o = [softmax(update_belief(env, theta1, o=o)) for o in oo]
    d_o = [kl(q_o_i, q) for q_o_i in q_o] # info gain for each observation
    info_gain = np.mean(d_o) # expected value of info gain

    # negative expected free energy for this timestep
    nefe = use_pragmatic_value * pragmatic + use_info_gain * info_gain

    # nefe for remainder of policy rollout
    nefe_rest = rollout_step(env, log_p_c, theta1, pi_rest, 
                        n_samples=n_samples,
                        use_info_gain=use_info_gain, 
                        use_pragmatic_value=use_pragmatic_value, debug=False)

    # concatenate expected free energy across future time steps
    if debug:
      return [nefe] + nefe_rest, d_o
    
    return [nefe] + nefe_rest

def select_action(env, theta_star, theta_start, 
                  k=4, # planning horizon (number of sequential actions per plan)
                  n_samples=100,
                  use_info_gain=True, 
                  use_pragmatic_value=True,
                  select_max_pi=False, # replace sampling with best action selection
                  debug=False, # return plans, p of selecting each, and marginal p of actions
                 ):
    log_p_c = np.log(softmax(theta_star))

    # genrate all plans
    plans = [ list(x) for x in itertools.product(range(env.a_N), repeat=k)]

    # evaluate negative expected free energy of all plans
    nefes = []
    for pi in plans:
      if debug:
        step_nefes, info_gains = rollout_step(env, log_p_c, theta_start, pi, 
                                  n_samples=n_samples,
                                  use_info_gain=use_info_gain, 
                                  use_pragmatic_value=use_pragmatic_value,
                                  debug=True)
      else:
        step_nefes = rollout_step(env, log_p_c, theta_start, pi, 
                                  n_samples=n_samples,
                                  use_info_gain=use_info_gain, 
                                  use_pragmatic_value=use_pragmatic_value)
        
      nefe = np.array(step_nefes).mean() # expected value over steps
      nefes.append(nefe)
        
    # compute probability of following each plan
    p_pi = softmax(np.array(nefes)).tolist()  

    if select_max_pi:
        a = plans[np.argmax(nefes)]
    else:
        a = plans[np.random.choice(len(plans), p=p_pi)]
    
    if debug:
        p_a = np.zeros(env.a_N)
        for p, pi in zip(p_pi, plans):
            p_a[pi[0]] += p
            
        return a, p_a, plans, p_pi, info_gains
    
    return a

Let's explore action selection from plans with horizon $k$ by specifying sharp priors on the starting state and target state $k-1$ steps apart.

If the starting state is to the right of the target (recall the state space wraps around), then policies that take a sequence of left actions ($a=0$)) are scored higher. Note that this holds true irrespective of the food source location. 

If the starting state is to the left of the target (e.g., $s_0=11$), then policies that take a sequence of right actions ($a=1$) are scored higher.

If the starting state and the target state coincide, then policies that take equal numbers of left and right actions are scored highest.

In [None]:
starting_state = 11
target_state = 14

env = coe.ContinuousObservationEnv(N=16,s_food=0, o_sample_size=10)

# initialize belief
theta_start = np.eye(env.s_N)[starting_state] * 10 # believe we are in state 1

# initialize preference
theta_star = np.eye(env.s_N)[target_state] * 10

a, p_a, plans, p_pi, info_gains = select_action(env, theta_star, theta_start, debug=True, n_samples=100)

# and explore what the agent prefers
plt.bar(x = range(len(plans)), height=p_pi)
plt.xlabel('plan id')
plt.ylabel('$p(\pi)$')

print('plans and associated probability of selecting them.')
for p, pi in zip(p_pi, plans):
    print(pi, p)

# estimate marginal probability of selecting a plan with first action 0 or 1
print('marginal probability of next action')
print(p_a)

## How many observation samples do we need?

Since the action selection runtime increases linearly with the number of observation samples used to approximate information gain, it would be useful to quantify the effect of sample size on downstream tasks.

### Uncertainty in information gain estimates

First, it is useful to explore how the uncertainty in the information gain estimate decreases with an increased number of observation samples. This lets us make an informed decision about the tradeoff between runtime and information gain estimation precision. In order to estimate this we can sample information gain from a large number of sampled observations and use bootsrap estimates of the information gain with varying number of samples to quantify uncertainty.

In [None]:
# explore how the estimate of info gain changes with increasing number of samples
print('number of info gain estimates', len(info_gains))
n_bootstrap = 100 # bootstrap samples of observation subsets
n_oo = [10, 20, 50, 100]
for n_o in n_oo:
  ig = np.random.choice(info_gains, size=(n_bootstrap, n_o), replace=True).mean(axis=1)
  plt.hist(ig, density=True, label=f'n={n_o}', alpha=0.5)
plt.legend()
plt.title('Bootstrap estimate of information gain with varying number of sampled observations')

### Uncertainty in action selection

Ultimately, though, the estimate of information gain is only relevant to the extent that it influences action selection. If this estimate was very uncertain or noisy due of a low number of observation samples but the probability of selecting each plan is virtually the same across repetitions, then the noise would be irrelevant for all practical purposes. So, another way to explore this tradeoff that is closer to what matters ultimately, action selection, is to explore how policies are scored differently with and without information gain estimates.

In [None]:
n_samples = 10 # number of observation samples in rollout used for info gain
n_trials = 10 # number of repeated estimates of plan probabilities with info gain
starting_state = 11
target_state = 14

env = coe.ContinuousObservationEnv(N=16,s_food=0, o_sample_size=10)
theta_start = np.eye(env.s_N)[starting_state] * 10 # initialize belief
theta_star = np.eye(env.s_N)[target_state] * 10 # initialize preference
pp_pi = []
pp_a = []
for _ in range(n_trials):
  _, p_a, _, p_pi, _ = select_action(env, theta_star, theta_start, debug=True, 
                                   n_samples=n_samples)
  pp_a.append(p_a)
  pp_pi.append(p_pi)

a1, p_a1, plans1, p_pi1, info_gain1 = select_action(env, 
                                                    theta_star, 
                                                    theta_start, 
                                                    use_info_gain=False,
                                                    debug=True, 
                                                    n_samples=1)

pp_pi = np.array(pp_pi)
pp_a = np.array(pp_a)
mean_pi, std_pi = pp_pi.mean(axis=0), pp_pi.std(axis=0)
mean_a, std_a = pp_a.mean(axis=0), pp_a.std(axis=0)

fig, ax = plt.subplots(2, 1, figsize=(8, 2*6))
plt.sca(ax[0])
plt.errorbar(x=range(len(p_pi)), y=mean_pi, yerr=std_pi, color='red', label='with info gain')
plt.bar(range(len(p_pi)), height= p_pi1, color='blue', label='without info gain')
plt.legend()
plt.title('Probabilty of selecting plans with noisy information gain.')

print('plans and associated probability of selecting them.')
for p, pi in zip(p_pi1, plans):
    print(pi, p)
    
plt.sca(ax[1])
plt.errorbar(x=range(len(p_a)), y=mean_a, yerr=std_a, color='red', label='with info gain')
plt.bar(range(len(p_a)), height= p_a1, color='blue', label='without info gain')
plt.legend()
plt.title('Probabilty of selecting next actions with noisy information gain.')


## Putting it all together

Now we have all components required to implement an Active Infererence agent for environments with discrete state spaces, discrete action spaces and _continuous_ observation spaces. The changes to the minimal agent that interacts with discrete observation space environments, again, turned out to be few and small.

1. Updating belief based on new observations required an interface change from accessing the environments conditional probability table $p[s,o]$ to state specific random variables that let us evaluate the likelihood.

2. Information gain during action selection can nolonger be performed by enumerating all possible discrete observations. Instead, we sample a finite set of observations from the observation space based on our current belief distribution over states.

Note that both of these changes could be ported back into the minimal agent to derive an interface that works for both discrete and continuous observation spaces. But sampling observations is far less efficient than enumerating all possibly observations and making use of their state-condtional probabilities.

Let's encapsulate these changes into an agent class that, as the other agents before it, manages the target state and current belief state over time and provides a minimal interface with reset and step methods.

In [None]:
import itertools

import numpy as np
import torch

def softmax(x):
  e = np.exp(x - x.max())
  return e / e.sum()

def kl(a, b):
    """ Discrete KL-divergence """
    return (a * (np.log(a) - np.log(b))).sum()

class ContinuousObservationAgent:
    
    def __init__(self, 
                 env,
                 target_state, 
                 k=2, # planning horizon
                 n_o_samples=10, # observation samples for information gain
                 use_info_gain=True, # score actions by info gain
                 use_pragmatic_value=True, # score actions by pragmatic value
                 select_max_pi=False, # sample plan (False), select max negEFE (True).
                 n_steps_o=20, # optimization steps after new observation
                 n_steps_a=20, # optimization steps after new action
                 lr_o=4., # learning rate of optimization after new observation
                 lr_a=4.): # learning rate of optimization after new action)
        
        self.env = env
        self.target_state = target_state
        self.k = k
        self.n_o_samples = n_o_samples
        self.use_info_gain = use_info_gain
        self.use_pragmatic_value = use_pragmatic_value
        self.select_max_pi = select_max_pi
        self.n_steps_o = n_steps_o
        self.n_steps_a = n_steps_a
        self.lr_a = lr_a
        self.lr_o = lr_o
        
    def reset(self):
        # initialize state preference
        self.b_star = np.eye(self.env.s_N)[self.target_state] * 10
        self.log_p_c = np.log(softmax(self.b_star))
        # initialize state prior as uniform
        self.b = np.zeros(self.env.s_N)
        
    def step(self, o, debug=False):
        if debug:
            return self._step_debug(o)
        
        self.b = self._update_belief(theta_prev=self.b, o=o)
        a = select_action(theta_start=self.b)[0] # pop first action of selected plan
        self.b = self._update_belief_a(theta_prev=self.b, a=a)
        return a
    
    def _step_debug(self, o):
        self.b, ll_o = self._update_belief(theta_prev=self.b, o=o, debug=True)
        a, p_a, _, _, _ = self._select_action(theta_start=self.b, debug=True)
        a = a[0]
        self.b, ll_a = self._update_belief_a(theta_prev=self.b, a=a, debug=True)
        return a, ll_o, ll_a, p_a
    
    def _update_belief_a(self, theta_prev, a, debug=False):
        # prior assumed to be expressed as parameters of the softmax (logits)
        theta = torch.tensor(theta_prev)
        q = torch.nn.Softmax(dim=0)(theta)

        # this is the prior for the distribution at time t
        q1 = torch.matmul(q, torch.tensor(self.env.p_s1_given_s_a[:,a,:]))

        # initialize parameters of updated belief to uniform
        theta1 = torch.zeros_like(theta, requires_grad=True)
        loss = torch.nn.CrossEntropyLoss() # expects logits and target distribution.
        optimizer = torch.optim.SGD([theta1], lr=self.lr_a)
        if debug:
            ll = np.zeros(self.n_steps_a)

        for i in range(self.n_steps_a):
            l = loss(theta1, q1)
            optimizer.zero_grad()
            l.backward()
            optimizer.step()

            if debug:
                ll[i] = l.detach().numpy()

        theta1 = theta1.detach().numpy()
        if debug:
            return theta1, ll

        return theta1
    
    def _update_belief(self, theta_prev, o, debug=False):
        theta = torch.tensor(theta_prev)

        # make p(s) from b
        q = torch.nn.Softmax(dim=0)
        p_o_given_s = torch.tensor([p.pdf(o) for p in self.env.p_o_given_s])
        p = p_o_given_s * q(theta) # p(o|s)p(s)
        log_p = torch.log(p)

        # initialize updated belief with current belief
        theta1 = torch.tensor(theta_prev, requires_grad=True)

        # estimate loss
        def forward():
            q1 = q(theta1)
            # free energy: KL[ q(s) || p(s, o) ]
            fe = torch.sum(q1 * (torch.log(q1) - log_p))
            return fe

        optimizer = torch.optim.SGD([theta1], lr=self.lr_o)
        ll = np.zeros(self.n_steps_o)
        for i in range(self.n_steps_o):
            l = forward()
            optimizer.zero_grad()
            l.backward()
            optimizer.step()

            if debug:
                ll[i] = l.detach().numpy()

        theta1 = theta1.detach().numpy()
        if debug:
            return theta1, ll

        return theta1

    def _select_action(self, theta_start, debug=False): # return plans, p of selecting each, and marginal p of actions
      
        # genrate all plans
        plans = [ list(x) for x in itertools.product(range(self.env.a_N), repeat=self.k)]
        # evaluate negative expected free energy of all plans
        nefes = []
        for pi in plans:
          
          if debug:
            step_nefes, info_gains = self._rollout_step(theta_start, pi, 
                                                        debug=True)
          else:
            step_nefes = self._rollout_step(theta_start, pi)
            
          nefe = np.array(step_nefes).mean() # expected value over steps
          nefes.append(nefe)

        # compute probability of following each plan
        p_pi = softmax(np.array(nefes)).tolist()
        if self.select_max_pi:
            a = plans[np.argmax(nefes)]
        else:
            a = plans[np.random.choice(len(plans), p=p_pi)]

        if debug:
            # compute marginal action probabilities
            p_a = np.zeros(self.env.a_N)
            for p, pi in zip(p_pi, plans):
                p_a[pi[0]] += p

            return a, p_a, plans, p_pi, info_gains

        return a

    def _rollout_step(self, theta, pi, debug=False):
        if pi == []:
            return []

        a, pi_rest = pi[0], pi[1:]
        # Where will I be after taking action a?
        theta1 = self._update_belief_a(theta, a=a) 
        q = softmax(theta1)
        #print('--------------')
        #print(theta)
        #print(a)
        #print(theta1)
        #print(q)
        
        # Do I like being there?
        pragmatic = np.dot(q, self.log_p_c)
        # What might I observe after taking action a? (marginalize p(o, s) over s)
        ss = np.random.choice(range(self.env.s_N), p=q, size=self.n_o_samples)
        oo = [rv.rvs() for rv in self.env.p_o_given_s[ss]]
        # Do I learn about s from observing o?
        q_o = [softmax(self._update_belief(theta1, o=o)) for o in oo]
        d_o = [kl(q_o_i, q) for q_o_i in q_o] # info gain for each observation
        info_gain = np.mean(d_o) # expected value of info gain
        # negative expected free energy for this timestep
        nefe = self.use_pragmatic_value * pragmatic + \
               self.use_info_gain * info_gain
        
        # nefe for remainder of policy rollout
        nefe_rest = self._rollout_step(theta1, pi_rest)
        # concatenate expected free energy across future time steps
        if debug:
          return [nefe] + nefe_rest, d_o

        return [nefe] + nefe_rest

The code below iterates over all steps involved in the interaction between the environment and the active inference agent. In each interaction step, the agent updates its belief about the current state given a new observation and selects an action to minimise expected free energy. It then updates its belief assuming the selected action was taken and starts anew by updating its belief based on the next observation.

In [None]:
import importlib
import continuous_observation_environment as coe
import continuous_observation_agent as coa
importlib.reload(coe)
importlib.reload(coa)

target_state = 4
k = 4 # planning horizon; run time increases exponentially with planning horizon

# runtime increases linearly with optimization steps during belief update
n_steps_o = 20 # optimization steps updating belief after observation
n_steps_a = 10 # optimization steps updating belief after action
lr_o = 8. # learning rate updating belief after observation
lr_a = 4. # learning rate updating belief after action

render_losses = True

env = coe.ContinuousObservationEnv(N=16, # number of states
                              s_food=0, # location of the food source
                              s_0=10, # starting location 
                              o_sample_size=5) # observation Beta distribution parameter.

# visualise emission probability
samples = [env.p_o_given_s[s].rvs(size=1000) for s in range(env.s_N)]
df = pd.DataFrame(np.array(samples).T)
sns.violinplot(df, cut=0, width=2)
plt.xlabel('state s')
plt.ylabel('p(o|s)')
plt.title('Continuous environment emission probability')

agent = coa.ContinuousObservationAgent(env=env, 
                             target_state=target_state,
                             k=k, 
                             use_info_gain=True,
                             use_pragmatic_value=True,
                             select_max_pi=True,
                             n_steps_o=n_steps_o, 
                             n_steps_a=n_steps_a, 
                             lr_a=lr_a, 
                             lr_o=lr_o)

o = env.reset() # set state to starting state
agent.reset() # initialize belief state and target state distribution

ss = [env.s_t]
bb = [agent.b]
aa = []
if render_losses:
    fig, ax = plt.subplots(1, 2, figsize=(12, 6))
    ax[0].set_title('updates from actions')
    ax[0].set_ylabel('loss')
    ax[0].set_xlabel('optimization step')
    ax[1].set_title('updates from observations')
    ax[1].set_ylabel('loss')
    ax[1].set_xlabel('optimization step')
    
for i in range(64):
    a, ll_o, ll_a, p_a = agent.step(o, debug=True)
    print(f"step {i}, s: {env.s_t}, max b:{bb[-1].argmax()}, o: {o:.2f}, p(a): {p_a}, a: {a}")
    if render_losses:
        ax[0].plot(ll_a)
        ax[1].plot(ll_o)
    
    o = env.step(a)
    
    ss.append(env.s_t)
    bb.append(agent.b)
    aa.append(a)


from matplotlib.markers import CARETUP, CARETDOWN
aa = np.array(aa)
ss = np.array(ss)

fig, ax = plt.subplots(figsize=(16, 6))
plt.imshow(np.array(bb).T, label='belief')

for i in range(len(aa)):
  plt.plot([i, i], [ss[i], ss[i]+[-1,1][aa[i]]], 
             color='orange', 
             linewidth=0.5,
             marker= CARETDOWN if aa[i] > 0 else CARETUP,
             label=None if i > 0 else 'action')


plt.plot(ss, label='state')
plt.plot([0, len(ss)-1], [target_state]*2, label='target')
plt.plot([0, len(ss)-1], [env.s_food]*2, 'w--', label='food')
plt.legend()

# Future Work

We highlighted that during rollout for action selection we perform a vast number of belief updates due to sampling hypothetical observations in each timestep. In future work, we should explore how this could be vectorized.

In [None]:
samples = [env.p_o_given_s[s].rvs(size=1000) for s in range(env.s_N)]
df = pd.DataFrame(np.array(samples).T)
sns.violinplot(df, cut=0, width=2)
plt.xlabel('state s')
plt.ylabel('p(o|s)')
plt.title('Continuous environment emission probability')