<a href="https://colab.research.google.com/github/Kinds-of-Intelligence-CFI/measurement-layout-tutorial/blob/main/tutorial-notebooks/4_BuildingGoodBenchmarks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building Good Benchmarks

**Lead Presenter**: Kozzy Voudouris

In this tutorial, we introduce some key concepts that should guide the development of purpose-built benchmarks for use with measurement layouts. In this notebook, we incrementally build a measurement layout for studying the cognitive capability of object permanence in a complex three-dimensional environment. Finally, We evaluate some agents on a suite of tests.

## Preamble

First, let's import the libraries, functions, and data that we will need.

In [None]:
!pip install arviz --quiet
!pip install erroranalysis --quiet
!pip install numpy --quiet
!pip install pymc --quiet

In [None]:
import arviz as az
import erroranalysis as ea
import gc
import graphviz
import math
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
import random as rm
import seaborn as sns

from IPython.display import Image
from scipy import stats
from sklearn.model_selection import train_test_split
from google.colab import files
from pymc import model

print(f"Running on PyMC v{pm.__version__}") #Note, colab imports an older version of PyMC by default. This won't cause problems for this tutorial, but may do if you use a different backend (e.g., gpu) and a jax/numpyro sampler. In which case, run `!pip install 'pymc>5.9' --quiet`

In [None]:
data_url = 'https://raw.githubusercontent.com/Kinds-of-Intelligence-CFI/measurement-layout-tutorial/main/data/4_PCTB_data.csv'
pctb_dataset = pd.read_csv(data_url)

Let's inspect the dataset. There are **551 instances**, **8 metafeatures**, and the performances of **7 agents**.

The metafeatures are as follows:
1. `basicTask` - is the task a basic task? Values (discrete, binary): `0` (No), `1` (Yes).
2. `pctbGridTask` - is the task a PCTB Grid task? Values (discrete binary): `0` (No), `1` (Yes).
3. `mainGoalSize` - what size is the goal in the task? Value range: `0.5` - `5.0`.
4. `goalPosition` - what is the relative position of the goal with respect to the agent's starting position? Value range: `-1.5` - `1.5` (0 is centre point).
5. `goalOccluded` - is the goal occluded when the agent starts the episode? Values (discrete binary): `0` (No), `1` (Yes).
6. `minDistToGoal` - how far is the goal from the agent? This is calculated is the manhattan distance to the goal, avoiding any obstacles/pits. Value range: `9.0` - `47.0`.
7. `minNumTurnsGoal` - how many right-angle turns would the agent take on the trajectory described by `minDistToGoal`. Value range: `0.0` - `3.0`.
8. `numChoices` - how many choices does the agent have in the task? Value range: `1.0` - `12.0`.

The agents performed each of these tasks, and whether they obtained the goal (`1`) or not (`0`) was recorded. The agents are as follows:
1. `Random_Agent` - An agent which randomly samples one of the 9 actions in the Animal-AI Environment (`no action`, `forwards`, `backwards`, `left rotate`, `right rotate`, `forwards left`, `forwards right`, `backwards left`, `backwards right`). It then takes that action for a number of steps sampled from $U(1, 20)$.
2. `Heuristic_Agent` - An agent that navigates towards green goals, following a rigid rule.
3. `Dreamer_basic` - A dreamer-v3 agent trained for 2M steps on a set of 300 basic tasks, of which all tasks where `basicTask == 1` are a subset.
4. `Dreamer_basic_control` - A dreamer-v3 agent trained for 10M steps on a set of 2372 basic and practice tasks, of which all tasks where `basicTask == 1 or (pctbGridTask == 1 and goalOccluded == 0)` are a subset.
5. `PPO_ basic` - A PPO agent trained for 2M steps on a set of 300 basic tasks, of which all tasks where `basicTask == 1` are a subset.
6. `PPO_basic_control` - A PPO agent trained for 10M steps on a set of 2372 basic and practice tasks, of which all tasks where `basicTask == 1 or (pctbGridTask == 1 and goalOccluded == 0)` are a subset.

In [None]:
pctb_dataset

## Object Permanence

First, let's set up a simple measurement layout for the *Object Permanence* ability.

The leaf node of the measurement layout is success in this case, so we want to define it as a Bernoulli, taking a probability of success. Therefore, we'll need a logistic function. Because we are dealing with bounded capabilities, the logistic function means that the probability of success on a task with a minimum demand for an agent with maximum ability is 0.999. Alternatively, for an agent with minimum ability performing on a task with maximum demand, we get a probability of success of 0.001. This is a nice parameterisation of the logistic function for our case of bounded capabilities.

In [None]:
def logistic999(x, min, max):    # This logistic function ensures that if x is at -(max-min), we get prob 0.001, and if x is at (max-min), we get prob 0.999
  x = x - min
  max = max - min
  x = 6.90675478 * x / max
  return 1 / (1 + np.exp(-x))

In [None]:
def setupOPModel(data, agent_col_name: str):

  # get results column
  results = data[agent_col_name]

  # define bounds
  abilityMin = {}
  abilityMax = {}

  minPermAbility = ((data["minDistToGoal"] * data["numChoices"]).min())

  maxPermAbility = ((data["minDistToGoal"] * data["numChoices"]).max())

  abilityMin["objPermAbility"] = minPermAbility
  abilityMax["objPermAbility"] = maxPermAbility


  m = pm.Model()
  with m:

    # Define abilities and their priors

    objPermAbility = pm.Uniform("objPermAbility", minPermAbility, maxPermAbility)

    # Define environment variables as MutableData

    goalDist = pm.MutableData("goalDistance", data["minDistToGoal"].values)
    numChoices = pm.MutableData("numChoices", data["numChoices"].values)
    opTest = pm.MutableData("goalOccluded", data["goalOccluded"].values)

    # Margins

    objPermMargin = (objPermAbility - (goalDist * numChoices * opTest))
    objPermP = pm.Deterministic("objPermP", logistic999(objPermMargin, min = minPermAbility, max = maxPermAbility))

    taskSuccess = pm.Bernoulli("taskSuccess", objPermP, observed = results)

  return m, abilityMin, abilityMax

In [None]:
m, abilityMin, abilityMax = setupOPModel(pctb_dataset, 'Dreamer_basic')
gv = pm.model_graph.model_to_graphviz(m)
gv

Let's run this measurement layout on two of our agents: the `Dreamer_basic` agent and the `Heuristic_Agent`.

In [None]:
model_dreamer_basic, abilityMin, abilityMax = setupOPModel(pctb_dataset, 'Dreamer_basic')
with model_dreamer_basic:
  data_dreamer_basic = pm.sample(1000, target_accept=0.95)

model_heuristic, abilityMin, abilityMax = setupOPModel(pctb_dataset, 'Heuristic_Agent')
with model_heuristic:
  data_heuristic = pm.sample(1000, target_accept=0.95)

Let's compare the inferred object permanence capability of these agents, by plotting their capabilities as forest plots:

In [None]:
forest_plot_dreamer_basic = az.plot_forest(data=data_dreamer_basic['posterior'][['objPermAbility']])
axes_dreamer_basic = forest_plot_dreamer_basic.ravel()[0]
axes_dreamer_basic.set_xlim(left=abilityMin['objPermAbility'], right=abilityMax['objPermAbility'])

forest_plot_heuristic = az.plot_forest(data=data_heuristic['posterior'][['objPermAbility']])
axes_heuristic = forest_plot_heuristic.ravel()[0]
axes_heuristic.set_xlim(left=abilityMin['objPermAbility'], right=abilityMax['objPermAbility'])

We can also look at the summary statistics from the two measurement layouts:

In [None]:
summary_dreamer_basic = az.summary(data_dreamer_basic['posterior']['objPermAbility'])
summary_dreamer_basic

In [None]:
summary_heuristic = az.summary(data_heuristic['posterior']['objPermAbility'])
summary_heuristic

We are inferring that the heuristic agent has higher object permanence than the dreamer agent.

## Introducing Navigation

These tasks are fundamentally search tasks. The agent must navigate towards the reward. As such, an agent with object permanence but poor at navigation may fail many of these tasks. Moreover, an agent without object permanence, but that is good at navigating, may accidentally obtain the reward on occasion.

The simplest way to frame this is to say that navigation and object permanence are non-compensatory - being good at navigation does not compensate for being bad at object permanence, and vice versa. For the purposes of this tutorial, we can proceed with this formulation, although it may be more accurate to implement an asymmetric compensatory relationship between these too (since, arguably, navigation is more compensatory for object permanence than vice versa).

Navigation demands can be implemented in terms of how far away the goal is along with the circuitousness of the route. We can define this as the product of distance and number of turns.

Let's extend the measurement layout to include navigation:

In [None]:
def setupOPNavModel(data, agent_col_name: str):

  # get results column
  results = data[agent_col_name]

  # define bounds
  abilityMin = {}
  abilityMax = {}

  minPermAbility = ((data["minDistToGoal"] * data["numChoices"]).min())
  minNavAbility = ((data["minDistToGoal"] * data["minNumTurnsGoal"]).min())

  maxPermAbility = ((data["minDistToGoal"] * data["numChoices"]).max())
  maxNavAbility = ((data["minDistToGoal"] * data["minNumTurnsGoal"]).max())

  abilityMin["objPermAbility"] = minPermAbility
  abilityMax["objPermAbility"] = maxPermAbility

  abilityMin["navAbility"] = minNavAbility
  abilityMax["navAbility"] = maxNavAbility


  m = pm.Model()
  with m:

    # Define abilities and their priors

    objPermAbility = pm.Uniform("objPermAbility", minPermAbility, maxPermAbility)

    navAbility = pm.Uniform("navAbility", minNavAbility, maxNavAbility)

    # Define environment variables as MutableData

    goalDist = pm.MutableData("goalDistance", data["minDistToGoal"].values)
    numChoices = pm.MutableData("numChoices", data["numChoices"].values)
    opTest = pm.MutableData("goalOccluded", data["goalOccluded"].values)
    numTurnsGoal = pm.MutableData("minTurnsToGoal", data["minNumTurnsGoal"].values)

    # Margins

    objPermMargin = (objPermAbility - (goalDist * numChoices * opTest))
    objPermP = pm.Deterministic("objPermP", logistic999(objPermMargin, min = minPermAbility, max = maxPermAbility))

    navP = pm.Deterministic("navP", logistic999(navAbility - (goalDist * numTurnsGoal), min = minNavAbility, max = maxNavAbility))

    # Define final margin with non-compensatory interaction

    finalP = pm.Deterministic("finalP", (objPermP * navP))

    taskSuccess = pm.Bernoulli("taskSuccess", finalP, observed = results)

  return m, abilityMin, abilityMax

In [None]:
m, abilityMin, abilityMax = setupOPNavModel(pctb_dataset, 'Dreamer_basic')
gv = pm.model_graph.model_to_graphviz(m)
gv

Let's run this new measurement layout with the same two agents:

In [None]:
model_dreamer_basic, abilityMin, abilityMax = setupOPNavModel(pctb_dataset, 'Dreamer_basic')
with model_dreamer_basic:
  data_dreamer_basic = pm.sample(1000, target_accept=0.95)

model_heuristic, abilityMin, abilityMax = setupOPNavModel(pctb_dataset, 'Heuristic_Agent')
with model_heuristic:
  data_heuristic = pm.sample(1000, target_accept=0.95)

Let's compare the inferred object permanence and navigation capabilities of these agents, by plotting their capabilities as forest plots:

In [None]:
forest_plot_dreamer_basic = az.plot_forest(data=data_dreamer_basic['posterior'][['objPermAbility']])
axes_dreamer_basic = forest_plot_dreamer_basic.ravel()[0]
axes_dreamer_basic.set_xlim(left=abilityMin['objPermAbility'], right=abilityMax['objPermAbility'])

forest_plot_heuristic = az.plot_forest(data=data_heuristic['posterior'][['objPermAbility']])
axes_heuristic = forest_plot_heuristic.ravel()[0]
axes_heuristic.set_xlim(left=abilityMin['objPermAbility'], right=abilityMax['objPermAbility'])

In [None]:
forest_plot_dreamer_basic = az.plot_forest(data=data_dreamer_basic['posterior'][['navAbility']])
axes_dreamer_basic = forest_plot_dreamer_basic.ravel()[0]
axes_dreamer_basic.set_xlim(left=abilityMin['navAbility'], right=abilityMax['navAbility'])

forest_plot_heuristic = az.plot_forest(data=data_heuristic['posterior'][['navAbility']])
axes_heuristic = forest_plot_heuristic.ravel()[0]
axes_heuristic.set_xlim(left=abilityMin['navAbility'], right=abilityMax['navAbility'])

Let's compare these results to the random agent too:

In [None]:
model_random, abilityMin, abilityMax = setupOPNavModel(pctb_dataset, 'Random_Agent')
with model_random:
  data_random = pm.sample(1000, target_accept=0.95)

In [None]:
forest_plot_random = az.plot_forest(data=data_random['posterior'][['objPermAbility']])
axes_random = forest_plot_random.ravel()[0]
axes_random.set_xlim(left=abilityMin['objPermAbility'], right=abilityMax['objPermAbility'])

In [None]:
forest_plot_random = az.plot_forest(data=data_random['posterior'][['navAbility']])
axes_random = forest_plot_random.ravel()[0]
axes_random.set_xlim(left=abilityMin['navAbility'], right=abilityMax['navAbility'])