# Introduction to the computional Theory of Mind (ToM) model

The following is an introduction to the computational implementation of Theory of Mind (ToM) originally presented by [Devaine et al. (2017)](http://dx.plos.org/10.1371/journal.pcbi.1005833), which is the model used by the ToM agents in the tomsup package. The guide will first give a conceptual overview of the model and its parameters, and will then go through the model in detail.

---
## Conceptual overview
The ToM model used here is an attempt at approximating the theory of mind processes in humans. The simplest kind of ToM agent - the 0-ToM - assumes that it's opponent chooses randomly with a specific probability. It then employs a variational bayes kalman filter to estimate that choice probability based on the opponent's behavior, and makes the choice that will give it the most value, given the opponent's choice probability. In this sense, the 0-TOM does not attribute any intentionality or adaptivity to the opponent, but treats the opponent's choices as randomly generated phenomena. Within the context of this module, this would correspond to the random bias agent.

The more advanced 1-ToM, in turn, assumes that its opponent is a 0-ToM trying to predict the actions of the 1-ToM in the manner described above. The 1-TOM takes its opponent's perspective by simulating a 0-ToM model given the same inputs as the opponent. Based on this simulation of its opponent, and on an estimation of the opponent's model parameters, 1-ToM finally estimates the opponent's choice probability, and makes its own choice accordingly. An even more sophisticated 2-TOM agent, then, assumes its opponent to either be a 1-ToM or a 0-ToM.




It uses variational Bayes Laplace approximation (for more on this, see chapter 5 in [this book](https://bookdown.org/rdpeng/advstatcomp/laplace-approximation.html)) to estimate opponent parameter values


In [3]:
import os
os.chdir('..')
import tomsup as ts

In [4]:
agents = ts.create_agents("1-TOM", "2-TOM")
agents.agent_names

AttributeError: 'TOM' object has no attribute 'agent_names'

# Mathematical description: the k-TOM model

Following the conceptual introduction we will go further into how the mathematical underpinnings of the ToM agent. We will start of with an introduction to the decision process and then examine the learning process. The learning processing will start of by introducing the 0-ToM and then generalize to the k-ToM agent.

## The Decision Process
When the learning process in concluded The probability ${P}^{op}_{t}$ of the opponent choosing 1 is estimated, and 0-ToM's learning process is concluded. Now, as the first step in 0-ToM's decision process, ${P}^{op}_{t}$ is inserted into the expected payoff function, shown below:

$$
\begin{equation}
\Delta V_t = {p}^{op}_{t} (U(1,1) - U(0, 1)) + (1-{p}^{op}_{t}) (U(1,0) - U(0,0))
\end{equation}\tag{7}
$$

Where $\Delta V_t$ is o-ToM's expected payoff of choosing 1 relative to 0 on the current trial $t$. The notation $U({c}^{self},{c}^{op})$ denotes the payoff function, which returns the reward \textit{R} given a payoff matrix and the (hypothetical) choices of 0-ToM herself ${c}^{self}$ and the opponent ${c}^{op}$ (see \textit{\ref{GT} \refname{GT}} for an explanation). This equation essentially finds the payoff difference of choosing 1 relative to 0 given both possible opponent choices, and sums them weighted by the probability of the opponent making that choice.

To calculate 0-ToM's own probability of choosing 1, 0-ToM's expected payoff of choosing 1, $\Delta V_t$ ,is now inserted in the softmax decision rule, as shown below:

$$
\begin{equation}
P({ c }^{ self }_t = 1)= \frac { 1 }{ 1 + exp(-\frac { \Delta V_t }{ \beta^{0}  } ) }
\end{equation}\tag{8}
$$


Where $P({c}^{self}_t=1)$ is 0-ToM's probability of choosing 1 on the current trial $t$. $\beta^{0}$ then denotes 0-ToM’s behavioural temperature parameter, which randomizes behaviour. An expected payoff of 0, i.e. equal values of choosing 1 or 0, results in a random choice, $P({c}^{self}_t=1)=0.5$ . Higher expected values then result in higher probabilities of choosing 1, in a sigmoidal manner asymptotic to 1 and 0. Higher $\beta$ values makes choice probabilities closer to 0.5, i.e. increases exploration. The softmax choice rule has previously proven efficient in game theory and in modelling choices on tasks such as the Iowa gambling task, as it provides a good balance between exploitation and exploration (Camerer, 2003; Steingroever et al., 2013).

Now that $P({c}^{self}_t=1)$ has been calculated, 0-ToM's decision process is concluded. All that follows is for the probability to be evaluated so 0-ToM can make its choice, after which the next trial commences.



## The 0-ToM
All ToM agents estimate their opponents’ parameter values $\theta$ in order to learn the choice probability of their opponents $P_t^{op}$, but since 0-ToM assumes its opponent to use a RB strategy, the estimation of the probability parameter and the choice probability $P_t^{op}$ becomes identical. The choice probability parameter is estimated as a normal distribution with mean $\mu$ and variance $\Sigma$, each of which are updated on a turn-by turn-basis, based on the opponent’s last choice. This is done using a a variational Bayes Laplace approximation for parameters with a linear relation to observed behaviour. In the graphical model (see following figure 1), 0-ToM’s learning and decision process can be seen. First the variance $\Sigma$ is updated, then the mean estimate $\mu$. This allows the estimation of the opponent’s choice probability $P_t^{op}$, after which the expected payoff difference $\Delta V$ can be calculated and inserted in the softmax function to decide 0-ToM’s own choice probability $P(c^{self} = 1)$.

<img src="img/gm_tom_0.png" alt="Graphical Models of 0-ToM" style="width: 500px;"/>

In [15]:
# #if if is not read above it should be read using this code:
# from IPython.display import Image # for loading in images
# Image("tutorials/img/gm_tom_0.png")

#### Figure 1
A graphical model of a single trial for the 0-ToM model. The shaded (grey) constitute observed variables while non-shaded (white) is unobserved. squares and circles constitute discrete and continuous varaibles respectively. The double bordered variables are deterministic. For more on graphical models see Bartlema et al. (2014). 

This graphical model is meant to help the reader understand the structure of the agent, however the graphical model is by no mean necessary for understanding the agent so the reader the more mathematically inclined reader should feel free to move straight to the formulas. 


### Variance update
The variance $\Sigma$ is updated using the following equation:

$$
\begin{equation}
{ \Sigma  }_{ t }^{ 0 }\quad \approx \quad \frac { 1 }{ \frac { 1 }{ { \Sigma  }_{ t-1 }^{ 0 }+{ \sigma  }^{ 0 } } \quad +\quad s({ { \mu  } }_{ t-1 }^{ 0 })(1-s({ { \mu  } }_{ t-1 }^{ 0 })) } 
\end{equation} \tag{1}
$$


Where ${ \mu}_{t}^{0}$ denotes 0-ToM’s mean estimate in logodds at trial \textit{t} of the opponent’s probability parameter \textit{p}, and ${ \Sigma}_{t}^{0}$ denotes the subjective uncertainty of the parameter estimate at trial \textit{t}, and where $t-1$ indicates the previous turn. $s$ is the sigmoid function, and the expression $s(\mu)$ is 0-ToM’s estimate in probability of opponent's choice probability $P^{op}_t$, without taking uncertainty $\Sigma$ into account. Note that a $\mu$ close to chance level results in a lower updated variance $\Sigma$. ${\sigma}^{0}$ denotes 0-ToM’s volatility parameter, which captures her prior assumptions on how much the opponent's parameters varies with time. (See Mathys et al., 2011) for a model where $\sigma$ is learned instead of assumed). The volatility parameter $\sigma$ controls the updating of variance $\Sigma$, where a higher volatility results in a higher updated variance on every trial. Together, the size of $\sigma$ and $\mu$ creates a dynamic lower bound for variance $\Sigma$.

### Mean update
After $\Sigma$ has been updated, it is used for updating the mean estimate $\mu$ of the opponent's choice probability parameter \textit{p} by insertion into the following equation:

$$
\begin{equation}
{ { \mu  } }_{ t }^{ 0 }\quad \approx \quad {  \mu   }^{0}_{ t-1 }\quad +\quad { \Sigma  }_{ t }^{ 0 }({ c }_{ t-1 }^{ op }\quad -\quad s({ { \mu  } }^{0}_{ t-1 }))\quad 
\end{equation} \tag{5}
$$

Where ${c}^{op}_{t}$ denotes the opponent’s choice at trial $t$. Consequently, the term ${c}^{op}_{t-1}-s({\mu}^{0}_{t-1})$ becomes the prediction error on last trial, which is added to the previous mean to update it. It is weighted by the variance $\Sigma$, meaning that very uncertain beliefs are affected more by new data than very certain ones. 

### Probability of opponent choosing 1
After having updated the mean and variance of the estimate of the opponent’s choice probability, 0-ToM uses the following equation to make its estimate the opponent of choosing 1 $P({c}^{op}=1)$:

$$
\begin{equation}
{  p   }_{ t }^{ op }\approx s\left( \frac { \mu _{ t }^{ 0 } }{ \sqrt { 1+(\Sigma ^{ 0 }_{ t }+\sigma ^{ 0 })3/\pi ^{ 2 } }  }  \right) 
\end{equation} \tag{6a} 
$$
\paragraph{}
Where ${P}^{op}_{t}$ is the estimated probability of the opponent choosing 1. Note that higher variance $\Sigma$ and volatility $\sigma$ values result in the choice probability estimates ${P}^{op}_{t}$ closer to chance level. This means that high subjective uncertainty and assuming more noise in the opponent's behaviour makes 0-ToM's estimates of its opponent's choice probability less extreme, and thus 0-ToM’s own choices more random, hereby preventing overfitting. 

Importantly, while \ref{p_op_basic_full} is the theoretically derived equation \citep{Devaine2017}, the implementation used in the VBA package \citep{Daunizeau2014c} uses an approximation to avoid identifiability issues:

$$
\begin{equation}
{  p   }_{ t }^{ op }\approx s\left( \frac { \mu _{ t }^{ 0 } }{ \sqrt { 1+a\cdot \Sigma ^{ 0 }_{ t } }  }  \right) 
\end{equation} \tag{6b} 
$$	

\paragraph{}
Where, a=0.36 is an approximation which contains within it the volatility parameter $\sigma$. This means that the volatility parameter $\sigma$ is held constant. Note that small uncertainties result in estimates of opponent choice probabilities ${P}^{op}_{t}$ closer to the mean $\mu$, i.e. $ \Sigma \rightarrow 0 \Rightarrow P^{op}_t\rightarrow \mu$. 

The two equations \ref{p_op_basic_full} and \ref{p_op_basic_a} give similar results when $\sigma$ values are below 1. This means that large volatility values do not affect the choice probability estimate directly, but only through \ref{var_basic}. In initial simulations, using \ref{p_op_basic_full} has a tendency to yield extreme parameter estimates because of identifiability issues, especially in the more complex \textit{k}-ToM model.

The SiRToM package is able to use both variants, but similarly to the VBA implementation \citep{Daunizeau2014c}, it defaults to using eq. \ref{p_op_basic_a}, since it gives better performance and more stable results.


### Decision Process
The probability ${P}^{op}_{t}$ of the opponent choosing 1 is estimated, and 0-ToM's learning process is concluded. Now, as the first step in 0-ToM's decision process, ${P}^{op}_{t}$ is inserted into the expected payoff function, shown below:

$$
\begin{equation}
\Delta V_t = {p}^{op}_{t} (U(1,1) - U(0, 1)) + (1-{p}^{op}_{t}) (U(1,0) - U(0,0))
\end{equation}\tag{7}
$$

Where $\Delta V_t$ is o-ToM's expected payoff of choosing 1 relative to 0 on the current trial $t$. The notation $U({c}^{self},{c}^{op})$ denotes the payoff function, which returns the reward \textit{R} given a payoff matrix and the (hypothetical) choices of 0-ToM herself ${c}^{self}$ and the opponent ${c}^{op}$ (see \textit{\ref{GT} \refname{GT}} for an explanation). This equation essentially finds the payoff difference of choosing 1 relative to 0 given both possible opponent choices, and sums them weighted by the probability of the opponent making that choice.

To calculate 0-ToM's own probability of choosing 1, 0-ToM's expected payoff of choosing 1, $\Delta V_t$ ,is now inserted in the softmax decision rule, as shown below:

$$
\begin{equation}
P({ c }^{ self }_t = 1)= \frac { 1 }{ 1 + exp(-\frac { \Delta V_t }{ \beta^{0}  } ) }
\end{equation}\tag{8}
$$


Where $P({c}^{self}_t=1)$ is 0-ToM's probability of choosing 1 on the current trial $t$. $\beta^{0}$ then denotes 0-ToM’s behavioural temperature parameter, which randomizes behaviour. An expected payoff of 0, i.e. equal values of choosing 1 or 0, results in a random choice, $P({c}^{self}_t=1)=0.5$ . Higher expected values then result in higher probabilities of choosing 1, in a sigmoidal manner asymptotic to 1 and 0. Higher $\beta$ values makes choice probabilities closer to 0.5, i.e. increases exploration. The softmax choice rule has previously proven efficient in game theory and in modelling choices on tasks such as the Iowa gambling task, as it provides a good balance between exploitation and exploration (Camerer, 2003; Steingroever et al., 2013).

Now that $P({c}^{self}_t=1)$ has been calculated, 0-ToM's decision process is concluded. All that follows is for the probability to be evaluated so 0-ToM can make its choice, after which the next trial commences.

## *k*-ToM

This graphical model is meant to help the reader understand the structure of the agent, however the graphical model is by no mean necessary for understanding the agent so the reader the more mathematically inclined reader should feel free to move straight to the formulas. 

Let us briefly go through the graphical model, starting at the output $c_t^{self}$, which is the choice of the agent at time $t$, which is stocastically determined 

Let start by the parameters which already defined for this agent, volatility $\sigma$ = -2, and the  behavioural temperature $\beta$ = -1.