In [None]:
!pip install cpm-toolbox
import cpm
from packaging import version

## cpm checks
print(cpm.__version__)
if version.parse(cpm.__version__) < version.parse("0.22"):
    raise ImportError("cpm version must be >= 0.22. Please install the latest version using: pip install --upgrade cpm")

# EXERCISE 1

In the following exercise, you will implement a model based on the mathematical description, and use the `cpm` toolbox to do so. What will you do here?

1. Build a model of a simple bandit task using the toolbox based on the mathematical description
1. Explore the model's behaviour by varying its parameters

## The model description


Let each stimulus have an associated value, which is the expected reward that can be obtained from selecting that stimulus. Let also $Q(a)$ be the estimated value of action $a$. We set the starting value for all $Q(a)$ to be nonzero and equally distributed between all stimuli.

In each trial, $t$, there are two stimuli present, so $Q(a)$ could be $Q(\text{left})$ or $Q(\text{right})$, where the corresponding Q values are derived from the associated value of the stimuli present on the left or right.
More formally, we can say that the expected value of the action $a$ selected at time $t$ is given by:

\begin{equation}
Q_t(a) = \mathbb{E}[R_t | A_t = a]
\end{equation}

where $R_t$ is the reward received at time $t$, and $A_t$ is the action selected at time $t$. In each trial $t$, the Softmax choice rule (Bridle, 1990) conceptually related to Luce's choice axiom (Luce, 1959), will assign probabilities to action (left or right) based on the following policy:

\begin{equation}
P(a_t) = \frac{e^{Q_{a,t} \beta}}{\sum_{i = 1}^{k}{e^{Q_{i,t} \beta}}}
\end{equation}

where $\beta$ is the inverse temperature parameter, also referred to as choice stochasticity, and $Q_{a,t}$ is the estimated value of the action $a$ at time $t$. $k$ is the number of actions available, and in our case, $k = 2$. The model uses the variant of the delta rule (Rescorla & Wagner, 1972; Rumelhart, Hinton & Williams, 1986) adapted for multi-armed bandit problems where each option has a single dimension (Barto & Sutton, 2018), reducing Rescorla-Wagner's summed error-term to the following equation, similar to single linear operators (Bush and Mosteller, 1955):


\begin{equation}
\Delta Q_t(A_t) = \alpha \times \Big[ R_t - Q_t(A_t) \Big]
\end{equation}


where $\alpha$ is the learning rate and $R_t$ is the reward received at time $t$, also called a teaching signal and sometimes annotated as $\lambda$. $A_t$ is the action chosen for the trial $t$. Then we update the Q-values, such as:

\begin{equation}
Q_{t+1}(A_t) = Q_t(A_t) + \Delta Q_t(A_t)
\end{equation}

## Explore your data

Your task here will be to implement the model described above using the `cpm` toolbox. Fortunately, most of the code is already here, so you will only need to fill in the blanks.

First, let us look at the data.

In [None]:
import cpm.datasets as datasets

data = datasets.load_bandit_data()
data.head()

The model will process each trial in the data, so below we can actually see what the model is going to see when it is run:

In [None]:
data.iloc[0]

## EXERCISE 1.1A: Model Parameters

Now let's start by the model parameters. Specify each model parameter, their respective priors, and the initial values. The model parameters are:
- `alpha`: the learning rate
- `temperature`: the inverse temperature parameter
- `Q`: the initial Q-values for each action (not a free parameter, but a model state)

In [None]:
from cpm.generators import Parameters, Value
import numpy
import pandas as pd

parameters = Parameters(
    # free parameters are indicated by specifying priors
    alpha=Value(
    ____________________ 
    ),
    temperature=Value(
    _____________________
    ),
    # everything without a prior is part of the initial state of the
    # model or constructs fixed throughout the simulation
    # (e.g. exemplars in general-context models of categorizations)
    # initial q-values starting starting from non-zero value
    # these are equal to all 4 stimuli (1 / 4)
    Qvalues = _______________________________
    )

In [None]:
parameters.export()
parameters.sample(5)
parameters.bounds()
parameters.update(alpha=0.6, temperature=2.5)
parameters.alpha.export()

## EXERCISE 1.1B: Model Implementation

Fill out the missing processes in the model. The model will be implemented as a simple function. You will need to implement the following methods:

1. Learning rule. The learning rule will be calculating the change in Q-values based on the reward received and the current Q-value for the action selected. You will be using a simple delta rule, which we already implemented in [`cpm.models.learning.SeparableRule`](https://devcompsy.github.io/cpm/references/models/#cpm.models.learning.SeparableRule).
2. Choice rule. The choice rule will be selecting the action based on the Q-values and the inverse temperature parameter. You will be using a softmax choice rule, which we already implemented in [`cpm.models.decision.Softmax`](https://devcompsy.github.io/cpm/references/models/#cpm.models.decision.Softmax).

In [None]:
import cpm
import ipyparallel as ipp  ## for parallel computing with ipython (specific for Jupyter Notebook)

@ipp.require("numpy")
def model(parameters, trial):
    # pull out the parameters
    alpha = parameters.alpha
    temperature = parameters.temperature
    values = numpy.array(parameters.values)
    
    # pull out the trial information
    stimulus = numpy.array([trial.arm_left, trial.arm_right]).astype(int)
    feedback = numpy.array([trial.reward_left, trial.reward_right])
    human_choice = trial.response.astype(int)

    # Equation 1. - get the value of each available action
    # Note that because python counts from 0, we need to shift
    # the stimulus identifiers by -1
    expected_rewards = values[stimulus - 1]
    # convert columns to rows
    expected_rewards = expected_rewards.reshape(2, 1)
    # calculate a policy based on the activations
    # Equation 2.
    ## you will need expected rewards and temperature
    ## look at the function documentations provided above
    _____________________________
    # after that, you need to compute the policy with the .compute method
    ____________________________
    # if the policy is NaN for an action, then we need to set it to 1
    # this corrects some numerical issues with python and infinities
    if numpy.isnan(choice_rule.policies).any():
        choice_rule.policies[numpy.isnan(choice_rule.policies)] = 1
    # get the received reward for the choice
    reward = feedback[human_choice]
    reward = numpy.array([reward])
    # we now create a vector that tells our learning rule what...
    # ... stimulus to update according to the participant's choice
    what_to_update = numpy.zeros(4)
    chosen_stimulus = stimulus[human_choice] - 1
    what_to_update[chosen_stimulus] = 1

    # Equation 4.
    # update the values based on the received reward
    ## you will need the:
    # learning rate (alpha)
    # values: Q-values
    # reward: received reward for the choice
    # what_to_update: telling the function what q-value you are updating 
    ____________________________
    # Equation 5.
    values += update.weights.flatten()
    # compile output
    output = {
        "trial"    : trial.trial.astype(int), # trial numbers
        "activation" : expected_rewards.flatten(), # expected reward of arms
        "policy"   : _________________,       # policies
        "reward"   : reward,                  # received reward
        "error"    : update.weights,          # prediction error
        "Qvalues"   : values,                  # updated values
        # dependent variable
        "dependent"  : numpy.array([choice_rule.policies[1]]),
    }
    return output

One important thing to note is that the model requires to output a variable called `dependent_variable`, which is the prediction we wish to compare to observations. Once you filled in the blanks, you can run the model and see how it performs on the data:

In [None]:
model(parameters, data.iloc[0])

## EXERCISE 1.2: run your model with different parameters and explore its behaviour

Here, we will see how the model behaves with different parameters. You can change the values of `alpha` and `temperature` to see how they affect the model's predictions. First we will input the model, parameters, and data into the model wrapper. The wrapper will take care of running the model and exporting the results. We will use the `cpm.generators.Wrapper` class to do this. If you need more information, read the documentation [here](https://devcompsy.github.io/cpm/references/generators/#cpm.generators.Wrapper).

In [None]:
from cpm.generators import Simulator, Wrapper

wrapper = Wrapper(model=model, parameters=parameters, data=data[data.ppt == 1])
wrapper.run()
wrapper.export()

In [None]:
wrapper.reset(parameters={__________________}, data=________________)
wrapper.run()
wrapper.export()

In the following code, you will need to change the initial Q-values for the stimuli. Do you notice any differences in the model's predictions?

In [None]:
wrapper.reset(parameters={__________________}, data=________________)
wrapper.run()
wrapper.export()

## Exercise 1.3: plot the model output as a function of change in the learning rate

Pick a range of learning rates and plot the model output as a function of the change in the learning rate. Try to find learning rates that lead to different behaviours of the model.

In [None]:
alpha_not_so_random = numpy.array([__, __, __])

big_results = pd.DataFrame()

for i in numpy.arange(len(alpha_not_so_random)):
    print(f"Running simulation for participant {i + 1} with alpha={alpha_not_so_random[i]}")
    wrapper.reset(parameters={"alpha": alpha_not_so_random[i], "values": numpy.ones(4)/4}, data=data[data.ppt == 1])
    wrapper.run()
    output = wrapper.export()
    output["alpha"] = alpha_not_so_random[i]
    big_results = pd.concat([big_results, output], ignore_index=True)

big_results.head()

In [None]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(3, 1, figsize=(12, 10), sharex=True)

value_cols = ['Qvalues_0', 'Qvalues_1', 'Qvalues_2', 'Qvalues_3']
colors = ['tab:blue', 'tab:orange', 'tab:green', 'tab:red']

for idx, alpha in enumerate(alpha_not_so_random):
    ax = axes[idx]
    subset = big_results[big_results['alpha'] == alpha]
    for vcol, color in zip(value_cols, colors):
        ax.plot(subset['trial_0'], subset[vcol], label=vcol, color=color)
    ax.set_ylabel('Q-value')
    ax.set_title(f'alpha={alpha}')
    ax.legend(['stimulus 1', 'stimulus 2', 'stimulus 3', 'stimulus 4'], loc='upper left')
axes[-1].set_xlabel('Trial')
fig.suptitle('Evolution of Q-values for Different Learning Rates\n', fontsize=16, y=1.02)
plt.tight_layout()
plt.show()

# Questions

* What do you notice here? What do you think about the model's behaviour?
* How do the parameters affect the model's predictions?
* How do the initial Q-values affect the model's predictions?
* Anything that surprises you?

## (NOT) EXERCISE 1.4: Simulating different participants with the same and different parameters

There are built-in tools in the `cpm` toolbox to allow you to explore model behaviour in a variety of ways. The process of trying to understand how the model explains the data often involves exploring its parameter space, simulating different trial orders, and so on. The tool we are using here is called `cpm.generators.Simulator`, which allows you to simulate different participants with the same or different parameters. You can read more about it in the documentation [here](https://devcompsy.github.io/cpm/references/generators/#cpm.generators.Simulator).

In [None]:
subset = data[data.ppt.isin([1, 3, 9, 4, 10])].copy()
numpy.random.seed(42)
multiple = parameters.sample(5) ## get 5 random parameter sets for each participant

simulate =  cpm.generators.Simulator(
    wrapper=wrapper,
    parameters=multiple,
    data=subset.groupby('ppt'),
)
simulate.run()
simulations_multiple_ppt = simulate.export()


In [None]:
import matplotlib.pyplot as plt

participants = simulations_multiple_ppt['ppt'].unique()
n_participants = len(participants)

fig, axes = plt.subplots(n_participants, 1, figsize=(12, 3 * n_participants), sharex=True)

value_cols = ['values_0', 'values_1', 'values_2', 'values_3']
colors = ['tab:blue', 'tab:orange', 'tab:green', 'tab:red']

for idx, ppt in enumerate(participants):
    ax = axes[idx] if n_participants > 1 else axes
    subset = simulations_multiple_ppt[simulations_multiple_ppt['ppt'] == ppt]
    for vcol, color in zip(value_cols, colors):
        ax.plot(subset['trial_0'], subset[vcol], label=vcol, color=color)
    ax.set_ylabel('Q-value')
    ax.set_title(f'Participant {ppt} with alpha={numpy.round(multiple[idx].get("alpha"), 3)}')

fig.legend(['stimulus 1', 'stimulus 2', 'stimulus 3', 'stimulus 4'], loc='upper right')
axes[-1].set_xlabel('Trial')
fig.suptitle('Evolution of Q-values for Different Participants\n', fontsize=16, y=1.02)
plt.tight_layout()
plt.show()

# References

Barto AG, Sutton RS. Reinforcement learning: An introduction. 2nd ed. The MIT Press; 2018.

Bridle JS. Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In: Neurocomputing: Algorithms, architectures and applications. Springer; 1990. p. 227–236.

Bush RR, Mosteller F. A mathematical model for simple learning. Psychological review. 1951;58(5):313.

Rescorla RA, Wagner AR. A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. In: Black AH, Prokasy WF, editors. Classical Conditioning II: Current Research and Theory. Appleton-Century-Crofts; 1972. p. 64–99.

Rumelhart DE, Hinton GE, Williams RJ. Learning representations by back-propagating errors. nature. 1986;323(6088):533–536.

