# Reinforcement Learning

Here, we will implement a simple two armed bandit task. We then run the same task on a language model specifically trained on tasks like these ([centaur](https://marcelbinz.github.io/centaur/)) and compare the results.

## Two-Armed Bandit Task

### Imports

In [1]:
from sweetbean import Block, Experiment
from sweetbean.stimulus import Bandit, Text
from sweetbean.variable import (
    DataVariable,
    FunctionVariable,
    SharedVariable,
    SideEffect,
    TimelineVariable,
)

### Timeline

Here, we slowly change the values of `bandit_1` 10 to 0 and for `bandit_2` in reverse order from 0 to 10.


In [2]:
timeline = []
for i in range(11):
    timeline.append(
        {
            "bandit_1": {"color": "orange", "value": 10 - i},
            "bandit_2": {"color": "blue", "value": i},
        }
    )

### Implementation

We also keep track of the score with a shared variable to present it between the bandit tasks.

In [3]:
bandit_1 = TimelineVariable("bandit_1")
bandit_2 = TimelineVariable("bandit_2")

score = SharedVariable("score", 0)
value = DataVariable("value", 0)

update_score = FunctionVariable(
    "update_score", lambda sc, val: sc + val, [score, value]
)

update_score_side_effect = SideEffect(score, update_score)

bandit_task = Bandit(
    bandits=[bandit_1, bandit_2],
    side_effects=[update_score_side_effect],
)

score_text = FunctionVariable("score_text", lambda sc: f"Score: {sc}", [score])

show_score = Text(duration=2000, text=score_text)

trial_sequence = Block([bandit_task, show_score], timeline=timeline)
experiment = Experiment([trial_sequence])

Export the experiment to a html file and run it in the browser.

In [4]:
experiment.to_html("bandit.html", path_local_download="bandit.json")

### Results
After running bandit.html, there should be a file called `bandit.json` in the download directory. You can open the file in your browser to see the results. First, we process it so that it only contains relevant data:

In [4]:
import json
from sweetbean.data import process_js, get_n_responses, until_response

with open("bandit.json") as f:
    data_raw = json.load(f)
    
data = process_js(data_raw)

We can now get the number of times a response was made and get the data until before the third response:

In [5]:
n_responses = get_n_responses(data)
data_third_response = until_response(data, 3)
data_third_response

[{'rt': 1154,
  'stimulus': ['<div class="slotmachine" style="position: absolute; top:10vh; left:10vw; width: 35vw; height: 35vh; border-color: orange"></div>',
   '<div class="slotmachine" style="position: absolute; top:10vh; left:55vw; width: 35vw; height: 35vh; border-color: blue"></div>'],
  'response': 0,
  'trial_duration': None,
  'duration': None,
  'html_array': ['<div class="slotmachine" style="position: absolute; top:10vh; left:10vw; width: 35vw; height: 35vh; border-color: orange"></div>',
   '<div class="slotmachine" style="position: absolute; top:10vh; left:55vw; width: 35vw; height: 35vh; border-color: blue"></div>'],
  'values': [10, 0],
  'time_after_response': 2000,
  'type': 'jsPsychHtmlChoice',
  'bandits': [{'color': 'orange', 'value': 10}, {'color': 'blue', 'value': 0}],
  'value': 10,
  'score': 10},
 {'rt': None,
  'stimulus': "<div style='color:white'>Score: 10</div>",
  'response': None,
  'trial_duration': 2000,
  'duration': 2000,
  'choices': [],
  'correct

# Experiment on language model

With the partial data, we can now run the experiment up to that point and then run the rest of the experiment on language input. To test this, we run it manually:

In [6]:
data_new, _ = experiment.run_on_language(input, data=data_third_response)

hi


In [7]:
print(data_new)

([{'rt': 1154, 'stimulus': ['<div class="slotmachine" style="position: absolute; top:10vh; left:10vw; width: 35vw; height: 35vh; border-color: orange"></div>', '<div class="slotmachine" style="position: absolute; top:10vh; left:55vw; width: 35vw; height: 35vh; border-color: blue"></div>'], 'response': 0, 'trial_duration': None, 'duration': None, 'html_array': ['<div class="slotmachine" style="position: absolute; top:10vh; left:10vw; width: 35vw; height: 35vh; border-color: orange"></div>', '<div class="slotmachine" style="position: absolute; top:10vh; left:55vw; width: 35vw; height: 35vh; border-color: blue"></div>'], 'values': [10, 0], 'time_after_response': 2000, 'type': 'jsPsychHtmlChoice', 'bandits': [{'color': 'orange', 'value': 10}, {'color': 'blue', 'value': 0}], 'value': 10, 'score': 10}, {'rt': None, 'stimulus': "<div style='color:white'>Score: 10</div>", 'response': None, 'trial_duration': 2000, 'duration': 2000, 'html_array': ['<div class="slotmachine" style="position: absol