-
Notifications
You must be signed in to change notification settings - Fork 0
/
todo.qmd
89 lines (63 loc) · 3.09 KB
/
todo.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
---
title: "Project Scratchpad"
author:
- name: Matthew J. C. Crump
orcid: 0000-0002-5612-0090
email: mcrump@brooklyn.cuny.edu
affiliations:
- name: Brooklyn College of CUNY
abstract: "Scratchpad for thinking through the project."
date: 6-27-2023
title-block-style: default
order: 1
---
I've already started on the work in a blog post: <https://crumplab.com/blog/771_GPT_Stroop/>
Need to briefly list what I've done already.
Then, come up with some goals for moving the project forward.
## Blog post summary
- Briefly described the Stroop task
- Develops some motivation for why I care whether or not LLMs can simulate performance in a Stroop-task
- data-spoofing, such as mturk workers using LLMs to exploit online tasks.
- Describe methods
- use the openai API and R to send instructions for a Stroop task, and then individual trials (in text format)
- have the model simulate trial-by-trial responses and reaction times, return them in JSON
- analyse the data and see what happens
- The post shows some draft code to simulate a single subject, and to simulate multiple subjects
- Got data from 10 simulated subjects
- Results showed:
- gpt-3.5-turbo generated data files that showed Stroop effects
- Simulated RTs were different across simulated subjects
- Accuracy was 100%
- RTs looked almost credible. Many individual RTs had 0 ending, and were too round looking.
## Modelling To do
Not an exhaustive list, something to get me started.
- [x] Settle on one script that can be extended across the examples.
- [x] Run multiple subjects, say batches of 20-30, which should be enough for the kinds of questions I want to ask
Answer the following basic questions
- [x] Does the model produce Stroop effects in RT and accuracy?
- [x] Does the model produce different answers for each run of simulated subjects?
- [x] Does the model produce RTs that look like human subject RTs?
- [x] Does the model produce additional Stroop phenomena without further prompting?
- [x] Congruency sequence effect?
- [x] Proportion Congruent effect?
- [x] Can the instruction prompt be used to control how the model simulates performance.
- [x] simulate 75% correct
- [] simulate 50% correct
- [x] simulate some long reaction times
- [x] simulate a reverse Stroop effect.
- [] simulate proportion congruent effects
- [] simulate some other tasks
- [x] Flanker task
- [x] Simon Task
- [] SRT (Nissen & Bullemer)
- [] Negative Priming
- [] Inhibition of Return
- [] SNARC effect
## Writing to do
- [x] interim overview
- [] consider in more detail what I'm trying to accomplish here and whether this project turns into something more formal than a shared repo on github
- [] Things that make me go hmmm.
- the modeling work is not reproducible
- openAI says the new default is that data sent through the API will not be used to train models, but this is an opt-in option
- I did not opt in, but if someone else did and re-ran this kind of code with feedback to alter the model resposnes, then the results from this kind of work would change
- Unclear what happens with newer models like GPT 4