## Preprocessing for Stan Modeling
**Note:** This is a tutorial notebook that walks through the first half of the `mcmcjob.py` script. Although this notebook can be used to preprare raw data for Stan modeling on a personal machine, it cannot be used on the HBS cluster - juypter notebooks are not supported on the cluster at this time. Therefore, please use the `mcmcjob.py` script when submitting jobs on the cluster. This notebook is intended as a reference.

### Step 1: Import Libraries & Data
The `mcmcjob.py` script requires the following libraries:
* pickle
* os
* sys
* numpy
* pandas
* random

Most python users should be familiar with most of the libraries with the possible exception of pickle. Pickle will be used to serialize our model code, fit, and data into a byte structure for saving. Let's import the libraries:

In [1]:
import pickle
import os
import sys
import numpy as np
import pandas as pd
import random
import warnings 
warnings.filterwarnings("ignore") # this is unnecessary, but included to avoid things like copywarnings in output.
# Feel free to comment out this warnings line if you'd like to see warnings.

Next we'll load the data. The data should be saved in the same working directory with `mcmcjob.py`. 

In [2]:
raw = pd.read_csv("final_dataset_for_ddm_dec_21.csv")

We will build a new dataframe that contains only the variables that we need for the Stan model:

In [3]:
df = pd.DataFrame({'ID': raw['Random.ID'], 'choice': raw['emo_binary'],
                   'rt': raw['rt'], 'valence': raw['valence'],
                   'identity': raw['faces'], 'intensity': raw['valence_values'],
                   'ratio': raw['b_person_ratio']})

### Step 2: Preprocessing - Converting Strings to Integers
The Stan language can be peculiar. We need to modify our raw data for interpretation with Stan. First we will drop any trials with NAs, NANs, or missing data. Then we will convert the data in our **choice** and **valence** columns from strings to integers. Unlike python, there are no statements with strings in Stan (e.g., `if value == 'string':`). In order to write conditional statements into our model, we have to convert these strings to integers.

In [4]:
df = df.dropna(axis=0).reset_index(drop=True)
df['choice'] = [1 if x == 'Not Emotional' else 2 for x in df['choice']]
df['valence'] = [1 if x == 'Happy' else 2 for x in df['valence']]

Here, we specified that all 'Not Emotional' cells in the **choice** column will instead read $1$. Otherwise, cells will read $2$, as is the case with 'Emotional' judgments. We similarly convert strings to integers in the **valence** column.

Next we need to convert the strings in the **identity** and **intensity** columns to lists. We will do this by spliting the strings along the separator ', ':

In [5]:
df['identity'] = [x.split(', ') for x in df['identity']]
df['intensity'] =  [x.split(', ') for x in df['intensity']]

Although **intensity** is now a column of lists with integers as elements, which Stan can work with, **identity** still has strings as elements. We need to convert these strings to integers. While we are doing this, we will also force all lists to be length 12. Why are we doing this? Again, this part is not necessary, but the way that I wrote the Stan model doesn't acknowledge differences in array size; if we don't acknowledge this and yet feed in trials with differing sizes, the model will disproportionately estimate weight to the stimuli in the trials with smaller arrays. Our goal with this modeling analysis is to estimate a unit of transformation from intensity to evidence, and we do not want to bias that estimate.

In [6]:
identitydict = {'E': 1, 'F': 1, 'B': 2, 'C': 2, 'NA': 0}
for i, x in enumerate(df['identity']):
    while len(x) < 12:
        x.append('NA')
        df['intensity'][i].append(0)
    df['identity'][i] = [identitydict[e] for e in x]

Notice that we collapsed all identities to be either Black (1) or White (2). We could always make a more complicated model that delineates the faces within these groups, but we chose to collapse for simplicity. Also note that in our efforts to make all lists length $12$, we appended $0$s to the intensity lists and NAs (0) to the identity lists.

### Step 3: Preprocessing - Additional Modification

Although our extended DDM will use **intensity** and **identity** to calculate drift rate, most of our simpler models will only use **valence** and **ratio**. Next, we will create a simple index variable for keeping track of the various **valence** and **ratio** conditions for some of these models. You might ask, why do we need these variables if we're simply going to index over them? We don't actually need them, but since an index is less interpretable to the uninformed, it is nice to include the variables anyway so that they can be referenced in our model object. 

In [7]:
df['indexer'] = [4*df['ratio'][i] + 3*(df['valence'][i]-1) for i, x in enumerate(df['ID'])]

What this **indexer** does is convert our **ratio** and **valence** integers into $6$ discrete indexing integers. See for yourself by plugging in values (e.g., where ratio is $0.50$ and valence is $2$, indexer is $4(0.50) + 3(2-1) = 5$). Each of these six indices represents one of our six conditions (e.g., Happy-25%Black, Angry-75%Black, etc.). Note that **indexer** starts at $1$ because Stan uses 1-based indexing (unlike python's 0-based indexing).

Now it is time for some data cleaning. This is primarily removing subjects that lack variability in their choices. I'm being particularly liberal with inclusion, removing only those subjects that make the same decision for *every* trial. You could rewrite this loop to search for subjects with more conservative criteria (i.e., 9:1 emotional decisions to non-emotional). We will print the IDs that need to be excluded due to low variability for our records. If the printed list is empty, congratulations, you have no subjects that need excluding.

In [8]:
dellist = []
for x in df['ID'].unique():
    if len(df[df['ID']==x]['choice'].unique()) < 2:
        dellist.append(x)
print('subjects with no variation: %s' % dellist)
df = df[~df['ID'].isin(dellist)]

subjects with no variation: []


Finally, we need to make some small modifications to our response time variable **rt**. First, we will divide each value in the **rt** column by $1000$ so that we're working in seconds rather than milliseconds. This is not a necessary step for working in Stan, rather it is necessary because of the way I wrote the Stan models we will use (e.g., I constrain $\tau$, the non-decision time parameter, to be greater than 0.1, or 100ms. If we fed milliseconds into this model, that lower bound would be misinterpreted as 100us).

In [9]:
df['rt'] = [x/1000 for x in df['rt']]

Second, we will drop all trials where **rt** is less than 0.1 seconds. As previously mentioned, we are constraining non-decision time to be greater than 100ms - anything less than this is likely to be a false start and inappropriate to model with a DDM. We will define this so that we can include it in the data we send to Stan.

In [10]:
df = df[df['rt'] > 0.1]
df = df.reset_index(drop=True)
rtbound = 0.1

### Step 4: Setting up the Data Pickle
We are now ready to convert everything - that we just so painstakingly converted to a dataframe - to arrays and vectors. Why didn't we just start with arrays and vectors?! Well, I *like* dataframes... I find them easier to work with. Unfortunately, Stan is peculiar, and will only take integers, real numbers, vectors, and matrices as data. We did all of the preprocessing in a dataframe because that's what I'm most comfortable with, but now it is time to convert that dataframe into something Stan will accept.

First, we need a few more bits of information. Namely, the number of subjects, the max number of trials, and the number of trials each subject has.

In [11]:
grouped = df.groupby(['ID'], sort=False)
trials_per = grouped.size()
subs = list(trials_per.index)
nsubs = len(subs)
tsubs = list(trials_per)
tmax = max(tsubs)

Next, we will create a bunch of arrays with shapes `nsubs`x`tmax`, or the number of subjects by the max number of trials. These arrays will all be filled with $-1$ values. Why $-1$? It is a placeholder that reflects no data - we will fill in these values with our data later. Note that some of the arrays are 3-dimensional. Those are for data that we need to import as vectors, where there are multiple values present in each trial (i.e., intensities and identities). You'll also see that some arrays are composed of integers and others of floats. These correspond to integers and reals in Stan, and depend on whether floating point numbers are necessary to represent your data.

In [12]:
choice = np.full((nsubs, tmax), -1, dtype=int)
rt = np.full((nsubs, tmax), -1, dtype=float)
valence = np.full((nsubs, tmax), -1, dtype=int)
intensity = np.full((nsubs, tmax, 12), -1, dtype=int)
identity = np.full((nsubs, tmax, 12), -1, dtype=int)
ratio = np.full((nsubs, tmax), -1, dtype=float)
indexer = np.full((nsubs, tmax), -1, dtype=int)
rtmin = np.full(nsubs, -1, dtype=float)

Next, we are going to iterate over each subject to fill their data into our newly created arrays.

In [13]:
sub_group = iter(grouped)
for s in range(nsubs):
    _, sub_data = next(sub_group)
    t = tsubs[s]
    choice[s][:t] = sub_data['choice']
    rt[s][:t] = sub_data['rt']
    valence[s][:t] = sub_data['valence']
    intensity[s][:t] = np.asarray([np.array(x) for x in sub_data['intensity']])
    identity[s][:t] = np.asarray([np.array(x) for x in sub_data['identity']])
    ratio[s][:t] = sub_data['ratio']
    indexer[s][:t] = sub_data['indexer']
    rtmin[s] = min(sub_data['rt'])

With these arrays we will create a dictionary and save that dictionary to our working directory as a pkl file.

In [14]:
data = {
    'N': nsubs,
    'T': tmax,
    'Tsub': tsubs,
    'choice': choice,
    'valence': valence,
    'rt': rt,
    'rtmin': rtmin,
    'rtbound': rtbound,
    'intensity': intensity,
    'identity': identity,
    'ratio': ratio,
    'indexer': indexer,
}
with open("modeldata.pkl", "wb") as f:
    pickle.dump(data, f, protocol=-1)

And that's it. We now have our data preprocessed for Stan. In the next tutorial, we will be using pystan, an API wrapper around the Stan language which we will use to write and run our models using the data we've prepared. Before you go, take a moment to look through your final data dictionary.

In [15]:
data

{'N': 271,
 'T': 50,
 'Tsub': [50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  11,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,
  50,