## Preprocessing for Stan Modeling
**Note:** This is a tutorial notebook that walks through the `mcmcjob.py` script. Although this notebook can be used to preprare raw data for Stan modeling on a personal machine, it cannot be used on the HBS cluster - juypter notebooks are not supported on the cluster at this time. Therefore, please use the `mcmcjob.py` script when submitting jobs on the cluster. This notebook is intended as a reference.

### Step 1: Import Libraries & Data
The `mcmcjob.py` script requires the following libraries:
* pickle
* os
* sys
* numpy
* pandas
* random
* pystan

Most python users should be familiar with most of the libraries with the exception of pystan and possibly pickle. Pickle will be used to serialize our model code and fit into a byte structure for saving. Pystan will be used as an API wrapper around the Stan language, which we will use to write our models. Let's import the libraries:

In [11]:
import pickle
import os
import sys
import numpy as np
import pandas as pd
import random
import pystan
import warnings 
warnings.filterwarnings("ignore") # this is unnecessary, but included to avoid things like copywarnings in output.
# Feel free to comment out this warnings line if you'd like to see warnings.

Next we'll load the data. The data should be saved in the same working directory with `mcmcjob.py`. 

In [13]:
raw = pd.read_csv("final_dataset_for_ddm_dec_21.csv")

We will build a new dataframe that contains only the variables that we need for the Stan model:

In [14]:
df = pd.DataFrame({'ID': raw['Random.ID'], 'choice': raw['emo_binary'],
                   'rt': raw['rt'], 'valence': raw['valence'],
                   'identity': raw['faces'], 'intensity': raw['valence_values'],
                   'ratio': raw['b_person_ratio']})

### Step 2: Preprocessing - Converting Strings to Integers & Vectors
The Stan language can be peculiar. We need to modify our raw data for interpretation with Stan. First we will drop any trials with NAs, NANs, or missing data. Then we will convert the data in our **choice** and **valence** columns from strings to integers. Unlike python, there are no statements with strings in Stan (e.g., `if value == 'string':`). In order to write conditional statements into our model, we have to convert these strings to integers.

In [15]:
df = df.dropna(axis=0).reset_index(drop=True)
df['choice'] = [1 if x == 'Not Emotional' else 2 for x in df['choice']]
df['valence'] = [1 if x == 'Happy' else 2 for x in df['valence']]

Here, we specified that all 'Not Emotional' cells in the **choice** column will instead read $1$. Otherwise, cells will read $2$, as is the case with 'Emotional' judgments. We similarly convert strings to integers in the **valence** column.

Next we need to convert the strings in the **identity** and **intensity** columns to vectors. We will do this by spliting the strings along the separator ', ':

In [16]:
df['identity'] = [x.split(', ') for x in df['identity']]
df['intensity'] =  [x.split(', ') for x in df['intensity']]

Although **intensity** is now a column of vectors with integers as elements, which Stan can work with, **identity** still has strings as elements. We need to convert these strings to integers. While we are doing this, we will also force all vectors to be length 12. Why are we doing this? Again, this part is not necessary, but the way that I wrote the Stan model doesn't acknowledge differences in array size; if we don't acknowledge this and yet feed in trials with differing sizes, the model will disproportionately estimate weight to the stimuli in the trials with smaller arrays. Our goal with this modeling analysis is to estimate a unit of transformation from intensity to evidence, and we do not want to bias that estimate.

In [17]:
identitydict = {'E': 1, 'F': 1, 'B': 2, 'C': 2, 'NA': 0}
for i, x in enumerate(df['identity']):
    while len(x) < 12:
        x.append('NA')
        df['intensity'][i].append(0)
    df['identity'][i] = [identitydict[e] for e in x]

Notice that we collapsed all identities to be either Black (1) or White (2). We could always make a more complicated model that deliniated the faces within these groups, but we chose to collapse for this analysis. Also note that in our efforts to make all vectors length 12, we appended $0$s to the intensity vectors and and NAs (0) to the identity vectors.

Finally, we will divide each value in the **rt** column by $1000$ so that we're working in seconds rather than milliseconds. This is not a necessary step for working in Stan, rather it is necessary because of the way I wrote the Stan models we will use. You could just as easily write a model with seconds instead of milliseconds.

In [7]:
df['rt'] = [x/1000 for x in df['rt']]

### Step 3:
In progress