# Convert and evaluate raw data

The convert data contains:

- 2 buffer trials in practice, to be ignored
- 18 practice trials
- 72 trials from the main part

each trial response contains:

- identification of the two samples in the order of appearance. The format is <letter>_<top half style><bottom half style>, e.g. `a_AD`. There are four styles identified by capital letters A, B, C, D.
- whether the top halfs are identical
- participant’s response
- exposure time for each letter that is derived from practice and different for each participant
- response time

e.g. `a_AD,a_AD,true, Probably same, 266.7, 2968`.


The processed data looks like this:

- non-aggregated, one row per trial
- aggregated, one row per session

### Columns used for indexing

A study (StudyID) was conducted in sessions, one session per participant (ParticipantID). Each session consists of a practice and main part. Each of these is made up of trials (TrialID) where two samples are shown to a participant consecutively for a fixed period of time. After that a response is collected. The practice included 20 trials while the main part included 72 trials. There was a simple introductory questionnaire.

- `StudyID` (int), “Pilot” vs “Main”
- `ParticipantID` (int, 0 and higher) for a session (all trials) completed by one participant
- `TestID` (int) 0 refers to practice, 1 refers to the main part
- `TrialID` (int, positive) for individual trials, used only in the non-aggregated data

### Columns with responses from the introduction

- `Training` (str) one of the training categories
- `isDesigner` (bool) whether participants belong to one of the categories of designers, based on `Training

### Columns with responses from the trials

- `First` (str) first sample used
- `Second` (str) second sample
- `Response` (str) one of: Sure same, Probably same, Sure different
- `Correct` (float) whether participant’s response was correct (1.0) or not (0.0) or **mean** of these values in the aggregated data
- `ET` (float) exposure time of the samples
- `RT` (float) response time or **mean** of response times in the aggregated data
- `RTnorm` (float) normalized response time or **mean** of normalized response times in the aggregated data (natural logarithm was applied)

### Columns used in aggregated data

- all columns above and
- `AUC` (float) for recognition tasks only, area under curve of participant’s responses in this task
- `AUCnorm` (float) same as `AUC`, but normalized  by 2 x arcsin (√(AUC))
- `Correctnorm` (float) same as `Correct`, but normalized by 2 x arcsin (√(AUC))

### Other columns

- `Date`

In [4]:
from datetime import datetime
import glob
import numpy as np
import pandas as pd
import os

alpha = 0.05

# set up a DataFrame to collect the processed data
columns = [
    "StudyID", "ParticipantID", "TestID", "TrialID",
    "Training", "isDesigner",
    "First", "Second", "Composite pair", "Congruent pair", "Same", "Response", "Correct", "Correct (normalized)", "ET", "RT", "RT (normalized)", "Date",
]
startDate = datetime.strptime("2022-06-03 00:00:00", '%Y-%m-%d %H:%M:%S')
d = pd.DataFrame(columns=columns)

## Normalization functions and calculation of AUC

In [5]:
def normalize_auc(auc):
    """
    Transform the square root of AUC
    using arcsin and multiply by 2.
    """

    return 2 * np.arcsin(np.sqrt(auc))


def normalize_rt(rt):
    """
    Tranform RTs
    using natural logarithm.
    """
    if rt > 0:
        return np.log(rt)


def denormalize_auc(aucnorm):
    """
    Transform the normalized AUC back
    using a square of the sine value of its half.
    """

    return np.sin(aucnorm / 2) ** 2


def denormalize_rt(rtnorm):
    """
    Tranform the normalized RTs back
    using the exponential function.
    """

    return np.exp(rtnorm)


def cummulative(x):
    return [sum(x[0:i+1]) for i in range(len(x))]


def get_auc(x, y):
    # make cummulative
    x, y = cummulative(x), cummulative(y)
    # normalize
    if max(x) != 0:
        x = [xi/max(x) for xi in x]
    else:
        x = [xi for xi in x]
    if max(y) != 0:
        y = [yi/max(y) for yi in y]
    else:
        y = [yi for yi in y]
    auc = 0
    x1, y1 = 0, 0
    for x2, y2 in zip(x, y):
        auc += (x2 - x1) * (y1 + y2) / 2
        x1, y1 = x2, y2
    return auc

map_JoL = {
    "very easy to read": 100,
    "easy to read": 75,
    "ok": 50,
    "difficult to read": 25,
    "very difficult to read": 0,
}

## Convert data from the raw format to stats-ready format

The raw format (saved from the website) has all responses from a single participant saved in a single row.
The responses for individual trials are saved in columns like “test_1_lexical”.
The following code converts this format and saves responses into individual rows.
It also deals with some minor format differences as the formatting evolved with time.

In [6]:
# Warning: this takes quite a while to compute

# congruent pairs
congruents = [("AA", "AA"), ("AD", "AD"), ("AD", "BC"), ("AA", "DD"), ("BC", "AD"), ("DD", "AA")]

# participant counter (Participant ID)
pid = 0
# counter for trials within each session of a single participant
x = 0
for fn in glob.glob(os.path.join("..", "data", "raw-data*.csv")):
    raw = pd.read_csv(fn)
    for i, rraw in raw.iterrows():
        # collect data that will be shared across all rows
        # for one participant
        shared = pd.Series(index=d.columns, dtype="float64")
        shared["StudyID"] = "Pilot"  # default
        if "submissionDate" in rraw:
            submissionDate = datetime.strptime(rraw["submissionDate"], '%Y-%m-%d %H:%M:%S')
            if submissionDate > startDate:
                shared["StudyID"] = "Main"
            shared["Date"] = rraw["submissionDate"]
        shared["ParticipantID"] = pid
        if "Designer" in rraw:
            shared["Training"] = rraw["Designer"]
            # isDesigner is a boolean column to conveniently group designers together
            shared["isDesigner"] = (rraw["Designer"] != "Non-designer")
        for c in rraw.index:
            # get values from columns like this: practice_<number of trial> or main_<number of trial>
            if c.startswith("practice_") or c.startswith("main_"):
                # prefill with shared data
                rd = pd.Series(shared)
                if c.startswith("practice_"):
                    rd["TestID"] =  0
                else:
                    rd["TestID"] =  1
                # get Trial ID from the column name
                rd["TrialID"] = int(c.strip().split("_")[-1])
                if isinstance(rraw[c], str) and "," in rraw[c]:
                    # get the first sample, second sample, response, ET, RT from the value in this column
                    # e.g. b_AB,b_AD,true, Sure same, 577.8, 3549
                    response = rraw[c].strip().split(",")
                    rd["First"] = response[0].strip()
                    rd["Second"] = response[1].strip()
                    # type of the pair
                    # Composite (e.g. AB, CB) vs Normal (e.g. AA, DD)
                    if rd["First"][2] == rd["First"][3]:
                        rd["Composite pair"] = False
                    else:
                        rd["Composite pair"] = True
                    # Congruent pair (e.g. AD/AD, AD/BC) vs Incongruent (e.g. AD/BD, AD/AB)
                    pair = (rd["First"][2:], rd["Second"][2:])
                    if pair in congruents:
                        rd["Congruent pair"] = True
                    else:
                        rd["Congruent pair"] = False
                    rd["Same"] = rd["First"][:3] == rd["Second"][:3]  # same letter, same top half style
                    rd["Response"] = response[3].strip()
                    rd["ET"] = float(response[4].strip())
                    rd["RT"] = float(response[5].strip())
                    # evaluate response
                    rd["Correct"] = 0
                    if rd["Same"] and ("same" in rd["Response"]):
                        rd["Correct"] = 1
                    elif (not rd["Same"]) and ("different" in rd["Response"]):
                        rd["Correct"] = 1
                    # normalize
                    rd["RT (normalized)"] = normalize_rt(rd["RT"])
                    rd["Correct (normalized)"] = normalize_auc(rd["Correct"])
                # add a row with this individual trial to the data
                d.loc[x] = rd
                x += 1
        pid += 1
# fix types
d["ParticipantID"] = d["ParticipantID"].astype(int)
d["TestID"] = d["TestID"].astype("int")
d.sort_values("Date")

print("Processed %d responses from %d participants." % (len(d), pid))

Processed 19504 responses from 212 participants.


# Aggregate data for each participant

Calculate AUC and mean RT across all responses, separately for Practice and Main.

In [7]:
def get_agg_results(d_):
    """
    Aggregate data for every (study, test, participant) combination.
    """
    
    d = d_.copy()

    # remove Practice round
    d = d[d["TestID"] == 1]

    # Prefill results
    # aggregate correct-ness and response times (use mean value)
    # keep the rest as is or set NaN value for new columns
    result_columns = columns + ["AUC", "AUC (normalized)"]
    agg_columns = {k:"first" for k in set(d.columns).intersection(result_columns)}
    agg_columns["TrialID"] = "count"
    agg_columns["Correct"] = "mean"
    agg_columns["Correct (normalized)"] = "mean"
    agg_columns["RT"] = "mean"
    agg_columns["RT (normalized)"] = "mean"
    results = d.groupby(["StudyID", "ParticipantID", "Composite pair", "Congruent pair"]).agg(agg_columns)  # look only at the Main part, ignore the Practice
    results = pd.DataFrame(results, columns=result_columns)
    results.set_index(["StudyID", "ParticipantID", "Composite pair", "Congruent pair"], inplace=True)
    
    # index for temporary data frames used to calculate the AUC.
    responses = ["Sure same", "Probably same", "Probably different", "Sure different"]
    ix = pd.MultiIndex.from_product([[True, False], responses], names=["Same", "Response"])

    # Get each part/test separately

    for sid in d["StudyID"].unique():  # Pilot vs Main
        for pid in d[d["StudyID"] == sid]["ParticipantID"].unique():  # Participant
            for composite in (True, False):
                for congruent in (True, False):
                    # Subset the data frame to participant-test combination
                    dtt = d[(d["StudyID"] == sid) & (d["ParticipantID"] == pid) & (d["Composite pair"] == composite) & (d["Congruent pair"] == congruent)]
                    
                    # Calculate the AUC
                    # the temporary data frame is needed to
                    # ensure the order in the index is always the same
                    dg = pd.DataFrame(index=ix)
                    dg["Frequencies"] = dtt.groupby(["Same"])["Response"].value_counts()
                    dg = dg.fillna(0)
                    # use frequencies for same/different for the y coordinate in the get_auc function
                    # use frequencies for participants responses for the x coordinate
                    freqs = dg["Frequencies"].tolist()
                    auc = get_auc(freqs[4:], freqs[:4])  # responses to different stimulae go first
                    results.loc[(sid, pid, composite, congruent), "AUC"] = auc
                    results.loc[(sid, pid, composite, congruent), "AUC (normalized)"] = normalize_auc(auc)
    return results

aggregated = get_agg_results(d)
del aggregated["First"]
del aggregated["Second"]
del aggregated["Same"]
del aggregated["Response"]

  results.loc[(sid, pid, composite, congruent), "AUC (normalized)"] = normalize_auc(auc)
  results.loc[(sid, pid, composite, congruent), "AUC"] = auc
  results.loc[(sid, pid, composite, congruent), "AUC (normalized)"] = normalize_auc(auc)
  results.loc[(sid, pid, composite, congruent), "AUC"] = auc
  results.loc[(sid, pid, composite, congruent), "AUC (normalized)"] = normalize_auc(auc)
  results.loc[(sid, pid, composite, congruent), "AUC"] = auc
  results.loc[(sid, pid, composite, congruent), "AUC (normalized)"] = normalize_auc(auc)
  results.loc[(sid, pid, composite, congruent), "AUC"] = auc
  results.loc[(sid, pid, composite, congruent), "AUC (normalized)"] = normalize_auc(auc)
  results.loc[(sid, pid, composite, congruent), "AUC"] = auc
  results.loc[(sid, pid, composite, congruent), "AUC (normalized)"] = normalize_auc(auc)
  results.loc[(sid, pid, composite, congruent), "AUC"] = auc
  results.loc[(sid, pid, composite, congruent), "AUC (normalized)"] = normalize_auc(auc)
  results.l

In [8]:
# save the processed and aggregated data
d.to_csv(os.path.join("..", "data", "serial-data.csv"))
aggregated.to_csv(os.path.join("..", "data", "aggregated-data.csv"))
print("Successfully saved to CSV files.")

Successfully saved to CSV files.
