# Convert and evaluate raw data

Converting the data from its raw form returned by the website to a form better suited for statistical analysis and subsequent aggregation.

The processed data looks like this:

### Columns used for indexing

- `StudyID` (int), 0 refers to pilot study, 1 and 2 refer to studies #1 and #2 which had a different study format (see our report)
- `ParticipantID` (int, 0 and higher)
- `TestID` (int, 1 or 2) the first or second part/test of the study (this is called a “study part” in our report)
- `Type` (“lexical” or “recognition”) identifies a task within a part/test, the lexical task is always the first and the recognition task is always the second
- `Trial ID` (int, positive) for individual trials in the non-aggregated data

### Columns with responses from the introduction

- `Fluent` (bool) whether participant was fluent in English
- `Training` (str) one of the training categories

### Columns with responses from the trials

- `Font` (str) font used, “arial” or ‘sansforgetica”
- `Sample` (str) sample text used
- `Category` (str) word or non-word
- `Response` (str) differs for the lexical (Sure word, Probably word, Sure non-word, Probably non-word) and recognition (Sure seen, Probably seen, Sure not seen, Probably not seen) task
- `Correct` (float) whether participants response was correct (1.0) or not (0.0) or **mean** of these values in the aggregated data
- `Seen` (bool) for recognition tasks only, whether the item (sample) had been shown in the lexical task
- `Foil` (str) for recognition tasks only, fake sample used instead of the real sample for items “not seen” in the lexical task
- `RT` (float) response time or **mean** of response times in the aggregated data
- `RTnorm` (float) normalized response time or **mean** of normalized response times in the aggregated data

### Columns used in aggregated data

- all columns above and
- `Task ID` (int, 1–4) the order of tasks within a single study session, for the same single participant
- `Order` (int, 1 or 2) to specify the order of tasks in a study session
- `Firstfont` (str) the first font the participant had seen
- `isDesigner` (bool) whether participants belong to one of the categories of designers, based on `Training`
- `RT_word` (float) mean response time for words only
- `RT_nonword` (float) mean response time for non-words only
- `RTnorm_word` (float) mean normalized response time for words only
- `RTnorm_nonword` (float) mean normalized response time for non-words only
- `AUC` (float) for recognition tasks only, area under curve of participant’s responses in this task
- `AUC_word` (float) same as `AUC`, but for words only
- `AUC_nonword` (float) same as `AUC`, but for non-words only
- `AUCnorm` (float) same as `AUC`, but normalized
- `AUCnorm_word` (float) same as `AUC`, but normalized and for words only
- `AUCnorm_nonword` (float) same as `AUC`, but normalized and for non-words only
- `Correctnorm` (float) same as `Correct`, but normalized in non-aggregated data

### Columns with responses from the intervening questionnaires

- `JoM` (float) response to judgement of memory for the lexical task (copied over to the following recognition task too)
- `JoL` (str) response to judgement of learning for the lexical task (copied over to the following recognition task too), in the aggregated data, the responses are converted to scale 0-100 to get a mean (see `map_JoL` mapping below)

### Other columns

- `Date`

In [1]:
import glob
import numpy as np
import pandas as pd
import os

# set up a DataFrame to collect the processed data
columns = [
    "StudyID", "ParticipantID", "Fluent", "Training",
    "TestID", "Type", "TrialID",
    "Font", "Sample", "Category",
    "Response", "Correct", "Seen", "Foil", "RT", "RTnorm",
    "JoM", "JoL", "Date",
]
d = pd.DataFrame(columns=columns)

## Normalization functions and calculation of AUC

In [2]:
def normalize_auc(auc):
    """
    Transform the square root of AUC
    using arcsin and multiply by 2.
    """

    return 2 * np.arcsin(np.sqrt(auc))


def normalize_rt(rt):
    """
    Tranform RTs
    using natural logarithm.
    """

    return np.log(rt)


def denormalize_auc(aucnorm):
    """
    Transform the normalized AUC back
    using a square of the sine value of its half.
    """

    return np.sin(aucnorm / 2) ** 2


def denormalize_rt(rtnorm):
    """
    Tranform the normalized RTs back
    using the exponential function.
    """

    return np.exp(rtnorm)


def cummulative(x):
    return [sum(x[0:i+1]) for i in range(len(x))]


def get_auc(x, y):
    # make cummulative
    x, y = cummulative(x), cummulative(y)
    # normalize
    x = [xi/max(x) for xi in x]
    y = [yi/max(y) for yi in y]
    auc = 0
    x1, y1 = 0, 0
    for x2, y2 in zip(x, y):
        auc += (x2 - x1) * (y1 + y2) / 2
        x1, y1 = x2, y2
    return auc

map_JoL = {
    "very easy to read": 100,
    "easy to read": 75,
    "ok": 50,
    "difficult to read": 25,
    "very difficult to read": 0,
}

## Convert data from the raw format to stats-ready format

The raw format has all responses from one participant in a single row
this breaks down results for individual trials (saved in columns like “test_1_lexical”)
and saves these as individual rows.

Deal with some minor format differences as the formatting evolved with time.

In [3]:
# Warning: this takes quite a while to compute

# participant counter (Participant ID)
pid = 0
# counter for trials within each session of a single participant
x = 0
for fn in glob.glob(os.path.join("..", "data", "raw", "*.csv")):
    raw = pd.read_csv(fn)
    for i, rraw in raw.iterrows():
        # collect data that will be shared across all rows
        # for one participant
        shared = pd.Series(index=d.columns, dtype="float64")
        if "studyid" in rraw:
            shared["StudyID"] = rraw["studyid"]
        else:
            shared["StudyID"] = 0  # pilot study
        shared["ParticipantID"] = pid
        if "Fluent" in rraw:
            shared["Fluent"] = rraw["Fluent"]
        # deal with legacy column names
        if "Native" in rraw:
            shared["Fluent"] = rraw["Native"]
        if "Designer" in rraw:
            shared["Training"] = rraw["Designer"]
        if "Design_skills" in rraw:
            shared["Training"] = rraw["Design_skills"]
        for c in rraw.index:
            # get values from columns like this: test_1_lexical_5
            # ignore values from columns like this: test_1_remember
            # or test_1_legibility
            if c.startswith("test_") and \
               not (c.endswith("_remember") or c.endswith("_legibility")):
                # prefill with shared data
                rd = pd.Series(shared)
                # set defaults
                rd["Category"], rd["Seen"], rd["Foil"] = np.nan, np.nan, np.nan
                # get Test ID, Type, and Trial ID from the column name
                _, rd["TestID"], rd["Type"], rd["TrialID"] = c.strip().split("_")
                # get respond from the value in this column
                response = rraw[c].strip().split(",")
                # tackle legacy formats of responses
                # when only some values were provided
                rd["Font"] = response[0].strip()
                rd["Response"] = response[-2].strip()
                rd["RT"] = float(response[-1].strip())
                if rd["Type"] == "lexical":
                    if len(response) == 4:
                        rd["Sample"] = response[1].strip()
                    else:
                        rd["Category"] = response[1].strip()
                        rd["Sample"] = response[2].strip()
                else:
                    if len(response) == 5:
                        rd["Sample"] = response[1].strip()
                        rd["Seen"] = response[2].strip()
                    elif len(response) == 6:
                        rd["Category"] = response[1].strip()
                        rd["Sample"] = response[2].strip()
                        rd["Seen"] = response[3].strip()
                    else:
                        rd["Category"] = response[1].strip()
                        rd["Sample"] = response[2].strip()
                        rd["Seen"] = response[3].strip()
                        rd["Foil"] = response[4].strip()
                # fix legacy values
                if isinstance(rd["Category"], str):
                    rd["Category"] = rd["Category"].replace("nonword", "non-word")
                if isinstance(rd["Seen"], str):
                    rd["Seen"] = rd["Seen"].replace("non-seen", "not seen")
                rd["Response"] = rd["Response"].replace("non-seen", "not seen")
                # add the judgement of learning for this part
                # value from column test_1_remember
                rd["JoM"] = rraw["test_%s_remember" % rd["TestID"]]
                # add the judgement of legibility for this part
                # value from column test_1_legibility
                rd["JoL"] = rraw["test_%s_legibility" % rd["TestID"]]
                rd["Date"] = rraw[-1]
                # add a row with for individual trial
                d.loc[x] = rd
                x += 1
        pid += 1
# fix types
d["StudyID"] = d["StudyID"].astype(int)
d["ParticipantID"] = d["ParticipantID"].astype(int)
d["TestID"] = d["TestID"].astype("int")

print("Processed %d responses from %d participants." % (len(d), pid))

Processed 15768 responses from 219 participants.


## Add missing data & evaluate responses

Evaluat whether responses are correct or not add normalized response time (RT) transformed using natural logarithm.

In [5]:
# Warning: this takes quite a while to compute

# get a list of words and non-words from txt files used for the website
# and map them to their category names (word, non-word)
categories = {}
for cat in ["words", "non-words"]:
    with open(os.path.join("..", "data", "samples-databases", cat + ".txt")) as f:
        for w in f.readlines():
            categories[w.strip()] = cat[:-1] # remove the final "s"

# add missing data & evaluate responses
for i, rd in d.iterrows():
    # convert string "yes" to boolean
    rd["Fluent"] = (rd["Fluent"] == "yes")
    if isinstance(rd["Category"], float) or rd["Category"] is np.nan:
        # assing correct category if missing
        rd["Category"] = categories[rd["Sample"]]
    # set Correct to 1 when the participant said sure or probably
    # set to zero otherwise
    rd["Correct"] = 0
    if rd["Type"] == "lexical":
        if rd["Response"] == ("Sure " + rd["Category"]) or \
          rd["Response"] == ("Probably " + rd["Category"]):
            rd["Correct"] = 1
    elif rd["Type"] == "recognition":
        if rd["Response"] == ("Sure " + rd["Seen"]) or \
          rd["Response"] == ("Probably " + rd["Seen"]):
            rd["Correct"] = 1
    d.loc[i] = rd
            
# add normalized RT
d["RTnorm"] = normalize_rt(d["RT"])

# Aggregate data for each part and participant

Aggregate data for every (study, test, participant) combination. This results in four rows per participant:
- TaskID 1 for: part/test 1, lexical task
- TaskID 2 for: part/test 1, recognition task
- TaskID 3 for: part/test 2, lexical task
- TaskID 4 for: part/test 2, recognition task

Calculate AUC and mean RT across all of their relevant responses
and get them for: all, words, and non-words, normalized and not.

In [7]:
def get_agg_results(d_):
    """
    Aggregate data for every (study, test, participant) combination.
    """
    
    d = d_.copy()

    # Prefill results
    # aggregate correct-ness and response times (use mean value)
    # keep the rest as is or set NaN value for new columns
    result_columns = ["StudyID", "ParticipantID", "TestID", "TaskID", "Type", "Order", "Ordertype",
                      "Firstfont",
                      "Fluent", "Training", "isDesigner", "Font", "Correct", "RT", "RTnorm",
                      "RT_word", "RT_nonword", "RTnorm_word", "RTnorm_nonword",
                      "AUC", "AUC_word", "AUC_nonword",
                      "JoL", "JoM", "Date"]
    agg_columns = {k:"first" for k in set(d.columns).intersection(result_columns)}
    agg_columns["Correct"] = "mean"
    agg_columns["RT"] = "mean"
    agg_columns["RTnorm"] = "mean"
    results = d.groupby(["StudyID", "ParticipantID", "TestID", "Type"]).agg(agg_columns)
    results = pd.DataFrame(results, columns=result_columns)
    results.set_index(["StudyID", "ParticipantID", "TestID", "Type"], inplace=True)
    # isDesigner is a boolean column to conveniently group designers together
    results["isDesigner"] = (results["Training"] != "Non-designer")
    # convert JoL responses to numerical values
    for k, v in map_JoL.items():
        results["JoL"] = results["JoL"].astype(str).replace(k, v)
    results["JoL"] = results["JoL"].astype(float)

    test_ids = sorted(set(d["TestID"].unique()))
    ttypes = sorted(d["Type"].unique())
    
    # Prepare indexes for temporary data frames used to calculate the AUC.
    # There are two indexes, one based on the “Category” column used for lexical tasks
    # and one based on the “Seen” column used for recognition.
    ix = {}
    category_name = "Category"
    categories = ["word", "non-word"]
    responses = ["Sure word", "Probably word", "Probably non-word", "Sure non-word"]
    ix["lexical"] = (category_name, pd.MultiIndex.from_product([categories, responses], names=[category_name, "Response"]))
    category_name = "Seen"
    categories = ["seen", "not seen"]
    responses = ["Sure seen", "Probably seen", "Probably not seen", "Sure not seen"]
    ix["recognition"] = (category_name, pd.MultiIndex.from_product([categories, responses], names=[category_name, "Response"]))

    # Get each part/test separately
    for sid in d["StudyID"].unique():
        for pid in d[d["StudyID"] == sid]["ParticipantID"].unique():
            taskid = 0
            for tid in test_ids:
                for order, ttype in enumerate(ttypes):
                    order += 1  # (0, 1) -> (1, 2)
                    taskid  += 1 # 2 * (int(tid) - 1) + order # -> (1, 2, 3, 4)
                    
                    # Subset the data frame to single task-type combination
                    # there are four (two parts/tests with two tasks) for each participant in a study
                    dtt = d[(d["StudyID"] == sid) & (d["ParticipantID"] == pid) & (d["TestID"] == tid) & (d["Type"] == ttype)]
                    # get/save the order
                    if sid == 1:
                        # in study #1 it corresponds to 1 = lexical, 2 = recognition
                        results.loc[(sid, pid, tid, ttype), "Order"] = order
                    else:
                        # in study #2 it depends on the Test ID
                        if tid == 1:
                            results.loc[(sid, pid, tid, ttype), "Order"] = order
                        elif tid == 2 and order == 1:
                            results.loc[(sid, pid, tid, ttype), "Order"] = 2
                        elif tid == 2 and order == 2:
                            results.loc[(sid, pid, tid, ttype), "Order"] = 1
                        else:
                            print("This should not happen", tid, order)
                    results.loc[(sid, pid, tid, ttype), "TaskID"] = taskid
                    ffont = results.loc[(sid, pid, tid, ttype), "Firstfont"]
                    results.loc[(sid, pid, tid, ttype), "Ordertype"] = int(ffont == "arial") + 1
                    
                    # Calculate the AUC
                    # figure out which category and index to use for this test type
                    # category_name is either “Category” (in lexical task) or “Seen” (in recognition task)
                    category_name, index = ix[ttype]
                    # get response frequencies for this test type first
                    # ensure the order in the index is always the same
                    dg = pd.DataFrame(index=index)
                    dg["Frequencies"] = dtt.groupby([category_name])["Response"].value_counts()
                    dg = dg.fillna(0)
                    # use frequencies for word/seen for the y coordinate in the get_auc function
                    # use frequencies for non-word/not seen for the x coordinate
                    freqs = dg["Frequencies"].tolist()
                    auc = get_auc(freqs[4:], freqs[:4])
                    results.loc[(sid, pid, tid, ttype), "AUC"] = auc
                    results.loc[(sid, pid, tid, ttype), "AUCnorm"] = normalize_auc(auc)
                    correct = results.loc[(sid, pid, tid, ttype), "Correct"]
                    results.loc[(sid, pid, tid, ttype), "Correctnorm"] = normalize_auc(correct)

                    # For recognition task only
                    # get the mean AUC and RT for words only and non-words only
                    if ttype == "recognition":
                        for cat in ["word", "non-word"]:
                            cat_ = cat.replace("-", "")
                            dg["Frequencies"] = dtt[dtt["Category"] == cat].groupby([category_name])["Response"].value_counts()
                            dg = dg.fillna(0)
                            freqs = dg["Frequencies"].tolist()
                            auc = get_auc(freqs[4:], freqs[:4])
                            results.loc[(sid, pid, tid, ttype), "AUC_%s" % cat_] = auc
                            results.loc[(sid, pid, tid, ttype), "AUCnorm_%s" % cat_] = normalize_auc(auc)
                        rt = dtt[dtt["Category"] == cat]["RT"].mean()
                        results.loc[(sid, pid, tid, ttype), "RT_%s" % cat_] = rt
                        results.loc[(sid, pid, tid, ttype), "RTnorm_%s" % cat_] = normalize_rt(rt)                    
            results.loc[(sid, pid), "Firstfont"] = results.loc[(sid, pid, 1, "lexical"), "Font"]
    # fix the type for order column
    # results["Order"] = results["Order"].astype(int)
    # swap recognition tests for SID == 2
    # to have lexical and recognition next to each other
    for pid in d[d["StudyID"] == 2]["ParticipantID"].unique():
        backup = results.loc[(2, pid, 1, "recognition")].copy()
        results.loc[(2, pid, 1, "recognition")] = results.loc[(2, pid, 2, "recognition")]
        results.loc[(2, pid, 2, "recognition")] = backup
    results.reset_index(inplace=True)
    # fix column type
    results["TaskID"] = results["TaskID"].astype("int")
    results["Ordertype"] = results["Ordertype"].astype("int")
    return results

aggregated = get_agg_results(d)

In [8]:
# save the processed and aggregated data
d.to_csv(os.path.join("..", "data", "data.csv"))
aggregated.to_csv(os.path.join("..", "data", "data_aggregated.csv"))
print("Successfully saved to CSV files.")

Successfully saved to CSV files.


# Not used: remove outliers

Outliers are responses times (RTs) outside mean +- 2*STD.
These RTs are being replaced by a mean of RTs for a participant and task type (lexical, decision).

In [9]:
def replace_outliers(d_):
    """
    Replace responses outside mean +- 2*STD with a mean
    """
    
    d= d_.copy()
    means = d.groupby(["ParticipantID", "Type"])["RTnorm"].mean()
    stds = d.groupby(["ParticipantID", "Type"])["RTnorm"].std()
    meanorigs = d.groupby(["ParticipantID", "Type"])["RT"].mean()
    d.set_index(["ParticipantID", "Type", "TestID", "TrialID"], inplace=True)
    count = 0
    # traverse individual responses
    for ix, dt in d.iterrows():
        # get mean and std for participant and task type
        mean = means[ix[:-2]]
        std = stds[ix[:-2]]
        meanorig = meanorigs[ix[:-2]]
        # judge by RTnorm, but replace both RTnorm and RT
        if dt["RTnorm"] >= (mean + 2*std):
            count += 1
            d.loc[ix, "RT"] = meanorig
            d.loc[ix, "RTnorm"] = normalize_rt(meanorig)
    d.reset_index(inplace=True)
    print("Replaced %d outliers" % count)
    return d

dwo = replace_outliers(d)
aggregatedwo = get_agg_results(dwo)

print("Replacing outliers changes the global normalized mean from %.3f to %.3f" % (d["RTnorm"].mean(), dwo["RTnorm"].mean()))
print("Replacing outliers changes the global mean from %.3f to %.3f" % (d["RT"].mean(), dwo["RT"].mean()))

Replaced 557 outliers
Replacing outliers changes the global normalized mean from 7.756 to 7.719
Replacing outliers changes the global mean from 4017.465 to 2642.310


In [10]:
# save the processed and aggregated data
dwo.to_csv(os.path.join("..", "data", "data_outliers-replaced.csv"))
aggregatedwo.to_csv(os.path.join("..", "data", "data_outliers-replaced_aggregated.csv"))
print("Successfully saved to CSV files.")

Successfully saved to CSV files.
