# pEYEpline: Preprocessing of the MultiplEYE data

Files that go to the MultiplEyeStore repository:


- **Raw recordings:** Raw files without any formatting applied (in the original encoding). --> to be decided if these data should really be included (con: storage; pro: stage 0 is good to have, some messages in the original files will be lost in the first step of preprocessing.) **Filename** must contain: participantId, stimulusId, TrialId
  
- **Raw data:** one csv file per stimulus text and per reader containing the following columns: ScreenId (x,y) screen location in pixels, gaze event (fixation, saccade, NaN, blink), optionally pupil size. **Filename** must contain: participantId, stimulusId, TrialId

- **Fixation data:** one csv file per stimulus text and per reader containing one fixation per line with the following columns: ScreenId, onset-time, offset-time, (x,y) screen location in pixels (mean and std), duration, etc. **Filename** must contain: participantId, stimulusId, TrialId

- **Saccade data:** one csv file per stimulus text and per reader containing one saccade per line with the following columns: screenId, onset-time, offset-time, start (x,y) screen location in pixels, end (x,y) screen location in pixels, duration, amplitude in deg of visual angle, amplitude in chars, mean velocity, peak velocity. **Filename** must contain: participantId, stimulusId, TrialId

- **Interest area files** word- and char-based interest area files (can be merged with data when loading it), contain line information. **Filename** must contain: participantId, stimulusId, TrialId

- **Reading measures files**: one csv file per stimulus text and per reader containing reading measures, aois, screenIds. **Filename** must contain: participantId, stimulusId, TrialId

- **Data quality reports**:
  - **Trial-level data quality reports:** json; one file per stimulus text per participant. **Filename** must contain: ParticipantId, StimulusId, TrialId
  - **Session-level data quality reports:** json; One file per session (participant). Contains data quality measures aggregated for all data from one sesseion.  **Filename** must contain: participantId, 
  - **Dataset-level data quality reports** json; One file per dataset

- **Response accuracies and text difficulty and familiarity ratings:** one csv or json file for each stimulus containing itemId, recorded response (pressed key), response type {target, distractor a, distractor b, distractor c}, response accuracies and latencies for all questions (=items) and and the text difficulty and familiarity rating for this text. **Filename** must contain: participantId, stimulusId



In [None]:
# TODOs:
# generate data quality report for each session asap after the session

## Stimulus texts preprocessing 

In [None]:
# set set-up-specific variable values; default values are set for DiLi lab, ZH
eyetracker = "eyelink"
# TODO add all relevant set-up specifications

*Terminology*
- output = {preprocessing, filter-criteria, repository}
- raw_recordings: eyetracking recording files generated by the eyetracker converted to *human-readable pure text format* (ascii)
- raw_files: eyetracking raw recordings (csv format) in device-unspecific format containing times stamps, and screen coordinates in pixels, and, optionally, pupil size and unit
- quality_report_raw_asc: file containing data quality measures extracted from the raw_recordings asc files (csv); eye-tracker-unspecific format, eyetracker-specific contents (missing values for some devices); one file per participant

## Eye movements preprocessing

Compute different representations of eye movements (raw samples, gaze event data, reading measures) and add relevant information (aois) and the various identifiers (textiId, screenId, trialId).

**Generate one file per participant per text for all stages of preprocessing (raw, events, reading measures).**

**Identifiers:**
Encode textId, participantId, and trialId only in the filename;
Add screenId to the data

**Interest areas**
For the raw and the evant data, aois (char-based and word-based) will be stored as separate files that need to be merged with the eye movement data via the gaze/aoi screen coordinates when loading the data. 

**Note:** implement this when adding multipleye to the pymovements library.

### Processing of eye-tracker-specific raw recording files (e.g., edf files)

#### Eyetracker-specific recording files to human-readable format
- eyetracker-specific step
- input is eyetracker specific, output is still eyetracker-specific
- only applicable if original eyetracking recording files are not human readable


In [None]:
if eyetracker == "eyelink":
    pass
    # load edf files
    # apply edf2asc
    # input: all edf data files
    # output:
    # name: raw_recordings
    # format: asc (eyetracker specific contents)
    # goes to: preprocessing
elif eyetracker == "tobii":
    pass
    # load raw_files

### Processing of eyetracker-specific human-readable raw recording files 

- Eyetracker-specific input
- Output should be eyetracker-unspecific in format, but will contain missing values in some places depending on the device
- Extract information that are directly written as meta-information into the recording file (information about calibration scores etc)
- Only stimulus-independent metrics
- No metrics that need to be calculated from the data samples

#### Blink extraction 

- only applicable if eyetracker provides blinks

In [None]:
# Blink detection:
# If provided by the eyetracker: extract blinks (times stamp and duration) and write to csv

# Input: raw_recordings
# Output:
# name: blinks_eyetracker
# format: csv
# goes to: preprocessing (data quality reports, gaze events)

#### Extract data quality information from eyetracker-specific recording files

In [None]:
# Generate data quality report that contains information that the eyetracker writes as meta-data into the recording
# TODO decide about the exact metrics
# Preliminary list of values/information to extract
# - Is the session complete? (all stimuli being completed and all questions answered); of not: how much is missing? (e.g., provide proportion completed in terms of screens and in terms of texts)
# - Have all comprehension questions been answered? (if this information is available in raw recordings)
# - Was calibration and validation performed at the beginning of the experiment? (scores?)
# - Has validation performed before each text? extract validation scores before each text (if validation was followed by a calibration and then a second validation, use the scores from the second validation)
# - Was a validation check  performed at the end of the recording? Extract validation scores.
# - calibration scores, validation scores; when/how many calibrations were performed?
# - when/to what extent has drift correction been performed? (timestamp, before which trial/item id? Was the drift corrected or only checked?
# - What (if any) filter was applied for data recording?
# - if blinks have been extracted (see above), compute proportion/frequency of blinks, and some measure reflecting their mean and std duration (or median)

# Input: raw_recordings
# Output:
# name: quality_report_raw_asc
# format: csv or json
# goes to: preprocessing: session-level data quality reports

#### Parsing of eyetracker-specific raw recording files to consistent eyetracker-unspecific csv files containing the raw samples

- Eytracker-specific input/code, output eyetracker-unspecific
- Apply inclusion criteria: a given participant needs to have completed reading at least one entire text (practice texts do not count) and have answered the comprehension questions for that text

In [None]:
# Process raw recording files to csv with

# Generate one file for each participant with the following columns: trialID, stimulusId, screenId, timestamp in ms, x-gaze coordinate in screen pixels, y-gaze coordinate in screen pixels, optional: pupil size, pupil size measurement unit (diameter, area...)
# Ensure that the same coordinate system is used across devices/datasets
# make sure to split data by stimulusId (=text) and screenid
# Apply inclusion criterion: remove participants who have not completed at least one entire text plus the corresponding comprehension questions
# merge multiple eyetracking files from one participant (only applicable if experiment was aborted and re-started)
# handle any other inconsistencies (wrong participant IDs (caution: the id is in many files), aborted trials, missing data,....

# Arguments: eyetracking device (format of raw_recordings)
# Input: raw_recordings
# Output:
# name: raw_files
# format: csv (consistent columns across devices; trialId, stimuulusId, screenId, x,y-screen px coords, optional column for pupil size); one file for each participant
# goes to: preprocessing

### Processing of raw samples (csv format)

#### Generate data quality measures from raw samples

In [None]:
# Compute the following measures (preliminary list) from the raw eyetracking data:

# duration of the recording (for each text or for each screen and total duration);
# proportion of data loss

# Input: raw_files
# Output:
# name: quality_report_raw
# format: csv or json
# goes to: repository, filter_criteria (possibly after merging with quality measures computed at the other stages of preprocessing)

#### Gaze event detection and evaluation

In [None]:
# Gaze even detection
# Compute gaze events and add them as additional column to the raw samples (saccade, fixation, blink, artifact/corrupt measurement)
# Apply artifact detection, blink detection, saccade/fixation detection

# Input: raw_files
# Output:
# name: gave_event_files
# format: csv
# goes to: preprocessing, repository

In [None]:
# Compute fixation files
# From raw samples classified as fixations, compute fixation features:

# Preliminary list of fixation features:
# start timestamp
# end timestamp
# duration
# standard deviation
# location (mean)

# Input: gaze_event_files
# Output:
# name: fixation_files
# format: csv files; one fixation file and one saccade file per participant and text)
# goes to: preprocessing for adding aoi infos

In [None]:
# Compute saccade files


### Processing of raw samples (csv format) and aoi files

Input is raw samples and aoi files (char-based or word-based)

#### Add aoi information to raw samples
- To be decided: Shall we share these files on the repository?
    - Pro: maybe useful for some users who want to work on the raw data plus aoi info
    - Con: Take a lot of space; these data can be easily generated by the user of MulitplEYE
- (Potential) use cases of these data: plotting of raw data and aoi
- Generation of Data quality report (next step)

In [None]:
# Merge aoi files with raw data (aoi as additional columns and NaNs)
# Input: char-based or word-based aoi's
# Output:
# name: raw_files_aoi
# format: csv
# goes to:

#### Generate stimulus-dependent quality measures from raw samples

### Add aois to fixation files
Input: fixation files, word-based and char-based aoi files

## Write trial-level data quality reports

## Write session-level data quality reports
Combine all session-level data quality reports that have been generated at the different steps of the pipeline into a single report (one file per session (=reader)). 

In [None]:
# For all readers combine quality reports from all texts into a single session (=reader)-level quality report

# Inputs:
# quality_report_raw_asc, quality_report_raw, TODO
# Output:
# name: session_level_quality_reports (one file per reader)
# format: json
# goes to: repository, filter-criteria, preprocessing (dataset-level data quality reports)

## Generate dataset-level data quality reports

### Compute dataset-level data quality information from session-level data quality reports

In [None]:
# Aggregate the session-level data quality measures to the dataset level
# Input: session_level_quality_reports (one file per reader)
# Output:

### Get dataset-level data quality information from meta-data documentation

In [None]:
# Read meta-data documentation, deviation form etc. TODO Which files


### Write dataset-level data quality report

## Comprehension questions and difficulty/familiarity rating response processing

In [None]:
# From the participant's response (pressed key) and target answer, compute response accuracy

In [None]:
# For each participant and each text, write file with response behavior and text difficulty and familiarity ratings:**
# one csv or json file for each stimulus containing itemId, recorded response (pressed key), response type {target, distractor a, distractor b, distractor c}, response accuracies and latencies for all questions (=items) and and the text difficulty and familiarity rating for this text. **Filename** must contain: participantId, stimulusId