# Pre-processing MultiplEYE Data

This notebook provides a step-by-step guide through how to process the eye-tracking data and the psychometric tests data collected within the MultiplEYE project. This goal of this notebook is twofold:

1. To provide a step-by-step guide on how to preprocess MultiplEYE data using the `pymovements` library and our custom preprocessing functions.
2. To serve as a tutorial for researchers who want to preprocess their own MultiplEYE data, or data from other eye-tracking datasets, using the `pymovements` library.

## Preparation steps
1. Download the data folder from the online repository. Note that this is only possible if you have access to at least one data collection protected folder. You will have access if you are an active member of one data collection group. Download the entire content of the folder.
When you download it from SwitchDrive, it will automatically create a .tar file.
2. Add the folder to the `data/` folder in this repo. The name of the folder is the data collection name, e.g., `MultiplEYE_ZH_CH_Zurich_1_2025`.
3. Extract the .tar file in the `data/` folder.
4. Make sure that the folder structure is correct. It should look like the one online and like this (there might be more data but this is not relevant at this point):
```
	MultiplEYE_ZH_CH_Zurich_1_2025/
		documentation/
		eye-tracking-sessions/
			001_.../
			002_.../
			...
			pilot_sessions/
				001_.../
				002_.../
				...
		psychometric-tests-sessions/
		stimuli_MultiplEYE_ZH_CH_Zurich_1_2025/
		...
```

## The config file



The pipeline uses a config file which can be used to specify parameters and settings for the preprocessing. It is located in the top repo folder and named like this: `multipleye_settings_preprocessing.yaml`. Please fill in the file with the appropriate settings for your data collection. The file is well documented, so you can find explanations for each parameter there. Once you did so you can run the next cell.

Please restart the notebook kernel and re-run the cell whenever you make changes to the config file.

In [None]:
# from preprocessing.data_collection.multipleye_data_collection import prepare_language_folder
from preprocessing.data_collection.multipleye_data_collection import (
    MultipleyeDataCollection,
)
from pathlib import Path

import preprocessing

# the config will be loaded into general constants module, so we can access all settings at the same place
from preprocessing import constants
from preprocessing.scripts.prepare_language_folder import prepare_language_folder

from preprocessing.metrics.words import (
    all_tokens_from_aois,
    mark_skipped_tokens,
    repair_word_labels,
)
from preprocessing.metrics.fixations import annotate_fixations
from preprocessing.metrics.reading_measures import build_word_level_table

import polars as pl

In [None]:
# get the data collection name from the config and create the path to the data folder
this_repo = Path().resolve()
data_collection_name = constants.DATA_COLLECTION_NAME
data_folder_path = this_repo / "data" / data_collection_name

## MultiplEYE-specific preprocessing & cleaning

In order to be able to run a more generic preprocessing, the MultiplEYE data folder for one language needs to be cleaned and organized in a specific way. Running the script below will:
- unzip session folders if needed
- move session folders from core_sessions folder to the top folder
- check if there is a config file in the stimuli folder (if not, the stimulus folder was probably not uploaded correctly)
- check if there are psychometric tests (if applicable)
	- if necessary, restructure the psychometric test folder.

These steps are very individual for this data collection and results from bugs or changes across the years of collecting data.

In [None]:
# run the preparation function to prepare the language folder structure
prepare_language_folder(data_collection_name)

Next, we create a `MultipleyeDataCollection` object from the data folder. This will allow us to easily access the sessions and their information in the next steps.

In [None]:
multipleye = MultipleyeDataCollection.create_from_data_folder(
    data_folder_path,
    include_pilots=constants.INCLUDE_PILOTS,
    excluded_sessions=constants.EXCLUDE_SESSIONS,
    included_sessions=constants.INCLUDE_SESSIONS,
)

## Stage 0: Converting EDF to ASC and Preparing Session-Level Information

Stage 0 refers to the initial steps of preprocessing, which involve converting raw eye-tracking data from its original format (e.g., EDF) into a more accessible format (e.g., ASC), and preparing session-level information. This stage is specific to EyeLink eye-trackers and can be omitted for other eye-trackers.

In [None]:
multipleye.convert_edf_to_asc()

Once this conversion has been completed, we can load all sessions and parse the .asc files.

In [None]:
multipleye.prepare_session_level_information()

In [None]:
# print an overview on the data collection and the sessions
multipleye

## Stage 1: Extracting Gaze Samples

In the first preprocessing stage, we extract gaze samples from the .asc files and create a gaze dataframe for each session. This dataframe contains the raw gaze data, including the x and y coordinates of the gaze, the timestamp. We also save the raw gaze data in a separate file for each session.

The next steps are performed for one session only. It is always possible to loop over all sessions and apply the same preprocessing steps to each of them, but for the sake of clarity and simplicity, we will work with one session as an example.



In [None]:
# pick only one session as an example to work with in the next steps
sessions = [s for s in multipleye]
sess = sessions[0]
idf = sess.session_identifier

### Creating Gaze Frame from ASCII File

In [None]:
# get the path to the .asc file for the session
asc = sess.asc_path

In [None]:
gaze = preprocessing.load_gaze_data(
    asc_file=asc,
    lab_config=sess.lab_config,
    session_idf=idf,
    trial_cols=constants.TRIAL_COLS,
)

In [None]:
# save gaze and metadata
preprocessing.save_raw_data(constants.OUTPUT_DIR, idf, gaze)
preprocessing.save_session_metadata(constants.OUTPUT_DIR, idf, gaze)

In order to have the metadata which is extracted by pymovements available to create out session overview, we get this information from pymovements and store it in our session object.

In [None]:
# saving the ._metadata like this is just a temporary solution, this will be changed soon
sess.pm_gaze_metadata = gaze._metadata
sess.calibrations = gaze.calibrations
sess.validations = gaze.validations

### Coordinate and Velocity Preprocessing

Eye movements are recorded in screen pixel coordinates, which depend on stimulus size and monitor setup. To compare gaze behavior across participants, screens, or datasets, it is standard to convert pixel positions 
into **degrees of visual angle (dva)**. Next, we compute **gaze velocity**, which allows us to detect saccades and distinguish them from fixations.

In [None]:
# inspect the gaze samples
gaze.samples.head()

In [None]:
preprocessing.preprocess_gaze(gaze)

In [None]:
# inspect the preprocessed gaze samples, the dataframe should now also contain a position in dva and velocity columns
gaze.samples.head()

## Stage 2a: Detect Events and Compute Their Properties

Eye-tracking data are typically segmented into events, i.e. `fixations` and `saccades`. Fixations represent moments when the eyes remain relatively still, allowing visual information to be processed, while saccades are the rapid movements between fixations that reposition the gaze. Detecting these events and computing their properties, such as `dispersion`, fixation `duration`, saccade `amplitude`, and `peak velocity`, provides the foundation for analyzing visual behavior and understanding how participants explore a stimulus.

### Fixations

We can detect fixations by applying the `I-VT` or the `I-DT` method.

The **I-VT (Velocity-Threshold Identification)** method distinguishes fixation and saccade points based on their point-to-point velocities. Each point is classified as a fixation if its velocity is below the specified threshold. Consecutive fixation points are then merged into a single fixation. A threshold of 20 degrees/second is commonly used as a default maximum value. Read more about [the IVT algorithm in the documentation](https://pymovements.readthedocs.io/en/stable/reference/api/pymovements.events.detection.ivt.html) 

The **I-DT (Dispersion-Threshold Identification)** method finds fixations by grouping consecutive points within a maximum separation (dispersion) threshold and a minimum duration threshold. The algorithm slides a moving window across the data: if the dispersion within the window is below the threshold, the window represents a fixation and is gradually expanded until the dispersion exceeds the threshold.
Read more about [our implementation of the IDT method](https://pymovements.readthedocs.io/en/stable/reference/api/pymovements.events.detection.idt.html).

We use the `I-VT` algorithm with the following key deafault parameters:
- `minimum duration`: 100 ms 
- `velocity threshold`: 20.0

Such properties as `location`, containing the centroid coordinates of each fixation, and `dispersion` will also be calculated.

In [None]:
preprocessing.detect_fixations(
    gaze,
)

### Saccades

Saccades are rapid eye movements that shift the point of fixation from one location to another. We detect saccades (or micro-saccades) from the velocity sequence of gaze data using the [microsaccades algorithm](https://pymovements.readthedocs.io/en/stable/reference/api/pymovements.events.detection.microsaccades.html#pymovements.events.detection.microsaccades). This algorithm implements a noise-adaptive velocity threshold, meaning that the detection threshold automatically scales with the noise level of the velocity signal. Such properties as `amplitude` and `peak velocity` of the detected saccades will also be calcuated.

The key default parameters are:
- `threshold_factor`: Multiplier used to determine the velocity threshold relative to the noise level of the signal. The default value is 6. A higher factor makes the algorithm more conservative (detects fewer saccades), while a lower factor makes it more sensitive.
- `minimum_duration`: Defines how long a velocity peak must persist to be classified as a saccade. The duration is expressed in the same units as timesteps. If no timesteps are provided, the value refers to the number of samples (default = 6), which corresponds to about 12 ms at a 500 Hz sampling rate. Shorter events are ignored as noise. 

In [None]:
preprocessing.detect_saccades(
    gaze,
)

Save our events data.

In [None]:
preprocessing.save_events_data(
    constants.FIXATION,
    constants.OUTPUT_DIR,
    idf,
    split_column="trial",
    name_columns=["trial", "stimulus"],
    file_columns=["onset", "duration", "location_x", "location_y", "page"],
    data=gaze,
)

preprocessing.save_events_data(
    constants.SACCADE,
    constants.OUTPUT_DIR,
    idf,
    split_column="trial",
    name_columns=["trial", "stimulus"],
    file_columns=[
        "onset",
        "duration",
        "amplitude",
        "peak_velocity",
        "dispersion",
        "page",
    ],
    data=gaze,
)

## Stage 2b: Map Fixations to AOIs

Once we have the fixations, we can map each of them to the AOIs of the stimulus. The resulting scanpath can then be saved. Note that this features is not yet completely finished.

In [None]:
preprocessing.map_fixations_to_aois(gaze, sess.stimuli)

In [None]:
# The resulting mapping can be stored as a scanpath, which is a sequence of AOIs that were fixated in the order they were fixated.
preprocessing.save_scanpaths(constants.OUTPUT_DIR, idf, gaze)

In [None]:
# save metadata again
preprocessing.save_session_metadata(constants.OUTPUT_DIR, idf, gaze)

## Stage 3: Calculate AOI-based Measures

In this last step, we calculate the aoi-based measures. These are also refered to as reading measures, as they are typically used in reading research. They include measures such as first pass fixation duration (FPF), total fixation count (TFC), regression path duration (RPD), and many more. These measures are calculated based on the fixations that were mapped to the AOIs in the previous step.

In [None]:
# we pick just one stimulus of our session as an example
stimulus = 4
trial_label = sess.stimuli[stimulus].trial_id
aois = sess.stimuli[stimulus].text_stimulus.aois

In [None]:
# add word label to blank spaces between words in AOIs. This step is necessary as the AOIs files currently map white space to the preceding word, however, it should be mapped to the word following the white space.
aois_clean = repair_word_labels(aois)

# collect all words from AOIs for the given trial
all_tokens = all_tokens_from_aois(aois_clean, trial=trial_label)

### Fixation-based Metrics

As an intermediate step, the fixations are annotated. These annoataions include:
- The run ID. This ID specifies continuous sequences of fixations on the same word. It is used to calculate first pass and second pass measures.
- Whether the fixation is within the first pass or not
- The index of the preceding word and the following word
- If the saccade entering or leaving the fixation is a regression or not
- Whether it is the first fixation on the word or not

This information is necessary to calculate the reading measures in the next step.

In [None]:
# create a fixation table
fixation_table = annotate_fixations(gaze.events.frame)
fixation_table.head()

In [None]:
#  annotate skipped words based on fixation table and all tokens
words_with_skip = mark_skipped_tokens(all_tokens, fixation_table)

In [None]:
# calculate word-level reading measures
word_level_table = build_word_level_table(
    words=words_with_skip,
    fix=fixation_table,
)

In [None]:
with pl.Config(tbl_rows=50):
    print(
        word_level_table.filter(pl.col("page") == "page_1").select(
            [
                "word_idx",
                "word",
                "skipped",
                "FPF",
                "TFC",
                "SL_in",
                "RPD_inc",
                "RBRT",
                "TFT",
            ]
        )
    )

## Final Steps

In the very end, we can create the session and dataset overview and store them as well. In addition, the participant data can be parsed and stored.

For the MultiplEYE data, there is also the option to create a sanity check report.

In [None]:
multipleye.create_sanity_check_report(
    gaze,
    sess.session_identifier,
    output_dir=constants.OUTPUT_DIR,
    plotting=True,
    overwrite=True,
)

In [None]:
multipleye.create_session_overview(sess.session_identifier, path=constants.OUTPUT_DIR)
multipleye.create_dataset_overview(path=constants.OUTPUT_DIR)
multipleye.parse_participant_data(constants.OUTPUT_DIR / "participant_data.csv")

In [None]:
from preprocessing.psychometric_tests.preprocess_psychometric_tests import (
    preprocess_all_sessions,
)

preprocess_all_sessions()