# Preprocessing eye-tracking data

#### ⚠️ Before you start, download the two dataset definition files `gctg.yaml` and `gctg-clean.yaml` from Moodle and put them in the root directory of this repository. ⚠️

> Please don't upload/push/publish these files elsewhere, as the dataset is confidential.

---

This notebook will cover:

1. [Levels of preprocessing](#1-levels-of-preprocessing)
2. [Inspecting and visualizing data](#2-inspecting-and-visualizing-data)
3. [Cleaning up data](#3-cleaning-up-data)
4. [Detecting events](#4-detecting-events)
5. [Mapping fixations to AOIs](#5-mapping-fixations-to-aois)
6. [Calculating reading measures](#6-calculating-reading-measures)

We will make use of the Python library [pymovements](https://pymovements.readthedocs.io/).

> **NOTE:** pymovements is still relatively young and under active development. Some features may be missing or buggy. You can get support during the seminar sessions and on the Moodle forum. If you notice any bugs or think that something could be improved in the library, don't hesitate to open an issue [on GitHub](https://github.com/aeye-lab/pymovements/issues).

In [None]:
%pip install pymovements

## 1. Levels of preprocessing

Generally, there are three levels of preprocessing eye-tracking-while-reading data:

- **Raw data:** Eye movement data as it is recorded by the eye tracker -- usually between 500 and 2000 samples per second. Each samples include X/Y gaze coordinates and possibly other measurements like pupil size.
- **Event data:** Automatically detected events like fixations and saccades. Each event has an onset and offset time; in combination with the raw data, properties like fixation duration, saccade amplitude, or maximum velocity can be calculated.
- **Reading measures:** Scalar measures like first fixation duration, first-pass gaze duration, and skip rate, calculated for each area of interest (e.g., for each word).

![Preprocessing pipeline](pipeline.drawio.png)

Unfortunately, many [publicly available datasets](https://pymovements.readthedocs.io/en/stable/datasets/index.html#public-datasets) do not include the raw data. If you want to use a dataset for an experiment where you need to apply your own preprocessing pipeline, make sure to check the available data types.

## 2. Inspecting and visualizing data

First, let's download the dataset and load it into memory:

In [None]:
import pymovements as pm

dataset = pm.Dataset("gctg.yaml", "data")
dataset.download()
dataset.load()

The dataset contains one `GazeDataFrame` per subject:

In [None]:
dataset.gaze[0]

For convenience, let's split those tables so that there is one `GazeDataFrame` for each screen:

In [None]:
dataset.split_gaze_data("stimulus")
dataset

Let's look at the gaze data from one subject on one of the screens:

In [None]:
stimulus_gaze = dataset.gaze[0]
pm.plotting.traceplot(stimulus_gaze)

This is not very helpful, because we can't see the text that the subject was looking at.

Let's extract the name of the stimulus and find the corresponding stimulus image in the data folder:

In [None]:
from pathlib import Path

stimulus_name = stimulus_gaze.frame["stimulus"].unique().item()
stimulus_path = Path("data", "raw", "stimuli", f"{stimulus_name}.png")

stimulus_name, stimulus_path

Now we can add the stimulus image to the plot:

In [None]:
pm.plotting.traceplot(stimulus_gaze, add_stimulus=True, path_to_image_stimulus=stimulus_path)
stimulus_path

As you can see, the calibration towards the bottom of the screen appears to be a bit off (we will fix this in step 3 below).

Data in pymovements is stored as [polars](https://pola.rs/) data frames. Polars is a library similar to pandas, but is generally faster and provides a more functional-programming-like interface.

You can access the raw data using the `frame` attribute and use it, for example, to create your own plots:

In [None]:
import polars as pl
import matplotlib.pyplot as plt

pupil_size = stimulus_gaze.frame["pupil"]
pixel_x = stimulus_gaze.frame.select(pl.col("pixel").list.first())
pixel_y = stimulus_gaze.frame.select(pl.col("pixel").list.last())

fig, (ax_x, ax_y, ax_pupil) = plt.subplots(3, 1, figsize=(10, 6), sharex=True)
ax_x.plot(pixel_x)
ax_y.plot(pixel_y)
ax_pupil.plot(pupil_size)
ax_x.set_ylabel("X location (pixel)")
ax_y.set_ylabel("Y location (pixel)")
ax_pupil.set_ylabel("Pupil size")
ax_pupil.set_xlabel("Time (ms)")

Let's check how many samples we have per screen and by how many subjects each screen has been seen:

In [None]:
all_samples = pl.concat([gaze.frame for gaze in dataset.gaze])
all_samples.group_by("stimulus").agg(
    [
        pl.col("time").count().alias("num_samples"),
        pl.col("subject_id").unique().count().alias("num_subjects"),
    ]
)

There is one screen which was accidentally skipped during the experiment. Can you find it?

In [None]:
# TODO: Find the stimulus with fewer than 4 subjects

## 3. Cleaning up data

Which data cleaning steps are necessary highly depends on the use case. For reading experiments, this commonly includes:

- Correcting sample or fixation locations in case of bad calibration
- Removing the first and last fixations on a page or line (because there is often a bit of "random" movement)
- Removing blinks and other noise

Here, we will only look at manual correction of sample locations.

In [None]:
%pip install scikit-image

The `correct.py` module implements a simple graphical interface for moving and warping raw gaze data. You can open a CSV file and edit one screen at a time.

- **Left mouse button:** Add and move anchor points
- **Right mouse button:** Remove anchor points
- `→`: Next screen
- `←`: Previous screen
- `CTRL+Z`: Undo
- `ESC`: Save and exit

In [None]:
import correct

correct.main(Path("data", "raw", "gaze", "P01.csv"), vertical=True)
# vertical=True means that you will only correct the vertical location (recommended)

Alternatively, you can use the CLI from a terminal:

```bash
python correct.py data/raw/gaze/P01.csv --vertical
```

When you are done correcting, you can apply the transformations, which will create a new CSV file:

In [None]:
import apply_transforms

apply_transforms.main(Path("data", "raw", "gaze", "P01.csv"))
# This will create P01.corrected.csv in the same directory

We have prepared a version of the dataset with manually corrected raw data, which you can use for your experiments. This version only includes the **text screens for the experimental trials**. If you want to use gaze data on the question screens or the practice trial, you need to get it from the uncleaned dataset, and possibly manually correct them first.

You can fetch and load the cleaned dataset like this:

In [None]:
dataset = pm.Dataset("gctg-clean.yaml", "data-clean").download().load()
dataset.split_gaze_data("stimulus")
dataset

In some trials, calibration quality was too low to be adequately corrected, so we excluded them. (But note that we only removed the gaze data, not the response data.)

You can find the missing trials like this:

In [None]:
all_samples = pl.concat([gaze.frame for gaze in dataset.gaze])
all_samples.group_by("stimulus").agg(
    [
        pl.col("time").count().alias("num_samples"),
        pl.col("subject_id").unique().count().alias("num_subjects"),
    ]
)

## 4. Detecting events

Depending on the research question, different events may be of interest:

- Fixations
- Saccades
- Microsaccades
- Smooth pursuit
- Blinks
- ...

For calculating reading measures, fixations are the most relevant. Two algorithms are commonly used for fixation detection:

- **IVT:** identification by velocity threshold; detects a fixation as long as the velocity is below a predefined threshold
- **IDT:** identification by dispersion threshold; detects a fixation as long as the gaze location does not move further than a predefined distance

Both IVT and IDT use the angular gaze position, so we first need to convert the pixel coordinates to degrees of visual angle:

In [None]:
dataset.pix2deg()

stimulus_gaze = dataset.gaze[0]
pixel_x = stimulus_gaze.frame.select(pl.col("pixel").list.first())[:5000]
pixel_y = stimulus_gaze.frame.select(pl.col("pixel").list.last())[:5000]
position_x = stimulus_gaze.frame.select(pl.col("position").list.first())[:5000]
position_y = stimulus_gaze.frame.select(pl.col("position").list.last())[:5000]

fig, (ax_x, ax_y, ax_px, ax_py) = plt.subplots(4, 1, figsize=(10, 6), sharex=True)
ax_x.plot(pixel_x)
ax_y.plot(pixel_y)
ax_px.plot(position_x)
ax_py.plot(position_y)
ax_x.set_ylabel("X location [pix]")
ax_y.set_ylabel("Y location [pix]")
ax_px.set_ylabel("X position [°]")
ax_py.set_ylabel("Y position [°]")
ax_py.set_xlabel("Time [ms]")

For IVT, we also need to calculate the velocity (in degrees per second). Due to random noise, the signal constantly jumps around a bit, even if the eye is completely still. Therefore, smoothing is usually applied when calculating the velocity. You can find the different methods implemented in pymovements [here](https://pymovements.readthedocs.io/en/stable/reference/pymovements.gaze.transforms.pos2vel.html).

In [None]:
dataset.pos2vel(method="fivepoint")

stimulus_gaze = dataset.gaze[0]
position_x = stimulus_gaze.frame.select(pl.col("position").list.first())[:5000]
position_y = stimulus_gaze.frame.select(pl.col("position").list.last())[:5000]
velocity_x = stimulus_gaze.frame.select(pl.col("velocity").list.first())[:5000]
velocity_y = stimulus_gaze.frame.select(pl.col("velocity").list.last())[:5000]

fig, (ax_x, ax_y, ax_vx, ax_vy) = plt.subplots(4, 1, figsize=(10, 6), sharex=True)
ax_x.plot(position_x)
ax_y.plot(position_y)
ax_vx.plot(velocity_x)
ax_vy.plot(velocity_y)
ax_x.set_ylabel("X position [°]")
ax_y.set_ylabel("Y position [°]")
ax_vx.set_ylabel("X velocity [°/s]")
ax_vy.set_ylabel("Y velocity [°/s]")
ax_vy.set_xlabel("Time [ms]")

Now, we can use IDT or IVT to detect fixations:

In [None]:
dataset.detect("idt", clear=True)
dataset.events[0]

The duration of each event is automatically calculated, but the fixation locations are missing. Let's add them by calculating the average pixel coordinates for each fixation.

In [None]:
dataset.compute_properties(("location", {"position_column": "pixel"}))
dataset.gaze[0].events

Let's visualize the detected events:

In [None]:
stimulus_gaze = dataset.gaze[0]
stimulus_name = stimulus_gaze.frame["stimulus"].unique().item()

stimulus_path = Path("data", "raw", "stimuli", f"{stimulus_name}.png")
pm.plotting.scanpathplot(stimulus_gaze.events, stimulus_gaze, add_traceplot=True, add_stimulus=True, path_to_image_stimulus=stimulus_path)

Visualize the data for a couple of screens and subjects. Are fixations detected reliably? If there are missing fixations, try tweaking the thresholds or using a different algorithm. You can find more information about [IDT](https://pymovements.readthedocs.io/en/stable/reference/pymovements.events.idt.html#pymovements.events.idt) and [IVT](https://pymovements.readthedocs.io/en/stable/reference/pymovements.events.ivt.html#pymovements.events.ivt) and its parameters in the documentation.

## 5. Mapping fixations to AOIs

We would like to assign each fixation to a word-level area of interest. The AOI rectangles have been predefined in CSV files. We can load them like this:

In [None]:
stimulus_names = all_samples["stimulus"].unique()
stimuli = {}
for stimulus_name in stimulus_names:
    aois_path = Path("data", "raw", "stimuli", f"{stimulus_name}.word.csv")
    aois = pl.read_csv(aois_path)
    stimulus = pm.stimulus.TextStimulus(
        aois,
        aoi_column="content",
        start_x_column="left",
        start_y_column="top",
        end_x_column="right",
        end_y_column="bottom",
    )
    stimuli[stimulus_name] = stimulus
stimuli["goldfish-zero.text.0"]

Next, let's map fixations to AOIs:

In [None]:
for events in dataset.events:
    stimulus_name = events.frame["stimulus"].unique().item()
    events.map_to_aois(stimuli[stimulus_name])

dataset.events[0]

## 6. Calculating reading measures

Unfortunately, this final part of the pipeline is still underdeveloped in pymovements, so you will have to implement the reading measures you need yourself.

Some things to watch out for:
- If you want to calculate reading measures for every word, make sure you include words that are never fixated by anyone (i.e., set the reading measures to 0).
- Make sure to exclude missing trials/screens (i.e., *don't* set the reading measures to 0 in that case).

Here is a simple example for total reading time (i.e., the sum of durations of all fixations on an AOI), applied to one screen and subject:

In [None]:
events = dataset.events[0].frame

reading_measures = (
    events
    # Remove fixations that are not located in any AOI
    .filter(pl.col("index").is_not_null())
    # Group by AOI index and aggregate
    .group_by("index")
    .agg(
        pl.col("subject_id").first(),
        pl.col("stimulus").first(),
        pl.col("content").first(),
        pl.col("duration").sum().alias("total_reading_time"),
    )
    # Sort by AOI index
    .sort("index")
)
reading_measures

Right now, `reading_measures` only includes AOIs which have been fixated at some point. However, we want to calculate total reading time for *all* AOIs, even those that were never fixated (like the AOI with index 61, which should get a TRT of 0). We can achieve this by doing a *right join* between the reading measures and the AOIs:

![Diagram of a right join](https://www.w3schools.com/sql/img_right_join.png)

In [None]:
subject_id = events["subject_id"].unique().item()
stimulus_name = events["stimulus"].unique().item()
stimulus_aois = stimuli[stimulus_name].aois

reading_measures = reading_measures.join(
    stimulus_aois,
    on=["index", "content"],
    how="right",
).select(pl.col("subject_id"), pl.col("stimulus"), pl.col("index"), pl.col("content"), pl.col("total_reading_time"))

# Fill missing values that were not present in the AOI table
reading_measures = reading_measures.with_columns(
    pl.col("subject_id").fill_null(subject_id),
    pl.col("stimulus").fill_null(stimulus_name),
    pl.col("total_reading_time").fill_null(0),
)
reading_measures

Try calculating some more complex reading measures, like first-pass gaze duration or regression rate (0 if there is no regression from this AOI, 1 if there is).

If you're not comfortable with polars, you can also convert the data to a pandas data frame:

In [None]:
events.to_pandas()

In [None]:
# TODO: Calculate more reading measures