## Description of the dataset

| Column header | Unit of measure | Description |
|---------------|-----------------|-------------|
| n            | ms             | Timestamp of the recorded gaze sample since the beginning of the recording |
| x            | dva            | $\theta_h$ for the cyclopean eye |
| y            | dva            | $\theta_v$ for the cyclopean eye |
| lx           | dva            | $\theta_h$ for the left eye |
| ly           | dva            | $\theta_v$ for the left eye |
| rx           | dva            | $\theta_h$ for the right eye |
| ry           | dva            | $\theta_v$ for the right eye |
| xT*          | dva            | $\theta_h$ for the stimulus, relative to the cyclopean eye |
| yT*          | dva            | $\theta_v$ for the stimulus, relative to the cyclopean eye |
| zT           | m              | Depth of the stimulus |
| clx          | m              | X position of the center of the left eyeball, relative to the camera origin |
| cly          | m              | Y position of the center of the left eyeball, relative to the camera origin |
| clz          | m              | Z position of the center of the left eyeball, relative to the camera origin |
| crx          | m              | X position of the center of the right eyeball, relative to the camera origin |
| cry          | m              | Y position of the center of the right eyeball, relative to the camera origin |
| crz          | m              | Z position of the center of the right eyeball, relative to the camera origin |
| round        |                | recording round (1-3) |
| participant  |                | participant ID (001-465) |
| session      |                | recording session (1-2) |
| task         |                | task category (1-5) |

In [1]:
# Dataset parameters
COL_TO_DROP = ["round", "session", "task", "xT", "yT", "zT"]

In [2]:
import pandas as pd

# Import filtered dataset
df = pd.read_parquet(
    "dataset/gazebasevr_filtered.parquet"
)

# Print number of records
print(f"Number of records: {df.shape[0]}")

Number of records: 4862282


In [3]:
# Drop unnecessary columns
df = df.drop(columns=COL_TO_DROP)

In [4]:
# For each participant, fill missing data with linear interpolation (bidirectional)
df_numeric = df.select_dtypes(include="number")

df[df_numeric.columns] = df.groupby("participant")[df_numeric.columns].transform(
    lambda group: group.interpolate(method="linear", limit_direction="both")
)

In [5]:
# Calculate differences between left and right eye angles
df["dx"] = df["lx"] - df["rx"]
df["dy"] = df["ly"] - df["ry"]

# Compute the first and second derivatives of gaze angles (x, y, lx, ly, rx, ry) with respect to time (n)
# Compute the first derivatives
for col in ["x", "y", "lx", "ly", "rx", "ry"]:
    df[f"{col}_d1"] = df.groupby("participant")[col].diff() / df.groupby("participant")["n"].diff()

# Compute the second derivatives
for col in ["x", "y", "lx", "ly", "rx", "ry"]:
    df[f"{col}_d2"] = df.groupby("participant")[f"{col}_d1"].diff() / df.groupby("participant")["n"].diff()

# Compute inter-eye distance
df["ied"] = ((df["clx"] - df["crx"])**2 + (df["cly"] - df["cry"])**2 + (df["clz"] - df["crz"])**2)**0.5

# Drop NaN rows
df = df.dropna()

In [6]:
# Normalize data columns (min-max normalization)
blacklist = ["participant", "n", "ied"]

columns_to_normalize = [col for col in df.columns if col not in blacklist]
df[columns_to_normalize] = df[columns_to_normalize].apply(
    lambda x: (x - x.min()) / (x.max() - x.min())
)

In [7]:
# Save the modified DataFrame to a new Parquet file
df.to_parquet(
    "dataset/gazebasevr_processed.parquet",
    index=False
)