## Description of the dataset

| Column header | Unit of measure | Description |
|---------------|-----------------|-------------|
| n            | ms             | Timestamp of the recorded gaze sample since the beginning of the recording |
| x            | dva            | $\theta_h$ for the cyclopean eye |
| y            | dva            | $\theta_v$ for the cyclopean eye |
| lx           | dva            | $\theta_h$ for the left eye |
| ly           | dva            | $\theta_v$ for the left eye |
| rx           | dva            | $\theta_h$ for the right eye |
| ry           | dva            | $\theta_v$ for the right eye |
| xT*          | dva            | $\theta_h$ for the stimulus, relative to the cyclopean eye |
| yT*          | dva            | $\theta_v$ for the stimulus, relative to the cyclopean eye |
| zT           | m              | Depth of the stimulus |
| clx          | m              | X position of the center of the left eyeball, relative to the camera origin |
| cly          | m              | Y position of the center of the left eyeball, relative to the camera origin |
| clz          | m              | Z position of the center of the left eyeball, relative to the camera origin |
| crx          | m              | X position of the center of the right eyeball, relative to the camera origin |
| cry          | m              | Y position of the center of the right eyeball, relative to the camera origin |
| crz          | m              | Z position of the center of the right eyeball, relative to the camera origin |
| round        |                | recording round (1-3) |
| participant  |                | participant ID (001-465) |
| session      |                | recording session (1-2) |
| task         |                | task category (1-5) |

In [4]:
import pandas as pd

In [5]:
# Load the dataset using Dask with specified dtypes, filter task PUR
df = pd.read_parquet('dataset/gazebasevr.parquet', filters=[('task', '=', 2)])

# Exclude metadata columns
df = df.drop(columns=['round', 'session', 'task'])

# Print memory usage
print(f"{df.memory_usage().sum() / 1024 ** 2:.2f} MB")

2760.84 MB


In [6]:
NUM_PARTICIPANTS = 20

# Randomly sample 20 participants
participants = df['participant'].unique()
participants = participants[:NUM_PARTICIPANTS]

# Filter the dataset to include only the selected participants
df = df[df['participant'].isin(participants)]

# Print memory usage
print(f"{df.memory_usage().sum() / 1024 ** 2:.2f} MB")

159.38 MB


In [7]:
# Fill missing values with forward fill and backward fill
df = df.ffill().bfill()

In [None]:
import IPython
import tsfel
import os

DATA_FREQUENCY = 250
WINDOW_SIZE = 10 * DATA_FREQUENCY # 10 seconds

# For each participant in the dataset, extract features for that participant
results = []
for participant_id, group in df.groupby("participant"):
    IPython.display.clear_output(wait=True)
    print(f"Extracting features for participant {participant_id}, rows: {len(group)}")

    cache_file = f"cache/tsfel/{participant_id}.parquet"
    # create cache directory if not exists
    os.makedirs(os.path.dirname(cache_file), exist_ok=True)
    if os.path.exists(cache_file):
        features = pd.read_parquet(cache_file)
    else:
        # Extract features with TSFEL
        cfg = tsfel.get_features_by_domain()
        # https://github.com/fraunhoferportugal/tsfel/issues/173
        features = tsfel.time_series_features_extractor(
            cfg, group.drop(columns=["n", "participant"]), fs=DATA_FREQUENCY, window_size=WINDOW_SIZE, n_jobs=-1
        )
        features.to_parquet(cache_file)

    features["participant"] = participant_id
    results.append(features)

features_df = pd.concat(results).reset_index(drop=True)
# Save the extracted features
features_df.to_parquet(f"cache/gazebasevr-features.parquet")

Extracting features for participant 2, rows: 165051


In [None]:
# print shape of the extracted features
print(f"Dataset shape: {features_df.shape}")

In [None]:
features_df.head()