In [None]:
%load_ext cudf

The cudf module is not an IPython extension.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import datashader as ds
import datashader.transfer_functions as tf

In [None]:
pd.set_option('display.max_columns', None)

# **11_IVT**

In [None]:
df_11_IVT = pd.read_csv('data/STData/11/11_IVT.csv')

In [None]:
df_11_IVT.head()

In [None]:
df_11_IVT.columns

In [None]:
df_11_IVT.shape

In [None]:
df_11_IVT.info()

In [None]:
df_11_IVT.isnull().sum()

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df_11_IVT.isnull(), cmap='viridis')
plt.show()

# Notes & Observations

- We observe many **null** (or missing) values in the `QuestionKey` columns.
- The nulls in the `QuestionKey` column may not represent “true” nulls. Rather, they follow interval patterns, suggesting that during those periods no question was displayed.
- These missing values in `QuestionKey` require additional investigation and context-aware handling.

In [None]:
df_11_IVT['QuestionKey'].unique()

In [None]:
df_11_IVT['Timestamp'] = pd.to_datetime(df_11_IVT['Timestamp'])

In [None]:
df_11_IVT.head(3)

In [None]:
df_11_IVT['QuestionKey'].fillna('None', inplace=True)

In [None]:
df_11_IVT['QuestionKey'].value_counts()

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df_11_IVT.isnull(), cmap='viridis')
plt.show()

In [None]:
df_11_IVT.isnull().sum()

In [None]:
df_11_IVT.head()

In [None]:
df_11_IVT['Row'].unique()

In [None]:
plt.figure(figsize=(8,6))
sns.histplot(df_11_IVT['Row'])
plt.show()

# Notes & Observations

- The `Row` column appears to be a simple row index and does not provide meaningful information relevant to the eye-tracking data itself. Therefore, it can be dropped.

In [None]:
df_11_IVT.drop('Row', axis=1, inplace=True)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 10))

sns.scatterplot(data=df_11_IVT, x='Gaze X', y='Gaze Y', ax=axes[0])
axes[0].set_title('Gaze X vs Gaze Y')

sns.scatterplot(data=df_11_IVT, x='Interpolated Gaze X', y='Interpolated Gaze Y', ax=axes[1])
axes[1].set_title('Interpolated Gaze X vs Interpolated Gaze Y')

plt.tight_layout()
plt.show()

# Gaze and Interpolated Gaze Scatter Plots

The scatter plots above visualize the relationship between the x and y coordinates of both the raw gaze data and the interpolated gaze data.

- **Gaze X vs Gaze Y:** This plot shows the raw gaze coordinates. The scattered points indicate the locations on the screen where the participant was looking. The density of points in certain areas might suggest regions of interest.
- **Interpolated Gaze X vs Interpolated Gaze Y:** This plot shows the interpolated gaze coordinates. Interpolation is often used to fill in gaps in the raw gaze data, providing a smoother representation of the gaze path. Comparing this plot to the raw gaze plot can show the effect of the interpolation process.

Both plots can help in understanding the distribution of gaze points across the screen and identifying potential patterns or biases in eye movements.

In [None]:
df_11_IVT.describe()

In [None]:
df_11_IVT.head(3)

In [None]:
df_11_IVT['Timestamp'] = pd.to_datetime(df_11_IVT['Timestamp'])

In [None]:
df_11_IVT.columns

In [None]:
cols = ['Gaze X', 'Gaze Y',
       'Interpolated Gaze X', 'Interpolated Gaze Y', 'Interpolated Distance',
       'Gaze Velocity', 'Gaze Acceleration', 'Fixation Index',
       'Fixation Index by Stimulus', 'Fixation X', 'Fixation Y',
       'Fixation Start', 'Fixation End', 'Fixation Duration',
       'Fixation Dispersion', 'Saccade Index', 'Saccade Index by Stimulus',
       'Saccade Start', 'Saccade End', 'Saccade Duration', 'Saccade Amplitude',
       'Saccade Peak Velocity', 'Saccade Peak Acceleration',
       'Saccade Peak Deceleration', 'Saccade Direction']

In [None]:
from IPython.display import display, Markdown

for col in cols:
    # Add a markdown cell before each plot for better separation and labeling
    display(Markdown(f'### {col} over Time'))
    plt.figure(figsize=(16, 10))
    sns.lineplot(x=df_11_IVT['Timestamp'], y=df_11_IVT[col])
    plt.xlabel("Timestamp") # Add x-axis label
    plt.ylabel(col) # Add y-axis label
    plt.show()

In [None]:
df_11_IVT.head()

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df_11_IVT[['Fixation Index', 'Fixation Index by Stimulus', 'Saccade Index', 'Saccade Index by Stimulus']].isnull(), cmap='viridis')
plt.show()

# Observation

The `Fixation Index`, `Fixation Index by Stimulus`, `Saccade Index` and `Saccade Index by Stimulus` columns are essentially just sequence numbers for identified events. While they indicate the order of fixations and saccades, they don't provide meaningful features for a machine learning model attempting to predict or classify eye movement patterns. Therefore, we will drop these columns as they are not useful for model building.

In [None]:
df_11_IVT.drop(['Fixation Index', 'Fixation Index by Stimulus', 'Saccade Index', 'Saccade Index by Stimulus'], axis=1, inplace=True)

In [None]:
plt.figure(figsize=(14,10))
sns.scatterplot(data=df_11_IVT, x='Fixation X', y='Fixation Y')
plt.title('Fixation X vs Fixation Y')
plt.show()

In [None]:
df_11_IVT['Fixation Start'].describe()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 8))

sns.histplot(df_11_IVT['Fixation Start'], bins=100, kde=True, ax=axes[0])
axes[0].set_xlabel('Fixation Start')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Fixation Start')

sns.histplot(df_11_IVT['Fixation End'], bins=100, kde=True, ax=axes[1])
axes[1].set_xlabel('Fixation End')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Fixation End')

plt.tight_layout()
plt.show()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 8))

sns.histplot(df_11_IVT['Saccade Start'], bins=100, kde=True, ax=axes[0])
axes[0].set_xlabel('Saccade Start')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Saccade Start')

sns.histplot(df_11_IVT['Saccade End'], bins=100, kde=True, ax=axes[1])
axes[1].set_xlabel('Saccade End')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Saccade End')

plt.tight_layout()
plt.show()

# Observation on Fixation and Saccade Timestamps

Upon examining the time series plots of 'Fixation Start', 'Fixation End', `Saccade Start`, and `Saccade End` against the `Timestamp`, we observe a clear linear, diagonal pattern. This indicates that these values are largely sequential and directly related to the progress of time in the data recording.

Furthermore, the histograms of these features show distributions that, while informative about the timing of events, don't necessarily reveal complex patterns that would be highly predictive for a machine learning model.

Crucially, the dataset already contains `Fixation Duration` and `Saccade Duration` columns. These duration features capture the length of each event, which is often a more directly relevant metric for understanding eye movement behavior than the absolute start and end times. Since the duration can be derived from the start and end times (Duration = End - Start), the start and end time columns introduce redundancy and do not provide substantial additional, independent information for modeling purposes.

Therefore, to simplify the dataset and focus on the most informative features for potential machine movement analysis or modeling, we will drop the `Fixation Start`, `Fixation End`, `Saccade Start`, and `Saccade End` columns.

In [None]:
df_11_IVT.drop(['Fixation Start', 'Fixation End', 'Saccade Start', 'Saccade End'], axis=1, inplace=True)

In [None]:
df_11_IVT.head()

In [None]:
plt.figure(figsize=(12, 8))
sns.heatmap(df_11_IVT[['Gaze X', 'Gaze Y', 'Interpolated Gaze X', 'Interpolated Gaze Y']].isnull(), cmap='viridis')
plt.show()


# Observations on Gaze and Interpolated Gaze Data

Based on the scatter plots of 'Gaze X' vs 'Gaze Y' and 'Interpolated Gaze X' vs 'Interpolated Gaze Y', we observe that the distributions of the raw and interpolated gaze points appear very similar. The spatial patterns of where the participant was looking are consistent between the two sets of coordinates.

Furthermore, the heatmap of null values for these columns ('Gaze X', 'Gaze Y', 'Interpolated Gaze X', 'Interpolated Gaze Y') reveals that the missing values are present in the same rows for both the raw and interpolated gaze coordinates. This suggests that the interpolation process did not fill in the gaps in the raw gaze data for these specific instances.

Given that the interpolated gaze data shows the same spatial distribution and the same pattern of null values as the raw gaze data, it appears that the interpolation did not significantly alter or complete the data in this case. Therefore, keeping both the raw and interpolated gaze columns might be redundant, and one set could potentially be dropped to simplify the dataset without losing significant information.

In [None]:
df_11_IVT.drop(['Interpolated Gaze X', 'Interpolated Gaze Y'], axis=1, inplace=True)

In [None]:
df_11_IVT.head()

In [None]:
plt.figure(figsize=(12, 8))
sns.heatmap(df_11_IVT.isnull(), cmap='viridis')
plt.show()

In [None]:
df_11_IVT.columns

In [None]:
fix_1_df = df_11_IVT.dropna(subset=['Fixation Duration'])
sac_1_df = df_11_IVT.dropna(subset=['Saccade Duration'])

In [None]:
fix_1_df.shape

In [None]:
sac_1_df.shape

In [None]:
fix_1_feature = fix_1_df.groupby('QuestionKey').agg({
    'Fixation Duration': ['count','mean','max','sum','var'],
    'Fixation Dispersion': ['mean','max'],
    'Fixation X': ['var'],   # screen spread X
    'Fixation Y': ['var']    # screen spread Y
})

In [None]:
fix_1_feature.columns = ['fix_count','fix_mean_dur','fix_max_dur','fix_total_time',
                        'fix_dur_var','fix_disp_mean','fix_disp_max',
                        'fix_x_var','fix_y_var']

In [None]:
fix_1_feature

In [None]:
sac_1_features = sac_1_df.groupby('QuestionKey').agg({
    'Saccade Duration': ['count','mean','sum'],
    'Saccade Amplitude': ['mean','max'],
    'Saccade Peak Velocity': ['mean','max'],
    'Saccade Peak Acceleration': ['mean'],
    'Saccade Peak Deceleration': ['mean'],
    'Saccade Direction': ['var']   # direction variance
})

In [None]:
sac_1_features.columns = ['sac_count','sac_mean_dur','sac_total_time',
                        'sac_amp_mean','sac_amp_max',
                        'sac_vel_mean','sac_vel_max',
                        'sac_acc_mean','sac_dec_mean','sac_dir_var']

In [None]:
sac_1_features

In [None]:
ivt_1_features = fix_1_feature.join(sac_1_features, how='outer').fillna(0)

In [None]:
ivt_1_features

In [None]:
ivt_1_features['fix_sac_count_ratio'] = ivt_1_features['fix_count'] / (ivt_1_features['sac_count']+1e-5)
ivt_1_features['fix_sac_time_ratio']  = ivt_1_features['fix_total_time'] / (ivt_1_features['sac_total_time']+1e-5)

In [None]:
ivt_1_features

# Aggregation of Fixation and Saccade Features

In the preceding code cells, we performed aggregation on the `fix_1_df` and `sac_1_df` DataFrames, which contain the cleaned fixation and saccade data, respectively. The goal of this aggregation was to create a summary of eye-tracking metrics for each `QuestionKey`.

For fixations, we calculated:
- Count of fixations (`fix_count`)
- Mean, max, sum, and variance of fixation duration (`fix_mean_dur`, `fix_max_dur`, `fix_total_time`, `fix_dur_var`)
- Mean and max of fixation dispersion (`fix_disp_mean`, `fix_disp_max`)
- Variance of fixation X and Y coordinates (`fix_x_var`, `fix_y_var`) to represent screen spread.

For saccades, we calculated:
- Count of saccades (`sac_count`)
- Mean and sum of saccade duration (`sac_mean_dur`, `sac_total_time`)
- Mean and max of saccade amplitude (`sac_amp_mean`, `sac_amp_max`)
- Mean and max of saccade peak velocity (`sac_vel_mean`, `sac_vel_max`)
- Mean of saccade peak acceleration and deceleration (`sac_acc_mean`, `sac_dec_mean`)
- Variance of saccade direction (`sac_dir_var`).

Finally, we joined these aggregated fixation and saccade features into a single DataFrame called `ivt_1_features`, using `QuestionKey` as the index. We also filled any resulting missing values (from `QuestionKey` values that may only have fixations or saccades, but not both) with 0. This `ivt_1_features` DataFrame now provides a consolidated summary of key eye-tracking characteristics for each question, which can be used for further analysis or modeling.

# **12_IVT**

In [None]:
df_12_IVT = pd.read_csv('data/STData/12/12_IVT.csv')

In [None]:
df_12_IVT.head()

In [None]:
df_12_IVT.columns

In [None]:
df_12_IVT.shape

In [None]:
df_12_IVT.info()

In [None]:
df_12_IVT.isnull().sum()

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df_12_IVT.isnull(), cmap='viridis')
plt.show()

# Notes & Observations

- We observe many **null** (or missing) values in the `QuestionKey` columns.
- The nulls in the `QuestionKey` column may not represent “true” nulls. Rather, they follow interval patterns, suggesting that during those periods no question was displayed.
- These missing values in `QuestionKey` require additional investigation and context-aware handling.

In [None]:
df_12_IVT['QuestionKey'].unique()

In [None]:
df_12_IVT['Timestamp'] = pd.to_datetime(df_12_IVT['Timestamp'])

In [None]:
df_12_IVT.head(3)

In [None]:
df_12_IVT['QuestionKey'].fillna('None', inplace=True)

In [None]:
df_12_IVT['QuestionKey'].value_counts()

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df_12_IVT.isnull(), cmap='viridis')
plt.show()

In [None]:
df_12_IVT.isnull().sum()

In [None]:
df_12_IVT.head()

In [None]:
df_12_IVT['Row'].unique()

In [None]:
plt.figure(figsize=(8,6))
sns.histplot(df_12_IVT['Row'])
plt.show()

# Notes & Observations

- The `Row` column appears to be a simple row index and does not provide meaningful information relevant to the eye-tracking data itself. Therefore, it can be dropped.

In [None]:
df_12_IVT.drop('Row', axis=1, inplace=True)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 10))

sns.scatterplot(data=df_12_IVT, x='Gaze X', y='Gaze Y', ax=axes[0])
axes[0].set_title('Gaze X vs Gaze Y')

sns.scatterplot(data=df_12_IVT, x='Interpolated Gaze X', y='Interpolated Gaze Y', ax=axes[1])
axes[1].set_title('Interpolated Gaze X vs Interpolated Gaze Y')

plt.tight_layout()
plt.show()

# Gaze and Interpolated Gaze Scatter Plots

The scatter plots above visualize the relationship between the x and y coordinates of both the raw gaze data and the interpolated gaze data.

- **Gaze X vs Gaze Y:** This plot shows the raw gaze coordinates. The scattered points indicate the locations on the screen where the participant was looking. The density of points in certain areas might suggest regions of interest.
- **Interpolated Gaze X vs Interpolated Gaze Y:** This plot shows the interpolated gaze coordinates. Interpolation is often used to fill in gaps in the raw gaze data, providing a smoother representation of the gaze path. Comparing this plot to the raw gaze plot can show the effect of the interpolation process.

Both plots can help in understanding the distribution of gaze points across the screen and identifying potential patterns or biases in eye movements.

In [None]:
df_12_IVT.describe()

In [None]:
df_12_IVT.head(3)

In [None]:
df_12_IVT['Timestamp'] = pd.to_datetime(df_12_IVT['Timestamp'])

In [None]:
df_12_IVT.columns

In [None]:
cols = ['Gaze X', 'Gaze Y',
       'Interpolated Gaze X', 'Interpolated Gaze Y', 'Interpolated Distance',
       'Gaze Velocity', 'Gaze Acceleration', 'Fixation Index',
       'Fixation Index by Stimulus', 'Fixation X', 'Fixation Y',
       'Fixation Start', 'Fixation End', 'Fixation Duration',
       'Fixation Dispersion', 'Saccade Index', 'Saccade Index by Stimulus',
       'Saccade Start', 'Saccade End', 'Saccade Duration', 'Saccade Amplitude',
       'Saccade Peak Velocity', 'Saccade Peak Acceleration',
       'Saccade Peak Deceleration', 'Saccade Direction']

In [None]:
from IPython.display import display, Markdown

for col in cols:
    # Add a markdown cell before each plot for better separation and labeling
    display(Markdown(f'### {col} over Time'))
    plt.figure(figsize=(16, 10))
    sns.lineplot(x=df_12_IVT['Timestamp'], y=df_12_IVT[col])
    plt.xlabel("Timestamp") # Add x-axis label
    plt.ylabel(col) # Add y-axis label
    plt.show()

In [None]:
df_12_IVT.head()

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df_12_IVT[['Fixation Index', 'Fixation Index by Stimulus', 'Saccade Index', 'Saccade Index by Stimulus']].isnull(), cmap='viridis')
plt.show()

# Observation

The `Fixation Index`, `Fixation Index by Stimulus`, `Saccade Index` and `Saccade Index by Stimulus` columns are essentially just sequence numbers for identified events. While they indicate the order of fixations and saccades, they don't provide meaningful features for a machine learning model attempting to predict or classify eye movement patterns. Therefore, we will drop these columns as they are not useful for model building.

In [None]:
df_12_IVT.drop(['Fixation Index', 'Fixation Index by Stimulus', 'Saccade Index', 'Saccade Index by Stimulus'], axis=1, inplace=True)

In [None]:
plt.figure(figsize=(14,10))
sns.scatterplot(data=df_12_IVT, x='Fixation X', y='Fixation Y')
plt.title('Fixation X vs Fixation Y')
plt.show()

In [None]:
df_12_IVT['Fixation Start'].describe()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 8))

sns.histplot(df_12_IVT['Fixation Start'], bins=100, kde=True, ax=axes[0])
axes[0].set_xlabel('Fixation Start')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Fixation Start')

sns.histplot(df_12_IVT['Fixation End'], bins=100, kde=True, ax=axes[1])
axes[1].set_xlabel('Fixation End')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Fixation End')

plt.tight_layout()
plt.show()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 8))

sns.histplot(df_12_IVT['Saccade Start'], bins=100, kde=True, ax=axes[0])
axes[0].set_xlabel('Saccade Start')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Saccade Start')

sns.histplot(df_12_IVT['Saccade End'], bins=100, kde=True, ax=axes[1])
axes[1].set_xlabel('Saccade End')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Saccade End')

plt.tight_layout()
plt.show()

# Observation on Fixation and Saccade Timestamps

Upon examining the time series plots of 'Fixation Start', 'Fixation End', `Saccade Start`, and `Saccade End` against the `Timestamp`, we observe a clear linear, diagonal pattern. This indicates that these values are largely sequential and directly related to the progress of time in the data recording.

Furthermore, the histograms of these features show distributions that, while informative about the timing of events, don't necessarily reveal complex patterns that would be highly predictive for a machine learning model.

Crucially, the dataset already contains `Fixation Duration` and `Saccade Duration` columns. These duration features capture the length of each event, which is often a more directly relevant metric for understanding eye movement behavior than the absolute start and end times. Since the duration can be derived from the start and end times (Duration = End - Start), the start and end time columns introduce redundancy and do not provide substantial additional, independent information for modeling purposes.

Therefore, to simplify the dataset and focus on the most informative features for potential machine movement analysis or modeling, we will drop the `Fixation Start`, `Fixation End`, `Saccade Start`, and `Saccade End` columns.

In [None]:
df_12_IVT.drop(['Fixation Start', 'Fixation End', 'Saccade Start', 'Saccade End'], axis=1, inplace=True)

In [None]:
df_12_IVT.head()

In [None]:
plt.figure(figsize=(12, 8))
sns.heatmap(df_12_IVT[['Gaze X', 'Gaze Y', 'Interpolated Gaze X', 'Interpolated Gaze Y']].isnull(), cmap='viridis')
plt.show()


# Observations on Gaze and Interpolated Gaze Data

Based on the scatter plots of 'Gaze X' vs 'Gaze Y' and 'Interpolated Gaze X' vs 'Interpolated Gaze Y', we observe that the distributions of the raw and interpolated gaze points appear very similar. The spatial patterns of where the participant was looking are consistent between the two sets of coordinates.

Furthermore, the heatmap of null values for these columns ('Gaze X', 'Gaze Y', 'Interpolated Gaze X', 'Interpolated Gaze Y') reveals that the missing values are present in the same rows for both the raw and interpolated gaze coordinates. This suggests that the interpolation process did not fill in the gaps in the raw gaze data for these specific instances.

Given that the interpolated gaze data shows the same spatial distribution and the same pattern of null values as the raw gaze data, it appears that the interpolation did not significantly alter or complete the data in this case. Therefore, keeping both the raw and interpolated gaze columns might be redundant, and one set could potentially be dropped to simplify the dataset without losing significant information.

In [None]:
df_12_IVT.drop(['Interpolated Gaze X', 'Interpolated Gaze Y'], axis=1, inplace=True)

In [None]:
df_12_IVT.head()

In [None]:
plt.figure(figsize=(12, 8))
sns.heatmap(df_12_IVT.isnull(), cmap='viridis')
plt.show()

In [None]:
df_12_IVT.columns

In [None]:
fix_1_df = df_12_IVT.dropna(subset=['Fixation Duration'])
sac_1_df = df_12_IVT.dropna(subset=['Saccade Duration'])

In [None]:
fix_1_df.shape

In [None]:
sac_1_df.shape

In [None]:
fix_1_feature = fix_1_df.groupby('QuestionKey').agg({
    'Fixation Duration': ['count','mean','max','sum','var'],
    'Fixation Dispersion': ['mean','max'],
    'Fixation X': ['var'],   # screen spread X
    'Fixation Y': ['var']    # screen spread Y
})

In [None]:
fix_1_feature.columns = ['fix_count','fix_mean_dur','fix_max_dur','fix_total_time',
                        'fix_dur_var','fix_disp_mean','fix_disp_max',
                        'fix_x_var','fix_y_var']

In [None]:
fix_1_feature

In [None]:
sac_1_features = sac_1_df.groupby('QuestionKey').agg({
    'Saccade Duration': ['count','mean','sum'],
    'Saccade Amplitude': ['mean','max'],
    'Saccade Peak Velocity': ['mean','max'],
    'Saccade Peak Acceleration': ['mean'],
    'Saccade Peak Deceleration': ['mean'],
    'Saccade Direction': ['var']   # direction variance
})

In [None]:
sac_1_features.columns = ['sac_count','sac_mean_dur','sac_total_time',
                        'sac_amp_mean','sac_amp_max',
                        'sac_vel_mean','sac_vel_max',
                        'sac_acc_mean','sac_dec_mean','sac_dir_var']

In [None]:
sac_1_features

In [None]:
ivt_1_features = fix_1_feature.join(sac_1_features, how='outer').fillna(0)

In [None]:
ivt_1_features

In [None]:
ivt_1_features['fix_sac_count_ratio'] = ivt_1_features['fix_count'] / (ivt_1_features['sac_count']+1e-5)
ivt_1_features['fix_sac_time_ratio']  = ivt_1_features['fix_total_time'] / (ivt_1_features['sac_total_time']+1e-5)

In [None]:
ivt_1_features

# Aggregation of Fixation and Saccade Features

In the preceding code cells, we performed aggregation on the `fix_1_df` and `sac_1_df` DataFrames, which contain the cleaned fixation and saccade data, respectively. The goal of this aggregation was to create a summary of eye-tracking metrics for each `QuestionKey`.

For fixations, we calculated:
- Count of fixations (`fix_count`)
- Mean, max, sum, and variance of fixation duration (`fix_mean_dur`, `fix_max_dur`, `fix_total_time`, `fix_dur_var`)
- Mean and max of fixation dispersion (`fix_disp_mean`, `fix_disp_max`)
- Variance of fixation X and Y coordinates (`fix_x_var`, `fix_y_var`) to represent screen spread.

For saccades, we calculated:
- Count of saccades (`sac_count`)
- Mean and sum of saccade duration (`sac_mean_dur`, `sac_total_time`)
- Mean and max of saccade amplitude (`sac_amp_mean`, `sac_amp_max`)
- Mean and max of saccade peak velocity (`sac_vel_mean`, `sac_vel_max`)
- Mean of saccade peak acceleration and deceleration (`sac_acc_mean`, `sac_dec_mean`)
- Variance of saccade direction (`sac_dir_var`).

Finally, we joined these aggregated fixation and saccade features into a single DataFrame called `ivt_1_features`, using `QuestionKey` as the index. We also filled any resulting missing values (from `QuestionKey` values that may only have fixations or saccades, but not both) with 0. This `ivt_1_features` DataFrame now provides a consolidated summary of key eye-tracking characteristics for each question, which can be used for further analysis or modeling.

# **13_IVT**

In [None]:
df_13_IVT = pd.read_csv('data/STData/13/13_IVT.csv')

In [None]:
df_13_IVT.head()

In [None]:
df_13_IVT.columns

In [None]:
df_13_IVT.shape

In [None]:
df_13_IVT.info()

In [None]:
df_13_IVT.isnull().sum()

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df_13_IVT.isnull(), cmap='viridis')
plt.show()

# Notes & Observations

- We observe many **null** (or missing) values in the `QuestionKey` columns.
- The nulls in the `QuestionKey` column may not represent “true” nulls. Rather, they follow interval patterns, suggesting that during those periods no question was displayed.
- These missing values in `QuestionKey` require additional investigation and context-aware handling.

In [None]:
df_13_IVT['QuestionKey'].unique()

In [None]:
df_13_IVT['Timestamp'] = pd.to_datetime(df_13_IVT['Timestamp'])

In [None]:
df_13_IVT.head(3)

In [None]:
df_13_IVT['QuestionKey'].fillna('None', inplace=True)

In [None]:
df_13_IVT['QuestionKey'].value_counts()

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df_13_IVT.isnull(), cmap='viridis')
plt.show()

In [None]:
df_13_IVT.isnull().sum()

In [None]:
df_13_IVT.head()

In [None]:
df_13_IVT['Row'].unique()

In [None]:
plt.figure(figsize=(8,6))
sns.histplot(df_13_IVT['Row'])
plt.show()

# Notes & Observations

- The `Row` column appears to be a simple row index and does not provide meaningful information relevant to the eye-tracking data itself. Therefore, it can be dropped.

In [None]:
df_13_IVT.drop('Row', axis=1, inplace=True)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 10))

sns.scatterplot(data=df_13_IVT, x='Gaze X', y='Gaze Y', ax=axes[0])
axes[0].set_title('Gaze X vs Gaze Y')

sns.scatterplot(data=df_13_IVT, x='Interpolated Gaze X', y='Interpolated Gaze Y', ax=axes[1])
axes[1].set_title('Interpolated Gaze X vs Interpolated Gaze Y')

plt.tight_layout()
plt.show()

# Gaze and Interpolated Gaze Scatter Plots

The scatter plots above visualize the relationship between the x and y coordinates of both the raw gaze data and the interpolated gaze data.

- **Gaze X vs Gaze Y:** This plot shows the raw gaze coordinates. The scattered points indicate the locations on the screen where the participant was looking. The density of points in certain areas might suggest regions of interest.
- **Interpolated Gaze X vs Interpolated Gaze Y:** This plot shows the interpolated gaze coordinates. Interpolation is often used to fill in gaps in the raw gaze data, providing a smoother representation of the gaze path. Comparing this plot to the raw gaze plot can show the effect of the interpolation process.

Both plots can help in understanding the distribution of gaze points across the screen and identifying potential patterns or biases in eye movements.

In [None]:
df_13_IVT.describe()

In [None]:
df_13_IVT.head(3)

In [None]:
df_13_IVT['Timestamp'] = pd.to_datetime(df_13_IVT['Timestamp'])

In [None]:
df_13_IVT.columns

In [None]:
cols = ['Gaze X', 'Gaze Y',
       'Interpolated Gaze X', 'Interpolated Gaze Y', 'Interpolated Distance',
       'Gaze Velocity', 'Gaze Acceleration', 'Fixation Index',
       'Fixation Index by Stimulus', 'Fixation X', 'Fixation Y',
       'Fixation Start', 'Fixation End', 'Fixation Duration',
       'Fixation Dispersion', 'Saccade Index', 'Saccade Index by Stimulus',
       'Saccade Start', 'Saccade End', 'Saccade Duration', 'Saccade Amplitude',
       'Saccade Peak Velocity', 'Saccade Peak Acceleration',
       'Saccade Peak Deceleration', 'Saccade Direction']

In [None]:
from IPython.display import display, Markdown

for col in cols:
    # Add a markdown cell before each plot for better separation and labeling
    display(Markdown(f'### {col} over Time'))
    plt.figure(figsize=(16, 10))
    sns.lineplot(x=df_13_IVT['Timestamp'], y=df_13_IVT[col])
    plt.xlabel("Timestamp") # Add x-axis label
    plt.ylabel(col) # Add y-axis label
    plt.show()

In [None]:
df_13_IVT.head()

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df_13_IVT[['Fixation Index', 'Fixation Index by Stimulus', 'Saccade Index', 'Saccade Index by Stimulus']].isnull(), cmap='viridis')
plt.show()

# Observation

The `Fixation Index`, `Fixation Index by Stimulus`, `Saccade Index` and `Saccade Index by Stimulus` columns are essentially just sequence numbers for identified events. While they indicate the order of fixations and saccades, they don't provide meaningful features for a machine learning model attempting to predict or classify eye movement patterns. Therefore, we will drop these columns as they are not useful for model building.

In [None]:
df_13_IVT.drop(['Fixation Index', 'Fixation Index by Stimulus', 'Saccade Index', 'Saccade Index by Stimulus'], axis=1, inplace=True)

In [None]:
plt.figure(figsize=(14,10))
sns.scatterplot(data=df_13_IVT, x='Fixation X', y='Fixation Y')
plt.title('Fixation X vs Fixation Y')
plt.show()

In [None]:
df_13_IVT['Fixation Start'].describe()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 8))

sns.histplot(df_13_IVT['Fixation Start'], bins=100, kde=True, ax=axes[0])
axes[0].set_xlabel('Fixation Start')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Fixation Start')

sns.histplot(df_13_IVT['Fixation End'], bins=100, kde=True, ax=axes[1])
axes[1].set_xlabel('Fixation End')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Fixation End')

plt.tight_layout()
plt.show()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 8))

sns.histplot(df_13_IVT['Saccade Start'], bins=100, kde=True, ax=axes[0])
axes[0].set_xlabel('Saccade Start')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Saccade Start')

sns.histplot(df_13_IVT['Saccade End'], bins=100, kde=True, ax=axes[1])
axes[1].set_xlabel('Saccade End')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Saccade End')

plt.tight_layout()
plt.show()

# Observation on Fixation and Saccade Timestamps

Upon examining the time series plots of 'Fixation Start', 'Fixation End', `Saccade Start`, and `Saccade End` against the `Timestamp`, we observe a clear linear, diagonal pattern. This indicates that these values are largely sequential and directly related to the progress of time in the data recording.

Furthermore, the histograms of these features show distributions that, while informative about the timing of events, don't necessarily reveal complex patterns that would be highly predictive for a machine learning model.

Crucially, the dataset already contains `Fixation Duration` and `Saccade Duration` columns. These duration features capture the length of each event, which is often a more directly relevant metric for understanding eye movement behavior than the absolute start and end times. Since the duration can be derived from the start and end times (Duration = End - Start), the start and end time columns introduce redundancy and do not provide substantial additional, independent information for modeling purposes.

Therefore, to simplify the dataset and focus on the most informative features for potential machine movement analysis or modeling, we will drop the `Fixation Start`, `Fixation End`, `Saccade Start`, and `Saccade End` columns.

In [None]:
df_13_IVT.drop(['Fixation Start', 'Fixation End', 'Saccade Start', 'Saccade End'], axis=1, inplace=True)

In [None]:
df_13_IVT.head()

In [None]:
plt.figure(figsize=(12, 8))
sns.heatmap(df_13_IVT[['Gaze X', 'Gaze Y', 'Interpolated Gaze X', 'Interpolated Gaze Y']].isnull(), cmap='viridis')
plt.show()


# Observations on Gaze and Interpolated Gaze Data

Based on the scatter plots of 'Gaze X' vs 'Gaze Y' and 'Interpolated Gaze X' vs 'Interpolated Gaze Y', we observe that the distributions of the raw and interpolated gaze points appear very similar. The spatial patterns of where the participant was looking are consistent between the two sets of coordinates.

Furthermore, the heatmap of null values for these columns ('Gaze X', 'Gaze Y', 'Interpolated Gaze X', 'Interpolated Gaze Y') reveals that the missing values are present in the same rows for both the raw and interpolated gaze coordinates. This suggests that the interpolation process did not fill in the gaps in the raw gaze data for these specific instances.

Given that the interpolated gaze data shows the same spatial distribution and the same pattern of null values as the raw gaze data, it appears that the interpolation did not significantly alter or complete the data in this case. Therefore, keeping both the raw and interpolated gaze columns might be redundant, and one set could potentially be dropped to simplify the dataset without losing significant information.

In [None]:
df_13_IVT.drop(['Interpolated Gaze X', 'Interpolated Gaze Y'], axis=1, inplace=True)

In [None]:
df_13_IVT.head()

In [None]:
plt.figure(figsize=(12, 8))
sns.heatmap(df_13_IVT.isnull(), cmap='viridis')
plt.show()

In [None]:
df_13_IVT.columns

In [None]:
fix_1_df = df_13_IVT.dropna(subset=['Fixation Duration'])
sac_1_df = df_13_IVT.dropna(subset=['Saccade Duration'])

In [None]:
fix_1_df.shape

In [None]:
sac_1_df.shape

In [None]:
fix_1_feature = fix_1_df.groupby('QuestionKey').agg({
    'Fixation Duration': ['count','mean','max','sum','var'],
    'Fixation Dispersion': ['mean','max'],
    'Fixation X': ['var'],   # screen spread X
    'Fixation Y': ['var']    # screen spread Y
})

In [None]:
fix_1_feature.columns = ['fix_count','fix_mean_dur','fix_max_dur','fix_total_time',
                        'fix_dur_var','fix_disp_mean','fix_disp_max',
                        'fix_x_var','fix_y_var']

In [None]:
fix_1_feature

In [None]:
sac_1_features = sac_1_df.groupby('QuestionKey').agg({
    'Saccade Duration': ['count','mean','sum'],
    'Saccade Amplitude': ['mean','max'],
    'Saccade Peak Velocity': ['mean','max'],
    'Saccade Peak Acceleration': ['mean'],
    'Saccade Peak Deceleration': ['mean'],
    'Saccade Direction': ['var']   # direction variance
})

In [None]:
sac_1_features.columns = ['sac_count','sac_mean_dur','sac_total_time',
                        'sac_amp_mean','sac_amp_max',
                        'sac_vel_mean','sac_vel_max',
                        'sac_acc_mean','sac_dec_mean','sac_dir_var']

In [None]:
sac_1_features

In [None]:
ivt_1_features = fix_1_feature.join(sac_1_features, how='outer').fillna(0)

In [None]:
ivt_1_features

In [None]:
ivt_1_features['fix_sac_count_ratio'] = ivt_1_features['fix_count'] / (ivt_1_features['sac_count']+1e-5)
ivt_1_features['fix_sac_time_ratio']  = ivt_1_features['fix_total_time'] / (ivt_1_features['sac_total_time']+1e-5)

In [None]:
ivt_1_features

# Aggregation of Fixation and Saccade Features

In the preceding code cells, we performed aggregation on the `fix_1_df` and `sac_1_df` DataFrames, which contain the cleaned fixation and saccade data, respectively. The goal of this aggregation was to create a summary of eye-tracking metrics for each `QuestionKey`.

For fixations, we calculated:
- Count of fixations (`fix_count`)
- Mean, max, sum, and variance of fixation duration (`fix_mean_dur`, `fix_max_dur`, `fix_total_time`, `fix_dur_var`)
- Mean and max of fixation dispersion (`fix_disp_mean`, `fix_disp_max`)
- Variance of fixation X and Y coordinates (`fix_x_var`, `fix_y_var`) to represent screen spread.

For saccades, we calculated:
- Count of saccades (`sac_count`)
- Mean and sum of saccade duration (`sac_mean_dur`, `sac_total_time`)
- Mean and max of saccade amplitude (`sac_amp_mean`, `sac_amp_max`)
- Mean and max of saccade peak velocity (`sac_vel_mean`, `sac_vel_max`)
- Mean of saccade peak acceleration and deceleration (`sac_acc_mean`, `sac_dec_mean`)
- Variance of saccade direction (`sac_dir_var`).

Finally, we joined these aggregated fixation and saccade features into a single DataFrame called `ivt_1_features`, using `QuestionKey` as the index. We also filled any resulting missing values (from `QuestionKey` values that may only have fixations or saccades, but not both) with 0. This `ivt_1_features` DataFrame now provides a consolidated summary of key eye-tracking characteristics for each question, which can be used for further analysis or modeling.

# **14_IVT**

In [None]:
df_14_IVT = pd.read_csv('data/STData/14/14_IVT.csv')

In [None]:
df_14_IVT.head()

In [None]:
df_14_IVT.columns

In [None]:
df_14_IVT.shape

In [None]:
df_14_IVT.info()

In [None]:
df_14_IVT.isnull().sum()

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df_14_IVT.isnull(), cmap='viridis')
plt.show()

# Notes & Observations

- We observe many **null** (or missing) values in the `QuestionKey` columns.
- The nulls in the `QuestionKey` column may not represent “true” nulls. Rather, they follow interval patterns, suggesting that during those periods no question was displayed.
- These missing values in `QuestionKey` require additional investigation and context-aware handling.

In [None]:
df_14_IVT['QuestionKey'].unique()

In [None]:
df_14_IVT['Timestamp'] = pd.to_datetime(df_14_IVT['Timestamp'])

In [None]:
df_14_IVT.head(3)

In [None]:
df_14_IVT['QuestionKey'].fillna('None', inplace=True)

In [None]:
df_14_IVT['QuestionKey'].value_counts()

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df_14_IVT.isnull(), cmap='viridis')
plt.show()

In [None]:
df_14_IVT.isnull().sum()

In [None]:
df_14_IVT.head()

In [None]:
df_14_IVT['Row'].unique()

In [None]:
plt.figure(figsize=(8,6))
sns.histplot(df_14_IVT['Row'])
plt.show()

# Notes & Observations

- The `Row` column appears to be a simple row index and does not provide meaningful information relevant to the eye-tracking data itself. Therefore, it can be dropped.

In [None]:
df_14_IVT.drop('Row', axis=1, inplace=True)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 10))

sns.scatterplot(data=df_14_IVT, x='Gaze X', y='Gaze Y', ax=axes[0])
axes[0].set_title('Gaze X vs Gaze Y')

sns.scatterplot(data=df_14_IVT, x='Interpolated Gaze X', y='Interpolated Gaze Y', ax=axes[1])
axes[1].set_title('Interpolated Gaze X vs Interpolated Gaze Y')

plt.tight_layout()
plt.show()

# Gaze and Interpolated Gaze Scatter Plots

The scatter plots above visualize the relationship between the x and y coordinates of both the raw gaze data and the interpolated gaze data.

- **Gaze X vs Gaze Y:** This plot shows the raw gaze coordinates. The scattered points indicate the locations on the screen where the participant was looking. The density of points in certain areas might suggest regions of interest.
- **Interpolated Gaze X vs Interpolated Gaze Y:** This plot shows the interpolated gaze coordinates. Interpolation is often used to fill in gaps in the raw gaze data, providing a smoother representation of the gaze path. Comparing this plot to the raw gaze plot can show the effect of the interpolation process.

Both plots can help in understanding the distribution of gaze points across the screen and identifying potential patterns or biases in eye movements.

In [None]:
df_14_IVT.describe()

In [None]:
df_14_IVT.head(3)

In [None]:
df_14_IVT['Timestamp'] = pd.to_datetime(df_14_IVT['Timestamp'])

In [None]:
df_14_IVT.columns

In [None]:
cols = ['Gaze X', 'Gaze Y',
       'Interpolated Gaze X', 'Interpolated Gaze Y', 'Interpolated Distance',
       'Gaze Velocity', 'Gaze Acceleration', 'Fixation Index',
       'Fixation Index by Stimulus', 'Fixation X', 'Fixation Y',
       'Fixation Start', 'Fixation End', 'Fixation Duration',
       'Fixation Dispersion', 'Saccade Index', 'Saccade Index by Stimulus',
       'Saccade Start', 'Saccade End', 'Saccade Duration', 'Saccade Amplitude',
       'Saccade Peak Velocity', 'Saccade Peak Acceleration',
       'Saccade Peak Deceleration', 'Saccade Direction']

In [None]:
from IPython.display import display, Markdown

for col in cols:
    # Add a markdown cell before each plot for better separation and labeling
    display(Markdown(f'### {col} over Time'))
    plt.figure(figsize=(16, 10))
    sns.lineplot(x=df_14_IVT['Timestamp'], y=df_14_IVT[col])
    plt.xlabel("Timestamp") # Add x-axis label
    plt.ylabel(col) # Add y-axis label
    plt.show()

In [None]:
df_14_IVT.head()

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df_14_IVT[['Fixation Index', 'Fixation Index by Stimulus', 'Saccade Index', 'Saccade Index by Stimulus']].isnull(), cmap='viridis')
plt.show()

# Observation

The `Fixation Index`, `Fixation Index by Stimulus`, `Saccade Index` and `Saccade Index by Stimulus` columns are essentially just sequence numbers for identified events. While they indicate the order of fixations and saccades, they don't provide meaningful features for a machine learning model attempting to predict or classify eye movement patterns. Therefore, we will drop these columns as they are not useful for model building.

In [None]:
df_14_IVT.drop(['Fixation Index', 'Fixation Index by Stimulus', 'Saccade Index', 'Saccade Index by Stimulus'], axis=1, inplace=True)

In [None]:
plt.figure(figsize=(14,10))
sns.scatterplot(data=df_14_IVT, x='Fixation X', y='Fixation Y')
plt.title('Fixation X vs Fixation Y')
plt.show()

In [None]:
df_14_IVT['Fixation Start'].describe()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 8))

sns.histplot(df_14_IVT['Fixation Start'], bins=100, kde=True, ax=axes[0])
axes[0].set_xlabel('Fixation Start')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Fixation Start')

sns.histplot(df_14_IVT['Fixation End'], bins=100, kde=True, ax=axes[1])
axes[1].set_xlabel('Fixation End')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Fixation End')

plt.tight_layout()
plt.show()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 8))

sns.histplot(df_14_IVT['Saccade Start'], bins=100, kde=True, ax=axes[0])
axes[0].set_xlabel('Saccade Start')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Saccade Start')

sns.histplot(df_14_IVT['Saccade End'], bins=100, kde=True, ax=axes[1])
axes[1].set_xlabel('Saccade End')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Saccade End')

plt.tight_layout()
plt.show()

# Observation on Fixation and Saccade Timestamps

Upon examining the time series plots of 'Fixation Start', 'Fixation End', `Saccade Start`, and `Saccade End` against the `Timestamp`, we observe a clear linear, diagonal pattern. This indicates that these values are largely sequential and directly related to the progress of time in the data recording.

Furthermore, the histograms of these features show distributions that, while informative about the timing of events, don't necessarily reveal complex patterns that would be highly predictive for a machine learning model.

Crucially, the dataset already contains `Fixation Duration` and `Saccade Duration` columns. These duration features capture the length of each event, which is often a more directly relevant metric for understanding eye movement behavior than the absolute start and end times. Since the duration can be derived from the start and end times (Duration = End - Start), the start and end time columns introduce redundancy and do not provide substantial additional, independent information for modeling purposes.

Therefore, to simplify the dataset and focus on the most informative features for potential machine movement analysis or modeling, we will drop the `Fixation Start`, `Fixation End`, `Saccade Start`, and `Saccade End` columns.

In [None]:
df_14_IVT.drop(['Fixation Start', 'Fixation End', 'Saccade Start', 'Saccade End'], axis=1, inplace=True)

In [None]:
df_14_IVT.head()

In [None]:
plt.figure(figsize=(12, 8))
sns.heatmap(df_14_IVT[['Gaze X', 'Gaze Y', 'Interpolated Gaze X', 'Interpolated Gaze Y']].isnull(), cmap='viridis')
plt.show()


# Observations on Gaze and Interpolated Gaze Data

Based on the scatter plots of 'Gaze X' vs 'Gaze Y' and 'Interpolated Gaze X' vs 'Interpolated Gaze Y', we observe that the distributions of the raw and interpolated gaze points appear very similar. The spatial patterns of where the participant was looking are consistent between the two sets of coordinates.

Furthermore, the heatmap of null values for these columns ('Gaze X', 'Gaze Y', 'Interpolated Gaze X', 'Interpolated Gaze Y') reveals that the missing values are present in the same rows for both the raw and interpolated gaze coordinates. This suggests that the interpolation process did not fill in the gaps in the raw gaze data for these specific instances.

Given that the interpolated gaze data shows the same spatial distribution and the same pattern of null values as the raw gaze data, it appears that the interpolation did not significantly alter or complete the data in this case. Therefore, keeping both the raw and interpolated gaze columns might be redundant, and one set could potentially be dropped to simplify the dataset without losing significant information.

In [None]:
df_14_IVT.drop(['Interpolated Gaze X', 'Interpolated Gaze Y'], axis=1, inplace=True)

In [None]:
df_14_IVT.head()

In [None]:
plt.figure(figsize=(12, 8))
sns.heatmap(df_14_IVT.isnull(), cmap='viridis')
plt.show()

In [None]:
df_14_IVT.columns

In [None]:
fix_1_df = df_14_IVT.dropna(subset=['Fixation Duration'])
sac_1_df = df_14_IVT.dropna(subset=['Saccade Duration'])

In [None]:
fix_1_df.shape

In [None]:
sac_1_df.shape

In [None]:
fix_1_feature = fix_1_df.groupby('QuestionKey').agg({
    'Fixation Duration': ['count','mean','max','sum','var'],
    'Fixation Dispersion': ['mean','max'],
    'Fixation X': ['var'],   # screen spread X
    'Fixation Y': ['var']    # screen spread Y
})

In [None]:
fix_1_feature.columns = ['fix_count','fix_mean_dur','fix_max_dur','fix_total_time',
                        'fix_dur_var','fix_disp_mean','fix_disp_max',
                        'fix_x_var','fix_y_var']

In [None]:
fix_1_feature

In [None]:
sac_1_features = sac_1_df.groupby('QuestionKey').agg({
    'Saccade Duration': ['count','mean','sum'],
    'Saccade Amplitude': ['mean','max'],
    'Saccade Peak Velocity': ['mean','max'],
    'Saccade Peak Acceleration': ['mean'],
    'Saccade Peak Deceleration': ['mean'],
    'Saccade Direction': ['var']   # direction variance
})

In [None]:
sac_1_features.columns = ['sac_count','sac_mean_dur','sac_total_time',
                        'sac_amp_mean','sac_amp_max',
                        'sac_vel_mean','sac_vel_max',
                        'sac_acc_mean','sac_dec_mean','sac_dir_var']

In [None]:
sac_1_features

In [None]:
ivt_1_features = fix_1_feature.join(sac_1_features, how='outer').fillna(0)

In [None]:
ivt_1_features

In [None]:
ivt_1_features['fix_sac_count_ratio'] = ivt_1_features['fix_count'] / (ivt_1_features['sac_count']+1e-5)
ivt_1_features['fix_sac_time_ratio']  = ivt_1_features['fix_total_time'] / (ivt_1_features['sac_total_time']+1e-5)

In [None]:
ivt_1_features

# Aggregation of Fixation and Saccade Features

In the preceding code cells, we performed aggregation on the `fix_1_df` and `sac_1_df` DataFrames, which contain the cleaned fixation and saccade data, respectively. The goal of this aggregation was to create a summary of eye-tracking metrics for each `QuestionKey`.

For fixations, we calculated:
- Count of fixations (`fix_count`)
- Mean, max, sum, and variance of fixation duration (`fix_mean_dur`, `fix_max_dur`, `fix_total_time`, `fix_dur_var`)
- Mean and max of fixation dispersion (`fix_disp_mean`, `fix_disp_max`)
- Variance of fixation X and Y coordinates (`fix_x_var`, `fix_y_var`) to represent screen spread.

For saccades, we calculated:
- Count of saccades (`sac_count`)
- Mean and sum of saccade duration (`sac_mean_dur`, `sac_total_time`)
- Mean and max of saccade amplitude (`sac_amp_mean`, `sac_amp_max`)
- Mean and max of saccade peak velocity (`sac_vel_mean`, `sac_vel_max`)
- Mean of saccade peak acceleration and deceleration (`sac_acc_mean`, `sac_dec_mean`)
- Variance of saccade direction (`sac_dir_var`).

Finally, we joined these aggregated fixation and saccade features into a single DataFrame called `ivt_1_features`, using `QuestionKey` as the index. We also filled any resulting missing values (from `QuestionKey` values that may only have fixations or saccades, but not both) with 0. This `ivt_1_features` DataFrame now provides a consolidated summary of key eye-tracking characteristics for each question, which can be used for further analysis or modeling.

# **15_IVT**

In [None]:
df_15_IVT = pd.read_csv('data/STData/15/15_IVT.csv')

In [None]:
df_15_IVT.head()

In [None]:
df_15_IVT.columns

In [None]:
df_15_IVT.shape

In [None]:
df_15_IVT.info()

In [None]:
df_15_IVT.isnull().sum()

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df_15_IVT.isnull(), cmap='viridis')
plt.show()

# Notes & Observations

- We observe many **null** (or missing) values in the `QuestionKey` columns.
- The nulls in the `QuestionKey` column may not represent “true” nulls. Rather, they follow interval patterns, suggesting that during those periods no question was displayed.
- These missing values in `QuestionKey` require additional investigation and context-aware handling.

In [None]:
df_15_IVT['QuestionKey'].unique()

In [None]:
df_15_IVT['Timestamp'] = pd.to_datetime(df_15_IVT['Timestamp'])

In [None]:
df_15_IVT.head(3)

In [None]:
df_15_IVT['QuestionKey'].fillna('None', inplace=True)

In [None]:
df_15_IVT['QuestionKey'].value_counts()

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df_15_IVT.isnull(), cmap='viridis')
plt.show()

In [None]:
df_15_IVT.isnull().sum()

In [None]:
df_15_IVT.head()

In [None]:
df_15_IVT['Row'].unique()

In [None]:
plt.figure(figsize=(8,6))
sns.histplot(df_15_IVT['Row'])
plt.show()

# Notes & Observations

- The `Row` column appears to be a simple row index and does not provide meaningful information relevant to the eye-tracking data itself. Therefore, it can be dropped.

In [None]:
df_15_IVT.drop('Row', axis=1, inplace=True)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 10))

sns.scatterplot(data=df_15_IVT, x='Gaze X', y='Gaze Y', ax=axes[0])
axes[0].set_title('Gaze X vs Gaze Y')

sns.scatterplot(data=df_15_IVT, x='Interpolated Gaze X', y='Interpolated Gaze Y', ax=axes[1])
axes[1].set_title('Interpolated Gaze X vs Interpolated Gaze Y')

plt.tight_layout()
plt.show()

# Gaze and Interpolated Gaze Scatter Plots

The scatter plots above visualize the relationship between the x and y coordinates of both the raw gaze data and the interpolated gaze data.

- **Gaze X vs Gaze Y:** This plot shows the raw gaze coordinates. The scattered points indicate the locations on the screen where the participant was looking. The density of points in certain areas might suggest regions of interest.
- **Interpolated Gaze X vs Interpolated Gaze Y:** This plot shows the interpolated gaze coordinates. Interpolation is often used to fill in gaps in the raw gaze data, providing a smoother representation of the gaze path. Comparing this plot to the raw gaze plot can show the effect of the interpolation process.

Both plots can help in understanding the distribution of gaze points across the screen and identifying potential patterns or biases in eye movements.

In [None]:
df_15_IVT.describe()

In [None]:
df_15_IVT.head(3)

In [None]:
df_15_IVT['Timestamp'] = pd.to_datetime(df_15_IVT['Timestamp'])

In [None]:
df_15_IVT.columns

In [None]:
cols = ['Gaze X', 'Gaze Y',
       'Interpolated Gaze X', 'Interpolated Gaze Y', 'Interpolated Distance',
       'Gaze Velocity', 'Gaze Acceleration', 'Fixation Index',
       'Fixation Index by Stimulus', 'Fixation X', 'Fixation Y',
       'Fixation Start', 'Fixation End', 'Fixation Duration',
       'Fixation Dispersion', 'Saccade Index', 'Saccade Index by Stimulus',
       'Saccade Start', 'Saccade End', 'Saccade Duration', 'Saccade Amplitude',
       'Saccade Peak Velocity', 'Saccade Peak Acceleration',
       'Saccade Peak Deceleration', 'Saccade Direction']

In [None]:
from IPython.display import display, Markdown

for col in cols:
    # Add a markdown cell before each plot for better separation and labeling
    display(Markdown(f'### {col} over Time'))
    plt.figure(figsize=(16, 10))
    sns.lineplot(x=df_15_IVT['Timestamp'], y=df_15_IVT[col])
    plt.xlabel("Timestamp") # Add x-axis label
    plt.ylabel(col) # Add y-axis label
    plt.show()

In [None]:
df_15_IVT.head()

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df_15_IVT[['Fixation Index', 'Fixation Index by Stimulus', 'Saccade Index', 'Saccade Index by Stimulus']].isnull(), cmap='viridis')
plt.show()

# Observation

The `Fixation Index`, `Fixation Index by Stimulus`, `Saccade Index` and `Saccade Index by Stimulus` columns are essentially just sequence numbers for identified events. While they indicate the order of fixations and saccades, they don't provide meaningful features for a machine learning model attempting to predict or classify eye movement patterns. Therefore, we will drop these columns as they are not useful for model building.

In [None]:
df_15_IVT.drop(['Fixation Index', 'Fixation Index by Stimulus', 'Saccade Index', 'Saccade Index by Stimulus'], axis=1, inplace=True)

In [None]:
plt.figure(figsize=(14,10))
sns.scatterplot(data=df_15_IVT, x='Fixation X', y='Fixation Y')
plt.title('Fixation X vs Fixation Y')
plt.show()

In [None]:
df_15_IVT['Fixation Start'].describe()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 8))

sns.histplot(df_15_IVT['Fixation Start'], bins=100, kde=True, ax=axes[0])
axes[0].set_xlabel('Fixation Start')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Fixation Start')

sns.histplot(df_15_IVT['Fixation End'], bins=100, kde=True, ax=axes[1])
axes[1].set_xlabel('Fixation End')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Fixation End')

plt.tight_layout()
plt.show()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 8))

sns.histplot(df_15_IVT['Saccade Start'], bins=100, kde=True, ax=axes[0])
axes[0].set_xlabel('Saccade Start')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Saccade Start')

sns.histplot(df_15_IVT['Saccade End'], bins=100, kde=True, ax=axes[1])
axes[1].set_xlabel('Saccade End')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Saccade End')

plt.tight_layout()
plt.show()

# Observation on Fixation and Saccade Timestamps

Upon examining the time series plots of 'Fixation Start', 'Fixation End', `Saccade Start`, and `Saccade End` against the `Timestamp`, we observe a clear linear, diagonal pattern. This indicates that these values are largely sequential and directly related to the progress of time in the data recording.

Furthermore, the histograms of these features show distributions that, while informative about the timing of events, don't necessarily reveal complex patterns that would be highly predictive for a machine learning model.

Crucially, the dataset already contains `Fixation Duration` and `Saccade Duration` columns. These duration features capture the length of each event, which is often a more directly relevant metric for understanding eye movement behavior than the absolute start and end times. Since the duration can be derived from the start and end times (Duration = End - Start), the start and end time columns introduce redundancy and do not provide substantial additional, independent information for modeling purposes.

Therefore, to simplify the dataset and focus on the most informative features for potential machine movement analysis or modeling, we will drop the `Fixation Start`, `Fixation End`, `Saccade Start`, and `Saccade End` columns.

In [None]:
df_15_IVT.drop(['Fixation Start', 'Fixation End', 'Saccade Start', 'Saccade End'], axis=1, inplace=True)

In [None]:
df_15_IVT.head()

In [None]:
plt.figure(figsize=(12, 8))
sns.heatmap(df_15_IVT[['Gaze X', 'Gaze Y', 'Interpolated Gaze X', 'Interpolated Gaze Y']].isnull(), cmap='viridis')
plt.show()


# Observations on Gaze and Interpolated Gaze Data

Based on the scatter plots of 'Gaze X' vs 'Gaze Y' and 'Interpolated Gaze X' vs 'Interpolated Gaze Y', we observe that the distributions of the raw and interpolated gaze points appear very similar. The spatial patterns of where the participant was looking are consistent between the two sets of coordinates.

Furthermore, the heatmap of null values for these columns ('Gaze X', 'Gaze Y', 'Interpolated Gaze X', 'Interpolated Gaze Y') reveals that the missing values are present in the same rows for both the raw and interpolated gaze coordinates. This suggests that the interpolation process did not fill in the gaps in the raw gaze data for these specific instances.

Given that the interpolated gaze data shows the same spatial distribution and the same pattern of null values as the raw gaze data, it appears that the interpolation did not significantly alter or complete the data in this case. Therefore, keeping both the raw and interpolated gaze columns might be redundant, and one set could potentially be dropped to simplify the dataset without losing significant information.

In [None]:
df_15_IVT.drop(['Interpolated Gaze X', 'Interpolated Gaze Y'], axis=1, inplace=True)

In [None]:
df_15_IVT.head()

In [None]:
plt.figure(figsize=(12, 8))
sns.heatmap(df_15_IVT.isnull(), cmap='viridis')
plt.show()

In [None]:
df_15_IVT.columns

In [None]:
fix_1_df = df_15_IVT.dropna(subset=['Fixation Duration'])
sac_1_df = df_15_IVT.dropna(subset=['Saccade Duration'])

In [None]:
fix_1_df.shape

In [None]:
sac_1_df.shape

In [None]:
fix_1_feature = fix_1_df.groupby('QuestionKey').agg({
    'Fixation Duration': ['count','mean','max','sum','var'],
    'Fixation Dispersion': ['mean','max'],
    'Fixation X': ['var'],   # screen spread X
    'Fixation Y': ['var']    # screen spread Y
})

In [None]:
fix_1_feature.columns = ['fix_count','fix_mean_dur','fix_max_dur','fix_total_time',
                        'fix_dur_var','fix_disp_mean','fix_disp_max',
                        'fix_x_var','fix_y_var']

In [None]:
fix_1_feature

In [None]:
sac_1_features = sac_1_df.groupby('QuestionKey').agg({
    'Saccade Duration': ['count','mean','sum'],
    'Saccade Amplitude': ['mean','max'],
    'Saccade Peak Velocity': ['mean','max'],
    'Saccade Peak Acceleration': ['mean'],
    'Saccade Peak Deceleration': ['mean'],
    'Saccade Direction': ['var']   # direction variance
})

In [None]:
sac_1_features.columns = ['sac_count','sac_mean_dur','sac_total_time',
                        'sac_amp_mean','sac_amp_max',
                        'sac_vel_mean','sac_vel_max',
                        'sac_acc_mean','sac_dec_mean','sac_dir_var']

In [None]:
sac_1_features

In [None]:
ivt_1_features = fix_1_feature.join(sac_1_features, how='outer').fillna(0)

In [None]:
ivt_1_features

In [None]:
ivt_1_features['fix_sac_count_ratio'] = ivt_1_features['fix_count'] / (ivt_1_features['sac_count']+1e-5)
ivt_1_features['fix_sac_time_ratio']  = ivt_1_features['fix_total_time'] / (ivt_1_features['sac_total_time']+1e-5)

In [None]:
ivt_1_features

# Aggregation of Fixation and Saccade Features

In the preceding code cells, we performed aggregation on the `fix_1_df` and `sac_1_df` DataFrames, which contain the cleaned fixation and saccade data, respectively. The goal of this aggregation was to create a summary of eye-tracking metrics for each `QuestionKey`.

For fixations, we calculated:
- Count of fixations (`fix_count`)
- Mean, max, sum, and variance of fixation duration (`fix_mean_dur`, `fix_max_dur`, `fix_total_time`, `fix_dur_var`)
- Mean and max of fixation dispersion (`fix_disp_mean`, `fix_disp_max`)
- Variance of fixation X and Y coordinates (`fix_x_var`, `fix_y_var`) to represent screen spread.

For saccades, we calculated:
- Count of saccades (`sac_count`)
- Mean and sum of saccade duration (`sac_mean_dur`, `sac_total_time`)
- Mean and max of saccade amplitude (`sac_amp_mean`, `sac_amp_max`)
- Mean and max of saccade peak velocity (`sac_vel_mean`, `sac_vel_max`)
- Mean of saccade peak acceleration and deceleration (`sac_acc_mean`, `sac_dec_mean`)
- Variance of saccade direction (`sac_dir_var`).

Finally, we joined these aggregated fixation and saccade features into a single DataFrame called `ivt_1_features`, using `QuestionKey` as the index. We also filled any resulting missing values (from `QuestionKey` values that may only have fixations or saccades, but not both) with 0. This `ivt_1_features` DataFrame now provides a consolidated summary of key eye-tracking characteristics for each question, which can be used for further analysis or modeling.

# **16_IVT**

In [None]:
df_16_IVT = pd.read_csv('data/STData/16/16_IVT.csv')

In [None]:
df_16_IVT.head()

In [None]:
df_16_IVT.columns

In [None]:
df_16_IVT.shape

In [None]:
df_16_IVT.info()

In [None]:
df_16_IVT.isnull().sum()

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df_16_IVT.isnull(), cmap='viridis')
plt.show()

# Notes & Observations

- We observe many **null** (or missing) values in the `QuestionKey` columns.
- The nulls in the `QuestionKey` column may not represent “true” nulls. Rather, they follow interval patterns, suggesting that during those periods no question was displayed.
- These missing values in `QuestionKey` require additional investigation and context-aware handling.

In [None]:
df_16_IVT['QuestionKey'].unique()

In [None]:
df_16_IVT['Timestamp'] = pd.to_datetime(df_16_IVT['Timestamp'])

In [None]:
df_16_IVT.head(3)

In [None]:
df_16_IVT['QuestionKey'].fillna('None', inplace=True)

In [None]:
df_16_IVT['QuestionKey'].value_counts()

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df_16_IVT.isnull(), cmap='viridis')
plt.show()

In [None]:
df_16_IVT.isnull().sum()

In [None]:
df_16_IVT.head()

In [None]:
df_16_IVT['Row'].unique()

In [None]:
plt.figure(figsize=(8,6))
sns.histplot(df_16_IVT['Row'])
plt.show()

# Notes & Observations

- The `Row` column appears to be a simple row index and does not provide meaningful information relevant to the eye-tracking data itself. Therefore, it can be dropped.

In [None]:
df_16_IVT.drop('Row', axis=1, inplace=True)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 10))

sns.scatterplot(data=df_16_IVT, x='Gaze X', y='Gaze Y', ax=axes[0])
axes[0].set_title('Gaze X vs Gaze Y')

sns.scatterplot(data=df_16_IVT, x='Interpolated Gaze X', y='Interpolated Gaze Y', ax=axes[1])
axes[1].set_title('Interpolated Gaze X vs Interpolated Gaze Y')

plt.tight_layout()
plt.show()

# Gaze and Interpolated Gaze Scatter Plots

The scatter plots above visualize the relationship between the x and y coordinates of both the raw gaze data and the interpolated gaze data.

- **Gaze X vs Gaze Y:** This plot shows the raw gaze coordinates. The scattered points indicate the locations on the screen where the participant was looking. The density of points in certain areas might suggest regions of interest.
- **Interpolated Gaze X vs Interpolated Gaze Y:** This plot shows the interpolated gaze coordinates. Interpolation is often used to fill in gaps in the raw gaze data, providing a smoother representation of the gaze path. Comparing this plot to the raw gaze plot can show the effect of the interpolation process.

Both plots can help in understanding the distribution of gaze points across the screen and identifying potential patterns or biases in eye movements.

In [None]:
df_16_IVT.describe()

In [None]:
df_16_IVT.head(3)

In [None]:
df_16_IVT['Timestamp'] = pd.to_datetime(df_16_IVT['Timestamp'])

In [None]:
df_16_IVT.columns

In [None]:
cols = ['Gaze X', 'Gaze Y',
       'Interpolated Gaze X', 'Interpolated Gaze Y', 'Interpolated Distance',
       'Gaze Velocity', 'Gaze Acceleration', 'Fixation Index',
       'Fixation Index by Stimulus', 'Fixation X', 'Fixation Y',
       'Fixation Start', 'Fixation End', 'Fixation Duration',
       'Fixation Dispersion', 'Saccade Index', 'Saccade Index by Stimulus',
       'Saccade Start', 'Saccade End', 'Saccade Duration', 'Saccade Amplitude',
       'Saccade Peak Velocity', 'Saccade Peak Acceleration',
       'Saccade Peak Deceleration', 'Saccade Direction']

In [None]:
from IPython.display import display, Markdown

for col in cols:
    # Add a markdown cell before each plot for better separation and labeling
    display(Markdown(f'### {col} over Time'))
    plt.figure(figsize=(16, 10))
    sns.lineplot(x=df_16_IVT['Timestamp'], y=df_16_IVT[col])
    plt.xlabel("Timestamp") # Add x-axis label
    plt.ylabel(col) # Add y-axis label
    plt.show()

In [None]:
df_16_IVT.head()

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df_16_IVT[['Fixation Index', 'Fixation Index by Stimulus', 'Saccade Index', 'Saccade Index by Stimulus']].isnull(), cmap='viridis')
plt.show()

# Observation

The `Fixation Index`, `Fixation Index by Stimulus`, `Saccade Index` and `Saccade Index by Stimulus` columns are essentially just sequence numbers for identified events. While they indicate the order of fixations and saccades, they don't provide meaningful features for a machine learning model attempting to predict or classify eye movement patterns. Therefore, we will drop these columns as they are not useful for model building.

In [None]:
df_16_IVT.drop(['Fixation Index', 'Fixation Index by Stimulus', 'Saccade Index', 'Saccade Index by Stimulus'], axis=1, inplace=True)

In [None]:
plt.figure(figsize=(14,10))
sns.scatterplot(data=df_16_IVT, x='Fixation X', y='Fixation Y')
plt.title('Fixation X vs Fixation Y')
plt.show()

In [None]:
df_16_IVT['Fixation Start'].describe()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 8))

sns.histplot(df_16_IVT['Fixation Start'], bins=100, kde=True, ax=axes[0])
axes[0].set_xlabel('Fixation Start')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Fixation Start')

sns.histplot(df_16_IVT['Fixation End'], bins=100, kde=True, ax=axes[1])
axes[1].set_xlabel('Fixation End')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Fixation End')

plt.tight_layout()
plt.show()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 8))

sns.histplot(df_16_IVT['Saccade Start'], bins=100, kde=True, ax=axes[0])
axes[0].set_xlabel('Saccade Start')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Saccade Start')

sns.histplot(df_16_IVT['Saccade End'], bins=100, kde=True, ax=axes[1])
axes[1].set_xlabel('Saccade End')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Saccade End')

plt.tight_layout()
plt.show()

# Observation on Fixation and Saccade Timestamps

Upon examining the time series plots of 'Fixation Start', 'Fixation End', `Saccade Start`, and `Saccade End` against the `Timestamp`, we observe a clear linear, diagonal pattern. This indicates that these values are largely sequential and directly related to the progress of time in the data recording.

Furthermore, the histograms of these features show distributions that, while informative about the timing of events, don't necessarily reveal complex patterns that would be highly predictive for a machine learning model.

Crucially, the dataset already contains `Fixation Duration` and `Saccade Duration` columns. These duration features capture the length of each event, which is often a more directly relevant metric for understanding eye movement behavior than the absolute start and end times. Since the duration can be derived from the start and end times (Duration = End - Start), the start and end time columns introduce redundancy and do not provide substantial additional, independent information for modeling purposes.

Therefore, to simplify the dataset and focus on the most informative features for potential machine movement analysis or modeling, we will drop the `Fixation Start`, `Fixation End`, `Saccade Start`, and `Saccade End` columns.

In [None]:
df_16_IVT.drop(['Fixation Start', 'Fixation End', 'Saccade Start', 'Saccade End'], axis=1, inplace=True)

In [None]:
df_16_IVT.head()

In [None]:
plt.figure(figsize=(12, 8))
sns.heatmap(df_16_IVT[['Gaze X', 'Gaze Y', 'Interpolated Gaze X', 'Interpolated Gaze Y']].isnull(), cmap='viridis')
plt.show()


# Observations on Gaze and Interpolated Gaze Data

Based on the scatter plots of 'Gaze X' vs 'Gaze Y' and 'Interpolated Gaze X' vs 'Interpolated Gaze Y', we observe that the distributions of the raw and interpolated gaze points appear very similar. The spatial patterns of where the participant was looking are consistent between the two sets of coordinates.

Furthermore, the heatmap of null values for these columns ('Gaze X', 'Gaze Y', 'Interpolated Gaze X', 'Interpolated Gaze Y') reveals that the missing values are present in the same rows for both the raw and interpolated gaze coordinates. This suggests that the interpolation process did not fill in the gaps in the raw gaze data for these specific instances.

Given that the interpolated gaze data shows the same spatial distribution and the same pattern of null values as the raw gaze data, it appears that the interpolation did not significantly alter or complete the data in this case. Therefore, keeping both the raw and interpolated gaze columns might be redundant, and one set could potentially be dropped to simplify the dataset without losing significant information.

In [None]:
df_16_IVT.drop(['Interpolated Gaze X', 'Interpolated Gaze Y'], axis=1, inplace=True)

In [None]:
df_16_IVT.head()

In [None]:
plt.figure(figsize=(12, 8))
sns.heatmap(df_16_IVT.isnull(), cmap='viridis')
plt.show()

In [None]:
df_16_IVT.columns

In [None]:
fix_1_df = df_16_IVT.dropna(subset=['Fixation Duration'])
sac_1_df = df_16_IVT.dropna(subset=['Saccade Duration'])

In [None]:
fix_1_df.shape

In [None]:
sac_1_df.shape

In [None]:
fix_1_feature = fix_1_df.groupby('QuestionKey').agg({
    'Fixation Duration': ['count','mean','max','sum','var'],
    'Fixation Dispersion': ['mean','max'],
    'Fixation X': ['var'],   # screen spread X
    'Fixation Y': ['var']    # screen spread Y
})

In [None]:
fix_1_feature.columns = ['fix_count','fix_mean_dur','fix_max_dur','fix_total_time',
                        'fix_dur_var','fix_disp_mean','fix_disp_max',
                        'fix_x_var','fix_y_var']

In [None]:
fix_1_feature

In [None]:
sac_1_features = sac_1_df.groupby('QuestionKey').agg({
    'Saccade Duration': ['count','mean','sum'],
    'Saccade Amplitude': ['mean','max'],
    'Saccade Peak Velocity': ['mean','max'],
    'Saccade Peak Acceleration': ['mean'],
    'Saccade Peak Deceleration': ['mean'],
    'Saccade Direction': ['var']   # direction variance
})

In [None]:
sac_1_features.columns = ['sac_count','sac_mean_dur','sac_total_time',
                        'sac_amp_mean','sac_amp_max',
                        'sac_vel_mean','sac_vel_max',
                        'sac_acc_mean','sac_dec_mean','sac_dir_var']

In [None]:
sac_1_features

In [None]:
ivt_1_features = fix_1_feature.join(sac_1_features, how='outer').fillna(0)

In [None]:
ivt_1_features

In [None]:
ivt_1_features['fix_sac_count_ratio'] = ivt_1_features['fix_count'] / (ivt_1_features['sac_count']+1e-5)
ivt_1_features['fix_sac_time_ratio']  = ivt_1_features['fix_total_time'] / (ivt_1_features['sac_total_time']+1e-5)

In [None]:
ivt_1_features

# Aggregation of Fixation and Saccade Features

In the preceding code cells, we performed aggregation on the `fix_1_df` and `sac_1_df` DataFrames, which contain the cleaned fixation and saccade data, respectively. The goal of this aggregation was to create a summary of eye-tracking metrics for each `QuestionKey`.

For fixations, we calculated:
- Count of fixations (`fix_count`)
- Mean, max, sum, and variance of fixation duration (`fix_mean_dur`, `fix_max_dur`, `fix_total_time`, `fix_dur_var`)
- Mean and max of fixation dispersion (`fix_disp_mean`, `fix_disp_max`)
- Variance of fixation X and Y coordinates (`fix_x_var`, `fix_y_var`) to represent screen spread.

For saccades, we calculated:
- Count of saccades (`sac_count`)
- Mean and sum of saccade duration (`sac_mean_dur`, `sac_total_time`)
- Mean and max of saccade amplitude (`sac_amp_mean`, `sac_amp_max`)
- Mean and max of saccade peak velocity (`sac_vel_mean`, `sac_vel_max`)
- Mean of saccade peak acceleration and deceleration (`sac_acc_mean`, `sac_dec_mean`)
- Variance of saccade direction (`sac_dir_var`).

Finally, we joined these aggregated fixation and saccade features into a single DataFrame called `ivt_1_features`, using `QuestionKey` as the index. We also filled any resulting missing values (from `QuestionKey` values that may only have fixations or saccades, but not both) with 0. This `ivt_1_features` DataFrame now provides a consolidated summary of key eye-tracking characteristics for each question, which can be used for further analysis or modeling.

# **17_IVT**

In [None]:
df_17_IVT = pd.read_csv('data/STData/17/17_IVT.csv')

In [None]:
df_17_IVT.head()

In [None]:
df_17_IVT.columns

In [None]:
df_17_IVT.shape

In [None]:
df_17_IVT.info()

In [None]:
df_17_IVT.isnull().sum()

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df_17_IVT.isnull(), cmap='viridis')
plt.show()

# Notes & Observations

- We observe many **null** (or missing) values in the `QuestionKey` columns.
- The nulls in the `QuestionKey` column may not represent “true” nulls. Rather, they follow interval patterns, suggesting that during those periods no question was displayed.
- These missing values in `QuestionKey` require additional investigation and context-aware handling.

In [None]:
df_17_IVT['QuestionKey'].unique()

In [None]:
df_17_IVT['Timestamp'] = pd.to_datetime(df_17_IVT['Timestamp'])

In [None]:
df_17_IVT.head(3)

In [None]:
df_17_IVT['QuestionKey'].fillna('None', inplace=True)

In [None]:
df_17_IVT['QuestionKey'].value_counts()

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df_17_IVT.isnull(), cmap='viridis')
plt.show()

In [None]:
df_17_IVT.isnull().sum()

In [None]:
df_17_IVT.head()

In [None]:
df_17_IVT['Row'].unique()

In [None]:
plt.figure(figsize=(8,6))
sns.histplot(df_17_IVT['Row'])
plt.show()

# Notes & Observations

- The `Row` column appears to be a simple row index and does not provide meaningful information relevant to the eye-tracking data itself. Therefore, it can be dropped.

In [None]:
df_17_IVT.drop('Row', axis=1, inplace=True)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 10))

sns.scatterplot(data=df_17_IVT, x='Gaze X', y='Gaze Y', ax=axes[0])
axes[0].set_title('Gaze X vs Gaze Y')

sns.scatterplot(data=df_17_IVT, x='Interpolated Gaze X', y='Interpolated Gaze Y', ax=axes[1])
axes[1].set_title('Interpolated Gaze X vs Interpolated Gaze Y')

plt.tight_layout()
plt.show()

# Gaze and Interpolated Gaze Scatter Plots

The scatter plots above visualize the relationship between the x and y coordinates of both the raw gaze data and the interpolated gaze data.

- **Gaze X vs Gaze Y:** This plot shows the raw gaze coordinates. The scattered points indicate the locations on the screen where the participant was looking. The density of points in certain areas might suggest regions of interest.
- **Interpolated Gaze X vs Interpolated Gaze Y:** This plot shows the interpolated gaze coordinates. Interpolation is often used to fill in gaps in the raw gaze data, providing a smoother representation of the gaze path. Comparing this plot to the raw gaze plot can show the effect of the interpolation process.

Both plots can help in understanding the distribution of gaze points across the screen and identifying potential patterns or biases in eye movements.

In [None]:
df_17_IVT.describe()

In [None]:
df_17_IVT.head(3)

In [None]:
df_17_IVT['Timestamp'] = pd.to_datetime(df_17_IVT['Timestamp'])

In [None]:
df_17_IVT.columns

In [None]:
cols = ['Gaze X', 'Gaze Y',
       'Interpolated Gaze X', 'Interpolated Gaze Y', 'Interpolated Distance',
       'Gaze Velocity', 'Gaze Acceleration', 'Fixation Index',
       'Fixation Index by Stimulus', 'Fixation X', 'Fixation Y',
       'Fixation Start', 'Fixation End', 'Fixation Duration',
       'Fixation Dispersion', 'Saccade Index', 'Saccade Index by Stimulus',
       'Saccade Start', 'Saccade End', 'Saccade Duration', 'Saccade Amplitude',
       'Saccade Peak Velocity', 'Saccade Peak Acceleration',
       'Saccade Peak Deceleration', 'Saccade Direction']

In [None]:
from IPython.display import display, Markdown

for col in cols:
    # Add a markdown cell before each plot for better separation and labeling
    display(Markdown(f'### {col} over Time'))
    plt.figure(figsize=(16, 10))
    sns.lineplot(x=df_17_IVT['Timestamp'], y=df_17_IVT[col])
    plt.xlabel("Timestamp") # Add x-axis label
    plt.ylabel(col) # Add y-axis label
    plt.show()

In [None]:
df_17_IVT.head()

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df_17_IVT[['Fixation Index', 'Fixation Index by Stimulus', 'Saccade Index', 'Saccade Index by Stimulus']].isnull(), cmap='viridis')
plt.show()

# Observation

The `Fixation Index`, `Fixation Index by Stimulus`, `Saccade Index` and `Saccade Index by Stimulus` columns are essentially just sequence numbers for identified events. While they indicate the order of fixations and saccades, they don't provide meaningful features for a machine learning model attempting to predict or classify eye movement patterns. Therefore, we will drop these columns as they are not useful for model building.

In [None]:
df_17_IVT.drop(['Fixation Index', 'Fixation Index by Stimulus', 'Saccade Index', 'Saccade Index by Stimulus'], axis=1, inplace=True)

In [None]:
plt.figure(figsize=(14,10))
sns.scatterplot(data=df_17_IVT, x='Fixation X', y='Fixation Y')
plt.title('Fixation X vs Fixation Y')
plt.show()

In [None]:
df_17_IVT['Fixation Start'].describe()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 8))

sns.histplot(df_17_IVT['Fixation Start'], bins=100, kde=True, ax=axes[0])
axes[0].set_xlabel('Fixation Start')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Fixation Start')

sns.histplot(df_17_IVT['Fixation End'], bins=100, kde=True, ax=axes[1])
axes[1].set_xlabel('Fixation End')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Fixation End')

plt.tight_layout()
plt.show()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 8))

sns.histplot(df_17_IVT['Saccade Start'], bins=100, kde=True, ax=axes[0])
axes[0].set_xlabel('Saccade Start')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Saccade Start')

sns.histplot(df_17_IVT['Saccade End'], bins=100, kde=True, ax=axes[1])
axes[1].set_xlabel('Saccade End')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Saccade End')

plt.tight_layout()
plt.show()

# Observation on Fixation and Saccade Timestamps

Upon examining the time series plots of 'Fixation Start', 'Fixation End', `Saccade Start`, and `Saccade End` against the `Timestamp`, we observe a clear linear, diagonal pattern. This indicates that these values are largely sequential and directly related to the progress of time in the data recording.

Furthermore, the histograms of these features show distributions that, while informative about the timing of events, don't necessarily reveal complex patterns that would be highly predictive for a machine learning model.

Crucially, the dataset already contains `Fixation Duration` and `Saccade Duration` columns. These duration features capture the length of each event, which is often a more directly relevant metric for understanding eye movement behavior than the absolute start and end times. Since the duration can be derived from the start and end times (Duration = End - Start), the start and end time columns introduce redundancy and do not provide substantial additional, independent information for modeling purposes.

Therefore, to simplify the dataset and focus on the most informative features for potential machine movement analysis or modeling, we will drop the `Fixation Start`, `Fixation End`, `Saccade Start`, and `Saccade End` columns.

In [None]:
df_17_IVT.drop(['Fixation Start', 'Fixation End', 'Saccade Start', 'Saccade End'], axis=1, inplace=True)

In [None]:
df_17_IVT.head()

In [None]:
plt.figure(figsize=(12, 8))
sns.heatmap(df_17_IVT[['Gaze X', 'Gaze Y', 'Interpolated Gaze X', 'Interpolated Gaze Y']].isnull(), cmap='viridis')
plt.show()


# Observations on Gaze and Interpolated Gaze Data

Based on the scatter plots of 'Gaze X' vs 'Gaze Y' and 'Interpolated Gaze X' vs 'Interpolated Gaze Y', we observe that the distributions of the raw and interpolated gaze points appear very similar. The spatial patterns of where the participant was looking are consistent between the two sets of coordinates.

Furthermore, the heatmap of null values for these columns ('Gaze X', 'Gaze Y', 'Interpolated Gaze X', 'Interpolated Gaze Y') reveals that the missing values are present in the same rows for both the raw and interpolated gaze coordinates. This suggests that the interpolation process did not fill in the gaps in the raw gaze data for these specific instances.

Given that the interpolated gaze data shows the same spatial distribution and the same pattern of null values as the raw gaze data, it appears that the interpolation did not significantly alter or complete the data in this case. Therefore, keeping both the raw and interpolated gaze columns might be redundant, and one set could potentially be dropped to simplify the dataset without losing significant information.

In [None]:
df_17_IVT.drop(['Interpolated Gaze X', 'Interpolated Gaze Y'], axis=1, inplace=True)

In [None]:
df_17_IVT.head()

In [None]:
plt.figure(figsize=(12, 8))
sns.heatmap(df_17_IVT.isnull(), cmap='viridis')
plt.show()

In [None]:
df_17_IVT.columns

In [None]:
fix_1_df = df_17_IVT.dropna(subset=['Fixation Duration'])
sac_1_df = df_17_IVT.dropna(subset=['Saccade Duration'])

In [None]:
fix_1_df.shape

In [None]:
sac_1_df.shape

In [None]:
fix_1_feature = fix_1_df.groupby('QuestionKey').agg({
    'Fixation Duration': ['count','mean','max','sum','var'],
    'Fixation Dispersion': ['mean','max'],
    'Fixation X': ['var'],   # screen spread X
    'Fixation Y': ['var']    # screen spread Y
})

In [None]:
fix_1_feature.columns = ['fix_count','fix_mean_dur','fix_max_dur','fix_total_time',
                        'fix_dur_var','fix_disp_mean','fix_disp_max',
                        'fix_x_var','fix_y_var']

In [None]:
fix_1_feature

In [None]:
sac_1_features = sac_1_df.groupby('QuestionKey').agg({
    'Saccade Duration': ['count','mean','sum'],
    'Saccade Amplitude': ['mean','max'],
    'Saccade Peak Velocity': ['mean','max'],
    'Saccade Peak Acceleration': ['mean'],
    'Saccade Peak Deceleration': ['mean'],
    'Saccade Direction': ['var']   # direction variance
})

In [None]:
sac_1_features.columns = ['sac_count','sac_mean_dur','sac_total_time',
                        'sac_amp_mean','sac_amp_max',
                        'sac_vel_mean','sac_vel_max',
                        'sac_acc_mean','sac_dec_mean','sac_dir_var']

In [None]:
sac_1_features

In [None]:
ivt_1_features = fix_1_feature.join(sac_1_features, how='outer').fillna(0)

In [None]:
ivt_1_features

In [None]:
ivt_1_features['fix_sac_count_ratio'] = ivt_1_features['fix_count'] / (ivt_1_features['sac_count']+1e-5)
ivt_1_features['fix_sac_time_ratio']  = ivt_1_features['fix_total_time'] / (ivt_1_features['sac_total_time']+1e-5)

In [None]:
ivt_1_features

# Aggregation of Fixation and Saccade Features

In the preceding code cells, we performed aggregation on the `fix_1_df` and `sac_1_df` DataFrames, which contain the cleaned fixation and saccade data, respectively. The goal of this aggregation was to create a summary of eye-tracking metrics for each `QuestionKey`.

For fixations, we calculated:
- Count of fixations (`fix_count`)
- Mean, max, sum, and variance of fixation duration (`fix_mean_dur`, `fix_max_dur`, `fix_total_time`, `fix_dur_var`)
- Mean and max of fixation dispersion (`fix_disp_mean`, `fix_disp_max`)
- Variance of fixation X and Y coordinates (`fix_x_var`, `fix_y_var`) to represent screen spread.

For saccades, we calculated:
- Count of saccades (`sac_count`)
- Mean and sum of saccade duration (`sac_mean_dur`, `sac_total_time`)
- Mean and max of saccade amplitude (`sac_amp_mean`, `sac_amp_max`)
- Mean and max of saccade peak velocity (`sac_vel_mean`, `sac_vel_max`)
- Mean of saccade peak acceleration and deceleration (`sac_acc_mean`, `sac_dec_mean`)
- Variance of saccade direction (`sac_dir_var`).

Finally, we joined these aggregated fixation and saccade features into a single DataFrame called `ivt_1_features`, using `QuestionKey` as the index. We also filled any resulting missing values (from `QuestionKey` values that may only have fixations or saccades, but not both) with 0. This `ivt_1_features` DataFrame now provides a consolidated summary of key eye-tracking characteristics for each question, which can be used for further analysis or modeling.

# **18_IVT**

In [None]:
df_18_IVT = pd.read_csv('data/STData/18/18_IVT.csv')

In [None]:
df_18_IVT.head()

In [None]:
df_18_IVT.columns

In [None]:
df_18_IVT.shape

In [None]:
df_18_IVT.info()

In [None]:
df_18_IVT.isnull().sum()

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df_18_IVT.isnull(), cmap='viridis')
plt.show()

# Notes & Observations

- We observe many **null** (or missing) values in the `QuestionKey` columns.
- The nulls in the `QuestionKey` column may not represent “true” nulls. Rather, they follow interval patterns, suggesting that during those periods no question was displayed.
- These missing values in `QuestionKey` require additional investigation and context-aware handling.

In [None]:
df_18_IVT['QuestionKey'].unique()

In [None]:
df_18_IVT['Timestamp'] = pd.to_datetime(df_18_IVT['Timestamp'])

In [None]:
df_18_IVT.head(3)

In [None]:
df_18_IVT['QuestionKey'].fillna('None', inplace=True)

In [None]:
df_18_IVT['QuestionKey'].value_counts()

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df_18_IVT.isnull(), cmap='viridis')
plt.show()

In [None]:
df_18_IVT.isnull().sum()

In [None]:
df_18_IVT.head()

In [None]:
df_18_IVT['Row'].unique()

In [None]:
plt.figure(figsize=(8,6))
sns.histplot(df_18_IVT['Row'])
plt.show()

# Notes & Observations

- The `Row` column appears to be a simple row index and does not provide meaningful information relevant to the eye-tracking data itself. Therefore, it can be dropped.

In [None]:
df_18_IVT.drop('Row', axis=1, inplace=True)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 10))

sns.scatterplot(data=df_18_IVT, x='Gaze X', y='Gaze Y', ax=axes[0])
axes[0].set_title('Gaze X vs Gaze Y')

sns.scatterplot(data=df_18_IVT, x='Interpolated Gaze X', y='Interpolated Gaze Y', ax=axes[1])
axes[1].set_title('Interpolated Gaze X vs Interpolated Gaze Y')

plt.tight_layout()
plt.show()

# Gaze and Interpolated Gaze Scatter Plots

The scatter plots above visualize the relationship between the x and y coordinates of both the raw gaze data and the interpolated gaze data.

- **Gaze X vs Gaze Y:** This plot shows the raw gaze coordinates. The scattered points indicate the locations on the screen where the participant was looking. The density of points in certain areas might suggest regions of interest.
- **Interpolated Gaze X vs Interpolated Gaze Y:** This plot shows the interpolated gaze coordinates. Interpolation is often used to fill in gaps in the raw gaze data, providing a smoother representation of the gaze path. Comparing this plot to the raw gaze plot can show the effect of the interpolation process.

Both plots can help in understanding the distribution of gaze points across the screen and identifying potential patterns or biases in eye movements.

In [None]:
df_18_IVT.describe()

In [None]:
df_18_IVT.head(3)

In [None]:
df_18_IVT['Timestamp'] = pd.to_datetime(df_18_IVT['Timestamp'])

In [None]:
df_18_IVT.columns

In [None]:
cols = ['Gaze X', 'Gaze Y',
       'Interpolated Gaze X', 'Interpolated Gaze Y', 'Interpolated Distance',
       'Gaze Velocity', 'Gaze Acceleration', 'Fixation Index',
       'Fixation Index by Stimulus', 'Fixation X', 'Fixation Y',
       'Fixation Start', 'Fixation End', 'Fixation Duration',
       'Fixation Dispersion', 'Saccade Index', 'Saccade Index by Stimulus',
       'Saccade Start', 'Saccade End', 'Saccade Duration', 'Saccade Amplitude',
       'Saccade Peak Velocity', 'Saccade Peak Acceleration',
       'Saccade Peak Deceleration', 'Saccade Direction']

In [None]:
from IPython.display import display, Markdown

for col in cols:
    # Add a markdown cell before each plot for better separation and labeling
    display(Markdown(f'### {col} over Time'))
    plt.figure(figsize=(16, 10))
    sns.lineplot(x=df_18_IVT['Timestamp'], y=df_18_IVT[col])
    plt.xlabel("Timestamp") # Add x-axis label
    plt.ylabel(col) # Add y-axis label
    plt.show()

In [None]:
df_18_IVT.head()

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df_18_IVT[['Fixation Index', 'Fixation Index by Stimulus', 'Saccade Index', 'Saccade Index by Stimulus']].isnull(), cmap='viridis')
plt.show()

# Observation

The `Fixation Index`, `Fixation Index by Stimulus`, `Saccade Index` and `Saccade Index by Stimulus` columns are essentially just sequence numbers for identified events. While they indicate the order of fixations and saccades, they don't provide meaningful features for a machine learning model attempting to predict or classify eye movement patterns. Therefore, we will drop these columns as they are not useful for model building.

In [None]:
df_18_IVT.drop(['Fixation Index', 'Fixation Index by Stimulus', 'Saccade Index', 'Saccade Index by Stimulus'], axis=1, inplace=True)

In [None]:
plt.figure(figsize=(14,10))
sns.scatterplot(data=df_18_IVT, x='Fixation X', y='Fixation Y')
plt.title('Fixation X vs Fixation Y')
plt.show()

In [None]:
df_18_IVT['Fixation Start'].describe()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 8))

sns.histplot(df_18_IVT['Fixation Start'], bins=100, kde=True, ax=axes[0])
axes[0].set_xlabel('Fixation Start')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Fixation Start')

sns.histplot(df_18_IVT['Fixation End'], bins=100, kde=True, ax=axes[1])
axes[1].set_xlabel('Fixation End')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Fixation End')

plt.tight_layout()
plt.show()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 8))

sns.histplot(df_18_IVT['Saccade Start'], bins=100, kde=True, ax=axes[0])
axes[0].set_xlabel('Saccade Start')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Saccade Start')

sns.histplot(df_18_IVT['Saccade End'], bins=100, kde=True, ax=axes[1])
axes[1].set_xlabel('Saccade End')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Saccade End')

plt.tight_layout()
plt.show()

# Observation on Fixation and Saccade Timestamps

Upon examining the time series plots of 'Fixation Start', 'Fixation End', `Saccade Start`, and `Saccade End` against the `Timestamp`, we observe a clear linear, diagonal pattern. This indicates that these values are largely sequential and directly related to the progress of time in the data recording.

Furthermore, the histograms of these features show distributions that, while informative about the timing of events, don't necessarily reveal complex patterns that would be highly predictive for a machine learning model.

Crucially, the dataset already contains `Fixation Duration` and `Saccade Duration` columns. These duration features capture the length of each event, which is often a more directly relevant metric for understanding eye movement behavior than the absolute start and end times. Since the duration can be derived from the start and end times (Duration = End - Start), the start and end time columns introduce redundancy and do not provide substantial additional, independent information for modeling purposes.

Therefore, to simplify the dataset and focus on the most informative features for potential machine movement analysis or modeling, we will drop the `Fixation Start`, `Fixation End`, `Saccade Start`, and `Saccade End` columns.

In [None]:
df_18_IVT.drop(['Fixation Start', 'Fixation End', 'Saccade Start', 'Saccade End'], axis=1, inplace=True)

In [None]:
df_18_IVT.head()

In [None]:
plt.figure(figsize=(12, 8))
sns.heatmap(df_18_IVT[['Gaze X', 'Gaze Y', 'Interpolated Gaze X', 'Interpolated Gaze Y']].isnull(), cmap='viridis')
plt.show()


# Observations on Gaze and Interpolated Gaze Data

Based on the scatter plots of 'Gaze X' vs 'Gaze Y' and 'Interpolated Gaze X' vs 'Interpolated Gaze Y', we observe that the distributions of the raw and interpolated gaze points appear very similar. The spatial patterns of where the participant was looking are consistent between the two sets of coordinates.

Furthermore, the heatmap of null values for these columns ('Gaze X', 'Gaze Y', 'Interpolated Gaze X', 'Interpolated Gaze Y') reveals that the missing values are present in the same rows for both the raw and interpolated gaze coordinates. This suggests that the interpolation process did not fill in the gaps in the raw gaze data for these specific instances.

Given that the interpolated gaze data shows the same spatial distribution and the same pattern of null values as the raw gaze data, it appears that the interpolation did not significantly alter or complete the data in this case. Therefore, keeping both the raw and interpolated gaze columns might be redundant, and one set could potentially be dropped to simplify the dataset without losing significant information.

In [None]:
df_18_IVT.drop(['Interpolated Gaze X', 'Interpolated Gaze Y'], axis=1, inplace=True)

In [None]:
df_18_IVT.head()

In [None]:
plt.figure(figsize=(12, 8))
sns.heatmap(df_18_IVT.isnull(), cmap='viridis')
plt.show()

In [None]:
df_18_IVT.columns

In [None]:
fix_1_df = df_18_IVT.dropna(subset=['Fixation Duration'])
sac_1_df = df_18_IVT.dropna(subset=['Saccade Duration'])

In [None]:
fix_1_df.shape

In [None]:
sac_1_df.shape

In [None]:
fix_1_feature = fix_1_df.groupby('QuestionKey').agg({
    'Fixation Duration': ['count','mean','max','sum','var'],
    'Fixation Dispersion': ['mean','max'],
    'Fixation X': ['var'],   # screen spread X
    'Fixation Y': ['var']    # screen spread Y
})

In [None]:
fix_1_feature.columns = ['fix_count','fix_mean_dur','fix_max_dur','fix_total_time',
                        'fix_dur_var','fix_disp_mean','fix_disp_max',
                        'fix_x_var','fix_y_var']

In [None]:
fix_1_feature

In [None]:
sac_1_features = sac_1_df.groupby('QuestionKey').agg({
    'Saccade Duration': ['count','mean','sum'],
    'Saccade Amplitude': ['mean','max'],
    'Saccade Peak Velocity': ['mean','max'],
    'Saccade Peak Acceleration': ['mean'],
    'Saccade Peak Deceleration': ['mean'],
    'Saccade Direction': ['var']   # direction variance
})

In [None]:
sac_1_features.columns = ['sac_count','sac_mean_dur','sac_total_time',
                        'sac_amp_mean','sac_amp_max',
                        'sac_vel_mean','sac_vel_max',
                        'sac_acc_mean','sac_dec_mean','sac_dir_var']

In [None]:
sac_1_features

In [None]:
ivt_1_features = fix_1_feature.join(sac_1_features, how='outer').fillna(0)

In [None]:
ivt_1_features

In [None]:
ivt_1_features['fix_sac_count_ratio'] = ivt_1_features['fix_count'] / (ivt_1_features['sac_count']+1e-5)
ivt_1_features['fix_sac_time_ratio']  = ivt_1_features['fix_total_time'] / (ivt_1_features['sac_total_time']+1e-5)

In [None]:
ivt_1_features

# Aggregation of Fixation and Saccade Features

In the preceding code cells, we performed aggregation on the `fix_1_df` and `sac_1_df` DataFrames, which contain the cleaned fixation and saccade data, respectively. The goal of this aggregation was to create a summary of eye-tracking metrics for each `QuestionKey`.

For fixations, we calculated:
- Count of fixations (`fix_count`)
- Mean, max, sum, and variance of fixation duration (`fix_mean_dur`, `fix_max_dur`, `fix_total_time`, `fix_dur_var`)
- Mean and max of fixation dispersion (`fix_disp_mean`, `fix_disp_max`)
- Variance of fixation X and Y coordinates (`fix_x_var`, `fix_y_var`) to represent screen spread.

For saccades, we calculated:
- Count of saccades (`sac_count`)
- Mean and sum of saccade duration (`sac_mean_dur`, `sac_total_time`)
- Mean and max of saccade amplitude (`sac_amp_mean`, `sac_amp_max`)
- Mean and max of saccade peak velocity (`sac_vel_mean`, `sac_vel_max`)
- Mean of saccade peak acceleration and deceleration (`sac_acc_mean`, `sac_dec_mean`)
- Variance of saccade direction (`sac_dir_var`).

Finally, we joined these aggregated fixation and saccade features into a single DataFrame called `ivt_1_features`, using `QuestionKey` as the index. We also filled any resulting missing values (from `QuestionKey` values that may only have fixations or saccades, but not both) with 0. This `ivt_1_features` DataFrame now provides a consolidated summary of key eye-tracking characteristics for each question, which can be used for further analysis or modeling.

# **19_IVT**

In [None]:
df_19_IVT = pd.read_csv('data/STData/19/19_IVT.csv')

In [None]:
df_19_IVT.head()

In [None]:
df_19_IVT.columns

In [None]:
df_19_IVT.shape

In [None]:
df_19_IVT.info()

In [None]:
df_19_IVT.isnull().sum()

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df_19_IVT.isnull(), cmap='viridis')
plt.show()

# Notes & Observations

- We observe many **null** (or missing) values in the `QuestionKey` columns.
- The nulls in the `QuestionKey` column may not represent “true” nulls. Rather, they follow interval patterns, suggesting that during those periods no question was displayed.
- These missing values in `QuestionKey` require additional investigation and context-aware handling.

In [None]:
df_19_IVT['QuestionKey'].unique()

In [None]:
df_19_IVT['Timestamp'] = pd.to_datetime(df_19_IVT['Timestamp'])

In [None]:
df_19_IVT.head(3)

In [None]:
df_19_IVT['QuestionKey'].fillna('None', inplace=True)

In [None]:
df_19_IVT['QuestionKey'].value_counts()

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df_19_IVT.isnull(), cmap='viridis')
plt.show()

In [None]:
df_19_IVT.isnull().sum()

In [None]:
df_19_IVT.head()

In [None]:
df_19_IVT['Row'].unique()

In [None]:
plt.figure(figsize=(8,6))
sns.histplot(df_19_IVT['Row'])
plt.show()

# Notes & Observations

- The `Row` column appears to be a simple row index and does not provide meaningful information relevant to the eye-tracking data itself. Therefore, it can be dropped.

In [None]:
df_19_IVT.drop('Row', axis=1, inplace=True)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 10))

sns.scatterplot(data=df_19_IVT, x='Gaze X', y='Gaze Y', ax=axes[0])
axes[0].set_title('Gaze X vs Gaze Y')

sns.scatterplot(data=df_19_IVT, x='Interpolated Gaze X', y='Interpolated Gaze Y', ax=axes[1])
axes[1].set_title('Interpolated Gaze X vs Interpolated Gaze Y')

plt.tight_layout()
plt.show()

# Gaze and Interpolated Gaze Scatter Plots

The scatter plots above visualize the relationship between the x and y coordinates of both the raw gaze data and the interpolated gaze data.

- **Gaze X vs Gaze Y:** This plot shows the raw gaze coordinates. The scattered points indicate the locations on the screen where the participant was looking. The density of points in certain areas might suggest regions of interest.
- **Interpolated Gaze X vs Interpolated Gaze Y:** This plot shows the interpolated gaze coordinates. Interpolation is often used to fill in gaps in the raw gaze data, providing a smoother representation of the gaze path. Comparing this plot to the raw gaze plot can show the effect of the interpolation process.

Both plots can help in understanding the distribution of gaze points across the screen and identifying potential patterns or biases in eye movements.

In [None]:
df_19_IVT.describe()

In [None]:
df_19_IVT.head(3)

In [None]:
df_19_IVT['Timestamp'] = pd.to_datetime(df_19_IVT['Timestamp'])

In [None]:
df_19_IVT.columns

In [None]:
cols = ['Gaze X', 'Gaze Y',
       'Interpolated Gaze X', 'Interpolated Gaze Y', 'Interpolated Distance',
       'Gaze Velocity', 'Gaze Acceleration', 'Fixation Index',
       'Fixation Index by Stimulus', 'Fixation X', 'Fixation Y',
       'Fixation Start', 'Fixation End', 'Fixation Duration',
       'Fixation Dispersion', 'Saccade Index', 'Saccade Index by Stimulus',
       'Saccade Start', 'Saccade End', 'Saccade Duration', 'Saccade Amplitude',
       'Saccade Peak Velocity', 'Saccade Peak Acceleration',
       'Saccade Peak Deceleration', 'Saccade Direction']

In [None]:
from IPython.display import display, Markdown

for col in cols:
    # Add a markdown cell before each plot for better separation and labeling
    display(Markdown(f'### {col} over Time'))
    plt.figure(figsize=(16, 10))
    sns.lineplot(x=df_19_IVT['Timestamp'], y=df_19_IVT[col])
    plt.xlabel("Timestamp") # Add x-axis label
    plt.ylabel(col) # Add y-axis label
    plt.show()

In [None]:
df_19_IVT.head()

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df_19_IVT[['Fixation Index', 'Fixation Index by Stimulus', 'Saccade Index', 'Saccade Index by Stimulus']].isnull(), cmap='viridis')
plt.show()

# Observation

The `Fixation Index`, `Fixation Index by Stimulus`, `Saccade Index` and `Saccade Index by Stimulus` columns are essentially just sequence numbers for identified events. While they indicate the order of fixations and saccades, they don't provide meaningful features for a machine learning model attempting to predict or classify eye movement patterns. Therefore, we will drop these columns as they are not useful for model building.

In [None]:
df_19_IVT.drop(['Fixation Index', 'Fixation Index by Stimulus', 'Saccade Index', 'Saccade Index by Stimulus'], axis=1, inplace=True)

In [None]:
plt.figure(figsize=(14,10))
sns.scatterplot(data=df_19_IVT, x='Fixation X', y='Fixation Y')
plt.title('Fixation X vs Fixation Y')
plt.show()

In [None]:
df_19_IVT['Fixation Start'].describe()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 8))

sns.histplot(df_19_IVT['Fixation Start'], bins=100, kde=True, ax=axes[0])
axes[0].set_xlabel('Fixation Start')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Fixation Start')

sns.histplot(df_19_IVT['Fixation End'], bins=100, kde=True, ax=axes[1])
axes[1].set_xlabel('Fixation End')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Fixation End')

plt.tight_layout()
plt.show()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 8))

sns.histplot(df_19_IVT['Saccade Start'], bins=100, kde=True, ax=axes[0])
axes[0].set_xlabel('Saccade Start')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Saccade Start')

sns.histplot(df_19_IVT['Saccade End'], bins=100, kde=True, ax=axes[1])
axes[1].set_xlabel('Saccade End')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Saccade End')

plt.tight_layout()
plt.show()

# Observation on Fixation and Saccade Timestamps

Upon examining the time series plots of 'Fixation Start', 'Fixation End', `Saccade Start`, and `Saccade End` against the `Timestamp`, we observe a clear linear, diagonal pattern. This indicates that these values are largely sequential and directly related to the progress of time in the data recording.

Furthermore, the histograms of these features show distributions that, while informative about the timing of events, don't necessarily reveal complex patterns that would be highly predictive for a machine learning model.

Crucially, the dataset already contains `Fixation Duration` and `Saccade Duration` columns. These duration features capture the length of each event, which is often a more directly relevant metric for understanding eye movement behavior than the absolute start and end times. Since the duration can be derived from the start and end times (Duration = End - Start), the start and end time columns introduce redundancy and do not provide substantial additional, independent information for modeling purposes.

Therefore, to simplify the dataset and focus on the most informative features for potential machine movement analysis or modeling, we will drop the `Fixation Start`, `Fixation End`, `Saccade Start`, and `Saccade End` columns.

In [None]:
df_19_IVT.drop(['Fixation Start', 'Fixation End', 'Saccade Start', 'Saccade End'], axis=1, inplace=True)

In [None]:
df_19_IVT.head()

In [None]:
plt.figure(figsize=(12, 8))
sns.heatmap(df_19_IVT[['Gaze X', 'Gaze Y', 'Interpolated Gaze X', 'Interpolated Gaze Y']].isnull(), cmap='viridis')
plt.show()


# Observations on Gaze and Interpolated Gaze Data

Based on the scatter plots of 'Gaze X' vs 'Gaze Y' and 'Interpolated Gaze X' vs 'Interpolated Gaze Y', we observe that the distributions of the raw and interpolated gaze points appear very similar. The spatial patterns of where the participant was looking are consistent between the two sets of coordinates.

Furthermore, the heatmap of null values for these columns ('Gaze X', 'Gaze Y', 'Interpolated Gaze X', 'Interpolated Gaze Y') reveals that the missing values are present in the same rows for both the raw and interpolated gaze coordinates. This suggests that the interpolation process did not fill in the gaps in the raw gaze data for these specific instances.

Given that the interpolated gaze data shows the same spatial distribution and the same pattern of null values as the raw gaze data, it appears that the interpolation did not significantly alter or complete the data in this case. Therefore, keeping both the raw and interpolated gaze columns might be redundant, and one set could potentially be dropped to simplify the dataset without losing significant information.

In [None]:
df_19_IVT.drop(['Interpolated Gaze X', 'Interpolated Gaze Y'], axis=1, inplace=True)

In [None]:
df_19_IVT.head()

In [None]:
plt.figure(figsize=(12, 8))
sns.heatmap(df_19_IVT.isnull(), cmap='viridis')
plt.show()

In [None]:
df_19_IVT.columns

In [None]:
fix_1_df = df_19_IVT.dropna(subset=['Fixation Duration'])
sac_1_df = df_19_IVT.dropna(subset=['Saccade Duration'])

In [None]:
fix_1_df.shape

In [None]:
sac_1_df.shape

In [None]:
fix_1_feature = fix_1_df.groupby('QuestionKey').agg({
    'Fixation Duration': ['count','mean','max','sum','var'],
    'Fixation Dispersion': ['mean','max'],
    'Fixation X': ['var'],   # screen spread X
    'Fixation Y': ['var']    # screen spread Y
})

In [None]:
fix_1_feature.columns = ['fix_count','fix_mean_dur','fix_max_dur','fix_total_time',
                        'fix_dur_var','fix_disp_mean','fix_disp_max',
                        'fix_x_var','fix_y_var']

In [None]:
fix_1_feature

In [None]:
sac_1_features = sac_1_df.groupby('QuestionKey').agg({
    'Saccade Duration': ['count','mean','sum'],
    'Saccade Amplitude': ['mean','max'],
    'Saccade Peak Velocity': ['mean','max'],
    'Saccade Peak Acceleration': ['mean'],
    'Saccade Peak Deceleration': ['mean'],
    'Saccade Direction': ['var']   # direction variance
})

In [None]:
sac_1_features.columns = ['sac_count','sac_mean_dur','sac_total_time',
                        'sac_amp_mean','sac_amp_max',
                        'sac_vel_mean','sac_vel_max',
                        'sac_acc_mean','sac_dec_mean','sac_dir_var']

In [None]:
sac_1_features

In [None]:
ivt_1_features = fix_1_feature.join(sac_1_features, how='outer').fillna(0)

In [None]:
ivt_1_features

In [None]:
ivt_1_features['fix_sac_count_ratio'] = ivt_1_features['fix_count'] / (ivt_1_features['sac_count']+1e-5)
ivt_1_features['fix_sac_time_ratio']  = ivt_1_features['fix_total_time'] / (ivt_1_features['sac_total_time']+1e-5)

In [None]:
ivt_1_features

# Aggregation of Fixation and Saccade Features

In the preceding code cells, we performed aggregation on the `fix_1_df` and `sac_1_df` DataFrames, which contain the cleaned fixation and saccade data, respectively. The goal of this aggregation was to create a summary of eye-tracking metrics for each `QuestionKey`.

For fixations, we calculated:
- Count of fixations (`fix_count`)
- Mean, max, sum, and variance of fixation duration (`fix_mean_dur`, `fix_max_dur`, `fix_total_time`, `fix_dur_var`)
- Mean and max of fixation dispersion (`fix_disp_mean`, `fix_disp_max`)
- Variance of fixation X and Y coordinates (`fix_x_var`, `fix_y_var`) to represent screen spread.

For saccades, we calculated:
- Count of saccades (`sac_count`)
- Mean and sum of saccade duration (`sac_mean_dur`, `sac_total_time`)
- Mean and max of saccade amplitude (`sac_amp_mean`, `sac_amp_max`)
- Mean and max of saccade peak velocity (`sac_vel_mean`, `sac_vel_max`)
- Mean of saccade peak acceleration and deceleration (`sac_acc_mean`, `sac_dec_mean`)
- Variance of saccade direction (`sac_dir_var`).

Finally, we joined these aggregated fixation and saccade features into a single DataFrame called `ivt_1_features`, using `QuestionKey` as the index. We also filled any resulting missing values (from `QuestionKey` values that may only have fixations or saccades, but not both) with 0. This `ivt_1_features` DataFrame now provides a consolidated summary of key eye-tracking characteristics for each question, which can be used for further analysis or modeling.

# **20_IVT**

In [None]:
df_20_IVT = pd.read_csv('data/STData/20/20_IVT.csv')

In [None]:
df_20_IVT.head()

In [None]:
df_20_IVT.columns

In [None]:
df_20_IVT.shape

In [None]:
df_20_IVT.info()

In [None]:
df_20_IVT.isnull().sum()

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df_20_IVT.isnull(), cmap='viridis')
plt.show()

# Notes & Observations

- We observe many **null** (or missing) values in the `QuestionKey` columns.
- The nulls in the `QuestionKey` column may not represent “true” nulls. Rather, they follow interval patterns, suggesting that during those periods no question was displayed.
- These missing values in `QuestionKey` require additional investigation and context-aware handling.

In [None]:
df_20_IVT['QuestionKey'].unique()

In [None]:
df_20_IVT['Timestamp'] = pd.to_datetime(df_20_IVT['Timestamp'])

In [None]:
df_20_IVT.head(3)

In [None]:
df_20_IVT['QuestionKey'].fillna('None', inplace=True)

In [None]:
df_20_IVT['QuestionKey'].value_counts()

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df_20_IVT.isnull(), cmap='viridis')
plt.show()

In [None]:
df_20_IVT.isnull().sum()

In [None]:
df_20_IVT.head()

In [None]:
df_20_IVT['Row'].unique()

In [None]:
plt.figure(figsize=(8,6))
sns.histplot(df_20_IVT['Row'])
plt.show()

# Notes & Observations

- The `Row` column appears to be a simple row index and does not provide meaningful information relevant to the eye-tracking data itself. Therefore, it can be dropped.

In [None]:
df_20_IVT.drop('Row', axis=1, inplace=True)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 10))

sns.scatterplot(data=df_20_IVT, x='Gaze X', y='Gaze Y', ax=axes[0])
axes[0].set_title('Gaze X vs Gaze Y')

sns.scatterplot(data=df_20_IVT, x='Interpolated Gaze X', y='Interpolated Gaze Y', ax=axes[1])
axes[1].set_title('Interpolated Gaze X vs Interpolated Gaze Y')

plt.tight_layout()
plt.show()

# Gaze and Interpolated Gaze Scatter Plots

The scatter plots above visualize the relationship between the x and y coordinates of both the raw gaze data and the interpolated gaze data.

- **Gaze X vs Gaze Y:** This plot shows the raw gaze coordinates. The scattered points indicate the locations on the screen where the participant was looking. The density of points in certain areas might suggest regions of interest.
- **Interpolated Gaze X vs Interpolated Gaze Y:** This plot shows the interpolated gaze coordinates. Interpolation is often used to fill in gaps in the raw gaze data, providing a smoother representation of the gaze path. Comparing this plot to the raw gaze plot can show the effect of the interpolation process.

Both plots can help in understanding the distribution of gaze points across the screen and identifying potential patterns or biases in eye movements.

In [None]:
df_20_IVT.describe()

In [None]:
df_20_IVT.head(3)

In [None]:
df_20_IVT['Timestamp'] = pd.to_datetime(df_20_IVT['Timestamp'])

In [None]:
df_20_IVT.columns

In [None]:
cols = ['Gaze X', 'Gaze Y',
       'Interpolated Gaze X', 'Interpolated Gaze Y', 'Interpolated Distance',
       'Gaze Velocity', 'Gaze Acceleration', 'Fixation Index',
       'Fixation Index by Stimulus', 'Fixation X', 'Fixation Y',
       'Fixation Start', 'Fixation End', 'Fixation Duration',
       'Fixation Dispersion', 'Saccade Index', 'Saccade Index by Stimulus',
       'Saccade Start', 'Saccade End', 'Saccade Duration', 'Saccade Amplitude',
       'Saccade Peak Velocity', 'Saccade Peak Acceleration',
       'Saccade Peak Deceleration', 'Saccade Direction']

In [None]:
from IPython.display import display, Markdown

for col in cols:
    # Add a markdown cell before each plot for better separation and labeling
    display(Markdown(f'### {col} over Time'))
    plt.figure(figsize=(16, 10))
    sns.lineplot(x=df_20_IVT['Timestamp'], y=df_20_IVT[col])
    plt.xlabel("Timestamp") # Add x-axis label
    plt.ylabel(col) # Add y-axis label
    plt.show()

In [None]:
df_20_IVT.head()

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df_20_IVT[['Fixation Index', 'Fixation Index by Stimulus', 'Saccade Index', 'Saccade Index by Stimulus']].isnull(), cmap='viridis')
plt.show()

# Observation

The `Fixation Index`, `Fixation Index by Stimulus`, `Saccade Index` and `Saccade Index by Stimulus` columns are essentially just sequence numbers for identified events. While they indicate the order of fixations and saccades, they don't provide meaningful features for a machine learning model attempting to predict or classify eye movement patterns. Therefore, we will drop these columns as they are not useful for model building.

In [None]:
df_20_IVT.drop(['Fixation Index', 'Fixation Index by Stimulus', 'Saccade Index', 'Saccade Index by Stimulus'], axis=1, inplace=True)

In [None]:
plt.figure(figsize=(14,10))
sns.scatterplot(data=df_20_IVT, x='Fixation X', y='Fixation Y')
plt.title('Fixation X vs Fixation Y')
plt.show()

In [None]:
df_20_IVT['Fixation Start'].describe()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 8))

sns.histplot(df_20_IVT['Fixation Start'], bins=100, kde=True, ax=axes[0])
axes[0].set_xlabel('Fixation Start')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Fixation Start')

sns.histplot(df_20_IVT['Fixation End'], bins=100, kde=True, ax=axes[1])
axes[1].set_xlabel('Fixation End')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Fixation End')

plt.tight_layout()
plt.show()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 8))

sns.histplot(df_20_IVT['Saccade Start'], bins=100, kde=True, ax=axes[0])
axes[0].set_xlabel('Saccade Start')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Saccade Start')

sns.histplot(df_20_IVT['Saccade End'], bins=100, kde=True, ax=axes[1])
axes[1].set_xlabel('Saccade End')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Saccade End')

plt.tight_layout()
plt.show()

# Observation on Fixation and Saccade Timestamps

Upon examining the time series plots of 'Fixation Start', 'Fixation End', `Saccade Start`, and `Saccade End` against the `Timestamp`, we observe a clear linear, diagonal pattern. This indicates that these values are largely sequential and directly related to the progress of time in the data recording.

Furthermore, the histograms of these features show distributions that, while informative about the timing of events, don't necessarily reveal complex patterns that would be highly predictive for a machine learning model.

Crucially, the dataset already contains `Fixation Duration` and `Saccade Duration` columns. These duration features capture the length of each event, which is often a more directly relevant metric for understanding eye movement behavior than the absolute start and end times. Since the duration can be derived from the start and end times (Duration = End - Start), the start and end time columns introduce redundancy and do not provide substantial additional, independent information for modeling purposes.

Therefore, to simplify the dataset and focus on the most informative features for potential machine movement analysis or modeling, we will drop the `Fixation Start`, `Fixation End`, `Saccade Start`, and `Saccade End` columns.

In [None]:
df_20_IVT.drop(['Fixation Start', 'Fixation End', 'Saccade Start', 'Saccade End'], axis=1, inplace=True)

In [None]:
df_20_IVT.head()

In [None]:
plt.figure(figsize=(12, 8))
sns.heatmap(df_20_IVT[['Gaze X', 'Gaze Y', 'Interpolated Gaze X', 'Interpolated Gaze Y']].isnull(), cmap='viridis')
plt.show()


# Observations on Gaze and Interpolated Gaze Data

Based on the scatter plots of 'Gaze X' vs 'Gaze Y' and 'Interpolated Gaze X' vs 'Interpolated Gaze Y', we observe that the distributions of the raw and interpolated gaze points appear very similar. The spatial patterns of where the participant was looking are consistent between the two sets of coordinates.

Furthermore, the heatmap of null values for these columns ('Gaze X', 'Gaze Y', 'Interpolated Gaze X', 'Interpolated Gaze Y') reveals that the missing values are present in the same rows for both the raw and interpolated gaze coordinates. This suggests that the interpolation process did not fill in the gaps in the raw gaze data for these specific instances.

Given that the interpolated gaze data shows the same spatial distribution and the same pattern of null values as the raw gaze data, it appears that the interpolation did not significantly alter or complete the data in this case. Therefore, keeping both the raw and interpolated gaze columns might be redundant, and one set could potentially be dropped to simplify the dataset without losing significant information.

In [None]:
df_20_IVT.drop(['Interpolated Gaze X', 'Interpolated Gaze Y'], axis=1, inplace=True)

In [None]:
df_20_IVT.head()

In [None]:
plt.figure(figsize=(12, 8))
sns.heatmap(df_20_IVT.isnull(), cmap='viridis')
plt.show()

In [None]:
df_20_IVT.columns

In [None]:
fix_1_df = df_20_IVT.dropna(subset=['Fixation Duration'])
sac_1_df = df_20_IVT.dropna(subset=['Saccade Duration'])

In [None]:
fix_1_df.shape

In [None]:
sac_1_df.shape

In [None]:
fix_1_feature = fix_1_df.groupby('QuestionKey').agg({
    'Fixation Duration': ['count','mean','max','sum','var'],
    'Fixation Dispersion': ['mean','max'],
    'Fixation X': ['var'],   # screen spread X
    'Fixation Y': ['var']    # screen spread Y
})

In [None]:
fix_1_feature.columns = ['fix_count','fix_mean_dur','fix_max_dur','fix_total_time',
                        'fix_dur_var','fix_disp_mean','fix_disp_max',
                        'fix_x_var','fix_y_var']

In [None]:
fix_1_feature

In [None]:
sac_1_features = sac_1_df.groupby('QuestionKey').agg({
    'Saccade Duration': ['count','mean','sum'],
    'Saccade Amplitude': ['mean','max'],
    'Saccade Peak Velocity': ['mean','max'],
    'Saccade Peak Acceleration': ['mean'],
    'Saccade Peak Deceleration': ['mean'],
    'Saccade Direction': ['var']   # direction variance
})

In [None]:
sac_1_features.columns = ['sac_count','sac_mean_dur','sac_total_time',
                        'sac_amp_mean','sac_amp_max',
                        'sac_vel_mean','sac_vel_max',
                        'sac_acc_mean','sac_dec_mean','sac_dir_var']

In [None]:
sac_1_features

In [None]:
ivt_1_features = fix_1_feature.join(sac_1_features, how='outer').fillna(0)

In [None]:
ivt_1_features

In [None]:
ivt_1_features['fix_sac_count_ratio'] = ivt_1_features['fix_count'] / (ivt_1_features['sac_count']+1e-5)
ivt_1_features['fix_sac_time_ratio']  = ivt_1_features['fix_total_time'] / (ivt_1_features['sac_total_time']+1e-5)

In [None]:
ivt_1_features

# Aggregation of Fixation and Saccade Features

In the preceding code cells, we performed aggregation on the `fix_1_df` and `sac_1_df` DataFrames, which contain the cleaned fixation and saccade data, respectively. The goal of this aggregation was to create a summary of eye-tracking metrics for each `QuestionKey`.

For fixations, we calculated:
- Count of fixations (`fix_count`)
- Mean, max, sum, and variance of fixation duration (`fix_mean_dur`, `fix_max_dur`, `fix_total_time`, `fix_dur_var`)
- Mean and max of fixation dispersion (`fix_disp_mean`, `fix_disp_max`)
- Variance of fixation X and Y coordinates (`fix_x_var`, `fix_y_var`) to represent screen spread.

For saccades, we calculated:
- Count of saccades (`sac_count`)
- Mean and sum of saccade duration (`sac_mean_dur`, `sac_total_time`)
- Mean and max of saccade amplitude (`sac_amp_mean`, `sac_amp_max`)
- Mean and max of saccade peak velocity (`sac_vel_mean`, `sac_vel_max`)
- Mean of saccade peak acceleration and deceleration (`sac_acc_mean`, `sac_dec_mean`)
- Variance of saccade direction (`sac_dir_var`).

Finally, we joined these aggregated fixation and saccade features into a single DataFrame called `ivt_1_features`, using `QuestionKey` as the index. We also filled any resulting missing values (from `QuestionKey` values that may only have fixations or saccades, but not both) with 0. This `ivt_1_features` DataFrame now provides a consolidated summary of key eye-tracking characteristics for each question, which can be used for further analysis or modeling.