# Eye Tracking Data Analysis

This notebook performs exploratory data analysis and cleaning on eye-tracking data.


In [None]:
%load_ext cudf

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import datashader as ds
import datashader.transfer_functions as tf

In [None]:
pd.set_option('display.max_columns', None)

# **1_EYE**

In [None]:
df_1_EYE = pd.read_csv('data/STData/1/1_EYE.csv')

In [None]:
df_1_EYE.head()

In [None]:
df_1_EYE.shape

In [None]:
df_1_EYE.columns

In [None]:
df_1_EYE.info()

In [None]:
df_1_EYE.isnull().sum()

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df_1_EYE.isnull(), cmap='viridis')
plt.show()

# Notes & Observations

- We observe many **null** (or missing) values in the `QuestionKey` columns.
- The nulls in the `QuestionKey` column may not represent “true” nulls. Rather, they follow interval patterns, suggesting that during those periods no question was displayed.
- These missing values in `QuestionKey` require additional investigation and context-aware handling.

In [None]:
df_1_EYE['QuestionKey'].unique()

In [None]:
df_1_EYE['Timestamp'] = pd.to_datetime(df_1_EYE['Timestamp'])

In [None]:
df_1_EYE.head(3)

In [None]:
df_1_EYE['QuestionKey'] = df_1_EYE['QuestionKey'].fillna('None')

In [None]:
df_1_EYE['QuestionKey'].value_counts()

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df_1_EYE.isnull(), cmap='viridis')
plt.show()

In [None]:
df_1_EYE.isnull().sum()

In [None]:
df_1_EYE.dropna(inplace=True)

In [None]:
df_1_EYE.head()

In [None]:
df_1_EYE['Row'].unique()

In [None]:
plt.figure(figsize=(8,6))
sns.histplot(df_1_EYE['Row'])
plt.show()

# Notes & Observations

- The `Row` column appears to be a simple row index and does not provide meaningful information relevant to the eye-tracking data itself. Therefore, it can be dropped.

In [None]:
df_1_EYE.drop('Row', axis=1, inplace=True)

In [None]:
df_1_EYE['ET_ValidityLeft'].unique()

In [None]:
df_1_EYE['ET_ValidityLeft'].value_counts()

In [None]:
df_1_EYE['ET_ValidityRight'].unique()

In [None]:
df_1_EYE['ET_ValidityRight'].value_counts()

In [None]:
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
sns.barplot(x=df_1_EYE['ET_ValidityLeft'].value_counts().index, y=df_1_EYE['ET_ValidityLeft'].value_counts().values)
plt.title('Count of ET_ValidityLeft')
plt.xlabel('Validity')
plt.ylabel('Count')


plt.subplot(1, 2, 2)
sns.barplot(x=df_1_EYE['ET_ValidityRight'].value_counts().index, y=df_1_EYE['ET_ValidityRight'].value_counts().values)
plt.title('Count of ET_ValidityRight')
plt.xlabel('Validity')
plt.ylabel('Count')

plt.tight_layout()
plt.show()

# Notes & Observations

- The `ET_ValidityLeft` and `ET_ValidityRight` columns indicate the validity of the eye-tracking data for the left and right eye, respectively.
- Based on the value counts and the bar plots, it appears that a value of `0.0` represents valid eye-tracking data, while a value of `4.0` represents invalid data.
- Although the amount of invalid data is relatively small, removing these rows could introduce unwanted patterns or gaps in the time series data.
- Therefore, we will keep the data and replace the value `4.0` with `1.0` in both `ET_ValidityLeft` and `ET_ValidityRight` columns. This will indicate to a machine learning model that the eye tracker had invalid data at those specific points in time while maintaining the integrity of the time series.

Define a mapping to convert validity values from `0.0` and `4.0` to `0` and `1`.

In [None]:
validity_map = {4.0: 1.0, 0.0: 0.0}

In [None]:
df_1_EYE['ET_ValidityLeft'] = df_1_EYE['ET_ValidityLeft'].map(validity_map).astype(np.int8)
df_1_EYE['ET_ValidityRight'] = df_1_EYE['ET_ValidityRight'].map(validity_map).astype(np.int8)

In [None]:
df_1_EYE.head(3)

In [None]:
df_1_EYE.describe()

In [None]:
df_1_EYE[df_1_EYE['ET_ValidityLeft'] == 1].shape

In [None]:
df_1_EYE[df_1_EYE['ET_ValidityRight'] == 1].shape

In [None]:
df_1_EYE[df_1_EYE['ET_ValidityLeft'] == 1].shape[0] / df_1_EYE.shape[0]

In [None]:
df_1_EYE[df_1_EYE['ET_ValidityRight'] == 1].shape[0] / df_1_EYE.shape[0]

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_1_EYE == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.subplot(1, 2, 2)
sns.heatmap(df_1_EYE == 1, cmap='viridis')
plt.title('Heatmap of 1 Values')

plt.tight_layout()
plt.show()

In [None]:
df_1_EYE[df_1_EYE['ET_PupilLeft'] == -1].shape

In [None]:
df_1_EYE[df_1_EYE['ET_PupilRight'] == -1].shape

In [None]:
df_1_EYE[df_1_EYE['ET_PupilLeft'] == -1].shape[0] / df_1_EYE.shape[0]

In [None]:
df_1_EYE[df_1_EYE['ET_PupilRight'] == -1].shape[0] / df_1_EYE.shape[0]

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_1_EYE[df_1_EYE['ET_ValidityLeft'] == 1] == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.subplot(1, 2, 2)
sns.heatmap(df_1_EYE[df_1_EYE['ET_ValidityRight'] == 1] == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.tight_layout()
plt.show()

# Notes & Observations

- The heatmaps reveal the distribution of -1 values across different columns.
- It is evident that the `-1` values are not randomly scattered but appear in specific columns, notably `ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, `ET_GazeRighty`, `ET_PupilLeft`, `ET_PupilRight`, `ET_DistanceLeft`, `ET_DistanceRight`, `ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, and `ET_CameraRightY`.
- These `-1` values often coincide with instances where `ET_ValidityLeft` or `ET_ValidityRight` is 1, indicating invalid eye-tracking data. This suggests that `-1` is used as a placeholder for missing or invalid measurements in these columns when the eye tracker is not providing valid data for a particular eye.
- Given that over 70% of the data in the `ET_PupilLeft` and `ET_PupilRight` columns is marked as invalid (-1), so instead of dropping them we can create new feature for both the `ET_PupilLeft` and `ET_PupilRight` to represent which row consist invalid `ET_PupilLeft` and `ET_PupilRight` data

In [None]:
pupil_validity = {-1: 1 }

In [None]:
df_1_EYE['ET_PupilLeft_validity'] = df_1_EYE['ET_PupilLeft'].map(pupil_validity)

In [None]:
df_1_EYE['ET_PupilRight_validity'] = df_1_EYE['ET_PupilRight'].map(pupil_validity)

In [None]:
df_1_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].head()

In [None]:
df_1_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].isnull().sum()

In [None]:
plt.figure(figsize=(18, 8))
sns.heatmap(df_1_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].isnull(), cmap='viridis')
plt.show()

In [None]:
df_1_EYE['ET_PupilLeft_validity'] = df_1_EYE['ET_PupilLeft_validity'].fillna(0)

In [None]:
df_1_EYE['ET_PupilRight_validity'] = df_1_EYE['ET_PupilRight_validity'].fillna(0)

In [None]:
df_1_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].head()

In [None]:
plt.figure(figsize=(18, 8))
sns.heatmap(df_1_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].isnull(), cmap='viridis')
plt.show()

In [None]:
df_1_EYE.head()

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_1_EYE == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.subplot(1, 2, 2)
sns.heatmap(df_1_EYE == 1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.tight_layout()
plt.show()

In [None]:
valid_left_ratio  = 1 - df_1_EYE['ET_ValidityLeft'].mean()

In [None]:
valid_left_ratio

In [None]:
valid_right_ratio = 1 - df_1_EYE['ET_ValidityRight'].mean()

In [None]:
valid_right_ratio

In [None]:
df_1_EYE['ET_PupilLeft_validity'] = df_1_EYE['ET_PupilLeft_validity'].astype(np.int8)
df_1_EYE['ET_PupilRight_validity'] = df_1_EYE['ET_PupilRight_validity'].astype(np.int8)

# Feature Engineering and Observations

Based on the analysis of the data, we've created two new features, `ET_PupilLeft_validity` and `ET_PupilRight_validity`. These features indicate the validity of the pupil data for the left and right eyes, respectively, with a value of 1 representing invalid data (originally -1) and 0 representing valid data.

The heatmaps above visually demonstrate the distribution of -1 and 1 values across the dataset. We observed that:
- The `-1` values are concentrated in specific columns related to gaze, pupil size, distance, and camera position, suggesting they represent missing or invalid sensor readings.
- The `1` values, after mapping from `4.0` in the original validity columns, indicate instances of invalid eye-tracking data.
- The heatmaps also show a strong correlation between the `-1` values in the pupil columns and a validity of 1 in the newly created pupil validity features, confirming that -1 was used to mark invalid pupil data.

In [None]:
df_1_EYE.head()

In [None]:
# Select only the numeric columns for plotting histograms, excluding time-related columns
numeric_cols = df_1_EYE.select_dtypes(include=np.number).columns
cols_to_plot = [col for col in numeric_cols if col not in ['UnixTime']]

# Calculate the number of rows and columns for the grid
n_cols = 4  # You can adjust the number of columns as needed
n_rows = (len(cols_to_plot) + n_cols - 1) // n_cols

plt.figure(figsize=(n_cols * 5, n_rows * 4)) # Adjust figure size as needed

for i, col in enumerate(cols_to_plot):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.histplot(df_1_EYE[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

# Observations from Histograms

The grid of histograms provides insights into the distribution of values for each numeric column in the dataset (excluding 'UnixTime'). Key observations include:

- Several columns, such as `ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, and `ET_GazeRighty`, show distributions that appear somewhat multimodal or skewed, suggesting variations in gaze patterns.
- The `ET_PupilLeft` and `ET_PupilRight` histograms clearly show a peak at -1, confirming the presence of a significant number of invalid pupil readings.
- `ET_TimeSignal` shows a relatively uniform distribution, as expected for a time-based signal.
- `ET_DistanceLeft` and `ET_DistanceRight` appear to have distributions centered around certain values, with some outliers or variations.
- The camera position columns (`ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, `ET_CameraRightY`) seem to have distributions concentrated within specific ranges, reflecting the camera's field of view.
- The validity columns (`ET_ValidityLeft`, `ET_ValidityRight`, `ET_PupilLeft_validity`, `ET_PupilRight_validity`) show distributions dominated by 0, indicating that most of the data is considered valid after the mapping. The smaller peaks at 1 represent the instances of invalid data.

These distributions highlight the need for appropriate handling of the -1 values and potential outliers in subsequent analysis or modeling steps.

In [None]:
df_1_EYE.columns

In [None]:
cols = ['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']

In [None]:
from IPython.display import display, Markdown

for col in cols:
    # Add a markdown cell before each plot for better separation and labeling
    display(Markdown(f'### {col} over Time'))
    plt.figure(figsize=(16, 10))
    plt.plot(df_1_EYE['Timestamp'], df_1_EYE[col])
    plt.xlabel("Timestamp") # Add x-axis label
    plt.ylabel(col) # Add y-axis label
    plt.show()

# Observations from Time Series Plots

The line plots showing various features against the `Timestamp` reveal the temporal patterns and fluctuations in the eye-tracking data. Key observations include:

- **Gaze Coordinates (`ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, `ET_GazeRighty`):** These plots show the changes in gaze position over time. We can observe periods of relatively stable gaze interspersed with rapid movements (saccades) and blinks or other events where the gaze data might be invalid (-1 values appear as gaps or spikes if not handled).
- **Pupil Size (`ET_PupilLeft`, `ET_PupilRight`):** The pupil size plots show variations over time. The presence of many -1 values is evident as flat lines at the bottom of the plot, indicating periods where pupil data was not recorded or was invalid.
- **Time Signal (`ET_TimeSignal`):** This plot shows a steady, increasing trend, as expected for a time-based signal.
- **Distance and Camera Position (`ET_DistanceLeft`, `ET_DistanceRight`, `ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, `ET_CameraRightY`):** These plots show how the distance from the eye tracker and the camera positions change over time. Variations in these features can be related to head movements or changes in the user's position relative to the eye tracker.
- **Validity (`ET_ValidityLeft`, `ET_ValidityRight`, `ET_PupilLeft_validity`, `ET_PupilRight_validity`):** These plots clearly show periods of invalid data (represented by 1) as spikes or plateaus, corresponding to instances where the eye tracker lost track of the eyes or the pupil data was marked as invalid.

Analyzing these time series plots is crucial for understanding the dynamics of the eye-tracking data and identifying patterns or anomalies that may require further investigation or specific handling during subsequent analysis.

In [None]:
# Select only the numeric columns for plotting histograms, excluding time-related columns
numeric_cols = df_1_EYE.select_dtypes(include=np.number).columns

# Calculate the number of rows and columns for the grid
n_cols = 4  # You can adjust the number of columns as needed
n_rows = (len(numeric_cols) + n_cols - 1) // n_cols

plt.figure(figsize=(n_cols * 5, n_rows * 4)) # Adjust figure size as needed

for i, col in enumerate(numeric_cols):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.boxplot(df_1_EYE[col])
    plt.title(f'Boxplot of {col}')
    plt.xlabel(col)

plt.tight_layout()
plt.show()

# Observations from Boxplots and Handling -1 Values

The boxplots provide a visual summary of the distribution and potential outliers for each numeric column. Key observations from the boxplots include:

- The boxplots for columns like `ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, `ET_GazeRighty`, `ET_PupilLeft`, `ET_PupilRight`, `ET_DistanceLeft`, `ET_DistanceRight`, `ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, and `ET_CameraRightY` clearly show the presence of -1 values as significant outliers, confirming our earlier observations from the heatmaps and histograms.
- The boxplots for the validity columns (`ET_ValidityLeft`, `ET_ValidityRight`, `ET_PupilLeft_validity`, `ET_PupilRight_validity`) show the discrete nature of these features, with the majority of data points at 0 (valid) and a smaller number at 1 (invalid).

Given the significant presence of -1 values, which represent invalid or missing data, especially in the pupil-related columns, we have decided to replace these -1 values with NaN to properly represent them as missing data. Subsequently, we will impute these missing values using the mean of each respective column. This approach helps to retain the data structure and allows for further analysis or modeling without the distortion caused by the -1 placeholders.

In [None]:
df_1_EYE.replace({-1: np.nan}, inplace=True)

In [None]:
df_1_EYE[['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']].mean()

In [None]:
df_1_EYE[['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']].median()

In [None]:
numeric_cols = df_1_EYE.select_dtypes(include=np.number).columns

for col in numeric_cols:
    df_1_EYE[col] = df_1_EYE[col].fillna(df_1_EYE[col].mean())

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_1_EYE.isnull(), cmap='viridis')
plt.title('Heatmap of Missing Values After Imputation')

plt.subplot(1, 2, 2)
sns.heatmap(df_1_EYE == 1, cmap='viridis')
plt.title('Heatmap of 1 Values')

plt.tight_layout()
plt.show()

# Handling Missing Values (Imputation)

As decided, we have replaced all the `-1` values with `NaN` to treat them as missing data. Subsequently, we have imputed these `NaN` values with the mean of their respective columns. The heatmap above, which was generated after the imputation, now shows no visible signs of `NaN` values, indicating that the imputation was successful.

In [None]:
df_1_EYE.head()

In [None]:
# Select only the numeric columns for plotting histograms, excluding time-related columns
numeric_cols = df_1_EYE.select_dtypes(include=np.number).columns
cols_to_plot = [col for col in numeric_cols if col not in ['UnixTime']]

# Calculate the number of rows and columns for the grid
n_cols = 4  # You can adjust the number of columns as needed
n_rows = (len(cols_to_plot) + n_cols - 1) // n_cols

plt.figure(figsize=(n_cols * 5, n_rows * 4)) # Adjust figure size as needed

for i, col in enumerate(cols_to_plot):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.histplot(df_1_EYE[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

# Observations from Histograms After Imputation

The histograms generated after replacing the -1 values with the mean of each column show the distributions of the numeric features with the missing data handled. Key observations from these updated histograms include:

- The distinct peaks at -1, which were prominent in the histograms for several columns (e.g., pupil size, gaze coordinates, distance, and camera position) before imputation, are now replaced by a peak at the mean of each respective column.
- The distributions in many columns now appear more unimodal or show shifted modes compared to the original histograms.
- The histograms for the validity columns still show their bimodal distributions with peaks at 0 and 1, as these were handled separately.

These histograms provide an updated view of the data's distribution after handling the missing values, highlighting the impact of the imputation method on the data's characteristics.

In [None]:
cols = ['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']

In [None]:
for col in cols:
    # Add a markdown cell before each plot for better separation and labeling
    display(Markdown(f'### {col} over Time'))
    plt.figure(figsize=(16, 10))
    plt.plot(df_1_EYE['Timestamp'], df_1_EYE[col])
    plt.xlabel("Timestamp") # Add x-axis label
    plt.ylabel(col) # Add y-axis label
    plt.show()

# Observations from Time Series Plots After Imputation

The line plots generated after imputing the missing values with the mean show the temporal patterns of the features with the missing data handled. Key observations from these updated plots include:

- The gaps or flat lines at -1, which were prominent in the plots for columns like gaze coordinates, pupil size, distance, and camera position, are now filled by lines at the mean value of the respective columns.
- The plots for the validity columns remain the same as they were handled separately.
- The `ET_TimeSignal` plot still shows a steady increasing trend, as expected.

In [None]:
plt.figure(figsize=(16, 10))
sns.heatmap(df_1_EYE.corr(numeric_only=True), cmap='YlGnBu', annot=True)
plt.show()

# Observations from Correlation Heatmap

The correlation heatmap provides a visual representation of the pairwise correlations between the numeric columns in the dataset. Key observations from the heatmap include:

- **High Positive Correlations:** We observe strong positive correlations (values close to 1) between:
  - `ET_GazeLeftx` and `ET_GazeRightx`: This is expected as the gaze positions of both eyes should be highly correlated when fixating on a point.
  - `ET_GazeLefty` and `ET_GazeRighty`: Similar to the x-coordinates, the y-coordinates of gaze should also be highly correlated.
  - `ET_PupilLeft` and `ET_PupilRight`: Pupil sizes of both eyes tend to change together in response to light and cognitive load.
  - `ET_DistanceLeft` and `ET_DistanceRight`: The distance from the eye tracker to each eye should be highly correlated.
  - `ET_CameraLeftX` and `ET_CameraRightX`, `ET_CameraLeftY` and `ET_CameraRightY`: The camera positions for both eyes are also expected to be highly correlated.
  - `UnixTime` and `ET_TimeSignal`: As previously noted, these two columns are almost perfectly linearly correlated, indicating redundancy.
  - `ET_ValidityLeft` and `ET_PupilLeft_validity`: There is a positive correlation, suggesting that when the overall left eye data is invalid, the left pupil data is also likely to be invalid.
  - `ET_ValidityRight` and `ET_PupilRight_validity`: Similar to the left eye, there is a positive correlation between the overall right eye validity and the right pupil validity.
- **Other Correlations:** We can also observe other varying degrees of correlations between different features, which can provide insights into the relationships between gaze behavior, pupil size, distance, and camera position. For example, there might be correlations between gaze coordinates and camera positions, reflecting head movements.
- **Low or Near-Zero Correlations:** Columns with low or near-zero correlations are relatively independent of each other.

Understanding these correlations is important for feature selection and for building models, as highly correlated features might indicate multicollinearity, while correlations between features can reveal underlying patterns in the data.

# Analysis of ET_TimeSignal and Decision to Drop

As observed in the time series plot and confirmed by the correlation heatmap, the `ET_TimeSignal` column exhibits a near-perfect linear relationship with both the `Timestamp` and `UnixTime` columns. This strong correlation (close to 1) suggests that `ET_TimeSignal` is essentially redundant and likely represents another form of time recording or a signal directly derived from the timestamp.

Including highly correlated features like this in a dataset can lead to issues such as multicollinearity in some statistical models, which can make it difficult to interpret the individual impact of each feature. Since the `Timestamp` column already provides the necessary temporal information, retaining `ET_TimeSignal` does not appear to add significant value for further analysis or modeling in most cases.

Therefore, based on its high correlation and lack of unique insight, we will proceed to drop the `ET_TimeSignal` column to simplify the dataset and potentially improve the performance and interpretability of future analyses.

In [None]:
df_1_EYE.drop('ET_TimeSignal', axis=1, inplace=True)

In [None]:
plt.figure(figsize=(16, 10))
sns.pairplot(df_1_EYE)
plt.show()

# **2_EYE**

In [None]:
df_2_EYE = pd.read_csv('data/STData/2/2_EYE.csv')

In [None]:
df_2_EYE.head()

In [None]:
df_2_EYE.shape

In [None]:
df_2_EYE.columns

In [None]:
df_2_EYE.info()

In [None]:
df_2_EYE.isnull().sum()

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df_2_EYE.isnull(), cmap='viridis')
plt.show()

# Notes & Observations

- We observe many **null** (or missing) values in the `QuestionKey` columns.
- The nulls in the `QuestionKey` column may not represent “true” nulls. Rather, they follow interval patterns, suggesting that during those periods no question was displayed.
- These missing values in `QuestionKey` require additional investigation and context-aware handling.

In [None]:
df_2_EYE['QuestionKey'].unique()

In [None]:
df_2_EYE['Timestamp'] = pd.to_datetime(df_2_EYE['Timestamp'])

In [None]:
df_2_EYE.head(3)

In [None]:
df_2_EYE['QuestionKey'] = df_2_EYE['QuestionKey'].fillna('None')

In [None]:
df_2_EYE['QuestionKey'].value_counts()

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df_2_EYE.isnull(), cmap='viridis')
plt.show()

In [None]:
df_2_EYE.isnull().sum()

In [None]:
df_2_EYE.dropna(inplace=True)

In [None]:
df_2_EYE.head()

In [None]:
df_2_EYE['Row'].unique()

In [None]:
plt.figure(figsize=(8,6))
sns.histplot(df_2_EYE['Row'])
plt.show()

# Notes & Observations

- The `Row` column appears to be a simple row index and does not provide meaningful information relevant to the eye-tracking data itself. Therefore, it can be dropped.

In [None]:
df_2_EYE.drop('Row', axis=1, inplace=True)

In [None]:
df_2_EYE['ET_ValidityLeft'].unique()

In [None]:
df_2_EYE['ET_ValidityLeft'].value_counts()

In [None]:
df_2_EYE['ET_ValidityRight'].unique()

In [None]:
df_2_EYE['ET_ValidityRight'].value_counts()

In [None]:
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
sns.barplot(x=df_2_EYE['ET_ValidityLeft'].value_counts().index, y=df_2_EYE['ET_ValidityLeft'].value_counts().values)
plt.title('Count of ET_ValidityLeft')
plt.xlabel('Validity')
plt.ylabel('Count')


plt.subplot(1, 2, 2)
sns.barplot(x=df_2_EYE['ET_ValidityRight'].value_counts().index, y=df_2_EYE['ET_ValidityRight'].value_counts().values)
plt.title('Count of ET_ValidityRight')
plt.xlabel('Validity')
plt.ylabel('Count')

plt.tight_layout()
plt.show()

# Notes & Observations

- The `ET_ValidityLeft` and `ET_ValidityRight` columns indicate the validity of the eye-tracking data for the left and right eye, respectively.
- Based on the value counts and the bar plots, it appears that a value of `0.0` represents valid eye-tracking data, while a value of `4.0` represents invalid data.
- Although the amount of invalid data is relatively small, removing these rows could introduce unwanted patterns or gaps in the time series data.
- Therefore, we will keep the data and replace the value `4.0` with `1.0` in both `ET_ValidityLeft` and `ET_ValidityRight` columns. This will indicate to a machine learning model that the eye tracker had invalid data at those specific points in time while maintaining the integrity of the time series.

Define a mapping to convert validity values from `0.0` and `4.0` to `0` and `1`.

In [None]:
validity_map = {4.0: 1.0, 0.0: 0.0}

In [None]:
df_2_EYE['ET_ValidityLeft'] = df_2_EYE['ET_ValidityLeft'].map(validity_map).astype(np.int8)
df_2_EYE['ET_ValidityRight'] = df_2_EYE['ET_ValidityRight'].map(validity_map).astype(np.int8)

In [None]:
df_2_EYE.head(3)

In [None]:
df_2_EYE.describe()

In [None]:
df_2_EYE[df_2_EYE['ET_ValidityLeft'] == 1].shape

In [None]:
df_2_EYE[df_2_EYE['ET_ValidityRight'] == 1].shape

In [None]:
df_2_EYE[df_2_EYE['ET_ValidityLeft'] == 1].shape[0] / df_2_EYE.shape[0]

In [None]:
df_2_EYE[df_2_EYE['ET_ValidityRight'] == 1].shape[0] / df_2_EYE.shape[0]

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_2_EYE == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.subplot(1, 2, 2)
sns.heatmap(df_2_EYE == 1, cmap='viridis')
plt.title('Heatmap of 1 Values')

plt.tight_layout()
plt.show()

In [None]:
df_2_EYE[df_2_EYE['ET_PupilLeft'] == -1].shape

In [None]:
df_2_EYE[df_2_EYE['ET_PupilRight'] == -1].shape

In [None]:
df_2_EYE[df_2_EYE['ET_PupilLeft'] == -1].shape[0] / df_2_EYE.shape[0]

In [None]:
df_2_EYE[df_2_EYE['ET_PupilRight'] == -1].shape[0] / df_2_EYE.shape[0]

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_2_EYE[df_2_EYE['ET_ValidityLeft'] == 1] == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.subplot(1, 2, 2)
sns.heatmap(df_2_EYE[df_2_EYE['ET_ValidityRight'] == 1] == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.tight_layout()
plt.show()

# Notes & Observations

- The heatmaps reveal the distribution of -1 values across different columns.
- It is evident that the `-1` values are not randomly scattered but appear in specific columns, notably `ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, `ET_GazeRighty`, `ET_PupilLeft`, `ET_PupilRight`, `ET_DistanceLeft`, `ET_DistanceRight`, `ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, and `ET_CameraRightY`.
- These `-1` values often coincide with instances where `ET_ValidityLeft` or `ET_ValidityRight` is 1, indicating invalid eye-tracking data. This suggests that `-1` is used as a placeholder for missing or invalid measurements in these columns when the eye tracker is not providing valid data for a particular eye.
- Given that over 70% of the data in the `ET_PupilLeft` and `ET_PupilRight` columns is marked as invalid (-1), so instead of dropping them we can create new feature for both the `ET_PupilLeft` and `ET_PupilRight` to represent which row consist invalid `ET_PupilLeft` and `ET_PupilRight` data

In [None]:
pupil_validity = {-1: 1 }

In [None]:
df_2_EYE['ET_PupilLeft_validity'] = df_2_EYE['ET_PupilLeft'].map(pupil_validity)

In [None]:
df_2_EYE['ET_PupilRight_validity'] = df_2_EYE['ET_PupilRight'].map(pupil_validity)

In [None]:
df_2_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].head()

In [None]:
df_2_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].isnull().sum()

In [None]:
plt.figure(figsize=(18, 8))
sns.heatmap(df_2_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].isnull(), cmap='viridis')
plt.show()

In [None]:
df_2_EYE['ET_PupilLeft_validity'] = df_2_EYE['ET_PupilLeft_validity'].fillna(0)

In [None]:
df_2_EYE['ET_PupilRight_validity'] = df_2_EYE['ET_PupilRight_validity'].fillna(0)

In [None]:
df_2_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].head()

In [None]:
plt.figure(figsize=(18, 8))
sns.heatmap(df_2_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].isnull(), cmap='viridis')
plt.show()

In [None]:
df_2_EYE.head()

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_2_EYE == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.subplot(1, 2, 2)
sns.heatmap(df_2_EYE == 1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.tight_layout()
plt.show()

In [None]:
valid_left_ratio  = 1 - df_2_EYE['ET_ValidityLeft'].mean()

In [None]:
valid_left_ratio

In [None]:
valid_right_ratio = 1 - df_2_EYE['ET_ValidityRight'].mean()

In [None]:
valid_right_ratio

In [None]:
df_2_EYE['ET_PupilLeft_validity'] = df_2_EYE['ET_PupilLeft_validity'].astype(np.int8)
df_2_EYE['ET_PupilRight_validity'] = df_2_EYE['ET_PupilRight_validity'].astype(np.int8)

# Feature Engineering and Observations

Based on the analysis of the data, we've created two new features, `ET_PupilLeft_validity` and `ET_PupilRight_validity`. These features indicate the validity of the pupil data for the left and right eyes, respectively, with a value of 1 representing invalid data (originally -1) and 0 representing valid data.

The heatmaps above visually demonstrate the distribution of -1 and 1 values across the dataset. We observed that:
- The `-1` values are concentrated in specific columns related to gaze, pupil size, distance, and camera position, suggesting they represent missing or invalid sensor readings.
- The `1` values, after mapping from `4.0` in the original validity columns, indicate instances of invalid eye-tracking data.
- The heatmaps also show a strong correlation between the `-1` values in the pupil columns and a validity of 1 in the newly created pupil validity features, confirming that -1 was used to mark invalid pupil data.

In [None]:
df_2_EYE.head()

In [None]:
# Select only the numeric columns for plotting histograms, excluding time-related columns
numeric_cols = df_2_EYE.select_dtypes(include=np.number).columns
cols_to_plot = [col for col in numeric_cols if col not in ['UnixTime']]

# Calculate the number of rows and columns for the grid
n_cols = 4  # You can adjust the number of columns as needed
n_rows = (len(cols_to_plot) + n_cols - 1) // n_cols

plt.figure(figsize=(n_cols * 5, n_rows * 4)) # Adjust figure size as needed

for i, col in enumerate(cols_to_plot):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.histplot(df_2_EYE[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

# Observations from Histograms

The grid of histograms provides insights into the distribution of values for each numeric column in the dataset (excluding 'UnixTime'). Key observations include:

- Several columns, such as `ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, and `ET_GazeRighty`, show distributions that appear somewhat multimodal or skewed, suggesting variations in gaze patterns.
- The `ET_PupilLeft` and `ET_PupilRight` histograms clearly show a peak at -1, confirming the presence of a significant number of invalid pupil readings.
- `ET_TimeSignal` shows a relatively uniform distribution, as expected for a time-based signal.
- `ET_DistanceLeft` and `ET_DistanceRight` appear to have distributions centered around certain values, with some outliers or variations.
- The camera position columns (`ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, `ET_CameraRightY`) seem to have distributions concentrated within specific ranges, reflecting the camera's field of view.
- The validity columns (`ET_ValidityLeft`, `ET_ValidityRight`, `ET_PupilLeft_validity`, `ET_PupilRight_validity`) show distributions dominated by 0, indicating that most of the data is considered valid after the mapping. The smaller peaks at 1 represent the instances of invalid data.

These distributions highlight the need for appropriate handling of the -1 values and potential outliers in subsequent analysis or modeling steps.

In [None]:
df_2_EYE.columns

In [None]:
cols = ['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']

In [None]:
from IPython.display import display, Markdown

for col in cols:
    # Add a markdown cell before each plot for better separation and labeling
    display(Markdown(f'### {col} over Time'))
    plt.figure(figsize=(16, 10))
    plt.plot(df_2_EYE['Timestamp'], df_2_EYE[col])
    plt.xlabel("Timestamp") # Add x-axis label
    plt.ylabel(col) # Add y-axis label
    plt.show()

# Observations from Time Series Plots

The line plots showing various features against the `Timestamp` reveal the temporal patterns and fluctuations in the eye-tracking data. Key observations include:

- **Gaze Coordinates (`ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, `ET_GazeRighty`):** These plots show the changes in gaze position over time. We can observe periods of relatively stable gaze interspersed with rapid movements (saccades) and blinks or other events where the gaze data might be invalid (-1 values appear as gaps or spikes if not handled).
- **Pupil Size (`ET_PupilLeft`, `ET_PupilRight`):** The pupil size plots show variations over time. The presence of many -1 values is evident as flat lines at the bottom of the plot, indicating periods where pupil data was not recorded or was invalid.
- **Time Signal (`ET_TimeSignal`):** This plot shows a steady, increasing trend, as expected for a time-based signal.
- **Distance and Camera Position (`ET_DistanceLeft`, `ET_DistanceRight`, `ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, `ET_CameraRightY`):** These plots show how the distance from the eye tracker and the camera positions change over time. Variations in these features can be related to head movements or changes in the user's position relative to the eye tracker.
- **Validity (`ET_ValidityLeft`, `ET_ValidityRight`, `ET_PupilLeft_validity`, `ET_PupilRight_validity`):** These plots clearly show periods of invalid data (represented by 1) as spikes or plateaus, corresponding to instances where the eye tracker lost track of the eyes or the pupil data was marked as invalid.

Analyzing these time series plots is crucial for understanding the dynamics of the eye-tracking data and identifying patterns or anomalies that may require further investigation or specific handling during subsequent analysis.

In [None]:
# Select only the numeric columns for plotting histograms, excluding time-related columns
numeric_cols = df_2_EYE.select_dtypes(include=np.number).columns

# Calculate the number of rows and columns for the grid
n_cols = 4  # You can adjust the number of columns as needed
n_rows = (len(numeric_cols) + n_cols - 1) // n_cols

plt.figure(figsize=(n_cols * 5, n_rows * 4)) # Adjust figure size as needed

for i, col in enumerate(numeric_cols):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.boxplot(df_2_EYE[col])
    plt.title(f'Boxplot of {col}')
    plt.xlabel(col)

plt.tight_layout()
plt.show()

# Observations from Boxplots and Handling -1 Values

The boxplots provide a visual summary of the distribution and potential outliers for each numeric column. Key observations from the boxplots include:

- The boxplots for columns like `ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, `ET_GazeRighty`, `ET_PupilLeft`, `ET_PupilRight`, `ET_DistanceLeft`, `ET_DistanceRight`, `ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, and `ET_CameraRightY` clearly show the presence of -1 values as significant outliers, confirming our earlier observations from the heatmaps and histograms.
- The boxplots for the validity columns (`ET_ValidityLeft`, `ET_ValidityRight`, `ET_PupilLeft_validity`, `ET_PupilRight_validity`) show the discrete nature of these features, with the majority of data points at 0 (valid) and a smaller number at 1 (invalid).

Given the significant presence of -1 values, which represent invalid or missing data, especially in the pupil-related columns, we have decided to replace these -1 values with NaN to properly represent them as missing data. Subsequently, we will impute these missing values using the mean of each respective column. This approach helps to retain the data structure and allows for further analysis or modeling without the distortion caused by the -1 placeholders.

In [None]:
df_2_EYE.replace({-1: np.nan}, inplace=True)

In [None]:
df_2_EYE[['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']].mean()

In [None]:
df_2_EYE[['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']].median()

In [None]:
numeric_cols = df_2_EYE.select_dtypes(include=np.number).columns

for col in numeric_cols:
    df_2_EYE[col] = df_2_EYE[col].fillna(df_2_EYE[col].mean())

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_2_EYE.isnull(), cmap='viridis')
plt.title('Heatmap of Missing Values After Imputation')

plt.subplot(1, 2, 2)
sns.heatmap(df_2_EYE == 1, cmap='viridis')
plt.title('Heatmap of 1 Values')

plt.tight_layout()
plt.show()

# Handling Missing Values (Imputation)

As decided, we have replaced all the `-1` values with `NaN` to treat them as missing data. Subsequently, we have imputed these `NaN` values with the mean of their respective columns. The heatmap above, which was generated after the imputation, now shows no visible signs of `NaN` values, indicating that the imputation was successful.

In [None]:
df_2_EYE.head()

In [None]:
# Select only the numeric columns for plotting histograms, excluding time-related columns
numeric_cols = df_2_EYE.select_dtypes(include=np.number).columns
cols_to_plot = [col for col in numeric_cols if col not in ['UnixTime']]

# Calculate the number of rows and columns for the grid
n_cols = 4  # You can adjust the number of columns as needed
n_rows = (len(cols_to_plot) + n_cols - 1) // n_cols

plt.figure(figsize=(n_cols * 5, n_rows * 4)) # Adjust figure size as needed

for i, col in enumerate(cols_to_plot):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.histplot(df_2_EYE[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

# Observations from Histograms After Imputation

The histograms generated after replacing the -1 values with the mean of each column show the distributions of the numeric features with the missing data handled. Key observations from these updated histograms include:

- The distinct peaks at -1, which were prominent in the histograms for several columns (e.g., pupil size, gaze coordinates, distance, and camera position) before imputation, are now replaced by a peak at the mean of each respective column.
- The distributions in many columns now appear more unimodal or show shifted modes compared to the original histograms.
- The histograms for the validity columns still show their bimodal distributions with peaks at 0 and 1, as these were handled separately.

These histograms provide an updated view of the data's distribution after handling the missing values, highlighting the impact of the imputation method on the data's characteristics.

In [None]:
cols = ['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']

In [None]:
for col in cols:
    # Add a markdown cell before each plot for better separation and labeling
    display(Markdown(f'### {col} over Time'))
    plt.figure(figsize=(16, 10))
    plt.plot(df_2_EYE['Timestamp'], df_2_EYE[col])
    plt.xlabel("Timestamp") # Add x-axis label
    plt.ylabel(col) # Add y-axis label
    plt.show()

# Observations from Time Series Plots After Imputation

The line plots generated after imputing the missing values with the mean show the temporal patterns of the features with the missing data handled. Key observations from these updated plots include:

- The gaps or flat lines at -1, which were prominent in the plots for columns like gaze coordinates, pupil size, distance, and camera position, are now filled by lines at the mean value of the respective columns.
- The plots for the validity columns remain the same as they were handled separately.
- The `ET_TimeSignal` plot still shows a steady increasing trend, as expected.

In [None]:
plt.figure(figsize=(16, 10))
sns.heatmap(df_2_EYE.corr(numeric_only=True), cmap='YlGnBu', annot=True)
plt.show()

# Observations from Correlation Heatmap

The correlation heatmap provides a visual representation of the pairwise correlations between the numeric columns in the dataset. Key observations from the heatmap include:

- **High Positive Correlations:** We observe strong positive correlations (values close to 1) between:
  - `ET_GazeLeftx` and `ET_GazeRightx`: This is expected as the gaze positions of both eyes should be highly correlated when fixating on a point.
  - `ET_GazeLefty` and `ET_GazeRighty`: Similar to the x-coordinates, the y-coordinates of gaze should also be highly correlated.
  - `ET_PupilLeft` and `ET_PupilRight`: Pupil sizes of both eyes tend to change together in response to light and cognitive load.
  - `ET_DistanceLeft` and `ET_DistanceRight`: The distance from the eye tracker to each eye should be highly correlated.
  - `ET_CameraLeftX` and `ET_CameraRightX`, `ET_CameraLeftY` and `ET_CameraRightY`: The camera positions for both eyes are also expected to be highly correlated.
  - `UnixTime` and `ET_TimeSignal`: As previously noted, these two columns are almost perfectly linearly correlated, indicating redundancy.
  - `ET_ValidityLeft` and `ET_PupilLeft_validity`: There is a positive correlation, suggesting that when the overall left eye data is invalid, the left pupil data is also likely to be invalid.
  - `ET_ValidityRight` and `ET_PupilRight_validity`: Similar to the left eye, there is a positive correlation between the overall right eye validity and the right pupil validity.
- **Other Correlations:** We can also observe other varying degrees of correlations between different features, which can provide insights into the relationships between gaze behavior, pupil size, distance, and camera position. For example, there might be correlations between gaze coordinates and camera positions, reflecting head movements.
- **Low or Near-Zero Correlations:** Columns with low or near-zero correlations are relatively independent of each other.

Understanding these correlations is important for feature selection and for building models, as highly correlated features might indicate multicollinearity, while correlations between features can reveal underlying patterns in the data.

# Analysis of ET_TimeSignal and Decision to Drop

As observed in the time series plot and confirmed by the correlation heatmap, the `ET_TimeSignal` column exhibits a near-perfect linear relationship with both the `Timestamp` and `UnixTime` columns. This strong correlation (close to 1) suggests that `ET_TimeSignal` is essentially redundant and likely represents another form of time recording or a signal directly derived from the timestamp.

Including highly correlated features like this in a dataset can lead to issues such as multicollinearity in some statistical models, which can make it difficult to interpret the individual impact of each feature. Since the `Timestamp` column already provides the necessary temporal information, retaining `ET_TimeSignal` does not appear to add significant value for further analysis or modeling in most cases.

Therefore, based on its high correlation and lack of unique insight, we will proceed to drop the `ET_TimeSignal` column to simplify the dataset and potentially improve the performance and interpretability of future analyses.

In [None]:
df_2_EYE.drop('ET_TimeSignal', axis=1, inplace=True)

In [None]:
plt.figure(figsize=(16, 10))
sns.pairplot(df_2_EYE)
plt.show()

# **3_EYE**

In [None]:
df_3_EYE = pd.read_csv('data/STData/3/3_EYE.csv')

In [None]:
df_3_EYE.head()

In [None]:
df_3_EYE.shape

In [None]:
df_3_EYE.columns

In [None]:
df_3_EYE.info()

In [None]:
df_3_EYE.isnull().sum()

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df_3_EYE.isnull(), cmap='viridis')
plt.show()

# Notes & Observations

- We observe many **null** (or missing) values in the `QuestionKey` columns.
- The nulls in the `QuestionKey` column may not represent “true” nulls. Rather, they follow interval patterns, suggesting that during those periods no question was displayed.
- These missing values in `QuestionKey` require additional investigation and context-aware handling.

In [None]:
df_3_EYE['QuestionKey'].unique()

In [None]:
df_3_EYE['Timestamp'] = pd.to_datetime(df_3_EYE['Timestamp'])

In [None]:
df_3_EYE.head(3)

In [None]:
df_3_EYE['QuestionKey'] = df_3_EYE['QuestionKey'].fillna('None')

In [None]:
df_3_EYE['QuestionKey'].value_counts()

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df_3_EYE.isnull(), cmap='viridis')
plt.show()

In [None]:
df_3_EYE.isnull().sum()

In [None]:
df_3_EYE.dropna(inplace=True)

In [None]:
df_3_EYE.head()

In [None]:
df_3_EYE['Row'].unique()

In [None]:
plt.figure(figsize=(8,6))
sns.histplot(df_3_EYE['Row'])
plt.show()

# Notes & Observations

- The `Row` column appears to be a simple row index and does not provide meaningful information relevant to the eye-tracking data itself. Therefore, it can be dropped.

In [None]:
df_3_EYE.drop('Row', axis=1, inplace=True)

In [None]:
df_3_EYE['ET_ValidityLeft'].unique()

In [None]:
df_3_EYE['ET_ValidityLeft'].value_counts()

In [None]:
df_3_EYE['ET_ValidityRight'].unique()

In [None]:
df_3_EYE['ET_ValidityRight'].value_counts()

In [None]:
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
sns.barplot(x=df_3_EYE['ET_ValidityLeft'].value_counts().index, y=df_3_EYE['ET_ValidityLeft'].value_counts().values)
plt.title('Count of ET_ValidityLeft')
plt.xlabel('Validity')
plt.ylabel('Count')


plt.subplot(1, 2, 2)
sns.barplot(x=df_3_EYE['ET_ValidityRight'].value_counts().index, y=df_3_EYE['ET_ValidityRight'].value_counts().values)
plt.title('Count of ET_ValidityRight')
plt.xlabel('Validity')
plt.ylabel('Count')

plt.tight_layout()
plt.show()

# Notes & Observations

- The `ET_ValidityLeft` and `ET_ValidityRight` columns indicate the validity of the eye-tracking data for the left and right eye, respectively.
- Based on the value counts and the bar plots, it appears that a value of `0.0` represents valid eye-tracking data, while a value of `4.0` represents invalid data.
- Although the amount of invalid data is relatively small, removing these rows could introduce unwanted patterns or gaps in the time series data.
- Therefore, we will keep the data and replace the value `4.0` with `1.0` in both `ET_ValidityLeft` and `ET_ValidityRight` columns. This will indicate to a machine learning model that the eye tracker had invalid data at those specific points in time while maintaining the integrity of the time series.

Define a mapping to convert validity values from `0.0` and `4.0` to `0` and `1`.

In [None]:
validity_map = {4.0: 1.0, 0.0: 0.0}

In [None]:
df_3_EYE['ET_ValidityLeft'] = df_3_EYE['ET_ValidityLeft'].map(validity_map).astype(np.int8)
df_3_EYE['ET_ValidityRight'] = df_3_EYE['ET_ValidityRight'].map(validity_map).astype(np.int8)

In [None]:
df_3_EYE.head(3)

In [None]:
df_3_EYE.describe()

In [None]:
df_3_EYE[df_3_EYE['ET_ValidityLeft'] == 1].shape

In [None]:
df_3_EYE[df_3_EYE['ET_ValidityRight'] == 1].shape

In [None]:
df_3_EYE[df_3_EYE['ET_ValidityLeft'] == 1].shape[0] / df_3_EYE.shape[0]

In [None]:
df_3_EYE[df_3_EYE['ET_ValidityRight'] == 1].shape[0] / df_3_EYE.shape[0]

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_3_EYE == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.subplot(1, 2, 2)
sns.heatmap(df_3_EYE == 1, cmap='viridis')
plt.title('Heatmap of 1 Values')

plt.tight_layout()
plt.show()

In [None]:
df_3_EYE[df_3_EYE['ET_PupilLeft'] == -1].shape

In [None]:
df_3_EYE[df_3_EYE['ET_PupilRight'] == -1].shape

In [None]:
df_3_EYE[df_3_EYE['ET_PupilLeft'] == -1].shape[0] / df_3_EYE.shape[0]

In [None]:
df_3_EYE[df_3_EYE['ET_PupilRight'] == -1].shape[0] / df_3_EYE.shape[0]

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_3_EYE[df_3_EYE['ET_ValidityLeft'] == 1] == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.subplot(1, 2, 2)
sns.heatmap(df_3_EYE[df_3_EYE['ET_ValidityRight'] == 1] == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.tight_layout()
plt.show()

# Notes & Observations

- The heatmaps reveal the distribution of -1 values across different columns.
- It is evident that the `-1` values are not randomly scattered but appear in specific columns, notably `ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, `ET_GazeRighty`, `ET_PupilLeft`, `ET_PupilRight`, `ET_DistanceLeft`, `ET_DistanceRight`, `ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, and `ET_CameraRightY`.
- These `-1` values often coincide with instances where `ET_ValidityLeft` or `ET_ValidityRight` is 1, indicating invalid eye-tracking data. This suggests that `-1` is used as a placeholder for missing or invalid measurements in these columns when the eye tracker is not providing valid data for a particular eye.
- Given that over 70% of the data in the `ET_PupilLeft` and `ET_PupilRight` columns is marked as invalid (-1), so instead of dropping them we can create new feature for both the `ET_PupilLeft` and `ET_PupilRight` to represent which row consist invalid `ET_PupilLeft` and `ET_PupilRight` data

In [None]:
pupil_validity = {-1: 1 }

In [None]:
df_3_EYE['ET_PupilLeft_validity'] = df_3_EYE['ET_PupilLeft'].map(pupil_validity)

In [None]:
df_3_EYE['ET_PupilRight_validity'] = df_3_EYE['ET_PupilRight'].map(pupil_validity)

In [None]:
df_3_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].head()

In [None]:
df_3_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].isnull().sum()

In [None]:
plt.figure(figsize=(18, 8))
sns.heatmap(df_3_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].isnull(), cmap='viridis')
plt.show()

In [None]:
df_3_EYE['ET_PupilLeft_validity'] = df_3_EYE['ET_PupilLeft_validity'].fillna(0)

In [None]:
df_3_EYE['ET_PupilRight_validity'] = df_3_EYE['ET_PupilRight_validity'].fillna(0)

In [None]:
df_3_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].head()

In [None]:
plt.figure(figsize=(18, 8))
sns.heatmap(df_3_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].isnull(), cmap='viridis')
plt.show()

In [None]:
df_3_EYE.head()

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_3_EYE == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.subplot(1, 2, 2)
sns.heatmap(df_3_EYE == 1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.tight_layout()
plt.show()

In [None]:
valid_left_ratio  = 1 - df_3_EYE['ET_ValidityLeft'].mean()

In [None]:
valid_left_ratio

In [None]:
valid_right_ratio = 1 - df_3_EYE['ET_ValidityRight'].mean()

In [None]:
valid_right_ratio

In [None]:
df_3_EYE['ET_PupilLeft_validity'] = df_3_EYE['ET_PupilLeft_validity'].astype(np.int8)
df_3_EYE['ET_PupilRight_validity'] = df_3_EYE['ET_PupilRight_validity'].astype(np.int8)

# Feature Engineering and Observations

Based on the analysis of the data, we've created two new features, `ET_PupilLeft_validity` and `ET_PupilRight_validity`. These features indicate the validity of the pupil data for the left and right eyes, respectively, with a value of 1 representing invalid data (originally -1) and 0 representing valid data.

The heatmaps above visually demonstrate the distribution of -1 and 1 values across the dataset. We observed that:
- The `-1` values are concentrated in specific columns related to gaze, pupil size, distance, and camera position, suggesting they represent missing or invalid sensor readings.
- The `1` values, after mapping from `4.0` in the original validity columns, indicate instances of invalid eye-tracking data.
- The heatmaps also show a strong correlation between the `-1` values in the pupil columns and a validity of 1 in the newly created pupil validity features, confirming that -1 was used to mark invalid pupil data.

In [None]:
df_3_EYE.head()

In [None]:
# Select only the numeric columns for plotting histograms, excluding time-related columns
numeric_cols = df_3_EYE.select_dtypes(include=np.number).columns
cols_to_plot = [col for col in numeric_cols if col not in ['UnixTime']]

# Calculate the number of rows and columns for the grid
n_cols = 4  # You can adjust the number of columns as needed
n_rows = (len(cols_to_plot) + n_cols - 1) // n_cols

plt.figure(figsize=(n_cols * 5, n_rows * 4)) # Adjust figure size as needed

for i, col in enumerate(cols_to_plot):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.histplot(df_3_EYE[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

# Observations from Histograms

The grid of histograms provides insights into the distribution of values for each numeric column in the dataset (excluding 'UnixTime'). Key observations include:

- Several columns, such as `ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, and `ET_GazeRighty`, show distributions that appear somewhat multimodal or skewed, suggesting variations in gaze patterns.
- The `ET_PupilLeft` and `ET_PupilRight` histograms clearly show a peak at -1, confirming the presence of a significant number of invalid pupil readings.
- `ET_TimeSignal` shows a relatively uniform distribution, as expected for a time-based signal.
- `ET_DistanceLeft` and `ET_DistanceRight` appear to have distributions centered around certain values, with some outliers or variations.
- The camera position columns (`ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, `ET_CameraRightY`) seem to have distributions concentrated within specific ranges, reflecting the camera's field of view.
- The validity columns (`ET_ValidityLeft`, `ET_ValidityRight`, `ET_PupilLeft_validity`, `ET_PupilRight_validity`) show distributions dominated by 0, indicating that most of the data is considered valid after the mapping. The smaller peaks at 1 represent the instances of invalid data.

These distributions highlight the need for appropriate handling of the -1 values and potential outliers in subsequent analysis or modeling steps.

In [None]:
df_3_EYE.columns

In [None]:
cols = ['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']

In [None]:
from IPython.display import display, Markdown

for col in cols:
    # Add a markdown cell before each plot for better separation and labeling
    display(Markdown(f'### {col} over Time'))
    plt.figure(figsize=(16, 10))
    plt.plot(df_3_EYE['Timestamp'], df_3_EYE[col])
    plt.xlabel("Timestamp") # Add x-axis label
    plt.ylabel(col) # Add y-axis label
    plt.show()

# Observations from Time Series Plots

The line plots showing various features against the `Timestamp` reveal the temporal patterns and fluctuations in the eye-tracking data. Key observations include:

- **Gaze Coordinates (`ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, `ET_GazeRighty`):** These plots show the changes in gaze position over time. We can observe periods of relatively stable gaze interspersed with rapid movements (saccades) and blinks or other events where the gaze data might be invalid (-1 values appear as gaps or spikes if not handled).
- **Pupil Size (`ET_PupilLeft`, `ET_PupilRight`):** The pupil size plots show variations over time. The presence of many -1 values is evident as flat lines at the bottom of the plot, indicating periods where pupil data was not recorded or was invalid.
- **Time Signal (`ET_TimeSignal`):** This plot shows a steady, increasing trend, as expected for a time-based signal.
- **Distance and Camera Position (`ET_DistanceLeft`, `ET_DistanceRight`, `ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, `ET_CameraRightY`):** These plots show how the distance from the eye tracker and the camera positions change over time. Variations in these features can be related to head movements or changes in the user's position relative to the eye tracker.
- **Validity (`ET_ValidityLeft`, `ET_ValidityRight`, `ET_PupilLeft_validity`, `ET_PupilRight_validity`):** These plots clearly show periods of invalid data (represented by 1) as spikes or plateaus, corresponding to instances where the eye tracker lost track of the eyes or the pupil data was marked as invalid.

Analyzing these time series plots is crucial for understanding the dynamics of the eye-tracking data and identifying patterns or anomalies that may require further investigation or specific handling during subsequent analysis.

In [None]:
# Select only the numeric columns for plotting histograms, excluding time-related columns
numeric_cols = df_3_EYE.select_dtypes(include=np.number).columns

# Calculate the number of rows and columns for the grid
n_cols = 4  # You can adjust the number of columns as needed
n_rows = (len(numeric_cols) + n_cols - 1) // n_cols

plt.figure(figsize=(n_cols * 5, n_rows * 4)) # Adjust figure size as needed

for i, col in enumerate(numeric_cols):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.boxplot(df_3_EYE[col])
    plt.title(f'Boxplot of {col}')
    plt.xlabel(col)

plt.tight_layout()
plt.show()

# Observations from Boxplots and Handling -1 Values

The boxplots provide a visual summary of the distribution and potential outliers for each numeric column. Key observations from the boxplots include:

- The boxplots for columns like `ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, `ET_GazeRighty`, `ET_PupilLeft`, `ET_PupilRight`, `ET_DistanceLeft`, `ET_DistanceRight`, `ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, and `ET_CameraRightY` clearly show the presence of -1 values as significant outliers, confirming our earlier observations from the heatmaps and histograms.
- The boxplots for the validity columns (`ET_ValidityLeft`, `ET_ValidityRight`, `ET_PupilLeft_validity`, `ET_PupilRight_validity`) show the discrete nature of these features, with the majority of data points at 0 (valid) and a smaller number at 1 (invalid).

Given the significant presence of -1 values, which represent invalid or missing data, especially in the pupil-related columns, we have decided to replace these -1 values with NaN to properly represent them as missing data. Subsequently, we will impute these missing values using the mean of each respective column. This approach helps to retain the data structure and allows for further analysis or modeling without the distortion caused by the -1 placeholders.

In [None]:
df_3_EYE.replace({-1: np.nan}, inplace=True)

In [None]:
df_3_EYE[['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']].mean()

In [None]:
df_3_EYE[['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']].median()

In [None]:
numeric_cols = df_3_EYE.select_dtypes(include=np.number).columns

for col in numeric_cols:
    df_3_EYE[col] = df_3_EYE[col].fillna(df_3_EYE[col].mean())

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_3_EYE.isnull(), cmap='viridis')
plt.title('Heatmap of Missing Values After Imputation')

plt.subplot(1, 2, 2)
sns.heatmap(df_3_EYE == 1, cmap='viridis')
plt.title('Heatmap of 1 Values')

plt.tight_layout()
plt.show()

# Handling Missing Values (Imputation)

As decided, we have replaced all the `-1` values with `NaN` to treat them as missing data. Subsequently, we have imputed these `NaN` values with the mean of their respective columns. The heatmap above, which was generated after the imputation, now shows no visible signs of `NaN` values, indicating that the imputation was successful.

In [None]:
df_3_EYE.head()

In [None]:
# Select only the numeric columns for plotting histograms, excluding time-related columns
numeric_cols = df_3_EYE.select_dtypes(include=np.number).columns
cols_to_plot = [col for col in numeric_cols if col not in ['UnixTime']]

# Calculate the number of rows and columns for the grid
n_cols = 4  # You can adjust the number of columns as needed
n_rows = (len(cols_to_plot) + n_cols - 1) // n_cols

plt.figure(figsize=(n_cols * 5, n_rows * 4)) # Adjust figure size as needed

for i, col in enumerate(cols_to_plot):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.histplot(df_3_EYE[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

# Observations from Histograms After Imputation

The histograms generated after replacing the -1 values with the mean of each column show the distributions of the numeric features with the missing data handled. Key observations from these updated histograms include:

- The distinct peaks at -1, which were prominent in the histograms for several columns (e.g., pupil size, gaze coordinates, distance, and camera position) before imputation, are now replaced by a peak at the mean of each respective column.
- The distributions in many columns now appear more unimodal or show shifted modes compared to the original histograms.
- The histograms for the validity columns still show their bimodal distributions with peaks at 0 and 1, as these were handled separately.

These histograms provide an updated view of the data's distribution after handling the missing values, highlighting the impact of the imputation method on the data's characteristics.

In [None]:
cols = ['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']

In [None]:
for col in cols:
    # Add a markdown cell before each plot for better separation and labeling
    display(Markdown(f'### {col} over Time'))
    plt.figure(figsize=(16, 10))
    plt.plot(df_3_EYE['Timestamp'], df_3_EYE[col])
    plt.xlabel("Timestamp") # Add x-axis label
    plt.ylabel(col) # Add y-axis label
    plt.show()

# Observations from Time Series Plots After Imputation

The line plots generated after imputing the missing values with the mean show the temporal patterns of the features with the missing data handled. Key observations from these updated plots include:

- The gaps or flat lines at -1, which were prominent in the plots for columns like gaze coordinates, pupil size, distance, and camera position, are now filled by lines at the mean value of the respective columns.
- The plots for the validity columns remain the same as they were handled separately.
- The `ET_TimeSignal` plot still shows a steady increasing trend, as expected.

In [None]:
plt.figure(figsize=(16, 10))
sns.heatmap(df_3_EYE.corr(numeric_only=True), cmap='YlGnBu', annot=True)
plt.show()

# Observations from Correlation Heatmap

The correlation heatmap provides a visual representation of the pairwise correlations between the numeric columns in the dataset. Key observations from the heatmap include:

- **High Positive Correlations:** We observe strong positive correlations (values close to 1) between:
  - `ET_GazeLeftx` and `ET_GazeRightx`: This is expected as the gaze positions of both eyes should be highly correlated when fixating on a point.
  - `ET_GazeLefty` and `ET_GazeRighty`: Similar to the x-coordinates, the y-coordinates of gaze should also be highly correlated.
  - `ET_PupilLeft` and `ET_PupilRight`: Pupil sizes of both eyes tend to change together in response to light and cognitive load.
  - `ET_DistanceLeft` and `ET_DistanceRight`: The distance from the eye tracker to each eye should be highly correlated.
  - `ET_CameraLeftX` and `ET_CameraRightX`, `ET_CameraLeftY` and `ET_CameraRightY`: The camera positions for both eyes are also expected to be highly correlated.
  - `UnixTime` and `ET_TimeSignal`: As previously noted, these two columns are almost perfectly linearly correlated, indicating redundancy.
  - `ET_ValidityLeft` and `ET_PupilLeft_validity`: There is a positive correlation, suggesting that when the overall left eye data is invalid, the left pupil data is also likely to be invalid.
  - `ET_ValidityRight` and `ET_PupilRight_validity`: Similar to the left eye, there is a positive correlation between the overall right eye validity and the right pupil validity.
- **Other Correlations:** We can also observe other varying degrees of correlations between different features, which can provide insights into the relationships between gaze behavior, pupil size, distance, and camera position. For example, there might be correlations between gaze coordinates and camera positions, reflecting head movements.
- **Low or Near-Zero Correlations:** Columns with low or near-zero correlations are relatively independent of each other.

Understanding these correlations is important for feature selection and for building models, as highly correlated features might indicate multicollinearity, while correlations between features can reveal underlying patterns in the data.

# Analysis of ET_TimeSignal and Decision to Drop

As observed in the time series plot and confirmed by the correlation heatmap, the `ET_TimeSignal` column exhibits a near-perfect linear relationship with both the `Timestamp` and `UnixTime` columns. This strong correlation (close to 1) suggests that `ET_TimeSignal` is essentially redundant and likely represents another form of time recording or a signal directly derived from the timestamp.

Including highly correlated features like this in a dataset can lead to issues such as multicollinearity in some statistical models, which can make it difficult to interpret the individual impact of each feature. Since the `Timestamp` column already provides the necessary temporal information, retaining `ET_TimeSignal` does not appear to add significant value for further analysis or modeling in most cases.

Therefore, based on its high correlation and lack of unique insight, we will proceed to drop the `ET_TimeSignal` column to simplify the dataset and potentially improve the performance and interpretability of future analyses.

In [None]:
df_3_EYE.drop('ET_TimeSignal', axis=1, inplace=True)

In [None]:
plt.figure(figsize=(16, 10))
sns.pairplot(df_3_EYE)
plt.show()

# **4_EYE**

In [None]:
df_4_EYE = pd.read_csv('data/STData/4/4_EYE.csv')

In [None]:
df_4_EYE.head()

In [None]:
df_4_EYE.shape

In [None]:
df_4_EYE.columns

In [None]:
df_4_EYE.info()

In [None]:
df_4_EYE.isnull().sum()

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df_4_EYE.isnull(), cmap='viridis')
plt.show()

# Notes & Observations

- We observe many **null** (or missing) values in the `QuestionKey` columns.
- The nulls in the `QuestionKey` column may not represent “true” nulls. Rather, they follow interval patterns, suggesting that during those periods no question was displayed.
- These missing values in `QuestionKey` require additional investigation and context-aware handling.

In [None]:
df_4_EYE['QuestionKey'].unique()

In [None]:
df_4_EYE['Timestamp'] = pd.to_datetime(df_4_EYE['Timestamp'])

In [None]:
df_4_EYE.head(3)

In [None]:
df_4_EYE['QuestionKey'] = df_4_EYE['QuestionKey'].fillna('None')

In [None]:
df_4_EYE['QuestionKey'].value_counts()

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df_4_EYE.isnull(), cmap='viridis')
plt.show()

In [None]:
df_4_EYE.isnull().sum()

In [None]:
df_4_EYE.dropna(inplace=True)

In [None]:
df_4_EYE.head()

In [None]:
df_4_EYE['Row'].unique()

In [None]:
plt.figure(figsize=(8,6))
sns.histplot(df_4_EYE['Row'])
plt.show()

# Notes & Observations

- The `Row` column appears to be a simple row index and does not provide meaningful information relevant to the eye-tracking data itself. Therefore, it can be dropped.

In [None]:
df_4_EYE.drop('Row', axis=1, inplace=True)

In [None]:
df_4_EYE['ET_ValidityLeft'].unique()

In [None]:
df_4_EYE['ET_ValidityLeft'].value_counts()

In [None]:
df_4_EYE['ET_ValidityRight'].unique()

In [None]:
df_4_EYE['ET_ValidityRight'].value_counts()

In [None]:
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
sns.barplot(x=df_4_EYE['ET_ValidityLeft'].value_counts().index, y=df_4_EYE['ET_ValidityLeft'].value_counts().values)
plt.title('Count of ET_ValidityLeft')
plt.xlabel('Validity')
plt.ylabel('Count')


plt.subplot(1, 2, 2)
sns.barplot(x=df_4_EYE['ET_ValidityRight'].value_counts().index, y=df_4_EYE['ET_ValidityRight'].value_counts().values)
plt.title('Count of ET_ValidityRight')
plt.xlabel('Validity')
plt.ylabel('Count')

plt.tight_layout()
plt.show()

# Notes & Observations

- The `ET_ValidityLeft` and `ET_ValidityRight` columns indicate the validity of the eye-tracking data for the left and right eye, respectively.
- Based on the value counts and the bar plots, it appears that a value of `0.0` represents valid eye-tracking data, while a value of `4.0` represents invalid data.
- Although the amount of invalid data is relatively small, removing these rows could introduce unwanted patterns or gaps in the time series data.
- Therefore, we will keep the data and replace the value `4.0` with `1.0` in both `ET_ValidityLeft` and `ET_ValidityRight` columns. This will indicate to a machine learning model that the eye tracker had invalid data at those specific points in time while maintaining the integrity of the time series.

Define a mapping to convert validity values from `0.0` and `4.0` to `0` and `1`.

In [None]:
validity_map = {4.0: 1.0, 0.0: 0.0}

In [None]:
df_4_EYE['ET_ValidityLeft'] = df_4_EYE['ET_ValidityLeft'].map(validity_map).astype(np.int8)
df_4_EYE['ET_ValidityRight'] = df_4_EYE['ET_ValidityRight'].map(validity_map).astype(np.int8)

In [None]:
df_4_EYE.head(3)

In [None]:
df_4_EYE.describe()

In [None]:
df_4_EYE[df_4_EYE['ET_ValidityLeft'] == 1].shape

In [None]:
df_4_EYE[df_4_EYE['ET_ValidityRight'] == 1].shape

In [None]:
df_4_EYE[df_4_EYE['ET_ValidityLeft'] == 1].shape[0] / df_4_EYE.shape[0]

In [None]:
df_4_EYE[df_4_EYE['ET_ValidityRight'] == 1].shape[0] / df_4_EYE.shape[0]

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_4_EYE == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.subplot(1, 2, 2)
sns.heatmap(df_4_EYE == 1, cmap='viridis')
plt.title('Heatmap of 1 Values')

plt.tight_layout()
plt.show()

In [None]:
df_4_EYE[df_4_EYE['ET_PupilLeft'] == -1].shape

In [None]:
df_4_EYE[df_4_EYE['ET_PupilRight'] == -1].shape

In [None]:
df_4_EYE[df_4_EYE['ET_PupilLeft'] == -1].shape[0] / df_4_EYE.shape[0]

In [None]:
df_4_EYE[df_4_EYE['ET_PupilRight'] == -1].shape[0] / df_4_EYE.shape[0]

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_4_EYE[df_4_EYE['ET_ValidityLeft'] == 1] == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.subplot(1, 2, 2)
sns.heatmap(df_4_EYE[df_4_EYE['ET_ValidityRight'] == 1] == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.tight_layout()
plt.show()

# Notes & Observations

- The heatmaps reveal the distribution of -1 values across different columns.
- It is evident that the `-1` values are not randomly scattered but appear in specific columns, notably `ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, `ET_GazeRighty`, `ET_PupilLeft`, `ET_PupilRight`, `ET_DistanceLeft`, `ET_DistanceRight`, `ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, and `ET_CameraRightY`.
- These `-1` values often coincide with instances where `ET_ValidityLeft` or `ET_ValidityRight` is 1, indicating invalid eye-tracking data. This suggests that `-1` is used as a placeholder for missing or invalid measurements in these columns when the eye tracker is not providing valid data for a particular eye.
- Given that over 70% of the data in the `ET_PupilLeft` and `ET_PupilRight` columns is marked as invalid (-1), so instead of dropping them we can create new feature for both the `ET_PupilLeft` and `ET_PupilRight` to represent which row consist invalid `ET_PupilLeft` and `ET_PupilRight` data

In [None]:
pupil_validity = {-1: 1 }

In [None]:
df_4_EYE['ET_PupilLeft_validity'] = df_4_EYE['ET_PupilLeft'].map(pupil_validity)

In [None]:
df_4_EYE['ET_PupilRight_validity'] = df_4_EYE['ET_PupilRight'].map(pupil_validity)

In [None]:
df_4_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].head()

In [None]:
df_4_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].isnull().sum()

In [None]:
plt.figure(figsize=(18, 8))
sns.heatmap(df_4_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].isnull(), cmap='viridis')
plt.show()

In [None]:
df_4_EYE['ET_PupilLeft_validity'] = df_4_EYE['ET_PupilLeft_validity'].fillna(0)

In [None]:
df_4_EYE['ET_PupilRight_validity'] = df_4_EYE['ET_PupilRight_validity'].fillna(0)

In [None]:
df_4_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].head()

In [None]:
plt.figure(figsize=(18, 8))
sns.heatmap(df_4_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].isnull(), cmap='viridis')
plt.show()

In [None]:
df_4_EYE.head()

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_4_EYE == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.subplot(1, 2, 2)
sns.heatmap(df_4_EYE == 1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.tight_layout()
plt.show()

In [None]:
valid_left_ratio  = 1 - df_4_EYE['ET_ValidityLeft'].mean()

In [None]:
valid_left_ratio

In [None]:
valid_right_ratio = 1 - df_4_EYE['ET_ValidityRight'].mean()

In [None]:
valid_right_ratio

In [None]:
df_4_EYE['ET_PupilLeft_validity'] = df_4_EYE['ET_PupilLeft_validity'].astype(np.int8)
df_4_EYE['ET_PupilRight_validity'] = df_4_EYE['ET_PupilRight_validity'].astype(np.int8)

# Feature Engineering and Observations

Based on the analysis of the data, we've created two new features, `ET_PupilLeft_validity` and `ET_PupilRight_validity`. These features indicate the validity of the pupil data for the left and right eyes, respectively, with a value of 1 representing invalid data (originally -1) and 0 representing valid data.

The heatmaps above visually demonstrate the distribution of -1 and 1 values across the dataset. We observed that:
- The `-1` values are concentrated in specific columns related to gaze, pupil size, distance, and camera position, suggesting they represent missing or invalid sensor readings.
- The `1` values, after mapping from `4.0` in the original validity columns, indicate instances of invalid eye-tracking data.
- The heatmaps also show a strong correlation between the `-1` values in the pupil columns and a validity of 1 in the newly created pupil validity features, confirming that -1 was used to mark invalid pupil data.

In [None]:
df_4_EYE.head()

In [None]:
# Select only the numeric columns for plotting histograms, excluding time-related columns
numeric_cols = df_4_EYE.select_dtypes(include=np.number).columns
cols_to_plot = [col for col in numeric_cols if col not in ['UnixTime']]

# Calculate the number of rows and columns for the grid
n_cols = 4  # You can adjust the number of columns as needed
n_rows = (len(cols_to_plot) + n_cols - 1) // n_cols

plt.figure(figsize=(n_cols * 5, n_rows * 4)) # Adjust figure size as needed

for i, col in enumerate(cols_to_plot):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.histplot(df_4_EYE[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

# Observations from Histograms

The grid of histograms provides insights into the distribution of values for each numeric column in the dataset (excluding 'UnixTime'). Key observations include:

- Several columns, such as `ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, and `ET_GazeRighty`, show distributions that appear somewhat multimodal or skewed, suggesting variations in gaze patterns.
- The `ET_PupilLeft` and `ET_PupilRight` histograms clearly show a peak at -1, confirming the presence of a significant number of invalid pupil readings.
- `ET_TimeSignal` shows a relatively uniform distribution, as expected for a time-based signal.
- `ET_DistanceLeft` and `ET_DistanceRight` appear to have distributions centered around certain values, with some outliers or variations.
- The camera position columns (`ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, `ET_CameraRightY`) seem to have distributions concentrated within specific ranges, reflecting the camera's field of view.
- The validity columns (`ET_ValidityLeft`, `ET_ValidityRight`, `ET_PupilLeft_validity`, `ET_PupilRight_validity`) show distributions dominated by 0, indicating that most of the data is considered valid after the mapping. The smaller peaks at 1 represent the instances of invalid data.

These distributions highlight the need for appropriate handling of the -1 values and potential outliers in subsequent analysis or modeling steps.

In [None]:
df_4_EYE.columns

In [None]:
cols = ['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']

In [None]:
from IPython.display import display, Markdown

for col in cols:
    # Add a markdown cell before each plot for better separation and labeling
    display(Markdown(f'### {col} over Time'))
    plt.figure(figsize=(16, 10))
    plt.plot(df_4_EYE['Timestamp'], df_4_EYE[col])
    plt.xlabel("Timestamp") # Add x-axis label
    plt.ylabel(col) # Add y-axis label
    plt.show()

# Observations from Time Series Plots

The line plots showing various features against the `Timestamp` reveal the temporal patterns and fluctuations in the eye-tracking data. Key observations include:

- **Gaze Coordinates (`ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, `ET_GazeRighty`):** These plots show the changes in gaze position over time. We can observe periods of relatively stable gaze interspersed with rapid movements (saccades) and blinks or other events where the gaze data might be invalid (-1 values appear as gaps or spikes if not handled).
- **Pupil Size (`ET_PupilLeft`, `ET_PupilRight`):** The pupil size plots show variations over time. The presence of many -1 values is evident as flat lines at the bottom of the plot, indicating periods where pupil data was not recorded or was invalid.
- **Time Signal (`ET_TimeSignal`):** This plot shows a steady, increasing trend, as expected for a time-based signal.
- **Distance and Camera Position (`ET_DistanceLeft`, `ET_DistanceRight`, `ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, `ET_CameraRightY`):** These plots show how the distance from the eye tracker and the camera positions change over time. Variations in these features can be related to head movements or changes in the user's position relative to the eye tracker.
- **Validity (`ET_ValidityLeft`, `ET_ValidityRight`, `ET_PupilLeft_validity`, `ET_PupilRight_validity`):** These plots clearly show periods of invalid data (represented by 1) as spikes or plateaus, corresponding to instances where the eye tracker lost track of the eyes or the pupil data was marked as invalid.

Analyzing these time series plots is crucial for understanding the dynamics of the eye-tracking data and identifying patterns or anomalies that may require further investigation or specific handling during subsequent analysis.

In [None]:
# Select only the numeric columns for plotting histograms, excluding time-related columns
numeric_cols = df_4_EYE.select_dtypes(include=np.number).columns

# Calculate the number of rows and columns for the grid
n_cols = 4  # You can adjust the number of columns as needed
n_rows = (len(numeric_cols) + n_cols - 1) // n_cols

plt.figure(figsize=(n_cols * 5, n_rows * 4)) # Adjust figure size as needed

for i, col in enumerate(numeric_cols):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.boxplot(df_4_EYE[col])
    plt.title(f'Boxplot of {col}')
    plt.xlabel(col)

plt.tight_layout()
plt.show()

# Observations from Boxplots and Handling -1 Values

The boxplots provide a visual summary of the distribution and potential outliers for each numeric column. Key observations from the boxplots include:

- The boxplots for columns like `ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, `ET_GazeRighty`, `ET_PupilLeft`, `ET_PupilRight`, `ET_DistanceLeft`, `ET_DistanceRight`, `ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, and `ET_CameraRightY` clearly show the presence of -1 values as significant outliers, confirming our earlier observations from the heatmaps and histograms.
- The boxplots for the validity columns (`ET_ValidityLeft`, `ET_ValidityRight`, `ET_PupilLeft_validity`, `ET_PupilRight_validity`) show the discrete nature of these features, with the majority of data points at 0 (valid) and a smaller number at 1 (invalid).

Given the significant presence of -1 values, which represent invalid or missing data, especially in the pupil-related columns, we have decided to replace these -1 values with NaN to properly represent them as missing data. Subsequently, we will impute these missing values using the mean of each respective column. This approach helps to retain the data structure and allows for further analysis or modeling without the distortion caused by the -1 placeholders.

In [None]:
df_4_EYE.replace({-1: np.nan}, inplace=True)

In [None]:
df_4_EYE[['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']].mean()

In [None]:
df_4_EYE[['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']].median()

In [None]:
numeric_cols = df_4_EYE.select_dtypes(include=np.number).columns

for col in numeric_cols:
    df_4_EYE[col] = df_4_EYE[col].fillna(df_4_EYE[col].mean())

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_4_EYE.isnull(), cmap='viridis')
plt.title('Heatmap of Missing Values After Imputation')

plt.subplot(1, 2, 2)
sns.heatmap(df_4_EYE == 1, cmap='viridis')
plt.title('Heatmap of 1 Values')

plt.tight_layout()
plt.show()

# Handling Missing Values (Imputation)

As decided, we have replaced all the `-1` values with `NaN` to treat them as missing data. Subsequently, we have imputed these `NaN` values with the mean of their respective columns. The heatmap above, which was generated after the imputation, now shows no visible signs of `NaN` values, indicating that the imputation was successful.

In [None]:
df_4_EYE.head()

In [None]:
# Select only the numeric columns for plotting histograms, excluding time-related columns
numeric_cols = df_4_EYE.select_dtypes(include=np.number).columns
cols_to_plot = [col for col in numeric_cols if col not in ['UnixTime']]

# Calculate the number of rows and columns for the grid
n_cols = 4  # You can adjust the number of columns as needed
n_rows = (len(cols_to_plot) + n_cols - 1) // n_cols

plt.figure(figsize=(n_cols * 5, n_rows * 4)) # Adjust figure size as needed

for i, col in enumerate(cols_to_plot):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.histplot(df_4_EYE[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

# Observations from Histograms After Imputation

The histograms generated after replacing the -1 values with the mean of each column show the distributions of the numeric features with the missing data handled. Key observations from these updated histograms include:

- The distinct peaks at -1, which were prominent in the histograms for several columns (e.g., pupil size, gaze coordinates, distance, and camera position) before imputation, are now replaced by a peak at the mean of each respective column.
- The distributions in many columns now appear more unimodal or show shifted modes compared to the original histograms.
- The histograms for the validity columns still show their bimodal distributions with peaks at 0 and 1, as these were handled separately.

These histograms provide an updated view of the data's distribution after handling the missing values, highlighting the impact of the imputation method on the data's characteristics.

In [None]:
cols = ['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']

In [None]:
for col in cols:
    # Add a markdown cell before each plot for better separation and labeling
    display(Markdown(f'### {col} over Time'))
    plt.figure(figsize=(16, 10))
    plt.plot(df_4_EYE['Timestamp'], df_4_EYE[col])
    plt.xlabel("Timestamp") # Add x-axis label
    plt.ylabel(col) # Add y-axis label
    plt.show()

# Observations from Time Series Plots After Imputation

The line plots generated after imputing the missing values with the mean show the temporal patterns of the features with the missing data handled. Key observations from these updated plots include:

- The gaps or flat lines at -1, which were prominent in the plots for columns like gaze coordinates, pupil size, distance, and camera position, are now filled by lines at the mean value of the respective columns.
- The plots for the validity columns remain the same as they were handled separately.
- The `ET_TimeSignal` plot still shows a steady increasing trend, as expected.

In [None]:
plt.figure(figsize=(16, 10))
sns.heatmap(df_4_EYE.corr(numeric_only=True), cmap='YlGnBu', annot=True)
plt.show()

# Observations from Correlation Heatmap

The correlation heatmap provides a visual representation of the pairwise correlations between the numeric columns in the dataset. Key observations from the heatmap include:

- **High Positive Correlations:** We observe strong positive correlations (values close to 1) between:
  - `ET_GazeLeftx` and `ET_GazeRightx`: This is expected as the gaze positions of both eyes should be highly correlated when fixating on a point.
  - `ET_GazeLefty` and `ET_GazeRighty`: Similar to the x-coordinates, the y-coordinates of gaze should also be highly correlated.
  - `ET_PupilLeft` and `ET_PupilRight`: Pupil sizes of both eyes tend to change together in response to light and cognitive load.
  - `ET_DistanceLeft` and `ET_DistanceRight`: The distance from the eye tracker to each eye should be highly correlated.
  - `ET_CameraLeftX` and `ET_CameraRightX`, `ET_CameraLeftY` and `ET_CameraRightY`: The camera positions for both eyes are also expected to be highly correlated.
  - `UnixTime` and `ET_TimeSignal`: As previously noted, these two columns are almost perfectly linearly correlated, indicating redundancy.
  - `ET_ValidityLeft` and `ET_PupilLeft_validity`: There is a positive correlation, suggesting that when the overall left eye data is invalid, the left pupil data is also likely to be invalid.
  - `ET_ValidityRight` and `ET_PupilRight_validity`: Similar to the left eye, there is a positive correlation between the overall right eye validity and the right pupil validity.
- **Other Correlations:** We can also observe other varying degrees of correlations between different features, which can provide insights into the relationships between gaze behavior, pupil size, distance, and camera position. For example, there might be correlations between gaze coordinates and camera positions, reflecting head movements.
- **Low or Near-Zero Correlations:** Columns with low or near-zero correlations are relatively independent of each other.

Understanding these correlations is important for feature selection and for building models, as highly correlated features might indicate multicollinearity, while correlations between features can reveal underlying patterns in the data.

# Analysis of ET_TimeSignal and Decision to Drop

As observed in the time series plot and confirmed by the correlation heatmap, the `ET_TimeSignal` column exhibits a near-perfect linear relationship with both the `Timestamp` and `UnixTime` columns. This strong correlation (close to 1) suggests that `ET_TimeSignal` is essentially redundant and likely represents another form of time recording or a signal directly derived from the timestamp.

Including highly correlated features like this in a dataset can lead to issues such as multicollinearity in some statistical models, which can make it difficult to interpret the individual impact of each feature. Since the `Timestamp` column already provides the necessary temporal information, retaining `ET_TimeSignal` does not appear to add significant value for further analysis or modeling in most cases.

Therefore, based on its high correlation and lack of unique insight, we will proceed to drop the `ET_TimeSignal` column to simplify the dataset and potentially improve the performance and interpretability of future analyses.

In [None]:
df_4_EYE.drop('ET_TimeSignal', axis=1, inplace=True)

In [None]:
plt.figure(figsize=(16, 10))
sns.pairplot(df_4_EYE)
plt.show()

# **5_EYE**

In [None]:
df_5_EYE = pd.read_csv('data/STData/5/5_EYE.csv')

In [None]:
df_5_EYE.head()

In [None]:
df_5_EYE.shape

In [None]:
df_5_EYE.columns

In [None]:
df_5_EYE.info()

In [None]:
df_5_EYE.isnull().sum()

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df_5_EYE.isnull(), cmap='viridis')
plt.show()

# Notes & Observations

- We observe many **null** (or missing) values in the `QuestionKey` columns.
- The nulls in the `QuestionKey` column may not represent “true” nulls. Rather, they follow interval patterns, suggesting that during those periods no question was displayed.
- These missing values in `QuestionKey` require additional investigation and context-aware handling.

In [None]:
df_5_EYE['QuestionKey'].unique()

In [None]:
df_5_EYE['Timestamp'] = pd.to_datetime(df_5_EYE['Timestamp'])

In [None]:
df_5_EYE.head(3)

In [None]:
df_5_EYE['QuestionKey'] = df_5_EYE['QuestionKey'].fillna('None')

In [None]:
df_5_EYE['QuestionKey'].value_counts()

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df_5_EYE.isnull(), cmap='viridis')
plt.show()

In [None]:
df_5_EYE.isnull().sum()

In [None]:
df_5_EYE.dropna(inplace=True)

In [None]:
df_5_EYE.head()

In [None]:
df_5_EYE['Row'].unique()

In [None]:
plt.figure(figsize=(8,6))
sns.histplot(df_5_EYE['Row'])
plt.show()

# Notes & Observations

- The `Row` column appears to be a simple row index and does not provide meaningful information relevant to the eye-tracking data itself. Therefore, it can be dropped.

In [None]:
df_5_EYE.drop('Row', axis=1, inplace=True)

In [None]:
df_5_EYE['ET_ValidityLeft'].unique()

In [None]:
df_5_EYE['ET_ValidityLeft'].value_counts()

In [None]:
df_5_EYE['ET_ValidityRight'].unique()

In [None]:
df_5_EYE['ET_ValidityRight'].value_counts()

In [None]:
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
sns.barplot(x=df_5_EYE['ET_ValidityLeft'].value_counts().index, y=df_5_EYE['ET_ValidityLeft'].value_counts().values)
plt.title('Count of ET_ValidityLeft')
plt.xlabel('Validity')
plt.ylabel('Count')


plt.subplot(1, 2, 2)
sns.barplot(x=df_5_EYE['ET_ValidityRight'].value_counts().index, y=df_5_EYE['ET_ValidityRight'].value_counts().values)
plt.title('Count of ET_ValidityRight')
plt.xlabel('Validity')
plt.ylabel('Count')

plt.tight_layout()
plt.show()

# Notes & Observations

- The `ET_ValidityLeft` and `ET_ValidityRight` columns indicate the validity of the eye-tracking data for the left and right eye, respectively.
- Based on the value counts and the bar plots, it appears that a value of `0.0` represents valid eye-tracking data, while a value of `4.0` represents invalid data.
- Although the amount of invalid data is relatively small, removing these rows could introduce unwanted patterns or gaps in the time series data.
- Therefore, we will keep the data and replace the value `4.0` with `1.0` in both `ET_ValidityLeft` and `ET_ValidityRight` columns. This will indicate to a machine learning model that the eye tracker had invalid data at those specific points in time while maintaining the integrity of the time series.

Define a mapping to convert validity values from `0.0` and `4.0` to `0` and `1`.

In [None]:
validity_map = {4.0: 1.0, 0.0: 0.0}

In [None]:
df_5_EYE['ET_ValidityLeft'] = df_5_EYE['ET_ValidityLeft'].map(validity_map).astype(np.int8)
df_5_EYE['ET_ValidityRight'] = df_5_EYE['ET_ValidityRight'].map(validity_map).astype(np.int8)

In [None]:
df_5_EYE.head(3)

In [None]:
df_5_EYE.describe()

In [None]:
df_5_EYE[df_5_EYE['ET_ValidityLeft'] == 1].shape

In [None]:
df_5_EYE[df_5_EYE['ET_ValidityRight'] == 1].shape

In [None]:
df_5_EYE[df_5_EYE['ET_ValidityLeft'] == 1].shape[0] / df_5_EYE.shape[0]

In [None]:
df_5_EYE[df_5_EYE['ET_ValidityRight'] == 1].shape[0] / df_5_EYE.shape[0]

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_5_EYE == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.subplot(1, 2, 2)
sns.heatmap(df_5_EYE == 1, cmap='viridis')
plt.title('Heatmap of 1 Values')

plt.tight_layout()
plt.show()

In [None]:
df_5_EYE[df_5_EYE['ET_PupilLeft'] == -1].shape

In [None]:
df_5_EYE[df_5_EYE['ET_PupilRight'] == -1].shape

In [None]:
df_5_EYE[df_5_EYE['ET_PupilLeft'] == -1].shape[0] / df_5_EYE.shape[0]

In [None]:
df_5_EYE[df_5_EYE['ET_PupilRight'] == -1].shape[0] / df_5_EYE.shape[0]

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_5_EYE[df_5_EYE['ET_ValidityLeft'] == 1] == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.subplot(1, 2, 2)
sns.heatmap(df_5_EYE[df_5_EYE['ET_ValidityRight'] == 1] == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.tight_layout()
plt.show()

# Notes & Observations

- The heatmaps reveal the distribution of -1 values across different columns.
- It is evident that the `-1` values are not randomly scattered but appear in specific columns, notably `ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, `ET_GazeRighty`, `ET_PupilLeft`, `ET_PupilRight`, `ET_DistanceLeft`, `ET_DistanceRight`, `ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, and `ET_CameraRightY`.
- These `-1` values often coincide with instances where `ET_ValidityLeft` or `ET_ValidityRight` is 1, indicating invalid eye-tracking data. This suggests that `-1` is used as a placeholder for missing or invalid measurements in these columns when the eye tracker is not providing valid data for a particular eye.
- Given that over 70% of the data in the `ET_PupilLeft` and `ET_PupilRight` columns is marked as invalid (-1), so instead of dropping them we can create new feature for both the `ET_PupilLeft` and `ET_PupilRight` to represent which row consist invalid `ET_PupilLeft` and `ET_PupilRight` data

In [None]:
pupil_validity = {-1: 1 }

In [None]:
df_5_EYE['ET_PupilLeft_validity'] = df_5_EYE['ET_PupilLeft'].map(pupil_validity)

In [None]:
df_5_EYE['ET_PupilRight_validity'] = df_5_EYE['ET_PupilRight'].map(pupil_validity)

In [None]:
df_5_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].head()

In [None]:
df_5_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].isnull().sum()

In [None]:
plt.figure(figsize=(18, 8))
sns.heatmap(df_5_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].isnull(), cmap='viridis')
plt.show()

In [None]:
df_5_EYE['ET_PupilLeft_validity'] = df_5_EYE['ET_PupilLeft_validity'].fillna(0)

In [None]:
df_5_EYE['ET_PupilRight_validity'] = df_5_EYE['ET_PupilRight_validity'].fillna(0)

In [None]:
df_5_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].head()

In [None]:
plt.figure(figsize=(18, 8))
sns.heatmap(df_5_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].isnull(), cmap='viridis')
plt.show()

In [None]:
df_5_EYE.head()

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_5_EYE == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.subplot(1, 2, 2)
sns.heatmap(df_5_EYE == 1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.tight_layout()
plt.show()

In [None]:
valid_left_ratio  = 1 - df_5_EYE['ET_ValidityLeft'].mean()

In [None]:
valid_left_ratio

In [None]:
valid_right_ratio = 1 - df_5_EYE['ET_ValidityRight'].mean()

In [None]:
valid_right_ratio

In [None]:
df_5_EYE['ET_PupilLeft_validity'] = df_5_EYE['ET_PupilLeft_validity'].astype(np.int8)
df_5_EYE['ET_PupilRight_validity'] = df_5_EYE['ET_PupilRight_validity'].astype(np.int8)

# Feature Engineering and Observations

Based on the analysis of the data, we've created two new features, `ET_PupilLeft_validity` and `ET_PupilRight_validity`. These features indicate the validity of the pupil data for the left and right eyes, respectively, with a value of 1 representing invalid data (originally -1) and 0 representing valid data.

The heatmaps above visually demonstrate the distribution of -1 and 1 values across the dataset. We observed that:
- The `-1` values are concentrated in specific columns related to gaze, pupil size, distance, and camera position, suggesting they represent missing or invalid sensor readings.
- The `1` values, after mapping from `4.0` in the original validity columns, indicate instances of invalid eye-tracking data.
- The heatmaps also show a strong correlation between the `-1` values in the pupil columns and a validity of 1 in the newly created pupil validity features, confirming that -1 was used to mark invalid pupil data.

In [None]:
df_5_EYE.head()

In [None]:
# Select only the numeric columns for plotting histograms, excluding time-related columns
numeric_cols = df_5_EYE.select_dtypes(include=np.number).columns
cols_to_plot = [col for col in numeric_cols if col not in ['UnixTime']]

# Calculate the number of rows and columns for the grid
n_cols = 4  # You can adjust the number of columns as needed
n_rows = (len(cols_to_plot) + n_cols - 1) // n_cols

plt.figure(figsize=(n_cols * 5, n_rows * 4)) # Adjust figure size as needed

for i, col in enumerate(cols_to_plot):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.histplot(df_5_EYE[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

# Observations from Histograms

The grid of histograms provides insights into the distribution of values for each numeric column in the dataset (excluding 'UnixTime'). Key observations include:

- Several columns, such as `ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, and `ET_GazeRighty`, show distributions that appear somewhat multimodal or skewed, suggesting variations in gaze patterns.
- The `ET_PupilLeft` and `ET_PupilRight` histograms clearly show a peak at -1, confirming the presence of a significant number of invalid pupil readings.
- `ET_TimeSignal` shows a relatively uniform distribution, as expected for a time-based signal.
- `ET_DistanceLeft` and `ET_DistanceRight` appear to have distributions centered around certain values, with some outliers or variations.
- The camera position columns (`ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, `ET_CameraRightY`) seem to have distributions concentrated within specific ranges, reflecting the camera's field of view.
- The validity columns (`ET_ValidityLeft`, `ET_ValidityRight`, `ET_PupilLeft_validity`, `ET_PupilRight_validity`) show distributions dominated by 0, indicating that most of the data is considered valid after the mapping. The smaller peaks at 1 represent the instances of invalid data.

These distributions highlight the need for appropriate handling of the -1 values and potential outliers in subsequent analysis or modeling steps.

In [None]:
df_5_EYE.columns

In [None]:
cols = ['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']

In [None]:
from IPython.display import display, Markdown

for col in cols:
    # Add a markdown cell before each plot for better separation and labeling
    display(Markdown(f'### {col} over Time'))
    plt.figure(figsize=(16, 10))
    plt.plot(df_5_EYE['Timestamp'], df_5_EYE[col])
    plt.xlabel("Timestamp") # Add x-axis label
    plt.ylabel(col) # Add y-axis label
    plt.show()

# Observations from Time Series Plots

The line plots showing various features against the `Timestamp` reveal the temporal patterns and fluctuations in the eye-tracking data. Key observations include:

- **Gaze Coordinates (`ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, `ET_GazeRighty`):** These plots show the changes in gaze position over time. We can observe periods of relatively stable gaze interspersed with rapid movements (saccades) and blinks or other events where the gaze data might be invalid (-1 values appear as gaps or spikes if not handled).
- **Pupil Size (`ET_PupilLeft`, `ET_PupilRight`):** The pupil size plots show variations over time. The presence of many -1 values is evident as flat lines at the bottom of the plot, indicating periods where pupil data was not recorded or was invalid.
- **Time Signal (`ET_TimeSignal`):** This plot shows a steady, increasing trend, as expected for a time-based signal.
- **Distance and Camera Position (`ET_DistanceLeft`, `ET_DistanceRight`, `ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, `ET_CameraRightY`):** These plots show how the distance from the eye tracker and the camera positions change over time. Variations in these features can be related to head movements or changes in the user's position relative to the eye tracker.
- **Validity (`ET_ValidityLeft`, `ET_ValidityRight`, `ET_PupilLeft_validity`, `ET_PupilRight_validity`):** These plots clearly show periods of invalid data (represented by 1) as spikes or plateaus, corresponding to instances where the eye tracker lost track of the eyes or the pupil data was marked as invalid.

Analyzing these time series plots is crucial for understanding the dynamics of the eye-tracking data and identifying patterns or anomalies that may require further investigation or specific handling during subsequent analysis.

In [None]:
# Select only the numeric columns for plotting histograms, excluding time-related columns
numeric_cols = df_5_EYE.select_dtypes(include=np.number).columns

# Calculate the number of rows and columns for the grid
n_cols = 4  # You can adjust the number of columns as needed
n_rows = (len(numeric_cols) + n_cols - 1) // n_cols

plt.figure(figsize=(n_cols * 5, n_rows * 4)) # Adjust figure size as needed

for i, col in enumerate(numeric_cols):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.boxplot(df_5_EYE[col])
    plt.title(f'Boxplot of {col}')
    plt.xlabel(col)

plt.tight_layout()
plt.show()

# Observations from Boxplots and Handling -1 Values

The boxplots provide a visual summary of the distribution and potential outliers for each numeric column. Key observations from the boxplots include:

- The boxplots for columns like `ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, `ET_GazeRighty`, `ET_PupilLeft`, `ET_PupilRight`, `ET_DistanceLeft`, `ET_DistanceRight`, `ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, and `ET_CameraRightY` clearly show the presence of -1 values as significant outliers, confirming our earlier observations from the heatmaps and histograms.
- The boxplots for the validity columns (`ET_ValidityLeft`, `ET_ValidityRight`, `ET_PupilLeft_validity`, `ET_PupilRight_validity`) show the discrete nature of these features, with the majority of data points at 0 (valid) and a smaller number at 1 (invalid).

Given the significant presence of -1 values, which represent invalid or missing data, especially in the pupil-related columns, we have decided to replace these -1 values with NaN to properly represent them as missing data. Subsequently, we will impute these missing values using the mean of each respective column. This approach helps to retain the data structure and allows for further analysis or modeling without the distortion caused by the -1 placeholders.

In [None]:
df_5_EYE.replace({-1: np.nan}, inplace=True)

In [None]:
df_5_EYE[['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']].mean()

In [None]:
df_5_EYE[['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']].median()

In [None]:
numeric_cols = df_5_EYE.select_dtypes(include=np.number).columns

for col in numeric_cols:
    df_5_EYE[col] = df_5_EYE[col].fillna(df_5_EYE[col].mean())

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_5_EYE.isnull(), cmap='viridis')
plt.title('Heatmap of Missing Values After Imputation')

plt.subplot(1, 2, 2)
sns.heatmap(df_5_EYE == 1, cmap='viridis')
plt.title('Heatmap of 1 Values')

plt.tight_layout()
plt.show()

# Handling Missing Values (Imputation)

As decided, we have replaced all the `-1` values with `NaN` to treat them as missing data. Subsequently, we have imputed these `NaN` values with the mean of their respective columns. The heatmap above, which was generated after the imputation, now shows no visible signs of `NaN` values, indicating that the imputation was successful.

In [None]:
df_5_EYE.head()

In [None]:
# Select only the numeric columns for plotting histograms, excluding time-related columns
numeric_cols = df_5_EYE.select_dtypes(include=np.number).columns
cols_to_plot = [col for col in numeric_cols if col not in ['UnixTime']]

# Calculate the number of rows and columns for the grid
n_cols = 4  # You can adjust the number of columns as needed
n_rows = (len(cols_to_plot) + n_cols - 1) // n_cols

plt.figure(figsize=(n_cols * 5, n_rows * 4)) # Adjust figure size as needed

for i, col in enumerate(cols_to_plot):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.histplot(df_5_EYE[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

# Observations from Histograms After Imputation

The histograms generated after replacing the -1 values with the mean of each column show the distributions of the numeric features with the missing data handled. Key observations from these updated histograms include:

- The distinct peaks at -1, which were prominent in the histograms for several columns (e.g., pupil size, gaze coordinates, distance, and camera position) before imputation, are now replaced by a peak at the mean of each respective column.
- The distributions in many columns now appear more unimodal or show shifted modes compared to the original histograms.
- The histograms for the validity columns still show their bimodal distributions with peaks at 0 and 1, as these were handled separately.

These histograms provide an updated view of the data's distribution after handling the missing values, highlighting the impact of the imputation method on the data's characteristics.

In [None]:
cols = ['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']

In [None]:
for col in cols:
    # Add a markdown cell before each plot for better separation and labeling
    display(Markdown(f'### {col} over Time'))
    plt.figure(figsize=(16, 10))
    plt.plot(df_5_EYE['Timestamp'], df_5_EYE[col])
    plt.xlabel("Timestamp") # Add x-axis label
    plt.ylabel(col) # Add y-axis label
    plt.show()

# Observations from Time Series Plots After Imputation

The line plots generated after imputing the missing values with the mean show the temporal patterns of the features with the missing data handled. Key observations from these updated plots include:

- The gaps or flat lines at -1, which were prominent in the plots for columns like gaze coordinates, pupil size, distance, and camera position, are now filled by lines at the mean value of the respective columns.
- The plots for the validity columns remain the same as they were handled separately.
- The `ET_TimeSignal` plot still shows a steady increasing trend, as expected.

In [None]:
plt.figure(figsize=(16, 10))
sns.heatmap(df_5_EYE.corr(numeric_only=True), cmap='YlGnBu', annot=True)
plt.show()

# Observations from Correlation Heatmap

The correlation heatmap provides a visual representation of the pairwise correlations between the numeric columns in the dataset. Key observations from the heatmap include:

- **High Positive Correlations:** We observe strong positive correlations (values close to 1) between:
  - `ET_GazeLeftx` and `ET_GazeRightx`: This is expected as the gaze positions of both eyes should be highly correlated when fixating on a point.
  - `ET_GazeLefty` and `ET_GazeRighty`: Similar to the x-coordinates, the y-coordinates of gaze should also be highly correlated.
  - `ET_PupilLeft` and `ET_PupilRight`: Pupil sizes of both eyes tend to change together in response to light and cognitive load.
  - `ET_DistanceLeft` and `ET_DistanceRight`: The distance from the eye tracker to each eye should be highly correlated.
  - `ET_CameraLeftX` and `ET_CameraRightX`, `ET_CameraLeftY` and `ET_CameraRightY`: The camera positions for both eyes are also expected to be highly correlated.
  - `UnixTime` and `ET_TimeSignal`: As previously noted, these two columns are almost perfectly linearly correlated, indicating redundancy.
  - `ET_ValidityLeft` and `ET_PupilLeft_validity`: There is a positive correlation, suggesting that when the overall left eye data is invalid, the left pupil data is also likely to be invalid.
  - `ET_ValidityRight` and `ET_PupilRight_validity`: Similar to the left eye, there is a positive correlation between the overall right eye validity and the right pupil validity.
- **Other Correlations:** We can also observe other varying degrees of correlations between different features, which can provide insights into the relationships between gaze behavior, pupil size, distance, and camera position. For example, there might be correlations between gaze coordinates and camera positions, reflecting head movements.
- **Low or Near-Zero Correlations:** Columns with low or near-zero correlations are relatively independent of each other.

Understanding these correlations is important for feature selection and for building models, as highly correlated features might indicate multicollinearity, while correlations between features can reveal underlying patterns in the data.

# Analysis of ET_TimeSignal and Decision to Drop

As observed in the time series plot and confirmed by the correlation heatmap, the `ET_TimeSignal` column exhibits a near-perfect linear relationship with both the `Timestamp` and `UnixTime` columns. This strong correlation (close to 1) suggests that `ET_TimeSignal` is essentially redundant and likely represents another form of time recording or a signal directly derived from the timestamp.

Including highly correlated features like this in a dataset can lead to issues such as multicollinearity in some statistical models, which can make it difficult to interpret the individual impact of each feature. Since the `Timestamp` column already provides the necessary temporal information, retaining `ET_TimeSignal` does not appear to add significant value for further analysis or modeling in most cases.

Therefore, based on its high correlation and lack of unique insight, we will proceed to drop the `ET_TimeSignal` column to simplify the dataset and potentially improve the performance and interpretability of future analyses.

In [None]:
df_5_EYE.drop('ET_TimeSignal', axis=1, inplace=True)

In [None]:
plt.figure(figsize=(16, 10))
sns.pairplot(df_5_EYE)
plt.show()

# **6_EYE**

In [None]:
df_6_EYE = pd.read_csv('data/STData/6/6_EYE.csv')

In [None]:
df_6_EYE.head()

In [None]:
df_6_EYE.shape

In [None]:
df_6_EYE.columns

In [None]:
df_6_EYE.info()

In [None]:
df_6_EYE.isnull().sum()

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df_6_EYE.isnull(), cmap='viridis')
plt.show()

# Notes & Observations

- We observe many **null** (or missing) values in the `QuestionKey` columns.
- The nulls in the `QuestionKey` column may not represent “true” nulls. Rather, they follow interval patterns, suggesting that during those periods no question was displayed.
- These missing values in `QuestionKey` require additional investigation and context-aware handling.

In [None]:
df_6_EYE['QuestionKey'].unique()

In [None]:
df_6_EYE['Timestamp'] = pd.to_datetime(df_6_EYE['Timestamp'])

In [None]:
df_6_EYE.head(3)

In [None]:
df_6_EYE['QuestionKey'] = df_6_EYE['QuestionKey'].fillna('None')

In [None]:
df_6_EYE['QuestionKey'].value_counts()

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df_6_EYE.isnull(), cmap='viridis')
plt.show()

In [None]:
df_6_EYE.isnull().sum()

In [None]:
df_6_EYE.dropna(inplace=True)

In [None]:
df_6_EYE.head()

In [None]:
df_6_EYE['Row'].unique()

In [None]:
plt.figure(figsize=(8,6))
sns.histplot(df_6_EYE['Row'])
plt.show()

# Notes & Observations

- The `Row` column appears to be a simple row index and does not provide meaningful information relevant to the eye-tracking data itself. Therefore, it can be dropped.

In [None]:
df_6_EYE.drop('Row', axis=1, inplace=True)

In [None]:
df_6_EYE['ET_ValidityLeft'].unique()

In [None]:
df_6_EYE['ET_ValidityLeft'].value_counts()

In [None]:
df_6_EYE['ET_ValidityRight'].unique()

In [None]:
df_6_EYE['ET_ValidityRight'].value_counts()

In [None]:
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
sns.barplot(x=df_6_EYE['ET_ValidityLeft'].value_counts().index, y=df_6_EYE['ET_ValidityLeft'].value_counts().values)
plt.title('Count of ET_ValidityLeft')
plt.xlabel('Validity')
plt.ylabel('Count')


plt.subplot(1, 2, 2)
sns.barplot(x=df_6_EYE['ET_ValidityRight'].value_counts().index, y=df_6_EYE['ET_ValidityRight'].value_counts().values)
plt.title('Count of ET_ValidityRight')
plt.xlabel('Validity')
plt.ylabel('Count')

plt.tight_layout()
plt.show()

# Notes & Observations

- The `ET_ValidityLeft` and `ET_ValidityRight` columns indicate the validity of the eye-tracking data for the left and right eye, respectively.
- Based on the value counts and the bar plots, it appears that a value of `0.0` represents valid eye-tracking data, while a value of `4.0` represents invalid data.
- Although the amount of invalid data is relatively small, removing these rows could introduce unwanted patterns or gaps in the time series data.


Define a mapping to convert validity values from `0.0` and `4.0` to `0` and `1`.

In [None]:
validity_map = {4.0: 1.0, 0.0: 0.0}

In [None]:
df_6_EYE['ET_ValidityLeft'] = df_6_EYE['ET_ValidityLeft'].map(validity_map).astype(np.int8)
df_6_EYE['ET_ValidityRight'] = df_6_EYE['ET_ValidityRight'].map(validity_map).astype(np.int8)

In [None]:
df_6_EYE.head(3)

In [None]:
df_6_EYE.describe()

In [None]:
df_6_EYE[df_6_EYE['ET_ValidityLeft'] == 1].shape

In [None]:
df_6_EYE[df_6_EYE['ET_ValidityRight'] == 1].shape

In [None]:
df_6_EYE[df_6_EYE['ET_ValidityLeft'] == 1].shape[0] / df_6_EYE.shape[0]

In [None]:
df_6_EYE[df_6_EYE['ET_ValidityRight'] == 1].shape[0] / df_6_EYE.shape[0]

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_6_EYE == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.subplot(1, 2, 2)
sns.heatmap(df_6_EYE == 1, cmap='viridis')
plt.title('Heatmap of 1 Values')

plt.tight_layout()
plt.show()

In [None]:
df_6_EYE[df_6_EYE['ET_PupilLeft'] == -1].shape

In [None]:
df_6_EYE[df_6_EYE['ET_PupilRight'] == -1].shape

In [None]:
df_6_EYE[df_6_EYE['ET_PupilLeft'] == -1].shape[0] / df_6_EYE.shape[0]

In [None]:
df_6_EYE[df_6_EYE['ET_PupilRight'] == -1].shape[0] / df_6_EYE.shape[0]

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_6_EYE[df_6_EYE['ET_ValidityLeft'] == 1] == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.subplot(1, 2, 2)
sns.heatmap(df_6_EYE[df_6_EYE['ET_ValidityRight'] == 1] == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.tight_layout()
plt.show()

# Notes & Observations

- The heatmaps reveal the distribution of -1 values across different columns.
- It is evident that the `-1` values are not randomly scattered but appear in specific columns, notably `ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, `ET_GazeRighty`, `ET_PupilLeft`, `ET_PupilRight`, `ET_DistanceLeft`, `ET_DistanceRight`, `ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, and `ET_CameraRightY`.
- These `-1` values often coincide with instances where `ET_ValidityLeft` or `ET_ValidityRight` is 1, indicating invalid eye-tracking data. This suggests that `-1` is used as a placeholder for missing or invalid measurements in these columns when the eye tracker is not providing valid data for a particular eye.
- Given that over 70% of the data in the `ET_PupilLeft` and `ET_PupilRight` columns is marked as invalid (-1), so instead of dropping them we can create new feature for both the `ET_PupilLeft` and `ET_PupilRight` to represent which row consist invalid `ET_PupilLeft` and `ET_PupilRight` data

In [None]:
pupil_validity = {-1: 1 }

In [None]:
df_6_EYE['ET_PupilLeft_validity'] = df_6_EYE['ET_PupilLeft'].map(pupil_validity)

In [None]:
df_6_EYE['ET_PupilRight_validity'] = df_6_EYE['ET_PupilRight'].map(pupil_validity)

In [None]:
df_6_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].head()

In [None]:
df_6_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].isnull().sum()

In [None]:
plt.figure(figsize=(18, 8))
sns.heatmap(df_6_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].isnull(), cmap='viridis')
plt.show()

In [None]:
df_6_EYE['ET_PupilLeft_validity'] = df_6_EYE['ET_PupilLeft_validity'].fillna(0)

In [None]:
df_6_EYE['ET_PupilRight_validity'] = df_6_EYE['ET_PupilRight_validity'].fillna(0)

In [None]:
df_6_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].head()

In [None]:
plt.figure(figsize=(18, 8))
sns.heatmap(df_6_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].isnull(), cmap='viridis')
plt.show()

In [None]:
df_6_EYE.head()

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_6_EYE == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.subplot(1, 2, 2)
sns.heatmap(df_6_EYE == 1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.tight_layout()
plt.show()

In [None]:
valid_left_ratio  = 1 - df_6_EYE['ET_ValidityLeft'].mean()

In [None]:
valid_left_ratio

In [None]:
valid_right_ratio = 1 - df_6_EYE['ET_ValidityRight'].mean()

In [None]:
valid_right_ratio

In [None]:
df_6_EYE['ET_PupilLeft_validity'] = df_6_EYE['ET_PupilLeft_validity'].astype(np.int8)
df_6_EYE['ET_PupilRight_validity'] = df_6_EYE['ET_PupilRight_validity'].astype(np.int8)

# Feature Engineering and Observations

Based on the analysis of the data, we've created two new features, `ET_PupilLeft_validity` and `ET_PupilRight_validity`. These features indicate the validity of the pupil data for the left and right eyes, respectively, with a value of 1 representing invalid data (originally -1) and 0 representing valid data.

The heatmaps above visually demonstrate the distribution of -1 and 1 values across the dataset. We observed that:
- The `-1` values are concentrated in specific columns related to gaze, pupil size, distance, and camera position, suggesting they represent missing or invalid sensor readings.
- The `1` values, after mapping from `4.0` in the original validity columns, indicate instances of invalid eye-tracking data.
- The heatmaps also show a strong correlation between the `-1` values in the pupil columns and a validity of 1 in the newly created pupil validity features, confirming that -1 was used to mark invalid pupil data.

In [None]:
df_6_EYE.head()

In [None]:
# Select only the numeric columns for plotting histograms, excluding time-related columns
numeric_cols = df_6_EYE.select_dtypes(include=np.number).columns
cols_to_plot = [col for col in numeric_cols if col not in ['UnixTime']]

# Calculate the number of rows and columns for the grid
n_cols = 4  # You can adjust the number of columns as needed
n_rows = (len(cols_to_plot) + n_cols - 1) // n_cols

plt.figure(figsize=(n_cols * 5, n_rows * 4)) # Adjust figure size as needed

for i, col in enumerate(cols_to_plot):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.histplot(df_6_EYE[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

# Observations from Histograms

The grid of histograms provides insights into the distribution of values for each numeric column in the dataset (excluding 'UnixTime'). Key observations include:

- Several columns, such as `ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, and `ET_GazeRighty`, show distributions that appear somewhat multimodal or skewed, suggesting variations in gaze patterns.
- The `ET_PupilLeft` and `ET_PupilRight` histograms clearly show a peak at -1, confirming the presence of a significant number of invalid pupil readings.
- `ET_TimeSignal` shows a relatively uniform distribution, as expected for a time-based signal.
- `ET_DistanceLeft` and `ET_DistanceRight` appear to have distributions centered around certain values, with some outliers or variations.
- The camera position columns (`ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, `ET_CameraRightY`) seem to have distributions concentrated within specific ranges, reflecting the camera's field of view.
- The validity columns (`ET_ValidityLeft`, `ET_ValidityRight`, `ET_PupilLeft_validity`, `ET_PupilRight_validity`) show distributions dominated by 0, indicating that most of the data is considered valid after the mapping. The smaller peaks at 1 represent the instances of invalid data.

These distributions highlight the need for appropriate handling of the -1 values and potential outliers in subsequent analysis or modeling steps.

In [None]:
df_6_EYE.columns

In [None]:
cols = ['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']

In [None]:
from IPython.display import display, Markdown

for col in cols:
    # Add a markdown cell before each plot for better separation and labeling
    display(Markdown(f'### {col} over Time'))
    plt.figure(figsize=(16, 10))
    plt.plot(df_6_EYE['Timestamp'], df_6_EYE[col])
    plt.xlabel("Timestamp") # Add x-axis label
    plt.ylabel(col) # Add y-axis label
    plt.show()

# Observations from Time Series Plots

The line plots showing various features against the `Timestamp` reveal the temporal patterns and fluctuations in the eye-tracking data. Key observations include:

- **Gaze Coordinates (`ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, `ET_GazeRighty`):** These plots show the changes in gaze position over time. We can observe periods of relatively stable gaze interspersed with rapid movements (saccades) and blinks or other events where the gaze data might be invalid (-1 values appear as gaps or spikes if not handled).
- **Pupil Size (`ET_PupilLeft`, `ET_PupilRight`):** The pupil size plots show variations over time. The presence of many -1 values is evident as flat lines at the bottom of the plot, indicating periods where pupil data was not recorded or was invalid.
- **Time Signal (`ET_TimeSignal`):** This plot shows a steady, increasing trend, as expected for a time-based signal.
- **Distance and Camera Position (`ET_DistanceLeft`, `ET_DistanceRight`, `ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, `ET_CameraRightY`):** These plots show how the distance from the eye tracker and the camera positions change over time. Variations in these features can be related to head movements or changes in the user's position relative to the eye tracker.
- **Validity (`ET_ValidityLeft`, `ET_ValidityRight`, `ET_PupilLeft_validity`, `ET_PupilRight_validity`):** These plots clearly show periods of invalid data (represented by 1) as spikes or plateaus, corresponding to instances where the eye tracker lost track of the eyes or the pupil data was marked as invalid.

Analyzing these time series plots is crucial for understanding the dynamics of the eye-tracking data and identifying patterns or anomalies that may require further investigation or specific handling during subsequent analysis.

In [None]:
# Select only the numeric columns for plotting histograms, excluding time-related columns
numeric_cols = df_6_EYE.select_dtypes(include=np.number).columns

# Calculate the number of rows and columns for the grid
n_cols = 4  # You can adjust the number of columns as needed
n_rows = (len(numeric_cols) + n_cols - 1) // n_cols

plt.figure(figsize=(n_cols * 5, n_rows * 4)) # Adjust figure size as needed

for i, col in enumerate(numeric_cols):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.boxplot(df_6_EYE[col])
    plt.title(f'Boxplot of {col}')
    plt.xlabel(col)

plt.tight_layout()
plt.show()

# Observations from Boxplots and Handling -1 Values

The boxplots provide a visual summary of the distribution and potential outliers for each numeric column. Key observations from the boxplots include:

- The boxplots for columns like `ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, `ET_GazeRighty`, `ET_PupilLeft`, `ET_PupilRight`, `ET_DistanceLeft`, `ET_DistanceRight`, `ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, and `ET_CameraRightY` clearly show the presence of -1 values as significant outliers, confirming our earlier observations from the heatmaps and histograms.
- The boxplots for the validity columns (`ET_ValidityLeft`, `ET_ValidityRight`, `ET_PupilLeft_validity`, `ET_PupilRight_validity`) show the discrete nature of these features, with the majority of data points at 0 (valid) and a smaller number at 1 (invalid).

Given the significant presence of -1 values, which represent invalid or missing data, especially in the pupil-related columns, we have decided to replace these -1 values with NaN to properly represent them as missing data. Subsequently, we will impute these missing values using the mean of each respective column. This approach helps to retain the data structure and allows for further analysis or modeling without the distortion caused by the -1 placeholders.

In [None]:
df_6_EYE.replace({-1: np.nan}, inplace=True)

In [None]:
df_6_EYE[['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']].mean()

In [None]:
df_6_EYE[['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']].median()

In [None]:
numeric_cols = df_6_EYE.select_dtypes(include=np.number).columns

for col in numeric_cols:
    df_6_EYE[col] = df_6_EYE[col].fillna(df_6_EYE[col].mean())

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_6_EYE.isnull(), cmap='viridis')
plt.title('Heatmap of Missing Values After Imputation')

plt.subplot(1, 2, 2)
sns.heatmap(df_6_EYE == 1, cmap='viridis')
plt.title('Heatmap of 1 Values')

plt.tight_layout()
plt.show()

# Handling Missing Values (Imputation)

As decided, we have replaced all the `-1` values with `NaN` to treat them as missing data. Subsequently, we have imputed these `NaN` values with the mean of their respective columns. The heatmap above, which was generated after the imputation, now shows no visible signs of `NaN` values, indicating that the imputation was successful.

In [None]:
df_6_EYE.head()

In [None]:
# Select only the numeric columns for plotting histograms, excluding time-related columns
numeric_cols = df_6_EYE.select_dtypes(include=np.number).columns
cols_to_plot = [col for col in numeric_cols if col not in ['UnixTime']]

# Calculate the number of rows and columns for the grid
n_cols = 4  # You can adjust the number of columns as needed
n_rows = (len(cols_to_plot) + n_cols - 1) // n_cols

plt.figure(figsize=(n_cols * 5, n_rows * 4)) # Adjust figure size as needed

for i, col in enumerate(cols_to_plot):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.histplot(df_6_EYE[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

# Observations from Histograms After Imputation

The histograms generated after replacing the -1 values with the mean of each column show the distributions of the numeric features with the missing data handled. Key observations from these updated histograms include:

- The distinct peaks at -1, which were prominent in the histograms for several columns (e.g., pupil size, gaze coordinates, distance, and camera position) before imputation, are now replaced by a peak at the mean of each respective column.
- The distributions in many columns now appear more unimodal or show shifted modes compared to the original histograms.
- The histograms for the validity columns still show their bimodal distributions with peaks at 0 and 1, as these were handled separately.

These histograms provide an updated view of the data's distribution after handling the missing values, highlighting the impact of the imputation method on the data's characteristics.

In [None]:
cols = ['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']

In [None]:
for col in cols:
    # Add a markdown cell before each plot for better separation and labeling
    display(Markdown(f'### {col} over Time'))
    plt.figure(figsize=(16, 10))
    plt.plot(df_6_EYE['Timestamp'], df_6_EYE[col])
    plt.xlabel("Timestamp") # Add x-axis label
    plt.ylabel(col) # Add y-axis label
    plt.show()

# Observations from Time Series Plots After Imputation

The line plots generated after imputing the missing values with the mean show the temporal patterns of the features with the missing data handled. Key observations from these updated plots include:

- The gaps or flat lines at -1, which were prominent in the plots for columns like gaze coordinates, pupil size, distance, and camera position, are now filled by lines at the mean value of the respective columns.
- The plots for the validity columns remain the same as they were handled separately.
- The `ET_TimeSignal` plot still shows a steady increasing trend, as expected.

In [None]:
plt.figure(figsize=(16, 10))
sns.heatmap(df_6_EYE.corr(numeric_only=True), cmap='YlGnBu', annot=True)
plt.show()

# Observations from Correlation Heatmap

The correlation heatmap provides a visual representation of the pairwise correlations between the numeric columns in the dataset. Key observations from the heatmap include:

- **High Positive Correlations:** We observe strong positive correlations (values close to 1) between:
  - `ET_GazeLeftx` and `ET_GazeRightx`: This is expected as the gaze positions of both eyes should be highly correlated when fixating on a point.
  - `ET_GazeLefty` and `ET_GazeRighty`: Similar to the x-coordinates, the y-coordinates of gaze should also be highly correlated.
  - `ET_PupilLeft` and `ET_PupilRight`: Pupil sizes of both eyes tend to change together in response to light and cognitive load.
  - `ET_DistanceLeft` and `ET_DistanceRight`: The distance from the eye tracker to each eye should be highly correlated.
  - `ET_CameraLeftX` and `ET_CameraRightX`, `ET_CameraLeftY` and `ET_CameraRightY`: The camera positions for both eyes are also expected to be highly correlated.
  - `UnixTime` and `ET_TimeSignal`: As previously noted, these two columns are almost perfectly linearly correlated, indicating redundancy.
  - `ET_ValidityLeft` and `ET_PupilLeft_validity`: There is a positive correlation, suggesting that when the overall left eye data is invalid, the left pupil data is also likely to be invalid.
  - `ET_ValidityRight` and `ET_PupilRight_validity`: Similar to the left eye, there is a positive correlation between the overall right eye validity and the right pupil validity.
- **Other Correlations:** We can also observe other varying degrees of correlations between different features, which can provide insights into the relationships between gaze behavior, pupil size, distance, and camera position. For example, there might be correlations between gaze coordinates and camera positions, reflecting head movements.
- **Low or Near-Zero Correlations:** Columns with low or near-zero correlations are relatively independent of each other.

Understanding these correlations is important for feature selection and for building models, as highly correlated features might indicate multicollinearity, while correlations between features can reveal underlying patterns in the data.

# Analysis of ET_TimeSignal and Decision to Drop

As observed in the time series plot and confirmed by the correlation heatmap, the `ET_TimeSignal` column exhibits a near-perfect linear relationship with both the `Timestamp` and `UnixTime` columns. This strong correlation (close to 1) suggests that `ET_TimeSignal` is essentially redundant and likely represents another form of time recording or a signal directly derived from the timestamp.

Including highly correlated features like this in a dataset can lead to issues such as multicollinearity in some statistical models, which can make it difficult to interpret the individual impact of each feature. Since the `Timestamp` column already provides the necessary temporal information, retaining `ET_TimeSignal` does not appear to add significant value for further analysis or modeling in most cases.

Therefore, based on its high correlation and lack of unique insight, we will proceed to drop the `ET_TimeSignal` column to simplify the dataset and potentially improve the performance and interpretability of future analyses.

In [None]:
df_6_EYE.drop('ET_TimeSignal', axis=1, inplace=True)

In [None]:
plt.figure(figsize=(16, 10))
sns.pairplot(df_6_EYE)
plt.show()

# Summary and Decision for 6_EYE Data

Based on the analysis of the 6_EYE dataset, we observed a significant amount of invalid data, particularly in the eye-tracking validity and pupil size columns. The heatmaps and value counts clearly showed that in many instances, the eye tracker did not provide valid data for either the left or right eye, and a large proportion of the pupil data was marked with -1, indicating invalid measurements.

While we performed imputation to handle the -1 values and replace them with the mean, the high percentage of invalid data still raises concerns about the overall reliability and representativeness of this dataset for model building. Including data with such a high proportion of imputed values, even if based on the mean, could introduce bias and negatively impact the performance and generalizability of any models trained on this data.

Therefore, considering the extent of the invalid data and its potential impact on model quality, we have decided **not to use the 6_EYE dataset for model building**. We will proceed with analyzing and preparing the other datasets that exhibit a higher proportion of valid eye-tracking data.

# **7_EYE**

In [None]:
df_7_EYE = pd.read_csv('data/STData/7/7_EYE.csv')

In [None]:
df_7_EYE.head()

In [None]:
df_7_EYE.shape

In [None]:
df_7_EYE.columns

In [None]:
df_7_EYE.info()

In [None]:
df_7_EYE.isnull().sum()

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df_7_EYE.isnull(), cmap='viridis')
plt.show()

# Notes & Observations

- We observe many **null** (or missing) values in the `QuestionKey` columns.
- The nulls in the `QuestionKey` column may not represent “true” nulls. Rather, they follow interval patterns, suggesting that during those periods no question was displayed.
- These missing values in `QuestionKey` require additional investigation and context-aware handling.

In [None]:
df_7_EYE['QuestionKey'].unique()

In [None]:
df_7_EYE['Timestamp'] = pd.to_datetime(df_7_EYE['Timestamp'])

In [None]:
df_7_EYE.head(3)

In [None]:
df_7_EYE['QuestionKey'] = df_7_EYE['QuestionKey'].fillna('None')

In [None]:
df_7_EYE['QuestionKey'].value_counts()

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df_7_EYE.isnull(), cmap='viridis')
plt.show()

In [None]:
df_7_EYE.isnull().sum()

In [None]:
df_7_EYE.dropna(inplace=True)

In [None]:
df_7_EYE.head()

In [None]:
df_7_EYE['Row'].unique()

In [None]:
plt.figure(figsize=(8,6))
sns.histplot(df_7_EYE['Row'])
plt.show()

# Notes & Observations

- The `Row` column appears to be a simple row index and does not provide meaningful information relevant to the eye-tracking data itself. Therefore, it can be dropped.

In [None]:
df_7_EYE.drop('Row', axis=1, inplace=True)

In [None]:
df_7_EYE['ET_ValidityLeft'].unique()

In [None]:
df_7_EYE['ET_ValidityLeft'].value_counts()

In [None]:
df_7_EYE['ET_ValidityRight'].unique()

In [None]:
df_7_EYE['ET_ValidityRight'].value_counts()

In [None]:
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
sns.barplot(x=df_7_EYE['ET_ValidityLeft'].value_counts().index, y=df_7_EYE['ET_ValidityLeft'].value_counts().values)
plt.title('Count of ET_ValidityLeft')
plt.xlabel('Validity')
plt.ylabel('Count')


plt.subplot(1, 2, 2)
sns.barplot(x=df_7_EYE['ET_ValidityRight'].value_counts().index, y=df_7_EYE['ET_ValidityRight'].value_counts().values)
plt.title('Count of ET_ValidityRight')
plt.xlabel('Validity')
plt.ylabel('Count')

plt.tight_layout()
plt.show()

# Notes & Observations

- The `ET_ValidityLeft` and `ET_ValidityRight` columns indicate the validity of the eye-tracking data for the left and right eye, respectively.
- Based on the value counts and the bar plots, it appears that a value of `0.0` represents valid eye-tracking data, while a value of `4.0` represents invalid data.
- Although the amount of invalid data is relatively small, removing these rows could introduce unwanted patterns or gaps in the time series data.
- Therefore, we will keep the data and replace the value `4.0` with `1.0` in both `ET_ValidityLeft` and `ET_ValidityRight` columns. This will indicate to a machine learning model that the eye tracker had invalid data at those specific points in time while maintaining the integrity of the time series.

Define a mapping to convert validity values from `0.0` and `4.0` to `0` and `1`.

In [None]:
validity_map = {4.0: 1.0, 0.0: 0.0}

In [None]:
df_7_EYE['ET_ValidityLeft'] = df_7_EYE['ET_ValidityLeft'].map(validity_map).astype(np.int8)
df_7_EYE['ET_ValidityRight'] = df_7_EYE['ET_ValidityRight'].map(validity_map).astype(np.int8)

In [None]:
df_7_EYE.head(3)

In [None]:
df_7_EYE.describe()

In [None]:
df_7_EYE[df_7_EYE['ET_ValidityLeft'] == 1].shape

In [None]:
df_7_EYE[df_7_EYE['ET_ValidityRight'] == 1].shape

In [None]:
df_7_EYE[df_7_EYE['ET_ValidityLeft'] == 1].shape[0] / df_7_EYE.shape[0]

In [None]:
df_7_EYE[df_7_EYE['ET_ValidityRight'] == 1].shape[0] / df_7_EYE.shape[0]

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_7_EYE == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.subplot(1, 2, 2)
sns.heatmap(df_7_EYE == 1, cmap='viridis')
plt.title('Heatmap of 1 Values')

plt.tight_layout()
plt.show()

In [None]:
df_7_EYE[df_7_EYE['ET_PupilLeft'] == -1].shape

In [None]:
df_7_EYE[df_7_EYE['ET_PupilRight'] == -1].shape

In [None]:
df_7_EYE[df_7_EYE['ET_PupilLeft'] == -1].shape[0] / df_7_EYE.shape[0]

In [None]:
df_7_EYE[df_7_EYE['ET_PupilRight'] == -1].shape[0] / df_7_EYE.shape[0]

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_7_EYE[df_7_EYE['ET_ValidityLeft'] == 1] == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.subplot(1, 2, 2)
sns.heatmap(df_7_EYE[df_7_EYE['ET_ValidityRight'] == 1] == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.tight_layout()
plt.show()

# Notes & Observations

- The heatmaps reveal the distribution of -1 values across different columns.
- It is evident that the `-1` values are not randomly scattered but appear in specific columns, notably `ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, `ET_GazeRighty`, `ET_PupilLeft`, `ET_PupilRight`, `ET_DistanceLeft`, `ET_DistanceRight`, `ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, and `ET_CameraRightY`.
- These `-1` values often coincide with instances where `ET_ValidityLeft` or `ET_ValidityRight` is 1, indicating invalid eye-tracking data. This suggests that `-1` is used as a placeholder for missing or invalid measurements in these columns when the eye tracker is not providing valid data for a particular eye.
- Given that over 70% of the data in the `ET_PupilLeft` and `ET_PupilRight` columns is marked as invalid (-1), so instead of dropping them we can create new feature for both the `ET_PupilLeft` and `ET_PupilRight` to represent which row consist invalid `ET_PupilLeft` and `ET_PupilRight` data

In [None]:
pupil_validity = {-1: 1 }

In [None]:
df_7_EYE['ET_PupilLeft_validity'] = df_7_EYE['ET_PupilLeft'].map(pupil_validity)

In [None]:
df_7_EYE['ET_PupilRight_validity'] = df_7_EYE['ET_PupilRight'].map(pupil_validity)

In [None]:
df_7_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].head()

In [None]:
df_7_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].isnull().sum()

In [None]:
plt.figure(figsize=(18, 8))
sns.heatmap(df_7_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].isnull(), cmap='viridis')
plt.show()

In [None]:
df_7_EYE['ET_PupilLeft_validity'] = df_7_EYE['ET_PupilLeft_validity'].fillna(0)

In [None]:
df_7_EYE['ET_PupilRight_validity'] = df_7_EYE['ET_PupilRight_validity'].fillna(0)

In [None]:
df_7_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].head()

In [None]:
plt.figure(figsize=(18, 8))
sns.heatmap(df_7_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].isnull(), cmap='viridis')
plt.show()

In [None]:
df_7_EYE.head()

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_7_EYE == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.subplot(1, 2, 2)
sns.heatmap(df_7_EYE == 1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.tight_layout()
plt.show()

In [None]:
valid_left_ratio  = 1 - df_7_EYE['ET_ValidityLeft'].mean()

In [None]:
valid_left_ratio

In [None]:
valid_right_ratio = 1 - df_7_EYE['ET_ValidityRight'].mean()

In [None]:
valid_right_ratio

In [None]:
df_7_EYE['ET_PupilLeft_validity'] = df_7_EYE['ET_PupilLeft_validity'].astype(np.int8)
df_7_EYE['ET_PupilRight_validity'] = df_7_EYE['ET_PupilRight_validity'].astype(np.int8)

# Feature Engineering and Observations

Based on the analysis of the data, we've created two new features, `ET_PupilLeft_validity` and `ET_PupilRight_validity`. These features indicate the validity of the pupil data for the left and right eyes, respectively, with a value of 1 representing invalid data (originally -1) and 0 representing valid data.

The heatmaps above visually demonstrate the distribution of -1 and 1 values across the dataset. We observed that:
- The `-1` values are concentrated in specific columns related to gaze, pupil size, distance, and camera position, suggesting they represent missing or invalid sensor readings.
- The `1` values, after mapping from `4.0` in the original validity columns, indicate instances of invalid eye-tracking data.
- The heatmaps also show a strong correlation between the `-1` values in the pupil columns and a validity of 1 in the newly created pupil validity features, confirming that -1 was used to mark invalid pupil data.

In [None]:
df_7_EYE.head()

In [None]:
# Select only the numeric columns for plotting histograms, excluding time-related columns
numeric_cols = df_7_EYE.select_dtypes(include=np.number).columns
cols_to_plot = [col for col in numeric_cols if col not in ['UnixTime']]

# Calculate the number of rows and columns for the grid
n_cols = 4  # You can adjust the number of columns as needed
n_rows = (len(cols_to_plot) + n_cols - 1) // n_cols

plt.figure(figsize=(n_cols * 5, n_rows * 4)) # Adjust figure size as needed

for i, col in enumerate(cols_to_plot):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.histplot(df_7_EYE[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

# Observations from Histograms

The grid of histograms provides insights into the distribution of values for each numeric column in the dataset (excluding 'UnixTime'). Key observations include:

- Several columns, such as `ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, and `ET_GazeRighty`, show distributions that appear somewhat multimodal or skewed, suggesting variations in gaze patterns.
- The `ET_PupilLeft` and `ET_PupilRight` histograms clearly show a peak at -1, confirming the presence of a significant number of invalid pupil readings.
- `ET_TimeSignal` shows a relatively uniform distribution, as expected for a time-based signal.
- `ET_DistanceLeft` and `ET_DistanceRight` appear to have distributions centered around certain values, with some outliers or variations.
- The camera position columns (`ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, `ET_CameraRightY`) seem to have distributions concentrated within specific ranges, reflecting the camera's field of view.
- The validity columns (`ET_ValidityLeft`, `ET_ValidityRight`, `ET_PupilLeft_validity`, `ET_PupilRight_validity`) show distributions dominated by 0, indicating that most of the data is considered valid after the mapping. The smaller peaks at 1 represent the instances of invalid data.

These distributions highlight the need for appropriate handling of the -1 values and potential outliers in subsequent analysis or modeling steps.

In [None]:
df_7_EYE.columns

In [None]:
cols = ['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']

In [None]:
from IPython.display import display, Markdown

for col in cols:
    # Add a markdown cell before each plot for better separation and labeling
    display(Markdown(f'### {col} over Time'))
    plt.figure(figsize=(16, 10))
    plt.plot(df_7_EYE['Timestamp'], df_7_EYE[col])
    plt.xlabel("Timestamp") # Add x-axis label
    plt.ylabel(col) # Add y-axis label
    plt.show()

# Observations from Time Series Plots

The line plots showing various features against the `Timestamp` reveal the temporal patterns and fluctuations in the eye-tracking data. Key observations include:

- **Gaze Coordinates (`ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, `ET_GazeRighty`):** These plots show the changes in gaze position over time. We can observe periods of relatively stable gaze interspersed with rapid movements (saccades) and blinks or other events where the gaze data might be invalid (-1 values appear as gaps or spikes if not handled).
- **Pupil Size (`ET_PupilLeft`, `ET_PupilRight`):** The pupil size plots show variations over time. The presence of many -1 values is evident as flat lines at the bottom of the plot, indicating periods where pupil data was not recorded or was invalid.
- **Time Signal (`ET_TimeSignal`):** This plot shows a steady, increasing trend, as expected for a time-based signal.
- **Distance and Camera Position (`ET_DistanceLeft`, `ET_DistanceRight`, `ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, `ET_CameraRightY`):** These plots show how the distance from the eye tracker and the camera positions change over time. Variations in these features can be related to head movements or changes in the user's position relative to the eye tracker.
- **Validity (`ET_ValidityLeft`, `ET_ValidityRight`, `ET_PupilLeft_validity`, `ET_PupilRight_validity`):** These plots clearly show periods of invalid data (represented by 1) as spikes or plateaus, corresponding to instances where the eye tracker lost track of the eyes or the pupil data was marked as invalid.

Analyzing these time series plots is crucial for understanding the dynamics of the eye-tracking data and identifying patterns or anomalies that may require further investigation or specific handling during subsequent analysis.

In [None]:
# Select only the numeric columns for plotting histograms, excluding time-related columns
numeric_cols = df_7_EYE.select_dtypes(include=np.number).columns

# Calculate the number of rows and columns for the grid
n_cols = 4  # You can adjust the number of columns as needed
n_rows = (len(numeric_cols) + n_cols - 1) // n_cols

plt.figure(figsize=(n_cols * 5, n_rows * 4)) # Adjust figure size as needed

for i, col in enumerate(numeric_cols):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.boxplot(df_7_EYE[col])
    plt.title(f'Boxplot of {col}')
    plt.xlabel(col)

plt.tight_layout()
plt.show()

# Observations from Boxplots and Handling -1 Values

The boxplots provide a visual summary of the distribution and potential outliers for each numeric column. Key observations from the boxplots include:

- The boxplots for columns like `ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, `ET_GazeRighty`, `ET_PupilLeft`, `ET_PupilRight`, `ET_DistanceLeft`, `ET_DistanceRight`, `ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, and `ET_CameraRightY` clearly show the presence of -1 values as significant outliers, confirming our earlier observations from the heatmaps and histograms.
- The boxplots for the validity columns (`ET_ValidityLeft`, `ET_ValidityRight`, `ET_PupilLeft_validity`, `ET_PupilRight_validity`) show the discrete nature of these features, with the majority of data points at 0 (valid) and a smaller number at 1 (invalid).

Given the significant presence of -1 values, which represent invalid or missing data, especially in the pupil-related columns, we have decided to replace these -1 values with NaN to properly represent them as missing data. Subsequently, we will impute these missing values using the mean of each respective column. This approach helps to retain the data structure and allows for further analysis or modeling without the distortion caused by the -1 placeholders.

In [None]:
df_7_EYE.replace({-1: np.nan}, inplace=True)

In [None]:
df_7_EYE[['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']].mean()

In [None]:
df_7_EYE[['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']].median()

In [None]:
numeric_cols = df_7_EYE.select_dtypes(include=np.number).columns

for col in numeric_cols:
    df_7_EYE[col] = df_7_EYE[col].fillna(df_7_EYE[col].mean())

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_7_EYE.isnull(), cmap='viridis')
plt.title('Heatmap of Missing Values After Imputation')

plt.subplot(1, 2, 2)
sns.heatmap(df_7_EYE == 1, cmap='viridis')
plt.title('Heatmap of 1 Values')

plt.tight_layout()
plt.show()

# Handling Missing Values (Imputation)

As decided, we have replaced all the `-1` values with `NaN` to treat them as missing data. Subsequently, we have imputed these `NaN` values with the mean of their respective columns. The heatmap above, which was generated after the imputation, now shows no visible signs of `NaN` values, indicating that the imputation was successful.

In [None]:
df_7_EYE.head()

In [None]:
# Select only the numeric columns for plotting histograms, excluding time-related columns
numeric_cols = df_7_EYE.select_dtypes(include=np.number).columns
cols_to_plot = [col for col in numeric_cols if col not in ['UnixTime']]

# Calculate the number of rows and columns for the grid
n_cols = 4  # You can adjust the number of columns as needed
n_rows = (len(cols_to_plot) + n_cols - 1) // n_cols

plt.figure(figsize=(n_cols * 5, n_rows * 4)) # Adjust figure size as needed

for i, col in enumerate(cols_to_plot):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.histplot(df_7_EYE[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

# Observations from Histograms After Imputation

The histograms generated after replacing the -1 values with the mean of each column show the distributions of the numeric features with the missing data handled. Key observations from these updated histograms include:

- The distinct peaks at -1, which were prominent in the histograms for several columns (e.g., pupil size, gaze coordinates, distance, and camera position) before imputation, are now replaced by a peak at the mean of each respective column.
- The distributions in many columns now appear more unimodal or show shifted modes compared to the original histograms.
- The histograms for the validity columns still show their bimodal distributions with peaks at 0 and 1, as these were handled separately.

These histograms provide an updated view of the data's distribution after handling the missing values, highlighting the impact of the imputation method on the data's characteristics.

In [None]:
cols = ['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']

In [None]:
for col in cols:
    # Add a markdown cell before each plot for better separation and labeling
    display(Markdown(f'### {col} over Time'))
    plt.figure(figsize=(16, 10))
    plt.plot(df_7_EYE['Timestamp'], df_7_EYE[col])
    plt.xlabel("Timestamp") # Add x-axis label
    plt.ylabel(col) # Add y-axis label
    plt.show()

# Observations from Time Series Plots After Imputation

The line plots generated after imputing the missing values with the mean show the temporal patterns of the features with the missing data handled. Key observations from these updated plots include:

- The gaps or flat lines at -1, which were prominent in the plots for columns like gaze coordinates, pupil size, distance, and camera position, are now filled by lines at the mean value of the respective columns.
- The plots for the validity columns remain the same as they were handled separately.
- The `ET_TimeSignal` plot still shows a steady increasing trend, as expected.

In [None]:
plt.figure(figsize=(16, 10))
sns.heatmap(df_7_EYE.corr(numeric_only=True), cmap='YlGnBu', annot=True)
plt.show()

# Observations from Correlation Heatmap

The correlation heatmap provides a visual representation of the pairwise correlations between the numeric columns in the dataset. Key observations from the heatmap include:

- **High Positive Correlations:** We observe strong positive correlations (values close to 1) between:
  - `ET_GazeLeftx` and `ET_GazeRightx`: This is expected as the gaze positions of both eyes should be highly correlated when fixating on a point.
  - `ET_GazeLefty` and `ET_GazeRighty`: Similar to the x-coordinates, the y-coordinates of gaze should also be highly correlated.
  - `ET_PupilLeft` and `ET_PupilRight`: Pupil sizes of both eyes tend to change together in response to light and cognitive load.
  - `ET_DistanceLeft` and `ET_DistanceRight`: The distance from the eye tracker to each eye should be highly correlated.
  - `ET_CameraLeftX` and `ET_CameraRightX`, `ET_CameraLeftY` and `ET_CameraRightY`: The camera positions for both eyes are also expected to be highly correlated.
  - `UnixTime` and `ET_TimeSignal`: As previously noted, these two columns are almost perfectly linearly correlated, indicating redundancy.
  - `ET_ValidityLeft` and `ET_PupilLeft_validity`: There is a positive correlation, suggesting that when the overall left eye data is invalid, the left pupil data is also likely to be invalid.
  - `ET_ValidityRight` and `ET_PupilRight_validity`: Similar to the left eye, there is a positive correlation between the overall right eye validity and the right pupil validity.
- **Other Correlations:** We can also observe other varying degrees of correlations between different features, which can provide insights into the relationships between gaze behavior, pupil size, distance, and camera position. For example, there might be correlations between gaze coordinates and camera positions, reflecting head movements.
- **Low or Near-Zero Correlations:** Columns with low or near-zero correlations are relatively independent of each other.

Understanding these correlations is important for feature selection and for building models, as highly correlated features might indicate multicollinearity, while correlations between features can reveal underlying patterns in the data.

# Analysis of ET_TimeSignal and Decision to Drop

As observed in the time series plot and confirmed by the correlation heatmap, the `ET_TimeSignal` column exhibits a near-perfect linear relationship with both the `Timestamp` and `UnixTime` columns. This strong correlation (close to 1) suggests that `ET_TimeSignal` is essentially redundant and likely represents another form of time recording or a signal directly derived from the timestamp.

Including highly correlated features like this in a dataset can lead to issues such as multicollinearity in some statistical models, which can make it difficult to interpret the individual impact of each feature. Since the `Timestamp` column already provides the necessary temporal information, retaining `ET_TimeSignal` does not appear to add significant value for further analysis or modeling in most cases.

Therefore, based on its high correlation and lack of unique insight, we will proceed to drop the `ET_TimeSignal` column to simplify the dataset and potentially improve the performance and interpretability of future analyses.

In [None]:
df_7_EYE.drop('ET_TimeSignal', axis=1, inplace=True)

In [None]:
plt.figure(figsize=(16, 10))
sns.pairplot(df_7_EYE)
plt.show()

# **8_EYE**

In [None]:
df_8_EYE = pd.read_csv('data/STData/8/8_EYE.csv')

In [None]:
df_8_EYE.head()

In [None]:
df_8_EYE.shape

In [None]:
df_8_EYE.columns

In [None]:
df_8_EYE.info()

In [None]:
df_8_EYE.isnull().sum()

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df_8_EYE.isnull(), cmap='viridis')
plt.show()

# Notes & Observations

- We observe many **null** (or missing) values in the `QuestionKey` columns.
- The nulls in the `QuestionKey` column may not represent “true” nulls. Rather, they follow interval patterns, suggesting that during those periods no question was displayed.
- These missing values in `QuestionKey` require additional investigation and context-aware handling.

In [None]:
df_8_EYE['QuestionKey'].unique()

In [None]:
df_8_EYE['Timestamp'] = pd.to_datetime(df_8_EYE['Timestamp'])

In [None]:
df_8_EYE.head(3)

In [None]:
df_8_EYE['QuestionKey'] = df_8_EYE['QuestionKey'].fillna('None')

In [None]:
df_8_EYE['QuestionKey'].value_counts()

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df_8_EYE.isnull(), cmap='viridis')
plt.show()

In [None]:
df_8_EYE.isnull().sum()

In [None]:
df_8_EYE.dropna(inplace=True)

In [None]:
df_8_EYE.head()

In [None]:
df_8_EYE['Row'].unique()

In [None]:
plt.figure(figsize=(8,6))
sns.histplot(df_8_EYE['Row'])
plt.show()

# Notes & Observations

- The `Row` column appears to be a simple row index and does not provide meaningful information relevant to the eye-tracking data itself. Therefore, it can be dropped.

In [None]:
df_8_EYE.drop('Row', axis=1, inplace=True)

In [None]:
df_8_EYE['ET_ValidityLeft'].unique()

In [None]:
df_8_EYE['ET_ValidityLeft'].value_counts()

In [None]:
df_8_EYE['ET_ValidityRight'].unique()

In [None]:
df_8_EYE['ET_ValidityRight'].value_counts()

In [None]:
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
sns.barplot(x=df_8_EYE['ET_ValidityLeft'].value_counts().index, y=df_8_EYE['ET_ValidityLeft'].value_counts().values)
plt.title('Count of ET_ValidityLeft')
plt.xlabel('Validity')
plt.ylabel('Count')


plt.subplot(1, 2, 2)
sns.barplot(x=df_8_EYE['ET_ValidityRight'].value_counts().index, y=df_8_EYE['ET_ValidityRight'].value_counts().values)
plt.title('Count of ET_ValidityRight')
plt.xlabel('Validity')
plt.ylabel('Count')

plt.tight_layout()
plt.show()

# Notes & Observations

- The `ET_ValidityLeft` and `ET_ValidityRight` columns indicate the validity of the eye-tracking data for the left and right eye, respectively.
- Based on the value counts and the bar plots, it appears that a value of `0.0` represents valid eye-tracking data, while a value of `4.0` represents invalid data.
- Although the amount of invalid data is relatively small, removing these rows could introduce unwanted patterns or gaps in the time series data.
- Therefore, we will keep the data and replace the value `4.0` with `1.0` in both `ET_ValidityLeft` and `ET_ValidityRight` columns. This will indicate to a machine learning model that the eye tracker had invalid data at those specific points in time while maintaining the integrity of the time series.

Define a mapping to convert validity values from `0.0` and `4.0` to `0` and `1`.

In [None]:
validity_map = {4.0: 1.0, 0.0: 0.0}

In [None]:
df_8_EYE['ET_ValidityLeft'] = df_8_EYE['ET_ValidityLeft'].map(validity_map).astype(np.int8)
df_8_EYE['ET_ValidityRight'] = df_8_EYE['ET_ValidityRight'].map(validity_map).astype(np.int8)

In [None]:
df_8_EYE.head(3)

In [None]:
df_8_EYE.describe()

In [None]:
df_8_EYE[df_8_EYE['ET_ValidityLeft'] == 1].shape

In [None]:
df_8_EYE[df_8_EYE['ET_ValidityRight'] == 1].shape

In [None]:
df_8_EYE[df_8_EYE['ET_ValidityLeft'] == 1].shape[0] / df_8_EYE.shape[0]

In [None]:
df_8_EYE[df_8_EYE['ET_ValidityRight'] == 1].shape[0] / df_8_EYE.shape[0]

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_8_EYE == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.subplot(1, 2, 2)
sns.heatmap(df_8_EYE == 1, cmap='viridis')
plt.title('Heatmap of 1 Values')

plt.tight_layout()
plt.show()

In [None]:
df_8_EYE[df_8_EYE['ET_PupilLeft'] == -1].shape

In [None]:
df_8_EYE[df_8_EYE['ET_PupilRight'] == -1].shape

In [None]:
df_8_EYE[df_8_EYE['ET_PupilLeft'] == -1].shape[0] / df_8_EYE.shape[0]

In [None]:
df_8_EYE[df_8_EYE['ET_PupilRight'] == -1].shape[0] / df_8_EYE.shape[0]

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_8_EYE[df_8_EYE['ET_ValidityLeft'] == 1] == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.subplot(1, 2, 2)
sns.heatmap(df_8_EYE[df_8_EYE['ET_ValidityRight'] == 1] == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.tight_layout()
plt.show()

# Notes & Observations

- The heatmaps reveal the distribution of -1 values across different columns.
- It is evident that the `-1` values are not randomly scattered but appear in specific columns, notably `ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, `ET_GazeRighty`, `ET_PupilLeft`, `ET_PupilRight`, `ET_DistanceLeft`, `ET_DistanceRight`, `ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, and `ET_CameraRightY`.
- These `-1` values often coincide with instances where `ET_ValidityLeft` or `ET_ValidityRight` is 1, indicating invalid eye-tracking data. This suggests that `-1` is used as a placeholder for missing or invalid measurements in these columns when the eye tracker is not providing valid data for a particular eye.
- Given that over 70% of the data in the `ET_PupilLeft` and `ET_PupilRight` columns is marked as invalid (-1), so instead of dropping them we can create new feature for both the `ET_PupilLeft` and `ET_PupilRight` to represent which row consist invalid `ET_PupilLeft` and `ET_PupilRight` data

In [None]:
pupil_validity = {-1: 1 }

In [None]:
df_8_EYE['ET_PupilLeft_validity'] = df_8_EYE['ET_PupilLeft'].map(pupil_validity)

In [None]:
df_8_EYE['ET_PupilRight_validity'] = df_8_EYE['ET_PupilRight'].map(pupil_validity)

In [None]:
df_8_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].head()

In [None]:
df_8_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].isnull().sum()

In [None]:
plt.figure(figsize=(18, 8))
sns.heatmap(df_8_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].isnull(), cmap='viridis')
plt.show()

In [None]:
df_8_EYE['ET_PupilLeft_validity'] = df_8_EYE['ET_PupilLeft_validity'].fillna(0)

In [None]:
df_8_EYE['ET_PupilRight_validity'] = df_8_EYE['ET_PupilRight_validity'].fillna(0)

In [None]:
df_8_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].head()

In [None]:
plt.figure(figsize=(18, 8))
sns.heatmap(df_8_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].isnull(), cmap='viridis')
plt.show()

In [None]:
df_8_EYE.head()

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_8_EYE == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.subplot(1, 2, 2)
sns.heatmap(df_8_EYE == 1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.tight_layout()
plt.show()

In [None]:
valid_left_ratio  = 1 - df_8_EYE['ET_ValidityLeft'].mean()

In [None]:
valid_left_ratio

In [None]:
valid_right_ratio = 1 - df_8_EYE['ET_ValidityRight'].mean()

In [None]:
valid_right_ratio

In [None]:
df_8_EYE['ET_PupilLeft_validity'] = df_8_EYE['ET_PupilLeft_validity'].astype(np.int8)
df_8_EYE['ET_PupilRight_validity'] = df_8_EYE['ET_PupilRight_validity'].astype(np.int8)

# Feature Engineering and Observations

Based on the analysis of the data, we've created two new features, `ET_PupilLeft_validity` and `ET_PupilRight_validity`. These features indicate the validity of the pupil data for the left and right eyes, respectively, with a value of 1 representing invalid data (originally -1) and 0 representing valid data.

The heatmaps above visually demonstrate the distribution of -1 and 1 values across the dataset. We observed that:
- The `-1` values are concentrated in specific columns related to gaze, pupil size, distance, and camera position, suggesting they represent missing or invalid sensor readings.
- The `1` values, after mapping from `4.0` in the original validity columns, indicate instances of invalid eye-tracking data.
- The heatmaps also show a strong correlation between the `-1` values in the pupil columns and a validity of 1 in the newly created pupil validity features, confirming that -1 was used to mark invalid pupil data.

In [None]:
df_8_EYE.head()

In [None]:
# Select only the numeric columns for plotting histograms, excluding time-related columns
numeric_cols = df_8_EYE.select_dtypes(include=np.number).columns
cols_to_plot = [col for col in numeric_cols if col not in ['UnixTime']]

# Calculate the number of rows and columns for the grid
n_cols = 4  # You can adjust the number of columns as needed
n_rows = (len(cols_to_plot) + n_cols - 1) // n_cols

plt.figure(figsize=(n_cols * 5, n_rows * 4)) # Adjust figure size as needed

for i, col in enumerate(cols_to_plot):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.histplot(df_8_EYE[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

# Observations from Histograms

The grid of histograms provides insights into the distribution of values for each numeric column in the dataset (excluding 'UnixTime'). Key observations include:

- Several columns, such as `ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, and `ET_GazeRighty`, show distributions that appear somewhat multimodal or skewed, suggesting variations in gaze patterns.
- The `ET_PupilLeft` and `ET_PupilRight` histograms clearly show a peak at -1, confirming the presence of a significant number of invalid pupil readings.
- `ET_TimeSignal` shows a relatively uniform distribution, as expected for a time-based signal.
- `ET_DistanceLeft` and `ET_DistanceRight` appear to have distributions centered around certain values, with some outliers or variations.
- The camera position columns (`ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, `ET_CameraRightY`) seem to have distributions concentrated within specific ranges, reflecting the camera's field of view.
- The validity columns (`ET_ValidityLeft`, `ET_ValidityRight`, `ET_PupilLeft_validity`, `ET_PupilRight_validity`) show distributions dominated by 0, indicating that most of the data is considered valid after the mapping. The smaller peaks at 1 represent the instances of invalid data.

These distributions highlight the need for appropriate handling of the -1 values and potential outliers in subsequent analysis or modeling steps.

In [None]:
df_8_EYE.columns

In [None]:
cols = ['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']

In [None]:
from IPython.display import display, Markdown

for col in cols:
    # Add a markdown cell before each plot for better separation and labeling
    display(Markdown(f'### {col} over Time'))
    plt.figure(figsize=(16, 10))
    plt.plot(df_8_EYE['Timestamp'], df_8_EYE[col])
    plt.xlabel("Timestamp") # Add x-axis label
    plt.ylabel(col) # Add y-axis label
    plt.show()

# Observations from Time Series Plots

The line plots showing various features against the `Timestamp` reveal the temporal patterns and fluctuations in the eye-tracking data. Key observations include:

- **Gaze Coordinates (`ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, `ET_GazeRighty`):** These plots show the changes in gaze position over time. We can observe periods of relatively stable gaze interspersed with rapid movements (saccades) and blinks or other events where the gaze data might be invalid (-1 values appear as gaps or spikes if not handled).
- **Pupil Size (`ET_PupilLeft`, `ET_PupilRight`):** The pupil size plots show variations over time. The presence of many -1 values is evident as flat lines at the bottom of the plot, indicating periods where pupil data was not recorded or was invalid.
- **Time Signal (`ET_TimeSignal`):** This plot shows a steady, increasing trend, as expected for a time-based signal.
- **Distance and Camera Position (`ET_DistanceLeft`, `ET_DistanceRight`, `ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, `ET_CameraRightY`):** These plots show how the distance from the eye tracker and the camera positions change over time. Variations in these features can be related to head movements or changes in the user's position relative to the eye tracker.
- **Validity (`ET_ValidityLeft`, `ET_ValidityRight`, `ET_PupilLeft_validity`, `ET_PupilRight_validity`):** These plots clearly show periods of invalid data (represented by 1) as spikes or plateaus, corresponding to instances where the eye tracker lost track of the eyes or the pupil data was marked as invalid.

Analyzing these time series plots is crucial for understanding the dynamics of the eye-tracking data and identifying patterns or anomalies that may require further investigation or specific handling during subsequent analysis.

In [None]:
# Select only the numeric columns for plotting histograms, excluding time-related columns
numeric_cols = df_8_EYE.select_dtypes(include=np.number).columns

# Calculate the number of rows and columns for the grid
n_cols = 4  # You can adjust the number of columns as needed
n_rows = (len(numeric_cols) + n_cols - 1) // n_cols

plt.figure(figsize=(n_cols * 5, n_rows * 4)) # Adjust figure size as needed

for i, col in enumerate(numeric_cols):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.boxplot(df_8_EYE[col])
    plt.title(f'Boxplot of {col}')
    plt.xlabel(col)

plt.tight_layout()
plt.show()

# Observations from Boxplots and Handling -1 Values

The boxplots provide a visual summary of the distribution and potential outliers for each numeric column. Key observations from the boxplots include:

- The boxplots for columns like `ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, `ET_GazeRighty`, `ET_PupilLeft`, `ET_PupilRight`, `ET_DistanceLeft`, `ET_DistanceRight`, `ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, and `ET_CameraRightY` clearly show the presence of -1 values as significant outliers, confirming our earlier observations from the heatmaps and histograms.
- The boxplots for the validity columns (`ET_ValidityLeft`, `ET_ValidityRight`, `ET_PupilLeft_validity`, `ET_PupilRight_validity`) show the discrete nature of these features, with the majority of data points at 0 (valid) and a smaller number at 1 (invalid).

Given the significant presence of -1 values, which represent invalid or missing data, especially in the pupil-related columns, we have decided to replace these -1 values with NaN to properly represent them as missing data. Subsequently, we will impute these missing values using the mean of each respective column. This approach helps to retain the data structure and allows for further analysis or modeling without the distortion caused by the -1 placeholders.

In [None]:
df_8_EYE.replace({-1: np.nan}, inplace=True)

In [None]:
df_8_EYE[['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']].mean()

In [None]:
df_8_EYE[['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']].median()

In [None]:
numeric_cols = df_8_EYE.select_dtypes(include=np.number).columns

for col in numeric_cols:
    df_8_EYE[col] = df_8_EYE[col].fillna(df_8_EYE[col].mean())

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_8_EYE.isnull(), cmap='viridis')
plt.title('Heatmap of Missing Values After Imputation')

plt.subplot(1, 2, 2)
sns.heatmap(df_8_EYE == 1, cmap='viridis')
plt.title('Heatmap of 1 Values')

plt.tight_layout()
plt.show()

# Handling Missing Values (Imputation)

As decided, we have replaced all the `-1` values with `NaN` to treat them as missing data. Subsequently, we have imputed these `NaN` values with the mean of their respective columns. The heatmap above, which was generated after the imputation, now shows no visible signs of `NaN` values, indicating that the imputation was successful.

In [None]:
df_8_EYE.head()

In [None]:
# Select only the numeric columns for plotting histograms, excluding time-related columns
numeric_cols = df_8_EYE.select_dtypes(include=np.number).columns
cols_to_plot = [col for col in numeric_cols if col not in ['UnixTime']]

# Calculate the number of rows and columns for the grid
n_cols = 4  # You can adjust the number of columns as needed
n_rows = (len(cols_to_plot) + n_cols - 1) // n_cols

plt.figure(figsize=(n_cols * 5, n_rows * 4)) # Adjust figure size as needed

for i, col in enumerate(cols_to_plot):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.histplot(df_8_EYE[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

# Observations from Histograms After Imputation

The histograms generated after replacing the -1 values with the mean of each column show the distributions of the numeric features with the missing data handled. Key observations from these updated histograms include:

- The distinct peaks at -1, which were prominent in the histograms for several columns (e.g., pupil size, gaze coordinates, distance, and camera position) before imputation, are now replaced by a peak at the mean of each respective column.
- The distributions in many columns now appear more unimodal or show shifted modes compared to the original histograms.
- The histograms for the validity columns still show their bimodal distributions with peaks at 0 and 1, as these were handled separately.

These histograms provide an updated view of the data's distribution after handling the missing values, highlighting the impact of the imputation method on the data's characteristics.

In [None]:
cols = ['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']

In [None]:
for col in cols:
    # Add a markdown cell before each plot for better separation and labeling
    display(Markdown(f'### {col} over Time'))
    plt.figure(figsize=(16, 10))
    plt.plot(df_8_EYE['Timestamp'], df_8_EYE[col])
    plt.xlabel("Timestamp") # Add x-axis label
    plt.ylabel(col) # Add y-axis label
    plt.show()

# Observations from Time Series Plots After Imputation

The line plots generated after imputing the missing values with the mean show the temporal patterns of the features with the missing data handled. Key observations from these updated plots include:

- The gaps or flat lines at -1, which were prominent in the plots for columns like gaze coordinates, pupil size, distance, and camera position, are now filled by lines at the mean value of the respective columns.
- The plots for the validity columns remain the same as they were handled separately.
- The `ET_TimeSignal` plot still shows a steady increasing trend, as expected.

In [None]:
plt.figure(figsize=(16, 10))
sns.heatmap(df_8_EYE.corr(numeric_only=True), cmap='YlGnBu', annot=True)
plt.show()

# Observations from Correlation Heatmap

The correlation heatmap provides a visual representation of the pairwise correlations between the numeric columns in the dataset. Key observations from the heatmap include:

- **High Positive Correlations:** We observe strong positive correlations (values close to 1) between:
  - `ET_GazeLeftx` and `ET_GazeRightx`: This is expected as the gaze positions of both eyes should be highly correlated when fixating on a point.
  - `ET_GazeLefty` and `ET_GazeRighty`: Similar to the x-coordinates, the y-coordinates of gaze should also be highly correlated.
  - `ET_PupilLeft` and `ET_PupilRight`: Pupil sizes of both eyes tend to change together in response to light and cognitive load.
  - `ET_DistanceLeft` and `ET_DistanceRight`: The distance from the eye tracker to each eye should be highly correlated.
  - `ET_CameraLeftX` and `ET_CameraRightX`, `ET_CameraLeftY` and `ET_CameraRightY`: The camera positions for both eyes are also expected to be highly correlated.
  - `UnixTime` and `ET_TimeSignal`: As previously noted, these two columns are almost perfectly linearly correlated, indicating redundancy.
  - `ET_ValidityLeft` and `ET_PupilLeft_validity`: There is a positive correlation, suggesting that when the overall left eye data is invalid, the left pupil data is also likely to be invalid.
  - `ET_ValidityRight` and `ET_PupilRight_validity`: Similar to the left eye, there is a positive correlation between the overall right eye validity and the right pupil validity.
- **Other Correlations:** We can also observe other varying degrees of correlations between different features, which can provide insights into the relationships between gaze behavior, pupil size, distance, and camera position. For example, there might be correlations between gaze coordinates and camera positions, reflecting head movements.
- **Low or Near-Zero Correlations:** Columns with low or near-zero correlations are relatively independent of each other.

Understanding these correlations is important for feature selection and for building models, as highly correlated features might indicate multicollinearity, while correlations between features can reveal underlying patterns in the data.

# Analysis of ET_TimeSignal and Decision to Drop

As observed in the time series plot and confirmed by the correlation heatmap, the `ET_TimeSignal` column exhibits a near-perfect linear relationship with both the `Timestamp` and `UnixTime` columns. This strong correlation (close to 1) suggests that `ET_TimeSignal` is essentially redundant and likely represents another form of time recording or a signal directly derived from the timestamp.

Including highly correlated features like this in a dataset can lead to issues such as multicollinearity in some statistical models, which can make it difficult to interpret the individual impact of each feature. Since the `Timestamp` column already provides the necessary temporal information, retaining `ET_TimeSignal` does not appear to add significant value for further analysis or modeling in most cases.

Therefore, based on its high correlation and lack of unique insight, we will proceed to drop the `ET_TimeSignal` column to simplify the dataset and potentially improve the performance and interpretability of future analyses.

In [None]:
df_8_EYE.drop('ET_TimeSignal', axis=1, inplace=True)

In [None]:
plt.figure(figsize=(16, 10))
sns.pairplot(df_8_EYE)
plt.show()

# **9_EYE**

In [None]:
df_9_EYE = pd.read_csv('data/STData/9/9_EYE.csv')

In [None]:
df_9_EYE.head()

In [None]:
df_9_EYE.shape

In [None]:
df_9_EYE.columns

In [None]:
df_9_EYE.info()

In [None]:
df_9_EYE.isnull().sum()

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df_9_EYE.isnull(), cmap='viridis')
plt.show()

# Notes & Observations

- We observe many **null** (or missing) values in the `QuestionKey` columns.
- The nulls in the `QuestionKey` column may not represent “true” nulls. Rather, they follow interval patterns, suggesting that during those periods no question was displayed.
- These missing values in `QuestionKey` require additional investigation and context-aware handling.

In [None]:
df_9_EYE['QuestionKey'].unique()

In [None]:
df_9_EYE['Timestamp'] = pd.to_datetime(df_9_EYE['Timestamp'])

In [None]:
df_9_EYE.head(3)

In [None]:
df_9_EYE['QuestionKey'] = df_9_EYE['QuestionKey'].fillna('None')

In [None]:
df_9_EYE['QuestionKey'].value_counts()

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df_9_EYE.isnull(), cmap='viridis')
plt.show()

In [None]:
df_9_EYE.isnull().sum()

In [None]:
df_9_EYE.dropna(inplace=True)

In [None]:
df_9_EYE.head()

In [None]:
df_9_EYE['Row'].unique()

In [None]:
plt.figure(figsize=(8,6))
sns.histplot(df_9_EYE['Row'])
plt.show()

# Notes & Observations

- The `Row` column appears to be a simple row index and does not provide meaningful information relevant to the eye-tracking data itself. Therefore, it can be dropped.

In [None]:
df_9_EYE.drop('Row', axis=1, inplace=True)

In [None]:
df_9_EYE['ET_ValidityLeft'].unique()

In [None]:
df_9_EYE['ET_ValidityLeft'].value_counts()

In [None]:
df_9_EYE['ET_ValidityRight'].unique()

In [None]:
df_9_EYE['ET_ValidityRight'].value_counts()

In [None]:
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
sns.barplot(x=df_9_EYE['ET_ValidityLeft'].value_counts().index, y=df_9_EYE['ET_ValidityLeft'].value_counts().values)
plt.title('Count of ET_ValidityLeft')
plt.xlabel('Validity')
plt.ylabel('Count')


plt.subplot(1, 2, 2)
sns.barplot(x=df_9_EYE['ET_ValidityRight'].value_counts().index, y=df_9_EYE['ET_ValidityRight'].value_counts().values)
plt.title('Count of ET_ValidityRight')
plt.xlabel('Validity')
plt.ylabel('Count')

plt.tight_layout()
plt.show()

# Notes & Observations

- The `ET_ValidityLeft` and `ET_ValidityRight` columns indicate the validity of the eye-tracking data for the left and right eye, respectively.
- Based on the value counts and the bar plots, it appears that a value of `0.0` represents valid eye-tracking data, while a value of `4.0` represents invalid data.
- Although the amount of invalid data is relatively small, removing these rows could introduce unwanted patterns or gaps in the time series data.
- Therefore, we will keep the data and replace the value `4.0` with `1.0` in both `ET_ValidityLeft` and `ET_ValidityRight` columns. This will indicate to a machine learning model that the eye tracker had invalid data at those specific points in time while maintaining the integrity of the time series.

Define a mapping to convert validity values from `0.0` and `4.0` to `0` and `1`.

In [None]:
validity_map = {4.0: 1.0, 0.0: 0.0}

In [None]:
df_9_EYE['ET_ValidityLeft'] = df_9_EYE['ET_ValidityLeft'].map(validity_map).astype(np.int8)
df_9_EYE['ET_ValidityRight'] = df_9_EYE['ET_ValidityRight'].map(validity_map).astype(np.int8)

In [None]:
df_9_EYE.head(3)

In [None]:
df_9_EYE.describe()

In [None]:
df_9_EYE[df_9_EYE['ET_ValidityLeft'] == 1].shape

In [None]:
df_9_EYE[df_9_EYE['ET_ValidityRight'] == 1].shape

In [None]:
df_9_EYE[df_9_EYE['ET_ValidityLeft'] == 1].shape[0] / df_9_EYE.shape[0]

In [None]:
df_9_EYE[df_9_EYE['ET_ValidityRight'] == 1].shape[0] / df_9_EYE.shape[0]

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_9_EYE == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.subplot(1, 2, 2)
sns.heatmap(df_9_EYE == 1, cmap='viridis')
plt.title('Heatmap of 1 Values')

plt.tight_layout()
plt.show()

In [None]:
df_9_EYE[df_9_EYE['ET_PupilLeft'] == -1].shape

In [None]:
df_9_EYE[df_9_EYE['ET_PupilRight'] == -1].shape

In [None]:
df_9_EYE[df_9_EYE['ET_PupilLeft'] == -1].shape[0] / df_9_EYE.shape[0]

In [None]:
df_9_EYE[df_9_EYE['ET_PupilRight'] == -1].shape[0] / df_9_EYE.shape[0]

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_9_EYE[df_9_EYE['ET_ValidityLeft'] == 1] == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.subplot(1, 2, 2)
sns.heatmap(df_9_EYE[df_9_EYE['ET_ValidityRight'] == 1] == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.tight_layout()
plt.show()

# Notes & Observations

- The heatmaps reveal the distribution of -1 values across different columns.
- It is evident that the `-1` values are not randomly scattered but appear in specific columns, notably `ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, `ET_GazeRighty`, `ET_PupilLeft`, `ET_PupilRight`, `ET_DistanceLeft`, `ET_DistanceRight`, `ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, and `ET_CameraRightY`.
- These `-1` values often coincide with instances where `ET_ValidityLeft` or `ET_ValidityRight` is 1, indicating invalid eye-tracking data. This suggests that `-1` is used as a placeholder for missing or invalid measurements in these columns when the eye tracker is not providing valid data for a particular eye.
- Given that over 70% of the data in the `ET_PupilLeft` and `ET_PupilRight` columns is marked as invalid (-1), so instead of dropping them we can create new feature for both the `ET_PupilLeft` and `ET_PupilRight` to represent which row consist invalid `ET_PupilLeft` and `ET_PupilRight` data

In [None]:
pupil_validity = {-1: 1 }

In [None]:
df_9_EYE['ET_PupilLeft_validity'] = df_9_EYE['ET_PupilLeft'].map(pupil_validity)

In [None]:
df_9_EYE['ET_PupilRight_validity'] = df_9_EYE['ET_PupilRight'].map(pupil_validity)

In [None]:
df_9_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].head()

In [None]:
df_9_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].isnull().sum()

In [None]:
plt.figure(figsize=(18, 8))
sns.heatmap(df_9_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].isnull(), cmap='viridis')
plt.show()

In [None]:
df_9_EYE['ET_PupilLeft_validity'] = df_9_EYE['ET_PupilLeft_validity'].fillna(0)

In [None]:
df_9_EYE['ET_PupilRight_validity'] = df_9_EYE['ET_PupilRight_validity'].fillna(0)

In [None]:
df_9_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].head()

In [None]:
plt.figure(figsize=(18, 8))
sns.heatmap(df_9_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].isnull(), cmap='viridis')
plt.show()

In [None]:
df_9_EYE.head()

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_9_EYE == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.subplot(1, 2, 2)
sns.heatmap(df_9_EYE == 1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.tight_layout()
plt.show()

In [None]:
valid_left_ratio  = 1 - df_9_EYE['ET_ValidityLeft'].mean()

In [None]:
valid_left_ratio

In [None]:
valid_right_ratio = 1 - df_9_EYE['ET_ValidityRight'].mean()

In [None]:
valid_right_ratio

In [None]:
df_9_EYE['ET_PupilLeft_validity'] = df_9_EYE['ET_PupilLeft_validity'].astype(np.int8)
df_9_EYE['ET_PupilRight_validity'] = df_9_EYE['ET_PupilRight_validity'].astype(np.int8)

# Feature Engineering and Observations

Based on the analysis of the data, we've created two new features, `ET_PupilLeft_validity` and `ET_PupilRight_validity`. These features indicate the validity of the pupil data for the left and right eyes, respectively, with a value of 1 representing invalid data (originally -1) and 0 representing valid data.

The heatmaps above visually demonstrate the distribution of -1 and 1 values across the dataset. We observed that:
- The `-1` values are concentrated in specific columns related to gaze, pupil size, distance, and camera position, suggesting they represent missing or invalid sensor readings.
- The `1` values, after mapping from `4.0` in the original validity columns, indicate instances of invalid eye-tracking data.
- The heatmaps also show a strong correlation between the `-1` values in the pupil columns and a validity of 1 in the newly created pupil validity features, confirming that -1 was used to mark invalid pupil data.

In [None]:
df_9_EYE.head()

In [None]:
# Select only the numeric columns for plotting histograms, excluding time-related columns
numeric_cols = df_9_EYE.select_dtypes(include=np.number).columns
cols_to_plot = [col for col in numeric_cols if col not in ['UnixTime']]

# Calculate the number of rows and columns for the grid
n_cols = 4  # You can adjust the number of columns as needed
n_rows = (len(cols_to_plot) + n_cols - 1) // n_cols

plt.figure(figsize=(n_cols * 5, n_rows * 4)) # Adjust figure size as needed

for i, col in enumerate(cols_to_plot):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.histplot(df_9_EYE[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

# Observations from Histograms

The grid of histograms provides insights into the distribution of values for each numeric column in the dataset (excluding 'UnixTime'). Key observations include:

- Several columns, such as `ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, and `ET_GazeRighty`, show distributions that appear somewhat multimodal or skewed, suggesting variations in gaze patterns.
- The `ET_PupilLeft` and `ET_PupilRight` histograms clearly show a peak at -1, confirming the presence of a significant number of invalid pupil readings.
- `ET_TimeSignal` shows a relatively uniform distribution, as expected for a time-based signal.
- `ET_DistanceLeft` and `ET_DistanceRight` appear to have distributions centered around certain values, with some outliers or variations.
- The camera position columns (`ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, `ET_CameraRightY`) seem to have distributions concentrated within specific ranges, reflecting the camera's field of view.
- The validity columns (`ET_ValidityLeft`, `ET_ValidityRight`, `ET_PupilLeft_validity`, `ET_PupilRight_validity`) show distributions dominated by 0, indicating that most of the data is considered valid after the mapping. The smaller peaks at 1 represent the instances of invalid data.

These distributions highlight the need for appropriate handling of the -1 values and potential outliers in subsequent analysis or modeling steps.

In [None]:
df_9_EYE.columns

In [None]:
cols = ['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']

In [None]:
from IPython.display import display, Markdown

for col in cols:
    # Add a markdown cell before each plot for better separation and labeling
    display(Markdown(f'### {col} over Time'))
    plt.figure(figsize=(16, 10))
    plt.plot(df_9_EYE['Timestamp'], df_9_EYE[col])
    plt.xlabel("Timestamp") # Add x-axis label
    plt.ylabel(col) # Add y-axis label
    plt.show()

# Observations from Time Series Plots

The line plots showing various features against the `Timestamp` reveal the temporal patterns and fluctuations in the eye-tracking data. Key observations include:

- **Gaze Coordinates (`ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, `ET_GazeRighty`):** These plots show the changes in gaze position over time. We can observe periods of relatively stable gaze interspersed with rapid movements (saccades) and blinks or other events where the gaze data might be invalid (-1 values appear as gaps or spikes if not handled).
- **Pupil Size (`ET_PupilLeft`, `ET_PupilRight`):** The pupil size plots show variations over time. The presence of many -1 values is evident as flat lines at the bottom of the plot, indicating periods where pupil data was not recorded or was invalid.
- **Time Signal (`ET_TimeSignal`):** This plot shows a steady, increasing trend, as expected for a time-based signal.
- **Distance and Camera Position (`ET_DistanceLeft`, `ET_DistanceRight`, `ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, `ET_CameraRightY`):** These plots show how the distance from the eye tracker and the camera positions change over time. Variations in these features can be related to head movements or changes in the user's position relative to the eye tracker.
- **Validity (`ET_ValidityLeft`, `ET_ValidityRight`, `ET_PupilLeft_validity`, `ET_PupilRight_validity`):** These plots clearly show periods of invalid data (represented by 1) as spikes or plateaus, corresponding to instances where the eye tracker lost track of the eyes or the pupil data was marked as invalid.

Analyzing these time series plots is crucial for understanding the dynamics of the eye-tracking data and identifying patterns or anomalies that may require further investigation or specific handling during subsequent analysis.

In [None]:
# Select only the numeric columns for plotting histograms, excluding time-related columns
numeric_cols = df_9_EYE.select_dtypes(include=np.number).columns

# Calculate the number of rows and columns for the grid
n_cols = 4  # You can adjust the number of columns as needed
n_rows = (len(numeric_cols) + n_cols - 1) // n_cols

plt.figure(figsize=(n_cols * 5, n_rows * 4)) # Adjust figure size as needed

for i, col in enumerate(numeric_cols):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.boxplot(df_9_EYE[col])
    plt.title(f'Boxplot of {col}')
    plt.xlabel(col)

plt.tight_layout()
plt.show()

# Observations from Boxplots and Handling -1 Values

The boxplots provide a visual summary of the distribution and potential outliers for each numeric column. Key observations from the boxplots include:

- The boxplots for columns like `ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, `ET_GazeRighty`, `ET_PupilLeft`, `ET_PupilRight`, `ET_DistanceLeft`, `ET_DistanceRight`, `ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, and `ET_CameraRightY` clearly show the presence of -1 values as significant outliers, confirming our earlier observations from the heatmaps and histograms.
- The boxplots for the validity columns (`ET_ValidityLeft`, `ET_ValidityRight`, `ET_PupilLeft_validity`, `ET_PupilRight_validity`) show the discrete nature of these features, with the majority of data points at 0 (valid) and a smaller number at 1 (invalid).

Given the significant presence of -1 values, which represent invalid or missing data, especially in the pupil-related columns, we have decided to replace these -1 values with NaN to properly represent them as missing data. Subsequently, we will impute these missing values using the mean of each respective column. This approach helps to retain the data structure and allows for further analysis or modeling without the distortion caused by the -1 placeholders.

In [None]:
df_9_EYE.replace({-1: np.nan}, inplace=True)

In [None]:
df_9_EYE[['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']].mean()

In [None]:
df_9_EYE[['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']].median()

In [None]:
numeric_cols = df_9_EYE.select_dtypes(include=np.number).columns

for col in numeric_cols:
    df_9_EYE[col] = df_9_EYE[col].fillna(df_9_EYE[col].mean())

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_9_EYE.isnull(), cmap='viridis')
plt.title('Heatmap of Missing Values After Imputation')

plt.subplot(1, 2, 2)
sns.heatmap(df_9_EYE == 1, cmap='viridis')
plt.title('Heatmap of 1 Values')

plt.tight_layout()
plt.show()

# Handling Missing Values (Imputation)

As decided, we have replaced all the `-1` values with `NaN` to treat them as missing data. Subsequently, we have imputed these `NaN` values with the mean of their respective columns. The heatmap above, which was generated after the imputation, now shows no visible signs of `NaN` values, indicating that the imputation was successful.

In [None]:
df_9_EYE.head()

In [None]:
# Select only the numeric columns for plotting histograms, excluding time-related columns
numeric_cols = df_9_EYE.select_dtypes(include=np.number).columns
cols_to_plot = [col for col in numeric_cols if col not in ['UnixTime']]

# Calculate the number of rows and columns for the grid
n_cols = 4  # You can adjust the number of columns as needed
n_rows = (len(cols_to_plot) + n_cols - 1) // n_cols

plt.figure(figsize=(n_cols * 5, n_rows * 4)) # Adjust figure size as needed

for i, col in enumerate(cols_to_plot):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.histplot(df_9_EYE[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

# Observations from Histograms After Imputation

The histograms generated after replacing the -1 values with the mean of each column show the distributions of the numeric features with the missing data handled. Key observations from these updated histograms include:

- The distinct peaks at -1, which were prominent in the histograms for several columns (e.g., pupil size, gaze coordinates, distance, and camera position) before imputation, are now replaced by a peak at the mean of each respective column.
- The distributions in many columns now appear more unimodal or show shifted modes compared to the original histograms.
- The histograms for the validity columns still show their bimodal distributions with peaks at 0 and 1, as these were handled separately.

These histograms provide an updated view of the data's distribution after handling the missing values, highlighting the impact of the imputation method on the data's characteristics.

In [None]:
cols = ['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']

In [None]:
for col in cols:
    # Add a markdown cell before each plot for better separation and labeling
    display(Markdown(f'### {col} over Time'))
    plt.figure(figsize=(16, 10))
    plt.plot(df_9_EYE['Timestamp'], df_9_EYE[col])
    plt.xlabel("Timestamp") # Add x-axis label
    plt.ylabel(col) # Add y-axis label
    plt.show()

# Observations from Time Series Plots After Imputation

The line plots generated after imputing the missing values with the mean show the temporal patterns of the features with the missing data handled. Key observations from these updated plots include:

- The gaps or flat lines at -1, which were prominent in the plots for columns like gaze coordinates, pupil size, distance, and camera position, are now filled by lines at the mean value of the respective columns.
- The plots for the validity columns remain the same as they were handled separately.
- The `ET_TimeSignal` plot still shows a steady increasing trend, as expected.

In [None]:
plt.figure(figsize=(16, 10))
sns.heatmap(df_9_EYE.corr(numeric_only=True), cmap='YlGnBu', annot=True)
plt.show()

# Observations from Correlation Heatmap

The correlation heatmap provides a visual representation of the pairwise correlations between the numeric columns in the dataset. Key observations from the heatmap include:

- **High Positive Correlations:** We observe strong positive correlations (values close to 1) between:
  - `ET_GazeLeftx` and `ET_GazeRightx`: This is expected as the gaze positions of both eyes should be highly correlated when fixating on a point.
  - `ET_GazeLefty` and `ET_GazeRighty`: Similar to the x-coordinates, the y-coordinates of gaze should also be highly correlated.
  - `ET_PupilLeft` and `ET_PupilRight`: Pupil sizes of both eyes tend to change together in response to light and cognitive load.
  - `ET_DistanceLeft` and `ET_DistanceRight`: The distance from the eye tracker to each eye should be highly correlated.
  - `ET_CameraLeftX` and `ET_CameraRightX`, `ET_CameraLeftY` and `ET_CameraRightY`: The camera positions for both eyes are also expected to be highly correlated.
  - `UnixTime` and `ET_TimeSignal`: As previously noted, these two columns are almost perfectly linearly correlated, indicating redundancy.
  - `ET_ValidityLeft` and `ET_PupilLeft_validity`: There is a positive correlation, suggesting that when the overall left eye data is invalid, the left pupil data is also likely to be invalid.
  - `ET_ValidityRight` and `ET_PupilRight_validity`: Similar to the left eye, there is a positive correlation between the overall right eye validity and the right pupil validity.
- **Other Correlations:** We can also observe other varying degrees of correlations between different features, which can provide insights into the relationships between gaze behavior, pupil size, distance, and camera position. For example, there might be correlations between gaze coordinates and camera positions, reflecting head movements.
- **Low or Near-Zero Correlations:** Columns with low or near-zero correlations are relatively independent of each other.

Understanding these correlations is important for feature selection and for building models, as highly correlated features might indicate multicollinearity, while correlations between features can reveal underlying patterns in the data.

# Analysis of ET_TimeSignal and Decision to Drop

As observed in the time series plot and confirmed by the correlation heatmap, the `ET_TimeSignal` column exhibits a near-perfect linear relationship with both the `Timestamp` and `UnixTime` columns. This strong correlation (close to 1) suggests that `ET_TimeSignal` is essentially redundant and likely represents another form of time recording or a signal directly derived from the timestamp.

Including highly correlated features like this in a dataset can lead to issues such as multicollinearity in some statistical models, which can make it difficult to interpret the individual impact of each feature. Since the `Timestamp` column already provides the necessary temporal information, retaining `ET_TimeSignal` does not appear to add significant value for further analysis or modeling in most cases.

Therefore, based on its high correlation and lack of unique insight, we will proceed to drop the `ET_TimeSignal` column to simplify the dataset and potentially improve the performance and interpretability of future analyses.

In [None]:
df_9_EYE.drop('ET_TimeSignal', axis=1, inplace=True)

In [None]:
plt.figure(figsize=(16, 10))
sns.pairplot(df_9_EYE)
plt.show()

# **10_EYE**

In [None]:
df_10_EYE = pd.read_csv('data/STData/10/10_EYE.csv')

In [None]:
df_10_EYE.head()

In [None]:
df_10_EYE.shape

In [None]:
df_10_EYE.columns

In [None]:
df_10_EYE.info()

In [None]:
df_10_EYE.isnull().sum()

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df_10_EYE.isnull(), cmap='viridis')
plt.show()

# Notes & Observations

- We observe many **null** (or missing) values in the `QuestionKey` columns.
- The nulls in the `QuestionKey` column may not represent “true” nulls. Rather, they follow interval patterns, suggesting that during those periods no question was displayed.
- These missing values in `QuestionKey` require additional investigation and context-aware handling.

In [None]:
df_10_EYE['QuestionKey'].unique()

In [None]:
df_10_EYE['Timestamp'] = pd.to_datetime(df_10_EYE['Timestamp'])

In [None]:
df_10_EYE.head(3)

In [None]:
df_10_EYE['QuestionKey'] = df_10_EYE['QuestionKey'].fillna('None')

In [None]:
df_10_EYE['QuestionKey'].value_counts()

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df_10_EYE.isnull(), cmap='viridis')
plt.show()

In [None]:
df_10_EYE.isnull().sum()

In [None]:
df_10_EYE.dropna(inplace=True)

In [None]:
df_10_EYE.head()

In [None]:
df_10_EYE['Row'].unique()

In [None]:
plt.figure(figsize=(8,6))
sns.histplot(df_10_EYE['Row'])
plt.show()

# Notes & Observations

- The `Row` column appears to be a simple row index and does not provide meaningful information relevant to the eye-tracking data itself. Therefore, it can be dropped.

In [None]:
df_10_EYE.drop('Row', axis=1, inplace=True)

In [None]:
df_10_EYE['ET_ValidityLeft'].unique()

In [None]:
df_10_EYE['ET_ValidityLeft'].value_counts()

In [None]:
df_10_EYE['ET_ValidityRight'].unique()

In [None]:
df_10_EYE['ET_ValidityRight'].value_counts()

In [None]:
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
sns.barplot(x=df_10_EYE['ET_ValidityLeft'].value_counts().index, y=df_10_EYE['ET_ValidityLeft'].value_counts().values)
plt.title('Count of ET_ValidityLeft')
plt.xlabel('Validity')
plt.ylabel('Count')


plt.subplot(1, 2, 2)
sns.barplot(x=df_10_EYE['ET_ValidityRight'].value_counts().index, y=df_10_EYE['ET_ValidityRight'].value_counts().values)
plt.title('Count of ET_ValidityRight')
plt.xlabel('Validity')
plt.ylabel('Count')

plt.tight_layout()
plt.show()

# Notes & Observations

- The `ET_ValidityLeft` and `ET_ValidityRight` columns indicate the validity of the eye-tracking data for the left and right eye, respectively.
- Based on the value counts and the bar plots, it appears that a value of `0.0` represents valid eye-tracking data, while a value of `4.0` represents invalid data.
- Although the amount of invalid data is relatively small, removing these rows could introduce unwanted patterns or gaps in the time series data.
- Therefore, we will keep the data and replace the value `4.0` with `1.0` in both `ET_ValidityLeft` and `ET_ValidityRight` columns. This will indicate to a machine learning model that the eye tracker had invalid data at those specific points in time while maintaining the integrity of the time series.

Define a mapping to convert validity values from `0.0` and `4.0` to `0` and `1`.

In [None]:
validity_map = {4.0: 1.0, 0.0: 0.0}

In [None]:
df_10_EYE['ET_ValidityLeft'] = df_10_EYE['ET_ValidityLeft'].map(validity_map).astype(np.int8)
df_10_EYE['ET_ValidityRight'] = df_10_EYE['ET_ValidityRight'].map(validity_map).astype(np.int8)

In [None]:
df_10_EYE.head(3)

In [None]:
df_10_EYE.describe()

In [None]:
df_10_EYE[df_10_EYE['ET_ValidityLeft'] == 1].shape

In [None]:
df_10_EYE[df_10_EYE['ET_ValidityRight'] == 1].shape

In [None]:
df_10_EYE[df_10_EYE['ET_ValidityLeft'] == 1].shape[0] / df_10_EYE.shape[0]

In [None]:
df_10_EYE[df_10_EYE['ET_ValidityRight'] == 1].shape[0] / df_10_EYE.shape[0]

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_10_EYE == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.subplot(1, 2, 2)
sns.heatmap(df_10_EYE == 1, cmap='viridis')
plt.title('Heatmap of 1 Values')

plt.tight_layout()
plt.show()

In [None]:
df_10_EYE[df_10_EYE['ET_PupilLeft'] == -1].shape

In [None]:
df_10_EYE[df_10_EYE['ET_PupilRight'] == -1].shape

In [None]:
df_10_EYE[df_10_EYE['ET_PupilLeft'] == -1].shape[0] / df_10_EYE.shape[0]

In [None]:
df_10_EYE[df_10_EYE['ET_PupilRight'] == -1].shape[0] / df_10_EYE.shape[0]

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_10_EYE[df_10_EYE['ET_ValidityLeft'] == 1] == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.subplot(1, 2, 2)
sns.heatmap(df_10_EYE[df_10_EYE['ET_ValidityRight'] == 1] == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.tight_layout()
plt.show()

# Notes & Observations

- The heatmaps reveal the distribution of -1 values across different columns.
- It is evident that the `-1` values are not randomly scattered but appear in specific columns, notably `ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, `ET_GazeRighty`, `ET_PupilLeft`, `ET_PupilRight`, `ET_DistanceLeft`, `ET_DistanceRight`, `ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, and `ET_CameraRightY`.
- These `-1` values often coincide with instances where `ET_ValidityLeft` or `ET_ValidityRight` is 1, indicating invalid eye-tracking data. This suggests that `-1` is used as a placeholder for missing or invalid measurements in these columns when the eye tracker is not providing valid data for a particular eye.
- Given that over 70% of the data in the `ET_PupilLeft` and `ET_PupilRight` columns is marked as invalid (-1), so instead of dropping them we can create new feature for both the `ET_PupilLeft` and `ET_PupilRight` to represent which row consist invalid `ET_PupilLeft` and `ET_PupilRight` data

In [None]:
pupil_validity = {-1: 1 }

In [None]:
df_10_EYE['ET_PupilLeft_validity'] = df_10_EYE['ET_PupilLeft'].map(pupil_validity)

In [None]:
df_10_EYE['ET_PupilRight_validity'] = df_10_EYE['ET_PupilRight'].map(pupil_validity)

In [None]:
df_10_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].head()

In [None]:
df_10_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].isnull().sum()

In [None]:
plt.figure(figsize=(18, 8))
sns.heatmap(df_10_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].isnull(), cmap='viridis')
plt.show()

In [None]:
df_10_EYE['ET_PupilLeft_validity'] = df_10_EYE['ET_PupilLeft_validity'].fillna(0)

In [None]:
df_10_EYE['ET_PupilRight_validity'] = df_10_EYE['ET_PupilRight_validity'].fillna(0)

In [None]:
df_10_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].head()

In [None]:
plt.figure(figsize=(18, 8))
sns.heatmap(df_10_EYE[['ET_PupilLeft_validity', 'ET_PupilRight_validity']].isnull(), cmap='viridis')
plt.show()

In [None]:
df_10_EYE.head()

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_10_EYE == -1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.subplot(1, 2, 2)
sns.heatmap(df_10_EYE == 1, cmap='viridis')
plt.title('Heatmap of -1 Values')

plt.tight_layout()
plt.show()

In [None]:
valid_left_ratio  = 1 - df_10_EYE['ET_ValidityLeft'].mean()

In [None]:
valid_left_ratio

In [None]:
valid_right_ratio = 1 - df_10_EYE['ET_ValidityRight'].mean()

In [None]:
valid_right_ratio

In [None]:
df_10_EYE['ET_PupilLeft_validity'] = df_10_EYE['ET_PupilLeft_validity'].astype(np.int8)
df_10_EYE['ET_PupilRight_validity'] = df_10_EYE['ET_PupilRight_validity'].astype(np.int8)

# Feature Engineering and Observations

Based on the analysis of the data, we've created two new features, `ET_PupilLeft_validity` and `ET_PupilRight_validity`. These features indicate the validity of the pupil data for the left and right eyes, respectively, with a value of 1 representing invalid data (originally -1) and 0 representing valid data.

The heatmaps above visually demonstrate the distribution of -1 and 1 values across the dataset. We observed that:
- The `-1` values are concentrated in specific columns related to gaze, pupil size, distance, and camera position, suggesting they represent missing or invalid sensor readings.
- The `1` values, after mapping from `4.0` in the original validity columns, indicate instances of invalid eye-tracking data.
- The heatmaps also show a strong correlation between the `-1` values in the pupil columns and a validity of 1 in the newly created pupil validity features, confirming that -1 was used to mark invalid pupil data.

In [None]:
df_10_EYE.head()

In [None]:
# Select only the numeric columns for plotting histograms, excluding time-related columns
numeric_cols = df_10_EYE.select_dtypes(include=np.number).columns
cols_to_plot = [col for col in numeric_cols if col not in ['UnixTime']]

# Calculate the number of rows and columns for the grid
n_cols = 4  # You can adjust the number of columns as needed
n_rows = (len(cols_to_plot) + n_cols - 1) // n_cols

plt.figure(figsize=(n_cols * 5, n_rows * 4)) # Adjust figure size as needed

for i, col in enumerate(cols_to_plot):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.histplot(df_10_EYE[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

# Observations from Histograms

The grid of histograms provides insights into the distribution of values for each numeric column in the dataset (excluding 'UnixTime'). Key observations include:

- Several columns, such as `ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, and `ET_GazeRighty`, show distributions that appear somewhat multimodal or skewed, suggesting variations in gaze patterns.
- The `ET_PupilLeft` and `ET_PupilRight` histograms clearly show a peak at -1, confirming the presence of a significant number of invalid pupil readings.
- `ET_TimeSignal` shows a relatively uniform distribution, as expected for a time-based signal.
- `ET_DistanceLeft` and `ET_DistanceRight` appear to have distributions centered around certain values, with some outliers or variations.
- The camera position columns (`ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, `ET_CameraRightY`) seem to have distributions concentrated within specific ranges, reflecting the camera's field of view.
- The validity columns (`ET_ValidityLeft`, `ET_ValidityRight`, `ET_PupilLeft_validity`, `ET_PupilRight_validity`) show distributions dominated by 0, indicating that most of the data is considered valid after the mapping. The smaller peaks at 1 represent the instances of invalid data.

These distributions highlight the need for appropriate handling of the -1 values and potential outliers in subsequent analysis or modeling steps.

In [None]:
df_10_EYE.columns

In [None]:
cols = ['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']

In [None]:
from IPython.display import display, Markdown

for col in cols:
    # Add a markdown cell before each plot for better separation and labeling
    display(Markdown(f'### {col} over Time'))
    plt.figure(figsize=(16, 10))
    plt.plot(df_10_EYE['Timestamp'], df_10_EYE[col])
    plt.xlabel("Timestamp") # Add x-axis label
    plt.ylabel(col) # Add y-axis label
    plt.show()

# Observations from Time Series Plots

The line plots showing various features against the `Timestamp` reveal the temporal patterns and fluctuations in the eye-tracking data. Key observations include:

- **Gaze Coordinates (`ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, `ET_GazeRighty`):** These plots show the changes in gaze position over time. We can observe periods of relatively stable gaze interspersed with rapid movements (saccades) and blinks or other events where the gaze data might be invalid (-1 values appear as gaps or spikes if not handled).
- **Pupil Size (`ET_PupilLeft`, `ET_PupilRight`):** The pupil size plots show variations over time. The presence of many -1 values is evident as flat lines at the bottom of the plot, indicating periods where pupil data was not recorded or was invalid.
- **Time Signal (`ET_TimeSignal`):** This plot shows a steady, increasing trend, as expected for a time-based signal.
- **Distance and Camera Position (`ET_DistanceLeft`, `ET_DistanceRight`, `ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, `ET_CameraRightY`):** These plots show how the distance from the eye tracker and the camera positions change over time. Variations in these features can be related to head movements or changes in the user's position relative to the eye tracker.
- **Validity (`ET_ValidityLeft`, `ET_ValidityRight`, `ET_PupilLeft_validity`, `ET_PupilRight_validity`):** These plots clearly show periods of invalid data (represented by 1) as spikes or plateaus, corresponding to instances where the eye tracker lost track of the eyes or the pupil data was marked as invalid.

Analyzing these time series plots is crucial for understanding the dynamics of the eye-tracking data and identifying patterns or anomalies that may require further investigation or specific handling during subsequent analysis.

In [None]:
# Select only the numeric columns for plotting histograms, excluding time-related columns
numeric_cols = df_10_EYE.select_dtypes(include=np.number).columns

# Calculate the number of rows and columns for the grid
n_cols = 4  # You can adjust the number of columns as needed
n_rows = (len(numeric_cols) + n_cols - 1) // n_cols

plt.figure(figsize=(n_cols * 5, n_rows * 4)) # Adjust figure size as needed

for i, col in enumerate(numeric_cols):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.boxplot(df_10_EYE[col])
    plt.title(f'Boxplot of {col}')
    plt.xlabel(col)

plt.tight_layout()
plt.show()

# Observations from Boxplots and Handling -1 Values

The boxplots provide a visual summary of the distribution and potential outliers for each numeric column. Key observations from the boxplots include:

- The boxplots for columns like `ET_GazeLeftx`, `ET_GazeLefty`, `ET_GazeRightx`, `ET_GazeRighty`, `ET_PupilLeft`, `ET_PupilRight`, `ET_DistanceLeft`, `ET_DistanceRight`, `ET_CameraLeftX`, `ET_CameraLeftY`, `ET_CameraRightX`, and `ET_CameraRightY` clearly show the presence of -1 values as significant outliers, confirming our earlier observations from the heatmaps and histograms.
- The boxplots for the validity columns (`ET_ValidityLeft`, `ET_ValidityRight`, `ET_PupilLeft_validity`, `ET_PupilRight_validity`) show the discrete nature of these features, with the majority of data points at 0 (valid) and a smaller number at 1 (invalid).

Given the significant presence of -1 values, which represent invalid or missing data, especially in the pupil-related columns, we have decided to replace these -1 values with NaN to properly represent them as missing data. Subsequently, we will impute these missing values using the mean of each respective column. This approach helps to retain the data structure and allows for further analysis or modeling without the distortion caused by the -1 placeholders.

In [None]:
df_10_EYE.replace({-1: np.nan}, inplace=True)

In [None]:
df_10_EYE[['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']].mean()

In [None]:
df_10_EYE[['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']].median()

In [None]:
numeric_cols = df_10_EYE.select_dtypes(include=np.number).columns

for col in numeric_cols:
    df_10_EYE[col] = df_10_EYE[col].fillna(df_10_EYE[col].mean())

In [None]:
plt.figure(figsize=(18, 8))

plt.subplot(1, 2, 1)
sns.heatmap(df_10_EYE.isnull(), cmap='viridis')
plt.title('Heatmap of Missing Values After Imputation')

plt.subplot(1, 2, 2)
sns.heatmap(df_10_EYE == 1, cmap='viridis')
plt.title('Heatmap of 1 Values')

plt.tight_layout()
plt.show()

# Handling Missing Values (Imputation)

As decided, we have replaced all the `-1` values with `NaN` to treat them as missing data. Subsequently, we have imputed these `NaN` values with the mean of their respective columns. The heatmap above, which was generated after the imputation, now shows no visible signs of `NaN` values, indicating that the imputation was successful.

In [None]:
df_10_EYE.head()

In [None]:
# Select only the numeric columns for plotting histograms, excluding time-related columns
numeric_cols = df_10_EYE.select_dtypes(include=np.number).columns
cols_to_plot = [col for col in numeric_cols if col not in ['UnixTime']]

# Calculate the number of rows and columns for the grid
n_cols = 4  # You can adjust the number of columns as needed
n_rows = (len(cols_to_plot) + n_cols - 1) // n_cols

plt.figure(figsize=(n_cols * 5, n_rows * 4)) # Adjust figure size as needed

for i, col in enumerate(cols_to_plot):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.histplot(df_10_EYE[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

# Observations from Histograms After Imputation

The histograms generated after replacing the -1 values with the mean of each column show the distributions of the numeric features with the missing data handled. Key observations from these updated histograms include:

- The distinct peaks at -1, which were prominent in the histograms for several columns (e.g., pupil size, gaze coordinates, distance, and camera position) before imputation, are now replaced by a peak at the mean of each respective column.
- The distributions in many columns now appear more unimodal or show shifted modes compared to the original histograms.
- The histograms for the validity columns still show their bimodal distributions with peaks at 0 and 1, as these were handled separately.

These histograms provide an updated view of the data's distribution after handling the missing values, highlighting the impact of the imputation method on the data's characteristics.

In [None]:
cols = ['ET_GazeLeftx', 'ET_GazeLefty',
       'ET_GazeRightx', 'ET_GazeRighty', 'ET_PupilLeft', 'ET_PupilRight',
       'ET_TimeSignal', 'ET_DistanceLeft', 'ET_DistanceRight',
       'ET_CameraLeftX', 'ET_CameraLeftY', 'ET_CameraRightX',
       'ET_CameraRightY', 'ET_ValidityLeft', 'ET_ValidityRight',
       'ET_PupilLeft_validity', 'ET_PupilRight_validity']

In [None]:
for col in cols:
    # Add a markdown cell before each plot for better separation and labeling
    display(Markdown(f'### {col} over Time'))
    plt.figure(figsize=(16, 10))
    plt.plot(df_10_EYE['Timestamp'], df_10_EYE[col])
    plt.xlabel("Timestamp") # Add x-axis label
    plt.ylabel(col) # Add y-axis label
    plt.show()

# Observations from Time Series Plots After Imputation

The line plots generated after imputing the missing values with the mean show the temporal patterns of the features with the missing data handled. Key observations from these updated plots include:

- The gaps or flat lines at -1, which were prominent in the plots for columns like gaze coordinates, pupil size, distance, and camera position, are now filled by lines at the mean value of the respective columns.
- The plots for the validity columns remain the same as they were handled separately.
- The `ET_TimeSignal` plot still shows a steady increasing trend, as expected.

In [None]:
plt.figure(figsize=(16, 10))
sns.heatmap(df_10_EYE.corr(numeric_only=True), cmap='YlGnBu', annot=True)
plt.show()

# Observations from Correlation Heatmap

The correlation heatmap provides a visual representation of the pairwise correlations between the numeric columns in the dataset. Key observations from the heatmap include:

- **High Positive Correlations:** We observe strong positive correlations (values close to 1) between:
  - `ET_GazeLeftx` and `ET_GazeRightx`: This is expected as the gaze positions of both eyes should be highly correlated when fixating on a point.
  - `ET_GazeLefty` and `ET_GazeRighty`: Similar to the x-coordinates, the y-coordinates of gaze should also be highly correlated.
  - `ET_PupilLeft` and `ET_PupilRight`: Pupil sizes of both eyes tend to change together in response to light and cognitive load.
  - `ET_DistanceLeft` and `ET_DistanceRight`: The distance from the eye tracker to each eye should be highly correlated.
  - `ET_CameraLeftX` and `ET_CameraRightX`, `ET_CameraLeftY` and `ET_CameraRightY`: The camera positions for both eyes are also expected to be highly correlated.
  - `UnixTime` and `ET_TimeSignal`: As previously noted, these two columns are almost perfectly linearly correlated, indicating redundancy.
  - `ET_ValidityLeft` and `ET_PupilLeft_validity`: There is a positive correlation, suggesting that when the overall left eye data is invalid, the left pupil data is also likely to be invalid.
  - `ET_ValidityRight` and `ET_PupilRight_validity`: Similar to the left eye, there is a positive correlation between the overall right eye validity and the right pupil validity.
- **Other Correlations:** We can also observe other varying degrees of correlations between different features, which can provide insights into the relationships between gaze behavior, pupil size, distance, and camera position. For example, there might be correlations between gaze coordinates and camera positions, reflecting head movements.
- **Low or Near-Zero Correlations:** Columns with low or near-zero correlations are relatively independent of each other.

Understanding these correlations is important for feature selection and for building models, as highly correlated features might indicate multicollinearity, while correlations between features can reveal underlying patterns in the data.

# Analysis of ET_TimeSignal and Decision to Drop

As observed in the time series plot and confirmed by the correlation heatmap, the `ET_TimeSignal` column exhibits a near-perfect linear relationship with both the `Timestamp` and `UnixTime` columns. This strong correlation (close to 1) suggests that `ET_TimeSignal` is essentially redundant and likely represents another form of time recording or a signal directly derived from the timestamp.

Including highly correlated features like this in a dataset can lead to issues such as multicollinearity in some statistical models, which can make it difficult to interpret the individual impact of each feature. Since the `Timestamp` column already provides the necessary temporal information, retaining `ET_TimeSignal` does not appear to add significant value for further analysis or modeling in most cases.

Therefore, based on its high correlation and lack of unique insight, we will proceed to drop the `ET_TimeSignal` column to simplify the dataset and potentially improve the performance and interpretability of future analyses.

In [None]:
df_10_EYE.drop('ET_TimeSignal', axis=1, inplace=True)

In [None]:
plt.figure(figsize=(16, 10))
sns.pairplot(df_10_EYE)
plt.show()