# Exploratory Data Analysis (EDA)

## Table of Contents
1. [Dataset Overview](#dataset-overview)
2. [Handling Missing Values](#handling-missing-values)
3. [Feature Distributions](#feature-distributions)
4. [Possible Biases](#possible-biases)
5. [Correlations](#correlations)
6. [Correlations](#correlations)


# Data Description
The original dataset was in a Matlab format. The data was converted to CSV format for easier handling. The self-developed Matlab-Export--Scripts are in the folder [Matlab-Export-Scripts](./MatlabExport).

- The Data are recorded with 240Hz
- The NASA TLX Scores are recorded after each task

The Description of the original data is as follows:

<details>

<summary>Click to expand</summary>

---Title---

COLET: A Dataset for Cognitive Workload Estimation based on Eye-tracking

---Contributors---

Emmanouil Ktistakis,Institute of Computer Science, Foundation for Research and Technology Hellas (FORTH) and the Laboratory of Optics and Vision, School of Medicine, University of Crete, GR-710 03 Heraklion, Greece
Vasileios Skaramagkas, Institute of Computer Science, Foundation for Research and Technology Hellas (FORTH), GR-700 13 Heraklion, Crete, Greece, ORCID: 0000-0002-3279-8016
Dimitris Manousos, Institute of Computer Science, Foundation for Research and Technology Hellas (FORTH), GR-700 13 Heraklion, Crete, Greece
Nikolaos S. Tachos, Department of Biomedical Research, Institute of Molecular Biology and Biotechnology, FORTH, GR-451 15, Ioannina, Greece and the Department of Materials Science and Engineering, Unit of Medical Technology and Intelligent Information Systems, University of Ioannina, GR-451 10, Ioannina, Greece
Evanthia Tripoliti, Department of Materials Science and Engineering, Unit of Medical Technology and Intelligent Information Systems, University of Ioannina, GR-451 10, Ioannina, Greece
Dimitrios I. Fotiadis, Department of Biomedical Research, Institute of Molecular Biology and Biotechnology, FORTH, GR-451 15, Ioannina, Greece and the Department of Materials Science and Engineering, Unit of Medical Technology and Intelligent Information Systems, University of Ioannina, GR-451 10, Ioannina, Greece
Manolis Tsiknakis, Institute of Computer Science, Foundation for Research and Technology Hellas (FORTH) and the Department of Electrical and Computer Engineering, Hellenic Mediterranean University, GR-710 04 Heraklion, Crete, Greece


---Corresponding Authors---

Vasileios Skaramagkas, vskaramagkas96@gmail.com
Emmanouil Ktistakis, mankti@ics.forth.gr


---DOI---

10.2139/ssrn.4059768 (temporary)

---database Description---

Database including eye movements from 47 participants as they solved puzzles involving visual search tasks of varying complexity and duration.
Participants rated their performance based on NASA RTLX index scale.

The uploaded files include:
1) "data.mat": the database in Matlab cell and struct-type format.
2) "tasks" folder: A folder containing the 21 tasks shown.
3) "readme.txt": a text file providing more information on the database structure.


Part of the dataset was used in the work accepted for publication in: 10.2139/ssrn.4059768


Please refer to the above mentioned article for more information on the data acquisition protocol.


---database Structure---

"data.mat": A matlab workspace file containing 'data' 1x47 cell-vector, which expands to
47 structs, each containing information and recordings from a single subject involved in the 4 tasks.

Version 1:

Each data struct contains the following three fields:

a) data{#}.task{i}

where i = [1,4] and is the task number (please refer to the respective publication for more info on the tasks)

Struct including the following subject information in struct format:

task{i}.gaze --> Gaze related metrics recorded from the eye tracker for each of the 4 total tasks (gaze_timestamp, world_index, confidence, norm_pos_x, norm_pos_y, base_data, gaze_point_3d_x, 
									gaze_point_3d_y, gaze_point_3d_z, eye_center0_3d_x, eye_center0_3d_y, eye_center0_3d_z, gaze_normal0_x, gaze_normal0_y, gaze_normal0_z, 
									eye_center1_3d_x, eye_center1_3d_y, eye_center1_3d_z, gaze_normal1_x, gaze_normal1_y, gaze_normal1_z)

task{i}.pupil --> Pupil related metrics recorded from the eye tracker for each of the 4 total tasks (pupil_timestamp, world_index, eye_id, confidence, norm_pos_x, norm_pos_y, diameter, method,
 									ellipse_center_x, ellipse_center_y, ellipse_axis_a, ellipse_axis_b, ellipse_angle, diameter_3d, model_confidence, model_id, sphere_center_x, 
									sphere_center_y, sphere_center_z, sphere_radius, circle_3d_center_x, circle_3d_center_y, circle_3d_center_z, circle_3d_normal_x, circle_3d_normal_y, 
									circle_3d_normal_z, circle_3d_radius, theta, phi, projected_sphere_center_x,projected_sphere_center_y, projected_sphere_axis_a, projected_sphere_axis_b, 
									projected_sphere_angle)

task{i}.blinks --> Blink related metrics recorded from the eye tracker for each of the 4 total tasks (id, start_timestamp, duration, end_timestamp, start_frame_index, index, end_frame_index, confidence,
								        filter_response, base_data)

task{i}.annotation --> NASA RTLX scores for each of the 4 total tasks

For more info regarding the recordings from Pupil Core visit: https://docs.pupil-labs.com/core/software/pupil-capture/


b) data{#}.subject_info

Struct including general subject information:

1) Visual acuity: measured binocularly in distance of 0.80cm (logMAR)
2) Gender ('F': Female, 'M': Male)  
3) Age (years) 
4) Education level (years)


c) "images" folder: Folder containing the images used in the experiment

1 --> bowling_balls
2 --> candles
3 --> chandelier
4 --> classroom
5 --> garage
6 --> handles
7 --> kitchen
8 --> light
9 --> paintings_1
10 --> paintings_2
11 --> pc_screens
12 --> pillows
13 --> poof
14 --> pool
15 --> seats_1
16 --> seats_2
17 --> shoes
18 --> students
19 --> towels
20 --> water
21 --> windows

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
For more information, please refer to the following article:

Ktistakis, E., Skaramagkas, V., Manousos, D., Tachos, N. S., Tripoliti, E., Fotiadis, D. I., & Tsiknakis, M. (2022). Colet: A dataset for cognitive workload estimation based on eye-tracking. Computer Methods and Programs in Biomedicine, 106989. https://doi.org/10.1016/j.cmpb.2022.106989

Last Update: 2022/06/07
 
</details>

# New Data Format

The new data structure is as follows:
- Participants.csv: Contains the participant information with some demographic data and a unique participant ID.
- Participant_X
    - Test1: 5 Images, no time constraint, no secondary task.
        - Participant_X_Annotations_1.csv: Contains the annotations for the test.
        - Participant_X_Blinks_1.csv: Contains the blinks data for the test.
        - Participant_X_Gaze_1.csv: Contains the gaze data for the test.
        - Participant_X_Pupil_1.csv: Contains the pupil data for the test.
    - Test2: 5 iamges, with time constraint, no secondary task.
    - Test3: 5 images, with time constraint, with secondary task.
    - Test4: 5 images, no time constraint, with secondary task.

In [11]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd


# Load Data

In [7]:
# Variables
data_path = '../Data/'

# Load data
participant_list = pd.read_csv(data_path + 'participants.csv')
print(participant_list.head())

   ID  VisualAcuity_logMAR_ Gender  Age  Education
0   1                 -0.04      F   28         18
1   2                 -0.10      F   28         18
2   3                 -0.08      F   38         16
3   4                 -0.07      F   29         18
4   5                 -0.15      M   30         18


In [12]:

participant_number = 1
test_number = 1

path_to_participant = data_path + 'Participant_' + str(participant_number) 
path_to_test = path_to_participant + '/Test' + str(test_number) + '/'
path_to_annotations = path_to_test + 'Participant_' + str(participant_number) + '_Annotations_' + str(test_number) + '.csv'
path_to_blinks = path_to_test + 'Participant_' + str(participant_number) + '_Blinks_' + str(test_number) + '.csv'
path_to_gaze = path_to_test + 'Participant_' + str(participant_number) + '_Gaze_' + str(test_number) + '.csv'
path_to_pupil = path_to_test + 'Participant_' + str(participant_number) + '_Pupil_' + str(test_number) + '.csv'

test1_participant1_annotations = pd.read_csv(path_to_annotations)
test1_participant1_blinks = pd.read_csv(path_to_blinks)
test1_participant1_gaze = pd.read_csv(path_to_gaze)
test1_participant1_pupil = pd.read_csv(path_to_pupil)


In [10]:
print(test1_participant1_blinks.head())

   id  start_timestamp  duration  end_timestamp  start_frame_index  index  \
0   1      5437.625617  0.236131    5437.861748                 37     40   
1   2      5444.161561  0.180073    5444.341634                231    233   

   end_frame_index  confidence  \
0               44    0.703872   
1              236    0.553669   

                                     filter_response  \
0  0.5068225043614704 0.5512669488059149 0.595711...   
1  0.5048782729116744 0.5493227173561188 0.592878...   

                                           base_data  
0  5437.625617 5437.629564 5437.633621 5437.63786...  
1  5444.161561 5444.165642 5444.169793 5444.17554...  


## Dataset Overview

[Provide a high-level overview of the dataset. This should include the source of the dataset, the number of samples, the number of features, and example showing the structure of the dataset.]


In [None]:

# Load the data
# Replace 'your_dataset.csv' with the path to your actual dataset
df = pd.read_csv('your_dataset.csv')

# Number of samples
num_samples = df.shape[0]

# Number of features
num_features = df.shape[1]

# Display these dataset characteristics
print(f"Number of samples: {num_samples}")
print(f"Number of features: {num_features}")

# Display the first few rows of the dataframe to show the structure
print("Example data:")
print(df.head())



## Handling Missing Values

[Identify any missing values in the dataset, and describe your approach to handle them if there are any. If there are no missing values simply indicate that there are none.]


In [None]:
# Check for missing values
missing_values = df.isnull().sum()
missing_values


In [None]:
# Handling missing values
# Example: Replacing NaN values with the mean value of the column
# df.fillna(df.mean(), inplace=True)

# Your code for handling missing values goes here


## Feature Distributions

[Plot the distribution of various features and target variables. Comment on the skewness, outliers, or any other observations.]


In [None]:
# Example: Plotting histograms of all numerical features
df.hist(figsize=(12, 12))
plt.show()


## Possible Biases

[Investigate the dataset for any biases that could affect the model’s performance and fairness (e.g., class imbalance, historical biases).]


In [None]:
# Example: Checking for class imbalance in a classification problem
# sns.countplot(x='target_variable', data=df)

# Your code to investigate possible biases goes here


## Correlations

[Explore correlations between features and the target variable, as well as among features themselves.]


In [None]:
# Example: Plotting a heatmap to show feature correlations
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True)
plt.show()
