In [None]:
import pandas as pd
import numpy as np
import math
from typing import Sequence, Literal
import matplotlib.pyplot as plt
from typing import Tuple

#Question 1: Handling the data

Last session we recording gaze data using different devices. Today we are going to look at how to analyze that data. The very first, **absolutely essential** step that **must never be left out in any data science project** is to check data quality.

We will perform the following steps:
- Load and parse the recorded data of all groups.
- Perform a basic data quality analysis for all datasets:
  - What is the sampling rate (Hz)
  - Is the sampling rate constant?
  - Are all recordings of the same (or similar) length? If not, should they?
  - Calculate the tracking rate, i.e., the ratio of valid gaze samples.
  - Calculate spatial accuracy and precision.
- Are there differences between groups? What is the source of these differences?

In [None]:
# Run this line to download the data to your Colab runtime environment.
!wget https://raw.githubusercontent.com/kueblert/gazeinteraction/master/yarbus_recordings.zip
!unzip yarbus_recordings.zip

**Question 1.b. (2 points)** Unify the data format. We want to later on use the same routines to process data from all recordings. Therefore, inherit from and implement the following class for each data format.

Use the other groups' readme files to interpret the data. In case it is insufficient or hard to understand, please let them know and get the missing info from them (**in a friendly way!**). Not all devices report binocular data. In that case, just use the same data for each eye.

In [None]:
class GazeDataset:
  def __init__(self, filename: str):
    # load a single recording from a file
    pass

  def x(eye: Literal["left", "right"]) -> pd.DataFrame:
    # the horizontal gaze coordinate on screen [pixels] for the left/right eye
    pass

  def y(eye: Literal["left", "right"]) -> pd.DataFrame:
    # the vertical gaze coordinate on screen [pixels] for the left/right eye
    pass

  def timestamp() -> pd.DataFrame:
    # the timestamp in milliseconds
    pass

  def validity(eye: Literal["left", "right"]) -> pd.DataFrame:
    # whether a sample (of the left/right eye, if distinguishable) is valid or not
    pass
  
  def stimulus() -> pd.DataFrame:
    # which image / task was displayed to the subject
    pass

In [None]:
# Your code goes here

In [None]:
# Helper code to apply a specific function f to all data within the dataset
# You will need to adjust paths and class names, but should be ready to work with otherwise.

dataset_paths = ["group1", "group2", "group3"]
class_types = [GazeDatasetG1, GazeDatasetG2, GazeDatasetG3]

# probably no changes necessary from here on
import glob

def apply_to_datasets(f):
  results = []
  # iterate over all groups
  for dataset_path, dataset_class in zip(dataset_paths, class_types):
    group = []
    # iterate over all recordings by that group
    for recording_filename in glob.glob("yarbus_recordings/%s/*.csv" % dataset_path):
      # parse the dataset
      dataset = dataset_class(recording_filename)
      # apply function to dataset
      group.append(f(dataset))
    results.append(group)
  return results

def print_length(d: GazeDataset):
  print(len(d.timestamp()))

apply_to_datasets(print_length)

# Question 2: Sampling rate

The sampling rate of an eye tracker is important for knowing what type of events you can measure and which event detection algorithm is appropriate. A rule of thumb is the *Nyquist-Shannon sampling theorem* (Shannon, 1949) which states that the sampling frequency should be at least twice the speed of the particular eye movement.

But we cannot simply tell sampling rate from device type, as they often support multiple settings and performance bottlenecks in the recording hardware might have an effect as well.

**Question 2.a. (1 point)** In the following code snippets, determine the average duration (in milliseconds) between data samples.

In [None]:
#Function for calculating the average ms between samples
def mean_duration_between_samples(data: GazeDataset) -> float:
  pass

**Question 2.b. (1 point)** In the following code snippets, determine the sampling frequency, i.e., the number of samples per second that the device recorded (in Hz), of the eye tracker.

In [None]:
#Function for calculating the sampling rate (samples per second)
def sampling_rate(time_between_samples: float) -> int:
  pass

**Question 2.c. (1 point)** Is the sampling rate (reasonably) constant within each group?

*Why we check this: Often data is recorded with more than one device. As sampling rate is a manual setting, it might be configured differently on the recording devices. Perhaps one recording laptop quit service inmidst of the experiment and the settings needed to be reconfigured on a second device, so some data is sampled differently? You need to know that!*

In [None]:
def sampling_rate_variance(data: GazeDataset) -> float:
  pass

**Question 2.d. (3 points):** For the next code snippet, write a function that resamples the data at a different sampling rate. Use a linear interpolation for this purpose. Resample the data to **60Hz**.

*In case your data is sampled inconsistently, it is always possible to downsample to the lowest sampling rate in your data. Upsampling would mean making up new values and should be avoided.*

In [None]:
#Function for Down sampling to another sampling rate
def resample(data: GazeDataset, target_sampling_rate: float = 60) -> GazeDataset:
  pass

**Question 2.c. (1 point):** Do you think it's an adequate sampling rate for analyzing saccades? Support your answer with 1 sentence.

**Answer:** your answer here

##Question 3: Tracking ratio

**Question 2: (2 points)** Measurement devices might be unable to acquire a good measurement under some conditions (eyeglasses, lightning, headbox too small). In some cases the device knows a measurement error has occurred and reports that. In other cases such a failure happens silently. Calculate the tracking rate, i.e., the ratio of reported valid samples versus all samples.
Return the tracking rates for both eyes separately.

In [None]:
def tracking_ratio(data: GazeDataset) -> Tuple[float, float]:
  pass

- Are there differences between devices? Possibly even between groups using the same devices?
- What are possible causes of such tracking errors?
- Are there certain characteristics observable (e.g., are there many individual losses or are there segments where tracking failed for a longer time)? 

## Question 3: Accuracy and precision

**Question 3.a. (3 point):** For the analysis of accuracy and precision we need to know where the subject is looking at. Accuracy is the measure of how good the eye-tracker hits that point (on average). Precision is a measure of how far the samples spread around that point). 

We recorded such an episode during the fixation cross phase of the experiment.
Extract data segments during the first fixation cross phase.
Ideally, drop the first 200 ms of data. This phase might not contain a fixation towards the fixation cross just yet.

In [None]:
def extract_fixation_cross(data: GazeDataset) -> GazeDataset:
  # Extracts the time segment during which the first fixation cross was displayed
  # your code here
  pass


Calculate accuracy and precision, for each eye separately.

In [None]:
def calculate_accuracy(data: GazeDataset, target: Tuple[int, int]) -> Tuple[float, float]:
  # your code here
  pass

In [None]:
def calculate_precision(data: GazeDataset) -> Tuple[float, float]:
  # your code here
  pass

**Question 3.b.: (1 point)** Visualize the accuracy per group using a boxplot.

In [None]:
import matplotlib


Are there differences between both eyes? If so, what are possible explanations?

# Question 4: Visualize! (Bonus)
Visualization is the probably most powerful tool to immediately see a lot of potential problems with the data.

- Select a single dataset of your choice.
- Visualize gaze samples on the stimulus image. Be careful with where on the screen the stimulus was displayed.
- Visualize only valid data.
- Each sample should be marked by a small dot / colorful pixel.
- The stimulus should be visible in the background.