# Dataset Description

# Course Datasets 

In this course, you will work with **two real-world datasets**.  
Each dataset represents a different type of machine learning problem.  
Because the data are different, we will use **different models** for each one.

---

## Raw Eye-Tracking Data
The eye-tracking data collected from multiple users for the task of **user identification**.  
The goal is to predict the **User ID** based solely on eye movement behavior.

Each user in the dataset has **two separate eye-tracking recordings**, corresponding to two different recording sessions. This design allows the model to learn user-specific eye movement patterns that generalize across sessions rather than memorizing a single recording.

---




### How is the eye tracking Data Represented?

- `UserID` uniquely identifies a participant  
- `SessionID` indicates the recording session (1 or 2)

**Example:**
- `ID_240_1.txt` → User 240, session 1  
- `ID_240_2.txt` → User 240, session 2  

---


Each file contains **time-ordered eye-tracking samples**, where each row represents a single time step recorded by the eye tracker.

The raw measurements include gaze-related values such as:
- Horizontal and vertical gaze coordinates
- Eye movement dynamics
- Other low-level eye-tracking signals

These measurements capture the **temporal dynamics of visual behavior**.

---

## Dataset B: Material Science Dataset

### How is the Material Science Data Represented?

Each material sample is stored as a **2D grid**, similar to an image.

- Each cell in the grid contains a numerical value  
- The value represents a physical or chemical property  
- Together, the grid shows a **pattern across the material**  

You can think of each sample as a **map of material properties**.

---

### Why Does Spatial Structure Matter?

In material science:
- Nearby regions affect each other  
- Local patterns influence overall material behavior  
- The **position** of each value is important  

If the values in the grid are shuffled, the meaning of the data is lost.

> Where a value is located matters just as much as the value itself.

---

## Data Exploration

Data exploration is the process of examining a dataset to understand its structure, identify patterns, detect outliers, test assumptions, and generate hypotheses.

This step relies on:
- **Summary statistics** (such as mean, minimum, and maximum values)
- **Visualizations** (such as plots and charts)

Data exploration helps us understand the data **before** building any machine learning model.

---

In [1]:
import ipykernel
ipykernel.__version__

'7.1.0'

In [1]:
## Import Required Libraries

import numpy as np      #for mathematical and numerical operations
import pandas as pd     #for data loading and preprocessing 
import matplotlib.pyplot as plt #visulization
import os
from scipy.signal import savgol_filter
import matplotlib.pyplot as plt

Matplotlib is building the font cache; this may take a moment.


In [3]:
import sys
print(sys.executable)


c:\Users\admin\AppData\Local\Programs\Python\Python313\python.exe


## Load Dataset 

In [None]:

from google.colab import drive
drive.mount('/content/drive')


## Dataset Paths

In [None]:

base_path = "/content/drive/MyDrive/RF_eye_tracking"
tex_path = os.path.join(base_path, "TEX")
ran_path = os.path.join(base_path, "RAN")

tex_files = sorted(os.listdir(tex_path))
ran_files = sorted(os.listdir(ran_path))

print("Number of TEX files:", len(tex_files))
print("Number of RAN files:", len(ran_files))


In [None]:
print("\n RAN files:")
print(ran_files[0:])

## Load One File

In [None]:
sample_file = ran_files[0]

df = pd.read_csv(
    os.path.join(ran_path, sample_file),
    sep=r"\s+",
    skiprows=1,
    header=None,
    names=[
        "SAMPLE",
        "X_DEGREE",
        "Y_DEGREE",
        "VALIDITY",
        "X_STIMULUS",
        "Y_STIMULUS"
    ]
)

df.head()

## Summary Statistics

In [None]:
df = df.apply(pd.to_numeric, errors="coerce")

In [None]:
df.columns

In [None]:
df.shape

In [None]:
df.describe()

## Filter Valid Data

In [None]:

df_valid = df[df["VALIDITY"] == 1]
df_valid.shape


## Visualize X Eye Position

In [None]:

plt.hist(df_valid["X_DEGREE"], bins=50)
plt.xlabel("X Degree")
plt.ylabel("Frequency")
plt.title("Distribution of X Eye Position")
plt.show()


## Visualize Y Eye Position

In [None]:
plt.hist(df_valid["Y_DEGREE"], bins=50)
plt.xlabel("Y Degree")
plt.ylabel("Frequency")
plt.title("Distribution of Y Eye Position")
plt.show()