# GazeClusterML

### Authored by: Taimur Khan, Benjamin Nava Höer
***Final Project for TU Berlin WU'20 course: Machine Learning using Python: Theory and Application**

___**Licensed under:**___

### 1. Abstract

An unlabelled dataset of an eyetracking timeseries was clustered using the spatial clustering algorithms DBSCAN and OPTICS, as well as the spatio-temporal clustering algorithms ST-DBSCAN and ST-OPTICS. The silhouette score was not found to be the appropriate evaluation metric for the obtained clusterings. A second, manually labelled dataset was used to evaluate the accuracy of the most promising algorithm ST-OPTICS. XX% of the predicted labels matched the provided labels, making ST-OPTICS a valuable tool for the future analysis of webcam-based eyetracking data.

### 2. Introduction

### 3. Theoretical Rationalization

#### 3.1 DBSCAN
DBSCAN discovers arbitrarily shaped clusters in a dataset using a radius value $\epsilon$ based on a user defined distance measure, i.e. euclidean. Additionally, a MinPts value defines the minimal number of points that should occur within $\epsilon$ radius. Given the neighborhood of $p$ as $N(p) := \{q \in D: d(p,q) \leq \epsilon\}$ with $D := dataset$ and $p$ and $q$ as points therein, this leads to the following three kinds of points:
- Core points: $\mid N(p)\mid \geq MinPts$
- Border points: $\mid N(p)\mid < MinPts$
- Else: Noise

Reference x
#### 3.3 ST-DBSCAN
ST-DBSCAN extends builds on DBSCAN by adding a second, temporal radius value $\epsilon_{2}$. Analogous distance metrics as for $\epsilon$ can be used, i.e. euclidean. The neighborhood of a point is now described by both $\epsilon$ and $\epsilon_{2}$: $N(p) := \{q \in D: d_{1}(p,q) \leq \epsilon_{1},d_{2}(p,q) \leq \epsilon_{2}\}$. Thereupon the points in the dataset will be classified according to the above mentioned categories.
Reference y

### 4. Implementation

**4.0. Setup Environment**

In [17]:
import json
import urllib.request
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
!pip install st-dbscan
from st_dbscan import ST_DBSCAN
!pip install ipympl
from mpl_toolkits.mplot3d import Axes3D
from sklearn.metrics import silhouette_score



**4.1. Load and explore data**

In [2]:
# Loading first dataset and store in dataframe 'df1'
url1 = urllib.request.urlopen("http://dschr.de/api/resultCombineData")
data1 = json.loads(url1.read().decode())
df1 = pd.DataFrame(data1[0]["data"])

# Loading second dataset and store in dataframe 'df2'
url2 = urllib.request.urlopen("http://dschr.de/api/handLabeled")
data2 = json.loads(url2.read().decode())
df2 = pd.DataFrame(data2[0]["data"])


df1, df2

(     timestamp            x           y     label
 0       102708   986.288075  508.004755  Fixation
 1       102781  1005.492167  495.522600  Fixation
 2       102842   942.353008  492.123891  Fixation
 3       102893   948.193646  474.714589  Fixation
 4       102943   938.728917  481.875697  Fixation
 ..         ...          ...         ...       ...
 968     163046   904.812802  246.252619   Saccade
 969     163105   861.176955  262.574859  Fixation
 970     163155   732.960155  276.037371  Fixation
 971     163239   635.075056  295.309572   Saccade
 972     163289   618.075163  313.958367  Fixation
 
 [973 rows x 4 columns],
      timestamp           x           y     label
 0        39342  739.023953  417.475902  fixation
 1        39380  707.481049  444.210737  fixation
 2        39424  713.926225  445.593758  fixation
 3        39469  704.775582  469.488595  fixation
 4        39507  704.159921  476.988842  fixation
 ..         ...         ...         ...       ...
 836      7

**4.2. Preprocess data**

In [3]:
# setting timestamps in both dataframes to start at 0
df1['timestamp'] = df1['timestamp'].apply(lambda x: x - df1['timestamp'][0]) 
df2['timestamp'] = df2['timestamp'].apply(lambda x: x - df2['timestamp'][0])

#Convert dataframes to numpy arrays
array1 = df1.to_numpy()
array2 = df2.to_numpy()

array1, array2

(array([[0, 986.2880749379, 508.0047550332, 'Fixation'],
        [73, 1005.4921671685, 495.5226000186, 'Fixation'],
        [134, 942.3530079831, 492.1238913635, 'Fixation'],
        ...,
        [60447, 732.960154917, 276.0373709741, 'Fixation'],
        [60531, 635.0750563979, 295.3095717554, 'Saccade'],
        [60581, 618.0751626144, 313.9583669145, 'Fixation']], dtype=object),
 array([[0, 739.0239530773961, 417.4759015313773, 'fixation'],
        [38, 707.4810489651718, 444.2107367829889, 'fixation'],
        [82, 713.9262252699162, 445.593757589654, 'fixation'],
        ...,
        [36541, 674.5909449373224, 369.15619639448806, 'saccade'],
        [36582, 671.6680822153079, 416.0326625889876, 'fixation'],
        [36619, 684.2615176944689, 428.94640490405294, 'fixation']],
       dtype=object))

**4.3. Choose and implement model**

In [13]:
# Setup DBSCAN classifier
eps_dbscan=150
min_samples_dbscan=5
clf_dbscan = DBSCAN(eps=eps_dbscan, min_samples=min_samples_dbscan, metric='euclidean', algorithm='auto', leaf_size=30, p=2, n_jobs=1)

# Setup ST-DBSCAN classifier
eps_stdbscan=70
eps2_stdbscan=250
min_samples_stdbscan=5
clf_st_dbscan = ST_DBSCAN(eps1=eps_stdbscan, eps2=eps2_stdbscan, min_samples=min_samples_stdbscan)        

**4.4. Train model and predict labels**

##### 4.4.1 DBSCAN

In [16]:
clf_dbscan.fit(df1.iloc[:,:3])

labels_pred_dbscan = clf_dbscan.labels_

for i in range(labels_pred_dbscan.size):
    if labels_pred_dbscan[i] >= 0:
        labels_pred_dbscan[i] = 1

          
%matplotlib widget
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df1.iloc[:,1],df1.iloc[:,2],df1.iloc[:,0], c=labels_pred_dbscan)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

<mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x1e09ff5f640>

##### 4.4.1 ST-DBSCAN

In [15]:
clf_st_dbscan.fit(df1.iloc[:,:3])

labels_pred_stdbscan = clf_st_dbscan.labels

for i in range(labels_pred_stdbscan.size):
    if labels_pred_stdbscan[i] >= 0:
        labels_pred_stdbscan[i] = 1

          
%matplotlib widget
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df1.iloc[:,1],df1.iloc[:,2],df1.iloc[:,0], c=labels_pred_stdbscan)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

<mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x1e09febca60>

**4.5. Evaluate model**

In [21]:
# Silhouette score DBSCAN
ss_dbscan = silhouette_score(df1.iloc[:,:3],labels_pred_dbscan)
# Silhouette score ST-DBSCAN
ss_stdbscan = silhouette_score(df1.iloc[:,:3],labels_pred_stdbscan)
print(f"Silhouette scores:\nDBSCAN: {ss_dbscan}\nST-DBSCAN: {ss_stdbscan}")

Silhouette scores:
DBSCAN: 0.004875167353651087
ST-DBSCAN: 0.017654084873375348


### 5. Conclusions 

### 6. References