# MECO

Feature extraction and data preparation for the MECO dataset

We chose the "joint_data_trimmed.dat" file in the MECO website (https://meco-read.com/).

Interesting paper that describe the dataset, https://link.springer.com/epdf/10.3758/s13428-021-01772-6?sharing_token=As4e3osuA15IaUCKtCvDT5AH0g46feNdnc402WrhzyoEtpF3alySPm1lAWocS1ewk9OZlpPc3CqibACC23iBC_nacc6BD4_GPYLuUZJAvfWHoa8e0hjmhhFn9fLIgIRd3VzSfjlcpQ3gS4EiUY2YpRXjDSh3hB5Zx5kZpkk4yIQ=.

## Import Libs and Data

In [1]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA
from sklearn import metrics
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from tqdm import tqdm
import matplotlib.pyplot as plt
from sklearn.metrics import pairwise_distances_argmin_min

In [2]:
df = pd.read_csv("augmenting_nlms_meco_data/joint_data_trimmed.csv", index_col=0)

We have chose to use the following features for each sample:

- **Skipping**: a binary index of whether the word was fixated at least once during the entire reading of the text [and not only during the first pass].
- **First Fixation**: the duration of the first fixation landing on the word.
- **Gaze Duration**: the summed duration of fixations on the word in the first pass, i.e., before the gaze leaves it for the first time.
- **Total Fixation Duration**: the summed duration of all fixations on the word.
- **First-run Number of Fixation**: the number of fixations on a word during the first pass.
- **Total Number of Fixations**: number of fixations on a word overall.
- **Regression**: a binary index of whether the gaze returned to the word after inspecting further textual material.
- **Rereading**: a binary index of whether the word elicited fixations after the first pass.


In [3]:
# following a paper cited on the MECO website, i will use a subset of the gaze features
gaze_features = ["skip", "firstfix.dur", "firstrun.dur", "dur", "firstrun.nfix", "nfix", "refix", "reread"]
other_features = ["trialid", "sentnum", "ianum", "ia", "lang", "uniform_id"]
df = df[other_features + gaze_features]

In [4]:
df.head()

Unnamed: 0,trialid,sentnum,ianum,ia,lang,uniform_id,skip,firstfix_dur,firstrun_dur,dur,firstrun_nfix,nfix,refix,reread
1,1.0,1.0,1.0,Janus,du,du_1,0.0,154.0,154.0,400.0,1.0,2.0,0.0,1.0
2,1.0,1.0,2.0,is,du,du_1,1.0,,,,,,,
3,1.0,1.0,3.0,in,du,du_1,0.0,551.0,551.0,551.0,1.0,1.0,0.0,0.0
4,1.0,1.0,4.0,de,du,du_1,1.0,,,,,,,
5,1.0,1.0,5.0,oude,du,du_1,0.0,189.0,189.0,439.0,1.0,2.0,0.0,1.0


## Data Understanding

We can notice that there are some Null elements, for the gaze_features except skip, those Null elements are in the rows with skip == 1, representing the fact that cannot be captured.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 855123 entries, 1 to 855123
Data columns (total 14 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   trialid        855122 non-null  float64
 1   sentnum        855122 non-null  float64
 2   ianum          855122 non-null  float64
 3   ia             854741 non-null  object 
 4   lang           855122 non-null  object 
 5   uniform_id     855123 non-null  object 
 6   skip           855122 non-null  float64
 7   firstfix_dur   639530 non-null  float64
 8   firstrun_dur   639530 non-null  float64
 9   dur            639530 non-null  float64
 10  firstrun_nfix  639530 non-null  float64
 11  nfix           639530 non-null  float64
 12  refix          639454 non-null  float64
 13  reread         639530 non-null  float64
dtypes: float64(11), object(3)
memory usage: 97.9+ MB


In [6]:
df.describe()

Unnamed: 0,trialid,sentnum,ianum,skip,firstfix_dur,firstrun_dur,dur,firstrun_nfix,nfix,refix,reread
count,855122.0,855122.0,855122.0,855122.0,639530.0,639530.0,639530.0,639530.0,639530.0,639454.0,639530.0
mean,6.319812,5.100584,84.710652,0.252118,214.771812,274.000635,396.190598,1.291295,1.870305,0.270565,0.315846
std,3.44021,2.697842,51.443266,0.434229,94.834265,181.464901,332.095123,0.666067,1.378493,0.444252,0.464852
min,1.0,1.0,1.0,0.0,2.0,2.0,2.0,1.0,1.0,0.0,0.0
25%,3.0,3.0,41.0,0.0,156.0,171.0,199.0,1.0,1.0,0.0,0.0
50%,6.0,5.0,82.0,0.0,200.0,229.0,297.0,1.0,1.0,0.0,0.0
75%,9.0,7.0,124.0,1.0,255.0,324.0,478.0,1.0,2.0,1.0,1.0
max,12.0,16.0,243.0,1.0,12688.0,12688.0,15579.0,44.0,50.0,1.0,1.0


In [7]:
df.lang.unique()

array(['du', 'ee', 'fi', 'ge', 'gr', 'he', 'it', 'ko', 'en', 'no', nan,
       'ru', 'sp', 'tr'], dtype=object)

Get a subset of languages, 

- **English**
- **Italian**

In [8]:
df = df[df.lang.isin(["en", "it"])]

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 171595 entries, 397572 to 604808
Data columns (total 14 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   trialid        171595 non-null  float64
 1   sentnum        171595 non-null  float64
 2   ianum          171595 non-null  float64
 3   ia             171525 non-null  object 
 4   lang           171595 non-null  object 
 5   uniform_id     171595 non-null  object 
 6   skip           171595 non-null  float64
 7   firstfix_dur   122875 non-null  float64
 8   firstrun_dur   122875 non-null  float64
 9   dur            122875 non-null  float64
 10  firstrun_nfix  122875 non-null  float64
 11  nfix           122875 non-null  float64
 12  refix          122847 non-null  float64
 13  reread         122875 non-null  float64
dtypes: float64(11), object(3)
memory usage: 19.6+ MB


In [10]:
df.head()

Unnamed: 0,trialid,sentnum,ianum,ia,lang,uniform_id,skip,firstfix_dur,firstrun_dur,dur,firstrun_nfix,nfix,refix,reread
397572,1.0,2.0,25.0,come,it,it_3,0.0,555.0,555.0,555.0,1.0,1.0,0.0,0.0
397573,1.0,2.0,26.0,avente,it,it_3,0.0,282.0,282.0,282.0,1.0,1.0,0.0,0.0
397574,1.0,2.0,27.0,due,it,it_3,0.0,281.0,281.0,281.0,1.0,1.0,0.0,0.0
397575,1.0,2.0,28.0,"facce,",it,it_3,1.0,,,,,,,
397576,1.0,2.0,29.0,poiché,it,it_3,1.0,,,,,,,


Notice that in the samples' gaze_features with skip == 0 there aren't Null elements.

In [11]:
df[df.skip==0].info()

<class 'pandas.core.frame.DataFrame'>
Index: 122875 entries, 397572 to 604808
Data columns (total 14 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   trialid        122875 non-null  float64
 1   sentnum        122875 non-null  float64
 2   ianum          122875 non-null  float64
 3   ia             122868 non-null  object 
 4   lang           122875 non-null  object 
 5   uniform_id     122875 non-null  object 
 6   skip           122875 non-null  float64
 7   firstfix_dur   122875 non-null  float64
 8   firstrun_dur   122875 non-null  float64
 9   dur            122875 non-null  float64
 10  firstrun_nfix  122875 non-null  float64
 11  nfix           122875 non-null  float64
 12  refix          122847 non-null  float64
 13  reread         122875 non-null  float64
dtypes: float64(11), object(3)
memory usage: 14.1+ MB


In [12]:
df[df.skip==1].info()

<class 'pandas.core.frame.DataFrame'>
Index: 48720 entries, 397575 to 604806
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   trialid        48720 non-null  float64
 1   sentnum        48720 non-null  float64
 2   ianum          48720 non-null  float64
 3   ia             48657 non-null  object 
 4   lang           48720 non-null  object 
 5   uniform_id     48720 non-null  object 
 6   skip           48720 non-null  float64
 7   firstfix_dur   0 non-null      float64
 8   firstrun_dur   0 non-null      float64
 9   dur            0 non-null      float64
 10  firstrun_nfix  0 non-null      float64
 11  nfix           0 non-null      float64
 12  refix          0 non-null      float64
 13  reread         0 non-null      float64
dtypes: float64(11), object(3)
memory usage: 5.6+ MB


We noticed that, for skip=1 and skip=0, there are some "ia" elements NULL, We can see that those rows have a lot of NULL elements over the GAZE features, so we will drop them.

~~We will drop them because, even more, "ia" feature represent the text of the word so without the text the element cannot be processed by our machine learning model.~~

We will use a padding token to represent the null ia elements to have more 

In [13]:
print("Probabilities of Null elements by columns, for the Null ia")
df[df.ia.isna()].isna().sum()/df[df.ia.isna()].shape[0]

Probabilities of Null elements by columns, for the Null ia


trialid          0.0
sentnum          0.0
ianum            0.0
ia               1.0
lang             0.0
uniform_id       0.0
skip             0.0
firstfix_dur     0.9
firstrun_dur     0.9
dur              0.9
firstrun_nfix    0.9
nfix             0.9
refix            0.9
reread           0.9
dtype: float64

In [14]:
print("Number of Null elements by columns, for the Null ia")
df[df.ia.isna()].isna().sum()

Number of Null elements by columns, for the Null ia


trialid           0
sentnum           0
ianum             0
ia               70
lang              0
uniform_id        0
skip              0
firstfix_dur     63
firstrun_dur     63
dur              63
firstrun_nfix    63
nfix             63
refix            63
reread           63
dtype: int64

In [15]:
df[df.ia.isna()]

Unnamed: 0,trialid,sentnum,ianum,ia,lang,uniform_id,skip,firstfix_dur,firstrun_dur,dur,firstrun_nfix,nfix,refix,reread
399129,9.0,6.0,147.0,,it,it_3,1.0,,,,,,,
400261,9.0,6.0,147.0,,it,it_4,0.0,97.0,97.0,97.0,1.0,1.0,0.0,0.0
401436,9.0,6.0,147.0,,it,it_5,1.0,,,,,,,
404528,9.0,6.0,147.0,,it,it_7,1.0,,,,,,,
406588,9.0,6.0,147.0,,it,it_8,1.0,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
594782,3.0,8.0,149.0,,en,en_94,1.0,,,,,,,
596890,3.0,8.0,149.0,,en,en_95,1.0,,,,,,,
598998,3.0,8.0,149.0,,en,en_97,1.0,,,,,,,
601106,3.0,8.0,149.0,,en,en_98,1.0,,,,,,,


In [16]:
ia_nan_els = df.ia.isna()
df.ia = df.ia.fillna("<unk>")

Fill gaze features of the skipped words with 0.

In [17]:
df = df.fillna(0)

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 171595 entries, 397572 to 604808
Data columns (total 14 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   trialid        171595 non-null  float64
 1   sentnum        171595 non-null  float64
 2   ianum          171595 non-null  float64
 3   ia             171595 non-null  object 
 4   lang           171595 non-null  object 
 5   uniform_id     171595 non-null  object 
 6   skip           171595 non-null  float64
 7   firstfix_dur   171595 non-null  float64
 8   firstrun_dur   171595 non-null  float64
 9   dur            171595 non-null  float64
 10  firstrun_nfix  171595 non-null  float64
 11  nfix           171595 non-null  float64
 12  refix          171595 non-null  float64
 13  reread         171595 non-null  float64
dtypes: float64(11), object(3)
memory usage: 19.6+ MB


In [19]:
# 
df[["skip", "firstrun.dur", "dur", "firstrun.nfix", "nfix", "refix", "reread"]].corr()

Unnamed: 0,skip,firstrun_dur,dur,firstrun_nfix,nfix,refix,reread
skip,1.0,-0.668898,-0.552499,-0.763031,-0.600375,-0.280106,-0.342848
firstrun_dur,-0.668898,1.0,0.722246,0.864126,0.622603,0.551049,0.241115
dur,-0.552499,0.722246,1.0,0.651924,0.926565,0.556965,0.63789
firstrun_nfix,-0.763031,0.864126,0.651924,1.0,0.713521,0.668339,0.273338
nfix,-0.600375,0.622603,0.926565,0.713521,1.0,0.626607,0.696863
refix,-0.280106,0.551049,0.556965,0.668339,0.626607,1.0,0.253991
reread,-0.342848,0.241115,0.63789,0.273338,0.696863,0.253991,1.0


### En

In [20]:
df_en = df[df.lang=="en"]
df_en.head()

Unnamed: 0,trialid,sentnum,ianum,ia,lang,uniform_id,skip,firstfix_dur,firstrun_dur,dur,firstrun_nfix,nfix,refix,reread
520174,1.0,1.0,1.0,In,en,en_3,0.0,154.0,154.0,154.0,1.0,1.0,0.0,0.0
520175,1.0,1.0,2.0,ancient,en,en_3,0.0,139.0,550.0,550.0,3.0,3.0,1.0,0.0
520176,1.0,1.0,3.0,Roman,en,en_3,0.0,90.0,274.0,274.0,2.0,2.0,0.0,0.0
520177,1.0,1.0,4.0,religion,en,en_3,0.0,301.0,301.0,301.0,1.0,1.0,0.0,0.0
520178,1.0,1.0,5.0,and,en,en_3,0.0,270.0,270.0,542.0,1.0,2.0,0.0,1.0


### It

In [21]:
df_it = df[df.lang=="it"]
df_it.head()

Unnamed: 0,trialid,sentnum,ianum,ia,lang,uniform_id,skip,firstfix_dur,firstrun_dur,dur,firstrun_nfix,nfix,refix,reread
397572,1.0,2.0,25.0,come,it,it_3,0.0,555.0,555.0,555.0,1.0,1.0,0.0,0.0
397573,1.0,2.0,26.0,avente,it,it_3,0.0,282.0,282.0,282.0,1.0,1.0,0.0,0.0
397574,1.0,2.0,27.0,due,it,it_3,0.0,281.0,281.0,281.0,1.0,1.0,0.0,0.0
397575,1.0,2.0,28.0,"facce,",it,it_3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
397576,1.0,2.0,29.0,poiché,it,it_3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Clustering Users

We need to cluster users to handle separate readers behaviours, since the fact that the grouped features have different correlation matrix wrt to non grouped data.

From each cluster (K=5) we will take as representative user the medoid of each cluster.

In [22]:
seed_ = 12345

np.random.seed(seed_)

### Profiling the Users

generate for each user the relative profile, for each user we will threat it as the average over the GAZE features for the readed words.

In [23]:
reader_grouped_df_en = df_en.groupby(["uniform_id", "lang"])[gaze_features].mean().reset_index(level=0).reset_index(level=0)
reader_grouped_df_it = df_it.groupby(["uniform_id", "lang"])[gaze_features].mean().reset_index(level=0).reset_index(level=0)

KeyError: "Columns not found: 'firstfix.dur', 'firstrun.dur', 'firstrun.nfix'"

In [None]:
reader_grouped_df_en.head()

In [None]:
reader_grouped_df_it.head()

In [None]:
reader_grouped_df_en.info()

In [None]:
reader_grouped_df_it.info()

In [None]:
reader_grouped_df_en.uniform_id.unique()

In [None]:
reader_grouped_df_it.uniform_id.unique()

In [None]:
reader_grouped_df_en[gaze_features].corr()

In [None]:
reader_grouped_df_it[gaze_features].corr()

### Apply K-means to clusterize our datas

In [None]:
def clusterize_user_profiling(reader_grouped_df, gaze_features):
    """
    Apply the K-Means algorithm to retrieve K clusters and the relative 
    """

    scaler = MinMaxScaler()

    X = scaler.fit_transform(reader_grouped_df[gaze_features].values)

    sse_list = list()
    separations = list()
    silouettes_ = list()

    max_k = 10
    for k in tqdm(range(2, max_k + 1)):
        kmeans = KMeans(n_clusters=k, random_state=seed_, n_init=100, max_iter=100)
        kmeans.fit(X)

        sse = kmeans.inertia_
        sse_list.append(sse)
        separations.append(metrics.davies_bouldin_score(X, kmeans.labels_))
        silouettes_.append(silhouette_score(X, kmeans.labels_))

    plt.plot(range(2, len(sse_list) + 2), sse_list)
    plt.ylabel('SSE', fontsize=22)
    plt.xlabel('K', fontsize=22)
    plt.xticks(range(2, len(sse_list) + 2))
    plt.show()

    plt.plot(range(2, len(separations) + 2), separations)
    plt.ylabel('Separation', fontsize=22)
    plt.xlabel('K', fontsize=22)
    plt.xticks(range(2, len(separations) + 2))
    plt.show()

    plt.plot(range(2, len(silouettes_) + 2), silouettes_)
    plt.ylabel('Silouettes', fontsize=22)
    plt.xlabel('K', fontsize=22)
    plt.xticks(range(2, len(silouettes_) + 2))
    plt.show()

    selected_k=5

    kmeans = KMeans(n_clusters=selected_k, random_state=seed_, n_init=100, max_iter=500)
    kmeans.fit(X)

    # sum up the metrics

    print(f"SSE : {kmeans.inertia_}")
    print(f"Separation : {metrics.davies_bouldin_score(X, kmeans.labels_)}")
    print(f"Silhouette : {silhouette_score(X, kmeans.labels_)}")

    bot_xt_pct = pd.crosstab(kmeans.labels_, reader_grouped_df["lang"])
    bot_xt_pct.plot(kind='bar', stacked=False, 
                       title=f'lang per cluster')
    plt.xlabel('Cluster')
    plt.ylabel("lang")
    plt.show()

    center = scaler.inverse_transform(kmeans.cluster_centers_)

    plt.figure(figsize=(8, 4))
    for i in range(0, len(center)):
        plt.plot(kmeans.cluster_centers_[i], marker='o', label='Cluster %s' % i)
    plt.tick_params(axis='both', which='major', labelsize=10)
    plt.xticks(range(0, len(gaze_features)), gaze_features, fontsize=18, rotation=90)
    plt.legend(fontsize=10)
    plt.show()
    
    """
    pca = PCA(n_components=2)
    X_reduced = pca.fit_transform(X)

    #plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=kmeans.labels_, s=20)
    
    color_legend = {0: "green", 1: "yellow", 2: "blue"}

    fig, ax = plt.subplots()
    for g in np.unique(kmeans.labels_):
        ix = np.where(kmeans.labels_ == g)
        ax.scatter(X_reduced[ix, 0], X_reduced[ix, 1], c = color_legend[g], label = g, s = 100)
    ax.legend()

    plt.tick_params(axis='both', which='major', labelsize=11)
    plt.show()
    """

    # TODO: return the nearest user to each cluster
    
    closest, _ = pairwise_distances_argmin_min(kmeans.cluster_centers_, X)
    # clostest[i] -> contains the index of the point closest to the i-th centroid
    medoids = []
    
    for i in range(selected_k):
        medoids.append(reader_grouped_df.uniform_id[closest[i]])
        
    return medoids

In [None]:
medoids_en = clusterize_user_profiling(reader_grouped_df_en, gaze_features)

In [None]:
medoids_en

In [None]:
medoids_it = clusterize_user_profiling(reader_grouped_df_it, gaze_features)

In [None]:
medoids_it

### Creating one dataset per medoid, so one for each representative user

In [None]:
datasets_en = []

for user in medoids_en:
    datasets_en.append(df_en[df_en.uniform_id == user].reset_index(drop=True))
    
datasets_it = []

for user in medoids_it:
    datasets_it.append(df_it[df_it.uniform_id == user].reset_index(drop=True))

In [None]:
for df in datasets_en:
    print(df[["skip", "firstrun_dur", "dur", "firstrun.nfix", "nfix", "refix", "reread"]].corr())

In [None]:
for df in datasets_it:
    print(df[["skip", "firstrun_dur", "dur", "firstrun_nfix", "nfix", "refix", "reread"]].corr())

In [None]:
for i, df in enumerate(datasets_en):
    print(f"Len dataset_{i} : {df.shape}")

In [None]:
for i, df in enumerate(datasets_it):
    print(f"Len dataset_{i} : {df.shape}")

### Saving datasets

In [None]:
for user, df in zip(medoids_en, datasets_en):
    df.to_csv(f"augmenting_nlms_meco_data/en/{user}_dataset.csv")

In [None]:
for user, df in zip(medoids_it, datasets_it):
    df.to_csv(f"augmenting_nlms_meco_data/it/{user}_dataset.csv")