# MECO

Feature extraction and data preparation for the MECO dataset

We chose the "joint_data_trimmed.dat" file in the MECO website (https://meco-read.com/).

## Import Libs and Data

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("joint_data_trimmed.csv", index_col=0)

We have chose to use the following features for each sample:

- **Skipping**: a binary index of whether the word was fixated at least once during the entire reading of the text [and not only during the first pass].
- **First Fixation**: the duration of the first fixation landing on the word.
- **Gaze Duration**: the summed duration of fixations on the word in the first pass, i.e., before the gaze leaves it for the first time.
- **Total Fixation Duration**: the summed duration of all fixations on the word.
- **First-run Number of Fixation**: the number of fixations on a word during the first pass.
- **Total Number of Fixations**: number of fixations on a word overall.
- **Regression**: a binary index of whether the gaze returned to the word after inspecting further textual material.
- **Rereading**: a binary index of whether the word elicited fixations after the first pass.


In [3]:
# following a paper cited on the MECO website, i will use a subset of the gaze features
gaze_features = ["skip", "firstfix.dur", "firstrun.dur", "dur", "firstrun.nfix", "nfix", "refix", "reread"]
basic_features = ["trialid", "sentnum", "ianum", "ia", "lang", "uniform_id"]
df = df[basic_features + gaze_features]

In [4]:
df.head()

Unnamed: 0,trialid,sentnum,ianum,ia,lang,uniform_id,skip,firstfix.dur,firstrun.dur,dur,firstrun.nfix,nfix,refix,reread
1,1.0,1.0,1.0,Janus,du,du_1,0.0,154.0,154.0,400.0,1.0,2.0,0.0,1.0
2,1.0,1.0,2.0,is,du,du_1,1.0,,,,,,,
3,1.0,1.0,3.0,in,du,du_1,0.0,551.0,551.0,551.0,1.0,1.0,0.0,0.0
4,1.0,1.0,4.0,de,du,du_1,1.0,,,,,,,
5,1.0,1.0,5.0,oude,du,du_1,0.0,189.0,189.0,439.0,1.0,2.0,0.0,1.0


## Data Understanding

We can notice that there are some Null elements, for the gaze_features except skip, those Null elements are in the rows with skip == 1, representing the fact that cannot be captured.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 855123 entries, 1 to 855123
Data columns (total 14 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   trialid        855122 non-null  float64
 1   sentnum        855122 non-null  float64
 2   ianum          855122 non-null  float64
 3   ia             854741 non-null  object 
 4   lang           855122 non-null  object 
 5   uniform_id     855123 non-null  object 
 6   skip           855122 non-null  float64
 7   firstfix.dur   639530 non-null  float64
 8   firstrun.dur   639530 non-null  float64
 9   dur            639530 non-null  float64
 10  firstrun.nfix  639530 non-null  float64
 11  nfix           639530 non-null  float64
 12  refix          639454 non-null  float64
 13  reread         639530 non-null  float64
dtypes: float64(11), object(3)
memory usage: 97.9+ MB


In [6]:
df.describe()

Unnamed: 0,trialid,sentnum,ianum,skip,firstfix.dur,firstrun.dur,dur,firstrun.nfix,nfix,refix,reread
count,855122.0,855122.0,855122.0,855122.0,639530.0,639530.0,639530.0,639530.0,639530.0,639454.0,639530.0
mean,6.319812,5.100584,84.710652,0.252118,214.771812,274.000635,396.190598,1.291295,1.870305,0.270565,0.315846
std,3.44021,2.697842,51.443266,0.434229,94.834265,181.464901,332.095123,0.666067,1.378493,0.444252,0.464852
min,1.0,1.0,1.0,0.0,2.0,2.0,2.0,1.0,1.0,0.0,0.0
25%,3.0,3.0,41.0,0.0,156.0,171.0,199.0,1.0,1.0,0.0,0.0
50%,6.0,5.0,82.0,0.0,200.0,229.0,297.0,1.0,1.0,0.0,0.0
75%,9.0,7.0,124.0,1.0,255.0,324.0,478.0,1.0,2.0,1.0,1.0
max,12.0,16.0,243.0,1.0,12688.0,12688.0,15579.0,44.0,50.0,1.0,1.0


In [7]:
df.lang.unique()

array(['du', 'ee', 'fi', 'ge', 'gr', 'he', 'it', 'ko', 'en', 'no', nan,
       'ru', 'sp', 'tr'], dtype=object)

Get a subset of languages, choosen look a the ones handled by mBERT (https://huggingface.co/bert-base-multilingual-cased).

- **German**
- **Italian**
- **Russian**
- **English**
- **Spanish**

In [8]:
# get only the languages that are necessary to the project
supported_languages = ["ge", "it", "ru", "en", "sp"]

In [9]:
df = df[df.lang.isin(supported_languages)]

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 402904 entries, 193910 to 823179
Data columns (total 14 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   trialid        402904 non-null  float64
 1   sentnum        402904 non-null  float64
 2   ianum          402904 non-null  float64
 3   ia             402834 non-null  object 
 4   lang           402904 non-null  object 
 5   uniform_id     402904 non-null  object 
 6   skip           402904 non-null  float64
 7   firstfix.dur   292582 non-null  float64
 8   firstrun.dur   292582 non-null  float64
 9   dur            292582 non-null  float64
 10  firstrun.nfix  292582 non-null  float64
 11  nfix           292582 non-null  float64
 12  refix          292539 non-null  float64
 13  reread         292582 non-null  float64
dtypes: float64(11), object(3)
memory usage: 46.1+ MB


In [11]:
df.head()

Unnamed: 0,trialid,sentnum,ianum,ia,lang,uniform_id,skip,firstfix.dur,firstrun.dur,dur,firstrun.nfix,nfix,refix,reread
193910,1.0,1.0,1.0,In,ge,ge_1,0.0,164.0,164.0,164.0,1.0,1.0,0.0,0.0
193911,1.0,1.0,2.0,der,ge,ge_1,0.0,166.0,166.0,657.0,1.0,3.0,0.0,1.0
193912,1.0,1.0,3.0,alten,ge,ge_1,0.0,144.0,144.0,717.0,1.0,3.0,0.0,1.0
193913,1.0,1.0,4.0,römischen,ge,ge_1,0.0,219.0,219.0,1231.0,1.0,6.0,0.0,1.0
193914,1.0,1.0,5.0,Religion,ge,ge_1,0.0,151.0,151.0,1338.0,1.0,8.0,1.0,1.0


Notice that in the samples' gaze_features with skip == 0 there aren't Null elements.

In [12]:
df[df.skip==0].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 292582 entries, 193910 to 823178
Data columns (total 14 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   trialid        292582 non-null  float64
 1   sentnum        292582 non-null  float64
 2   ianum          292582 non-null  float64
 3   ia             292575 non-null  object 
 4   lang           292582 non-null  object 
 5   uniform_id     292582 non-null  object 
 6   skip           292582 non-null  float64
 7   firstfix.dur   292582 non-null  float64
 8   firstrun.dur   292582 non-null  float64
 9   dur            292582 non-null  float64
 10  firstrun.nfix  292582 non-null  float64
 11  nfix           292582 non-null  float64
 12  refix          292539 non-null  float64
 13  reread         292582 non-null  float64
dtypes: float64(11), object(3)
memory usage: 33.5+ MB


In [13]:
df[df.skip==1].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 110322 entries, 193926 to 823179
Data columns (total 14 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   trialid        110322 non-null  float64
 1   sentnum        110322 non-null  float64
 2   ianum          110322 non-null  float64
 3   ia             110259 non-null  object 
 4   lang           110322 non-null  object 
 5   uniform_id     110322 non-null  object 
 6   skip           110322 non-null  float64
 7   firstfix.dur   0 non-null       float64
 8   firstrun.dur   0 non-null       float64
 9   dur            0 non-null       float64
 10  firstrun.nfix  0 non-null       float64
 11  nfix           0 non-null       float64
 12  refix          0 non-null       float64
 13  reread         0 non-null       float64
dtypes: float64(11), object(3)
memory usage: 12.6+ MB


Even more, there are some ia elements that are Null, we can see that most of them are not skipped words with a lot of Null elements, so we can drop them.

In [14]:
print("Probabilities of Null elements by columns, for the Null ia")
df[df.ia.isna()].isna().sum()/df[df.ia.isna()].shape[0]

Probabilities of Null elements by columns, for the Null ia


trialid          0.0
sentnum          0.0
ianum            0.0
ia               1.0
lang             0.0
uniform_id       0.0
skip             0.0
firstfix.dur     0.9
firstrun.dur     0.9
dur              0.9
firstrun.nfix    0.9
nfix             0.9
refix            0.9
reread           0.9
dtype: float64

In [15]:
df = df[~df.ia.isna()]

Fill gaze features of the skipped words with 0.

In [16]:
df = df.fillna(0)

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 402834 entries, 193910 to 823179
Data columns (total 14 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   trialid        402834 non-null  float64
 1   sentnum        402834 non-null  float64
 2   ianum          402834 non-null  float64
 3   ia             402834 non-null  object 
 4   lang           402834 non-null  object 
 5   uniform_id     402834 non-null  object 
 6   skip           402834 non-null  float64
 7   firstfix.dur   402834 non-null  float64
 8   firstrun.dur   402834 non-null  float64
 9   dur            402834 non-null  float64
 10  firstrun.nfix  402834 non-null  float64
 11  nfix           402834 non-null  float64
 12  refix          402834 non-null  float64
 13  reread         402834 non-null  float64
dtypes: float64(11), object(3)
memory usage: 46.1+ MB


## Merge together samples of different readers from the same language

Do mean over the same ia read by different users of the same lang.

In [18]:
# do a mean over trialid, sentnum, lang, ianum, ia
df[np.logical_and(np.logical_and(np.logical_and(np.logical_and(df.sentnum==1, df.lang=="ge"), df.ianum==1), df.trialid==1), df.ia=="In")].head(30)

Unnamed: 0,trialid,sentnum,ianum,ia,lang,uniform_id,skip,firstfix.dur,firstrun.dur,dur,firstrun.nfix,nfix,refix,reread
193910,1.0,1.0,1.0,In,ge,ge_1,0.0,164.0,164.0,164.0,1.0,1.0,0.0,0.0
195937,1.0,1.0,1.0,In,ge,ge_2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
197633,1.0,1.0,1.0,In,ge,ge_3,0.0,180.0,180.0,180.0,1.0,1.0,0.0,0.0
199660,1.0,1.0,1.0,In,ge,ge_4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
201181,1.0,1.0,1.0,In,ge,ge_5,0.0,135.0,135.0,135.0,1.0,1.0,0.0,0.0
202535,1.0,1.0,1.0,In,ge,ge_6,0.0,144.0,144.0,144.0,1.0,1.0,0.0,0.0
204562,1.0,1.0,1.0,In,ge,ge_8,0.0,37.0,37.0,37.0,1.0,1.0,0.0,0.0
206589,1.0,1.0,1.0,In,ge,ge_9,0.0,145.0,145.0,145.0,1.0,1.0,0.0,0.0
208616,1.0,1.0,1.0,In,ge,ge_10,0.0,333.0,333.0,333.0,1.0,1.0,0.0,0.0
210643,1.0,1.0,1.0,In,ge,ge_11,0.0,141.0,141.0,141.0,1.0,1.0,0.0,0.0


In [19]:
group_by_cols = ["trialid", "sentnum", "lang", "ianum", "ia"]
grouped_cols = ["skip", "firstrun.dur", "dur", "firstrun.nfix", "nfix", "refix", "reread"]

In [20]:
grouped_df = df.groupby(group_by_cols)[grouped_cols].mean()

In [21]:
grouped_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,skip,firstrun.dur,dur,firstrun.nfix,nfix,refix,reread
trialid,sentnum,lang,ianum,ia,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1.0,1.0,en,1.0,In,0.605263,52.631579,76.789474,0.421053,0.526316,0.026316,0.078947
1.0,1.0,en,2.0,ancient,0.026316,322.394737,615.368421,1.578947,2.894737,0.578947,0.631579
1.0,1.0,en,3.0,Roman,0.052632,258.131579,516.605263,1.236842,2.526316,0.236842,0.657895
1.0,1.0,en,4.0,religion,0.0,298.078947,683.684211,1.342105,3.052632,0.473684,0.631579
1.0,1.0,en,5.0,and,0.473684,129.868421,172.552632,0.578947,0.789474,0.052632,0.157895


In [22]:
# Move indexes as columns

grouped_df = grouped_df.reset_index(level=0).reset_index(level=0).reset_index(level=0).reset_index(level=0).reset_index(level=0)

In [23]:
grouped_df.head()

Unnamed: 0,ia,ianum,lang,sentnum,trialid,skip,firstrun.dur,dur,firstrun.nfix,nfix,refix,reread
0,In,1.0,en,1.0,1.0,0.605263,52.631579,76.789474,0.421053,0.526316,0.026316,0.078947
1,ancient,2.0,en,1.0,1.0,0.026316,322.394737,615.368421,1.578947,2.894737,0.578947,0.631579
2,Roman,3.0,en,1.0,1.0,0.052632,258.131579,516.605263,1.236842,2.526316,0.236842,0.657895
3,religion,4.0,en,1.0,1.0,0.0,298.078947,683.684211,1.342105,3.052632,0.473684,0.631579
4,and,5.0,en,1.0,1.0,0.473684,129.868421,172.552632,0.578947,0.789474,0.052632,0.157895


Numerate the sentences removing the concept of trialid, as done in https://arxiv.org/abs/2104.05433

In [24]:
grouped_df["trial_sentnum"] = grouped_df["sentnum"]
grouped_df["sentnum"] = grouped_df["sentnum"].astype("string") + grouped_df["trialid"].astype("string") + grouped_df["lang"].astype("string")
grouped_df.sentnum = grouped_df.sentnum.astype('category').cat.codes

In [25]:
grouped_df.head()

Unnamed: 0,ia,ianum,lang,sentnum,trialid,skip,firstrun.dur,dur,firstrun.nfix,nfix,refix,reread,trial_sentnum
0,In,1.0,en,0,1.0,0.605263,52.631579,76.789474,0.421053,0.526316,0.026316,0.078947,1.0
1,ancient,2.0,en,0,1.0,0.026316,322.394737,615.368421,1.578947,2.894737,0.578947,0.631579,1.0
2,Roman,3.0,en,0,1.0,0.052632,258.131579,516.605263,1.236842,2.526316,0.236842,0.657895,1.0
3,religion,4.0,en,0,1.0,0.0,298.078947,683.684211,1.342105,3.052632,0.473684,0.631579,1.0
4,and,5.0,en,0,1.0,0.473684,129.868421,172.552632,0.578947,0.789474,0.052632,0.157895,1.0


change col names

In [26]:
grouped_df.rename(columns={"skip" : "prob_skip", "refix" : "prob_refix", "reread" : "prob_reread"}, inplace=True)

In [27]:
grouped_df.head()

Unnamed: 0,ia,ianum,lang,sentnum,trialid,prob_skip,firstrun.dur,dur,firstrun.nfix,nfix,prob_refix,prob_reread,trial_sentnum
0,In,1.0,en,0,1.0,0.605263,52.631579,76.789474,0.421053,0.526316,0.026316,0.078947,1.0
1,ancient,2.0,en,0,1.0,0.026316,322.394737,615.368421,1.578947,2.894737,0.578947,0.631579,1.0
2,Roman,3.0,en,0,1.0,0.052632,258.131579,516.605263,1.236842,2.526316,0.236842,0.657895,1.0
3,religion,4.0,en,0,1.0,0.0,298.078947,683.684211,1.342105,3.052632,0.473684,0.631579,1.0
4,and,5.0,en,0,1.0,0.473684,129.868421,172.552632,0.578947,0.789474,0.052632,0.157895,1.0


See eventually correated features.

We can notice that there are some features that are correlated a lot.

In [28]:
grouped_df[["prob_skip", "firstrun.dur", "dur", "firstrun.nfix", "nfix", "prob_refix", "prob_reread"]].corr()

Unnamed: 0,prob_skip,firstrun.dur,dur,firstrun.nfix,nfix,prob_refix,prob_reread
prob_skip,1.0,-0.778231,-0.731378,-0.828764,-0.772671,-0.574849,-0.600154
firstrun.dur,-0.778231,1.0,0.87637,0.937817,0.835661,0.78706,0.51592
dur,-0.731378,0.87637,1.0,0.849587,0.961604,0.79005,0.771978
firstrun.nfix,-0.828764,0.937817,0.849587,1.0,0.89144,0.840413,0.549345
nfix,-0.772671,0.835661,0.961604,0.89144,1.0,0.827222,0.801388
prob_refix,-0.574849,0.78706,0.79005,0.840413,0.827222,1.0,0.524673
prob_reread,-0.600154,0.51592,0.771978,0.549345,0.801388,0.524673,1.0


### Cast features to correct types, save data to csv

In [32]:
grouped_df.ianum = grouped_df.ianum.astype(int)
grouped_df.trialid = grouped_df.trialid.astype(int)
grouped_df.trial_sentnum = grouped_df.trial_sentnum.astype(int)

grouped_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11908 entries, 0 to 11907
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   ia             11908 non-null  object 
 1   ianum          11908 non-null  int64  
 2   lang           11908 non-null  object 
 3   sentnum        11908 non-null  int16  
 4   trialid        11908 non-null  int64  
 5   prob_skip      11908 non-null  float64
 6   firstrun.dur   11908 non-null  float64
 7   dur            11908 non-null  float64
 8   firstrun.nfix  11908 non-null  float64
 9   nfix           11908 non-null  float64
 10  prob_refix     11908 non-null  float64
 11  prob_reread    11908 non-null  float64
 12  trial_sentnum  11908 non-null  int64  
dtypes: float64(7), int16(1), int64(3), object(2)
memory usage: 1.1+ MB


In [33]:
grouped_df.to_csv("cleaned_data.csv")