## Computational Human Reading Prediction

The study of eye movement is of great interest to neuroscience, since they reflect cognitive processes that underlie visual tasks, in particularly reading. An important variable for determining these movements is called Predictibility. That probably needs a better name. This, Predictability, reprents the prediction that one is making about the coming word during reading. In fields such as Neurolinguistics, this Predictibility variable is not estimated from a part of the text but from responses from other reading filling in the word that follows given a the same context. 

In parallel, the field of NLP has estimated this type of prediction in an automatic manner as some of their goals. A simple but at the same time successful example includes n-grams. Where the probability to predict a word is constructed from the appearance of the context in a large corpus of text, that represents the knowledge of the language that the reader has.  

This model is able to extend the utility of recently read text for realizing such probabilities (cache n-gram). In this work it is proposed to estimate the predictability of a word in an automatic form using parts from distinct variants of the language models. For our dataset we show that the new automatic predictibility is equally as effective as the predictibility of human explained eye movements, is much better to understand, cheaper and more rapid to obtain as it does not require experiments that involve a great count of people. 

In [1]:
import numpy as np
import pandas as pd
from os import walk 
from tqdm import tqdm

In [2]:
filepath="../data/LMM-CBP/csv_in"
data=[]

# Va a caminar en las carpetas para cargar los files
for root, dirs, files in tqdm(walk(filepath)):
    for file in files:
        temp=pd.read_csv(f"{filepath}/{file}", delimiter=";")
        data.append(temp)

del temp

df=pd.concat(data)
print(df.shape)


  temp=pd.read_csv(f"{filepath}/{file}", delimiter=";")
1it [00:22, 22.61s/it]


(2586948, 140)


In [3]:
df.describe()

Unnamed: 0,suj_id,n_orac,palnum,tipo,pred,freq,length,MaxJump,bad_epoch,stopword,...,E119,E120,E121,E122,E123,E124,E125,E126,E127,E128
count,2586948.0,2586948.0,2586948.0,2586948.0,2586948.0,2586948.0,2586948.0,2586948.0,2586948.0,2586948.0,...,2586948.0,2586948.0,2586948.0,2586948.0,2586948.0,2586948.0,2586948.0,2586948.0,2586948.0,2586947.0
mean,14.5,94.85284,4.375697,1.016722,-0.2119133,43773.67,4.341137,0.3199554,0.6005733,0.4760312,...,-0.0937847,-0.003066079,-0.04883013,0.01872346,-0.04121687,-0.04585046,0.04162723,0.1097454,0.1410374,0.1750807
std,8.077749,49.2555,2.373298,0.8517471,0.9609049,69408.15,2.360397,1.217291,0.4897807,0.4994253,...,11.99427,12.87864,12.01152,12.30401,12.80992,13.57826,14.6018,12.68019,12.12173,12.77105
min,1.0,11.0,1.0,0.0,-1.2553,0.0,1.0,-1.0,0.0,0.0,...,-580.8102,-723.6652,-725.82,-726.2158,-753.0132,-749.0185,-752.7536,-754.1962,-751.8157,-752.0677
25%,7.75,54.0,2.0,0.0,-1.2553,194.0,2.0,-1.0,0.0,0.0,...,-6.57413,-6.429716,-6.499453,-6.574852,-6.698232,-6.586955,-7.274955,-6.762504,-6.425458,-6.377016
50%,14.5,93.0,4.0,1.0,-0.30103,2718.0,4.0,0.0,1.0,0.0,...,-0.1456375,-0.1038615,-0.08447389,-0.02715446,-0.03462234,-0.01620925,-0.003477534,0.03852869,0.09089095,0.08002501
75%,21.25,138.0,6.0,2.0,0.69897,62214.0,6.0,2.0,1.0,1.0,...,6.298971,6.258519,6.357286,6.577438,6.600107,6.479055,7.341053,6.925016,6.65764,6.611491
max,28.0,197.0,12.0,2.0,1.2553,264721.0,12.0,2.0,1.0,1.0,...,196.7297,1282.204,188.5576,221.9662,189.5408,363.4467,214.7726,170.7255,169.2756,254.3207


In [4]:
# Va a reconstruir los frases base en la apracion punta
frase=""
sents=[]
for word in df.loc[df.suj_id==1][' pal'].to_list():
    if frase=="":
        frase=word
    elif word.find(".")>-1:
        frase=frase+" "+word
        sents.append(frase)
        frase=""
    else:
        frase=frase+" "+word

df_phrases=pd.DataFrame({"sentences":sents})
df_phrases.head()

Unnamed: 0,sentences
0,La picadura de ciertas arañas puede ser mortal.
1,Cuando hay hambre no hay pan duro.
2,La película terminó de forma extraña.
3,El gato atrapó muchos ratones.
4,Sobre gustos no hay nada escrito.


In [9]:
# Guarda a csv
# df_phrases.to_csv("sentences.csv", index=0)

In [66]:
# Va a traer un parte del df
df_subset=df[
    [
        'suj_id',
        ' pal',
        ' palnum',
        ' freq',
        ' length',
        ' time',
    ]
].copy()
df_subset.columns=["suj_id", "pal", "palnum", "freq", "length", "time"]

In [70]:
# Va a identificar frases base en una punta aparicion
# No va a identificar el mismo frase con el mismo identificacion de subjetos diferentes.
sent_grp_count=0
sent_id=[]
for i,row in tqdm(df_subset.iterrows()):
     sent_id.append(sent_grp_count)
     word=row.pal
     word_period=word.find(".")
     if word_period>-1:
          sent_grp_count+=1

In [67]:
df_subset.groupby(
    ["suj_id", ""]
)

Unnamed: 0,suj_id,pal,palnum,freq,length,time
0,1,La,1,192476,2,-101.5625
1,1,picadura,2,9,8,-101.5625
2,1,de,3,264721,2,-101.5625
3,1,ciertas,4,380,7,-101.5625
4,1,arañas,5,20,6,-101.5625
...,...,...,...,...,...,...
25111,28,la,5,192476,2,664.0625
25112,28,panadería,6,10,9,664.0625
25113,28,cocinan,7,7,7,664.0625
25114,28,el,8,139594,2,664.0625
