## Computational Human Reading Prediction

The study of eye movement is of great interest to neuroscience, since they reflect cognitive processes that underlie visual tasks, in particularly reading. An important variable for determining these movements is called Predictibility. That probably needs a better name. This, Predictability, reprents the prediction that one is making about the coming word during reading. In fields such as Neurolinguistics, this Predictibility variable is not estimated from a part of the text but from responses from other reading filling in the word that follows given a the same context. 

In parallel, the field of NLP has estimated this type of prediction in an automatic manner as some of their goals. A simple but at the same time successful example includes n-grams. Where the probability to predict a word is constructed from the appearance of the context in a large corpus of text, that represents the knowledge of the language that the reader has.  

This model is able to extend the utility of recently read text for realizing such probabilities (cache n-gram). In this work it is proposed to estimate the predictability of a word in an automatic form using parts from distinct variants of the language models. For our dataset we show that the new automatic predictibility is equally as effective as the predictibility of human explained eye movements, is much better to understand, cheaper and more rapid to obtain as it does not require experiments that involve a great count of people. 

#### NLTK Set up

Follow [1.2 Getting Started with NLTK](https://www.nltk.org/book/ch01.html)

In [42]:
# Run once, or as many times as you need to configure which files you want access to. 
import nltk
from nltk import word_tokenize
from nltk.corpus import cess_esp as cess
from nltk.corpus import spanish_grammars as sg
from nltk import UnigramTagger as ut
from nltk import BigramTagger as bt
from nltk.util import ngrams
import nltk, re, pprint
# nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

## Other Packages that are useful

Probablement, conoces [numpy](https://numpy.org/) y [pandas](https://pandas.pydata.org/). Al menos que lo dos estan familar a vos. Por las dudas, lo pongo hipervínculo. 

[os](https://docs.python.org/3/library/os.html) es muy util para explorar carpetas en tu computer

[tqdm](https://tqdm.github.io/) es muy util para ver cuando un proceso esta cargando. 

In [2]:
import numpy as np
import pandas as pd
from os import walk 
from tqdm import tqdm

In [3]:
filepath="../data/LMM-CBP/csv_in"
data=[]

# Va a caminar en las carpetas para cargar los files
for root, dirs, files in tqdm(walk(filepath)):
    for file in files:
        temp=pd.read_csv(f"{filepath}/{file}", delimiter=";")
        data.append(temp)

del temp

df=pd.concat(data)
print(df.shape)


  temp=pd.read_csv(f"{filepath}/{file}", delimiter=";")
1it [00:22, 22.76s/it]


(2586948, 140)


In [4]:
df.describe()

Unnamed: 0,suj_id,n_orac,palnum,tipo,pred,freq,length,MaxJump,bad_epoch,stopword,...,E119,E120,E121,E122,E123,E124,E125,E126,E127,E128
count,2586948.0,2586948.0,2586948.0,2586948.0,2586948.0,2586948.0,2586948.0,2586948.0,2586948.0,2586948.0,...,2586948.0,2586948.0,2586948.0,2586948.0,2586948.0,2586948.0,2586948.0,2586948.0,2586948.0,2586947.0
mean,14.5,94.85284,4.375697,1.016722,-0.2119133,43773.67,4.341137,0.3199554,0.6005733,0.4760312,...,-0.0937847,-0.003066079,-0.04883013,0.01872346,-0.04121687,-0.04585046,0.04162723,0.1097454,0.1410374,0.1750807
std,8.077749,49.2555,2.373298,0.8517471,0.9609049,69408.15,2.360397,1.217291,0.4897807,0.4994253,...,11.99427,12.87864,12.01152,12.30401,12.80992,13.57826,14.6018,12.68019,12.12173,12.77105
min,1.0,11.0,1.0,0.0,-1.2553,0.0,1.0,-1.0,0.0,0.0,...,-580.8102,-723.6652,-725.82,-726.2158,-753.0132,-749.0185,-752.7536,-754.1962,-751.8157,-752.0677
25%,7.75,54.0,2.0,0.0,-1.2553,194.0,2.0,-1.0,0.0,0.0,...,-6.57413,-6.429716,-6.499453,-6.574852,-6.698232,-6.586955,-7.274955,-6.762504,-6.425458,-6.377016
50%,14.5,93.0,4.0,1.0,-0.30103,2718.0,4.0,0.0,1.0,0.0,...,-0.1456375,-0.1038615,-0.08447389,-0.02715446,-0.03462234,-0.01620925,-0.003477534,0.03852869,0.09089095,0.08002501
75%,21.25,138.0,6.0,2.0,0.69897,62214.0,6.0,2.0,1.0,1.0,...,6.298971,6.258519,6.357286,6.577438,6.600107,6.479055,7.341053,6.925016,6.65764,6.611491
max,28.0,197.0,12.0,2.0,1.2553,264721.0,12.0,2.0,1.0,1.0,...,196.7297,1282.204,188.5576,221.9662,189.5408,363.4467,214.7726,170.7255,169.2756,254.3207


In [5]:
# Va a reconstruir los frases base en la apracion punta
frase=""
sents=[]
for word in df.loc[df.suj_id==1][' pal'].to_list():
    if frase=="":
        frase=word
    elif word.find(".")>-1:
        frase=frase+" "+word
        sents.append(frase)
        frase=""
    else:
        frase=frase+" "+word

df_phrases=pd.DataFrame({"sentences":sents})
df_phrases.head()

Unnamed: 0,sentences
0,La picadura de ciertas arañas puede ser mortal.
1,Cuando hay hambre no hay pan duro.
2,La película terminó de forma extraña.
3,El gato atrapó muchos ratones.
4,Sobre gustos no hay nada escrito.


In [9]:
# Guarda a csv
# df_phrases.to_csv("sentences.csv", index=0)

In [6]:
# Va a traer un parte del df
df_subset=df[
    [
        'suj_id',
        ' pal',
        ' palnum',
        ' freq',
        ' length',
        ' time',
    ]
].copy()
df_subset.columns=["suj_id", "pal", "palnum", "freq", "length", "time"]
df_subset.reset_index(inplace=True)

In [7]:
# Va a identificar frases base en un punto aparicion
# No va a identificar el mismo frase con el mismo identificacion de subjetos diferentes.
sent_grp_count=0
sent_id=[]
for i,row in tqdm(df_subset.iterrows()):
     sent_id.append(sent_grp_count)
     word=row.pal
     word_period=word.find(".")
     if word_period>-1:
          sent_grp_count+=1

df_subset['frase_id']=sent_id

2586948it [00:47, 54218.91it/s]


Mientras esta ayudable que lo ultima palabra tiene un punto, de hecho quieria en una forma que lo esta separado. Por ejemplo:

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.']

En esta forma, podemos usar NLTK.

In [8]:
# Agarro el primer grupo de frases. Ni importa que lo agarras
lectura=[]
lectura_frases=[] 
prev_frase_id=0
for i,row in tqdm(df_subset.loc[df_subset.suj_id==df_subset.suj_id[0]].iterrows()):
    # Verificar si la proxima palabra esta dentro mismo frase de la ultima palabra
    if row.frase_id==prev_frase_id:
        lectura.append(row.pal)

    # Si no, replecar el punto de la ultima palabra.
    # Append el punto a la lista de palabras
    # Append la lista a la lista de frases
    elif row.frase_id!=prev_frase_id:
        ult_pal=lectura[-1]
        ult_pal=ult_pal.replace(".", "")
        lectura[-1]=ult_pal
        lectura.append(".")
        lectura_frases.append(lectura)
        
        prev_frase_id=row.frase_id
        lectura=[]
        lectura.append(row.pal)

92391it [00:02, 44374.03it/s]


In [9]:
# Verifica la ultima elemento de la lista de frases
lectura_frases[-1]

['No', 'te', 'des', 'por', 'vencido', 'ni', 'aún', 'vencido', '.']

Una manera que es mas facil es por NLTK `word_tokenize`.

Por ejemplo:

In [25]:
print(
    word_tokenize(df_phrases.sentences[0], language="spanish")
)
df_phrases.head(3)

['La', 'picadura', 'de', 'ciertas', 'arañas', 'puede', 'ser', 'mortal', '.']


Unnamed: 0,sentences
0,La picadura de ciertas arañas puede ser mortal.
1,Cuando hay hambre no hay pan duro.
2,La película terminó de forma extraña.


In [45]:
tokenized=[]
bigram=[]
bigram_pos_tag=[]
for i,row in tqdm(df_phrases.iterrows()):
    tokenized.append(word_tokenize(row.sentences, language="spanish"))
    bigrams=ngrams(tokenized[-1], 2)
    bigram_pos_tag.append(nltk.pos_tag(tokenized[-1]))
    bigram.append(list(bigrams))
df_phrases["sentences_tokenized"]=tokenized
df_phrases["bigrams"]=bigram
df_phrases["bigrams_pos_tag"]=bigram_pos_tag

12360it [00:06, 2014.02it/s]


In [46]:
df_phrases.head(3)

Unnamed: 0,sentences,sentences_tokenized,bigrams,bigrams_pos_tag
0,La picadura de ciertas arañas puede ser mortal.,"[La, picadura, de, ciertas, arañas, puede, ser...","[(La, picadura), (picadura, de), (de, ciertas)...","[(La, NNP), (picadura, FW), (de, FW), (ciertas..."
1,Cuando hay hambre no hay pan duro.,"[Cuando, hay, hambre, no, hay, pan, duro, .]","[(Cuando, hay), (hay, hambre), (hambre, no), (...","[(Cuando, NNP), (hay, NN), (hambre, NN), (no, ..."
2,La película terminó de forma extraña.,"[La, película, terminó, de, forma, extraña, .]","[(La, película), (película, terminó), (terminó...","[(La, NNP), (película, NN), (terminó, NN), (de..."


In [47]:
df_phrases.bigrams_pos_tag

0        [(La, NNP), (picadura, FW), (de, FW), (ciertas...
1        [(Cuando, NNP), (hay, NN), (hambre, NN), (no, ...
2        [(La, NNP), (película, NN), (terminó, NN), (de...
3        [(El, NNP), (gato, NN), (atrapó, NN), (muchos,...
4        [(Sobre, NNP), (gustos, VBZ), (no, DT), (hay, ...
                               ...                        
12355    [(No, DT), (hagas, NN), (promesas, NN), (que, ...
12356    [(El, NNP), (que, VBZ), (a, DT), (hierro, NN),...
12357    [(Los, NNP), (loros, JJ), (comieron, NN), (la,...
12358    [(No, DT), (te, NN), (des, VBZ), (por, JJ), (v...
12359    [(Lucifer, NNP), (es, JJ), (uno, NN), (de, IN)...
Name: bigrams_pos_tag, Length: 12360, dtype: object

In [None]:
# Train the unigram tagger
# uni_tag = ut(cess_sents)
# uni_tag.tag(df_phrases.bigrams[0])
nltk.pos_tag(df_phrases.bigrams[0])

In [38]:
cess_sents[:train]

[[('El', 'da0ms0'), ('grupo', 'ncms000'), ('estatal', 'aq0cs0'), ('Electricité_de_France', 'np00000'), ('-Fpa-', 'Fpa'), ('EDF', 'np00000'), ('-Fpt-', 'Fpt'), ('anunció', 'vmis3s0'), ('hoy', 'rg'), (',', 'Fc'), ('jueves', 'W'), (',', 'Fc'), ('la', 'da0fs0'), ('compra', 'ncfs000'), ('del', 'spcms'), ('51_por_ciento', 'Zp'), ('de', 'sps00'), ('la', 'da0fs0'), ('empresa', 'ncfs000'), ('mexicana', 'aq0fs0'), ('Electricidad_Águila_de_Altamira', 'np00000'), ('-Fpa-', 'Fpa'), ('EAA', 'np00000'), ('-Fpt-', 'Fpt'), (',', 'Fc'), ('creada', 'aq0fsp'), ('por', 'sps00'), ('el', 'da0ms0'), ('japonés', 'aq0ms0'), ('Mitsubishi_Corporation', 'np00000'), ('para', 'sps00'), ('poner_en_marcha', 'vmn0000'), ('una', 'di0fs0'), ('central', 'ncfs000'), ('de', 'sps00'), ('gas', 'ncms000'), ('de', 'sps00'), ('495', 'Z'), ('megavatios', 'ncmp000'), ('.', 'Fp')], [('Una', 'di0fs0'), ('portavoz', 'nccs000'), ('de', 'sps00'), ('EDF', 'np00000'), ('explicó', 'vmis3s0'), ('a', 'sps00'), ('EFE', 'np00000'), ('que', 'c

In [55]:
# Split corpus into training and testing set.
train = int(len(df_phrases.bigrams_pos_tag)*70/100) # 90%

# # Train a bigram tagger with only training data.
bi_tag = bt(df_phrases.iloc[:train].bigrams_pos_tag.to_list())

# # Evaluates on testing data remaining 10%
bi_tag.accuracy(df_phrases.iloc[train+1:].bigrams_pos_tag.to_list())

0.838390230012339

In [56]:
bi_tag.tag(df_phrases.bigrams[0])

[(('La', 'picadura'), None),
 (('picadura', 'de'), None),
 (('de', 'ciertas'), None),
 (('ciertas', 'arañas'), None),
 (('arañas', 'puede'), None),
 (('puede', 'ser'), None),
 (('ser', 'mortal'), None),
 (('mortal', '.'), None)]

In [11]:

from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.tag(brown_sents[2007])

[('Various', 'JJ'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('apartments', 'NNS'),
 ('are', 'BER'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('terrace', 'NN'),
 ('type', 'NN'),
 (',', ','),
 ('being', 'BEG'),
 ('on', 'IN'),
 ('the', 'AT'),
 ('ground', 'NN'),
 ('floor', 'NN'),
 ('so', 'QL'),
 ('that', 'CS'),
 ('entrance', 'NN'),
 ('is', 'BEZ'),
 ('direct', 'JJ'),
 ('.', '.')]

In [24]:
nltk.corpus.cess_esp.words()

['El', 'grupo', 'estatal', 'Electricité_de_France', ...]

In [19]:
print(len(brown_sents))
print(brown_sents[0])

4623
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.']


In [67]:
df_subset.groupby(
    ["suj_id", ""]
)

Unnamed: 0,suj_id,pal,palnum,freq,length,time
0,1,La,1,192476,2,-101.5625
1,1,picadura,2,9,8,-101.5625
2,1,de,3,264721,2,-101.5625
3,1,ciertas,4,380,7,-101.5625
4,1,arañas,5,20,6,-101.5625
...,...,...,...,...,...,...
25111,28,la,5,192476,2,664.0625
25112,28,panadería,6,10,9,664.0625
25113,28,cocinan,7,7,7,664.0625
25114,28,el,8,139594,2,664.0625
