# Data Cleaning 3: Correcting Entity Spans (short)

Search for tokens to the left and right of a tagged person that appear in the entity list and the name lists and have not been tagged as entities. We can import the spacy docs, each token should have an entity tag = 0. 

In [1]:
from IPython.display import clear_output
import spacy
from spacy import displacy

import pickle

In [6]:
import pandas as pd

In [3]:
file = open("Files_Cleaning2/NER_data_Cleaned22.p", 'rb')
docs = pickle.load(file)

Let's import the file with the list of tokens appearing in our 3 lists, manually checked for dubious options, and filter those out. 

In [81]:
names = pd.read_excel("Files_Cleaning2/EntityTokensinKFLNamelists_manual_check.xlsx")

In [82]:
names.head()

Unnamed: 0,token,check
0,guejar,
1,belmonte,
2,gatica,
3,moyna,
4,r,1.0


In [83]:
acceptednames = names.loc[names['check']!=1,'token'].tolist()

In [85]:
acceptednames = list(set(acceptednames))

In [86]:
len(acceptednames)

1617

We can also add tokens from our name lists that did not appear in our tagged tokens, but might be elsewhere in the documents.

In [118]:
fnames = open("DBE_Names/firstnames.txt", 'r')
lnames = open("DBE_Names/lastnames.txt",'r')

In [119]:
DBEnames = []

for line in fnames:
    DBEnames.append(line.rstrip('\n'))
    
for line in lnames:
    DBEnames.append(line.rstrip('\n'))

In [120]:
DBEnames = set(DBEnames)

In [179]:
len(DBEnames)

10004

Iterating through the rows, we found some tokens that are not necessary in these lists. We'll remove them:
maestro
licenciado
bachiller
maestre
fray
nombre
jurado
notario
vicario
mercader
escribano

In [211]:
'maestro' in DBEnames

True

In [212]:
'licenciado' in DBEnames

True

In [213]:
'bachiller' in DBEnames

True

In [244]:
'maestre' in DBEnames

True

In [245]:
'fray' in DBEnames

True

In [246]:
'nombre' in DBEnames

True

In [247]:
'jurado' in DBEnames

True

In [248]:
'notario' in DBEnames

True

In [249]:
'vicario' in DBEnames

True

In [250]:
'mercader' in DBEnames

True

In [282]:
'escribano' in DBEnames

True

In [178]:
DBEnames.remove('maestro')

In [214]:
DBEnames.remove('licenciado')
DBEnames.remove('bachiller')

In [251]:
DBEnames.remove('maestre')
DBEnames.remove('fray')
DBEnames.remove('nombre')
DBEnames.remove('jurado')
DBEnames.remove('notario')
DBEnames.remove('vicario')
DBEnames.remove('mercader')

In [283]:
DBEnames.remove('escribano')

Let's look at the tokens that appear in the DBE list but do not form part of any entity (token.ent_iob is 2). 

In [284]:
untagged = []
for doc,context in docs:
    for token in doc:
        if (token.ent_iob==2) and ((token.text.lower() in DBEnames)):
            untagged.append(token.text)

In [285]:
len(untagged)

79780

In [286]:
print(set(untagged))

{'plano', 'marques', 'matos', 'pontones', 'pradilla', 'Gallego', 'benito', 'pascual', 'jesus', 'vela', 'isla', 'ferran', 'Sevillano', 'farfan', 'fregenal', 'poyo', 'gabriel', 'santaren', 'Redondo', 'Rafael', 'Encarnacion', 'acosta', 'bellon', 'rincon', 'maldonado', 'porto', 'Ruis', 'moneda', 'portillo', 'jusepe', 'red', 'costas', 'verdes', 'Panes', 'Ovejas', 'arias', 'costa', 'trinidad', 'ARTEAGA', 'cid', 'la', 'morillo', 'pinelo', 'caula', 'page', 'faber', 'olivar', 'moran', 'carbon', 'Alvarez', 'barca', 'pio', 'fita', 'Jacinto', 'sea', 'cabeza', 'Esquivel', 'domingo', 'marco', 'Rio', 'tomasa', 'abarca', 'sacristan', 'von', 'argote', 'gradilla', 'ana', 'cortes', 'leandro', 'beata', 'melgarejo', 'cantillo', 'mina', 'ines', 'serafin', 'armas', 'encalada', 'sebil', 'Cosme', 'aguilar', 'prieto', 'machuca', 'cajes', 'Guardia', 'Segunda', 'brabante', 'aceites', 'Guillermo', 'ayala', 'vermondo', 'alcayde', 'Abad', 'cesare', 'Castañeda', 'Romana', 'Torres', 'Portugal', 'corona', 'arostegui', 

In [287]:
len(set(untagged))

1464

It would be wise to review this list of 1,478 tokens manually to determine which are names for sure and which are dubious. But first, let's see how many of these matches (DBE and acceptednames) happen adjacent to a PER entity (before or after) to see which spans we have to edit.

In [49]:
docs[0][0][1]

martin

I want a rule that looks in front or behind of a PER entity. If that token is in the name list and is not tagged as an entity, I want it to be appended to our list. If the token is de, el or la (single or plural), I want it to move to the nex token. 

In [288]:
nexttopers = []
for doc,context in docs:
    for ent in doc.ents:
        if ent.label_=='PER':
            try:
                i = 1
                a = doc[ent.start-i]
                while a.text.lower() == 'de' or a.text.lower() =='del' or a.text.lower() == 'la'or a.text == 'el' or a.text == 'los' or a.text == 'las':
                    i = i+1
                    a = doc[ent.start-i]
                if (a.ent_iob==2) and ((a.text.lower() in acceptednames) or (a.text.lower() in DBEnames)):
                    nexttopers.append(a.text)
            except: 
                pass
            try:
                i = 0
                b = doc[ent.end+i]
                while b.text.lower() == 'de' or b.text.lower() == 'del' or b.text.lower() == 'la'or b.text == 'el' or b.text == 'los' or b.text == 'las':
                    i = i+1
                    b = doc[ent.end+i]
                if (b.ent_iob==2) and ((b.text.lower() in acceptednames)or (b.text.lower() in DBEnames)):
                    nexttopers.append(b.text)
            except:
                pass

In [289]:
len(nexttopers)

838

In [290]:
len(set(nexttopers))

251

In [291]:
print(set(nexttopers))

{'della', 'dos', 'maeso', 'mozo', 'hidalgo', 'marques', 'matos', 'clavijo', 'peinado', 'casas', 'barrio', 'caja', 'corro', 'blanco', 'sobrino', 'vida', 'coronel', 'alcantara', 'Fernando', 'jesus', 'hombre', 'Jeronimo', 'catalina', 'menor', 'serrano', 'vea', 'quadrado', 'Pero', 'vargas', 'comino', 'tristan', 'tercero', 'haya', 'ballestero', 'sotomayor', 'Conde', 'Cantero', 'aleman', 'guerrero', 'restan', 'solar', 'Clemencia', 'ferrero', 'farfan', 'espadero', 'quero', 'cosa', 'texedor', 'carta', 'diez', 'pinto', 'fuentes', 'leal', 'Niño', 'Duque', 'juan', 'dias', 'acosta', 'nieto', 'fraile', 'Dios', 'banda', 'maldonado', 'barba', 'Ledesma', 'viñas', 'hijas', 'mayor', 'joven', 'albear', 'alexos', 'chamorro', 'trigo', 'ginoves', 'foi', 'duque', 'Bueno', 'teresa', 'segundo', 'verdad', 'carpintero', 'estrada', 'negro', 'reyna', 'dorado', 'Marcos', 'arias', 'fernandez', 'torre', 'costa', 'contador', 'placido', 'toma', 'torquemada', 'cuentas', 'centeno', 'cid', 'monesterio', 'LEAL', 'bueno', '

According to this rule, 838 modifications could be made, from 251 distinct tokens.

In [260]:
peopleshort = []

for doc,context in docs:
    for ent in doc.ents:
        if ent.label_=='PER':
            try:
                i = 1
                a = doc[ent.start-i]
                while a.text.lower() == 'de' or a.text.lower() =='del' or a.text.lower() == 'la'or a.text.lower() == 'el' or a.text.lower() == 'los' or a.text.lower() == 'las':
                    i = i+1
                    a = doc[ent.start-i]
                if (a.ent_iob==2) and ((a.text.lower() in acceptednames) or (a.text.lower() in DBEnames)):
                    h = context['id']
                    b = ent.text
                    c = ent.start
                    d = ent.end
                    e = a.i
                    f = a.text
                    g = ent.label_
                    entry = [h, b, c, d, e, f,g]
                    peopleshort.append(entry)
            except: 
                pass
            try:
                i = 0
                b = doc[ent.end+i]
                while b.text.lower() == 'de' or b.text.lower() == 'del' or b.text.lower() == 'la'or b.text.lower() == 'el' or b.text.lower() == 'los' or b.text.lower() == 'las':
                    i = i+1
                    b = doc[ent.end+i]
                if (b.ent_iob==2) and ((b.text.lower() in acceptednames)or (b.text.lower() in DBEnames)):
                    a = context['id']
                    h = ent.text
                    c = ent.start
                    d = ent.end
                    e = b.i
                    f = b.text
                    g=ent.label_
                    entry = [a, h, c, d, e, f,g]
                    peopleshort.append(entry)
            except:
                pass

In [261]:
len(peopleshort)

1154

In [262]:
peopleshort = pd.DataFrame(peopleshort, columns = ['docid','string','start','end','tokenloc','tokentext','entlabel'])

In [263]:
peopleshort.head()

Unnamed: 0,docid,string,start,end,tokenloc,tokentext,entlabel
0,50,Cristobal garcia,6,8,8,cantero,PER
1,56,garcia fustanero,23,25,22,pero,PER
2,59,Pedro Blascoen,2476,2478,2478,Sevilla,PER
3,101,nicolas jurate,163,165,165,dos,PER
4,106,niculao de leon,0,3,3,frances,PER


In [264]:
peopleshort['edit_short']=''

This should be enough to loop through our labels and edit them, as we did previously in Data Cleaning 2. And the number of tags is small enough that we do not need to filter things out. 

In [292]:
count = 0
for (index, row) in peopleshort.iterrows():
    if row['edit_short']=='':
        count=count+1
        print(count)
        print(row['string'],row['start'],row['end'])
        print(row['tokentext'], row['tokenloc'])
        docid = row['docid']
        strstart = min([int(row['start']), int(row['tokenloc'])])
        strend = max([int(row['end']), int(row['tokenloc'])]) + 10
        
        for doc in docs:
            if doc[1]['id'] == docid:      
                displacy.render(doc[0][strstart:strend],style='ent',jupyter=True)
                peopleshort.loc[index, 'edit_short']=input('Edit?')
                
                if peopleshort.loc[index, 'edit_short']=='X':
                        for doc in docs:
                            if doc[1]['id'] == docid:
                                displacy.render(doc[0],style='ent',jupyter=True)
                                peopleshort.loc[index,'edit_short']=input('Edit?')
                clear_output(wait=True)

231
Marcos de medina 1279 1282
Escribano 1282


KeyboardInterrupt: 

In [294]:
peopleshort = peopleshort[peopleshort['tokentext']!='Escribano']

In [295]:
peopleshort

Unnamed: 0,docid,string,start,end,tokenloc,tokentext,entlabel,edit_short
0,50,Cristobal garcia,6,8,8,cantero,PER,0
1,56,garcia fustanero,23,25,22,pero,PER,1
2,59,Pedro Blascoen,2476,2478,2478,Sevilla,PER,0
3,101,nicolas jurate,163,165,165,dos,PER,0
4,106,niculao de leon,0,3,3,frances,PER,0
...,...,...,...,...,...,...,...,...
1148,8577,DIEGO DÍAZ,255,257,254,oro,PER,
1149,8577,FRANCISCO MANUEL [SOLORZANO,290,294,289,oro,PER,
1150,8577,JOSEPH MERINO,513,515,512,oro,PER,
1151,8578,LUIS ANTONIO NAVARRO,25,28,29,menor,PER,


In [296]:
peopleshort.groupby('edit_short').count()

Unnamed: 0_level_0,docid,string,start,end,tokenloc,tokentext,entlabel
edit_short,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
,550,550,550,550,550,550,550
0,179,179,179,179,179,179,179
1,104,104,104,104,104,104,104
o,5,5,5,5,5,5,5


In [None]:
maestre
fray
nombre
jurado
notario
vicario
mercader
---
cantero
escribano

In [298]:
peopleshort.to_csv('Files_Cleaning3/PERS_tag_clean3_strat1.csv')