# Data Cleaning 3: Correcting Entity Spans (short)

Search for tokens to the left and right of a tagged person that appear in the entity list and the name lists and have not been tagged as entities. We can import the spacy docs, each token should have an entity tag = 0. 

In [1]:
from IPython.display import clear_output
import spacy
from spacy import displacy

import pickle

In [2]:
import pandas as pd

In [3]:
file = open("Files_Cleaning2/NER_data_Cleaned22.p", 'rb')
docs = pickle.load(file)

Let's import the file with the list of tokens appearing in our 3 lists, manually checked for dubious options, and filter those out. 

In [4]:
names = pd.read_excel("Files_Cleaning2/EntityTokensinKFLNamelists_manual_check.xlsx")

In [8]:
names.fillna('0').groupby('check').count()

Unnamed: 0_level_0,token
check,Unnamed: 1_level_1
1.0,311
0.0,1617


In [6]:
acceptednames = names.loc[names['check']!=1,'token'].tolist()

In [9]:
acceptednames = list(set(acceptednames))

In [10]:
len(acceptednames)

1617

We can also add tokens from our name lists that did not appear in our tagged tokens, but might be elsewhere in the documents.

In [11]:
fnames = open("DBE_Names/firstnames.txt", 'r')
lnames = open("DBE_Names/lastnames.txt",'r')

In [12]:
DBEnames = []

for line in fnames:
    DBEnames.append(line.rstrip('\n'))
    
for line in lnames:
    DBEnames.append(line.rstrip('\n'))

In [13]:
DBEnames = set(DBEnames)

In [14]:
len(DBEnames)

10005

Iterating through the rows, we found some tokens that are not necessary in these lists. We'll remove them:
maestro
licenciado
bachiller
maestre
fray
nombre
jurado
notario
vicario
mercader
escribano

In [15]:
'maestro' in DBEnames

True

In [16]:
'licenciado' in DBEnames

True

In [17]:
'bachiller' in DBEnames

True

In [18]:
'maestre' in DBEnames

True

In [19]:
'fray' in DBEnames

True

In [20]:
'nombre' in DBEnames

True

In [21]:
'jurado' in DBEnames

True

In [22]:
'notario' in DBEnames

True

In [23]:
'vicario' in DBEnames

True

In [24]:
'mercader' in DBEnames

True

In [25]:
'escribano' in DBEnames

True

In [26]:
DBEnames.remove('maestro')

In [27]:
DBEnames.remove('licenciado')
DBEnames.remove('bachiller')

In [28]:
DBEnames.remove('maestre')
DBEnames.remove('fray')
DBEnames.remove('nombre')
DBEnames.remove('jurado')
DBEnames.remove('notario')
DBEnames.remove('vicario')
DBEnames.remove('mercader')

In [29]:
DBEnames.remove('escribano')

Let's look at the tokens that appear in the DBE list but do not form part of any entity (token.ent_iob is 2). 

In [30]:
untagged = []
for doc,context in docs:
    for token in doc:
        if (token.ent_iob==2) and ((token.text.lower() in DBEnames)):
            untagged.append(token.text)

In [31]:
len(untagged)

79780

In [32]:
print(set(untagged))

{'vera', 'bejar', 'labrador', 'hijas', 'campo', 'estada', 'salvador', 'Chaves', 'osuna', 'Hospital', 'peru', 'Ignacio', 'vuelta', 'bassa', 'oleo', 'Hombre', 'Jesus', 'romero', 'crespo', 'balduque', 'iglesias', 'obligado', 'horno', 'andino', 'provecho', 'balle', 'macia', 'batista', 'sota', 'coronel', 'linage', 'regla', 'Morales', 'angeles', 'negro', 'phelipe', 'balmazeda', 'Segunda', 'treviño', 'Rico', 'Corte', 'dominico', 'escriba', 'garcia', 'Monteros', 'sobrino', 'Yermo', 'Batista', 'celada', 'castellanos', 'tornos', 'llanas', 'juan', 'maeda', 'sillero', 'raso', 'Tome', 'Restan', 'Jacob', 'coll', 'plaza', 'peon', 'linaza', 'Villanueva', 'salcedo', 'remesal', 'pared', 'ursula', 'Aldana', 'laurencio', 'Amparo', 'eliseo', 'tizon', 'Justa', 'criado', 'Ramo', 'castillejo', 'mina', 'limpias', 'plata', 'CARTA', 'gallego', 'Perez', 'lara', 'Carta', 'gato', 'purificacion', 'Junta', 'cuentas', 'palomares', 'benegas', 'monasterio', 'rueda', 'rincon', 'camacho', 'Perea', 'rosa', 'anunciacion', '

In [33]:
len(set(untagged))

1464

It would be wise to review this list of 1,478 tokens manually to determine which are names for sure and which are dubious. But first, let's see how many of these matches (DBE and acceptednames) happen adjacent to a PER entity (before or after) to see which spans we have to edit.

# Strategy 1: Exact Matching in Lists

In [34]:
docs[0][0][1]

martin

I want a rule that looks in front or behind of a PER entity. If that token is in the name list and is not tagged as an entity, I want it to be appended to our list. If the token is de, el or la (single or plural), I want it to move to the nex token. 

In [35]:
nexttopers = []
for doc,context in docs:
    for ent in doc.ents:
        if ent.label_=='PER':
            try:
                i = 1
                a = doc[ent.start-i]
                while a.text.lower() == 'de' or a.text.lower() =='del' or a.text.lower() == 'la'or a.text == 'el' or a.text == 'los' or a.text == 'las' or a.text == 'y':
                    i = i+1
                    a = doc[ent.start-i]
                if (a.ent_iob==2) and ((a.text.lower() in acceptednames) or (a.text.lower() in DBEnames)):
                    nexttopers.append(a.text)
            except: 
                pass
            try:
                i = 0
                b = doc[ent.end+i]
                while b.text.lower() == 'de' or b.text.lower() == 'del' or b.text.lower() == 'la'or b.text == 'el' or b.text == 'los' or b.text == 'las' or b.text == 'y':
                    i = i+1
                    b = doc[ent.end+i]
                if (b.ent_iob==2) and ((b.text.lower() in acceptednames)or (b.text.lower() in DBEnames)):
                    nexttopers.append(b.text)
            except:
                pass

In [36]:
len(nexttopers)

971

In [37]:
len(set(nexttopers))

298

In [38]:
print(set(nexttopers))

{'capilla', 'costa', 'sa', 'cordero', 'Esquivel', 'delgado', 'matos', 'calero', 'cantero', 'maldonado', 'carrasco', 'abad', 'oro', 'ayala', 'labrador', 'Conde', 'puga', 'hijas', 'campo', 'bravo', 'villafuerte', 'hoz', 'Nisa', 'Abad', 'jesus', 'cavallero', 'texada', 'calafate', 'justa', 'toma', 'camacho', 'Vega', 'corral', 'hermoso', 'vida', 'camara', 'Chaves', 'serrano', 'guzman', 'portugues', 'comino', 'centeno', 'rua', 'vuelta', 'tapia', 'monesterio', 'farfan', 'ana', 'vico', 'Cantero', 'marques', 'Sala', 'doblas', 'chantre', 'francisco', 'cabeza', 'clavijo', 'beata', 'martin', 'maeso', 'cotan', 'moro', 'obligado', 'reyna', 'pero', 'fernandes', 'gragera', 'justicia', 'tome', 'andino', 'Clemencia', 'casado', 'teresa', 'Albo', 'balle', 'toledano', 'Jeronimo', 'espadero', 'crisostomo', 'moran', 'Linaje', 'tablada', 'pastor', 'herrero', 'Bueno', 'urban', 'falconete', 'melgarejo', 'linage', 'tamariz', 'palacios', 'coronel', 'encima', 'ceron', 'contador', 'vea', 'quero', 'aguirre', 'Marcos

According to this rule, 971 modifications could be made, from 298 distinct tokens.

In [39]:
peopleshort = []

for doc,context in docs:
    for ent in doc.ents:
        if ent.label_=='PER':
            try:
                i = 1
                a = doc[ent.start-i]
                while a.text.lower() == 'de' or a.text.lower() =='del' or a.text.lower() == 'la'or a.text.lower() == 'el' or a.text.lower() == 'los' or a.text.lower() == 'las' or a.text.lower()=='y':
                    i = i+1
                    a = doc[ent.start-i]
                if (a.ent_iob==2) and ((a.text.lower() in acceptednames) or (a.text.lower() in DBEnames)):
                    h = context['id']
                    b = ent.text
                    c = ent.start
                    d = ent.end
                    e = a.i
                    f = a.text
                    g = ent.label_
                    entry = [h, b, c, d, e, f,g]
                    peopleshort.append(entry)
            except: 
                pass
            try:
                i = 0
                b = doc[ent.end+i]
                while b.text.lower() == 'de' or b.text.lower() == 'del' or b.text.lower() == 'la'or b.text.lower() == 'el' or b.text.lower() == 'los' or b.text.lower() == 'las' or b.text.lower()=='y':
                    i = i+1
                    b = doc[ent.end+i]
                if (b.ent_iob==2) and ((b.text.lower() in acceptednames)or (b.text.lower() in DBEnames)):
                    a = context['id']
                    h = ent.text
                    c = ent.start
                    d = ent.end
                    e = b.i
                    f = b.text
                    g=ent.label_
                    entry = [a, h, c, d, e, f,g]
                    peopleshort.append(entry)
            except:
                pass

In [40]:
len(peopleshort)

971

In [41]:
peopleshort = pd.DataFrame(peopleshort, columns = ['docid','string','start','end','tokenloc','tokentext','entlabel'])

In [42]:
peopleshort.head()

Unnamed: 0,docid,string,start,end,tokenloc,tokentext,entlabel
0,50,Cristobal garcia,6,8,8,cantero,PER
1,56,garcia fustanero,23,25,22,pero,PER
2,59,Pedro Blascoen,2476,2478,2478,Sevilla,PER
3,101,nicolas jurate,163,165,165,dos,PER
4,106,niculao de leon,0,3,3,frances,PER


In [43]:
peopleshort['edit_short']=''

This should be enough to loop through our labels and edit them, as we did previously in Data Cleaning 2. And the number of tags is small enough that we do not need to filter things out. 

In [56]:
 count = 0
for (index, row) in peopleshort.iterrows():
    if row['edit_short']=='':
        count=count+1
        print(count)
        print(row['string'],row['start'],row['end'])
        print(row['tokentext'], row['tokenloc'])
        docid = row['docid']
        strstart = min([int(row['start']), int(row['tokenloc'])])
        strend = max([int(row['end']), int(row['tokenloc'])]) + 10
        
        for doc in docs:
            if doc[1]['id'] == docid:      
                displacy.render(doc[0][strstart:strend],style='ent',jupyter=True)
                peopleshort.loc[index, 'edit_short']=input('Edit?')
                
                if peopleshort.loc[index, 'edit_short']=='X':
                        for doc in docs:
                            if doc[1]['id'] == docid:
                                displacy.render(doc[0],style='ent',jupyter=True)
                                peopleshort.loc[index,'edit_short']=input('Edit?')
                clear_output(wait=True)

2
DIEGO AMBROSIO 14 16
menor 16


Edit? 0


In [None]:
peopleshort

In [57]:
peopleshort.groupby('edit_short').count()

Unnamed: 0_level_0,docid,string,start,end,tokenloc,tokentext,entlabel
edit_short,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,686,686,686,686,686,686,686
1,262,262,262,262,262,262,262
L,1,1,1,1,1,1,1
R,7,7,7,7,7,7,7
o,15,15,15,15,15,15,15


Above: of the tags identified for checking, 686 are correct, 262 must be edited for length, one must be reclassified as a location, 7 must be removed and 15 have another problem (multiple tags or mis-tokenization).

In [58]:
peopleshort.to_csv('Files_Cleaning3/PERS_tag_clean3_strat1.csv')

# Strategy 2: Fuzzy Matching to Lists

Let's do the same thing, but with tokens that approximately match names in our lists.

In [59]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

In [60]:
tokens = pd.DataFrame()
tokens['token']=''
tokens['score']=''
tokens['match']=''

In [61]:
untagged = []
for doc,context in docs:
    for token in doc:
        if (token.ent_iob==2):
            untagged.append(token.text)

In [62]:
untagged = set(untagged)

In [63]:
len(untagged)

31748

In [64]:
for token in untagged:
    tokens= tokens.append({'token':token}, ignore_index=True)

In [65]:
len(tokens)

31748

In [66]:
tokens = tokens.replace(to_replace='None', value='')

In [69]:
tokens.head()

Unnamed: 0,token,score,match
0,sucedido,,
1,yeruas,,
2,enojosa,,
3,brazado,,
4,italianas,,


The function below, taken from Data Cleaning 1, chooses the maximum fuzzy match for each token.

In [67]:
def fuzzmax(string,names):
        string = str(string).lower()
        result = list(map(lambda a: fuzz.token_sort_ratio(string,a),names))
        return [max(result),names[result.index(max(result))]]

In [71]:
names = list(set(acceptednames +list(DBEnames)))

In [72]:
len(names)

10192

In [74]:
tokens['score'] = tokens.apply(lambda x: fuzzmax(x['token'], names), axis = 1)

In [75]:
tokens.head()

Unnamed: 0,token,score,match
0,sucedido,"[80, saucedo]",
1,yeruas,"[77, payeras]",
2,enojosa,"[80, hinojosa]",
3,brazado,"[80, bazcardo]",
4,italianas,"[75, albiñana]",


In [76]:
tokens['match']=tokens.apply(lambda x: x['score'][1],axis=1)

In [77]:
tokens['score']=tokens.apply(lambda x: x['score'][0], axis=1)

In [78]:
tokens.head()

Unnamed: 0,token,score,match
0,sucedido,80,saucedo
1,yeruas,77,payeras
2,enojosa,80,hinojosa
3,brazado,80,bazcardo
4,italianas,75,albiñana


In [79]:
len(tokens)

31748

In [80]:
tokens.to_csv('Files_Cleaning3/untagged_token_match_scores.csv')

In [81]:
tokens.set_index('token', inplace=True)

In [82]:
tokens.loc[(tokens['score']>90)&(tokens['score']<100)]

Unnamed: 0_level_0,score,match
token,Unnamed: 1_level_1,Unnamed: 2_level_1
despejo,92,espejo
vigas,91,viegas
calabaza,94,calabazas
pasadas,93,passadas
colla,91,coalla
...,...,...
villacreses,91,villacreces
barias,91,briñas
honda,91,hondal
abraça,91,abraha
