# Data Cleaning 3: Correcting Entity Spans (short)

Search for tokens to the left and right of a tagged person that appear in the entity list and the name lists and have not been tagged as entities. We can import the spacy docs, each token should have an entity tag = 0. 

In [2]:
from IPython.display import clear_output
import spacy
from spacy import displacy

import pickle

In [3]:
import pandas as pd

In [4]:
file = open("Files_Cleaning2/NER_data_Cleaned22.p", 'rb')
docs = pickle.load(file)

Let's import the file with the list of tokens appearing in our 3 lists, manually checked for dubious options, and filter those out. 

In [4]:
names = pd.read_excel("Files_Cleaning2/EntityTokensinKFLNamelists_manual_check.xlsx")

In [8]:
names.fillna('0').groupby('check').count()

Unnamed: 0_level_0,token
check,Unnamed: 1_level_1
1.0,311
0.0,1617


In [6]:
acceptednames = names.loc[names['check']!=1,'token'].tolist()

In [9]:
acceptednames = list(set(acceptednames))

In [10]:
len(acceptednames)

1617

We can also add tokens from our name lists that did not appear in our tagged tokens, but might be elsewhere in the documents.

In [11]:
fnames = open("DBE_Names/firstnames.txt", 'r')
lnames = open("DBE_Names/lastnames.txt",'r')

In [12]:
DBEnames = []

for line in fnames:
    DBEnames.append(line.rstrip('\n'))
    
for line in lnames:
    DBEnames.append(line.rstrip('\n'))

In [13]:
DBEnames = set(DBEnames)

In [14]:
len(DBEnames)

10005

Iterating through the rows, we found some tokens that are not necessary in these lists. We'll remove them:
maestro
licenciado
bachiller
maestre
fray
nombre
jurado
notario
vicario
mercader
escribano

In [15]:
'maestro' in DBEnames

True

In [16]:
'licenciado' in DBEnames

True

In [17]:
'bachiller' in DBEnames

True

In [18]:
'maestre' in DBEnames

True

In [19]:
'fray' in DBEnames

True

In [20]:
'nombre' in DBEnames

True

In [21]:
'jurado' in DBEnames

True

In [22]:
'notario' in DBEnames

True

In [23]:
'vicario' in DBEnames

True

In [24]:
'mercader' in DBEnames

True

In [25]:
'escribano' in DBEnames

True

In [26]:
DBEnames.remove('maestro')

In [27]:
DBEnames.remove('licenciado')
DBEnames.remove('bachiller')

In [28]:
DBEnames.remove('maestre')
DBEnames.remove('fray')
DBEnames.remove('nombre')
DBEnames.remove('jurado')
DBEnames.remove('notario')
DBEnames.remove('vicario')
DBEnames.remove('mercader')

In [29]:
DBEnames.remove('escribano')

Let's look at the tokens that appear in the DBE list but do not form part of any entity (token.ent_iob is 2). 

In [30]:
untagged = []
for doc,context in docs:
    for token in doc:
        if (token.ent_iob==2) and ((token.text.lower() in DBEnames)):
            untagged.append(token.text)

In [31]:
len(untagged)

79780

In [32]:
print(set(untagged))

{'vera', 'bejar', 'labrador', 'hijas', 'campo', 'estada', 'salvador', 'Chaves', 'osuna', 'Hospital', 'peru', 'Ignacio', 'vuelta', 'bassa', 'oleo', 'Hombre', 'Jesus', 'romero', 'crespo', 'balduque', 'iglesias', 'obligado', 'horno', 'andino', 'provecho', 'balle', 'macia', 'batista', 'sota', 'coronel', 'linage', 'regla', 'Morales', 'angeles', 'negro', 'phelipe', 'balmazeda', 'Segunda', 'treviño', 'Rico', 'Corte', 'dominico', 'escriba', 'garcia', 'Monteros', 'sobrino', 'Yermo', 'Batista', 'celada', 'castellanos', 'tornos', 'llanas', 'juan', 'maeda', 'sillero', 'raso', 'Tome', 'Restan', 'Jacob', 'coll', 'plaza', 'peon', 'linaza', 'Villanueva', 'salcedo', 'remesal', 'pared', 'ursula', 'Aldana', 'laurencio', 'Amparo', 'eliseo', 'tizon', 'Justa', 'criado', 'Ramo', 'castillejo', 'mina', 'limpias', 'plata', 'CARTA', 'gallego', 'Perez', 'lara', 'Carta', 'gato', 'purificacion', 'Junta', 'cuentas', 'palomares', 'benegas', 'monasterio', 'rueda', 'rincon', 'camacho', 'Perea', 'rosa', 'anunciacion', '

In [33]:
len(set(untagged))

1464

It would be wise to review this list of 1,478 tokens manually to determine which are names for sure and which are dubious. But first, let's see how many of these matches (DBE and acceptednames) happen adjacent to a PER entity (before or after) to see which spans we have to edit.

# Strategy 1: Exact Matching in Lists

In [34]:
docs[0][0][1]

martin

I want a rule that looks in front or behind of a PER entity. If that token is in the name list and is not tagged as an entity, I want it to be appended to our list. If the token is de, el or la (single or plural), I want it to move to the nex token. 

In [35]:
nexttopers = []
for doc,context in docs:
    for ent in doc.ents:
        if ent.label_=='PER':
            try:
                i = 1
                a = doc[ent.start-i]
                while a.text.lower() == 'de' or a.text.lower() =='del' or a.text.lower() == 'la'or a.text == 'el' or a.text == 'los' or a.text == 'las' or a.text == 'y':
                    i = i+1
                    a = doc[ent.start-i]
                if (a.ent_iob==2) and ((a.text.lower() in acceptednames) or (a.text.lower() in DBEnames)):
                    nexttopers.append(a.text)
            except: 
                pass
            try:
                i = 0
                b = doc[ent.end+i]
                while b.text.lower() == 'de' or b.text.lower() == 'del' or b.text.lower() == 'la'or b.text == 'el' or b.text == 'los' or b.text == 'las' or b.text == 'y':
                    i = i+1
                    b = doc[ent.end+i]
                if (b.ent_iob==2) and ((b.text.lower() in acceptednames)or (b.text.lower() in DBEnames)):
                    nexttopers.append(b.text)
            except:
                pass

In [36]:
len(nexttopers)

971

In [37]:
len(set(nexttopers))

298

In [38]:
print(set(nexttopers))

{'capilla', 'costa', 'sa', 'cordero', 'Esquivel', 'delgado', 'matos', 'calero', 'cantero', 'maldonado', 'carrasco', 'abad', 'oro', 'ayala', 'labrador', 'Conde', 'puga', 'hijas', 'campo', 'bravo', 'villafuerte', 'hoz', 'Nisa', 'Abad', 'jesus', 'cavallero', 'texada', 'calafate', 'justa', 'toma', 'camacho', 'Vega', 'corral', 'hermoso', 'vida', 'camara', 'Chaves', 'serrano', 'guzman', 'portugues', 'comino', 'centeno', 'rua', 'vuelta', 'tapia', 'monesterio', 'farfan', 'ana', 'vico', 'Cantero', 'marques', 'Sala', 'doblas', 'chantre', 'francisco', 'cabeza', 'clavijo', 'beata', 'martin', 'maeso', 'cotan', 'moro', 'obligado', 'reyna', 'pero', 'fernandes', 'gragera', 'justicia', 'tome', 'andino', 'Clemencia', 'casado', 'teresa', 'Albo', 'balle', 'toledano', 'Jeronimo', 'espadero', 'crisostomo', 'moran', 'Linaje', 'tablada', 'pastor', 'herrero', 'Bueno', 'urban', 'falconete', 'melgarejo', 'linage', 'tamariz', 'palacios', 'coronel', 'encima', 'ceron', 'contador', 'vea', 'quero', 'aguirre', 'Marcos

According to this rule, 971 modifications could be made, from 298 distinct tokens.

In [39]:
peopleshort = []

for doc,context in docs:
    for ent in doc.ents:
        if ent.label_=='PER':
            try:
                i = 1
                a = doc[ent.start-i]
                while a.text.lower() == 'de' or a.text.lower() =='del' or a.text.lower() == 'la'or a.text.lower() == 'el' or a.text.lower() == 'los' or a.text.lower() == 'las' or a.text.lower()=='y':
                    i = i+1
                    a = doc[ent.start-i]
                if (a.ent_iob==2) and ((a.text.lower() in acceptednames) or (a.text.lower() in DBEnames)):
                    h = context['id']
                    b = ent.text
                    c = ent.start
                    d = ent.end
                    e = a.i
                    f = a.text
                    g = ent.label_
                    entry = [h, b, c, d, e, f,g]
                    peopleshort.append(entry)
            except: 
                pass
            try:
                i = 0
                b = doc[ent.end+i]
                while b.text.lower() == 'de' or b.text.lower() == 'del' or b.text.lower() == 'la'or b.text.lower() == 'el' or b.text.lower() == 'los' or b.text.lower() == 'las' or b.text.lower()=='y':
                    i = i+1
                    b = doc[ent.end+i]
                if (b.ent_iob==2) and ((b.text.lower() in acceptednames)or (b.text.lower() in DBEnames)):
                    a = context['id']
                    h = ent.text
                    c = ent.start
                    d = ent.end
                    e = b.i
                    f = b.text
                    g=ent.label_
                    entry = [a, h, c, d, e, f,g]
                    peopleshort.append(entry)
            except:
                pass

In [40]:
len(peopleshort)

971

In [41]:
peopleshort = pd.DataFrame(peopleshort, columns = ['docid','string','start','end','tokenloc','tokentext','entlabel'])

In [42]:
peopleshort.head()

Unnamed: 0,docid,string,start,end,tokenloc,tokentext,entlabel
0,50,Cristobal garcia,6,8,8,cantero,PER
1,56,garcia fustanero,23,25,22,pero,PER
2,59,Pedro Blascoen,2476,2478,2478,Sevilla,PER
3,101,nicolas jurate,163,165,165,dos,PER
4,106,niculao de leon,0,3,3,frances,PER


In [43]:
peopleshort['edit_short']=''

This should be enough to loop through our labels and edit them, as we did previously in Data Cleaning 2. And the number of tags is small enough that we do not need to filter things out. 

In [56]:
count = 0
for (index, row) in peopleshort.iterrows():
    if row['edit_short']=='':
        count=count+1
        print(count)
        print(row['string'],row['start'],row['end'])
        print(row['tokentext'], row['tokenloc'])
        docid = row['docid']
        strstart = min([int(row['start']), int(row['tokenloc'])])
        strend = max([int(row['end']), int(row['tokenloc'])]) + 10
        
        for doc in docs:
            if doc[1]['id'] == docid:      
                displacy.render(doc[0][strstart:strend],style='ent',jupyter=True)
                peopleshort.loc[index, 'edit_short']=input('Edit?')
                
                if peopleshort.loc[index, 'edit_short']=='X':
                        for doc in docs:
                            if doc[1]['id'] == docid:
                                displacy.render(doc[0],style='ent',jupyter=True)
                                peopleshort.loc[index,'edit_short']=input('Edit?')
                clear_output(wait=True)

2
DIEGO AMBROSIO 14 16
menor 16


Edit? 0


In [None]:
peopleshort

In [363]:
peopleshort.fillna('').groupby('edit_short').count()

Unnamed: 0_level_0,Unnamed: 0,docid,string,start,end,tokenloc,tokentext,entlabel,newstart,newend
edit_short,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,686,686,686,686,686,686,686,686,686,686
1,262,262,262,262,262,262,262,262,262,262
L,1,1,1,1,1,1,1,1,1,1
R,7,7,7,7,7,7,7,7,7,7
o,15,15,15,15,15,15,15,15,15,15


Above: of the tags identified for checking, 686 are correct, 262 must be edited for length, one must be reclassified as a location, 7 must be removed and 15 have another problem (multiple tags or mis-tokenization).

In [58]:
peopleshort.to_csv('Files_Cleaning3/PERS_tag_clean3_strat1.csv')

# Strategy 2: Fuzzy Matching to Lists

Let's do the same thing, but with tokens that approximately match names in our lists.

In [59]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

In [60]:
tokens = pd.DataFrame()
tokens['token']=''
tokens['score']=''
tokens['match']=''

In [61]:
untagged = []
for doc,context in docs:
    for token in doc:
        if (token.ent_iob==2):
            untagged.append(token.text)

In [62]:
untagged = set(untagged)

In [63]:
len(untagged)

31748

In [64]:
for token in untagged:
    tokens= tokens.append({'token':token}, ignore_index=True)

In [65]:
len(tokens)

31748

In [66]:
tokens = tokens.replace(to_replace='None', value='')

In [69]:
tokens.head()

Unnamed: 0,token,score,match
0,sucedido,,
1,yeruas,,
2,enojosa,,
3,brazado,,
4,italianas,,


The function below, taken from Data Cleaning 1, chooses the maximum fuzzy match for each token.

In [67]:
def fuzzmax(string,names):
        string = str(string).lower()
        result = list(map(lambda a: fuzz.token_sort_ratio(string,a),names))
        return [max(result),names[result.index(max(result))]]

In [71]:
names = list(set(acceptednames +list(DBEnames)))

In [72]:
len(names)

10192

In [74]:
tokens['score'] = tokens.apply(lambda x: fuzzmax(x['token'], names), axis = 1)

In [75]:
tokens.head()

Unnamed: 0,token,score,match
0,sucedido,"[80, saucedo]",
1,yeruas,"[77, payeras]",
2,enojosa,"[80, hinojosa]",
3,brazado,"[80, bazcardo]",
4,italianas,"[75, albiñana]",


In [76]:
tokens['match']=tokens.apply(lambda x: x['score'][1],axis=1)

In [77]:
tokens['score']=tokens.apply(lambda x: x['score'][0], axis=1)

In [78]:
tokens.head()

Unnamed: 0,token,score,match
0,sucedido,80,saucedo
1,yeruas,77,payeras
2,enojosa,80,hinojosa
3,brazado,80,bazcardo
4,italianas,75,albiñana


In [79]:
len(tokens)

31748

In [80]:
tokens.to_csv('Files_Cleaning3/untagged_token_match_scores.csv')

In [81]:
tokens.set_index('token', inplace=True)

In [84]:
tokens90 = tokens.loc[(tokens['score']>90)&(tokens['score']<100)]

In [85]:
tokens90

Unnamed: 0_level_0,score,match
token,Unnamed: 1_level_1,Unnamed: 2_level_1
despejo,92,espejo
vigas,91,viegas
calabaza,94,calabazas
pasadas,93,passadas
colla,91,coalla
...,...,...
villacreses,91,villacreces
barias,91,briñas
honda,91,hondal
abraça,91,abraha


In [94]:
tokens90 = tokens90.drop(index='pintor')

In [98]:
tokens90 = tokens90.drop(index='dorador')
tokens90 = tokens90.drop(index='pintar')
tokens90 = tokens90.drop(index='frayle')

In [102]:
tokens90 = tokens90.drop(index='marido')
tokens90 = tokens90.drop(index='esclava')
tokens90 = tokens90.drop(index='vecino')

In [109]:
tokens90 = tokens90.drop(index='vecina')

In [114]:
tokens90 = tokens90.drop(index='hijos')

KeyError: "['hija'] not found in axis"

In [115]:
peopleshortfuzz = []

for doc,context in docs:
    for ent in doc.ents:
        if ent.label_=='PER':
            try:
                i = 1
                a = doc[ent.start-i]
                while a.text.lower() == 'de' or a.text.lower() =='del' or a.text.lower() == 'la'or a.text.lower() == 'el' or a.text.lower() == 'los' or a.text.lower() == 'las' or a.text.lower()=='y':
                    i = i+1
                    a = doc[ent.start-i]
                if (a.ent_iob==2) and (a.text.lower() in tokens90.index):
                    h = context['id']
                    b = ent.text
                    c = ent.start
                    d = ent.end
                    e = a.i
                    f = a.text
                    g = ent.label_
                    entry = [h, b, c, d, e, f,g]
                    peopleshortfuzz.append(entry)
            except: 
                pass
            try:
                i = 0
                b = doc[ent.end+i]
                while b.text.lower() == 'de' or b.text.lower() == 'del' or b.text.lower() == 'la'or b.text.lower() == 'el' or b.text.lower() == 'los' or b.text.lower() == 'las' or b.text.lower()=='y':
                    i = i+1
                    b = doc[ent.end+i]
                if (b.ent_iob==2) and (b.text.lower() in tokens90.index):
                    a = context['id']
                    h = ent.text
                    c = ent.start
                    d = ent.end
                    e = b.i
                    f = b.text
                    g=ent.label_
                    entry = [a, h, c, d, e, f,g]
                    peopleshortfuzz.append(entry)
            except:
                pass

In [116]:
len(peopleshortfuzz)

583

In [117]:
peopleshortfuzz

[[22, 'Miguel de Gainza', 29, 32, 28, 'fueron', 'PER'],
 [49, 'Cristobal', 2, 3, 1, 'maestre', 'PER'],
 [49, 'luis sanches', 969, 971, 968, 'canteria', 'PER'],
 [49, 'juan fernandes', 981, 983, 984, 'maestre', 'PER'],
 [49, 'Cristobal', 985, 986, 984, 'maestre', 'PER'],
 [49, 'Cristobal', 1007, 1008, 1006, 'maestre', 'PER'],
 [110, 'Andres', 6, 7, 5, 'maestre', 'PER'],
 [112, 'Cristobal', 15, 16, 16, 'darcos', 'PER'],
 [132, 'Pedro de Campaña', 3, 6, 2, 'maestre', 'PER'],
 [139, 'nufrio de ortega', 281, 284, 285, 'cuarta', 'PER'],
 [139, 'bartolome de ortega', 359, 362, 363, 'cuarta', 'PER'],
 [172, 'diego de la barrera', 36, 40, 35, 'cuanto', 'PER'],
 [201, 'Juan barva', 33, 35, 35, 'cabeça', 'PER'],
 [221, 'francisco pavon', 67, 69, 69, 'maestre', 'PER'],
 [227, 'juan gonzales', 53, 55, 52, 'padrino', 'PER'],
 [229, 'Manuel de Artiaga y Salcedo', 84, 89, 83, 'padrino', 'PER'],
 [232, 'Manuel de Artiaga y Alua', 76, 81, 75, 'padrino', 'PER'],
 [256, 'Andres de Ibarburu y Galdona', 39,

In [118]:
peopleshortfuzz = pd.DataFrame(peopleshortfuzz, columns = ['docid','string','start','end','tokenloc','tokentext','entlabel'])

In [120]:
peopleshortfuzz['edit_short']=''

In [121]:
peopleshortfuzz.head()

Unnamed: 0,docid,string,start,end,tokenloc,tokentext,entlabel,edit_short
0,22,Miguel de Gainza,29,32,28,fueron,PER,
1,49,Cristobal,2,3,1,maestre,PER,
2,49,luis sanches,969,971,968,canteria,PER,
3,49,juan fernandes,981,983,984,maestre,PER,
4,49,Cristobal,985,986,984,maestre,PER,


In [122]:
count = 0
for (index, row) in peopleshortfuzz.iterrows():
    if row['edit_short']=='':
        count=count+1
        print(count)
        print(row['string'],row['start'],row['end'])
        print(row['tokentext'], row['tokenloc'])
        docid = row['docid']
        strstart = min([int(row['start']), int(row['tokenloc'])])
        strend = max([int(row['end']), int(row['tokenloc'])]) + 10
        
        for doc in docs:
            if doc[1]['id'] == docid:      
                displacy.render(doc[0][strstart:strend],style='ent',jupyter=True)
                peopleshortfuzz.loc[index, 'edit_short']=input('Edit?')
                
                if peopleshortfuzz.loc[index, 'edit_short']=='X':
                        for doc in docs:
                            if doc[1]['id'] == docid:
                                displacy.render(doc[0],style='ent',jupyter=True)
                                peopleshortfuzz.loc[index,'edit_short']=input('Edit?')
                clear_output(wait=True)

583
Pedro de Plata 369 372
rosarios 367


Edit? 0


In [362]:
peopleshortfuzz.fillna('').groupby('edit_short').count()

Unnamed: 0_level_0,Unnamed: 0,docid,string,start,end,tokenloc,tokentext,entlabel
edit_short,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,519,519,519,519,519,519,519,519
1,59,59,59,59,59,59,59,59
R,3,3,3,3,3,3,3,3
o,2,2,2,2,2,2,2,2


In [125]:
peopleshortfuzz.to_csv('Files_Cleaning3/PERS_tag_clean3_strat2.csv')

# Editing the spans

By finding exact and fuzzy matches near PER entities, we found spans that need to be edited in our NER tags. Let's combine the information from both searches and mark the new start/end of the span to edit.

In [357]:
peopleshort = pd.read_csv('Files_Cleaning3/PERS_tag_clean3_strat1.csv')
peopleshortfuzz = pd.read_csv('Files_Cleaning3/PERS_tag_clean3_strat2.csv')

In [364]:
peopleshort = peopleshort.append(peopleshortfuzz)

In [365]:
peopleshort['newstart']=''
peopleshort['newend']=''

In [366]:
peopleshort.groupby('edit_short').count()

Unnamed: 0_level_0,Unnamed: 0,docid,string,start,end,tokenloc,tokentext,entlabel,newstart,newend
edit_short,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,1205,1205,1205,1205,1205,1205,1205,1205,1205,1205
1,321,321,321,321,321,321,321,321,321,321
L,1,1,1,1,1,1,1,1,1,1
R,10,10,10,10,10,10,10,10,10,10
o,17,17,17,17,17,17,17,17,17,17


In [367]:
count = 0
for (index, row) in peopleshortfuzz.iterrows():

    if row['edit_short']=='1':
        count=count+1
        print(count)
        print(row['string'],row['start'],row['end'])
        
        docid = row['docid']
        strstart = min([int(row['start']), int(row['tokenloc'])])
        strend = max([int(row['end']), int(row['tokenloc'])]) + 10
        
        for doc in docs:
            if doc[1]['id'] == docid:      
                displacy.render(doc[0][strstart:strend],style='ent',jupyter=True)
                
                a=input('New start?')
                b=input('New end?')
                d= input('Edit_short?')           
                if d =='e':
                    peopleshortfuzz.loc[index,'edit_short']='e'
                    if a !='':
                        peopleshortfuzz.loc[index, 'newstart']=a
                    if b !='':
                        peopleshortfuzz.loc[index,'newend']=b
                else:
                    peopleshortfuzz.loc[index,'edit_short']=d
                    
                clear_output(wait=True)

59
Bartolomé 9 10


New start? 
New end? 11
Edit_short? e


In [154]:
# To double check some items
for doc,context in docs:
    if "pablo pere" in doc.text:
        print(doc.text)

en presencia de mi pedro de almonacid escriuano publico de seuilla parescio luisa ordoñez biuda muger que fue de geronimo hernandez escultor difunto que sea en gloria vezina de seuilla en la collacion de san juan y dijo porque del dho su marido quedaron ciertos bienes y porque sean sauidos e conoscidos a las personas que a ellos touieren derecho que hazia e hizo inventario de los tales bienes que son los siguientes deudas que se quedaron debiendo y se deuen primeramente 1 600 ducados que deve la fabrica de la villa de lucena de rresto de los marauedis en que fue tasada la hechura de un rretablo que geronimo hernandez hizo para la yglesia de la dicha villa 150 ducados que deue la fabrica de la yglesia de sant nicolas de seuilla de rresto de los maravedis en que fue tassado el rretablo que geronimo hernandez hizo para la dha yglesia 100 ducados que deve el monasterio de sant leandro de rresto de los marauedis en que fue concertado el rretablo que geronimo hernandez hizo para la yglesia d

In [374]:
peopleshort.groupby('edit_short').count()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0.1,docid,string,start,end,tokenloc,tokentext,entlabel,newstart,newend
edit_short,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,1208,686,1208,1208,1208,1208,1208,1208,1208,0,0
L,1,1,1,1,1,1,1,1,1,0,0
R,10,7,10,10,10,10,10,10,10,0,0
e,317,261,317,317,317,317,317,317,317,29,288
o,16,14,16,16,16,16,16,16,16,0,0


In [195]:
peopleshort.loc[1,'newstart']='22'

In [373]:
peopleshort.to_csv('Files_Cleaning3/PERS_tag_clean3_manual')

# Applying the changes to Spacy Docs

The first thing to do is  those marked as 0, as we do not need to change them.

In [5]:
peopleshort = pd.read_csv('Files_Cleaning3/PERS_tag_clean3_manual')

In [6]:
edits = peopleshort[peopleshort['edit_short']!='0']

In [7]:
edits.loc[(edits['edit_short']=='L'),'entlabel']='LOC'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [8]:
edits.groupby('entlabel').count()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,docid,string,start,end,tokenloc,tokentext,edit_short,newstart,newend
entlabel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
LOC,1,1,1,1,1,1,1,1,1,1,0,0
PER,343,343,282,343,343,343,343,343,343,343,29,288


In [9]:
edits = edits.fillna('')

In the work below, we noticed that there were two duplicates. Let's edit this before it becomes a problem in subsequent work:

In [10]:
edits[edits.duplicated(subset=['docid','start','end'], keep = False)]

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,docid,string,start,end,tokenloc,tokentext,entlabel,edit_short,newstart,newend


We can drop the duplicates because the edits are the same.

In [243]:
peopleshort = peopleshort.drop(index=[283,389])

In [16]:
peopleshort[peopleshort.index == 389]

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,docid,string,start,end,tokenloc,tokentext,entlabel,edit_short,newstart,newend
389,389,391,391.0,2617,bartolome muñoz,0,2,568,capilla,PER,0,,


In [12]:
edits.loc[edits['newstart']=='','newstart'] = edits['start']

In [13]:
edits.loc[edits['newend']=='','newend'] = edits['end']

In [17]:
edits

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,docid,string,start,end,tokenloc,tokentext,entlabel,edit_short,newstart,newend
1,1,1,1,56,garcia fustanero,23,25,22,pero,PER,e,22,25
2,2,2,2,59,Pedro Blascoen,2476,2478,2478,Sevilla,PER,o,2476,2478
6,6,6,6,195,Fernandez de Guadalupe,1,4,0,Pero,PER,e,0,4
7,7,7,7,197,Fernandez de Guadalupe,1,4,0,Pero,PER,e,0,4
9,9,9,9,227,firmefecho ut supraxpobal,70,73,73,alcantara,PER,o,70,73
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1346,377,377,,4319,juan bautista,73,75,75,billalpando,PER,e,73,76
1354,385,385,,4393,juan de uzeda,3,6,6,castroberde,PER,e,3,7
1375,406,406,,4625,Lusia,81,82,82,damiano,PER,e,81,85
1376,407,407,,4673,Andrés de Lazo,31,34,34,Destrada,PER,e,31,35


In [18]:
edits.dtypes

Unnamed: 0         int64
Unnamed: 0.1       int64
Unnamed: 0.1.1    object
docid              int64
string            object
start              int64
end                int64
tokenloc           int64
tokentext         object
entlabel          object
edit_short        object
newstart          object
newend            object
dtype: object

In [19]:
edits.groupby('edit_short').count()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,docid,string,start,end,tokenloc,tokentext,entlabel,newstart,newend
edit_short,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
L,1,1,1,1,1,1,1,1,1,1,1,1
R,10,10,10,10,10,10,10,10,10,10,10,10
e,317,317,317,317,317,317,317,317,317,317,317,317
o,16,16,16,16,16,16,16,16,16,16,16,16


In [20]:
remove = edits[['docid','start','end']].values.tolist()

In [21]:
len(remove)

344

In [398]:
remove

[[56, 23, 25],
 [59, 2476, 2478],
 [195, 1, 4],
 [197, 1, 4],
 [227, 70, 73],
 [229, 6, 8],
 [238, 34, 38],
 [267, 154, 157],
 [275, 170, 173],
 [276, 55, 59],
 [320, 7, 10],
 [329, 48, 51],
 [333, 52, 54],
 [337, 68, 71],
 [338, 45, 48],
 [339, 34, 37],
 [362, 7, 9],
 [602, 225, 228],
 [754, 189, 190],
 [772, 1840, 1843],
 [772, 2479, 2482],
 [772, 2753, 2756],
 [787, 169, 171],
 [1007, 337, 340],
 [1009, 21, 24],
 [1017, 72, 76],
 [1017, 100, 102],
 [1022, 30, 34],
 [1022, 146, 150],
 [1022, 179, 182],
 [1022, 418, 421],
 [1023, 173, 176],
 [1023, 383, 387],
 [1023, 403, 406],
 [1023, 452, 456],
 [1027, 110, 113],
 [1028, 3, 5],
 [1028, 193, 196],
 [1029, 83, 86],
 [1044, 1340, 1342],
 [1049, 167, 170],
 [1049, 349, 352],
 [1059, 32, 34],
 [1066, 131, 133],
 [1066, 1142, 1144],
 [1086, 90, 92],
 [1107, 13, 15],
 [1107, 321, 323],
 [1107, 586, 588],
 [1170, 326, 328],
 [1197, 7, 9],
 [1199, 602, 604],
 [1201, 925, 926],
 [1201, 928, 929],
 [1201, 958, 959],
 [1211, 970, 972],
 [1227, 

In [22]:
edits = edits[edits['edit_short']!='R']

In [23]:
add = edits[['docid','newstart','newend','entlabel']].values.tolist()

In [24]:
len(add)

334

In [None]:
# First step: drop all tags included in our edits dataset.

In [25]:
tempents =[]

for doc, context in docs:
    ents = list(doc.ents)
    ents2 =[]
    for ent in ents:
        rep = [ context['id'] , ent.start , ent.end]
        if rep not in remove: 
            a = spacy.tokens.Span(doc, ent.start,ent.end,label = ent.label)
            ents2.append(a)
    ents_id = [context['id'], ents2]
    tempents.append(ents_id)

In [26]:
count=0
for line in tempents:
    for ent in line[1]:
        count = count+1

In [27]:
count

70859

Just to double check that this removed the right number of entities, there should be a difference of 344 with the original.

In [28]:
count =0
for doc,context in docs:
    ents = list(doc.ents)
    for ent in ents:
        count = count+1

In [29]:
count

71203

In [30]:
71203-344

70859

In [31]:
for doc, context in docs:
    for item in tempents:
        if item[0]==context['id']:
            doc.ents= item[1]

That should have updated the ents, including all but those in our 'remove' list.

In [184]:
# Second step: add tags with modified spans and labels, not those marked R.

In [32]:
for doc, context in docs:
    for line in add:
        if line[0] == context['id']:
            a = spacy.tokens.Span(doc, int(line[1]),int(line[2]),label = line[3])
            doc.ents = list(doc.ents) + [a]

To double check, we should have 70920 + 334 entities. 

In [33]:
len(add)

334

In [34]:
count =0
for doc,context in docs:
    ents = list(doc.ents)
    for ent in ents:
        count = count+1

In [35]:
count

71193

In [36]:
71193-70859

334

Finally, let's pickle the results.

In [37]:
pickle.dump(docs, open( "Files_Cleaning3/NER_data_Cleaned3.p", "wb" ))