# Data Cleaning 2: Correcting Token Spans

In this notebook, we go over "person" tags and clean spans to make sure they reference full names, no more and no less. 

We will recover the tags as cleaned in Data Cleaning 1, focusing on those we maintained as 'Remove'= 0. 



# Strategy 1: To revise spans that are **too broad**

Create a list of tokens that appear in entities but do not appear in the name lists. We can do this by getting the difference between the name set and the entity tokens set. Mark the tokens that might be considered potentially problematic. Find them and determine if they should be a part of the span. Edit the spans respectively.

We can also add Kinkead's capitalized tokens to the "acceptable" token set.

# Strategy 2: To revise spans that are **too short**

Search for tokens to the left and right of a tagged person that appear in the entity list and the name lists and have not been tagged as entities. We can import the spacy docs, each token should have an entity tag = 0. 

# Strategy 3: Untagged PER entities

Search for strings that match exactly with strings that have been tagged as entities. This should work to pick up untagged people.

---

# 1 Creating a list of tokens that appear in our entities 

In [198]:
import pandas as pd
people = pd.read_csv('Files_Cleaning1/PER_tags_clean1.csv')

In [3]:
people.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,docid,string,label,start,end,Remove
0,0,0,0,57935,57935,6792,,,135,138,1
1,1,1,1,45285,45285,4978,- Rodriguez,PER,6,8,0
2,2,2,2,45173,45173,4961,-- Gomez Brito,PER,34,38,0
3,3,3,3,45108,45108,4952,-- Manrique,PER,11,13,0
4,4,4,4,61780,61780,7342,-- Ramirez,PER,29,31,0


In [2]:
grouped = people.groupby('Remove')
grouped.count()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,docid,string,label,start,end
Remove,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,32442,32442,32442,32442,32442,32442,32442,32442,32442,32442
1,561,561,561,561,561,561,558,0,561,561
D,6,6,6,6,6,6,6,6,6,6
L,39,39,39,39,39,39,39,39,39,39
M,6,6,6,6,6,6,6,6,6,6
O,66,66,66,66,66,66,66,66,66,66


Let's generate a list of tokens that appear in our PER entities. To do this, we are flattening the list of lists that is generated when we split each string by spaces, and we create a set to remove duplicates. Then we store it as a list again. 

In [199]:
tokenlist= people.loc[people['Remove']=='0','string'].str.split().tolist()

In [200]:
newlist= []
for sublist in tokenlist:
    if type(sublist)==list:
        for item in sublist:
            newlist.append(str(item).lower())
tokenlist=list(set(newlist))

In [201]:
tokenlist.sort()

In [202]:
len(tokenlist)

5164

In [203]:
with open('PersonNameTokenList.csv','w') as file:
    for item in tokenlist:
        file.write(item + "\n")

In [32]:
print(tokenlist)

['-', '--', '--de', '--nino', '-basco', '-gaspar', '-geronimo', '-luisa', '-ndo', '-vasco', '2-', '20', '481', 'a', 'abaasquita', 'abaco', 'abadesa', 'abadessa', 'abadia', 'abaeu', 'abalos', 'abaxo', 'abdina', 'abedaria', 'abeis', 'abel', 'abellan', 'abellano', 'abendaria', 'abendaño', 'abila', 'abiles', 'abita', 'abley', 'ablo', 'abraham', 'abrego', 'abreu', 'abril', 'abrue', 'abueda', 'acacio', 'academia', 'acauna', 'acebedo', 'acedo', 'acencio', 'acencion', 'acenfio', 'acensio', 'acer', 'acerdian', 'acero', 'acevedo', 'acha', 'acharte', 'acinto', 'acipres', 'acisclos', 'acle', 'acofta', 'acolina', 'acosser', 'acosta', 'acuna', 'acuña', 'adalid', 'adallo', 'adam', 'adame', 'adan', 'adornio', 'adriaen', 'adrian', 'aduana', 'afan', 'afanador', 'afandor', 'afedo', 'afencio', 'afenfio', 'afeto', 'agais', 'agilon', 'agora', 'agramonte', 'agua', 'aguayo', 'aguero', 'aguftin', 'aguiar', 'aguiere', 'aguila', 'aguilar', 'aguilera', 'aguilla', 'aguillar', 'aguire', 'aguirre', 'aguitar', 'agust

# 2 Creating a list of tokens not in our combined namelists

In [204]:
fnames = open('DBE_Names/firstnames.txt')
lnames = open('DBE_Names/lastnames.txt')

In [205]:
names = []

for line in fnames:
    names.append(line.rstrip('\n'))
    
for line in lnames:
    names.append(line.rstrip('\n'))

In [206]:
names = set(names)

In [46]:
names

{'arenzana',
 'altamirano',
 'bruneau',
 'riego',
 'batlle',
 'menendez-lavandera',
 'avinent',
 'balaguer',
 'johann',
 'alsedo',
 'lores',
 'rufino',
 'mazo',
 'german',
 'blumen',
 'obrador',
 'comallonga',
 'ruiz-bravo',
 'rechaule',
 'sesvilles',
 'baçan',
 'foz',
 'trastamar',
 'jacot',
 'poyo',
 'gazola',
 'ibero',
 'arrivillaga',
 'caamaño',
 'manjon',
 'yandiola',
 'buey',
 'bonanat',
 'nogueira',
 'francesc',
 'beraton',
 'fernandez-rajo',
 'texeda',
 'adsor',
 'ambrosio',
 'monge',
 'leaegui',
 'marzo',
 'olavide',
 'bergosa',
 'caunedo',
 'ciscar',
 'landaburu',
 'mirabent',
 'tramulles',
 'buach',
 'abadal',
 'centella',
 'torrella',
 'cienfuegos',
 'danese',
 'ohiggins',
 'ferry',
 'crispin',
 'presno',
 'fuensalida',
 'nordenflicht',
 'monteros',
 'salelles',
 'malvar',
 'tirry',
 'vicenç',
 'dupire',
 'mancio',
 'langarica',
 'cramer',
 'villagrassa',
 'giro',
 'zamalloa',
 'terrazas',
 'manuela',
 'taboada',
 'thubieres',
 'coleti',
 'deverez',
 'frontaura',
 'estuard'

In [47]:
len(names)

10005

Above, we have created one set of unique names from both our first names and last names file, no overlap.

Next, let's find the difference between this list and our tokens found above.

In [207]:
tokset = set(tokenlist)
potential_non_names = list(tokset - names)

In [208]:
len(potential_non_names)

3477

In [209]:
acceptedtokens = list(tokset & names)

In [210]:
len(acceptedtokens)

1687

In [54]:
print(acceptedtokens)

['lario', 'pinelo', 'galvez', 'gaytan', 'briones', 'contreras', 'gil', 'mayor', 'catalan', 'alsedo', 'pinel', 'rufino', 'mazo', 'german', 'nardi', 'coria', 'cuevas', 'jacob', 'orduña', 'leona', 'patron', 'texeda', 'joseph', 'miguel', 'ambrosio', 'monge', 'leaegui', 'chinchetru', 'sojo', 'olea', 'medrano', 'zamudio', 'cienfuegos', 'silva', 'osuna', 'maestro', 'tamayo', 'segarra', 'monteros', 'pablo', 'pizarro', 'teodoro', 'trigo', 'martir', 'mirafuentes', 'salablanca', 'guarda', 'riba', 'aso', 'soria', 'manuela', 'calatrava', 'alexo', 'tellez', 'cantillana', 'correa', 'ruy', 'saenz', 'linares', 'pastor', 'acedo', 'roger', 'antonia', 'catala', 'jerez', 'milanes', 'iglesia', 'lainez', 'ignacio', 'palomino', 'minjares', 'redondo', 'cordova', 'morante', 'cerecedo', 'alcocer', 'pilares', 'trujillo', 'barua', 'orosco', 'velasquez', 'aguirre', 'enrique', 'jeronimo', 'barrientos', 'rosales', 'joana', 'andujar', 'noboa', 'buendia', 'cano', 'aguilar', 'castilla', 'hidalgo', 'anuncibay', 'sigler',

In [50]:
print(potential_non_names)

['rodrygo', 'schuitt', 'libera', 'universales', 'fugoso', 'belazques', 'guantero', 'laçaro', 'bartolo', 'bastos', 'salbadora', 'espinera', 'mañan', 'rreal', 'cotte', 'andrach', 'quinientos', 'sausedo', 'ynostrosa', 'rribas', 'millon', 'tibino', 'publico', 'rey', 'menezes', 'monrroi', 'luysa', 'gorge', 'salcedobenjumea', 'nanziba', 'prespectiba', 'afenfio', 'abalos', 'corbartes', 'algun', 'ybañes', 'barea', 'embila', 'jues', 'caruajal', 'asegurado', 'juanete', 'suaso', 'xpoual', 'urtaben', 'baja', 'hinoxosa', 'fumarraga', 'cibo', 'redon', 'ragel', 'arbuscody', 'usanqui', 'ranero', 'çurbaran', 'ponse', 'aya', 'rrio', 'urdiales', 'bauzel', 'montejer', 'parrafranco', 'ouido', 'hazello', 'malaves', 'pesaue', 'casablanca', 'yniesta', 'gonfales', 'argadona', 'canprovin', 'rreciue', 'tamares', 'grada', 'tamaro', 'bouadilla', 'migues', 'bordadores', 'monjas', 'bellarte', 'firme\x97fecho', 'usarte', 'cathalina', 'aluaro', 'mda', 'saya', 'conocio', 'ayllan', 'mathia', 'tamaris', 'jigon', 'aqua', 

This approach accepts 1,687 as true name tokens, but 3,477 remain unacceptable. Many (see above) are once again spelling variations of actual name tokens. And, as we can see below, it leaves too many records to revise manually.

# Number of records with a token not from our name list

In [172]:
people.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,docid,string,label,start,end,Remove,matched_str
0,0,0,0,57935,57935,6792,,,135,138,1,n
1,1,1,1,45285,45285,4978,- Rodriguez,PER,6,8,0,-
2,2,2,2,45173,45173,4961,-- Gomez Brito,PER,34,38,0,--
3,3,3,3,45108,45108,4952,-- Manrique,PER,11,13,0,--
4,4,4,4,61780,61780,7342,-- Ramirez,PER,29,31,0,--


In [211]:
import re
def pattern_searcher(search_str:str, search_list:str):

    search_obj = re.search(search_list, search_str)
    if search_obj:
        return_str = 1
    else:
        return_str = 0
    return return_str

In [212]:
pattern = '|'.join(potential_non_names)

In [213]:
people['matched_str'] = people['string'].apply(lambda x: pattern_searcher(search_str=str(x), search_list=pattern))

In [214]:
people[people['matched_str']==1].count()

Unnamed: 0            27102
Unnamed: 0.1          27102
Unnamed: 0.1.1        27102
Unnamed: 0.1.1.1      27102
Unnamed: 0.1.1.1.1    27102
docid                 27102
string                27099
label                 26541
start                 27102
end                   27102
Remove                27102
matched_str           27102
dtype: int64

We have 27,102 tags that have a potentially questionable token right now.

# 2 Fuzzy Matching

Let's use our results from fuzzy matching and accept those tokens that represent a 90% match with tokens in the lists.

In [215]:
fuzzymatches = pd.read_csv('Files_Cleaning1/token_match_scores.csv')

In [216]:
fuzzymatches.set_index('token', inplace= True)

In [217]:
fuzzymatches.head()

Unnamed: 0_level_0,Unnamed: 0,score,match
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
BENJUMEA,0,71,enjuta
Prevendado,1,71,revenga
Lateral,2,77,antera
MORILLO,3,100,morillo
Ramon,4,100,ramon


In [218]:
tokens90 = fuzzymatches[fuzzymatches['score']>=90].index.str.lower().tolist()

In [219]:
len(tokens90)

3717

In [220]:
print(tokens90)

['morillo', 'ramon', 'ruela', 'morante', 'coello', 'velazquez', 'cardenas', 'tirado', 'borja', 'quebedo', 'martos', 'quevedo', 'teresa', 'juanes', 'monero', 'arze', 'biedma', 'durango', 'barrionuebo', 'despindola', 'trujillo', 'ruiz', 'baltasar', 'tenorio', 'villens', 'iriarte', 'canto', 'villa', 'astete', 'agustin', 'ruano', 'trelles', 'enrique', 'rroelas', 'grimon', 'rosal', 'horosco', 'atienza', 'sala', 'cuesta', 'espinosa', 'eraso', 'carasco', 'ayllon', 'meneses', 'sotto', 'cipriano', 'ribadeneira', 'dimas', 'barrassa', '-geronimo', 'alonso', 'calatayud', 'salafranca', 'caro', 'arnao', 'holgado', 'pan', 'gomez', 'morellon', 'llanes', 'porras', 'muñoz', 'ruis', 'mathia', 'garcia', 'torres', 'deza', 'siguença', 'vermondo', 'solarte', 'ruesta', 'caraballo', 'almonacid', 'aldrette', 'florencia', 'nicolas', 'deza', 'rrojas', 'hernandez', 'roela', 'carcel', 'matias', 'pino', 'goncalo', 'maria', 'carrasco', 'federigui', 'rroldan', 'torado', 'marcho', 'cuevas', 'arpe', 'arana', 'izaguire',

In [222]:
tok90set = set(tokens90)

In [224]:
acceptedtokens= list(set(tokset & (names.union(tok90set))))

In [225]:
len(acceptedtokens)

2416

In [233]:
potential_non_names2 = list(tokset - names - tok90set)

In [234]:
len(potential_non_names2)

2748

This is still too long

# Number of records with a token not from our name list

In [195]:
people.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,docid,string,label,start,end,Remove,matched_str
0,0,0,0,57935,57935,6792,,,135,138,1,1
1,1,1,1,45285,45285,4978,- Rodriguez,PER,6,8,0,1
2,2,2,2,45173,45173,4961,-- Gomez Brito,PER,34,38,0,1
3,3,3,3,45108,45108,4952,-- Manrique,PER,11,13,0,1
4,4,4,4,61780,61780,7342,-- Ramirez,PER,29,31,0,1


In [235]:
pattern = ''
pattern = '|'.join(potential_non_names2)

In [236]:
people['matched_str'] = people['string'].apply(lambda x: pattern_searcher(search_str=str(x), search_list=pattern))

In [237]:
people[people['matched_str']==1].count()

Unnamed: 0            27102
Unnamed: 0.1          27102
Unnamed: 0.1.1        27102
Unnamed: 0.1.1.1      27102
Unnamed: 0.1.1.1.1    27102
docid                 27102
string                27099
label                 26541
start                 27102
end                   27102
Remove                27102
matched_str           27102
dtype: int64

We have 27,102 tags that have a potentially questionable token right now.

In [112]:
fuzzymatches.loc['SCHUITT']

Unnamed: 0     3668
score            83
match         schut
Name: SCHUITT, dtype: object

In [142]:
tok83set = set(fuzzymatches[(fuzzymatches['score']>83) &(fuzzymatches['score']<90)].index.str.lower().to_list())

In [143]:
len(tok83set)

981

In [144]:
fuzzymatches.loc['becino']

Unnamed: 0      5042
score             83
match         benino
Name: becino, dtype: object

In [145]:
print(tok83set)

{'rodrygo', 'zarcosa', 'libera', 'chamarro', 'lucente', 'rruiz', 'fugoso', 'graendes', 'jerusalen', 'reymundo', 'ferino', 'bartolo', 'pora', 'leonarda', 'kernandez', 'hespa', 'sayz', 'consecion', 'galbez', 'myranda', 'costança', 'fria', 'sauzedo', 'sal', 'beatris', 'fraucisco', 'espinera', 'mañan', 'rreal', 'andrach', 'cotte', 'antonyo', 'sausedo', 'u¡llalobos', 'gueuara', 'benabides', 'arguela', 'juhan', 'dotores', 'vrrutia', 'corbachino', 'cueba', 'guarin', 'monrroi', 'bejines', 'amarrosta', 'roças', 'moyna', 'mansano', 'lodobico', 'ase', 'cifneros', 'abalos', 'salbago', 'bernandes', 'barea', 'ala', 'hoya', 'altar', 'montearroyo', 'caruajal', 'merced\x97', 'forteza', 'loro', 'hinoxosa', 'lansa', 'hordones', 'faustina', 'anduxar', 'serabia', 'trabieso', 'montejer', 'ponse', 'rrio', 'jhoan', 'aya', 'escaraza', 'faucedo', 'malaves', 'tres', 'domingues', 'jasinto', 'cuia', 'alrecon', 'yniesta', 'argadona', 'reynalte', 'requadros', 'tamares', 'bouadilla', 'bos', 'malgarida', 'cauello', 'c

In [151]:
potential_non_names = list(tokset - names - tok90set - tok83set)

In [153]:
len(potential_non_names)

1841

In [152]:
print(potential_non_names)

['schuitt', 'universales', 'belazques', 'guantero', 'salbadora', 'quinientos', 'ynostrosa', 'millon', 'tibino', 'publico', 'rey', 'luysa', 'salcedobenjumea', 'prespectiba', 'nanziba', 'afenfio', 'corbartes', 'jues', 'algun', 'ybañes', 'embila', 'asegurado', 'juanete', 'suaso', 'xpoual', 'urtaben', 'baja', 'fumarraga', 'cibo', 'arbuscody', 'usanqui', 'ranero', 'çurbaran', 'urdiales', 'bauzel', 'parrafranco', 'ouido', 'hazello', 'pesaue', 'casablanca', 'gonfales', 'rreciue', 'canprovin', 'migues', 'bordadores', 'monjas', 'firme\x97fecho', 'usarte', 'mda', 'aluaro', 'conocio', 'ayllan', 'jigon', 'aqua', 'gemo', 'paiba', 'terron', 'suares', 'bazques', 'nupcias', 'prescio', 'laretana', 'baldibiefo', 'ynquisidor', 'seis', 'montañezsignada', 'donzelta', 'no', 'vellavoa', 'blaceo', 'fe', 'surbaran', 'sm', 'nusete', 'allarran', 'juy', 'josseffe', 'elega', 'pisaro', 'higales', 'purita', 'guandeola', 'graviel', 'pafecto', 'bieffas', 'samiedo', 'sita', 'questa', 'ualga', 'cancagorta', 'ymbrea', 'b

In [154]:
tokensless83 = list(set(fuzzymatches[fuzzymatches['score']<=83].index.str.lower().tolist()))

In [155]:
len(tokensless83)

2260

In [150]:
print(tokensless83)

['schuitt', 'relieuo', 'universales', 'belazques', 'becino', 'guantero', 'efto', 'salbadora', 'quinientos', 'ynostrosa', 'tibino', 'millon', 'publico', 'rey', 'luysa', 'salcedobenjumea', 'nanziba', 'prespectiba', 'afenfio', 'corbartes', 'jues', 'algun', 'ybañes', 'embila', 'asegurado', 'juanete', 'urtaben', 'xpoual', 'berbo', 'suaso', 'baja', 'fumarraga', 'cibo', 'arbuscody', 'usanqui', 'ranero', 'çurbaran', 'urdiales', 'bauzel', 'parrafranco', 'ouido', 'bifta', 'hazello', 'pesaue', 'casablanca', 'subceda', 'gonfales', 'rreciue', 'canprovin', 'migues', 'ntra', 'bordadores', 'monjas', 'firme\x97fecho', 'usarte', 'mda', 'aluaro', 'conocio', 'ayllan', 'jigon', 'aqua', 'gemo', 'obiere', 'paiba', 'terron', 'suares', 'bazques', 'nupcias', 'formulas', 'prescio', 'baptizo', 'laretana', 'ynquisidor', 'baldibiefo', 'seis', 'ciudadde', 'montañezsignada', 'pasteles', 'donzelta', 'animas', 'no', 'vellavoa', 'blaceo', 'fe', 'surbaran', 'sm', 'nusete', 'omniun', 'vmdes', 'cuando', 'allarran', 'juy', 

# Strategy 2

Check strings with "San, Santo, Santa, Santos, Santas" and other non-person tokens.

In [31]:
nontokens = ['San ','Santo ','Santa ','Santos ','Santas ']

for token in nontokens:
    people.loc[people['string'].str.contains(token),'Remove']= 'check'

In [32]:
people.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,docid,string,label,start,end,Remove
0,57935,57935,6792,,PER,135,138,
1,45285,45285,4978,Rodriguez,PER,6,8,
2,45173,45173,4961,Gomez Brito,PER,34,38,
3,45108,45108,4952,Manrique,PER,11,13,
4,61780,61780,7342,Ramirez,PER,29,31,


In [33]:
people.groupby('Remove').count()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0.1,docid,string,label,start,end
Remove,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,7076,7076,7076,7076,7076,7076,7076
,25943,25943,25943,25943,25943,25943,25943
check,101,101,101,101,101,101,101


In [135]:
people[people['Remove']=='check']

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,docid,string,label,start,end,Remove
1585,47680,47680,5302,Andrea Santa Trinidad,PER,62,67,check
1911,44395,44395,4876,Andres San Martin,PER,46,50,check
2452,16350,16350,1783,Antonio Santa Cruz,PER,37,41,check
2453,16540,16540,1806,Antonio Santa Cruz,PER,0,4,check
2454,16597,16597,1814,Antonio Santa Cruz,PER,9,13,check
...,...,...,...,...,...,...,...,...
18449,2112,2112,265,Sebastian Santa Maria,PER,10,14,check
18635,45302,45302,4979,Señor San Andres,PER,243,246,check
18636,66747,66747,8050,Señor San Laureano,PER,122,125,check
18639,52425,52425,5990,Señorr San Esteban Martir,PER,103,107,check


In [136]:
# And then go over them one by one. Make list and change tags

In [1]:
import spacy
from spacy import displacy

In [8]:
nlp = spacy.load("es_core_news_ml_EMS2")

In [144]:
from IPython.display import clear_output

In [9]:
import pickle

file = open("Trained_EMS2_NER_data.p", 'rb')
docs = pickle.load(file)

In [10]:
for (index, row) in people.iterrows():
    if row['Remove']=='check':
        print(row['string'])
        
        docid = row['docid']
        strstart = int(row['start'])
        fetchend = strstart + 10
        
        for doc in docs:
            if doc[1]['id'] == docid:
                displacy.render(doc[0][strstart:fetchend],style='ent',jupyter=True)
                people.loc[index,'Remove']=input('Remove?')
            if  people.loc[index,'Remove']=='X':
                people.loc[index,'Remove']=''
                if doc[1]['id'] == docid:
                    displacy.render(doc[0],style='ent',jupyter=True)
                    people.loc[index,'Remove']=input('Remove?')
            clear_output(wait=True)

NameError: name 'people' is not defined