# Data Cleaning 2: Correcting Token Spans

In this notebook, we go over "person" tags and clean spans to make sure they reference full names, no more and no less. 

We will recover the tags as cleaned in Data Cleaning 1, focusing on those we maintained as 'Remove'= 0. 



# Goal 1: To revise spans that are **too broad**

Create a list of tokens that appear in entities but do not appear in the name lists. We can do this by getting the difference between the name set and the entity tokens set. Mark the tokens that might be considered potentially problematic. Find them and determine if they should be a part of the span. Edit the spans respectively.

We can also add Kinkead's capitalized tokens to the "acceptable" token set.

# Goal 2: To revise spans that are **too short**

Search for tokens to the left and right of a tagged person that appear in the entity list and the name lists and have not been tagged as entities. We can import the spacy docs, each token should have an entity tag = 0. 

# Goal 3: Untagged PER entities

Search for strings that match exactly with strings that have been tagged as entities. This should work to pick up untagged people.

---
# Goal 1

## 1 Creating a list of tokens that appear in our entities 

In [120]:
import pandas as pd
people = pd.read_csv('Files_Cleaning1/PER_tags_clean1.csv')

In [121]:
people.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,docid,string,label,start,end,Remove
0,0,0,0,57935,57935,6792,,,135,138,1
1,1,1,1,45285,45285,4978,- Rodriguez,PER,6,8,0
2,2,2,2,45173,45173,4961,-- Gomez Brito,PER,34,38,0
3,3,3,3,45108,45108,4952,-- Manrique,PER,11,13,0
4,4,4,4,61780,61780,7342,-- Ramirez,PER,29,31,0


In [3]:
grouped = people.groupby('Remove')
grouped.count()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,docid,string,label,start,end
Remove,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,32442,32442,32442,32442,32442,32442,32442,32442,32442,32442
1,561,561,561,561,561,561,558,0,561,561
D,6,6,6,6,6,6,6,6,6,6
L,39,39,39,39,39,39,39,39,39,39
M,6,6,6,6,6,6,6,6,6,6
O,66,66,66,66,66,66,66,66,66,66


Let's generate a list of tokens that appear in our PER entities. To do this, we are flattening the list of lists that is generated when we split each string by spaces, and we create a set to remove duplicates. Then we store it as a list again. 

In [11]:
tokenlist= people.loc[people['Remove']=='0','string'].str.split().tolist()

In [12]:
newlist= []
for sublist in tokenlist:
    if type(sublist)==list:
        for item in sublist:
            newlist.append(str(item))
tokenlist=list(set(newlist))

In [13]:
tokenlist.sort()

In [14]:
len(tokenlist)

6692

Note that this list includes tokens in upper and lowercase. This is because these cases are informative. As Kinkead marked artist names in full uppercase, we can mark uppercase tokens as tokens for true names.

In [15]:
with open('PersonNameTokenList.csv','w') as file:
    for item in tokenlist:
        file.write(item + "\n")

In [9]:
print(tokenlist)

['-', '--', '--de', '--nino', '-basco', '-gaspar', '-geronimo', '-luisa', '-ndo', '-vasco', '2-', '20', '481', 'a', 'abaasquita', 'abaco', 'abadesa', 'abadessa', 'abadia', 'abaeu', 'abalos', 'abaxo', 'abdina', 'abedaria', 'abeis', 'abel', 'abellan', 'abellano', 'abendaria', 'abendaño', 'abila', 'abiles', 'abita', 'abley', 'ablo', 'abraham', 'abrego', 'abreu', 'abril', 'abrue', 'abueda', 'acacio', 'academia', 'acauna', 'acebedo', 'acedo', 'acencio', 'acencion', 'acenfio', 'acensio', 'acer', 'acerdian', 'acero', 'acevedo', 'acha', 'acharte', 'acinto', 'acipres', 'acisclos', 'acle', 'acofta', 'acolina', 'acosser', 'acosta', 'acuna', 'acuña', 'adalid', 'adallo', 'adam', 'adame', 'adan', 'adornio', 'adriaen', 'adrian', 'aduana', 'afan', 'afanador', 'afandor', 'afedo', 'afencio', 'afenfio', 'afeto', 'agais', 'agilon', 'agora', 'agramonte', 'agua', 'aguayo', 'aguero', 'aguftin', 'aguiar', 'aguiere', 'aguila', 'aguilar', 'aguilera', 'aguilla', 'aguillar', 'aguire', 'aguirre', 'aguitar', 'agust

Let's turn this into a dataframe with a column that includes a column for "accepted" tokens, where 1 is a name token and 0 is a dubious token. Let's go ahead and assign a 1 value to those tokens that are in uppercase.

In [18]:
tokens = pd.DataFrame(tokenlist, columns = ['token'])
tokens.loc[tokens['token'].str.isupper(),'is_name']=1

In [19]:
tokens.groupby('is_name').count()

Unnamed: 0_level_0,token
is_name,Unnamed: 1_level_1
1.0,572


In [24]:
uppers = tokens.loc[tokens['is_name']==1,'token'].tolist()

In [25]:
uppers

['A',
 'ADRIAN',
 'AFANDOR',
 'AGUILA',
 'AGUILAR',
 'AGUSTIN',
 'AIALA',
 'ALA',
 'ALARCON',
 'ALAVA',
 'ALBERTO',
 'ALCANTEIES',
 'ALCANTUD',
 'ALDRETTE',
 'ALFONSO',
 'ALMANSA',
 'ALMANZA',
 'ALONSO',
 'ALTA',
 'ALTERI',
 'ALVA',
 'ALVAREZ',
 'AMARROSTA',
 'AMARROSTAO',
 'AMBAR',
 'AMBROSIO',
 'ANDRACH',
 'ANDRES',
 'ANGELO',
 'ANTOLINEZ',
 'ANTONIO',
 'APARICIO',
 'ARAGON',
 'ARANA',
 'ARANDA',
 'ARAUS',
 'ARAZ',
 'ARCE',
 'ARCOS',
 'ARENAS',
 'ARENSECALLATRADA',
 'ARIAS',
 'ARIENZACALATRAVA',
 'ARMEDIA',
 'ARRENAS',
 'ARROYO',
 'ARTEAGA',
 'ARTEAGAALFARO',
 'ARTENSE',
 'ARTENSECALATADO',
 'ARTENSECALATRADA',
 'ARTIAGA',
 'ARTIGA',
 'ARUAS',
 'ARZE',
 'ASCENCIO',
 'ASENCIO',
 'ASINTA',
 'ATIENZA',
 'ATIENZACALATRAVA',
 'AVILA',
 'AY',
 'AYALA',
 'B',
 'BAL',
 'BALDEQUIROS',
 'BALDES',
 'BALDEZ',
 'BALLESTEROS',
 'BALTASAR',
 'BALTAZAR',
 'BALTHAZAR',
 'BAN',
 'BANMOL',
 'BARAHONA',
 'BARAONA',
 'BARCO',
 'BARELA',
 'BARONA',
 'BARREDA',
 'BARRERA',
 'BARTOLOME',
 'BASQUEZ',
 'BAUTI

This has identified 572 tokens as acceptable. We can also mark their lowercase equivalents as acceptable.

In [26]:
tokens.loc[tokens['token'].str.upper().isin(uppers),'is_name']=1

In [27]:
tokens.groupby('is_name').count()

Unnamed: 0_level_0,token
is_name,Unnamed: 1_level_1
1.0,1253


This has identified 1,253 tokens as acceptable.

In [102]:
strat1 = tokens.loc[tokens['is_name']==1,'token']

strat1.to_csv("Files_Cleaning2/acceptednametokens2_strat1.csv")

A look at the .CSV file reveals that the list seems OK. We allow it to include articles (la, lo, el, los) because at this stage we don't think they should be problematic. 

# 2 Creating a list of tokens not in our combined namelists

Other acceptable names may not be in the list of Kinkead tokens. We can rely on tokens from our first and last name lists to be name tokens. (Or, put in other words, we will have reason to suspect tokens not in this list not to be names).

In [32]:
fnames = open('DBE_Names/firstnames.txt')
lnames = open('DBE_Names/lastnames.txt')

In [33]:
names = []

for line in fnames:
    names.append(line.rstrip('\n'))
    
for line in lnames:
    names.append(line.rstrip('\n'))

In [34]:
names = set(names)

In [35]:
names

{'moraza',
 'monte',
 'pichardo',
 'simeon',
 'guarinos',
 'virtus',
 'mejias',
 'gertrudis',
 'filiberto',
 'letamendi',
 'aguera',
 'parcero',
 'calduch',
 'echave',
 'sierra',
 'bolero',
 'nurnberger',
 'lalana',
 'quintano',
 'piedras',
 'castellon',
 'mortela',
 'nienguiri',
 'capo',
 'eleuterio',
 'ballinas',
 'acha',
 'icoaga',
 'iturri',
 'salt',
 'linaje',
 'cusi',
 'insaurralde',
 'encinas',
 'labrador',
 'peral',
 'bergaigne',
 'descatllar',
 'rozas',
 'jasso',
 'folguera',
 'cristofini',
 'corta',
 'gausy',
 'urquinaona',
 'labart',
 'carriedo',
 'urioste',
 'bernabe',
 'vallgornera',
 'welzer',
 'turell',
 'grassot',
 'rosari',
 'santiago',
 'leyes',
 'zabalo',
 'molinillo',
 'hermosino',
 'reding',
 'fraso',
 'sanctis',
 'boltas',
 'dauventon',
 'ungo',
 'nariño',
 'reinegg',
 'cochet',
 'yeregui',
 'valenti',
 'hondal',
 'muxica',
 'cannizaro',
 'bru',
 'benet',
 'azcune',
 'oncins',
 'sarrion',
 'borsoto',
 'segade',
 'haro',
 'bernuz',
 'escalona',
 'fondevila',
 'lang

In [47]:
len(names)

10005

Above, we have created one set of unique names from both our first names and last names file, no overlap. Note that these are all lowercase.

Next, let's find the difference between this list and our tokens found above.

In [36]:
tokens.loc[tokens['token'].str.lower().isin(names),'is_name']=1

In [37]:
tokens.groupby('is_name').count()

Unnamed: 0_level_0,token
is_name,Unnamed: 1_level_1
1.0,3171


In [38]:
len(tokens)

6692

So far, we have managed to mark 3,171 tokens as reliable names from a list of 6,692. 3,521 remain unaccounted for. 

In [40]:
dubioustokens = tokens.loc[tokens['is_name']!=1,'token'].to_list()

In [46]:
print(dubioustokens)

['-', '--', '--Nino', '--de', '-basco', '-gaspar', '-geronimo', '-luisa', '-ndo', '-vasco', '2-', '20', '481', 'Abaasquita', 'Abaco', 'Abadesa', 'Abaeu', 'Abalos', 'Abdina', 'Abedaria', 'Abendaria', 'Abrego', 'Abril', 'Abrue', 'Abueda', 'Academia', 'Acauna', 'Acencion', 'Acenfio', 'Acensio', 'Acerdian', 'Acisclos', 'Acle', 'Acolina', 'Acosser', 'Acuna', 'Adallo', 'Afanador', 'Agilon', 'Agua', 'Aguiere', 'Aguillar', 'Aguire', 'Agustino', 'Aillon', 'Alaras', 'Alarcon\x97Simon', 'Alariz', 'Alavarre', 'Albarada', 'Albarado', 'Albardo', 'Albardon', 'Albares', 'Albarez', 'Albaro', 'Albarran', 'Alber', 'Albero', 'Alberta', 'Alca', 'Alcala', 'Alcega\x97Franco', 'Alcina', 'Alcor', 'Alcosser', 'Aldape', 'Aleantara', 'Alegro', 'Aleola', 'Alexandro', 'Alfian', 'Alfonsa', 'Alhondiga', 'Alicante', 'Alina', 'Alja', 'Allendano', 'Alma', 'Almança', 'Almao', 'Almario', 'Almas', 'Almedo', 'Almonacir', 'Almonarir', 'Almoneria', 'Almonte', 'Alo', 'Alonfo', 'Alonsa', 'Alonsso', 'Alphonsa', 'Alrecon', 'Alseg

How many tags do these dubious tokens affect? Let's calculate this below.

# Number of records with a token not from our name list

In [39]:
people.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,docid,string,label,start,end,Remove
0,0,0,0,57935,57935,6792,,,135,138,1
1,1,1,1,45285,45285,4978,- Rodriguez,PER,6,8,0
2,2,2,2,45173,45173,4961,-- Gomez Brito,PER,34,38,0
3,3,3,3,45108,45108,4952,-- Manrique,PER,11,13,0
4,4,4,4,61780,61780,7342,-- Ramirez,PER,29,31,0


In [110]:
def pattern_searcher(search_str:str, search_list:list):
    
    search_strlist = search_str.lower().split()
    
    check =  any(item in search_strlist for item in search_list)
    
    if check:
        return_str = 1
        
    else:
        return_str = 0
        
    return return_str

In [59]:
people['matched_str'] = people['string'].apply(lambda x: pattern_searcher(search_str=str(x), search_list=dubioustokens))

In [72]:
people[people['Remove']=='0'].groupby(['matched_str']).count()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,docid,string,label,start,end,Remove
matched_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,23310,23310,23310,23310,23310,23310,23310,23310,23310,23310,23310
1,9132,9132,9132,9132,9132,9132,9132,9132,9132,9132,9132


Within our tags that have not been flagged for removal, We have 9132 tags that have a potentially questionable token right now, and 23,310 that are OK.

In [73]:
len((set(people.loc[(people['Remove']=='0')&(people['matched_str']==1),'string'])))

5511

Of these 9132 tags, 5511 are distinct strings. 

In [75]:
len((set(people.loc[(people['Remove']=='0')&(people['matched_str']==0),'string'].to_list())))

8498

Of the 23,310 correct tags, 8498 are distinct strings.

# 3 Fuzzy Matching

Let's also use our results from fuzzy matching and mark any tokens that represent a less than 90% match 90% match with tokens in the lists as dubious.

In [76]:
fuzzymatches = pd.read_csv('Files_Cleaning1/token_match_scores.csv')

In [77]:
fuzzymatches.set_index('token', inplace= True)

In [78]:
fuzzymatches.head()

Unnamed: 0_level_0,Unnamed: 0,score,match
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
BENJUMEA,0,71,enjuta
Prevendado,1,71,revenga
Lateral,2,77,antera
MORILLO,3,100,morillo
Ramon,4,100,ramon


In [84]:
tokens90 = fuzzymatches[fuzzymatches['score']>=90].index.tolist()

In [85]:
len(tokens90)

3717

In [86]:
print(tokens90)

['MORILLO', 'Ramon', 'Ruela', 'morante', 'Coello', 'VELAZQUEZ', 'Cardenas', 'TIRADO', 'BORJA', 'quebedo', 'martos', 'Quevedo', 'Teresa', 'Juanes', 'Monero', 'arze', 'Biedma', 'Durango', 'barrionuebo', 'despindola', 'trujillo', 'Ruiz', 'baltasar', 'Tenorio', 'Villens', 'IRIARTE', 'Canto', 'villa', 'Astete', 'AGUSTIN', 'Ruano', 'Trelles', 'Enrique', 'rroelas', 'Grimon', 'Rosal', 'horosco', 'ATIENZA', 'Sala', 'cuesta', 'Espinosa', 'Eraso', 'CARASCO', 'ayllon', 'MENESES', 'Sotto', 'Cipriano', 'Ribadeneira', 'Dimas', 'barrassa', '-geronimo', 'ALONSO', 'Calatayud', 'Salafranca', 'Caro', 'Arnao', 'holgado', 'Pan', 'GOMEZ', 'Morellon', 'Llanes', 'porras', 'muñoz', 'Ruis', 'Mathia', 'GARCIA', 'TORRES', 'Deza', 'siguença', 'Vermondo', 'Solarte', 'Ruesta', 'caraballo', 'Almonacid', 'ALDRETTE', 'florencia', 'Nicolas', 'deza', 'rrojas', 'Hernandez', 'roela', 'carcel', 'Matias', 'pino', 'goncalo', 'maria', 'carrasco', 'Federigui', 'rroldan', 'Torado', 'marcho', 'cuevas', 'arpe', 'ARANA', 'Izaguire',

Items that appear in this list should also be accepted as name tokens (is_name =1)

In [87]:
tokens.loc[tokens['token'].str.lower().isin(tokens90),'is_name']=1

In [105]:
tokens.groupby('is_name').count()

Unnamed: 0_level_0,token
is_name,Unnamed: 1_level_1
1.0,3610


In [89]:
len(tokens)

6692

So far, we have managed to mark 3,610 tokens as reliable names from a list of 6,692. 3,082 remain unaccounted for. 

Making all lowercase, these are 2,862 unique strings.

In [108]:
dubioustokens = tokens.loc[tokens['is_name']!=1,'token'].str.lower().to_list()

In [109]:
len(set(dubioustokens))

2862

In [46]:
print(dubioustokens)

['-', '--', '--Nino', '--de', '-basco', '-gaspar', '-geronimo', '-luisa', '-ndo', '-vasco', '2-', '20', '481', 'Abaasquita', 'Abaco', 'Abadesa', 'Abaeu', 'Abalos', 'Abdina', 'Abedaria', 'Abendaria', 'Abrego', 'Abril', 'Abrue', 'Abueda', 'Academia', 'Acauna', 'Acencion', 'Acenfio', 'Acensio', 'Acerdian', 'Acisclos', 'Acle', 'Acolina', 'Acosser', 'Acuna', 'Adallo', 'Afanador', 'Agilon', 'Agua', 'Aguiere', 'Aguillar', 'Aguire', 'Agustino', 'Aillon', 'Alaras', 'Alarcon\x97Simon', 'Alariz', 'Alavarre', 'Albarada', 'Albarado', 'Albardo', 'Albardon', 'Albares', 'Albarez', 'Albaro', 'Albarran', 'Alber', 'Albero', 'Alberta', 'Alca', 'Alcala', 'Alcega\x97Franco', 'Alcina', 'Alcor', 'Alcosser', 'Aldape', 'Aleantara', 'Alegro', 'Aleola', 'Alexandro', 'Alfian', 'Alfonsa', 'Alhondiga', 'Alicante', 'Alina', 'Alja', 'Allendano', 'Alma', 'Almança', 'Almao', 'Almario', 'Almas', 'Almedo', 'Almonacir', 'Almonarir', 'Almoneria', 'Almonte', 'Alo', 'Alonfo', 'Alonsa', 'Alonsso', 'Alphonsa', 'Alrecon', 'Alseg

How many tags do these dubious tokens affect? Let's calculate this below.

# Number of records with a token not from our name list

In [92]:
people.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,docid,string,label,start,end,Remove,matched_str
0,0,0,0,57935,57935,6792,,,135,138,1,0
1,1,1,1,45285,45285,4978,- Rodriguez,PER,6,8,0,1
2,2,2,2,45173,45173,4961,-- Gomez Brito,PER,34,38,0,1
3,3,3,3,45108,45108,4952,-- Manrique,PER,11,13,0,1
4,4,4,4,61780,61780,7342,-- Ramirez,PER,29,31,0,1


In [93]:
people['matched_str'] = people['string'].apply(lambda x: pattern_searcher(search_str=str(x), search_list=dubioustokens))

In [94]:
people[people['Remove']=='0'].groupby(['matched_str']).count()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,docid,string,label,start,end,Remove
matched_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,24717,24717,24717,24717,24717,24717,24717,24717,24717,24717,24717
1,7725,7725,7725,7725,7725,7725,7725,7725,7725,7725,7725


Within our tags that have not been flagged for removal, We have 7,725 tags that have a potentially questionable token right now, and 24,717 that are OK.

In [104]:
len((set(people.loc[(people['Remove']=='0')&(people['matched_str']==1),'string'])))

4720

Of these 7,725 tags, 4720 are distinct strings. 

In [96]:
len((set(people.loc[(people['Remove']=='0')&(people['matched_str']==0),'string'].to_list())))

9289

Of the 24,717 correct tags, 9289 are distinct strings.

We could potentially lower the barrier of fuzzy matching, to reduce these 4,720 tags further. But seeing that we only have 3,082 dubious tokens to go through, let's go through these manually and mark the ones we know to be tokens referring to names.

First, let's save these dubious tokens to a file. 

In [98]:
tokens.loc[tokens['is_name']!=1,'token'].to_csv('Files_Cleaning2/Dubioustokens_strat2.csv')

In [101]:
strat3 = tokens.loc[tokens['is_name']!=1,'token']
strat3.to_csv("Files_Cleaning2/dubiousnametokens2_strat3.csv")

## 4 Manual Check of Dubious Tokens

In the file above, we went ahead and marked any tokens that were inequivocally last names with is_name=1. We also marked tokens that were clearly erroneous as is_name=0. These included verbs and numbers. We removed titles and descriptors such as "escribanos" or "fiadores", except if these descriptors might form part of the name ("Arcediano de Niebla").  

In [112]:
dubious_handchecked = pd.read_excel("Files_Cleaning2/dubiousnametokens23_manualcheck.xlsx")

In [114]:
dubious_handchecked.head()

Unnamed: 0,tokenlower,is_name
0,-,1.0
1,--,1.0
2,--nino,1.0
3,--de,1.0
4,-ndo,1.0


In [115]:
dubious_handchecked.set_index('tokenlower')
dubious_handchecked.head()

Unnamed: 0,tokenlower,is_name
0,-,1.0
1,--,1.0
2,--nino,1.0
3,--de,1.0
4,-ndo,1.0


We will take strings marked as 1 to be names, and strings marked as 0 or blank need to be edited.

In [118]:
dubiousfinal = dubious_handchecked.loc[dubious_handchecked['is_name']!=1,'tokenlower'].to_list()

In [119]:
len(dubiousfinal)

1874

In [116]:
people['matched_str'] = people['string'].apply(lambda x: pattern_searcher(search_str=str(x), search_list=dubious_handchecked))

In [117]:
people[people['Remove']=='0'].groupby(['matched_str']).count()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,docid,string,label,start,end,Remove
matched_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,32442,32442,32442,32442,32442,32442,32442,32442,32442,32442,32442


# Strategy 2

Check strings with "San, Santo, Santa, Santos, Santas" and other non-person tokens.

In [31]:
nontokens = ['San ','Santo ','Santa ','Santos ','Santas ']

for token in nontokens:
    people.loc[people['string'].str.contains(token),'Remove']= 'check'

In [32]:
people.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,docid,string,label,start,end,Remove
0,57935,57935,6792,,PER,135,138,
1,45285,45285,4978,Rodriguez,PER,6,8,
2,45173,45173,4961,Gomez Brito,PER,34,38,
3,45108,45108,4952,Manrique,PER,11,13,
4,61780,61780,7342,Ramirez,PER,29,31,


In [33]:
people.groupby('Remove').count()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0.1,docid,string,label,start,end
Remove,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,7076,7076,7076,7076,7076,7076,7076
,25943,25943,25943,25943,25943,25943,25943
check,101,101,101,101,101,101,101


In [135]:
people[people['Remove']=='check']

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,docid,string,label,start,end,Remove
1585,47680,47680,5302,Andrea Santa Trinidad,PER,62,67,check
1911,44395,44395,4876,Andres San Martin,PER,46,50,check
2452,16350,16350,1783,Antonio Santa Cruz,PER,37,41,check
2453,16540,16540,1806,Antonio Santa Cruz,PER,0,4,check
2454,16597,16597,1814,Antonio Santa Cruz,PER,9,13,check
...,...,...,...,...,...,...,...,...
18449,2112,2112,265,Sebastian Santa Maria,PER,10,14,check
18635,45302,45302,4979,Señor San Andres,PER,243,246,check
18636,66747,66747,8050,Señor San Laureano,PER,122,125,check
18639,52425,52425,5990,Señorr San Esteban Martir,PER,103,107,check


In [136]:
# And then go over them one by one. Make list and change tags

In [1]:
import spacy
from spacy import displacy

In [8]:
nlp = spacy.load("es_core_news_ml_EMS2")

In [144]:
from IPython.display import clear_output

In [9]:
import pickle

file = open("Trained_EMS2_NER_data.p", 'rb')
docs = pickle.load(file)

In [10]:
for (index, row) in people.iterrows():
    if row['Remove']=='check':
        print(row['string'])
        
        docid = row['docid']
        strstart = int(row['start'])
        fetchend = strstart + 10
        
        for doc in docs:
            if doc[1]['id'] == docid:
                displacy.render(doc[0][strstart:fetchend],style='ent',jupyter=True)
                people.loc[index,'Remove']=input('Remove?')
            if  people.loc[index,'Remove']=='X':
                people.loc[index,'Remove']=''
                if doc[1]['id'] == docid:
                    displacy.render(doc[0],style='ent',jupyter=True)
                    people.loc[index,'Remove']=input('Remove?')
            clear_output(wait=True)

NameError: name 'people' is not defined