# Data Cleaning 2: Correcting Entity Spans (Long)

In this notebook, we go over "person" tags and clean spans to make sure they reference full names, no more and no less. 

We will recover the tags as cleaned in Data Cleaning 1, focusing on those we maintained as 'Remove'= 0. 

# Goal 1: To revise spans that are **too broad** (this notebook)

Create a list of tokens that appear in entities but do not appear in the name lists. We can do this by getting the difference between the name set and the entity tokens set. Mark the tokens that might be considered potentially problematic. Find them and determine if they should be a part of the span. Edit the spans respectively.

We can also add Kinkead's capitalized tokens to the "acceptable" token set.

# Goal 2: To revise spans that are **too short** (data cleaning 3)

Search for tokens to the left and right of a tagged person that appear in the entity list and the name lists and have not been tagged as entities. We can import the spacy docs, each token should have an entity tag = 0. 

# Goal 3: Untagged PER entities (data cleaning 4)

Search for strings that match exactly with strings that have been tagged as entities. This should work to pick up untagged people.

# Goal 4: Problematic words within the name list (data cleaning 5)
Some words may refer to saints or locations. Identify these words and find strings that have only these words to reclassify.

---
# Goal 1

## 1 Creating a list of tokens that appear in our entities 

In [1]:
import pandas as pd
people = pd.read_csv('Files_Cleaning1/PER_tags_clean1.csv')

In [2]:
people.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,docid,string,label,start,end,Remove
0,0,0,0,57935,57935,6792,,,135,138,1
1,1,1,1,45285,45285,4978,- Rodriguez,PER,6,8,0
2,2,2,2,45173,45173,4961,-- Gomez Brito,PER,34,38,0
3,3,3,3,45108,45108,4952,-- Manrique,PER,11,13,0
4,4,4,4,61780,61780,7342,-- Ramirez,PER,29,31,0


In [3]:
grouped = people.groupby('Remove')
grouped.count()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,docid,string,label,start,end
Remove,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,32442,32442,32442,32442,32442,32442,32442,32442,32442,32442
1,561,561,561,561,561,561,558,0,561,561
D,6,6,6,6,6,6,6,6,6,6
L,39,39,39,39,39,39,39,39,39,39
M,6,6,6,6,6,6,6,6,6,6
O,66,66,66,66,66,66,66,66,66,66


Let's generate a list of tokens that appear in our PER entities. To do this, we are flattening the list of lists that is generated when we split each string by spaces, and we create a set to remove duplicates. Then we store it as a list again. 

In [4]:
tokenlist= people.loc[people['Remove']=='0','string'].str.split().tolist()

In [5]:
newlist= []
for sublist in tokenlist:
    if type(sublist)==list:
        for item in sublist:
            newlist.append(str(item))
tokenlist=list(set(newlist))

In [6]:
tokenlist.sort()

In [7]:
len(tokenlist)

6692

Note that this list includes tokens in upper and lowercase. This is because these cases are informative. As Kinkead marked artist names in full uppercase, we can mark uppercase tokens as tokens for true names.

In [8]:
with open('PersonNameTokenList.csv','w') as file:
    for item in tokenlist:
        file.write(item + "\n")

In [9]:
print(tokenlist)

['-', '--', '--Nino', '--de', '-basco', '-gaspar', '-geronimo', '-luisa', '-ndo', '-vasco', '2-', '20', '481', 'A', 'ADRIAN', 'AFANDOR', 'AGUILA', 'AGUILAR', 'AGUSTIN', 'AIALA', 'ALA', 'ALARCON', 'ALAVA', 'ALBERTO', 'ALCANTEIES', 'ALCANTUD', 'ALDRETTE', 'ALFONSO', 'ALMANSA', 'ALMANZA', 'ALONSO', 'ALTA', 'ALTERI', 'ALVA', 'ALVAREZ', 'AMARROSTA', 'AMARROSTAO', 'AMBAR', 'AMBROSIO', 'ANDRACH', 'ANDRES', 'ANGELO', 'ANTOLINEZ', 'ANTONIO', 'APARICIO', 'ARAGON', 'ARANA', 'ARANDA', 'ARAUS', 'ARAZ', 'ARCE', 'ARCOS', 'ARENAS', 'ARENSECALLATRADA', 'ARIAS', 'ARIENZACALATRAVA', 'ARMEDIA', 'ARRENAS', 'ARROYO', 'ARTEAGA', 'ARTEAGAALFARO', 'ARTENSE', 'ARTENSECALATADO', 'ARTENSECALATRADA', 'ARTIAGA', 'ARTIGA', 'ARUAS', 'ARZE', 'ASCENCIO', 'ASENCIO', 'ASINTA', 'ATIENZA', 'ATIENZACALATRAVA', 'AVILA', 'AY', 'AYALA', 'Abaasquita', 'Abaco', 'Abadesa', 'Abaeu', 'Abalos', 'Abdina', 'Abedaria', 'Abel', 'Abellan', 'Abendaria', 'Abraham', 'Abrego', 'Abreu', 'Abril', 'Abrue', 'Abueda', 'Acacio', 'Academia', 'Acaun

Let's turn this into a dataframe with a column that includes a column for "accepted" tokens, where 1 is a name token and 0 is a dubious token. Let's go ahead and assign a 1 value to those tokens that are in uppercase.

In [10]:
tokens = pd.DataFrame(tokenlist, columns = ['token'])
tokens.loc[tokens['token'].str.isupper(),'is_name']=1

In [11]:
tokens.groupby('is_name').count()

Unnamed: 0_level_0,token
is_name,Unnamed: 1_level_1
1.0,572


In [12]:
uppers = tokens.loc[tokens['is_name']==1,'token'].tolist()

In [13]:
uppers

['A',
 'ADRIAN',
 'AFANDOR',
 'AGUILA',
 'AGUILAR',
 'AGUSTIN',
 'AIALA',
 'ALA',
 'ALARCON',
 'ALAVA',
 'ALBERTO',
 'ALCANTEIES',
 'ALCANTUD',
 'ALDRETTE',
 'ALFONSO',
 'ALMANSA',
 'ALMANZA',
 'ALONSO',
 'ALTA',
 'ALTERI',
 'ALVA',
 'ALVAREZ',
 'AMARROSTA',
 'AMARROSTAO',
 'AMBAR',
 'AMBROSIO',
 'ANDRACH',
 'ANDRES',
 'ANGELO',
 'ANTOLINEZ',
 'ANTONIO',
 'APARICIO',
 'ARAGON',
 'ARANA',
 'ARANDA',
 'ARAUS',
 'ARAZ',
 'ARCE',
 'ARCOS',
 'ARENAS',
 'ARENSECALLATRADA',
 'ARIAS',
 'ARIENZACALATRAVA',
 'ARMEDIA',
 'ARRENAS',
 'ARROYO',
 'ARTEAGA',
 'ARTEAGAALFARO',
 'ARTENSE',
 'ARTENSECALATADO',
 'ARTENSECALATRADA',
 'ARTIAGA',
 'ARTIGA',
 'ARUAS',
 'ARZE',
 'ASCENCIO',
 'ASENCIO',
 'ASINTA',
 'ATIENZA',
 'ATIENZACALATRAVA',
 'AVILA',
 'AY',
 'AYALA',
 'B',
 'BAL',
 'BALDEQUIROS',
 'BALDES',
 'BALDEZ',
 'BALLESTEROS',
 'BALTASAR',
 'BALTAZAR',
 'BALTHAZAR',
 'BAN',
 'BANMOL',
 'BARAHONA',
 'BARAONA',
 'BARCO',
 'BARELA',
 'BARONA',
 'BARREDA',
 'BARRERA',
 'BARTOLOME',
 'BASQUEZ',
 'BAUTI

This has identified 572 tokens as acceptable. We can also mark their lowercase equivalents as acceptable.

In [14]:
tokens.loc[tokens['token'].str.upper().isin(uppers),'is_name']=1

In [15]:
tokens.groupby('is_name').count()

Unnamed: 0_level_0,token
is_name,Unnamed: 1_level_1
1.0,1253


This has identified 1,253 tokens as acceptable.

In [16]:
len(set(tokens.loc[tokens['is_name']==1,'token'].str.lower().to_list()))

572

In [17]:
strat1 = tokens.loc[tokens['is_name']==1,'token']

strat1.to_csv("Files_Cleaning2/acceptednametokens2_strat1.csv")

A look at the .CSV file reveals that the list seems OK. We allow it to include articles (la, lo, el, los) because at this stage we don't think they should be problematic. 

# 2 Creating a list of tokens not in our combined namelists

Other acceptable names may not be in the list of Kinkead tokens. We can rely on tokens from our first and last name lists to be name tokens. (Or, put in other words, we will have reason to suspect tokens not in this list not to be names).

In [18]:
fnames = open('DBE_Names/firstnames.txt')
lnames = open('DBE_Names/lastnames.txt')

In [19]:
names = []

for line in fnames:
    names.append(line.rstrip('\n'))
    
for line in lnames:
    names.append(line.rstrip('\n'))

In [20]:
names = set(names)

In [21]:
names

{'grops',
 'carreras',
 'barsol',
 'bassadona',
 'oleaga',
 'belmonte',
 'marsilio',
 'diaz-bravo',
 'gaver',
 'dorlie',
 'cerecedo',
 'vizcayno',
 'mahoma',
 'bardela',
 'brochero',
 'daoiz',
 'florentino',
 'mazarredo',
 'richarte',
 'romero',
 'beruti',
 'aldazabal',
 'sustaete',
 'aguirre',
 'ablitas',
 'sagaseta',
 'soares',
 'bartolache',
 'ganuza',
 'pini',
 'estuard',
 'creswell',
 'bejes',
 'pedralvares',
 'urria',
 'alava',
 'diustegui',
 'torrecillas',
 'azcueta',
 'cadrecha',
 'crusat',
 'oreitia',
 'isidoro',
 'iturzaeta',
 'lautaro',
 'capo',
 'mayora',
 'eleno',
 'grimau',
 'ondegardo',
 'sagarvinaga',
 'uzqueta',
 'castelli',
 'lombardo',
 'melendez',
 'alas',
 'negrete-gomez',
 'costanzo',
 'o’doyle',
 'molleto',
 'omaña',
 'liedena',
 'verdugo',
 'albin',
 'aliguer',
 'cabrillo',
 'jufre',
 'beaugrant',
 'craywinckle',
 'laserna',
 'astorch',
 'urquijo',
 'azua',
 'burton',
 'belastegui',
 'cock',
 'proust',
 'villarreal',
 'gortari',
 'jaca',
 'archiga',
 'llaudes',


In [22]:
len(names)

10005

Above, we have created one set of unique names from both our first names and last names file, no overlap. Note that these are all lowercase.

Next, let's find the difference between this list and our tokens found above.

In [23]:
tokens.loc[tokens['token'].str.lower().isin(names),'is_name']=1

In [24]:
tokens.groupby('is_name').count()

Unnamed: 0_level_0,token
is_name,Unnamed: 1_level_1
1.0,3171


In [25]:
len(tokens)

6692

So far, we have managed to mark 3,171 tokens as reliable names from a list of 6,692. 3,521 remain unaccounted for. 

In [26]:
len(set(tokens.loc[tokens['is_name']==1, 'token'].str.lower().to_list()))

1928

Of these, 1,928 are unique if lowercased.

In [24]:
dubioustokens = tokens.loc[tokens['is_name']!=1,'token'].to_list()

In [25]:
print(dubioustokens)

['-', '--', '--Nino', '--de', '-basco', '-gaspar', '-geronimo', '-luisa', '-ndo', '-vasco', '2-', '20', '481', 'Abaasquita', 'Abaco', 'Abadesa', 'Abaeu', 'Abalos', 'Abdina', 'Abedaria', 'Abendaria', 'Abrego', 'Abril', 'Abrue', 'Abueda', 'Academia', 'Acauna', 'Acencion', 'Acenfio', 'Acensio', 'Acerdian', 'Acisclos', 'Acle', 'Acolina', 'Acosser', 'Acuna', 'Adallo', 'Afanador', 'Agilon', 'Agua', 'Aguiere', 'Aguillar', 'Aguire', 'Agustino', 'Aillon', 'Alaras', 'Alarcon\x97Simon', 'Alariz', 'Alavarre', 'Albarada', 'Albarado', 'Albardo', 'Albardon', 'Albares', 'Albarez', 'Albaro', 'Albarran', 'Alber', 'Albero', 'Alberta', 'Alca', 'Alcala', 'Alcega\x97Franco', 'Alcina', 'Alcor', 'Alcosser', 'Aldape', 'Aleantara', 'Alegro', 'Aleola', 'Alexandro', 'Alfian', 'Alfonsa', 'Alhondiga', 'Alicante', 'Alina', 'Alja', 'Allendano', 'Alma', 'Almança', 'Almao', 'Almario', 'Almas', 'Almedo', 'Almonacir', 'Almonarir', 'Almoneria', 'Almonte', 'Alo', 'Alonfo', 'Alonsa', 'Alonsso', 'Alphonsa', 'Alrecon', 'Alseg

How many tags do these dubious tokens affect? Let's calculate this below.

# Number of records with a token not from our name list

In [26]:
people.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,docid,string,label,start,end,Remove
0,0,0,0,57935,57935,6792,,,135,138,1
1,1,1,1,45285,45285,4978,- Rodriguez,PER,6,8,0
2,2,2,2,45173,45173,4961,-- Gomez Brito,PER,34,38,0
3,3,3,3,45108,45108,4952,-- Manrique,PER,11,13,0
4,4,4,4,61780,61780,7342,-- Ramirez,PER,29,31,0


In [72]:
def pattern_searcher(search_str:str, search_list:list):
    
    search_strlist = search_str.lower().split()
    
    check =  any(item in search_strlist for item in search_list)
    
    if check:
        return_str = 1
        
    else:
        return_str = 0
        
    return return_str

In [28]:
people['matched_str'] = people['string'].apply(lambda x: pattern_searcher(search_str=str(x), search_list=dubioustokens))

In [29]:
people[people['Remove']=='0'].groupby(['matched_str']).count()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,docid,string,label,start,end,Remove
matched_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,25162,25162,25162,25162,25162,25162,25162,25162,25162,25162,25162
1,7280,7280,7280,7280,7280,7280,7280,7280,7280,7280,7280


Within our tags that have not been flagged for removal, We have 7280 tags that have a potentially questionable token right now, and 25,162 that are OK.

In [30]:
len((set(people.loc[(people['Remove']=='0')&(people['matched_str']==1),'string'])))

4029

Of these 7,280 tags, 4029 are distinct strings. 

In [31]:
len((set(people.loc[(people['Remove']=='0')&(people['matched_str']==0),'string'].to_list())))

9980

Of the 25,162 correct tags, 9,980 are distinct strings.

# 3 Fuzzy Matching

Let's also use our results from fuzzy matching and mark any tokens that represent a less than 90% match 90% match with tokens in the lists as dubious.

In [33]:
fuzzymatches = pd.read_csv('Files_Cleaning1/token_match_scores.csv')

In [34]:
fuzzymatches.set_index('token', inplace= True)

In [35]:
fuzzymatches.head()

Unnamed: 0_level_0,Unnamed: 0,score,match
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
BENJUMEA,0,71,enjuta
Prevendado,1,71,revenga
Lateral,2,77,antera
MORILLO,3,100,morillo
Ramon,4,100,ramon


In [36]:
tokens90 = fuzzymatches[fuzzymatches['score']>=90].index.tolist()

In [37]:
len(tokens90)

3717

In [38]:
print(tokens90)

['MORILLO', 'Ramon', 'Ruela', 'morante', 'Coello', 'VELAZQUEZ', 'Cardenas', 'TIRADO', 'BORJA', 'quebedo', 'martos', 'Quevedo', 'Teresa', 'Juanes', 'Monero', 'arze', 'Biedma', 'Durango', 'barrionuebo', 'despindola', 'trujillo', 'Ruiz', 'baltasar', 'Tenorio', 'Villens', 'IRIARTE', 'Canto', 'villa', 'Astete', 'AGUSTIN', 'Ruano', 'Trelles', 'Enrique', 'rroelas', 'Grimon', 'Rosal', 'horosco', 'ATIENZA', 'Sala', 'cuesta', 'Espinosa', 'Eraso', 'CARASCO', 'ayllon', 'MENESES', 'Sotto', 'Cipriano', 'Ribadeneira', 'Dimas', 'barrassa', '-geronimo', 'ALONSO', 'Calatayud', 'Salafranca', 'Caro', 'Arnao', 'holgado', 'Pan', 'GOMEZ', 'Morellon', 'Llanes', 'porras', 'muñoz', 'Ruis', 'Mathia', 'GARCIA', 'TORRES', 'Deza', 'siguença', 'Vermondo', 'Solarte', 'Ruesta', 'caraballo', 'Almonacid', 'ALDRETTE', 'florencia', 'Nicolas', 'deza', 'rrojas', 'Hernandez', 'roela', 'carcel', 'Matias', 'pino', 'goncalo', 'maria', 'carrasco', 'Federigui', 'rroldan', 'Torado', 'marcho', 'cuevas', 'arpe', 'ARANA', 'Izaguire',

Items that appear in this list should also be accepted as name tokens (is_name =1)

In [39]:
tokens.loc[tokens['token'].str.lower().isin(tokens90),'is_name']=1

In [40]:
tokens.groupby('is_name').count()

Unnamed: 0_level_0,token
is_name,Unnamed: 1_level_1
1.0,3610


In [41]:
len(tokens)

6692

So far, we have managed to mark 3,610 tokens as reliable names from a list of 6,692. 3,082 remain unaccounted for. 

Making all lowercase, these are 2,862 unique strings.

In [42]:
dubioustokens = tokens.loc[tokens['is_name']!=1,'token'].str.lower().to_list()

In [43]:
len(set(dubioustokens))

2862

In [44]:
print(dubioustokens)

['-', '--', '--nino', '--de', '-ndo', '2-', '20', '481', 'abaasquita', 'abaco', 'abadesa', 'abaeu', 'abalos', 'abdina', 'abedaria', 'abendaria', 'abrego', 'abril', 'abrue', 'abueda', 'academia', 'acauna', 'acencion', 'acenfio', 'acerdian', 'acisclos', 'acle', 'acolina', 'acosser', 'acuna', 'adallo', 'afanador', 'agilon', 'agua', 'aguiere', 'aguillar', 'aguire', 'agustino', 'aillon', 'alaras', 'alarcon\x97simon', 'alariz', 'alavarre', 'albarada', 'albarado', 'albardo', 'albardon', 'albares', 'albarez', 'albarran', 'alber', 'albero', 'alberta', 'alca', 'alcala', 'alcega\x97franco', 'alcina', 'alcor', 'alcosser', 'aldape', 'aleantara', 'alegro', 'aleola', 'alexandro', 'alfian', 'alfonsa', 'alhondiga', 'alicante', 'alina', 'alja', 'allendano', 'alma', 'almao', 'almario', 'almas', 'almedo', 'almonacir', 'almonarir', 'almoneria', 'almonte', 'alo', 'alonfo', 'alonsa', 'alphonsa', 'alrecon', 'alsega', 'alsicolla', 'alsola', 'altamirrano', 'altar', 'altas', 'altopica', 'alua', 'alvan', 'alvarez

How many tags do these dubious tokens affect? Let's calculate this below.

# Number of records with a token not from our name list

In [45]:
people.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,docid,string,label,start,end,Remove,matched_str
0,0,0,0,57935,57935,6792,,,135,138,1,0
1,1,1,1,45285,45285,4978,- Rodriguez,PER,6,8,0,1
2,2,2,2,45173,45173,4961,-- Gomez Brito,PER,34,38,0,1
3,3,3,3,45108,45108,4952,-- Manrique,PER,11,13,0,1
4,4,4,4,61780,61780,7342,-- Ramirez,PER,29,31,0,1


In [46]:
people['matched_str'] = people['string'].apply(lambda x: pattern_searcher(search_str=str(x), search_list=dubioustokens))

In [47]:
people[people['Remove']=='0'].groupby(['matched_str']).count()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,docid,string,label,start,end,Remove
matched_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,24717,24717,24717,24717,24717,24717,24717,24717,24717,24717,24717
1,7725,7725,7725,7725,7725,7725,7725,7725,7725,7725,7725


Within our tags that have not been flagged for removal, We have 7,725 tags that have a potentially questionable token right now, and 24,717 that are OK.

In [48]:
len((set(people.loc[(people['Remove']=='0')&(people['matched_str']==1),'string'])))

4720

Of these 7,725 tags, 4720 are distinct strings. 

In [49]:
len((set(people.loc[(people['Remove']=='0')&(people['matched_str']==0),'string'].to_list())))

9289

Of the 24,717 correct tags, 9289 are distinct strings.

We could potentially lower the barrier of fuzzy matching, to reduce these 4,720 tags further. But seeing that we only have 3,082 dubious tokens to go through, let's go through these manually and mark the ones we know to be tokens referring to names.

First, let's save these dubious tokens to a file. 

In [51]:
strat3 = tokens.loc[tokens['is_name']!=1,'token']
strat3.to_csv("Files_Cleaning2/dubiousnametokens2_strat3.csv")

## 4 Manual Check of Dubious Tokens

In the file above, we went ahead and marked any tokens that were inequivocally last names with is_name=1. We also marked tokens that were clearly erroneous as is_name=0. These included verbs and numbers. We removed titles and descriptors such as "escribanos" or "fiadores", except if these descriptors might form part of the name ("Arcediano de Niebla").  

In [52]:
dubious_handchecked = pd.read_excel("Files_Cleaning2/dubiousnametokens23_manualcheck.xlsx")

In [54]:
dubious_handchecked.groupby('is_name').count()

Unnamed: 0_level_0,tokenlower
is_name,Unnamed: 1_level_1
0.0,185
1.0,983


In [55]:
len(dubious_handchecked)

2857

Of 2,857 tokens, we classified 983 as correct, 185 as wrong and the rest are yet undetermined.
We will take strings marked as 1 to be names, and strings marked as 0 or blank need to be edited.

In [56]:
dubiousfinal = dubious_handchecked.loc[dubious_handchecked['is_name']!=1,'tokenlower'].to_list()

In [57]:
len(dubiousfinal)

1874

In [59]:
people['matched_str'] = people['string'].apply(lambda x: pattern_searcher(search_str=str(x), search_list=dubiousfinal))

In [60]:
people[people['Remove']=='0'].groupby(['matched_str']).count()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,docid,string,label,start,end,Remove
matched_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,28768,28768,28768,28768,28768,28768,28768,28768,28768,28768,28768
1,3674,3674,3674,3674,3674,3674,3674,3674,3674,3674,3674


This leaves 3,674 tags whose spans need to be double checked.

## 5 Checking dubious tags manually - marking whether edits are needed

In this next step, we will create a column for editlong (Y-N), to mark those tags that will need span editing.

In [61]:
people['edit_long']=''

In [83]:
people.loc[people['matched_str']==0,'edit_long']='0'

In [63]:
from IPython.display import clear_output
import spacy
from spacy import displacy

In [70]:
import pickle

file = open("../Text Mining (NER)/Trained_EMS2_NER_data.p", 'rb')
docs = pickle.load(file)

In [159]:
count = 0
for (index, row) in people.iterrows():

    if (row['Remove']=='0') and (row['matched_str']==1) and (row['edit_long']==''):
        count=count+1
        print(count)
        print(row['string'])
        
        docid = row['docid']
        strstart = int(row['start'])
        fetchend = strstart + 10
        
        for doc in docs:
            if doc[1]['id'] == docid:      
                displacy.render(doc[0][strstart:fetchend],style='ent',jupyter=True)
                people.loc[index, 'edit_long']=input('Edit?')
                
                if people.loc[index, 'edit_long']=='X':
                    for doc in docs:
                        if doc[1]['id'] == docid:
                            displacy.render(doc[0],style='ent',jupyter=True)
                            people.loc[index,'edit_long']=input('Edit?')
                clear_output(wait=True)

1
maria muger


Edit? X


Edit? 1


In [160]:
people[people['Remove']=='0'].groupby(['edit_long']).count()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,docid,string,label,start,end,Remove,matched_str
edit_long,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,31756,31756,31756,31756,31756,31756,31756,31756,31756,31756,31756,31756
1,644,644,644,644,644,644,644,644,644,644,644,644
L,2,2,2,2,2,2,2,2,2,2,2,2
O,1,1,1,1,1,1,1,1,1,1,1,1
R,39,39,39,39,39,39,39,39,39,39,39,39


In [None]:
# there may be tokens in the name list that are suspect?

So now I have marked the strings that would require editing to remove excess tokens with a 1. 

L O and R refer to removing or reclassifying tags, so let's transfer these to the 'Remove' column.

In [164]:
people.loc[people['edit_long']=='R','Remove']= '1'

In [167]:
people.loc[people['edit_long']=='L','Remove']= 'L'
people.loc[people['edit_long']=='O','Remove']= 'O'

And let's clean up so that the edit_long category is blank for any items slated for removal or recategorization.

In [169]:
people.loc[people['Remove']=='1','edit_long']= ''
people.loc[people['Remove']=='D','edit_long']= ''
people.loc[people['Remove']=='L','edit_long']= ''
people.loc[people['Remove']=='O','edit_long']= ''

In [170]:
people.groupby(['Remove','edit_long']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,docid,string,label,start,end,matched_str
Remove,edit_long,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,0.0,31756,31756,31756,31756,31756,31756,31756,31756,31756,31756,31756
0,1.0,644,644,644,644,644,644,644,644,644,644,644
1,,600,600,600,600,600,600,597,39,600,600,600
D,,6,6,6,6,6,6,6,6,6,6,6
L,,41,41,41,41,41,41,41,41,41,41,41
M,,6,6,6,6,6,6,6,6,6,6,6
O,,67,67,67,67,67,67,67,67,67,67,67


As we can see, 31,756 tags are accepted. 644 must be edited for length. 600 must be removed, and 120 must be recategorized.

In [171]:
people.to_csv("Files_Cleaning2/PER_tags_clean2.csv")

In [172]:
people.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,docid,string,label,start,end,Remove,matched_str,edit_long
0,0,0,0,57935,57935,6792,,,135,138,1,0,
1,1,1,1,45285,45285,4978,- Rodriguez,PER,6,8,0,0,0.0
2,2,2,2,45173,45173,4961,-- Gomez Brito,PER,34,38,0,0,0.0
3,3,3,3,45108,45108,4952,-- Manrique,PER,11,13,0,0,0.0
4,4,4,4,61780,61780,7342,-- Ramirez,PER,29,31,0,0,0.0


# Apply these changes to the Spans in Our Dataframe

The first thing to do is to change the spans in our people dataframe. 

We then have to apply these changes directly to the pickled spacy output, in order to find more information later (such as person descriptors). We can also use it later to update the strings in our people dataframe.

In [89]:
# importing pandas and our previous work in a new session
import pandas as pd


from IPython.display import clear_output
import spacy
from spacy import displacy

import pickle



In [44]:
people= pd.read_csv('Files_Cleaning2/PER_tags_clean2.csv')

In [90]:
file = open("../Text Mining (NER)/Trained_EMS2_NER_data.p", 'rb')
docs = pickle.load(file)

1. Changing spans and strings in our people dataframe.

First, let's add new columns so we can modify the spans without risk of mistakes.

In [45]:
people['newstart']=''
people['newend']=''

To avoid tripping up and making mistakes, we are going to loop twice, first editing the starts of spans and separately editing the ends.

In [60]:
count = 0
for (index, row) in people.iterrows():

    if row['edit_long']=='1':
        count=count+1
        print(count)
        print(row['string'],row['start'],row['end'])
        
        docid = row['docid']
        strstart = int(row['start'])
        fetchend = strstart + 10
        
        for doc in docs:
            if doc[1]['id'] == docid:      
                displacy.render(doc[0][strstart:fetchend],style='ent',jupyter=True)
                a=input('New start?')
                d= input('Need to fix edit_long?')           
                if d =='y':
                    people.loc[index,'edit_long']='D'
                    if a !='':
                        people.loc[index, 'newstart']=a
                if d=='e':
                    people.loc[index,'edit_long']='e'
                    if a !='':
                        people.loc[index, 'newstart']=a
                else:
                    people.loc[index,'edit_long']=d
                    
                clear_output(wait=True)

5
soror beatriz jesus priora 15 20


New start? 16
Need to fix edit_long? y


y = go to edit long
e = end (done with edits, does not go to edit long)
o = other issue (does not go to edit long, should also double check)
R = remove (must double check, does not go to edit long)

Issues included in 'o' are two: incorrect tokenizing (cannot remove extra word) and multiple persons in a single tag

In [61]:
people.groupby(['Remove','edit_long']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,Unnamed: 0.1.1.1.1.1,docid,string,label,start,end,matched_str,newstart,newend
Remove,edit_long,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0,0.0,31756,31756,31756,31756,31756,31756,31756,31756,31756,31756,31756,31756,31756,31756
0,0,11,11,11,11,11,11,11,11,11,11,11,11,11,11
0,D,150,150,150,150,150,150,150,150,150,150,150,150,150,150
0,R,11,11,11,11,11,11,11,11,11,11,11,11,11,11
0,e,79,79,79,79,79,79,79,79,79,79,79,79,79,79
0,o,153,153,153,153,153,153,153,153,153,153,153,153,153,153
0,y,240,240,240,240,240,240,240,240,240,240,240,240,240,240


In [72]:
count = 0
for (index, row) in people.iterrows():

    if row['edit_long']=='y':
        count=count+1
        print(count)
        print(row['string'],row['start'],row['end'])
        
        docid = row['docid']
        strstart = int(row['start'])
        fetchend = strstart + 10
        
        for doc in docs:
            if doc[1]['id'] == docid:      
                displacy.render(doc[0][strstart:fetchend],style='ent',jupyter=True)
                a=input('New end?')
                d= input('Need to fix edit_long?')           
                if d =='y':
                    people.loc[index,'edit_long']='D'
                    if a !='':
                        people.loc[index, 'newend']=a
                if d=='e':
                    people.loc[index,'edit_long']='e'
                    if a !='':
                        people.loc[index, 'newend']=a
                else:
                    people.loc[index,'edit_long']=d
                    
                clear_output(wait=True)

139
 Francisco Saavedra Escribano 1821 1826


New end? 1825
Need to fix edit_long? e


In [73]:
people.groupby(['Remove','edit_long']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,Unnamed: 0.1.1.1.1.1,docid,string,label,start,end,matched_str,newstart,newend
Remove,edit_long,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0,0.0,31756,31756,31756,31756,31756,31756,31756,31756,31756,31756,31756,31756,31756,31756
0,0,15,15,15,15,15,15,15,15,15,15,15,15,15,15
0,R,11,11,11,11,11,11,11,11,11,11,11,11,11,11
0,e,465,465,465,465,465,465,465,465,465,465,465,465,465,465
0,o,153,153,153,153,153,153,153,153,153,153,153,153,153,153


We are going over the full text of the ones we marked R to re-check them. Ones that truly must be removed have now been marked 'r', others edited and categorized.

In [75]:
count = 0
for (index, row) in people.iterrows():

    if row['edit_long']=='R':
        count=count+1
        print(count)
        print(row['string'],row['start'],row['end'])
        
        docid = row['docid']
        strstart = int(row['start'])
        fetchend = strstart + 10
        
        for doc in docs:
            if doc[1]['id'] == docid:      
                displacy.render(doc[0],style='ent',jupyter=True)
                a=input('New end?')
                d= input('Need to fix edit_long?')           
                if d =='y':
                    people.loc[index,'edit_long']='D'
                    if a !='':
                        people.loc[index, 'newend']=a
                if d=='e':
                    people.loc[index,'edit_long']='e'
                    if a !='':
                        people.loc[index, 'newend']=a
                else:
                    people.loc[index,'edit_long']=d
                    
                clear_output(wait=True)

11
miguel fechas barro 75 79


New end? 76
Need to fix edit_long? e


In [96]:
people.groupby(['Remove','edit_long']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,Unnamed: 0.1.1.1.1.1,docid,string,label,start,end,matched_str,newstart,newend
Remove,edit_long,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0,0.0,31772,31772,31772,31772,31772,31772,31772,31772,31772,31772,31772,31772,31772,31772
0,e,470,470,470,470,470,470,470,470,470,470,470,470,470,470
0,o,154,154,154,154,154,154,154,154,154,154,154,154,154,154
1,,4,4,4,4,4,4,4,4,4,4,4,4,4,4


In [95]:
people.loc[people['edit_long']=='0','edit_long']=0

In [85]:
people.loc[people['edit_long']=='r','Remove']='1'

In [92]:
people.loc[people['edit_long']=='r','edit_long']= None

In [97]:
people.groupby(['Remove',people['edit_long'].isna()]).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,Unnamed: 0.1.1.1.1.1,docid,string,label,start,end,matched_str,edit_long,newstart,newend
Remove,edit_long,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
0,False,32396,32396,32396,32396,32396,32396,32396,32396,32396,32396,32396,32396,32396,32396,32396
1,False,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4
1,True,600,600,600,600,600,600,600,597,39,600,600,600,0,600,600
D,True,6,6,6,6,6,6,6,6,6,6,6,6,0,6,6
L,True,41,41,41,41,41,41,41,41,41,41,41,41,0,41,41
M,True,6,6,6,6,6,6,6,6,6,6,6,6,0,6,6
O,True,67,67,67,67,67,67,67,67,67,67,67,67,0,67,67


In [98]:
people.to_csv("Files_Cleaning2/PER_tags_clean2_manual.csv")

D are  the ones that were edited and corrected.

o are the ones that have another problem and have to be corrected in another way. There are two different problems: 
    1. Two names are marked as one entity
    2. The span is too long but because of faulty tokenization (i.e. the name has incorporated another word because that word wasn't separated from the rest of the name).
    
The second case does not seem too problematic, we can just accept them and edit them as needed.
The first case...


# Update: Also Revising some Dubious Words in the Namelist

Before proceeding, I would like to double check that some tokens that we have accepted because they are on the namelist might not also be wrong. I will import our lists of names (first names, last names, and name tokens identified by kinkead), and find all the tokens that match an item in the namelists. Of these, I will manually mark those that could be problematic, and review them as before.

In [40]:
fnames = open('DBE_Names/firstnames.txt')
lnames = open('DBE_Names/lastnames.txt')
knames = pd.read_csv("Files_Cleaning2/acceptednametokens2_strat1.csv")
knames= knames['token'].str.lower().tolist()

In [41]:
names = []

for line in fnames:
    names.append(line.rstrip('\n'))
    
for line in lnames:
    names.append(line.rstrip('\n'))
    
for line in knames:
    names.append(line.rstrip('\n'))

In [44]:
names = set(names)

In [45]:
len(names)

10246

Just to check, this is a list of 10,246 tokens, which coincides with what we know of the name lists.

In [46]:
tokensintags = pd.read_csv('PersonNameTokenList.csv')

In [47]:
tokensintags

Unnamed: 0,-
0,--
1,--Nino
2,--de
3,-basco
4,-gaspar
...,...
6686,çuleta
6687,çumaraga
6688,çumarraga
6689,çurbaran


In [48]:
tokensintags = tokensintags['-'].tolist()

In [51]:
in_namelist = []

for line in tokensintags:
    if line.lower() in names:
        in_namelist.append(line.lower())

In [59]:
in_namelist= list(set(in_namelist))

In [60]:
len(in_namelist)

1928

So of the tokens present in our identified entities, 1,928 of them (lowercase) appear in our name lists, which coincides with our work above. This is a manageable number to revise by hand in a spreadsheet.

Let's output it to a file:

In [61]:
with open('Files_Cleaning2/EntityTokensinKFLNamelists.csv','w') as file:
    for line in in_namelist:
        file.write(line + '\n')

Below, we import the file that resulted from this manual check. Value 1 are tokens that should be double checked manually. We included tokens that made reference to geography, had second meanings or had no reasonable close approximation that we thought could be a name.

In [66]:
dubiousnametokens = pd.read_excel('Files_Cleaning2/EntityTokensinKFLNamelists_manual_check.xlsx')

In [67]:
dubiousnametokens

Unnamed: 0,token,check
0,guejar,
1,belmonte,
2,gatica,
3,moyna,
4,r,1.0
...,...,...
1923,solana,
1924,patiño,
1925,pradillo,
1926,montoro,


In [76]:
dubiousnametokens = dubiousnametokens.loc[dubiousnametokens['check']==1,'token'].to_list()

Using the previously defined pattern searcher function, we can check the number of entities we will have to check.

In [69]:
people = pd.read_csv('Files_Cleaning2/PER_tags_clean2_manual.csv')

In [70]:
people.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,Unnamed: 0.1.1.1.1.1,Unnamed: 0.1.1.1.1.1.1,docid,string,label,start,end,Remove,matched_str,edit_long,newstart,newend
0,0,0,0,0,0,57935,57935,6792,,,135,138,1,0,,,
1,1,1,1,1,1,45285,45285,4978,- Rodriguez,PER,6,8,0,0,0.0,,
2,2,2,2,2,2,45173,45173,4961,-- Gomez Brito,PER,34,38,0,0,0.0,,
3,3,3,3,3,3,45108,45108,4952,-- Manrique,PER,11,13,0,0,0.0,,
4,4,4,4,4,4,61780,61780,7342,-- Ramirez,PER,29,31,0,0,0.0,,


In [71]:
people['matched_str'] = ''

In [77]:
people['matched_str'] = people['string'].apply(lambda x: pattern_searcher(search_str=str(x), search_list=dubiousnametokens))

In [103]:
people = people.fillna('')

Next. all the matches that occur with strings that are in capitals can be ignored, because Kinkead already identified them as correct.

In [107]:
people.loc[people['string'].str.isupper(), 'matched_str']=0

In [108]:
people[(people['Remove']=='0')].groupby(['matched_str','edit_long']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,Unnamed: 0.1.1.1.1.1,Unnamed: 0.1.1.1.1.1.1,docid,string,label,start,end,Remove,newstart,newend
matched_str,edit_long,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
0,0,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37
0,0.0,29858,29858,29858,29858,29858,29858,29858,29858,29858,29858,29858,29858,29858,29858,29858
0,e,425,425,425,425,425,425,425,425,425,425,425,425,425,425,425
0,o,138,138,138,138,138,138,138,138,138,138,138,138,138,138,138
1,0,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
1,0.0,1875,1875,1875,1875,1875,1875,1875,1875,1875,1875,1875,1875,1875,1875,1875
1,e,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45
1,o,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16


labels tagged 'e' were already fixed. labels tagged 'o' have no easy solution and were also checked. We have to go over 1,875 tags that have matched a dubious string and show up as 0.0.

As a first step, let's see which we actually need to edit. We will mark the edit-long column as 1 if needs to be changed, 0 as OK, and LODM or R if the tag needs to be changed or the entity removed.

In [124]:
count = 0
for (index, row) in people.iterrows():
    
    if row['string'].isupper():
        people.loc[index, 'edit_long']='0'

    elif (row['Remove']=='0') and (row['matched_str']==1) and (row['edit_long']=='0.0'):
        count=count+1
        print(count)
        print(row['string'])
        
        docid = row['docid']
        strstart = int(row['start'])
        fetchend = strstart + 10
        
        for doc in docs:
            if doc[1]['id'] == docid:      
                displacy.render(doc[0][strstart:fetchend],style='ent',jupyter=True)
                people.loc[index, 'edit_long']=input('Edit?')
                
                if people.loc[index, 'edit_long']=='X':
                    for doc in docs:
                        if doc[1]['id'] == docid:
                            displacy.render(doc[0],style='ent',jupyter=True)
                            people.loc[index,'edit_long']=input('Edit?')
                clear_output(wait=True)

1
fray Antonio los angeles


Edit? 1


In [142]:
people[(people['Remove']=='0')].groupby(['edit_long']).count()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,Unnamed: 0.1.1.1.1.1,Unnamed: 0.1.1.1.1.1.1,docid,string,label,start,end,Remove,matched_str,newstart,newend
edit_long,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
0,7718,7718,7718,7718,7718,7718,7718,7718,7718,7718,7718,7718,7718,7718,7718,7718
0.0,23862,23862,23862,23862,23862,23862,23862,23862,23862,23862,23862,23862,23862,23862,23862,23862
1,146,146,146,146,146,146,146,146,146,146,146,146,146,146,146,146
L,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3
O,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8
R,35,35,35,35,35,35,35,35,35,35,35,35,35,35,35,35
e,470,470,470,470,470,470,470,470,470,470,470,470,470,470,470,470
o,154,154,154,154,154,154,154,154,154,154,154,154,154,154,154,154


As we can see, 146 spans need to be modified, and a few recategorized or removed.

Next, let's go over the 146 spans to be modified as we did before, and edit the newbegin and newend values.

In [147]:
count = 0
for (index, row) in people.iterrows():

    if row['edit_long']=='1':
        count=count+1
        print(count)
        print(row['string'],row['start'],row['end'])
        
        docid = row['docid']
        strstart = int(row['start'])
        fetchend = strstart + 10
        
        for doc in docs:
            if doc[1]['id'] == docid:      
                displacy.render(doc[0][strstart:fetchend],style='ent',jupyter=True)
                
                a=input('New start?')
                b=input('New end?')
                d= input('Edit_long?')           
                if d =='e':
                    people.loc[index,'edit_long']='e'
                    if a !='':
                        people.loc[index, 'newstart']=a
                    if b !='':
                        people.loc[index,'newend']=b
                else:
                    people.loc[index,'edit_long']=d
                    
                clear_output(wait=True)

36
paulo quinto 71 73


New start? 
New end? 
Edit_long? R


In [152]:
count = 0
for (index, row) in people.iterrows():

    if 'fernandez carmona' in row['string']:
        count=count+1
        print(count)
        print(row['string'],row['start'],row['end'])
        
        docid = row['docid']
        strstart = int(row['start'])
        fetchend = strstart + 10
        
        for doc in docs:
            if doc[1]['id'] == docid:      
                displacy.render(doc[0],style='ent',jupyter=True)
                
                a=input('New start?')
                b=input('New end?')
                d= input('Edit_long?')           
                if d =='e':
                    people.loc[index,'edit_long']='e'
                    if a !='':
                        people.loc[index, 'newstart']=a
                    if b !='':
                        people.loc[index,'newend']=b
                else:
                    people.loc[index,'edit_long']=d
                    
                clear_output(wait=True)

1
fernandez carmona 34 37


New start? 33
New end? 
Edit_long? e


In [153]:
people[(people['Remove']=='0')].groupby(['edit_long']).count()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,Unnamed: 0.1.1.1.1.1,Unnamed: 0.1.1.1.1.1.1,docid,string,label,start,end,Remove,matched_str,newstart,newend
edit_long,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
0,7720,7720,7720,7720,7720,7720,7720,7720,7720,7720,7720,7720,7720,7720,7720,7720
0.0,23862,23862,23862,23862,23862,23862,23862,23862,23862,23862,23862,23862,23862,23862,23862,23862
L,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3
O,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8
R,35,35,35,35,35,35,35,35,35,35,35,35,35,35,35,35
e,610,610,610,610,610,610,610,610,610,610,610,610,610,610,610,610
o,158,158,158,158,158,158,158,158,158,158,158,158,158,158,158,158


To clean the cells and transfer relevant results to the remove column:

In [154]:
people.loc[people['edit_long']=='0','edit_long']=0

In [168]:
people.loc[people['edit_long']=='R','Remove']='1'
people.loc[people['edit_long']=='L','Remove']= 'L'
people.loc[people['edit_long']=='O','Remove']= 'O'

In [182]:
people.loc[people['edit_long']=='R','edit_long']= None
people.loc[people['edit_long']=='L','edit_long']= None
people.loc[people['edit_long']=='O','edit_long']= None
people.loc[people['edit_long']=='e','edit_long']= '1'
people.loc[people['edit_long']=='0.0','edit_long']= '0'
people.loc[people['edit_long']==0,'edit_long']= '0'

In [184]:
people.loc[(people['edit_long']=='0A'),'edit_long']='0'

In [188]:
people.groupby(['Remove']).count()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,Unnamed: 0.1.1.1.1.1,Unnamed: 0.1.1.1.1.1.1,docid,string,label,start,end,matched_str,edit_long,newstart,newend
Remove,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
0,32350,32350,32350,32350,32350,32350,32350,32350,32350,32350,32350,32350,32350,32350,32350,32350
1,639,639,639,639,639,639,639,639,639,639,639,639,639,604,639,639
D,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6
L,44,44,44,44,44,44,44,44,44,44,44,44,44,41,44,44
M,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6
O,75,75,75,75,75,75,75,75,75,75,75,75,75,67,75,75


In [189]:
people.groupby('edit_long').count()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,Unnamed: 0.1.1.1.1.1,Unnamed: 0.1.1.1.1.1.1,docid,string,label,start,end,Remove,matched_str,newstart,newend
edit_long,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
,724,724,724,724,724,724,724,724,724,724,724,724,724,724,724,724
0,31582,31582,31582,31582,31582,31582,31582,31582,31582,31582,31582,31582,31582,31582,31582,31582
1,610,610,610,610,610,610,610,610,610,610,610,610,610,610,610,610
o,158,158,158,158,158,158,158,158,158,158,158,158,158,158,158,158


With these numbers tidied up, we can see that 639 tags are slated for removal, others for re-categorization (6 dates, 44 locations, 6 monetary, 75 organizations). Of the accepted tags, we edited 610 spans and found 158 with problems that we have not yet been able to resolve (more than one entity in a tag and tokenization issues). 31,582 tags have been accepted without modification.

Note: I was going to make a distinction between spans we checked manually (c.7700) and those that were detected automatically, but I accidentally wrote over the information and will not be re-checking.

In [186]:
people.to_csv("Files_Cleaning2/PER_tags_clean2_manual2.csv")

In [None]:
import pickle

In [1]:
pickle.dump(docs, open( "NER_data_Cleaned22.p", "wb" ) )

NameError: name 'pickle' is not defined