# Data Cleaning 1. Removing False Matches (Person Tags)

Data cleaning will comprise of three phases:
1. Removing false matches (tags that do not refer to an entity at all)
1. Correcting miscategorized labels (tags that misinterpret the entity category, ex. assign "PER" to a location)
1. Correcting truncated or extended spans (tags which are correct but whose extent is wrong, incorporating irrelevant tokens or ignoring relevant ones).
1. Identifying untagged entities

Perfect tagging might be impossible, but we can improve model results significantly by taking these measures.
This is an iterative process that will improve results as we go through the phases.

## Strategy 1. Capitalized names
    Duncan Kinkead capitalized all artist names in his documents, so any tags that have picked up capitalized strings are true names. 
    Any strings that are the lowercase equivalent of those capitalized names are also person names.
    
## Strategy 2 .First and last name lists
    We have developed a list of first and last names from 1400-1800 from a historical biographical dictionary. Using these, we can accept that any tags that incorporate a token from these lists refer to an actual person.
    
    Because we do not want to be to broad, we will accept tokens with a first name AND last name token that correspond to a word in the list. 
    Strings that do not include both a first name and a last name will have to be manually checked, which includes strings that have only one token. 
    
## Strategy 3. Visual check of remaining person tags
    If strategies 1 and 2 discard enough PER tags, check the other ones manually.
    
## Strategy 4. Removing literary and religious references
    Some documents refer to ahistorical persons. This is especially important because the names of certain locations and associations take the names of Saints and other religious figures. (move this to "correcting miscategorized labels")
    


In [1]:
import pandas as pd
import re
import string

In [2]:
tagsDF = pd.read_csv('EntitiesEMSModel2.csv')

In [3]:
tagsDF.head()

Unnamed: 0.1,Unnamed: 0,docid,string,label,start,end
0,0,1,martin de gaynça,PER,1,4
1,1,1,santa iglesia,ORG,13,15
2,2,1,Sevilla,LOC,16,17
3,3,1,juan sanchez de caliz,PER,18,22
4,4,1,sevilla,LOC,34,35


# 1. Create a list of PER tags with IDS

We are creating a list of person tags with their unique ids. We remove stop words, excess whitespace and punctuation to reduce variation between strings. 

In [4]:
# Test: PER tags
a = tagsDF.loc[tagsDF["label"] == "PER"] #select PER tags
b = a.sort_values('string') #sort alphabetically
#b['string'] = b['string'].str.lower()  #lowercase all strings

More data cleaning: removing punctuation and accents

In [5]:
people=b

people.head()

Unnamed: 0.1,Unnamed: 0,docid,string,label,start,end
57935,57935,6792,"""""""",PER,135,138
45285,45285,4978,- Rodríguez,PER,6,8
45173,45173,4961,-- Gómez de Brito,PER,34,38
45108,45108,4952,-- Manrique,PER,11,13
61780,61780,7342,-- Ramírez,PER,29,31


# 2 Clean strings to remove stopwords and accents

In [114]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [6]:
def clean_iter (line):
    result = re.sub(' del ', ' ', line) #remove stopwords
    result = re.sub(' de ',' ',result)
    result = re.sub(' la ',' ',result)
    
    #result = re.sub(\s\s, \s, result) #remove extra spaces
    
    result= re.sub('á','a',result) #remove accents
    result= re.sub('é','e',result)
    result= re.sub('í','i',result)
    result= re.sub('ó','o',result)
    result= re.sub('ú','u',result)
    result= re.sub('ü','u',result)
    result= re.sub('à','a',result)
    result= re.sub('è','e',result)
    result= re.sub('ì','i',result)
    result= re.sub('ò','o',result)
    result= re.sub('ù','u',result)
    result= re.sub('Á','A',result)
    result= re.sub('É','E',result)
    result= re.sub('Í','I',result)
    result= re.sub('Ó','O',result)
    result= re.sub('Ú','U',result)
    result= re.sub('Ü','U',result)
    result= re.sub('À','A',result)
    result= re.sub('È','E',result)
    result= re.sub('Ì','I',result)
    result= re.sub('Ò','O',result)
    result= re.sub('Ù','U',result)
    result= result.strip()
    
    #result = result.translate(str.maketrans('', '', string.punctuation)) #remove punctuation
    table = str.maketrans('', '', string.punctuation)
    result = result.translate(table)
    
    return result
    
clean_series=[]

for index, row in people.iterrows():
    clean_series.append(clean_iter(row['string']))
people['string']= clean_series


In [7]:
people.head()

Unnamed: 0.1,Unnamed: 0,docid,string,label,start,end
57935,57935,6792,,PER,135,138
45285,45285,4978,Rodriguez,PER,6,8
45173,45173,4961,Gomez Brito,PER,34,38
45108,45108,4952,Manrique,PER,11,13
61780,61780,7342,Ramirez,PER,29,31


# 3 Create an attribute for checking process called 'Remove'

Then we insert a column for a dummy variable for "remove" or "keep". We will use this to mark the false tags that have to be removed with a 1. Tags that should be kept will have a 0 value. Blank values must be double-checked manually before being assigned a 0 or 1.

In [8]:
people.insert(6,"Remove","")

In [9]:
people.head()

Unnamed: 0.1,Unnamed: 0,docid,string,label,start,end,Remove
57935,57935,6792,,PER,135,138,
45285,45285,4978,Rodriguez,PER,6,8,
45173,45173,4961,Gomez Brito,PER,34,38,
45108,45108,4952,Manrique,PER,11,13,
61780,61780,7342,Ramirez,PER,29,31,


Then we create a list of unique person tags.

In [10]:
uniquepers = people['string'].value_counts(sort=True)
uniquepers

juan martinez montañes                   354
ALONSO PEREZ                             268
andres ocanpo                            206
juan obiedo                              192
JUAN DE VALDES                           191
                                        ... 
juan muños                                 1
Lucia Alvarez                              1
dotores i santa barbara i santa elena      1
josephe naranjo                            1
Anton Ruis Guebara                         1
Name: string, Length: 14505, dtype: int64

There are 14517 unique strings.

# Strategy 1. Mark all Capitalized words and their lowecase equivalents as true names

Kinkead set all artist (painter, architect, goldbeater, gilder, etc) names in All-Caps. We can identify these strings and their lowercase equivalents as true names.

In [11]:
perscaps = people.loc[people['string'].str.isupper()==True]
perscaps.head()

Unnamed: 0.1,Unnamed: 0,docid,string,label,start,end,Remove
66787,66787,8056,AGUSTIN FRANCO,PER,7,9,
71698,71698,8579,AGUSTIN DE PEREA,PER,101,104,
45413,45413,4999,AGUSTIN DE PEREA,PER,16,19,
45421,45421,5000,AGUSTIN DE PEREA,PER,49,52,
45397,45397,4997,AGUSTIN DE PEREA,PER,26,29,


We can get an idea of what these strings are with a unique list.

In [12]:
uniqueperscaps = perscaps['string'].value_counts(sort=True)
uniqueperscaps

ALONSO PEREZ                  268
JUAN DE VALDES                191
BARTOLOME MURILLO             176
CARLOS DE ZARATE              166
MATIAS DE ARTEAGA             119
                             ... 
LOPEZ CHICO                     1
BLAS MUÑOZ DE MONSADA           1
ANTONIO JIMENEZ DE ZARZOSA      1
IGNACIO IRIARTE                 1
PEDRO BILLAVICENCIO             1
Name: string, Length: 892, dtype: int64

In [13]:
UPCapslist = list(uniqueperscaps.index.values)

Using this list of unique capitalized strings, we can set Remove='0' for any strings that match an uppercase name.

In [14]:
for ind in people.index:
    if people.loc[ind,'string'] in UPCapslist:
        people.loc[ind,'Remove']='0'

In [15]:
grouped = people.groupby('Remove')

grouped.count()

Unnamed: 0_level_0,Unnamed: 0,docid,string,label,start,end
Remove,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
,27101,27101,27101,27101,27101,27101
0.0,6019,6019,6019,6019,6019,6019


This process has marked 6019 instances of a true name, which should be equivalent to the artists found in Kinkead's Pintores y Doradores en Sevilla.

Now we mark any lowercase strings equivalent to the uppercase tags we know are correct as 0 as well:

In [16]:
for ind in people.index:
    if people.loc[ind,'string'].upper() in UPCapslist:
        people.loc[ind,'Remove']='0'

In [17]:
grouped = people.groupby('Remove')
grouped.count()

Unnamed: 0_level_0,Unnamed: 0,docid,string,label,start,end
Remove,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
,26055,26055,26055,26055,26055,26055
0.0,7065,7065,7065,7065,7065,7065


Below is a shorter way of achieving the same goal:

In [243]:
perscaps = people['string'][people['string'].str.isupper()==True].unique().tolist()

In [28]:
people.loc[people['string'].str.upper().isin(perscaps),'Remove'] = 0

These approaches have identified a total 7065 tags as acceptable. That is, an additional 1046 tags corresponded to the same artists found in Kinkead. 

(Note: we will need to remove any documents that appear in Kinkead and other sources. This should be easy because Kinkead gives references as to where texts were published.)

# Strategy 2. Name Dictionary

In [18]:
firstnames = open('firstnames.txt','r')
lastnames = open('lastnames.txt','r')

In [19]:
fnames = []
fnames = [line.rstrip('\n') for line in open('firstnames.txt')]

In [20]:
lnames = []
lnames = [line.rstrip('\n') for line in open('lastnames.txt')]

In [34]:
fnames

['abd',
 'abdo',
 'aben',
 'abraham',
 'abu',
 'abul',
 'acisclo',
 'adamo',
 'adan',
 'adaucto',
 'adria',
 'adriaen',
 'adrian',
 'adriano',
 'adrien',
 'advincula',
 'afan',
 'afonso',
 'africano',
 'agapito',
 'agesilao',
 'agostinho',
 'agostino',
 'agusti',
 'agustin',
 'agustina',
 'ahmad',
 'ahuitzotl',
 'aime-jacques',
 'al-basti',
 'al-garnati',
 'al-hayari',
 'al-malik',
 'al-manzari',
 'al-qalasadi',
 'al-titwani',
 'alal',
 'alberico',
 'albert-octave',
 'alberti',
 'alberto',
 'albino',
 'albret',
 'alcantara',
 'alcocer',
 'aldonza',
 'aleix',
 'aleixo',
 'alejandrino',
 'alejandro',
 'alejo',
 'alemania',
 'alessandro',
 'alex',
 'alexander',
 'alexandre',
 'alexo',
 'alexos',
 'alfonso',
 'ali',
 'almanzor',
 'almidez',
 'alonso',
 'alvar',
 'alvarado',
 'alvaro',
 'amadeo',
 'amador',
 'amalia',
 'amancio',
 'amaneo',
 'amaru',
 'ambrosia',
 'ambrosio',
 'amelia',
 'amerigo',
 'amparo',
 'ana',
 'anacaona',
 'anacleto',
 'anastasio',
 'andre',
 'andrea',
 'andres',
 '

In [177]:
lnames

['olim',
 'alava',
 'ab-ach',
 'abad',
 'abadal',
 'abadia',
 'abadiano',
 'abadie',
 'abadin',
 'abarca',
 'abaria',
 'abarrategui',
 'abascal',
 'abasolo',
 'abastas',
 'abat',
 'abbeville',
 'abd',
 'abduladin',
 'abel',
 'abella',
 'abellan',
 'abello',
 'aben',
 'abenaxara',
 'abengoechea',
 'abiancos',
 'abianelo',
 'ablanque',
 'ablitas',
 'aboab',
 'abollado',
 'aborim',
 'abraha',
 'abraham',
 'abranches',
 'abravanel',
 'abreu',
 'abrich',
 'abu',
 'abuin',
 'aburruza',
 'acacio',
 'acassuso',
 'accursio',
 'acebal',
 'acebedo',
 'acebes',
 'acebo',
 'acedo',
 'aceites',
 'acero',
 'acevedo',
 'acha',
 'achutegui',
 'acitores',
 'acorbe',
 'acorda',
 'acosta',
 'acquaviva',
 'acuña',
 'adalid',
 'adam',
 'adame',
 'adan',
 'adaro',
 'adarzo',
 'adell',
 'adeva',
 'adrada',
 'adrian',
 'adriano',
 'adsor',
 'aduart',
 'aduarte',
 'adurza',
 'aedo',
 'aerts',
 'afan',
 'afferden',
 'afonso',
 'aganduru',
 'agar',
 'agnesio',
 'agoiz',
 'agramont',
 'agramonte',
 'agraz',
 'agre

In [93]:
# Code used for reference

#def name_looping(df, namelist):
#    for ind in df.index:
#        if df.loc[ind,'Remove']=='':
#            for name in namelist:
#                if name in df.loc[ind,'string']:
#                    df.loc[ind,'Remove']='0'
#                else:
#                    df.loc[ind,'Remove']='A'  #very slow

In [None]:
# Online Examples to Follow

# For loop (slow)
#def soc_loop(leaguedf,TEAM,):
#    leaguedf['Draws'] = 99999
#    for row in range(0, len(leaguedf)):
#        if ((leaguedf['HomeTeam'].iloc[row] == TEAM) & (leaguedf['FTR'].iloc[row] == 'D')) | \
#            ((leaguedf['AwayTeam'].iloc[row] == TEAM) & (leaguedf['FTR'].iloc[row] == 'D')):
#            leaguedf['Draws'].iloc[row] = 'Draw'
#        elif ((leaguedf['HomeTeam'].iloc[row] == TEAM) & (leaguedf['FTR'].iloc[row] != 'D')) | \
#            ((leaguedf['AwayTeam'].iloc[row] == TEAM) & (leaguedf['FTR'].iloc[row] != 'D')):
#            leaguedf['Draws'].iloc[row] = 'No_Draw'
#        else:
#            leaguedf['Draws'].iloc[row] = 'No_Game'
            
#Using Iterrows (faster)

#def soc_iter(TEAM,home,away,ftr):
#    #team, row['HomeTeam'], row['AwayTeam'], row['FTR']
#    if [((home == TEAM) & (ftr == 'D')) | ((away == TEAM) & (ftr == 'D'))]:
#        result = 'Draw'
#    elif [((home == TEAM) & (ftr != 'D')) | ((away == TEAM) & (ftr != 'D'))]:
#        result = 'No_Draw'
#    else:
#        result = 'No_Game'
#    return result

#draw_series=[]
#for index, row in df.iterrows():
#    draw_series.append(soc_iter('Arsenal',row['HomeTeam'], row['AwayTeam'], row['FTR']))
#df['Draws']= draw_series

#Using apply (even faster)
#df['Draws'] = df.apply(lambda row: soc_iter('Arsenal', row['HomeTeam'], row['AwayTeam'], row['FTR'], axis=1))

#Pandas Vectorization (fastererer)

#def soc_iter(TEAM,home,away,ftr):
#    df['Draws'] = 'No_Game'
#    df.loc[((home == TEAM) & (ftr == 'D')) | ((away == TEAM) & (ftr == 'D')), 'Draws'] = 'Draw'
#    df.loc[((home == TEAM) & (ftr != 'D')) | ((away == TEAM) & (ftr != 'D')), 'Draws'] = 'No_Draw'
    
#df['Draws']= soc_iter('Arsenal', df['Hometeam'], df['AwayTeam'], df['FTR'])

#Numpys Vectorization (fastest)

#df['Draws']= soc_iter('Arsenal'), df['Hometeam'].values, df['Awayteam'].values, df['FTR'].values)


We will define a function that iterates through strings, searching through our first and last name lists to find a match in each. Accepted matches will include at least one first name token and one last name token. 

In [41]:
#def name_iter(NAME, LNAME, remove, string):
#    n= '(.* )*'+NAME + '(.* )*'
#    x = re.search(n, string, re.IGNORECASE)
#    if x:
#        ln= '(.* )*'+ LNAME + '( .*)*'
#        y = re.search(n, string, re.IGNORECASE)
#        if y:
#            result = '0'
#    else:
#        result = remove
#    return result
#    print(name)

In [42]:
#remove_series=[]

#for name, lname in zip(fnames, lnames):
#    for index, row in people.iterrows():
#        remove_series.append(name_iter(name, lname, row['Remove'], row['string']))
#    people['Remove']= remove_series

#    remove_series=[]

Below: editing function to guarantee that it finds 2 tokens, one from each list (sometimes would just find the same value, twice. )

In [21]:
any_in = lambda a, b, c: any(i.lower() in b for i in a) and any(i.lower() in c for i in a) 

In [22]:
people['Remove'] = people.apply(lambda x: 0 if any_in(x['string'].split(),fnames,lnames) and len(x['string'].split())>1 else x['Remove'], axis=1)

The function above checks every row in the dataframe, assigning people['Remove'] = 0 whenever there is at least one token from the string in both the first name and last name lists. One issue is when certain names appear in both lists - then the string can have only one token belonging to both lists. I was not able to eliminate this completely, but the added condition of length string.split() being greater than one at least weeds out names that have only one token. This is useful particularly for removing names of saints, likely from descriptions of paintings or images.

But let's remove names that are repeated in the two lists. First we will make a list of the shared elements:

In [23]:
f = set(fnames)
l = set(lnames)

if (f & l): 
        f_and_l = list(f & l)
        print(f_and_l)

['carlota', 'kahi', 'foix', 'mata', 'bernal', 'gentil', 'fadrique', 'villagomez', 'esteban', 'barros', 'mariana', 'guillen', 'leonor', 'richard', 'monserrat', 'hurtado', 'mawlay', 'alonso', 'luis', 'cecilia', 'von', 'ernesto', 'felipe', 'ascension', 'alvaro', 'tutul', 'manrique', 'julia', 'maximiliano', 'amadeo', 'mendo', 'carles', 'mancio', 'mauro', 'oliva', 'vivar', 'leon', 'almanzor', 'martinillo', 'francisco', 'conrado', 'cacamatzin', 'luca', 'esteve', 'morejon', 'nieves', 'ibn', 'paul', 'bezon', 'robert', 'rui', 'dalmau', 'margarita', 'muley', 'nuñez', 'pedroso', 'inga', 'clemente', 'pio', 'al-malik', 'ferrante', 'moctezuma', 'regalado', 'juana', 'jordan', 'muhammad', 'sales', 'asensio', 'agusti', 'atanasio', 'garcia', 'alberti', 'saboya', 'andrea', 'guillermo', 'victoria', 'anastasio', 'macias', 'pedro', 'roque', 'jacome', 'jaime', 'narciso', 'francisca', 'egas', 'toribio', 'cebrian', 'estefania', 'ponsich', 'cayo', 'nadal', 'rubin', 'tupac', 'claudio', 'barbara', 'navarra', 'eli

And we will drop them all from the lnames list (our function does not actually make a semantic difference between both lists).

In [24]:
lnames = list(l-(f&l))

In [30]:
people['Remove'] = ''

Let's try our operations again:

In [31]:
people.loc[people['string'].str.upper().isin(perscaps),'Remove'] = 0

In [32]:
people['Remove'] = people.apply(lambda x: 0 if any_in(x['string'].split(),fnames,lnames) and len(x['string'].split())>1 else x['Remove'], axis=1)

In [33]:
grouped = people.groupby('Remove')
grouped.count()

Unnamed: 0_level_0,Unnamed: 0,docid,string,label,start,end
Remove,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0.0,18161,18161,18161,18161,18161,18161
,14959,14959,14959,14959,14959,14959


In [34]:
len(people['string'].unique())

14505

Number of unique strings identified:

In [35]:
len(people.loc[people['Remove']==0,'string'].unique())

7323

Number of unique strings not yet identified:

In [36]:
len(people.loc[people['Remove']=='','string'].unique())

7182

Let's export both lists of unique strings (confirmed and unconfirmed) to print them out, look at them and try to identify patterns for further work.

In [37]:
a= people.loc[people['Remove']==0,'string'].str.lower()
a= a.unique()
a= pd.Series(a)
len(a)

6386

In [38]:
b= people.loc[people['Remove']=='','string'].str.lower()
b= b.unique()
b= pd.Series(b)
len(b)

6623

In [40]:
a.to_csv("acceptedpersnames.csv")
b.to_csv("remainingpersnames.csv")

In [39]:
'alonso' in fnames

True

Will also export the intermediate people table to work with them in different sessions:

In [41]:
people.to_csv("people_strat2.csv")

# Fuzzy Matching

Many of the unmatched tags above are variations of real names with approximate matches in the list.  We can use fuzzy matching to further find names.


In [53]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

In [None]:
#     result = list(map(sum,input)) 
#     print (result.index(max(result)))

In [367]:
#def fuzzmax(string,name):
#    if len(string)>0:
#        string = str(string)
#        result = list(map(lambda a: fuzz.token_sort_ratio(a,name), string.split()))
#        return max(result)
#    else:
#        return 0

In [366]:
#type(fuzzmax("Antonio de Medina","Jose"))

int

In [None]:
#map(lambda a: )

In [None]:
%%timeit
temp = pd.DataFrame()
for fname, lname in zip(fnames,lnames):
    temp[fname + lname] = people.apply(lambda x: fuzzmax(x['string'],fname) + fuzzmax(x['string'],lname), axis=1)

In [None]:
people['Remove'] = people.apply(lambda x: str(max(map(lambda a: fuzz.token_sort_ratio(a,fname)), x['string'].split())) + str(max(map(lambda a: fuzz.token_sort_ratio(a,lname)), x['string'].split())) else x['Remove'], axis=1)

In [None]:

def soc_iter(TEAM,home,away,ftr):
    df['Draws'] = 'No_Game'
    df.loc[((home == TEAM) & (ftr == 'D')) | ((away == TEAM) & (ftr == 'D')), 'Draws'] = 'Draw'
    df.loc[((home == TEAM) & (ftr != 'D')) | ((away == TEAM) & (ftr != 'D')), 'Draws'] = 'No_Draw'
view rawvec_func hosted with ❤ by GitHub

df['Draws']= soc_iter('Arsenal',df['HomeTeam'],df['AwayTeam'],df['ftr'] )

In [None]:
def fuzznames(fnames, lnames, string):
    stringspl = string.split()
    fuzzf= list(map(lambda a: fuzz.token_sort_ratio(fname,a), stringspl))
    fuzzl= list(map(lambda a: fuzz.token_sort_ratio(lname,a), stringspl))
    fuzzf.index(max(fuzzf))
    fuzzl.index(max(fuzzf))
    people.loc[people['Remove']!=0,'Remove']=people.apply(lambda x: 90 if max(list(map(lambda a: fuzz.token_sort_ratio(a,fname), x['string'].split()))) and  >90 and max(list(map(lambda a: fuzz.token_sort_ratio(a,lname),x['string'].split())))>90 else x['Remove'], axis=1)
    
    def maxOnes(input): 
  
     # map sum function on each row of 
     # given matrix 
     # it will return list of sum of all one's 
     # in each row, then find index of maximum element 
     result = list(map(sum,input)) 
     print (result.index(max(result))) 



# Check Against the Documents

The tags still not marked 0 have not been identified as names. We will look at them within the context of their documents to mark whether they are names or not.

In [None]:
# For future reference: per tags up to this point have been saved as "people_strat2.csv" 

In [1]:
import pandas as pd
people = pd.read_csv("people_strat2.csv")

In [2]:
people['Remove'] = people['Remove'].astype(str)
people['Remove'] = people['Remove'].replace('nan','')
people['Remove'] = people['Remove'].replace('0.0','0')

In [3]:
people

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,docid,string,label,start,end,Remove
0,57935,57935,6792,,PER,135,138,
1,45285,45285,4978,Rodriguez,PER,6,8,0
2,45173,45173,4961,Gomez Brito,PER,34,38,0
3,45108,45108,4952,Manrique,PER,11,13,0
4,61780,61780,7342,Ramirez,PER,29,31,0
...,...,...,...,...,...,...,...,...
33115,45897,45897,5065,Angela Guillenes,PER,19,21,0
33116,65927,65927,7947,Angela Rodriguez Borte,PER,28,31,0
33117,11101,11101,1304,çamora,PER,238,239,
33118,32475,32475,3436,çamora,PER,90,91,


In [4]:
grouped = people.groupby('Remove')
grouped.count()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0.1,docid,string,label,start,end
Remove,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
,1478,1478,1478,1475,1478,1478,1478
0.0,31642,31642,31642,31642,31642,31642,31642


There are 1478 tags that do not contain at least one first name and one last name token from the lists above.  Next, we will go through them individually, looking at them in context, to determine whether they should be removed or kept.

In [1]:
#import mysql.connector

#mydb = mysql.connector.connect(
#  host="localhost",
#  user="root",
#  passwd="",
#  database="SevillePainters3"
#)

#print(mydb)

In [None]:
# The span being read is not the one that includes the entity, for whatever reason.
#"SELECT SUBSTRING(CAST(document as char), "+str(strstart) +", 50) FROM document WHERE document.id = " + str(docid)
# try:
 #           strstart = int(people.loc[index, 'start'])-20
  #      except: 
   #         strstart = 0

In [None]:
# The problem is how Spacy marks the position of a token. 
# Not by character but by token number. We could calculate 
# this based on spaces, but I think the easiest way would be 
# to pickle the Doc files in the 'exporting' notebook. 

In [17]:
# Create a cursor object that allows you to execute SQL

#mycursor = mydb.cursor()

#for (index, row) in people.iterrows():
#    if row['Remove']!='0':
#        print(row['string'])
        
#        docid = row[ 'docid']
#        strstart = int(row['start'])
#        sqlstart = strstart - 20
#        strlength= int(row['end']) - int(row['start'])
#        strend= strstart+strlength
        
#        sql1 = "SELECT SUBSTRING(CAST(document as char), " + str(sqlstart) +", 20) FROM document WHERE document.id = " + str(docid)
#        mycursor.execute(sql1)
#        print(mycursor.fetchall())
        
#        sql2 = "SELECT SUBSTRING(CAST(document as char), " + str(strstart) +"," + str(strlength) +") FROM document WHERE document.id = " + str(docid)
#        mycursor.execute(sql2)
#        print(mycursor.fetchall())
        
#        sql3 = "SELECT SUBSTRING(CAST(document as char), " + str(strend) +", 20) FROM document WHERE document.id = " + str(docid)
#        mycursor.execute(sql3)
#        print(mycursor.fetchall())
        
#        people.loc[index,'Remove']=input('Remove?')
#        clear_output(wait=True)

In [2]:
from IPython.display import clear_output
import spacy
from spacy import displacy

ModuleNotFoundError: No module named 'spacy'

In [1]:
import pickle

file = open("Trained_EMS2_NER_data.p", 'rb')
docs = pickle.load(file)

ModuleNotFoundError: No module named 'spacy'

In [18]:
docs[1]

(...martin de gaynça... soy convenido... con los reverendos padres prior frailes e convento del monasterio de san pablo desta dicha ciudad de Sevilla... en tal manera que yo me obligo de fazer el cruzero de la iglesia junto a la capilla mayor... e de lo fazer de buena obra de canteria del puerto de santa maria del altura de la capilla mayor lo cual tengo de derrocar a mi costa con tanto que todo e material que saliere del dicho cruzero asi de piedra como de ladrillo sea mio e me obligo de ensanchar las ventanas del dicho cruzero del altura que fuere menester e desgarrar lo que fuere menester para ello e de encalar e torar los testeros donde estan las dichas ventanas (muy roto) e mas me obligo de fazer el pilarote sobre que a de venir el canpanario fasta en raz de la capilla e mas me obligo de alinpiar los dos arcos para la parte de la capilla que estan en el dicho cruzero... e de lo dar acabado en fin del mes de marzo de 1543... dandome 400 ducados,
 {'id': 2})

In [27]:
docs[1][0].ents

(martin de gaynça,
 monasterio de san pablo,
 Sevilla,
 puerto de santa maria,
 fin del mes de marzo de 1543,
 400 ducados)

In [67]:
import spacy
from spacy import displacy

In [None]:
for (index, row) in people.iterrows():
    if row['Remove']=='':
        print(row['string'])
        
        docid = row['docid']
        strstart = int(row['start'])
        fetchend = strstart + 10
        
        for doc in docs:
            if doc[1]['id'] == docid:
                displacy.render(doc[0][strstart:fetchend],style='ent',jupyter=True)
                people.loc[index,'Remove']=input('Remove?')
            if  people.loc[index,'Remove']=='X':
                people.loc[index,'Remove']=''
                if doc[1]['id'] == docid:
                    displacy.render(doc[0],style='ent',jupyter=True)
                    people.loc[index,'Remove']=input('Remove?')
            clear_output(wait=True)

Espejo


In [6]:
grouped = people.groupby('Remove')
grouped.count()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0.1,docid,string,label,start,end
Remove,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0.0,31642,31642,31642,31642,31642,31642,31642
,1478,1478,1478,1475,1478,1478,1478


In [None]:
# Note: keeping -- in name strings might be valuable because some scholars have used it to indicate a missing token.