# Data Cleaning 2: Correct Miscategorized Labels

In this notebook, we go over "person" tags and reduce the number of labels that identify an entity but place it in the wrong category. 

This is common because places and organizations often take the names of Saints and other religious figures. 

(tags that misinterpret the entity category, ex. assign "PER" to a location)

## Strategy 1. Capitalized names

    Duncan Kinkead capitalized all artist names in his documents, so any tags that have picked up capitalized strings are true names. 
    Any strings that are the lowercase equivalent of those capitalized names are also person names.
    
## Strategy 2 . "San", "Santa", "Santos", "Santas" Token
Check the entity and the text around the entity for these 

First and last name lists
    We have developed a list of first and last names from 1400-1800 from a historical biographical dictionary. Using these, we can accept that any tags that incorporate a token from these lists refer to an actual person. 
    
## Strategy 3. Visual check of remaining person tags
    If strategies 1 and 2 discard enough PER tags, check the other ones manually.

In [None]:
# Idea: create a list of tokens that do not appear in the name list. Mark those that could be problematic. 
# This can be used to eliminate false matches, but also to edit matches that have gone over their span.
# Would also be wise to mark tokens that could be problematic within the name list (ex. ángeles)

In [1]:
import pandas as pd
people = pd.read_csv('Files_Cleaning1/PER_tags_clean1.csv')

First, reset the Remove column.

In [12]:
people['Remove'] = ''

In [29]:
people.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,docid,string,label,start,end,Remove
0,57935,57935,6792,,PER,135,138,
1,45285,45285,4978,Rodriguez,PER,6,8,
2,45173,45173,4961,Gomez Brito,PER,34,38,
3,45108,45108,4952,Manrique,PER,11,13,
4,61780,61780,7342,Ramirez,PER,29,31,


In [28]:
people['string']=people['string'].fillna('')

# Strategy 1. Capitalized Names (Like in Data Cleaning 1)

In [13]:
perscaps = people['string'][people['string'].str.isupper()==True].unique().tolist()

In [14]:
people.loc[people['string'].str.upper().isin(perscaps),'Remove'] = 0

In [30]:
people[people['Remove']==0]

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,docid,string,label,start,end,Remove
25,66787,66787,8056,AGUSTIN FRANCO,PER,7,9,0
26,71698,71698,8579,AGUSTIN DE PEREA,PER,101,104,0
27,45413,45413,4999,AGUSTIN DE PEREA,PER,16,19,0
28,45421,45421,5000,AGUSTIN DE PEREA,PER,49,52,0
29,45397,45397,4997,AGUSTIN DE PEREA,PER,26,29,0
...,...,...,...,...,...,...,...,...
32319,6873,6873,958,pedro ramirez,PER,1,3,0
32321,29490,29490,3146,pedro roldan,PER,368,370,0
32322,25215,25215,2727,pedro roldan,PER,295,297,0
32323,2503,2503,295,pedro roldan,PER,4,6,0


In [16]:
grouped = people.groupby('Remove')
grouped.count()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0.1,docid,string,label,start,end
Remove,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0.0,7076,7076,7076,7076,7076,7076,7076
,26044,26044,26044,26041,26044,26044,26044


We will accept these 7076 tags. 

# Step 2 Identifying Non-Person tokens

Next, we will examine the list of tokens in the tags that have not yet been flagged and find the tokens that could indicate non-personhood:

In [17]:
tokenlist= people.loc[people['Remove']!=0,'string'].str.split().tolist()

In [18]:
newlist= []
for sublist in tokenlist:
    if type(sublist)==list:
        for item in sublist:
            newlist.append(str(item).lower())
tokenlist=list(set(newlist))

In [19]:
tokenlist.sort()

In [20]:
len(tokenlist)

5509

In [21]:
with open('PersonNameTokenList.csv','w') as file:
    for item in tokenlist:
        file.write(item + "\n")

# Strategy 2

Check strings with "San, Santo, Santa, Santos, Santas" and other non-person tokens.

In [31]:
nontokens = ['San ','Santo ','Santa ','Santos ','Santas ']

for token in nontokens:
    people.loc[people['string'].str.contains(token),'Remove']= 'check'

In [32]:
people.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,docid,string,label,start,end,Remove
0,57935,57935,6792,,PER,135,138,
1,45285,45285,4978,Rodriguez,PER,6,8,
2,45173,45173,4961,Gomez Brito,PER,34,38,
3,45108,45108,4952,Manrique,PER,11,13,
4,61780,61780,7342,Ramirez,PER,29,31,


In [33]:
people.groupby('Remove').count()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0.1,docid,string,label,start,end
Remove,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,7076,7076,7076,7076,7076,7076,7076
,25943,25943,25943,25943,25943,25943,25943
check,101,101,101,101,101,101,101


In [135]:
people[people['Remove']=='check']

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,docid,string,label,start,end,Remove
1585,47680,47680,5302,Andrea Santa Trinidad,PER,62,67,check
1911,44395,44395,4876,Andres San Martin,PER,46,50,check
2452,16350,16350,1783,Antonio Santa Cruz,PER,37,41,check
2453,16540,16540,1806,Antonio Santa Cruz,PER,0,4,check
2454,16597,16597,1814,Antonio Santa Cruz,PER,9,13,check
...,...,...,...,...,...,...,...,...
18449,2112,2112,265,Sebastian Santa Maria,PER,10,14,check
18635,45302,45302,4979,Señor San Andres,PER,243,246,check
18636,66747,66747,8050,Señor San Laureano,PER,122,125,check
18639,52425,52425,5990,Señorr San Esteban Martir,PER,103,107,check


In [136]:
# And then go over them one by one. Make list and change tags

In [1]:
import spacy
from spacy import displacy

In [8]:
nlp = spacy.load("es_core_news_ml_EMS2")

In [144]:
from IPython.display import clear_output

In [9]:
import pickle

file = open("Trained_EMS2_NER_data.p", 'rb')
docs = pickle.load(file)

In [10]:
for (index, row) in people.iterrows():
    if row['Remove']=='check':
        print(row['string'])
        
        docid = row['docid']
        strstart = int(row['start'])
        fetchend = strstart + 10
        
        for doc in docs:
            if doc[1]['id'] == docid:
                displacy.render(doc[0][strstart:fetchend],style='ent',jupyter=True)
                people.loc[index,'Remove']=input('Remove?')
            if  people.loc[index,'Remove']=='X':
                people.loc[index,'Remove']=''
                if doc[1]['id'] == docid:
                    displacy.render(doc[0],style='ent',jupyter=True)
                    people.loc[index,'Remove']=input('Remove?')
            clear_output(wait=True)

NameError: name 'people' is not defined