# Data Cleaning 3: Correcting Entity Spans (short)

Search for tokens to the left and right of a tagged person that appear in the entity list and the name lists and have not been tagged as entities. We can import the spacy docs, each token should have an entity tag = 0. 

In [5]:
import pandas as pd
people = pd.read_csv('Files_Cleaning2/PER_tags_clean2_manual.csv')

In [6]:
people.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,Unnamed: 0.1.1.1.1.1,Unnamed: 0.1.1.1.1.1.1,docid,string,label,start,end,Remove,matched_str,edit_long,newstart,newend
0,0,0,0,0,0,57935,57935,6792,,,135,138,1,0,,,
1,1,1,1,1,1,45285,45285,4978,- Rodriguez,PER,6,8,0,0,0.0,,
2,2,2,2,2,2,45173,45173,4961,-- Gomez Brito,PER,34,38,0,0,0.0,,
3,3,3,3,3,3,45108,45108,4952,-- Manrique,PER,11,13,0,0,0.0,,
4,4,4,4,4,4,61780,61780,7342,-- Ramirez,PER,29,31,0,0,0.0,,


In [11]:
people.fillna('blank').groupby(['Remove','edit_long']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,Unnamed: 0.1.1.1.1.1,Unnamed: 0.1.1.1.1.1.1,docid,string,label,start,end,matched_str,newstart,newend
Remove,edit_long,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
0,0,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16
0,0.0,31756,31756,31756,31756,31756,31756,31756,31756,31756,31756,31756,31756,31756,31756,31756
0,e,470,470,470,470,470,470,470,470,470,470,470,470,470,470,470
0,o,154,154,154,154,154,154,154,154,154,154,154,154,154,154,154
1,blank,604,604,604,604,604,604,604,604,604,604,604,604,604,604,604
D,blank,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6
L,blank,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41
M,blank,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6
O,blank,67,67,67,67,67,67,67,67,67,67,67,67,67,67,67


In the table above, we can see the progress in data cleaning so far. The 'Remove' step identified 600 tags to remove (plus 4 that were discovered in the edit-length step), as well as others to reclassify (6 dates, 41 locations, 6 monetary amounts and 67 references to organizations). These were ignored in the edit length stage, where we focused on the PER tags we are keeping and tried to find spans that were too long: 470 were given a new span length in a new column, and 154 have problems that cannot be edited yet (because of errors in tokenization or because they include multiple tags that will need to be split).

Next, we will be trying to find words that should have formed part of tokens and incorporated them. To do this, we have to look outside our 'people' dataframe and look at the documents as a whole. Let's import our pickled documents file, which has the output of our work on Spacy.

In [12]:
from IPython.display import clear_output
import spacy
from spacy import displacy

import pickle

In [13]:
file = open("../Text Mining (NER)/Trained_EMS2_NER_data.p", 'rb')
docs = pickle.load(file)

The way we will work is by taking strings that do appear in our firstname, lastname, and accepted tokens, and try to find the tokens in our documents that match these strings, but are not marked as entities. 

One layer of complication that we will have to resolve is that some of these will represent entities that have not been marked at all; others are tokens that should be incorporated into entities.

We will have to work with the attributes of tokens to find them and make these changes.

In [56]:
fnames = open('DBE_Names/firstnames.txt')
lnames = open('DBE_Names/lastnames.txt')

We can also bring in the tokens we accepted from Kinkead's uppercased names:

In [57]:
knames = pd.read_csv("Files_Cleaning2/acceptednametokens2_strat1.csv")

In [58]:
knames= knames['token'].str.lower().tolist()

In [59]:
names = []

for line in fnames:
    names.append(line.rstrip('\n'))
    
for line in lnames:
    names.append(line.rstrip('\n'))
    
for line in knames:
    names.append(line)

In [60]:
names = set(names)

In [62]:
names

{'benegas',
 'ascencio',
 'fausto',
 'lastanosa',
 'valeto',
 'german',
 'rene',
 'lleonart',
 'brum',
 'llop',
 'rourera',
 'aichelberg',
 'bartomeu',
 'diez',
 'espindola',
 'cataldino',
 'commeg',
 'fierlant',
 'lorena',
 'cregenzan',
 'jeronimo',
 'sinues',
 'oliet',
 'bores',
 'barriales',
 'betancourt',
 'al-titwan',
 'texeda',
 'iturrate',
 'araz',
 'coca',
 'peñas',
 'coaza',
 'ponzoni',
 'bonilla',
 'mac',
 'atienza',
 'tacon',
 'santotis',
 'eleno',
 'alamo',
 'salavarrieta',
 'asturiano',
 'colsa',
 'exarch',
 'tristany',
 'arteagaalfaro',
 'villota',
 'zambiza',
 'boteller',
 'chacon',
 'meseron',
 'eztenaga',
 'leoncio',
 'croy-solre',
 'arago',
 'vernaccini',
 'negrete',
 'rovere',
 'salvanes',
 'simancas',
 'carbajal',
 'granollachs',
 'urraca',
 'gerra',
 'solares',
 'franci',
 'rexe',
 'botelho',
 'urdaniz',
 'melilla',
 'urruchi',
 'garnica',
 'gomendio',
 'borges',
 'corta',
 'lay',
 'salablanca',
 'montpalau',
 'habaqui',
 'villagome',
 'reinoso',
 'unzurrunzaga',
 

In [61]:
len(names)

10246

In [63]:
tokensintags = pd.read_csv('PersonNameTokenList.csv')

In [66]:
tokensintags

Unnamed: 0,-
0,--
1,--Nino
2,--de
3,-basco
4,-gaspar
...,...
6686,çuleta
6687,çumaraga
6688,çumarraga
6689,çurbaran


In [67]:
tokensintags = tokensintags['-'].tolist()

In [68]:
in_namelist = []

for line in tokensintags:
    if line in names:
        in_namelist.append(line)

In [70]:
len(in_namelist)

1153

In [39]:
untagged = []
for doc,context in docs:
    for token in doc:
        if (token.ent_iob==2) and (token.text in names):
            untagged.append(token.text)

In [43]:
len(untagged)

85465

In [42]:
print(set(untagged))

{'benegas', 'leonor', 'bello', 'muros', 'picado', 'pablo', 'cayo', 'gil', 'toscano', 'marino', 'rincon', 'lindo', 'espina', 'rueda', 'diez', 'sastre', 'texada', 'riesco', 'valor', 'michel', 'gusta', 'domingo', 'arcangel', 'piedras', 'armada', 'patricio', 'cerro', 'peñaranda', 'aguila', 'leal', 'vara', 'seva', 'ponce', 'ovejas', 'ybarra', 'baptista', 'romana', 'margarita', 'verdes', 'carranza', 'girona', 'caballeria', 'sola', 'enjuta', 'texedor', 'cean', 'castro', 'azuaga', 'losa', 'texeda', 'orta', 'rasa', 'puebla', 'peñas', 'tablada', 'rozas', 'balduque', 'guarda', 'cano', 'falconete', 'doblas', 'pacheco', 'higuera', 'notario', 'parado', 'escaño', 'buenaventura', 'quintero', 'alamo', 'leyba', 'gozar', 'vergara', 'contreras', 'ambrosio', 'castaño', 'nuñez', 'moreno', 'dueñas', 'tristan', 'luna', 'casillas', 'hospital', 'carnicero', 'mañara', 'anton', 'chacon', 'criado', 'hierro', 'fusta', 'rodrigo', 'romano', 'du', 'linage', 'aquino', 'monasterio', 'catalina', 'vargas', 'marfil', 'arte

In [41]:
len(set(untagged))

1096