# Incorporating Data Cleaning 1 and 2 changes to Spacy.

Before proceeding with further data cleaning that will depend on untagged tokens in Spacy's document objects, we will apply our changes to the pickled spacy documents.

Spacy outputs a doc object with several attributes, including tokens that have their own attributes that mark where they are in an entity. In other words, what we need to change are these attributes in the tokens of our documents (which we already imported above).

Potentially useful token attributes include:

|attribute| datatype| function|
|--|--|--|
|ent_type| int| Named entity type.|
|ent_type_|unicode|Named entity type.|
|ent_iob|int|IOB code of named entity tag. 3 means the token begins an entity, 2 means it is outside an entity, 1 means it is inside an entity, and 0 means no entity tag is set.|
|ent_iob_|unicode|IOB code of named entity tag. “B” means the token begins an entity, “I” means it is inside an entity, “O” means it is outside an entity, and "" means no entity tag is set.|
|ent_kb_id V2.2|int|Knowledge base ID that refers to the named entity this token is a part of, if any.|
|ent_kb_id_ V2.2|unicode|Knowledge base ID that refers to the named entity this token is a part of, if any.|
|ent_id|int|ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution.|
|ent_id_|unicode|ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution.

From Spacy:

> To ensure that the sequence of token annotations remains consistent, you have to set entity annotations at the document level. However, you can’t write directly to the token.ent_iob or token.ent_type attributes, so the easiest way to set entities is to assign to the doc.ents attribute and create the new entity as a Span.

Sample code:
```python
import spacy
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
doc = nlp("fb is hiring a new vice president of global policy")
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('Before', ents)
# the model didn't recognise "fb" as an entity :(

fb_ent = Span(doc, 0, 1, label="ORG") # create a Span for the new entity
doc.ents = list(doc.ents) + [fb_ent]

ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('After', ents)
# [('fb', 0, 2, 'ORG')] 🎉
```
> Keep in mind that you need to create a Span with the start and end index of the token, not the start and end index of the entity in the document. In this case, “fb” is token (0, 1) – but at the document level, the entity will have the start and end indices (0, 2).

Source: https://spacy.io/usage/linguistic-features#named-entities

So our plan will be to import our processed, pickled documents, and I believe the best approach is actually to drop all the PER tags, and then reupload the ones that weren't rejected with their updated spans and categories.

In [1]:
# importing pandas and our previous work in a new session
import pandas as pd
from IPython.display import clear_output
import spacy
from spacy import displacy
import pickle

In [2]:
file = open("../Text Mining (NER)/Trained_EMS2_NER_data.p", 'rb')
docs = pickle.load(file)

In [3]:
people = pd.read_csv('Files_Cleaning2/PER_tags_clean2_manual2.csv')

We sadly do not have entity IDs in our people dataframe (we do have their document ID, as well as their position within the document).  We do preserve the original spans, however, which can help us identify tokens.

In [4]:
people.groupby('Remove').count()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,Unnamed: 0.1.1.1.1.1,Unnamed: 0.1.1.1.1.1.1,Unnamed: 0.1.1.1.1.1.1.1,docid,string,label,start,end,matched_str,edit_long,newstart,newend
Remove,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
0,32350,32350,32350,32350,32350,32350,32350,32350,32350,32350,32350,32350,32350,32350,32350,173,449
1,639,639,639,639,639,639,639,639,639,636,78,639,639,639,0,0,0
D,6,6,6,6,6,6,6,6,6,6,6,6,6,6,0,0,0
L,44,44,44,44,44,44,44,44,44,44,44,44,44,44,0,0,0
M,6,6,6,6,6,6,6,6,6,6,6,6,6,6,0,0,0
O,75,75,75,75,75,75,75,75,75,75,75,75,75,75,0,0,0


In [5]:
len(people)

33120

The first thing we can do is drop all the entities marked as '1' in our people dataframe. The number of remaining rows should  be 32,481. 

In [6]:
people = people[people['Remove']!='1']

In [7]:
len(people)

32481

Recall that docs are stored as tuples with the doc itself and the ID as context. 

In [8]:
docs[0]

(...martin de gaynça maestro mayor de las obras de canteria de la santa iglesia de Sevilla y juan sanchez de caliz maestro mayor de las obras (roto) desta dicha ciudad de sevilla... otorgamos... que damos todo nuestro poder... a rodrigo de cordova vecino de la ciudad de gibraltar... para... que por nos... pueda fazer qualquier yguala e convenencia... con luis de toro en nombre de su majestad para fazer un muro de abajo que se a de fazer en la ciudad de gibraltar desde san francisco fasta encima del cuchillo de una sierra que a de ser el dicho muro de grueso de nueve pies desde el dicho monasterio de san francisco fasta la cueva de san cristoval e desde la dicha cueva... fasta el cuchillo de la sierra a de ser de cuatro pies de grueso... e otrosi nos pueda obligar que faremos otras qualesquier obras de canteria e manposteria que fueren menester de se faser en la dicha ciudad de Gibraltar,
 {'id': 1})

Creating a temporary list of entities in our file except for the PER entities:

In [10]:
tempents =[]

for doc, context in docs:
    ents = list(doc.ents)
    ents2 =[]
    for ent in ents:
        if ent.label_ != 'PER':
            a = spacy.tokens.Span(doc, ent.start,ent.end,label = ent.label)
            ents2.append(a)
    ents_id = [context['id'], ents2]
    tempents.append(ents_id)

In [11]:
print(tempents)

[[1, [santa iglesia, Sevilla, sevilla, gibraltar, gibraltar, san francisco, monasterio de san francisco, cueva de san cristoval, Gibraltar]], [2, [monasterio de san pablo, Sevilla, puerto de santa maria, fin del mes de marzo de 1543, 400 ducados]], [3, [monesterio e, convento de señor san pablo, Sevilla, 40000 maravedis]], [4, [Sevilla, Santa Iglesia, Sevilla, provincia del Rio de la Plata]], [5, [Llanos, 10.150 maravedis]], [6, [Moron, 75.000 maravedis]], [7, [Santa Iglesia, Sevilla]], [8, [Jerez de la Frontera, Puerto de Santa Maria]], [9, [Alcazares reales, Sevilla, 40 ducados de oro, Alcazar]], [10, [95.962 maravedis]], [11, []], [12, []], [13, [Santa Iglesia, Sevilla, 150.000 maravedis, Holanda]], [14, [Hospital de los Caballeros]], [15, [Moron]], [16, [Moron]], [17, [Nueva España]], [18, [Marchena]], [19, [Santa Iglesia, Sevilla, Sevilla, collacion de Santa Maria, Capilla del Sagrario, Catedral, Sevilla, Medina-Sidonia, seis ducados de oro, Aracena, un ducado de oro]], [20, [1.50

In [12]:
for doc, context in docs:
    for item in tempents:
        if item[0]==context['id']:
            doc.ents= item[1]

Now we should have replaced all entities in the documents with the filtered tags. If we display a document with displacy, this should be reflected:

In [13]:
displacy.render(docs[0][0],style='ent', jupyter=True)

This should be reflected in tokens' ent_iob attributes, where 2 means that they are not assigned to any entity. This should prove true for names, like the first 3 tokens. 

In [14]:
for token in docs[0][0]:
    print(token.ent_iob,end=" ")

2 2 2 2 2 2 2 2 2 2 2 2 2 3 1 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 1 1 1 2 2 3 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 

Now, let's create a list with the same format for our person entities. We will need to have a list of lists where each item is a docnumber and a list of Span objects as such:

spacy.tokens.Span(doc, ent.start,ent.end,label = ent.label)

To make this process easier, it might help to transfer the information contained in the Remove section. Additionally, we will transfer begin and end values to any empty newbegin and newend cells, so we can use the latter as the start and end points of our spans (endpoint = newend-1).

In [15]:
people.groupby('edit_long').count()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,Unnamed: 0.1.1.1.1.1,Unnamed: 0.1.1.1.1.1.1,Unnamed: 0.1.1.1.1.1.1.1,docid,string,label,start,end,Remove,matched_str,newstart,newend
edit_long,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
0,31582,31582,31582,31582,31582,31582,31582,31582,31582,31582,31582,31582,31582,31582,31582,0,0
1,610,610,610,610,610,610,610,610,610,610,610,610,610,610,610,173,449
o,158,158,158,158,158,158,158,158,158,158,158,158,158,158,158,0,0


In [16]:
people.fillna('').groupby(['Remove','label']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,Unnamed: 0.1.1.1.1.1,Unnamed: 0.1.1.1.1.1.1,Unnamed: 0.1.1.1.1.1.1.1,docid,string,start,end,matched_str,edit_long,newstart,newend
Remove,label,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
0,PER,32350,32350,32350,32350,32350,32350,32350,32350,32350,32350,32350,32350,32350,32350,32350,32350
D,DATE,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6
L,LOC,39,39,39,39,39,39,39,39,39,39,39,39,39,39,39,39
L,PER,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5
M,MON,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6
O,ORG,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66
O,PER,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9


In [17]:
people.loc[(people['Remove']=='L')&(people['label']=='PER'),'label']='LOC'

In [18]:
people.loc[(people['Remove']=='O')&(people['label']=='PER'),'label']='ORG'

In [19]:
people.loc[people['newstart'].isnull(),'newstart'] = people['start']

In [20]:
people['newstart']

1          6.0
2         34.0
3         11.0
4         29.0
5         14.0
         ...  
33115     19.0
33116     28.0
33117    238.0
33118     90.0
33119    649.0
Name: newstart, Length: 32481, dtype: float64

In [21]:
people.loc[people['newend'].isnull(),'newend'] = people['end']

In [22]:
people['newstart']=people['newstart'].astype(int)
people['newend']=people['newend'].astype(int)

In [23]:
people['newstart']

1          6
2         34
3         11
4         29
5         14
        ... 
33115     19
33116     28
33117    238
33118     90
33119    649
Name: newstart, Length: 32481, dtype: int64

In [24]:
people.fillna('').groupby(['Remove','label']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,Unnamed: 0.1.1.1.1.1,Unnamed: 0.1.1.1.1.1.1,Unnamed: 0.1.1.1.1.1.1.1,docid,string,start,end,matched_str,edit_long,newstart,newend
Remove,label,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
0,PER,32350,32350,32350,32350,32350,32350,32350,32350,32350,32350,32350,32350,32350,32350,32350,32350
D,DATE,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6
L,LOC,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44
M,MON,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6
O,ORG,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75


In [25]:
people[people['edit_long']=='1']

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,Unnamed: 0.1.1.1.1.1,Unnamed: 0.1.1.1.1.1.1,Unnamed: 0.1.1.1.1.1.1.1,docid,string,label,start,end,Remove,matched_str,edit_long,newstart,newend
814,814,814,814,814,814,814,21404,21404,2329,Alarcon Escribano,PER,299,301,0,1,1,299,300
815,815,815,815,815,815,815,21356,21356,2327,Alarcon Escribano Publico,PER,2301,2304,0,1,1,2301,2302
816,816,816,816,816,816,816,21254,21254,2325,Alarcon Escribano Publico,PER,253,256,0,1,1,253,254
817,817,817,817,817,817,817,21434,21434,2330,Alarcon Escribano Publico,PER,448,451,0,1,1,448,449
898,898,898,898,898,898,898,19755,19755,2186,Alo alarcon escriuo,PER,1584,1588,0,0,1,1584,1587
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32942,32942,32942,32942,32942,32942,32942,27836,27836,3007,xpoval nuñes frayle profefo,PER,17,21,0,0,1,17,19
32982,32982,32982,32982,32982,32982,32982,41245,41245,4427,ynes carrillo buestra,PER,130,133,0,0,1,130,132
33061,33061,33061,33061,33061,33061,33061,19414,19414,2148,ysabel india horra,PER,41,44,0,0,1,41,42
33070,33070,33070,33070,33070,33070,33070,8106,8106,1065,ysabel rrodriguez hermoso vezina,PER,2,6,0,0,1,2,5


Now we can use the information in our docid, label, newstart and newend columns to build our spans. 

In [26]:
people2 = people[['docid','label','newstart','newend']].values.tolist()

In [27]:
people2

[[4978, 'PER', 6, 8],
 [4961, 'PER', 34, 38],
 [4952, 'PER', 11, 13],
 [7342, 'PER', 29, 31],
 [4803, 'PER', 14, 16],
 [5849, 'PER', 32, 34],
 [5588, 'PER', 23, 25],
 [5810, 'PER', 1, 5],
 [8589, 'PER', 177, 182],
 [7839, 'PER', 1, 4],
 [5229, 'PER', 14, 18],
 [5402, 'PER', 5, 7],
 [3748, 'PER', 0, 2],
 [3739, 'PER', 1, 3],
 [3465, 'PER', 0, 3],
 [3855, 'PER', 0, 3],
 [3623, 'PER', 0, 2],
 [2622, 'PER', 0, 3],
 [1223, 'PER', 372, 376],
 [124, 'PER', 7, 9],
 [8056, 'PER', 7, 9],
 [8579, 'PER', 101, 104],
 [4999, 'PER', 16, 19],
 [5000, 'PER', 49, 52],
 [4997, 'PER', 26, 29],
 [4997, 'PER', 83, 86],
 [5666, 'PER', 9, 11],
 [8063, 'PER', 0, 2],
 [6170, 'PER', 25, 27],
 [5665, 'PER', 24, 26],
 [5664, 'PER', 97, 99],
 [5663, 'PER', 13, 15],
 [5884, 'PER', 7, 9],
 [5883, 'PER', 15, 17],
 [7547, 'PER', 5, 7],
 [7548, 'PER', 30, 32],
 [7544, 'PER', 0, 3],
 [7545, 'PER', 34, 37],
 [7546, 'PER', 0, 3],
 [6127, 'PER', 19, 21],
 [5447, 'PER', 82, 84],
 [8379, 'PER', 82, 84],
 [5500, 'PER', 1083, 1

In [28]:
for doc, context in docs:
    for line in people2:
        if line[0] == context['id']:
            a = spacy.tokens.Span(doc, line[2],line[3],label = line[1])
            doc.ents = list(doc.ents) + [a]

In [29]:
docs[0][0].ents

(martin de gaynça,
 santa iglesia,
 Sevilla,
 juan sanchez de caliz,
 sevilla,
 rodrigo de cordova,
 gibraltar,
 luis de toro,
 gibraltar,
 san francisco,
 monasterio de san francisco,
 cueva de san cristoval,
 Gibraltar)

In [33]:
displacy.render(docs[0][0],style='ent', jupyter=True)

And that's how it's done! Let's export both our people dataframe and our updated pickled docs to work with later. We have to remember to keep the pickled docs on the Gitignore.

In [34]:
people.to_csv("Files_Cleaning2/PER_tags_clean2_manual22.csv")

In [35]:
pickle.dump(docs, open( "NER_data_Cleaned22.p", "wb" ))