# Saving NER output to file

Once we have trained the model, we want to save the NER output to a file so that we can clean the results and incorporate the data into our database. We are interested in preserving the string value of the entity, the label and an identifier for the text it is contained in.

In [2]:
#Import modules
from __future__ import unicode_literals
import spacy
from spacy.lang.es import Spanish 
from spacy import displacy
import pandas as pd
import numpy as np
import json
from spacy.gold import docs_to_json

In [2]:
#Load the trained Spacy Model

nlp = spacy.load("es_core_news_ml_EMS")

In [3]:
Texts = pd.read_csv('KinkeadDocs.csv', delimiter =",")

In [4]:
Texts

Unnamed: 0,document,id
0,D. Juan García Blanco .. arriendo a MIGUEL DE ...,4631
1,"18 mayo 1692, Auto de Fe, celebrado en Santa A...",4632
2,"JACINTO DE AGUILAR, dorador, soltero, natural ...",4633
3,En Triana 26 julio 1663 .. Manuel Francisco de...,4634
4,Juan Rodríguez como padre de Lorenzo Rodríguez...,4635
5,"MANUEL DE AGUILAR, dorador de imaginería, arre...",4636
6,Restauración del monumento de la parroquia de ...,4637
7,En la ciudad de Sevilla en 18 junio 1656 ante ...,4638
8,Juan Paez .. 15 años .. aprendiz de dorador y ...,4639
9,CRISTÓBAL ALTA pintor de imaginería .. collaci...,4640


In [5]:
#Four rows are blank because they represent an author's comments without alluding to a particular document. We must conver the empty strings to NA and drop these for the text analysis to work.
Texts['document'].replace('', np.nan, inplace=True)
Texts.dropna(subset=['document'], inplace=True)

Text with metadata can be processed on spacy by setting "as tuples" as True. This allows you to preserve text as well as context. First, we have to format our data appropriately, as a list of tuples with any metadata in a dictionary. 
That is:
[(text,{context label: context value}),(...)]

In [6]:
data = []

In [7]:
for index, row in Texts.iterrows():
    a= {'id': row[1]}
    b=row[0]
    newrow= (b,a)
    data.append(newrow)

In [8]:
data

[('D. Juan García Blanco .. arriendo a MIGUEL DE AGUILA, maestro pintor, Santa María, calle de Tintores, una casa a la Ballestilla, un año, 6 ducados y medio, Francisco Melgarejo, fiador, maestro de hacer cajas de joyas, 22 junio 1683.',
  {'id': 4631}),
 ('18 mayo 1692, Auto de Fe, celebrado en Santa Ana, Triana, un reo fue MIGUEL DE ÁGUILA, pintor, pena de destierro de Sevilla y Madrid por 5 años por hechicero, adivinador y embustero. ',
  {'id': 4632}),
 ('JACINTO DE AGUILAR, dorador, soltero, natural de Sevilla, 36 años ',
  {'id': 4633}),
 ('En Triana 26 julio 1663 .. Manuel Francisco de edad de 14 años .. me pongo por aprendiz del oficio de pintor de loza de Talavera con JUAN DE AGUILAR, calle de los Esparteñas, 6 años.',
  {'id': 4634}),
 ('Juan Rodríguez como padre de Lorenzo Rodríguez de edad de 12 años .. le pongo aprender del oficio de pintor y de dorador con MANUEL DE AGUILAR, San Martín, 6 años, MANUEL DE AGUILAR no sabe firmar, 31 octubre 1647.',
  {'id': 4635}),
 ('MANUE

In [9]:
docs=list(nlp.pipe(data, as_tuples=True))

In [11]:
columns2=['docid','string','label','start','end']
Entities_df =pd.DataFrame(columns=columns2)

In [12]:
Entities_df

Unnamed: 0,docid,string,label,start,end


In [22]:
for doc, context in docs:
    a=context['id']
    for ent in doc.ents:
        newrow= {'docid': a,'label': ent.label_,'string': ent.text, 'start':ent.start, 'end': ent.end}
        Entities_df=Entities_df.append(newrow,ignore_index=True)

In [23]:
Entities_df

Unnamed: 0,docid,string,label,start,end
0,4631,Juan García Blanco,PER,1,4
1,4631,MIGUEL DE AGUILA,PER,7,10
2,4631,Santa María,PER,14,16
3,4631,calle de Tintores,LOC,17,20
4,4631,6 ducados y medio,MON,30,34
5,4631,Francisco Melgarejo,PER,35,37
6,4632,Santa Ana,LOC,10,12
7,4632,MIGUEL DE ÁGUILA,PER,18,21
8,4632,Sevilla,LOC,28,29
9,4632,Madrid,LOC,30,31


We have created a dataframe with each entity and its attributes, as well as the ID of the text it belongs to. We can export this as a CSV to later incorporate it into our database.

In [24]:
Entities_df.to_csv(path_or_buf="EntitiesEMSModel_Kinkead")