## Preparing a Text File with Tagged Entities for DataTurks.com - Random Selection

In this notebook, we run Spacy on our texts to output an initial list of entities that can later be revised on Dataturks. We print this in the necessary format for upload on said website, which can be copied into a text file. 


In [45]:
# Import modules used in this notebook
from spacy.lang.es import Spanish 
from spacy import displacy
import spacy
from spacy.tokens import Doc

from collections import defaultdict, Counter
from spacy.attrs import ORTH
import pandas as pd
import json
import numpy as np
import random

In [4]:
# Load the trained medium Spanish model
nlp = spacy.load('es_core_news_ml_EMS')

In [6]:
# Read the file we are tagging
df = pd.read_csv('AllDocswithID.csv', delimiter =",")
print(df)

      iddoc                                           document
0         1  ...martin de gaynça maestro mayor de las obras...
1         2  ...martin de gaynça... soy convenido... con lo...
2         3  ...martin de gaynça... soy convenido... con el...
3         4   Francisco Buelta, cantero, vecino de Sevilla,...
4         5   Martin de Gainza y Juan de Escalona, vecinos ...
5         6   Martin de Gainza, se obligo a pagar a Juan Pe...
6         7   Rodrigo Alonso, calero, se obligo a pagar a M...
7         8   Fernan Rodriguez, cantero, vecino de Jerez de...
8         9   Ochoa de Isasi, maestro marmolero, y Martin d...
9        10   Martin de Gainza y Juan de Gainza, cantero, o...
10       11   Martin de Gainza, otorgo poder a Juanes de Ve...
11       12             Martin de Gainza. Escritura de deudo.]
12       13   Martin de Gainza, obrero mayor de la canteria...
13       14   Martin de Gainza, vizcaino, se constituyo en ...
14       15   Pedro Montañes, cantero, y Martin de Gain

In [11]:
#Four rows are blank because they represent an author's comments without alluding to a particular document. We must conver the empty strings to NA and drop these for the text analysis to work.
df['document'].replace('', np.nan, inplace=True)
df.dropna(subset=['document'], inplace=True)

In [12]:
#Run the model on the texts
docs=list(nlp.pipe(df['document']))

The Dataturks upload format is: 

    {"content":"cd players and tuners","annotation":[{"label":["Category"],"points":[{"start":0,"end":1,"text":"cd"}]},{"label":["Category"],"points":[{"start":3,"end":9,"text":"players"}]},{"label":["Category"],"points":[{"start":15,"end":20,"text":"tuners"}]}],"extras":{"Name":"columnName","Class":"ColumnValue"}}

Which is a dictionary with "content" and "annotation" key/value pairs
inside the annotation is a list of entities
each entity is a dictionary with "label", "points" key/value pairs
label includes a list with "value"
points includes a list with a dictionary with "start", "end", and "text" key/value pairs.

Each text is in a new line.

We need to output a JSON file of our documents with entity tags in this structure.

First, we created a document with all our texts as tagged by the model, in the structure needed:

In [72]:
filename='Alldocs_EMS20200209.json'
with open(filename, w, encoding='UTF-8') as file:
    for doc in docs:
        docdict={}
        docdict["content"]= doc.text
        annotation=[]
        for ent in doc.ents: 
            entity={}
            label=[]
            points=[]
            pointsd={}
            label=ent.label_
            entity["label"]=label
            pointsd["start"]=ent.start
            pointsd["end"]=ent.end
            pointsd["text"]=ent.text
            points=pointsd
            entity["points"]=points
            annotation.append(entity)
        docdict["annotation"]=annotation
        json.dump(docdict,jsonfile,ensure_ascii=False)
        jsonfile.write("\n")

Then, we randomly take a k-length sample of this document:

In [73]:
k = 500
filename = 'Alldocs_EMS20200209.json'
with open(filename, encoding='UTF-8') as file:
    lines = file.read().splitlines()

random_lines = random.sample(lines, k)
print("\n".join(random_lines)) # print random lines

{"content": "22 abril 1668. Este día un niño hijo de MATÍAS DE ARTEAGA, en calle Bayona.", "annotation": [{"label": "PER", "points": {"start": 10, "end": 13, "text": "MATÍAS DE ARTEAGA"}}, {"label": "LOC", "points": {"start": 15, "end": 17, "text": "calle Bayona"}}]}
{"content": " Sepan quantos esta carta vieren como yo blas martin silbestre pintor de ymagineria vezino desta ciudad de Sevilla en la collacion de la magdalena e yo lazaro perez castellanos escultor vesino desta ciudad de Sevilla en la dicha collacion de la madalena anbos a dos juntamente de mancomun... somos conbenidos y concertados con frai Juan de la cruz de la orden de san francisco comisario de los santos lugares de Jerusalen residente en el conbento grande de la dicha orden desta ciudad de sevilla en tal manera que nos obligamos de haser y que haremos y daremos ffecho y acabado en toda perfesion de aqui a en fin del mes de marso que biene deste a??o de mill y seiscientos y veinte y seis una ymagen de san agustin de m

And save our results:

In [74]:
 with open('DocSample_EMS20200209', 'w') as output_file:
        output_file.writelines(line + "\n"
                               for line in lines if line in random_lines)

This file can now be uploaded into Dataturks, and its tags corrected. Once the texts are perfectly tagged, they can be used for retraining the model and testing our results.