## Preparing a Text File with Tagged Entities for DataTurks.com - Random Selection

In this notebook, we run Spacy on our texts to output an initial list of entities that can later be revised on Dataturks. We print this in the necessary format for upload on said website, which can be copied into a text file. 


In [45]:
# Import modules used in this notebook
from spacy.lang.es import Spanish 
from spacy import displacy
import spacy
from spacy.tokens import Doc

from collections import defaultdict, Counter
from spacy.attrs import ORTH
import pandas as pd
import json
import numpy as np
import random

In [4]:
# Load the medium Spanish model
nlp = spacy.load('es_core_news_ml_EMS')

In [109]:
# Read the file we are tagging
filename=input("Input filename:")
df = pd.read_csv(filename, delimiter =",")
print(df)
  

Input filename:20200211AllDocsDB.csv
        id                                           document
0        1  ...martin de gaynça maestro mayor de las obras...
1        2  ...martin de gaynça... soy convenido... con lo...
2        3  ...martin de gaynça... soy convenido... con el...
3        4  Francisco Buelta, cantero, vecino de Sevilla, ...
4        5  Martin de Gainza y Juan de Escalona, vecinos d...
5        6  Martin de Gainza, se obligo a pagar a Juan Per...
6        7  Rodrigo Alonso, calero, se obligo a pagar a Ma...
7        8  Fernan Rodriguez, cantero, vecino de Jerez de ...
8        9  Ochoa de Isasi, maestro marmolero, y Martin de...
9       10  Martin de Gainza y Juan de Gainza, cantero, ot...
10      11  Martin de Gainza, otorgo poder a Juanes de Veo...
11      12              Martin de Gainza. Escritura de deudo.
12      13  Martin de Gainza, obrero mayor de la canteria ...
13      14  Martin de Gainza, vizcaino, se constituyo en f...
14      15  Pedro Montañes, cante

In [110]:
#Four rows are blank because they represent an author's comments without alluding to a particular document. We must conver the empty strings to NA and drop these for the text analysis to work.
df['document'].replace('', np.nan, inplace=True)
df.dropna(subset=['document'], inplace=True)

We will take a random subsample of our data to use as training and testing data

In [112]:
#Take a random sample of size k of the file

k=500
random_sample = df.sample(k)

In [113]:
#Run the model on the texts
docs=list(nlp.pipe(random_sample['document']))

The Dataturks upload format is: 

    {"content":"cd players and tuners","annotation":[{"label":["Category"],"points":[{"start":0,"end":1,"text":"cd"}]},{"label":["Category"],"points":[{"start":3,"end":9,"text":"players"}]},{"label":["Category"],"points":[{"start":15,"end":20,"text":"tuners"}]}],"extras":{"Name":"columnName","Class":"ColumnValue"}}

Which is a dictionary with "content" and "annotation" key/value pairs
inside the annotation is a list of entities
each entity is a dictionary with "label", "points" key/value pairs
label includes a list with "value"
points includes a list with a dictionary with "start", "end", and "text" key/value pairs.

Each text is in a new line.

We need to output a JSON file of our documents with entity tags in this structure.

First, we created a document with all our texts as tagged by the model, in the structure needed:

In [114]:
filename=input("Input filename for output:")
with open(filename, 'w', encoding='UTF-8') as file:
    for doc in docs:
        docdict={}
        docdict["content"]= doc.text
        annotation=[]
        for ent in doc.ents: 
            entity={}
            label=[]
            label.append(ent.label_) #Adds label to created list
            points=[]
            pointsd={}
            entity["label"]=label #Adds list as value to key "label" in entity dict
            pointsd["start"]=doc[ent.start].idx #Accesses start char of first token of entity
            pointsd["end"]=doc[ent.start].idx+len(ent.text) #Finds end char by adding entity length to start char
            pointsd["text"]=ent.text #Text string attribute of entity
            points.append(pointsd)
            entity["points"]=points #Adds points attribute as value to key "points" in entity dict
            annotation.append(entity)
        docdict["annotation"]=annotation
        json.dump(docdict, file,ensure_ascii=False)
        file.write("\n")

Input filename for output:20200211TrainingSample_EMS.json


This file can now be uploaded into Dataturks, and its tags corrected. Once the texts are perfectly tagged, they can be used for retraining the model and testing our results. Other notebooks explain how to format Dataturks output into usable training data for Spacy,and then how to train Spacy's model. 