# Spacy dataset creation

This notebook takes train and test  datasets (of type `List[InputSample]`)
and transforms them into two structures consumed by Spacy:
1. Spacy JSON (see https://spacy.io/api/annotation#json-input)
2. Spacy Pickle files (of structure `[(full_text,"entities":[(start, end, type),(...))]`.  
See more details here: https://spacy.io/api/annotation#json-input)

JSON is used for Spacy's CLI trainer. 
Pickle is used for fine-tuning using the logic in [../models/spacy_retrain.py](../models/spacy_retrain.py)

In [1]:
from presidio_evaluator.data_generator import read_synth_dataset
%reload_ext autoreload

In [2]:
synth_train = read_synth_dataset('../../data/synth_train.json')
conll_train = read_synth_dataset('../../data/conll_train.json')
ontonotes_train = read_synth_dataset('../../data/ontonotes_train.json')

In [3]:
train_samples = conll_train + ontonotes_train # + synth_train  

In [4]:
synth_val = read_synth_dataset('../../data/synth_val.json')
conll_val = read_synth_dataset('../../data/conll_val.json')
ontonotes_val = read_synth_dataset('../../data/ontonotes_val.json')

In [5]:
val_samples = conll_val + ontonotes_val

In [6]:
synth_test = read_synth_dataset('../../data/synth_test.json')
conll_test = read_synth_dataset('../../data/conll_test.json')
ontonotes_test = read_synth_dataset('../../data/ontonotes_test.json')

In [7]:
test_samples = conll_test + ontonotes_test # + synth_test 

In [9]:
print("Read train {} samples".format(len(train_samples)))
print("Read val {} samples".format(len(val_samples)))
print("Read test {} samples".format(len(test_samples)))

Read train 13026 samples
Read val 4049 samples
Read test 3989 samples


For training, keep only sentences with entities:

In [10]:
train_tagged = [sample for sample in train_samples if len(sample.spans)>0]
print("Kept {} samples after removal of non-tagged samples".format(len(train_tagged)))

Kept 6773 samples after removal of non-tagged samples


Evaluate training set's entities

In [11]:
print("Entities found in training set:")
entities = []
for sample in train_tagged:
    entities.extend([tag for tag in sample.tags])
set(entities)

Entities found in training set:


{'B-LOCATION',
 'B-MALE_TITLE',
 'B-NATIONALITY',
 'B-NATION_MAN',
 'B-NATION_PLURAL',
 'B-ORGANIZATION',
 'B-PERSON',
 'I-LOCATION',
 'I-NATIONALITY',
 'I-NATION_PLURAL',
 'I-ORGANIZATION',
 'I-PERSON',
 'L-LOCATION',
 'L-MALE_TITLE',
 'L-NATIONALITY',
 'L-NATION_MAN',
 'L-NATION_PLURAL',
 'L-ORGANIZATION',
 'L-PERSON',
 'O',
 'U-FEMALE_TITLE',
 'U-LOCATION',
 'U-MALE_TITLE',
 'U-NATIONALITY',
 'U-NATION_MAN',
 'U-NATION_PLURAL',
 'U-ORGANIZATION',
 'U-PERSON'}

Create Spacy dataset (option 2)

In [12]:
from presidio_evaluator import InputSample
import pickle

spacy_train = InputSample.create_spacy_dataset(train_tagged)


In [13]:
entities_spacy = [x[1]['entities'] for x in spacy_train]
entities_spacy
entities_spacy_flat = []
for samp in entities_spacy:
    for ent in samp:
        entities_spacy_flat.append(ent[2])
set(entities_spacy_flat)

{'GPE', 'O', 'ORG', 'PERSON'}

Create Spacy dataset (option 1: JSON)

In [14]:
from presidio_evaluator import InputSample
spacy_train_json = InputSample.create_spacy_json(train_tagged)

6773it [00:00, 31822.83it/s]


Quick evaluation of samples

In [15]:
spacy_train_json[0]['paragraphs'][0]['sentences']

[{'tokens': [{'orth': 'The', 'tag': 'DT', 'ner': 'O'},
   {'orth': 'Romanian', 'tag': 'NNP', 'ner': 'U-GPE'},
   {'orth': 'found', 'tag': 'VBD', 'ner': 'O'},
   {'orth': 'the', 'tag': 'DT', 'ner': 'O'},
   {'orth': 'mark', 'tag': 'NN', 'ner': 'O'},
   {'orth': 'again', 'tag': 'RB', 'ner': 'O'},
   {'orth': 'two', 'tag': 'CD', 'ner': 'O'},
   {'orth': 'minutes', 'tag': 'NNS', 'ner': 'O'},
   {'orth': 'after', 'tag': 'IN', 'ner': 'O'},
   {'orth': 'halftime', 'tag': 'NN', 'ner': 'O'},
   {'orth': 'and', 'tag': 'CC', 'ner': 'O'},
   {'orth': 'again', 'tag': 'RB', 'ner': 'O'},
   {'orth': 'in', 'tag': 'IN', 'ner': 'O'},
   {'orth': 'the', 'tag': 'DT', 'ner': 'O'},
   {'orth': '56th', 'tag': 'JJ', 'ner': 'O'},
   {'orth': 'minute', 'tag': 'NN', 'ner': 'O'},
   {'orth': 'before', 'tag': 'IN', 'ner': 'O'},
   {'orth': 'midfielder', 'tag': 'NN', 'ner': 'O'},
   {'orth': 'Mathias', 'tag': 'NNP', 'ner': 'B-PERSON'},
   {'orth': 'Jespersen', 'tag': 'NNP', 'ner': 'L-PERSON'},
   {'orth': 'scored',

Dump training set to pickle and json respectively

In [16]:
import pickle
import json
with open("../../data/spacy_train.pickle", 'wb') as handle:
    pickle.dump(spacy_train,handle, protocol=pickle.HIGHEST_PROTOCOL)

with open("../../data/spacy_train.json","w") as f:
    json.dump(spacy_train_json,f)

Create JSON and pickle files for test dataset

In [17]:
spacy_test = InputSample.create_spacy_dataset(test_samples)
spacy_test_json = InputSample.create_spacy_json(test_samples)

3989it [00:00, 48871.27it/s]


Dump test set to pickle and json respectively

In [18]:
import pickle
with open("../../data/spacy_test.pickle", 'wb') as handle:
    pickle.dump(spacy_test,handle, protocol=pickle.HIGHEST_PROTOCOL)
    
with open("../../data/spacy_test.json","w") as f:
    json.dump(spacy_test_json,f)
       

In [19]:
spacy_val = InputSample.create_spacy_dataset(val_samples)
spacy_val_json = InputSample.create_spacy_json(val_samples)

4049it [00:00, 42553.55it/s]


In [20]:
import pickle
with open("../../data/spacy_val.pickle", 'wb') as handle:
    pickle.dump(spacy_val,handle, protocol=pickle.HIGHEST_PROTOCOL)
    
with open("../../data/spacy_val.json","w") as f:
    json.dump(spacy_val_json,f)       