# Spacy dataset creation

This notebook takes train and test  datasets (of type `List[InputSample]`)
and transforms them into two structures consumed by Spacy:
1. Spacy JSON (see https://spacy.io/api/annotation#json-input)
2. Spacy Pickle files (of structure `[(full_text,"entities":[(start, end, type),(...))]`.  
See more details here: https://spacy.io/api/annotation#json-input)

JSON is used for Spacy's CLI trainer. 
Pickle is used for fine-tuning using the logic in [../models/spacy_retrain.py](../models/spacy_retrain.py)

In [1]:
from presidio_evaluator.data_generator import read_synth_dataset
from presidio_evaluator.data_objects import InputSample
%reload_ext autoreload

In [2]:
n2c2_train = read_synth_dataset('../../data/n2c2/set_1.json')
n2c2_val = read_synth_dataset('../../data/n2c2/set_2.json')

In [3]:
print("Read train {} samples".format(len(n2c2_train)))
print("Read val {} samples".format(len(n2c2_val)))

Read train 445 samples
Read val 237 samples


For training, keep only sentences with entities:

In [4]:
train_tagged = [sample for sample in n2c2_train if len(sample.spans)>0]
print("Kept {} samples after removal of non-tagged samples".format(len(train_tagged)))

Kept 445 samples after removal of non-tagged samples


Evaluate training set's entities

In [5]:
print("Entities found in training set:")
entities = []
for sample in train_tagged:
    entities.extend([tag for tag in sample.tags])
set(entities)

Entities found in training set:


{'B-AGE',
 'B-DATE_TIME',
 'B-HEALTHPLAN',
 'B-IDNUM',
 'B-LOCATION',
 'B-MEDICALRECORD',
 'B-ORGANIZATION',
 'B-PERSON',
 'B-PHONE_NUMBER',
 'B-PROFESSION',
 'B-URL',
 'I-DATE_TIME',
 'I-IDNUM',
 'I-LOCATION',
 'I-MEDICALRECORD',
 'I-ORGANIZATION',
 'I-PERSON',
 'I-PHONE_NUMBER',
 'I-PROFESSION',
 'I-URL',
 'L-AGE',
 'L-DATE_TIME',
 'L-HEALTHPLAN',
 'L-IDNUM',
 'L-LOCATION',
 'L-MEDICALRECORD',
 'L-ORGANIZATION',
 'L-PERSON',
 'L-PHONE_NUMBER',
 'L-PROFESSION',
 'L-URL',
 'O',
 'U-AGE',
 'U-BIOID',
 'U-DATE_TIME',
 'U-EMAIL_ADDRESS',
 'U-IDNUM',
 'U-LOCATION',
 'U-MEDICALRECORD',
 'U-ORGANIZATION',
 'U-PERSON',
 'U-PHONE_NUMBER',
 'U-PROFESSION',
 'U-USERNAME'}

Create Spacy dataset (option 2)

In [6]:
from presidio_evaluator import InputSample
import pickle

spacy_train = InputSample.create_spacy_dataset(train_tagged)

In [7]:
entities_spacy = [x[1]['entities'] for x in spacy_train]
entities_spacy
entities_spacy_flat = []
for samp in entities_spacy:
    for ent in samp:
        entities_spacy_flat.append(ent[2])
set(entities_spacy_flat)

{'GPE', 'O', 'ORG', 'PERSON'}

Create Spacy dataset (option 1: JSON)

In [23]:
from presidio_evaluator import InputSample
spacy_train_json = InputSample.create_spacy_json(train_tagged)

445it [00:00, 1374.31it/s]


Quick evaluation of samples

Dump training set to pickle and json respectively

In [24]:
import pickle
import json
with open("../../data/n2c2/set_1_spacy.pickle", 'wb') as handle:
    pickle.dump(spacy_train,handle, protocol=pickle.HIGHEST_PROTOCOL)

with open("../../data/n2c2/set_1_spacy.json","w") as f:
    json.dump(spacy_train_json,f)

Create JSON and pickle files for test dataset

In [25]:
spacy_val = InputSample.create_spacy_dataset(n2c2_val)
spacy_val_json = InputSample.create_spacy_json(n2c2_val)

237it [00:00, 1304.10it/s]


Dump test set to pickle and json respectively

In [26]:
import pickle
with open("../../data/n2c2/set_2_spacy.pickle", 'wb') as handle:
    pickle.dump(spacy_val,handle, protocol=pickle.HIGHEST_PROTOCOL)
    
with open("../../data/n2c2/set_2_spacy.json","w") as f:
    json.dump(spacy_val_json,f)
       