# Spacy dataset creation

This notebook takes train and test  datasets (of type `List[InputSample]`)
and transforms them into two structures consumed by Spacy:
1. Spacy JSON (see https://spacy.io/api/annotation#json-input)
2. Spacy Pickle files (of structure `[(full_text,"entities":[(start, end, type),(...))]`.  
See more details here: https://spacy.io/api/annotation#json-input)

JSON is used for Spacy's CLI trainer. 
Pickle is used for fine-tuning using the logic in [../models/spacy_retrain.py](../models/spacy_retrain.py)

In [1]:
from presidio_evaluator.data_generator import read_synth_dataset
from presidio_evaluator.data_objects import InputSample
%reload_ext autoreload

In [2]:
n2c2_train = read_synth_dataset('/data/datasets/n2c2/2012/processed/train.json')
n2c2_val = read_synth_dataset('/data/datasets/n2c2/2012/processed/test.json')

In [3]:
print("Read train {} samples".format(len(n2c2_train)))
print("Read val {} samples".format(len(n2c2_val)))

Read train 190 samples
Read val 120 samples


For training, keep only sentences with entities:

In [4]:
train_tagged = [sample for sample in n2c2_train if len(sample.spans)>0]
print("Kept {} samples after removal of non-tagged samples".format(len(train_tagged)))

Kept 190 samples after removal of non-tagged samples


Evaluate training set's entities

In [5]:
print("Entities found in training set:")
entities = []
for sample in train_tagged:
    entities.extend([tag for tag in sample.tags])
set(entities)

Entities found in training set:


{'B-',
 'B-CLINICAL_DEPT',
 'B-EVIDENTIAL',
 'B-OCCURRENCE',
 'B-PROBLEM',
 'B-TEST',
 'B-TREATMENT',
 'I-CLINICAL_DEPT',
 'I-EVIDENTIAL',
 'I-OCCURRENCE',
 'I-PROBLEM',
 'I-TEST',
 'I-TREATMENT',
 'L-',
 'L-CLINICAL_DEPT',
 'L-EVIDENTIAL',
 'L-OCCURRENCE',
 'L-PROBLEM',
 'L-TEST',
 'L-TREATMENT',
 'O',
 'U-CLINICAL_DEPT',
 'U-EVIDENTIAL',
 'U-OCCURRENCE',
 'U-PROBLEM',
 'U-TEST',
 'U-TREATMENT'}

Create Spacy dataset (option 2)

In [6]:
from presidio_evaluator import InputSample
import pickle

spacy_train = InputSample.create_spacy_dataset(train_tagged, ignore_unknown=False)

In [7]:
entities_spacy = [x[1]['entities'] for x in spacy_train]
entities_spacy
entities_spacy_flat = []
for samp in entities_spacy:
    for ent in samp:
        entities_spacy_flat.append(ent[2])
set(entities_spacy_flat)

{'',
 'CLINICAL_DEPT',
 'EVIDENTIAL',
 'OCCURRENCE',
 'PROBLEM',
 'TEST',
 'TREATMENT'}

Create Spacy dataset (option 1: JSON)

In [8]:
from presidio_evaluator import InputSample
spacy_train_json = InputSample.create_spacy_json(train_tagged, ignore_unknown=False)

190it [00:00, 1767.04it/s]


Quick evaluation of samples

Dump training set to pickle and json respectively

In [9]:
import pickle
import json
with open("/data/datasets/n2c2/2012/processed/train_spacy.pickle", 'wb') as handle:
    pickle.dump(spacy_train,handle, protocol=pickle.HIGHEST_PROTOCOL)

with open("/data/datasets/n2c2/2012/processed/train_spacy.json","w") as f:
    json.dump(spacy_train_json,f)

Create JSON and pickle files for test dataset

In [10]:
spacy_val = InputSample.create_spacy_dataset(n2c2_val, ignore_unknown=False)
spacy_val_json = InputSample.create_spacy_json(n2c2_val, ignore_unknown=False)

120it [00:00, 1248.18it/s]


Dump test set to pickle and json respectively

In [11]:
import pickle
with open("/data/datasets/n2c2/2012/processed/test_spacy.pickle", 'wb') as handle:
    pickle.dump(spacy_val,handle, protocol=pickle.HIGHEST_PROTOCOL)
    
with open("/data/datasets/n2c2/2012/processed/test_spacy.json","w") as f:
    json.dump(spacy_val_json,f)
       