In [2]:
#First, we import modules to be used in this session

import json

### Converting Dataturks training data to Spacy Format

Both Dataturks and Spacy work with JSON files, but they are formatted slightly differently. 

We first define a function for converting the Dataturks data into the Spacy format (available [here](https://dataturks.com/help/dataturks-ner-json-to-spacy-train.php)).

In [3]:
#Define function that converts Dataturks into Spacy format (provided on Dataturks):

def convert_dataturks_to_spacy(dataturks_JSON_FilePath):
    try:
        training_data = []
        lines=[]
        with open(dataturks_JSON_FilePath, 'r') as f:
            lines = f.readlines()

        for line in lines:
            data = json.loads(line)
            text = data['content']
            entities = []
            for annotation in data['annotation']:
                #only a single point in text annotation.
                point = annotation['points'][0]
                labels = annotation['label']
                # handle both list of labels or a single label.
                if not isinstance(labels, list):
                    labels = [labels]

                for label in labels:
                    #dataturks indices are both inclusive [start, end] but spacy is not [start, end)
                    entities.append((point['start'], point['end'] + 1 ,label))


            training_data.append((text, {"entities" : entities}))

        return training_data
    except Exception as e:
        logging.exception("Unable to process " + dataturks_JSON_FilePath + "\n" + "error = " + str(e))
        return None

Next, we apply this function to our own data, saving the object as **TAGGED_DATA**:

In [7]:
#Convert Dataturks to Spacy Format

filename= input("Input filename:")
TAGGED_DATA = convert_dataturks_to_spacy(filename)

Input file path:Dataturksexport.json


Now, we have a Python object that is formatted for training the model. We finally save this as a JSON file for use in other notebooks.

In [8]:
# write to file
newfile= input("Input new file:")
with open(newfile, 'w', encoding='utf-8') as fp:
    json.dump(TAGGED_DATA, fp)

Input new file:TaggedData_SF.json
