# Read NER data

Reads the CoNLL'03 English data from https://github.com/glample/tagger and converts it to json.

Download the originals (if we don't have them already):

In [1]:
!wget -nc https://github.com/glample/tagger/raw/master/dataset/eng.train
!wget -nc https://github.com/glample/tagger/raw/master/dataset/eng.testa
!wget -nc https://github.com/glample/tagger/raw/master/dataset/eng.testb

File ‘eng.train’ already there; not retrieving.

File ‘eng.testa’ already there; not retrieving.

File ‘eng.testb’ already there; not retrieving.



Conversion:

In [2]:
import json


def to_json(in_path, out_path):
    sentences = []
    with open(in_path, 'rt') as input_file:
        sentence_tokens = []
        sentence_tags = []
        for line in input_file:
            # Empty lines are sentence boundaries in the CoNLL format
            if line.strip() == '':
                if len(sentence_tokens) > 0:
                    sentences.append({'text': sentence_tokens, 'tags': sentence_tags})
                sentence_tokens = []
                sentence_tags = []
            elif '-DOCSTART-' in line: # These are metadata lines we can skip
                continue
            else:
                data = line.strip().split()
                # Data format is <token> <pos> <chunk> <ner>
                sentence_tokens.append(data[0])
                sentence_tags.append(data[3].replace('I-MISC', 'O').replace('B-MISC', 'O').replace('B-', 'I-')) # Lets ignore MISC class
    with open(out_path, 'w') as output_file:
        json.dump(sentences, output_file, indent=2)

Run conversion, renaming the files in the process.

In [3]:
to_json('eng.train', 'data/ner-conll03-en-train.json')
to_json('eng.testa', 'data/ner-conll03-en-dev.json')
to_json('eng.testb', 'data/ner-conll03-en-test.json')