#### Importing libraries

In [42]:
import os
import numpy as np
import pandas as pd

import csv
from collections import Counter

#### Getting the data

For
this task, the Groningen Meaning Bank (GMB) data set will be used. This dataset is
not considered a gold standard. This means that this data set is built using automatic
tagging software, followed by human raters updating subsets of the data. 

The following named entities are tagged in
this corpus:
* geo = Geographical entity
* org = Organization
* per = Person
* gpe = Geopolitical entity
* tim = Time indicator
* art = Artifact
* eve = Event
* nat = Natural phenomenon


To download dataset:

In [1]:
!wget https://gmb.let.rug.nl/releases/gmb-2.2.0.zip
!unzip gmb-2.2.0.zip

--2021-05-13 11:36:30--  https://gmb.let.rug.nl/releases/gmb-2.2.0.zip
Resolving gmb.let.rug.nl (gmb.let.rug.nl)... 129.125.2.210
Connecting to gmb.let.rug.nl (gmb.let.rug.nl)|129.125.2.210|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 841163535 (802M) [application/zip]
Saving to: ‘gmb-2.2.0.zip’

gmb-2.2.0.zip         0%[                    ]   6.18M   249KB/s    eta 59m 43s^C
Archive:  gmb-2.2.0.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of gmb-2.2.0.zip or
        gmb-2.2.0.zip.zip, and cannot find gmb-2.2.0.zip.ZIP, period.


#### Looking at data

We will be using only
files named en.tags in various subdirectories. These files are tab-separated files with
each word of a sentence in a row. 

In [6]:
data_path = 'gmb-2.2.0'
output_fn = 'gmb-2.2.0/cleaned.csv'

In [8]:
def get_filenames_by_extension(data_path, extension):
    fnames = []
    
    for root, dirs, files in os.walk(data_path):
        for filename in files:
            if filename.endswith(extension):
                file_path = os.path.join(root, filename)
                fnames.append(file_path)
                
    return fnames

In [11]:
tags = get_filenames_by_extension(data_path, '.tags')

print('Length of tags: ', len(tags))

Length of tags:  10000


A few processing steps need to happen. Each file has a number of sentences, with
each words in a row. The entire sentence as a sequence and the corresponding
sequence of NER tags need to be fed in as inputs while training the model. As
mentioned above, the NER tags also need to be simplified to the top-level entities
only. Secondly, the NER tags need to be converted to the IOB format.

In [87]:
def strip_ner_subcat(tag):
    # NER tags are of form {cat}-{subcat}
    # eg tim-dow. We only want first part
    return tag.split("-")[0]

def iob_format(ners):
    # converts IO tags into IOB format
    # input is a sequence of IO NER tokens
    # convert this: O, PERSON, PERSON, O, O, LOCATION, O
    # into: O, B-PERSON, I-PERSON, O, O, B-LOCATION, O
    iob_tokens = []
    for idx, token in enumerate(ners):
        if token != 'O': # !other
            if idx == 0:
                token = "B-" + token #start of sentence
            elif ners[idx-1] == token:
                token = "I-" + token # continues
            else:
                token = "B-" + token
        iob_tokens.append(token)
        iob_tags[token] += 1
    return iob_tokens

def process_data(tags):
    total_sentences = 0
    outfiles = []
    rows = []
    for idx, file in enumerate(tags):
        with open(file, 'rb') as content:
            data = content.read().decode('utf-8').strip()
            sentences = data.split("\n\n")

            total_sentences += len(sentences)

            for sentence in sentences:
                toks = sentence.split('\n')
                words, pos, ner = [], [], []

                for tok in toks:
                    t = tok.split("\t")
                    words.append(t[0])
                    pos.append(t[1])
                    ner_tags[t[3]] += 1
                    ner.append(strip_ner_subcat(t[3]))
                rows.append([" ".join(words), " ".join(iob_format(ner)), " ".join(pos)])
    return rows

In [88]:
ner_tags = Counter()
iob_tags = Counter()

In [96]:
data = process_data(tags)
df = pd.DataFrame(data)
df.columns = ['text', 'label', 'pos']
df.to_csv(os.path.join(data_path, 'dataset.csv'), index=False)

In [97]:
df

Unnamed: 0,text,label,pos
0,Thousands of demonstrators have marched throug...,O O O O O O B-geo O O O O O B-geo O O O O O B-...,NNS IN NNS VBP VBN IN NNP TO VB DT NN IN NNP C...
1,Families of soldiers killed in the conflict jo...,O O O O O O O O O O O O O O O O O O B-per O O ...,NNS IN NNS VBN IN DT NN VBD DT NNS WP VBD NNS ...
2,They marched from the Houses of Parliament to ...,O O O O O O O O O O O B-geo I-geo O,PRP VBD IN DT NNS IN NN TO DT NN IN NNP NNP .
3,"Police put the number of marchers at 10,000 wh...",O O O O O O O O O O O O O O O,NNS VBD DT NN IN NNS IN CD IN NNS VBD PRP VBD ...
4,The protest comes on the eve of the annual con...,O O O O O O O O O O O B-geo O O B-org I-org O ...,DT NN VBZ IN DT NN IN DT JJ NN IN NNP POS VBG ...
...,...,...,...
62005,"At last the Goatherd threw a stone , and break...",O O O B-tim O O O O O O O O O O O O O O O O O O,"IN JJ DT NNP VBD DT NN , CC VBG PRP$ NN , VBD ..."
62006,"The Goat replied , "" Why , you silly fellow , ...",O O O O O O O O O O O O O O O O O O O O O,"DT NNP VBD , LQU WRB , PRP JJ NN , DT NN MD VB..."
62007,Do not attempt to hide things which can not be...,O O O O O O O O O O O O,VBP RB VB TO VB NNS WDT MD RB VB JJ .
62008,A Sunday School teacher asked her class why Jo...,O B-tim I-tim O O O O O B-per O B-per O B-per ...,DT NNP NNP NN VBD PRP$ NN WRB NNP CC NNP VBD N...
