# Notebook for preparing the data for further processing

## Preparing the NER evaluation datasets

First download the zip of the [repository](https://github.com/juand-r/entity-recognition-datasets) available on Github.

You can also clone the repository, but this Notebook is geared towards unpacking the required datasets from a `.zip` file.

In [1]:
# Set the filename variable (without suffix)
filename = "entity-recognition-datasets-master"

## Working directory

This Notebook is design to run in Google Colaboratory and as such the working directory is a folder inside Google Drive.

If you wish to use a local IPython instance make sure to update the working directory and leave out the step of connectind the Notebook to Google Drive.

In [2]:
from google.colab import drive
drive.mount("/content/drive/")

# Access your Drive data using folder '/content/drive/MyDrive'

# Set the working directory
workdir = "/content/drive/MyDrive/ml_ner/datasets"

!ls -lah "$workdir"

Mounted at /content/drive/
total 2.5M
drwx------ 2 root root 4.0K Dec 30 19:28 CoNLL03
drwx------ 2 root root 4.0K Jan  1 18:43 emtd
-rw------- 1 root root 2.4M Dec 30 19:30 entity-recognition-datasets-master.zip


## File management

Unzip the copy of the repo and copy out the selected datasets (Wikigold and BTC). Remove the rest and the extracted repository files.

**Both datasets are licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).**

In [None]:
# Unzip the files
!unzip "$workdir/$filename"'.zip' -d "$workdir"

In [4]:
# Create the dataset directories and copy the dataset files
!mkdir -p "$workdir"'/wikigold/parsed' "$workdir"'/btc/parsed'

!cp "$workdir/$filename"'/data/wikigold/CONLL-format/data/wikigold.conll.txt' "$workdir"'/wikigold'
!cp "$workdir/$filename"'/data/BTC/CONLL-format/data/'*'.conll' "$workdir"'/btc'

# Remove the extracted repository files
!rm -r "$workdir/$filename"

## Read in and parse

Read in all the annotated data and parse it into a unified format for evaluation.

In [5]:
# imports
import os
import re
import shutil
import json

from pathlib import PurePath

In [6]:
# Function for parsin the data in dataset files
def parse_data(fp):
  # List of dicts for data {sentence, words, NE annotations}
  dataset = []

  # Iteration variables
  sentence = ""
  words = []
  annotations = []

  # Increment line by line
  for line in fp:
    # Line is empty (new sentence)
    if line in ["\n", "\r\n"]:
      # Write sentence to dataset in specified dict format
      dataset.append({
          "sentence": sentence.rstrip(),
          "words": words,
          "annotations": annotations
      })

      # Clear the iteration variables
      sentence = ""
      words = []
      annotations = []

      continue

    # Exclusions (special cases that need to be filtered out)
    if line in ["-DOCSTART- 0"]:
      continue

    # Somehow the dataset contains whitespaces in the classification (removing)
    if len(line.split()) != 2:
      continue

    # Build the sentence and add the word with annotation to the current lists
    word, annotation = line.split()
    words.append(word)
    annotations.append(annotation)

    # If word is special character remove whitespaces before building.
    rule_leading = re.compile('[,.!?;:\-_@%^&*)}=+/?\\|]')
    rule_trailing = re.compile('[<>"\'\-_@#$^&*({=+/\\|]')

    if len(word) == 1:
      if rule_leading.match(word):
        sentence = sentence.rstrip()
  
      # Add word to sentence without whitespace
      if rule_trailing.match(word):
        sentence += word
        continue

    # Add word+trailing whitespace to sentence
    sentence += f"{word} "

  return dataset

In [7]:
# Write out the data
def encode_data(path, filename, output):
  # Export the json file
  json_dump = json.dumps(output)

  with open(PurePath(path, filename), "w") as f:
    f.write(json_dump)

In [8]:
# Set filepaths for read and parse
filepaths = [PurePath(workdir, "wikigold"), PurePath(workdir, "btc")]

# Process all the files
for filepath in filepaths:
  for path, _, files in os.walk(filepath):
    for name in files:
      # Skip subfolders
      if PurePath(path) != filepath:
        continue
      # Compile the absolute filepath
      file = PurePath(filepath, name)

      # Open the file
      with open(file) as fp:
        # Process the file
        dataset = parse_data(fp)

        # Get filename
        filename = f"{name.split('.')[0]}.json"
        
        # Write the parsed data to json file
        encode_data(PurePath(filepath, "parsed"), filename, dataset)


## Flush changes

If ran in Google Colaboratory

In [9]:
# Run this at the end

drive.flush_and_unmount()
print('All changes made in this colab session should now be visible in Drive.')

All changes made in this colab session should now be visible in Drive.
