# NER

In this Notebook the parsed datasets are used to run the NER models and compute the performance metrics.

In [1]:
!pip install tensorflow-gpu torch pandas numpy scikit-learn transformers spacy stanza classla nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/



## I/O device registering

Current working directory is set to /content by default. You can also give access to your Google Drive to save models/results/... there.


In [2]:
from google.colab import drive
drive.mount("/content/drive/")

# Access your Drive data using folder '/content/drive/MyDrive'

# Set the working directory
workdir = "/content/drive/MyDrive/ml_ner/datasets"

!ls -lah "$workdir"

Mounted at /content/drive/
total 2.5M
drwx------ 2 root root 4.0K Jan  2 19:23 btc
drwx------ 2 root root 4.0K Dec 30 19:28 CoNLL03
drwx------ 2 root root 4.0K Jan  1 18:43 emtd
-rw------- 1 root root 2.4M Dec 30 19:30 entity-recognition-datasets-master.zip
drwx------ 2 root root 4.0K Jan  2 19:23 wikigold


## File management

Create the results folder under datasets.

Set the results path.

**Both datasets are licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).**

In [3]:
# Create the datasets results directories
!mkdir -p "$workdir"'/wikigold/results' "$workdir"'/btc/results'

## Read in and parse

Read in all the annotated data and parse it into a unified format for evaluation.

In [4]:
# imports
import os
import shutil
import json

from pathlib import PurePath

In [5]:
# Read in the data
def decode_data(file):
  # Read the file
  with open(file) as f:
    dataset = json.loads(f.read())
    
    return dataset

# Write out the data
def encode_data(path, filename, output):
  # Export the json file
  json_dump = json.dumps(output)

  with open(PurePath(path, filename), "w") as f:
    f.write(json_dump)

## Check GPU resources

In [6]:
!nvidia-smi

Mon Jan  2 19:48:44 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   57C    P0    26W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Check TensorFlow compatibilty

In [7]:
import tensorflow as tf
import os
print(f"Tensorflow version: {tf.__version__}")

# Restrict TensorFlow to only allocate 4GBs of memory on the first GPU
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    tf.config.experimental.set_virtual_device_configuration(
        gpus[0],
        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=4096)])
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(f"The system contains '{len(gpus)}' Physical GPUs and '{len(logical_gpus)}' Logical GPUs")
  except RuntimeError as e:
    print(e)
else:
    print(f"Your system does not contain a GPU that could be used by Tensorflow!")

Tensorflow version: 2.11.0
The system contains '1' Physical GPUs and '1' Logical GPUs


## NER pre-trained model wrappers

Bellow are the functions that emplore the NER methods available in spaCy, Stanza, Classla and NLTK NER systems.

Each system is also initialized and prepared to process sentences.

### spaCy

Model: `en_core_web_trf`.

Format: IOB

In [8]:
# Download the model
!python -m spacy download en_core_web_trf

2023-01-02 19:49:03.313781: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-01-02 19:49:03.313913: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-trf==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.4.1/en_core_web_trf-3.4.1-py3-none-any.whl (460.3 MB)
[K     |████████████████████████████████| 460.3 MB 27 kB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load(

In [9]:
# imports
import spacy

# Load the model
nlp_spacy = spacy.load("en_core_web_trf")

In [10]:
# Function to call
def ner_spacy(sentence):
  # Get the global variable
  global nlp_spacy

  # Process the sentence
  doc = nlp_spacy(sentence)

  # Return list of IOB entities
  return [token.ent_iob_ if not token.ent_type_ else f"{token.ent_iob_}-{token.ent_type_}" for token in doc]

In [11]:
# Test for correct output
print(ner_spacy("010 is the tenth album from Japanese Punk Techno band The Mad Capsule Markets."))

['B-CARDINAL', 'O', 'O', 'B-ORDINAL', 'O', 'O', 'B-NORP', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'O']


### Stanza

Model: `en`.

Format: BIOES

In [12]:
# imports
import stanza

# Initialize the pipeline
nlp_stanza = stanza.Pipeline(lang = "en", processors = "tokenize,ner")

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

INFO:stanza:Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | combined  |
| ner       | ontonotes |

INFO:stanza:Use device: gpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


In [13]:
# Function to call
def ner_stanza(sentence):
  # Get the global variable
  global nlp_stanza

  # Process the sentence
  doc = nlp_stanza(sentence)

  # Return a list of BIOES entities
  return [token.ner for sent in doc.sentences for token in sent.tokens]

In [14]:
# Test for correct output
print(ner_stanza("010 is the tenth album from Japanese Punk Techno band The Mad Capsule Markets."))

['S-CARDINAL', 'O', 'O', 'S-ORDINAL', 'O', 'O', 'S-NORP', 'B-ORG', 'E-ORG', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'E-ORG', 'O']


### Classla

Model: `sl`.

Format: BIOES

In [15]:
# imports
import classla

# Download the model
classla.download('sl')

# Initialize the pipeline
nlp_classla = classla.Pipeline(lang = "sl", processors = "tokenize,ner")

Downloading https://raw.githubusercontent.com/clarinsi/classla-resources/main/resources_1.0.1.json: 10.3kB [00:00, 2.34MB/s]                   
INFO:classla:Downloading these customized packages for language: sl (Slovenian)...
| Processor | Package  |
------------------------
| tokenize  | standard |
| pos       | standard |
| lemma     | standard |
| depparse  | standard |
| ner       | standard |
| pretrain  | standard |

INFO:classla:File exists: /root/classla_resources/sl/pos/standard.pt.
INFO:classla:File exists: /root/classla_resources/sl/lemma/standard.pt.
INFO:classla:File exists: /root/classla_resources/sl/depparse/standard.pt.
INFO:classla:File exists: /root/classla_resources/sl/ner/standard.pt.
INFO:classla:File exists: /root/classla_resources/sl/pretrain/standard.pt.
INFO:classla:Finished downloading models and saved to /root/classla_resources.
INFO:classla:Loading these models for language: sl (Slovenian):
| Processor | Package  |
------------------------
| tokenize  | sta

In [16]:
# Function to call
def ner_classla(sentence):
  # Get the global variable
  global nlp_classla

  # Process the sentence
  doc = nlp_classla(sentence)

  # Return a list of BIOES entities
  return [token.ner for sent in doc.sentences for token in sent.tokens]

In [17]:
# Test for correct output
print(ner_classla("010 is the tenth album from Japanese Punk Techno band The Mad Capsule Markets."))

['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'O']


### NLTK

Model: `built in`.

Format: IOB

In [18]:
# imports
import nltk
nltk.download("popular")

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Do

True

In [19]:
# Function to call
def ner_nltk(sentence):
  # Process the sentence

  # Tokenization
  tokens = nltk.word_tokenize(sentence)

  # Tagging
  tagged_tokens = nltk.pos_tag(tokens)

  # NER
  entities = nltk.chunk.ne_chunk(tagged_tokens)

  # Transform tree to conll tags
  conll_tags = nltk.chunk.tree2conlltags(entities)

  # Return a list of IOB entities
  return [conll_tag[2] for conll_tag in conll_tags]

In [20]:
# Test for correct output
print(ner_nltk("010 is the tenth album from Japanese Punk Techno band The Mad Capsule Markets."))

['O', 'O', 'O', 'O', 'O', 'O', 'B-GPE', 'B-ORGANIZATION', 'I-ORGANIZATION', 'O', 'O', 'B-ORGANIZATION', 'I-ORGANIZATION', 'I-ORGANIZATION', 'O']


## Main execution

Run the NER taggers over the prepared sentances in the dataset and store the resulting entities/attributes for further analysis.

In [21]:
# List of wrappers to execute
ner_wrappers = [ner_spacy, ner_stanza, ner_classla, ner_nltk]
result_keys = ["spacy_entities", "stanza_entities", "classla_entities", "nltk_entities"]

# Set filepaths for read in
filepaths = [PurePath(workdir, "wikigold"), PurePath(workdir, "btc")]

# Process all the files
for filepath in filepaths:
  for path, _, files in os.walk(PurePath(filepath, "parsed")):
    for name in files:
      # Compile the absolute filepath
      file = PurePath(path, name)

      # Read in the dataset
      dataset = decode_data(file)

      # Iterate over the sentences in the dataset
      for entry in dataset:
        # Run all the NERs and save the results
        for ner_wrapper, result_key in zip(ner_wrappers, result_keys):
          entry[result_key] = ner_wrapper(entry["sentence"])

      # Get the filename and save the data
      filename = f"{file.stem}_results.json"

      encode_data(PurePath(filepath, "results"), filename, dataset)




## Flush changes

If ran in Google Colaboratory

In [22]:
# Run this at the end

drive.flush_and_unmount()
print('All changes made in this colab session should now be visible in Drive.')

All changes made in this colab session should now be visible in Drive.
