# NLP. Assignment 3. Nested Named Entity Recognition
---

Name: Shulepin Danila

Innopolis email: d.shulepin@innopolis.university

CodaLab nickname: D4n1la

GitHub nickname: D4ni1a

GitHub repository: https://github.com/D4ni1a/nlp_projects/tree/main/Assignment%203

Named Entity Recognition (NER) is a field of natural language processing dedicated to categorizing named entities within textual content. These named entities encompass distinct types of categories. Nested named entity recognition is a subtask of NER that seeks to locate and classify nested named entities (i.e., hierarchically structured entities) mentioned in unstructured text.

The significance of NER extends across diverse applications, encompassing information extraction, question answering, chatbots, sentiment analysis, and recommendation systems, underscoring its pivotal role in advancing multiple areas of natural language understanding and utilization.

### Fine-Tuning SpaCy Model for Named Entity Recognition

SpaCy stands as a leading natural language processing (NLP) library, renowned for its efficiency and versatility in handling various linguistic tasks.

This open-source library offers pre-trained models for many tasks, including named entity recognition. SpaCy apart is fast and memory efficient, making it particularly adept at processing large volumes of text in real-time. Moreover, it has a user-friendly interface, extensive language support, integration with deep learning frameworks and its support of a custom training and fine-tuning. 

In [2]:
# !pip install gdown
# !pip install thinc==8.2.3

Downloading the dataset

In [3]:
import gdown

# I uploaded the dataset into my Google Drive
# At first step, download data via gdown
url = 'https://drive.google.com/uc?id=10vGDK96wji8twLD-2wz6XdG7G_foQSbk'
output = 'dev.jsonl'
gdown.download(url, output, quiet=False)

url = 'https://drive.google.com/uc?id=1NjHU20IgEJ1gZD4eCTmmzHnvjpDG_M5h'
output = 'test.jsonl'
gdown.download(url, output, quiet=False)

url = 'https://drive.google.com/uc?id=1Wy0TjYjIUcN6q9pUTZ96CMICDrTodjyy'
output = 'train.jsonl'
gdown.download(url, output, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=10vGDK96wji8twLD-2wz6XdG7G_foQSbk
To: C:\Users\dshul\Desktop\NLP\A3\best_cat\dev.jsonl
100%|███████████████████████████████████████████████████████████████████████████████| 588k/588k [00:00<00:00, 2.22MB/s]
Downloading...
From: https://drive.google.com/uc?id=1NjHU20IgEJ1gZD4eCTmmzHnvjpDG_M5h
To: C:\Users\dshul\Desktop\NLP\A3\best_cat\test.jsonl
100%|███████████████████████████████████████████████████████████████████████████████| 507k/507k [00:00<00:00, 1.77MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Wy0TjYjIUcN6q9pUTZ96CMICDrTodjyy
To: C:\Users\dshul\Desktop\NLP\A3\best_cat\train.jsonl
100%|█████████████████████████████████████████████████████████████████████████████| 4.87M/4.87M [00:01<00:00, 4.11MB/s]


'train.jsonl'

In [6]:
train_file = "./train.jsonl"
test_file = "./test.jsonl"
dev_file = "./dev.jsonl"

In [7]:
import json

# Read dataset from file in JSON format
train = [json.loads(line) for line in open(train_file, 'r')]
test = [json.loads(line) for line in open(test_file, 'r')]
val = [json.loads(line) for line in open(dev_file, 'r')]

The NEREL dataset contains sentences with the following labels: AGE, AWARD, CITY, COUNTRY, CRIME, DATE, DISEASE, EVENT, FACILITY, FAMILY, IDEOLOGY, LANGUAGE, LAW, LOCATION, MONEY, NATIONALITY, NUMBER, ORDINAL, ORGANIZATION, PENALTY, PERCENT, PERSON, PRODUCT, PROFESSION, RELEGION, STATE_OR_PROV, TIME, WORK_OF_ART, ORGANIZATION.

In [8]:
ner_list = ["AGE", "AWARD", "CITY", "COUNTRY", "CRIME", "DATE", "DISEASE",
            "DISTRICT", "EVENT", "FACILITY", "FAMILY", "IDEOLOGY", "LANGUAGE",
            "LAW", "LOCATION", "MONEY", "NATIONALITY", "NUMBER", "ORDINAL",
            "ORGANIZATION", "PENALTY", "PERCENT", "PERSON", "PRODUCT",
            "PROFESSION", "RELIGION", "STATE_OR_PROVINCE", "TIME", "WORK_OF_ART"
            ]
ner_to_num = {word: str(i+1) for i, word in enumerate(ner_list)}
num_to_ner = {str(i+1): word for i, word in enumerate(ner_list)}

In [15]:
import matplotlib.pyplot as plt

# Calculate frequency of each named entity in the train dataset
count = {num_to_ner[str(i+1)]:0 for i in range(len(ner_list))}
for j in range(len(train)):
    for i in train[j]['ners']:
        a, b, c = i
        count[c] += 1

print("Train dataset frequencies:")
for key, value in count.items():
    print(f"{key} - {value}")

Train dataset frequencies:
AGE - 657
AWARD - 404
CITY - 1261
COUNTRY - 2510
CRIME - 221
DATE - 2689
DISEASE - 220
DISTRICT - 103
EVENT - 3335
FACILITY - 424
FAMILY - 24
IDEOLOGY - 273
LANGUAGE - 54
LAW - 405
LOCATION - 314
MONEY - 179
NATIONALITY - 437
NUMBER - 1107
ORDINAL - 614
ORGANIZATION - 4088
PENALTY - 92
PERCENT - 68
PERSON - 5119
PRODUCT - 245
PROFESSION - 5039
RELIGION - 89
STATE_OR_PROVINCE - 412
TIME - 182
WORK_OF_ART - 270


According to the list of frequencies, the labels are really imbalanced. Only around 15 of them occur quit frequent: PERSON, PROFESSION, ORGANIZATION, EVENT, DATE, COUNTRY, CITY, NUMBER, AGE, ORDINAL, NATIONALITY, FACILITY, STATE_OR_PROVINCE, LAW, AWARD.

In [None]:
# num = 15
num = 29
acceptance_list = [a for a, b in sorted(count.items(), key=lambda x:-x[1])[:num]]

Used dataset should have specific format in order to train Spacy models on it. In Spacy model end index of the substring should be equal to real end index + 1. Therefore, one should increment end indexes of the initial dataset.

In [8]:
# https://ubiai.tools/fine-tuning-spacy-models-customizing-named-entity-recognition/

def convert(data):
    """
    Converting data into the format accepted by Spacy
    
    :param data: dictinary-like dataset
    :return: dictinary-like dataset accepted by the Spacy model
    """
    data_spacy = {'classes' : [str(i) for i in range(len(ner_list))], 'annotations' : []}
    for i in range(len(data)):
        ners = data[i]['ners']
        tmp = {}
        # Extract sentence
        tmp['text'] = data[i]['sentences']
        tmp['entities'] = []
        for j in range(len(ners)):
            start, end, label = ners[j]
            # Append annotations
            if label in acceptance_list:
                new_label = ner_to_num[label]
                # Incremental end
                tmp['entities'].append((start, end + 1, new_label))
        if len(tmp['entities']) != 0:
            data_spacy['annotations'].append(tmp)
    return data_spacy

train_data_spacy = convert(train)

In [14]:
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm

# Create DocBin object to hold serialized annotations
nlp_doc = spacy.blank("ru")
doc_bin = DocBin()

For building the Spacy model was used SpanCategorizer. The SpanCategorizer is a Spacy component used to structure annotation for a wide variety of labeled spans, including  overlapping annotations.

In [None]:
from spacy.util import filter_spans

# Building doc for SpanCategorizer
for training_example in tqdm(train_data_spacy['annotations']):
    text = training_example['text']
    labels = training_example['entities']
    doc = nlp_doc.make_doc(text)
    ents = []
    # Building spans groups
    for start, end, label in labels:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            continue
        else:
            ents.append(span)
    # Alternative to SpanCategorizer
    # filtered_ents = filter_spans(ents)
    # doc.ents = filtered_ents
    doc.spans["sc"] = ents
    doc_bin.add(doc)

doc_bin.to_disk("training_data.spacy") # save the docbin object

In [16]:
# https://spacy.io/usage/training
# Building basic configuration file

base_config = '''[paths]
train = "./training_data.spacy"
dev = "./training_data.spacy"
vectors = null
[system]
gpu_allocator = null

[nlp]
lang = "ru"
pipeline = ["tok2vec","spancat"]
batch_size = 1000

[components]

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE"]
rows = [5000, 1000, 2500, 2500]
include_static_vectors = false

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[components.spancat]
factory = "spancat"
max_positive = null
scorer = {"@scorers":"spacy.spancat_scorer.v1"}
spans_key = "sc"
threshold = 0.5

[components.spancat.model]
@architectures = "spacy.SpanCategorizer.v1"

[components.spancat.model.reducer]
@layers = "spacy.mean_max_reducer.v1"
hidden_size = 128

[components.spancat.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO = null
nI = null

[components.spancat.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}

[components.spancat.suggester]
@misc = "spacy.ngram_suggester.v1"
sizes = [1,2,3]

[corpora]

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"

[training.optimizer]
@optimizers = "Adam.v1"

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001

[initialize]
vectors = ${paths.vectors}'''

with open("base_config.cfg", "w") as f:
    f.write(base_config)

In [20]:
# Initializing configuration file and training model on max of 5000 steps
!python -m spacy init fill-config base_config.cfg config.cfg
!python -m spacy train config.cfg --output ./ --training.max_steps 5000 --paths.train ./training_data.spacy --paths.dev ./training_data.spacy --gpu-id 0

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy
[38;5;4mℹ Saving to output directory: .[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2024-04-27 19:42:37,732] [INFO] Set up nlp object from config
[2024-04-27 19:42:37,773] [INFO] Pipeline: ['tok2vec', 'spancat']
[2024-04-27 19:42:37,779] [INFO] Created vocabulary
[2024-04-27 19:42:37,779] [INFO] Finished initializing nlp object
[2024-04-27 19:42:50,927] [INFO] Initialized pipeline components: ['tok2vec', 'spancat']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'spancat'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS SPANCAT  SPANS_SC_F  SPANS_SC_P  SPANS_SC_R  SCORE 
---  ------  ------------  ------------  ----------  ----------  ----------  ------
  0       0        312.99       7599.03        0.45        0.23

Prediction of the NER-spans for the test set on the best model.

In [48]:
# Loading best model
nlp_ner = spacy.load("./model-best")

output = []
for i in tqdm(range(len(test))):
    item = test[i]
    tmp = {}
    tmp["id"] = item['id']
    sentence = item['senences']
    # Predicting NER-spans for test set
    ner_doc = nlp_ner(sentence)
    span = ner_doc.spans['sc']
    ners = []
    for j in range(len(span)):
        label = span[j].label_
        start = span[j].start_char
        end = span[j].end_char - 1
        ners.append([start, end, str(label)])
    tmp["ners"] = ners
    output.append(tmp)

# Saving the results
# !mkdir ./output/
with open('./output/test.jsonl', 'w') as f:
    for i in range(len(output)):
        f.write(f'{output[i]}\n')

100%|██████████| 65/65 [00:01<00:00, 35.39it/s]
