# NLP. Assignment 3. Nested Named Entity Recognition
---

Name: Shulepin Danila

Innopolis email: d.shulepin@innopolis.university

CodaLab nickname: D4n1la

GitHub nickname: D4ni1a

GitHub repository: https://github.com/D4ni1a/nlp_projects/tree/main/Assignment%203

Named Entity Recognition (NER) is a field of natural language processing dedicated to categorizing named entities within textual content. These named entities encompass distinct types of categories. Nested named entity recognition is a subtask of NER that seeks to locate and classify nested named entities (i.e., hierarchically structured entities) mentioned in unstructured text.

The significance of NER extends across diverse applications, encompassing information extraction, question answering, chatbots, sentiment analysis, and recommendation systems, underscoring its pivotal role in advancing multiple areas of natural language understanding and utilization.

### Fine-tuning LLMs with LORA

A large language model (LLM) is a type of language model that stands out for its capacity to generate texts for a wide range of contexts and carry out different NLP tasks. Text generation is primarily powered by LLMs. To put it briefly, they are made up of large pretrained transformer models that have been trained to predict the next token based on input text.

A method called Parameter Efficient Fine Tuning (PEFT) was created to optimize models with the least amount of finances and resources. When it comes to domain-specific activities that require model modification, PEFT is an excellent option. By using PEFT, one may successfully adapt the pre-trained model to a specific task with fewer parameters while preserving important knowledge from it. It is possible to achieve parameter efficient fine-tuning in a variety of methods. LoRA, or Low Rank Parameter, is one of the most popular and useful.

The goal of LoRA is to minimize the number of trainable parameters with low-rank representations. Low-rank matrices that are updated and learned are created by decomposing the weight matrix. The pretrained model's parameters are all kept fixed. After training, the low-rank matrices come back to their original weights. Consequently, a LoRA model has far fewer parameters, making it easier to store and train.

LoRA: https://arxiv.org/pdf/2106.09685

In [1]:
# !pip install gdown
# !pip install -q peft transformers datasets evaluate seqeval accelerate bitsandbytes trl

In [2]:
import numpy as np
import pandas as pd
import os
import gdown
from tqdm import tqdm
import bitsandbytes as bnb
import torch
import torch.nn as nn
import transformers
from datasets import Dataset
from peft import LoraConfig, PeftConfig
from trl import SFTTrainer
from trl import setup_chat_format
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
import warnings

warnings.filterwarnings("ignore")

2024-04-28 14:16:06.818950: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-28 14:16:06.819044: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-28 14:16:06.972879: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Downloading the dataset

In [3]:
import gdown

# I uploaded the dataset into my Google Drive
# At first step, download data via gdown
url = 'https://drive.google.com/uc?id=10vGDK96wji8twLD-2wz6XdG7G_foQSbk'
output = 'dev.jsonl'
gdown.download(url, output, quiet=False)

url = 'https://drive.google.com/uc?id=1NjHU20IgEJ1gZD4eCTmmzHnvjpDG_M5h'
output = 'test.jsonl'
gdown.download(url, output, quiet=False)

url = 'https://drive.google.com/uc?id=1Wy0TjYjIUcN6q9pUTZ96CMICDrTodjyy'
output = 'train.jsonl'
gdown.download(url, output, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=10vGDK96wji8twLD-2wz6XdG7G_foQSbk
To: /kaggle/working/dev.jsonl
100%|██████████| 588k/588k [00:00<00:00, 123MB/s]
Downloading...
From: https://drive.google.com/uc?id=1NjHU20IgEJ1gZD4eCTmmzHnvjpDG_M5h
To: /kaggle/working/test.jsonl
100%|██████████| 507k/507k [00:00<00:00, 116MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Wy0TjYjIUcN6q9pUTZ96CMICDrTodjyy
To: /kaggle/working/train.jsonl
100%|██████████| 4.87M/4.87M [00:00<00:00, 260MB/s]


'train.jsonl'

In [4]:
train_file = "./train.jsonl"
test_file = "./test.jsonl"
dev_file = "./dev.jsonl"

In [5]:
import json

# Read dataset from file in JSON format
train = [json.loads(line) for line in open(train_file, 'r')]
test = [json.loads(line) for line in open(test_file, 'r')]
val = [json.loads(line) for line in open(dev_file, 'r')]

The NEREL dataset contains sentences with the following labels: AGE, AWARD, CITY, COUNTRY, CRIME, DATE, DISEASE, EVENT, FACILITY, FAMILY, IDEOLOGY, LANGUAGE, LAW, LOCATION, MONEY, NATIONALITY, NUMBER, ORDINAL, ORGANIZATION, PENALTY, PERCENT, PERSON, PRODUCT, PROFESSION, RELEGION, STATE_OR_PROV, TIME, WORK_OF_ART, ORGANIZATION.

In [6]:
ner_list = ["AGE", "AWARD", "CITY", "COUNTRY", "CRIME", "DATE", "DISEASE",
            "DISTRICT", "EVENT", "FACILITY", "FAMILY", "IDEOLOGY", "LANGUAGE",
            "LAW", "LOCATION", "MONEY", "NATIONALITY", "NUMBER", "ORDINAL",
            "ORGANIZATION", "PENALTY", "PERCENT", "PERSON", "PRODUCT",
            "PROFESSION", "RELIGION", "STATE_OR_PROVINCE", "TIME", "WORK_OF_ART"
            ]
ner_to_num = {word: str(i+1) for i, word in enumerate(ner_list)}
num_to_ner = {str(i+1): word for i, word in enumerate(ner_list)}

In [7]:
# Choose device to train on
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [22]:
def generate_prompt(data_point):
    """
    Generating text prompt from a single data point of a train data
    
    :param data_point: single data point of a training dataset
    :return: text prompt
    """
    return f"""
            Act as an expert analyst with more than 20 years of experience in linguistics. 
            You need to extract nested named entities from the following text:
            "{data_point["sentences"]}" and write it at Output.

            Output = {data_point["ners"]}.
            """.strip()


def generate_test_prompt(data_point):
    """
    Generating text prompt from a single data point of a test data
    
    :param data_point: single data point of a test dataset
    :return: text prompt
    """
    return f"""
            Act as an expert analyst with more than 20 years of experience in linguistics. 
            You need to extract nested named entities from the following text:
            "{data_point["senences"]}" and write it at Output.

            Output = .
            """.strip()

def train_prompter(x):
    """
    Converting a single data point of a test dataset
    
    :param x: single data point of a test dataset
    :return: text prompt
    """
    text, ners = x['sentences'], x['ners']
    tmp = []
    for i in range(len(ners)):
        # Replace boundries by a word
        a, b, c = ners[i]
        tmp.append([text[a:b+1], c])
    return generate_prompt({'sentences':text, 'ners':tmp})

In [9]:
# Convert train and test data
X_train = pd.DataFrame([train_prompter(x) for x in train], columns=["text"])
train_data = Dataset.from_pandas(X_train)
X_test = pd.DataFrame([generate_test_prompt(x) for x in test], columns=["text"])
test_data = Dataset.from_pandas(X_test)

Creating a BitsAndBytesConfig object and loading Llama2 LLM with quantization config

In [10]:
# Model path from Kaggle, as this notebook was run on Kaggle
# https://www.kaggle.com/datasets/lizhecheng/llama2-7b-hf
model_name = "/kaggle/input/llama2-7b-hf/Llama2-7b-hf"
compute_dtype = getattr(torch, "float16")

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, # the model weights
    bnb_4bit_quant_type="nf4", # quantization type
    bnb_4bit_compute_dtype=compute_dtype, # the float16 data type
    bnb_4bit_use_double_quant=True,
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map=device,
    torch_dtype=compute_dtype,
    quantization_config=bnb_config,
)

model.config.use_cache = False
model.config.pretraining_tp = 1

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True,
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

model, tokenizer = setup_chat_format(model, tokenizer)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [11]:
output_dir = "trained_weigths"

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    target_modules="all-linear",
    task_type="CAUSAL_LM",
)

training_arguments = TrainingArguments(
    output_dir=output_dir,  # directory to save and repository id
    num_train_epochs=2,  # number of training epochs
    per_device_train_batch_size=1,  # batch size
    gradient_accumulation_steps=8,
    gradient_checkpointing=True,  # gradient checkpointing to save memory
    optim="paged_adamw_32bit",
    save_steps=0,
    logging_steps=25,  # log every 25 steps
    learning_rate=2e-4,  # learning rate
    weight_decay=0.001,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,  # max gradient norm
    max_steps=-1,
    warmup_ratio=0.03,  # warmup ratio
    group_by_length=True,
    lr_scheduler_type="cosine",  # cosine learning rate scheduler
    report_to="tensorboard",
    evaluation_strategy="epoch",
)

# Build trainer for Llama LLM model
trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=train_data,
    eval_dataset=train_data,
    peft_config=peft_config,
    dataset_text_field="text",
    tokenizer=tokenizer,
    max_seq_length=1024,
    packing=False,
    dataset_kwargs={
        "add_special_tokens": False,
        "append_concat_token": False,
    },
)

Map:   0%|          | 0/129 [00:00<?, ? examples/s]

Map:   0%|          | 0/129 [00:00<?, ? examples/s]

In [12]:
# Training
trainer.train()

Epoch,Training Loss,Validation Loss
0,No log,1.195784
1,No log,1.137321


TrainOutput(global_step=16, training_loss=1.2363789081573486, metrics={'train_runtime': 4936.4174, 'train_samples_per_second': 0.052, 'train_steps_per_second': 0.003, 'total_flos': 1.0362271594364928e+16, 'train_loss': 1.2363789081573486, 'epoch': 1.97})

In [13]:
# Saving weights
trainer.save_model()
tokenizer.save_pretrained(output_dir)

('trained_weigths/tokenizer_config.json',
 'trained_weigths/special_tokens_map.json',
 'trained_weigths/tokenizer.model',
 'trained_weigths/added_tokens.json',
 'trained_weigths/tokenizer.json')

In [14]:
# delete and call garbage collector to free memory
import gc

del [
    model,
    tokenizer,
    peft_config,
    trainer,
    train_data,
    bnb_config,
    training_arguments,
]
del [X_train]
del [TrainingArguments, SFTTrainer, LoraConfig, BitsAndBytesConfig]

# empty cuda cache several times
for _ in range(300):
    torch.cuda.empty_cache()
    gc.collect()

Loading model and saved weights

In [15]:
from peft import AutoPeftModelForCausalLM

finetuned_model = "./trained_weigths/"
compute_dtype = getattr(torch, "float16")
tokenizer = AutoTokenizer.from_pretrained("/kaggle/input/llama2-7b-hf/Llama2-7b-hf")

# Load model
model = AutoPeftModelForCausalLM.from_pretrained(
    finetuned_model,
    torch_dtype=compute_dtype,
    return_dict=False,
    low_cpu_mem_usage=True,
    device_map=device,
)

merged_model = model.merge_and_unload()
merged_model.save_pretrained(
    "./merged_model", safe_serialization=True, max_shard_size="2GB"
)
tokenizer.save_pretrained("./merged_model")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


('./merged_model/tokenizer_config.json',
 './merged_model/special_tokens_map.json',
 './merged_model/tokenizer.model',
 './merged_model/added_tokens.json',
 './merged_model/tokenizer.json')

Prediction of the NER-spans for the test set on the best model.

In [61]:
from ast import literal_eval
import re

def predict(test, model, tokenizer):
    """
    Predicting the test results
    
    :param test: test data
    :param model: trained model
    :param tokenizer: tokenizer for model
    :return: list of dictionaries with spans
    """
    y_pred = []
    for i in tqdm(range(len(test))):
        idx = test[i]['id']
        prompt = generate_test_prompt(test[i])
        text = test[i]["senences"]
        pipe = pipeline(
            task="text-generation",
            model=model,
            tokenizer=tokenizer,
            max_new_tokens=1,
            temperature=0.0,
        )
        result = pipe(prompt)
        # Predict result and get part after 'Output = '
        answer = literal_eval(result[0]["generated_text"].split('Output = ')[-1][:-1])
        answer_ner = []
        # Replace word by its indexes in text
        for out in output:
            word, label = out[0], out[1]
            rgx = re.compile(word)
            match = re.search(rgx, text)
            answer_ner.append([match.start(), match.end() - 1, label])
        out = {'id': idx, 'ners':answer_ner}
        y_pred.append(out)
    return y_pred

In [60]:
# Predict NER-spans
y_pred = predict(test, merged_model, tokenizer)

# Save results
# !mkdir ./output/
with open('./output/test.jsonl', 'w') as f:
    for i in range(len(y_pred)):
        f.write(f'{y_pred[i]}\n')

100%|██████████| 65/65 [00:00<00:00, 418143.80it/s]
