# **Model Training using Annotated Datasets**

In actuality, this notebook is created after the `fine_tuning_for_RE.ipynb` google colab notebook. Hence, most of the code from that notebook has been reusd here, just for the reasons of testing the performance of both SpanBERT and RoBERTa for relation extraction tasks without training and fine-tuning. Given this, a more comprehensive outline of the source code is found in the stated notebook.

As such, some important points for base case testing is:
1. Only the `final_text.xlxs` file is required to be in the root directory of the mounted drive
2. No model saving, no training, just importing the model from HuggingFace and goes straight to testing.
3. As such, both input formatting and tokenization still follows, also preserving the max sequences of tokens with a value of 128.

In general, these trials reveal that both models perform on the downstream RE task poorly without and training on the given dataset.

## **Preliminaries**

In [None]:
pip install pandas openpyxl



In [None]:
#Mounting Drive for excel files and saving models
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"

## **Base Case SpanBERT**

In [None]:
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from datasets import Dataset
from sklearn.metrics import precision_recall_fscore_support, accuracy_score

test_file = "/content/drive/MyDrive/final_test.xlsx"

SPANBERT_MODEL = "SpanBERT/spanbert-base-cased"

def load_test_data(file_path):
    df = pd.read_excel(file_path)
    df = df.rename(columns={
        "Chunk": "chunk",
        "Entity 1": "entity1",
        "Entity 2": "entity2",
        "Entity 1 Type": "entity_type1",
        "Entity 2 Type": "entity_type2",
        "relation": "relation"
    })
    return Dataset.from_pandas(df)

def preprocess_test_data(batch, tokenizer, model_name, max_seq_length):
    inputs = []
    for chunk, entity1, entity2, entity_type1, entity_type2 in zip(
        batch["chunk"], batch["entity1"], batch["entity2"], batch["entity_type1"], batch["entity_type2"]
    ):
        if "spanbert" in model_name:
            input_text = chunk.replace(entity1, f"[{entity_type1}]").replace(entity2, f"[{entity_type2}]")
        else:
            input_text = chunk.replace(entity1, f"[{entity_type1}]").replace(entity2, f"[{entity_type2}]")
        inputs.append(input_text)

    return tokenizer(inputs, padding=True, truncation=True, max_length=max_seq_length, return_tensors="pt")

def compute_metrics_base_case(predictions, labels):
    preds = predictions.argmax(dim=1).cpu().numpy()
    labels = labels.cpu().numpy()
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="macro")
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}

test_dataset = load_test_data(test_file)

relations = test_dataset["relation"]
label_to_id = {label: idx for idx, label in enumerate(set(relations))}
id_to_label = {v: k for k, v in label_to_id.items()}
test_dataset = test_dataset.map(lambda example: {"label": label_to_id[example["relation"]]})

tokenizer = AutoTokenizer.from_pretrained(SPANBERT_MODEL)
model = AutoModelForSequenceClassification.from_pretrained(SPANBERT_MODEL, num_labels=len(label_to_id))
model.eval()

max_seq_length = 128
test_encodings = preprocess_test_data(test_dataset, tokenizer, SPANBERT_MODEL, max_seq_length)
labels = torch.tensor(test_dataset["label"])

with torch.no_grad():    # employed to avoid RAM crashing
    outputs = model(**test_encodings)
    logits = outputs.logits

metrics = compute_metrics_base_case(logits, labels)
print(metrics)

Map:   0%|          | 0/377 [00:00<?, ? examples/s]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/413 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/215M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at SpanBERT/spanbert-base-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


model.safetensors:   0%|          | 0.00/215M [00:00<?, ?B/s]

{'accuracy': 0.3978779840848806, 'f1': 0.3413916032088351, 'precision': 0.3783372984403912, 'recall': 0.3314935717902186}


## **Base Case RoBERTa**

In [None]:
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from datasets import Dataset
from sklearn.metrics import precision_recall_fscore_support, accuracy_score

test_file = "/content/drive/MyDrive/final_test.xlsx"

ROBERTA_MODEL = "roberta-base"

def load_test_data(file_path):
    df = pd.read_excel(file_path)
    df = df.rename(columns={
        "Chunk": "chunk",
        "Entity 1": "entity1",
        "Entity 2": "entity2",
        "Entity 1 Type": "entity_type1",
        "Entity 2 Type": "entity_type2",
        "relation": "relation"
    })
    return Dataset.from_pandas(df)

def preprocess_test_data(batch, tokenizer, model_name, max_seq_length):
    inputs = []
    for chunk, entity1, entity2, entity_type1, entity_type2 in zip(
        batch["chunk"], batch["entity1"], batch["entity2"], batch["entity_type1"], batch["entity_type2"]
    ):
        input_text = chunk.replace(entity1, f"[{entity_type1}]").replace(entity2, f"[{entity_type2}]")
        inputs.append(input_text)

    return tokenizer(inputs, padding=True, truncation=True, max_length=max_seq_length, return_tensors="pt")

def compute_metrics_base_case(predictions, labels):
    preds = predictions.argmax(dim=1).cpu().numpy()
    labels = labels.cpu().numpy()
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="macro")
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}

test_dataset = load_test_data(test_file)

relations = test_dataset["relation"]
label_to_id = {label: idx for idx, label in enumerate(set(relations))}
id_to_label = {v: k for k, v in label_to_id.items()}
test_dataset = test_dataset.map(lambda example: {"label": label_to_id[example["relation"]]})

tokenizer = AutoTokenizer.from_pretrained(ROBERTA_MODEL)
model = AutoModelForSequenceClassification.from_pretrained(ROBERTA_MODEL, num_labels=len(label_to_id))
model.eval()

max_seq_length = 128
test_encodings = preprocess_test_data(test_dataset, tokenizer, ROBERTA_MODEL, max_seq_length)
labels = torch.tensor(test_dataset["label"])

with torch.no_grad(): #added to avoid RAM crash
    outputs = model(**test_encodings)
    logits = outputs.logits


metrics = compute_metrics_base_case(logits, labels)
print(metrics)

Map:   0%|          | 0/377 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'accuracy': 0.5172413793103449, 'f1': 0.3166251791506374, 'precision': 0.347313596491228, 'recall': 0.3845305086615911}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
