# **Model Fine-tuning using Annotated Datasets**

## **A. Preliminaries**

These are several libraries needed to install as the source code has dependencies on such.

In [None]:
pip install pandas openpyxl



In [None]:
# Mounting Drive for loading .xlsx files and saving models
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"       # Set to true to avoid API prompt during fine-tuning

## **B. Fine-tuning**

Note that for the following, it is assumed that Google Drive is mounted along with the datasets (train, dev,and test) mounted at the root directory. This also assumes a directory named "MODEL" exists in the root directory, where each models during the grid search is saved for documentation and experimentation purposes.

It is also advised to use the T4 GPU for running the said code, as it hastens the training and optimizations greatly. The expected duration of the sources code's runtime is apporximately 4 hours using the T4 GPU.

Furhter details concerning the section of code is discussed using the code's comments.

An overview of the whole code is outlined here:

1. The dataset (.xlsx) subdivided into *train, dev* and *test* is loaded in a dataframe then into a single dataset dictionaty utilizing the `load_excel_data()` function.
2. Relation labels are mapped to integers, in this case from (0,1,2)
3. The parameter values are prepared for hyperparameter optimization
4. A table or array for `results` per iteration during the grid search is initialized.
5. The Grid search starts with varying combinations of learning rate and batch sizes.
6. For training the models, the data is first formatted by masking the specfici entity into its entity type, as observed in `preprocessing data`. Essentially, there is no difference between the formatting for SpanBERT and RoBERTa since this modification was only recent due to following such strategy of conventional formatting for RE fine-tuning. This newly formatted chunk is saved in a column named: `input_text`
7. After formatting, all other columns aer dropped except `input_text` and `labels`. Note that the `labels` column comprise of the ID of corresponding to the a unique relation label.
8. This then proceeds to Tokenization, which both models share the same module as for BERT.
9. The pre-trained model is then initialized from HuggingFace, afterwards then proceeds into training with the specified parameters. The trained and fine-tuned model is also saved in persistent memory afterwards
10. After each training and validation, the fine-tuned model is then evaluated with the assigned metrics using the test set and logged into the `results` array, in which now the iteration loops to train another model with different parameters.
11. After all iterations, the results are printed out in one table for analysis and discussion.

In [None]:
import itertools
import shutil
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
from datasets import Dataset, DatasetDict
import pandas as pd
import torch

# Labeled Dataset loaded from Google drive
train_file = "/content/drive/MyDrive/final_train.xlsx"
dev_file = "/content/drive/MyDrive/final_dev.xlsx"
test_file = "/content/drive/MyDrive/final_test.xlsx"

# Models from HuggingFace transformers
SPANBERT_MODEL = "SpanBERT/spanbert-base-cased"
ROBERTA_MODEL = "roberta-base"

# Function for loading the excel files into
# dataframs and into one dictionary
def load_excel_data(train_file, dev_file, test_file):
    def process_file(file_path):

        df = pd.read_excel(file_path)

        # Extract desired columns (entity pairs and relation labels + three-sentence chunks)
        df = df.rename(columns={
            "Chunk": "chunk",
            "Entity 1": "entity1",
            "Entity 2": "entity2",
            "Entity 1 Type": "entity_type1",
            "Entity 2 Type": "entity_type2",
            "relation": "relation"
        })

        return Dataset.from_pandas(df)

    datasets = DatasetDict({
        "train": process_file(train_file),
        "dev": process_file(dev_file),
        "test": process_file(test_file)
    })
    return datasets


# Preprocessing the data
# Preparing it for input formatting for both models
def preprocess_data(batch, tokenizer, model_name, label_to_id, max_seq_length):
    inputs = []
    for chunk, entity1, entity2, entity_type1, entity_type2 in zip(
        batch["chunk"], batch["entity1"], batch["entity2"], batch["entity_type1"], batch["entity_type2"]
    ):
        if "spanbert" in model_name:
            input_text = chunk.replace(entity1, f"[{entity_type1}]").replace(entity2, f"[{entity_type2}]") # SpanBERT-specific input formatting: Similar for BERT formattings
        else:
            input_text = chunk.replace(entity1, f"[{entity_type1}]").replace(entity2, f"[{entity_type2}]") # RoBERTa-specific input formatting: not really different since this was updated to follow the procedure for input formattings for BERT models
        inputs.append(input_text)

    encodings = tokenizer(inputs, padding=True, truncation=True, max_length=max_seq_length)
    encodings["labels"] = [label_to_id[label] for label in batch["relation"]]
    return encodings

# Function for performance evaluation using scikit-metrics
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="macro")
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}

# The General Training Function
def train_model(model_name, tokenizer, datasets, output_dir, best_model_save_path, num_labels, preprocess_fn, max_seq_length, lr, batch_size):
    # Tokenize datasets
    tokenized_datasets = datasets.map(
        lambda batch: preprocess_fn(batch, tokenizer, model_name, label_to_id, max_seq_length),   # This line use the preporcessing and formatting as defined earlier,
        batched=True
    )
    tokenized_datasets = tokenized_datasets.remove_columns(["chunk", "entity1", "entity2", "relation"])  # Keep only input and labels (2-columns) after formatting

    # Load model
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)   # Loading the model from HuggingFace using the path specified

    # Training arguments and parameters
    # which are repeatedly modified during optimization
    training_args = TrainingArguments(
        output_dir=output_dir,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        learning_rate=lr,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=8,                        # 8 epochs for all iterations
        weight_decay=0.01,
        save_total_limit=2,
        logging_dir=f"{output_dir}/logs",
        load_best_model_at_end=True,
        metric_for_best_model="f1",                # f1 here is set as the metric
        fp16=False,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_datasets["train"],
        eval_dataset=tokenized_datasets["dev"],
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
    )

    # Train and save
    trainer.train()
    trainer.save_model(output_dir)  # Save the last model to output_dir

    # Save the best model to the specific location
    trainer.save_model(best_model_save_path)  # Save the best model explicitly

    # Optional: Move the best model to a specific location using shutil
    specific_location = "/content/drive/MyDrive/MODEL/"  # model saving path in drive
    if os.path.exists(best_model_save_path):
        shutil.move(best_model_save_path, os.path.join(specific_location, os.path.basename(best_model_save_path)))

    # Evaluate on test set
    test_results = trainer.evaluate(tokenized_datasets["test"])
    return test_results



# MAIN CODE STARTS HERE
datasets = load_excel_data(train_file, dev_file, test_file)

# This line here maps the labels into integer IDs:
# For instance, 0: `has_taxon`, ....
label_to_id = {label: idx for idx, label in enumerate(set(datasets["train"]["relation"]))}
for split in datasets.keys():
    datasets[split] = datasets[split].map(lambda example: {"label": label_to_id[example["relation"]]})

# Hyperparameters to be tested using grid search
max_seq_length = 128
learning_rates = [5e-6, 1e-5, 2e-5, 3e-5, 5e-5]
batch_sizes = [16, 32]
hyperparameter_combinations = list(itertools.product(learning_rates, batch_sizes))

# Train Models with Hyperparameter Tuning
results_summary = []

for model_name, tokenizer_name, model_output_dir in [
    (SPANBERT_MODEL, SPANBERT_MODEL, "./spanbert_model"),
    (ROBERTA_MODEL, ROBERTA_MODEL, "./roberta_model")
]:
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    print(f"Now training {model_name}")

    for lr, batch_size in hyperparameter_combinations:
        print(f"Experimenting {model_name} with LR={lr}, Batch Size={batch_size}")

        # Train the model
        best_model_path = f"{model_output_dir}_best_lr{lr}_bs{batch_size}"
        test_results = train_model(
            model_name=model_name,
            tokenizer=tokenizer,
            datasets=datasets,
            output_dir=f"{model_output_dir}_lr{lr}_bs{batch_size}",
            best_model_save_path=best_model_path,
            num_labels=len(label_to_id),
            preprocess_fn=preprocess_data,
            max_seq_length=max_seq_length,
            lr=lr,
            batch_size=batch_size,
        )

        # Log results
        results_summary.append({
            "model": model_name,
            "learning_rate": lr,
            "batch_size": batch_size,
            "accuracy": test_results["eval_accuracy"],
            "f1": test_results["eval_f1"],
            "precision": test_results["eval_precision"],
            "recall": test_results["eval_recall"],
        })

# Print the Summary stored in the results array
import pandas as pd
results_df = pd.DataFrame(results_summary)
print(results_df)

  return cls(pa.Table.from_pandas(*args, **kwargs))


Map:   0%|          | 0/3128 [00:00<?, ? examples/s]

Map:   0%|          | 0/136 [00:00<?, ? examples/s]

Map:   0%|          | 0/377 [00:00<?, ? examples/s]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/413 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Now training SpanBERT/spanbert-base-cased
Experimenting SpanBERT/spanbert-base-cased with LR=5e-06, Batch Size=16


Map:   0%|          | 0/3128 [00:00<?, ? examples/s]

Map:   0%|          | 0/136 [00:00<?, ? examples/s]

Map:   0%|          | 0/377 [00:00<?, ? examples/s]

pytorch_model.bin:   0%|          | 0.00/215M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at SpanBERT/spanbert-base-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,1.192917,0.492647,0.35288,0.336001,0.406108
2,No log,1.066454,0.610294,0.598669,0.682459,0.581689
3,0.899600,0.961666,0.573529,0.5928,0.615279,0.578493
4,0.899600,0.949512,0.588235,0.599419,0.630014,0.581728
5,0.899600,0.922882,0.602941,0.615846,0.636111,0.602209
6,0.509800,0.924051,0.625,0.640299,0.67623,0.621053
7,0.509800,0.899822,0.661765,0.674896,0.70219,0.657453
8,0.390400,0.934331,0.632353,0.646097,0.680236,0.626576


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Experimenting SpanBERT/spanbert-base-cased with LR=5e-06, Batch Size=32


Map:   0%|          | 0/3128 [00:00<?, ? examples/s]

Map:   0%|          | 0/136 [00:00<?, ? examples/s]

Map:   0%|          | 0/377 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at SpanBERT/spanbert-base-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,1.236097,0.397059,0.189474,0.132353,0.333333
2,No log,1.141174,0.5,0.365812,0.333333,0.410006
3,No log,1.01639,0.573529,0.559388,0.668827,0.53553
4,No log,0.959372,0.610294,0.614752,0.635034,0.605783
5,No log,0.969538,0.617647,0.629182,0.67509,0.607719
6,0.865800,0.963088,0.610294,0.623318,0.670428,0.601871
7,0.865800,0.939245,0.625,0.646636,0.677911,0.627888
8,0.865800,0.955566,0.617647,0.635272,0.674129,0.61488


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Experimenting SpanBERT/spanbert-base-cased with LR=1e-05, Batch Size=16


Map:   0%|          | 0/3128 [00:00<?, ? examples/s]

Map:   0%|          | 0/136 [00:00<?, ? examples/s]

Map:   0%|          | 0/377 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at SpanBERT/spanbert-base-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.969882,0.669118,0.693644,0.713889,0.695517
2,No log,0.731609,0.720588,0.745389,0.742042,0.749149
3,0.720500,0.728859,0.735294,0.738521,0.759837,0.724392
4,0.720500,0.71508,0.75,0.760269,0.786329,0.749435
5,0.720500,0.746195,0.757353,0.778315,0.778627,0.778713
6,0.322100,0.785308,0.742647,0.751425,0.77835,0.736751
7,0.322100,0.929219,0.735294,0.737602,0.774306,0.718207
8,0.217800,0.964225,0.727941,0.731841,0.769543,0.712359


Experimenting SpanBERT/spanbert-base-cased with LR=1e-05, Batch Size=32


Map:   0%|          | 0/3128 [00:00<?, ? examples/s]

Map:   0%|          | 0/136 [00:00<?, ? examples/s]

Map:   0%|          | 0/377 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at SpanBERT/spanbert-base-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,1.124313,0.485294,0.354962,0.324015,0.397986
2,No log,1.029843,0.617647,0.606987,0.709558,0.596322
3,No log,0.908741,0.654412,0.654567,0.721801,0.631098
4,No log,0.804958,0.698529,0.708736,0.738563,0.706225
5,No log,0.857159,0.654412,0.673294,0.720612,0.653229
6,0.653200,0.905314,0.654412,0.667546,0.715857,0.645419
7,0.653200,0.88443,0.661765,0.685042,0.718758,0.665588
8,0.653200,0.94948,0.654412,0.667898,0.717927,0.645419


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Experimenting SpanBERT/spanbert-base-cased with LR=2e-05, Batch Size=16


Map:   0%|          | 0/3128 [00:00<?, ? examples/s]

Map:   0%|          | 0/136 [00:00<?, ? examples/s]

Map:   0%|          | 0/377 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at SpanBERT/spanbert-base-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.945613,0.669118,0.684911,0.724465,0.681196
2,No log,0.82229,0.676471,0.694445,0.73127,0.687044
3,0.579700,0.824602,0.742647,0.763745,0.780154,0.753996
4,0.579700,0.99543,0.625,0.633706,0.633373,0.655231
5,0.579700,0.770595,0.727941,0.736736,0.774255,0.723756
6,0.242500,0.856214,0.75,0.751663,0.802778,0.740975
7,0.242500,1.084868,0.691176,0.719955,0.748136,0.713385
8,0.148800,1.138026,0.683824,0.705847,0.736941,0.692242


Experimenting SpanBERT/spanbert-base-cased with LR=2e-05, Batch Size=32


Map:   0%|          | 0/3128 [00:00<?, ? examples/s]

Map:   0%|          | 0/136 [00:00<?, ? examples/s]

Map:   0%|          | 0/377 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at SpanBERT/spanbert-base-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.981857,0.683824,0.699319,0.725724,0.692242
2,No log,0.89556,0.698529,0.722067,0.736794,0.726719
3,No log,0.907725,0.661765,0.67216,0.725029,0.652242
4,No log,0.689834,0.735294,0.740113,0.771635,0.723418
5,No log,0.972699,0.698529,0.713922,0.763642,0.697102
6,0.483600,1.042361,0.669118,0.677254,0.746575,0.659064
7,0.483600,1.01462,0.705882,0.713876,0.76087,0.695465
8,0.483600,1.086791,0.661765,0.670768,0.730626,0.652891


Experimenting SpanBERT/spanbert-base-cased with LR=3e-05, Batch Size=16


Map:   0%|          | 0/3128 [00:00<?, ? examples/s]

Map:   0%|          | 0/136 [00:00<?, ? examples/s]

Map:   0%|          | 0/377 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at SpanBERT/spanbert-base-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.899093,0.691176,0.717527,0.722376,0.719896
2,No log,0.842254,0.705882,0.726042,0.757945,0.733541
3,0.551300,0.751825,0.742647,0.767561,0.777446,0.761481
4,0.551300,0.712832,0.772059,0.77386,0.795007,0.7813
5,0.551300,0.858087,0.705882,0.72549,0.724204,0.729318
6,0.247200,0.875282,0.757353,0.779474,0.789321,0.773502
7,0.247200,1.092554,0.698529,0.724249,0.755733,0.719883
8,0.150000,1.072566,0.742647,0.766138,0.781852,0.762456


Experimenting SpanBERT/spanbert-base-cased with LR=3e-05, Batch Size=32


Map:   0%|          | 0/3128 [00:00<?, ? examples/s]

Map:   0%|          | 0/136 [00:00<?, ? examples/s]

Map:   0%|          | 0/377 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at SpanBERT/spanbert-base-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.885996,0.683824,0.709888,0.723902,0.714698
2,No log,0.822519,0.691176,0.720612,0.730621,0.720221
3,No log,0.806927,0.713235,0.743353,0.754612,0.737765
4,No log,0.723762,0.720588,0.733386,0.76294,0.718558
5,No log,0.75247,0.757353,0.786905,0.789508,0.788148
6,0.431700,0.962056,0.691176,0.715413,0.751818,0.706225
7,0.431700,0.938649,0.705882,0.729829,0.755567,0.716946
8,0.431700,1.116792,0.691176,0.720294,0.75141,0.713385


Experimenting SpanBERT/spanbert-base-cased with LR=5e-05, Batch Size=16


Map:   0%|          | 0/3128 [00:00<?, ? examples/s]

Map:   0%|          | 0/136 [00:00<?, ? examples/s]

Map:   0%|          | 0/377 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at SpanBERT/spanbert-base-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.993244,0.676471,0.701352,0.706099,0.7082
2,No log,0.587322,0.735294,0.755168,0.76036,0.754984
3,0.507900,0.657587,0.742647,0.770308,0.774644,0.776777
4,0.507900,0.805045,0.772059,0.791214,0.78519,0.798545
5,0.507900,0.889489,0.727941,0.729258,0.750953,0.760858
6,0.232600,0.981251,0.698529,0.701208,0.703463,0.742339
7,0.232600,0.802794,0.764706,0.791961,0.800606,0.794971
8,0.138900,0.96961,0.764706,0.791961,0.800606,0.794971


Experimenting SpanBERT/spanbert-base-cased with LR=5e-05, Batch Size=32


Map:   0%|          | 0/3128 [00:00<?, ? examples/s]

Map:   0%|          | 0/136 [00:00<?, ? examples/s]

Map:   0%|          | 0/377 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at SpanBERT/spanbert-base-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.955961,0.654412,0.676206,0.702131,0.691956
2,No log,0.823224,0.713235,0.734246,0.749686,0.738739
3,No log,0.705014,0.713235,0.751039,0.752125,0.751761
4,No log,0.845442,0.683824,0.712079,0.73189,0.697778
5,No log,0.71816,0.727941,0.76564,0.769801,0.771917
6,0.386400,1.140927,0.691176,0.708517,0.799623,0.708174
7,0.386400,1.035565,0.705882,0.743063,0.755597,0.747212
8,0.386400,1.197375,0.676471,0.703441,0.754957,0.702664


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Now training roberta-base
Experimenting roberta-base with LR=5e-06, Batch Size=16


Map:   0%|          | 0/3128 [00:00<?, ? examples/s]

Map:   0%|          | 0/136 [00:00<?, ? examples/s]

Map:   0%|          | 0/377 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.866591,0.536765,0.54809,0.538384,0.563249
2,No log,0.655572,0.654412,0.669799,0.654531,0.718973
3,0.628400,0.574611,0.713235,0.719792,0.707193,0.758596
4,0.628400,0.661828,0.705882,0.717028,0.741052,0.747875
5,0.628400,0.588508,0.705882,0.709252,0.733114,0.756985
6,0.326800,0.587786,0.676471,0.679663,0.690873,0.716985
7,0.326800,0.588581,0.698529,0.702899,0.707937,0.742014
8,0.247000,0.601843,0.683824,0.686827,0.696078,0.722833


Experimenting roberta-base with LR=5e-06, Batch Size=32


Map:   0%|          | 0/3128 [00:00<?, ? examples/s]

Map:   0%|          | 0/136 [00:00<?, ? examples/s]

Map:   0%|          | 0/377 [00:00<?, ? examples/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,1.031799,0.433824,0.3167,0.294118,0.352827
2,No log,0.796506,0.713235,0.734714,0.754588,0.746225
3,No log,0.694919,0.661765,0.694657,0.696449,0.694555
4,No log,0.730162,0.654412,0.67334,0.685967,0.713125
5,No log,0.604497,0.720588,0.75184,0.740421,0.772255
6,0.577600,0.642871,0.713235,0.740144,0.749597,0.74525
7,0.577600,0.632054,0.742647,0.773283,0.769439,0.791098
8,0.577600,0.629065,0.735294,0.767481,0.759342,0.7846


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Experimenting roberta-base with LR=1e-05, Batch Size=16


Map:   0%|          | 0/3128 [00:00<?, ? examples/s]

Map:   0%|          | 0/136 [00:00<?, ? examples/s]

Map:   0%|          | 0/377 [00:00<?, ? examples/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.741678,0.669118,0.70403,0.701813,0.707888
2,No log,0.656039,0.705882,0.710559,0.715698,0.762508
3,0.527900,0.606289,0.698529,0.704695,0.706148,0.756335
4,0.527900,0.67167,0.661765,0.664293,0.706296,0.705627
5,0.527900,0.666112,0.698529,0.702241,0.708658,0.742989
6,0.262600,0.739164,0.705882,0.710604,0.712022,0.748187
7,0.262600,0.992938,0.683824,0.683501,0.703501,0.723808
8,0.181600,1.003991,0.683824,0.683501,0.703501,0.723808


Experimenting roberta-base with LR=1e-05, Batch Size=32


Map:   0%|          | 0/3128 [00:00<?, ? examples/s]

Map:   0%|          | 0/136 [00:00<?, ? examples/s]

Map:   0%|          | 0/377 [00:00<?, ? examples/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.817268,0.654412,0.691087,0.681585,0.703028
2,No log,0.730308,0.727941,0.752299,0.790914,0.758895
3,No log,0.529342,0.75,0.777991,0.766405,0.794672
4,No log,0.578216,0.735294,0.756894,0.766003,0.780377
5,No log,0.572041,0.794118,0.811502,0.811113,0.829435
6,0.457800,0.614269,0.786765,0.809499,0.797627,0.825211
7,0.457800,0.61467,0.772059,0.797881,0.795173,0.81514
8,0.457800,0.564559,0.816176,0.833997,0.822604,0.849253


Experimenting roberta-base with LR=2e-05, Batch Size=16


Map:   0%|          | 0/3128 [00:00<?, ? examples/s]

Map:   0%|          | 0/136 [00:00<?, ? examples/s]

Map:   0%|          | 0/377 [00:00<?, ? examples/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.773801,0.720588,0.742092,0.75138,0.743938
2,No log,0.665237,0.727941,0.729309,0.778797,0.713658
3,0.500700,0.736355,0.683824,0.676652,0.694773,0.701027
4,0.500700,0.940693,0.683824,0.682017,0.708884,0.723821
5,0.500700,0.494729,0.816176,0.826709,0.83694,0.819636
6,0.242000,0.840549,0.705882,0.698045,0.712698,0.718895
7,0.242000,0.834247,0.713235,0.705622,0.721521,0.725068
8,0.147600,1.117043,0.676471,0.670116,0.697734,0.695828


Experimenting roberta-base with LR=2e-05, Batch Size=32


Map:   0%|          | 0/3128 [00:00<?, ? examples/s]

Map:   0%|          | 0/136 [00:00<?, ? examples/s]

Map:   0%|          | 0/377 [00:00<?, ? examples/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.707618,0.661765,0.704292,0.69084,0.722547
2,No log,0.597849,0.683824,0.712371,0.724247,0.705913
3,No log,0.740351,0.654412,0.688259,0.697228,0.680572
4,No log,0.7308,0.669118,0.700141,0.719868,0.685432
5,No log,0.527461,0.75,0.77248,0.781777,0.778726
6,0.402600,0.684174,0.742647,0.758747,0.787095,0.747485
7,0.402600,0.691288,0.786765,0.797315,0.848684,0.784522
8,0.402600,0.707465,0.779412,0.79158,0.839702,0.77835


Experimenting roberta-base with LR=3e-05, Batch Size=16


Map:   0%|          | 0/3128 [00:00<?, ? examples/s]

Map:   0%|          | 0/136 [00:00<?, ? examples/s]

Map:   0%|          | 0/377 [00:00<?, ? examples/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.8283,0.639706,0.654196,0.678632,0.640559
2,No log,0.770999,0.588235,0.603797,0.597162,0.659181
3,0.513000,1.131499,0.676471,0.661724,0.684524,0.680533
4,0.513000,0.847679,0.661765,0.659103,0.672126,0.690643
5,0.513000,0.90271,0.705882,0.69889,0.699041,0.718571
6,0.250400,0.737721,0.786765,0.785361,0.775264,0.819025
7,0.250400,0.913856,0.698529,0.690567,0.704301,0.712723
8,0.156500,0.946784,0.742647,0.750013,0.78604,0.74065


Experimenting roberta-base with LR=3e-05, Batch Size=32


Map:   0%|          | 0/3128 [00:00<?, ? examples/s]

Map:   0%|          | 0/136 [00:00<?, ? examples/s]

Map:   0%|          | 0/377 [00:00<?, ? examples/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.77102,0.573529,0.608365,0.603826,0.626342
2,No log,0.598799,0.683824,0.711261,0.711143,0.711449
3,No log,0.531799,0.727941,0.744331,0.748698,0.740676
4,No log,0.560838,0.720588,0.719734,0.762205,0.6987
5,No log,0.384044,0.779412,0.804879,0.801738,0.813177
6,0.376400,0.637611,0.720588,0.737496,0.765309,0.737427
7,0.376400,0.645578,0.727941,0.748344,0.769906,0.750435
8,0.376400,0.583268,0.727941,0.748344,0.769906,0.750435


Experimenting roberta-base with LR=5e-05, Batch Size=16


Map:   0%|          | 0/3128 [00:00<?, ? examples/s]

Map:   0%|          | 0/136 [00:00<?, ? examples/s]

Map:   0%|          | 0/377 [00:00<?, ? examples/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.767162,0.639706,0.635655,0.718743,0.612242
2,No log,0.697024,0.639706,0.655679,0.639245,0.691982
3,0.487300,1.0008,0.602941,0.605871,0.60938,0.64256
4,0.487300,1.056421,0.610294,0.620879,0.616569,0.669565
5,0.487300,1.174828,0.647059,0.639603,0.658602,0.679597
6,0.253400,1.449857,0.639706,0.625858,0.667524,0.651293
7,0.253400,1.504304,0.647059,0.639967,0.678419,0.679597
8,0.154600,1.74381,0.654412,0.642759,0.684509,0.678609


Experimenting roberta-base with LR=5e-05, Batch Size=32


Map:   0%|          | 0/3128 [00:00<?, ? examples/s]

Map:   0%|          | 0/136 [00:00<?, ? examples/s]

Map:   0%|          | 0/377 [00:00<?, ? examples/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.657289,0.669118,0.678827,0.665506,0.714724
2,No log,0.815458,0.610294,0.608139,0.739361,0.598285
3,No log,1.029751,0.661765,0.663086,0.666663,0.689344
4,No log,1.021951,0.632353,0.610099,0.663078,0.623314
5,No log,0.691882,0.713235,0.694339,0.691917,0.699363
6,0.380500,1.221044,0.625,0.615216,0.66462,0.647083
7,0.380500,1.306088,0.639706,0.637085,0.670894,0.673424
8,0.380500,1.379842,0.639706,0.63933,0.662087,0.672775


                           model  learning_rate  batch_size  accuracy  \
0   SpanBERT/spanbert-base-cased       0.000005          16  0.610080   
1   SpanBERT/spanbert-base-cased       0.000005          32  0.618037   
2   SpanBERT/spanbert-base-cased       0.000010          16  0.694960   
3   SpanBERT/spanbert-base-cased       0.000010          32  0.628647   
4   SpanBERT/spanbert-base-cased       0.000020          16  0.639257   
5   SpanBERT/spanbert-base-cased       0.000020          32  0.668435   
6   SpanBERT/spanbert-base-cased       0.000030          16  0.618037   
7   SpanBERT/spanbert-base-cased       0.000030          32  0.652520   
8   SpanBERT/spanbert-base-cased       0.000050          16  0.612732   
9   SpanBERT/spanbert-base-cased       0.000050          32  0.625995   
10                  roberta-base       0.000005          16  0.503979   
11                  roberta-base       0.000005          32  0.503979   
12                  roberta-base       0.000010    


---

These final results are further discussed in the Conference Paper Draft. An interesting thing to note here is that there is a sudden drop of overall performance comparing the logs displayed from the validation set and the final test set - which probably might suggest overfitting due to the small number of training and dev samples