# **SemEval-2025 Task 6: A Multi-Architecture Approach for Corporate Environmental Promise Verification - The Base Model (Model 1)**

Model 1 Notebook Author:

- Nawar Turk - https://www.linkedin.com/in/nawart/

Paper Authors:
- Nawar Turk *
- Eeham Khan *
- Leila Kosseim

*Equal contribution



# 1. Introduction


## 1.1. Overview


This notebook addresses the SemEval-2025 Task 6 (PromiseEval) challenge, which focuses on verifying promises in corporate ESG (Environmental, Social, and Governance) reports. The competition consists of four subtasks: promise identification, supporting evidence assessment, clarity evaluation, and verification timing.

We have developed three distinct model architectures of increasing complexity:

* Model 1 (This notebook): Base ESG-BERT with task-specific classification heads
* Model 2: Enhanced ESG-BERT with linguistic features
* Model 3: Combined multi-objective approach with attention mechanisms

## 1.2. The Base Model (Model 1)

Our base model employs ESG-BERT, a domain-specific BERT variant pre-trained on environmental, social, and governance texts. For each subtask, we use the same underlying ESG-BERT architecture but add separate classification heads trained independently:

* Task 1 (Promise Identification): Binary classification (Yes/No)
* Task 2 (Supporting Evidence): Binary classification (Yes/No)
* Task 3 (Clarity Evaluation): Multi-class classification (Clear/Not Clear/Misleading/N/A)
* Task 4 (Verification Timeline): Multi-class classification (Less than 2 years/2 to 5 years/More than 5 years/Already/N/A)

We implement partial fine-tuning by freezing the lower layers of the transformer and only training the top two transformer layers along with the task-specific classification heads. This approach balances leveraging the pre-trained ESG knowledge while adapting to the specific nuances of promise verification.

## 1.3. Links

* Task Website: https://sites.google.com/view/promiseeval/promiseeval




* Competition Page: https://www.kaggle.com/competitions/sem-eval-2025-promise-eval-english/overview

* Our Repository and Paper: https://github.com/CLaC-Lab/SemEval-2025-Task6

* This Notebook: https://colab.research.google.com/drive/1qlOs2B7PWvADnD3TaIluonRC5XqfmJbG?usp=sharing

#  2. Environment Setup & Data Loading


## 2.1. Install dependencies

In [None]:
%%capture
!pip install --upgrade pip

# Pin all your working versions
!pip install \
    torch==2.6.0+cu124 \
    numpy==2.0.2 \
    pandas==2.2.2 \
    scikit-learn==1.6.1 \
    datasets==3.5.1 \
    optuna==4.3.0 \
    flair==0.15.1 \
    spacy==3.8.5 \
    transformers==4.49.0

# Download the spaCy English model that matches spacy v3.8.5
!python -m spacy download en_core_web_sm

## 2.2 Import libraries

In [None]:
import json
import torch
import numpy as np
from sklearn.model_selection import StratifiedKFold
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, EarlyStoppingCallback
from datasets import Dataset
import optuna
import os
import random
import pandas as pd

## 2.3. Define Paths

In [None]:
train_data_path = 'PromiseEval_Trainset_English.json'
test_data_path = 'test_data_unlabelled.json'

#output folders
task1_output_path = './promise_task1_processed_Base-Model'
task2_output_path = './promise_task2_processed_Base-Model'
task3_output_path = './promise_task3_processed_Base-Model'
task4_output_path = './promise_task4_processed_Base-Model'

os.makedirs(task1_output_path, exist_ok=True)
os.makedirs(task2_output_path, exist_ok=True)
os.makedirs(task3_output_path, exist_ok=True)
os.makedirs(task4_output_path, exist_ok=True)

## 2.4. Training & Testing Datasets



### 2.4.1. Getting the Data





Contact the Task organizers of SemEval 2025 PromiseEval to obtain the training data: https://groups.google.com/g/promiseeval

- Train data: "PromiseEval_Trainset_English.json"
- Test data: "english_submission_file.json"

### 2.4.2. Restructuring Test Data



We need to restructure the SemEval 2025 PromiseEval test data to match both our model's input format and the competition submission requirements.

In [None]:
# Load the original test file
with open("english_submission_file.json", "r") as file:
    original_data = json.load(file)

# Create new structure
restructured_data = {}

# Determine the total number of items (assuming all fields have the same count)
total_items = len(original_data["data"])

# For each index, create a corresponding entry
for idx in range(total_items):
    idx_str = str(idx)
    restructured_data[idx_str] = {
        "ID": original_data["ID"].get(idx_str, ""),
        "URL": original_data["URL"].get(idx_str, ""),
        "page_number": original_data["page_number"].get(idx_str, ""),
        "data": original_data["data"].get(idx_str, ""),
        "promise_status": "",
        "verification_timeline":"",
        "evidence_status":"",
        "evidence_quality":""
    }

# Save the restructured data
with open("english_submission_file_restructured_unlabeled.json", "w") as file:
    json.dump(restructured_data, file, indent=2)

print(f"Restructured data saved with {len(restructured_data)} items.")

Restructured data saved with 400 items.


# 3. The Four Tasks, Train & Predict

## 3.1. Task1, Promise Identification

This is a boolean label (Yes/No) based on whether supporting evidence exists.


### 3.1.1. Task1, Train

In [None]:
if not torch.cuda.is_available():
    raise RuntimeError("This script requires a GPU to run")

# Create task-specific paths
task1_model_path = os.path.join(task1_output_path, "task1_final_model")
task1_hyperparams_path = os.path.join(task1_output_path, "task1_best_hyperparameters.json")
task1_config_path = os.path.join(task1_output_path, "task1_config.json")  # label_mapping

def set_seed(seed_value=42):
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)
    os.environ['PYTHONHASHSEED'] = str(seed_value)
    torch.backends.cudnn.deterministic = True

set_seed()

def prepare_model_for_training():
    model = AutoModelForSequenceClassification.from_pretrained(
        "nbroad/ESG-BERT",
        num_labels=2,
        ignore_mismatched_sizes=True
    )

    # Reset the classifier
    model.classifier = torch.nn.Linear(model.config.hidden_size, 2)

    # Freeze all layers except last N transformer layers
    N = 2
    for param in model.bert.parameters():
        param.requires_grad = False
    for param in model.bert.encoder.layer[-N:].parameters():
        param.requires_grad = True
    for param in model.classifier.parameters():
        param.requires_grad = True

    return model

# Load and prepare dataset
print("Loading dataset...")
with open(train_data_path, 'r') as file:
    data = json.load(file)

# Preprocess dataset
print("Processing texts...")
texts = [item['data'] for item in data]
labels = [1 if item["promise_status"] == "Yes" else 0 for item in data]

tokenizer = AutoTokenizer.from_pretrained("nbroad/ESG-BERT")

def tokenize_function(examples):
    return tokenizer(examples, truncation=True, padding="max_length", max_length=512)


# Hyperparameter optimization setup
n_splits = 4
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

def objective(trial):
    learning_rate = trial.suggest_float("learning_rate", 1e-5, 5e-5, log=True)
    batch_size = trial.suggest_categorical("batch_size", [4, 8, 12])
    weight_decay = trial.suggest_float("weight_decay", 0.01, 0.3)

    fold_losses = []
    for fold, (train_idx, val_idx) in enumerate(skf.split(texts, labels)):
        print(f"Trial {trial.number}, Fold {fold + 1}/{n_splits}")

        # Prepare fold data
        train_texts = [texts[i] for i in train_idx]
        val_texts = [texts[i] for i in val_idx]
        train_labels = [labels[i] for i in train_idx]
        val_labels = [labels[i] for i in val_idx]

        train_encodings = tokenize_function(train_texts)
        val_encodings = tokenize_function(val_texts)

        train_dataset = Dataset.from_dict({**train_encodings, "labels": train_labels})
        val_dataset = Dataset.from_dict({**val_encodings, "labels": val_labels})

        model = prepare_model_for_training()
        model.to("cuda")

        trial_dir = os.path.join(task1_output_path, f'trial_{trial.number}_fold_{fold}')

        training_args = TrainingArguments(
            output_dir=trial_dir,
            evaluation_strategy="epoch",
            save_strategy="epoch",
            save_total_limit=1,
            logging_strategy="steps",
            logging_steps=50,
            per_device_train_batch_size=batch_size,
            per_device_eval_batch_size=batch_size,
            num_train_epochs=10,
            learning_rate=learning_rate,
            weight_decay=weight_decay,
            load_best_model_at_end=True,
            no_cuda=False,
            report_to="none",
            seed=42
        )

        trainer = Trainer(
            model=model,
            args=training_args,
            train_dataset=train_dataset,
            eval_dataset=val_dataset,
            tokenizer=tokenizer,
            callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
        )

        try:
            trainer.train()
            val_output = trainer.evaluate()
            fold_losses.append(val_output["eval_loss"])
        except Exception as e:
            print(f"Error in fold {fold}: {e}")
            return float('inf')
        finally:
            del model, trainer
            torch.cuda.empty_cache()

    return np.mean(fold_losses)

# Run hyperparameter optimization
print("Starting hyperparameter optimization...")
study = optuna.create_study(direction="minimize", sampler=optuna.samplers.TPESampler(seed=42))
study.optimize(objective, n_trials=7)

# Save best hyperparameters
best_hyperparameters = study.best_params
print("Best Hyperparameters:", best_hyperparameters)
with open(task1_hyperparams_path, "w") as file:
    json.dump(best_hyperparameters, file, indent=2)

# Train final model
print("Training final model...")
train_encodings = tokenize_function(texts)
train_dataset = Dataset.from_dict({**train_encodings, "labels": labels})

final_model = prepare_model_for_training()
final_model.to("cuda")

final_training_args = TrainingArguments(
    output_dir=task1_model_path,
    save_strategy="epoch",
    save_total_limit=2,
    logging_strategy="steps",
    logging_steps=50,
    per_device_train_batch_size=best_hyperparameters["batch_size"],
    num_train_epochs=10,
    learning_rate=best_hyperparameters["learning_rate"],
    weight_decay=best_hyperparameters["weight_decay"],
    no_cuda=False,
    report_to="none",
    seed=42

)

final_trainer = Trainer(
    model=final_model,
    args=final_training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer
)

final_trainer.train()

# Save everything needed for prediction
print("Saving model and prediction requirements for task 1...")

# 1. Save the model and tokenizer
final_model.save_pretrained(task1_model_path, save_optimizer_state=False)
tokenizer.save_pretrained(task1_model_path)

# 2. Save promise words and config
config = {
    'label_mapping': {
        '0': 'No',
        '1': 'Yes'
    }
}
with open(task1_config_path, 'w') as f:
    json.dump(config, f, indent=2)

print("Saved the following files for task 1:")
print(f"1. Model and tokenizer at: {task1_model_path}")
print(f"2. Best hyperparameters at: {task1_hyperparams_path}")
print(f"3. Configuration at: {task1_config_path}")
print("\nTask 1 training complete. You can now use these files for prediction on new data.")

Loading dataset...
Processing texts...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/376 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

[I 2025-05-18 18:21:42,804] A new study created in memory with name: no-name-854f2559-603b-41e3-aeb9-d2017dedcc6a


Starting hyperparameter optimization...
Trial 0, Fold 1/4


config.json:   0%|          | 0.00/2.67k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,0.5808,0.488443
2,0.4815,0.519868
3,0.4241,0.509283


Trial 0, Fold 2/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,0.5214,0.471425
2,0.4587,0.445259
3,0.3781,0.448028
4,0.4252,0.487672


Trial 0, Fold 3/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,0.5511,0.471344
2,0.4612,0.513724
3,0.4114,0.522649


Trial 0, Fold 4/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,0.5248,0.500708
2,0.4675,0.5042
3,0.4408,0.541676


[I 2025-05-18 18:23:26,117] Trial 0 finished with value: 0.47643864899873734 and parameters: {'learning_rate': 1.827226177606625e-05, 'batch_size': 4, 'weight_decay': 0.055245405728306586}. Best is trial 0 with value: 0.47643864899873734.


Trial 1, Fold 1/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.496476
2,0.527400,0.473327
3,0.443300,0.468185
4,0.382200,0.463592
5,0.382200,0.465116
6,0.354900,0.457552
7,0.336500,0.456893
8,0.326900,0.456171
9,0.326900,0.45612
10,0.321200,0.455942


Trial 1, Fold 2/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.488474
2,0.519000,0.435166
3,0.414400,0.43338
4,0.407500,0.416764
5,0.407500,0.40436
6,0.379200,0.401388
7,0.317500,0.39398
8,0.339100,0.394793
9,0.339100,0.387211
10,0.304400,0.386583


Trial 1, Fold 3/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.491252
2,0.513000,0.48508
3,0.433500,0.471355
4,0.355000,0.473013
5,0.355000,0.484971


Trial 1, Fold 4/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.4752
2,0.521400,0.460956
3,0.434000,0.459492
4,0.382700,0.457178
5,0.382700,0.461137
6,0.371800,0.457078
7,0.297400,0.458336
8,0.326400,0.454981
9,0.326400,0.456811
10,0.317100,0.455739


[I 2025-05-18 18:26:26,446] Trial 1 finished with value: 0.4422152265906334 and parameters: {'learning_rate': 1.2853916978930139e-05, 'batch_size': 8, 'weight_decay': 0.21534104756085318}. Best is trial 1 with value: 0.4422152265906334.


Trial 2, Fold 1/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,0.5747,0.483988
2,0.4739,0.489934
3,0.4091,0.485198


Trial 2, Fold 2/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,0.5571,0.491754
2,0.4585,0.460704
3,0.3936,0.454671
4,0.4448,0.485802
5,0.4642,0.450486
6,0.3955,0.443483
7,0.3379,0.453337
8,0.3891,0.440769
9,0.3653,0.431855
10,0.3311,0.4355


Trial 2, Fold 3/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,0.5368,0.487682
2,0.4538,0.504403
3,0.4289,0.500388


Trial 2, Fold 4/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,0.5691,0.486359
2,0.4695,0.4795
3,0.4563,0.495427
4,0.4126,0.492238


[I 2025-05-18 18:28:22,808] Trial 2 finished with value: 0.4707561433315277 and parameters: {'learning_rate': 1.0336843570697396e-05, 'batch_size': 4, 'weight_decay': 0.06272924049005918}. Best is trial 1 with value: 0.4422152265906334.


Trial 3, Fold 1/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.481731
2,0.546400,0.458078
3,0.452000,0.453193
4,0.387500,0.450117
5,0.387500,0.451021
6,0.367400,0.445779
7,0.351200,0.445215
8,0.333200,0.442109
9,0.333200,0.441174
10,0.323400,0.440502


Trial 3, Fold 2/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.486174
2,0.517200,0.433241
3,0.412600,0.432175
4,0.404800,0.414834
5,0.404800,0.402857
6,0.375400,0.399816
7,0.312600,0.392073
8,0.333000,0.392992
9,0.333000,0.385121
10,0.298000,0.384397


Trial 3, Fold 3/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.489776
2,0.510900,0.4858
3,0.431500,0.471001
4,0.351600,0.473323
5,0.351600,0.486022


Trial 3, Fold 4/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.475084
2,0.519600,0.460721
3,0.432000,0.459457
4,0.379900,0.457228
5,0.379900,0.461096
6,0.367400,0.457428


[I 2025-05-18 18:31:03,044] Trial 3 finished with value: 0.43828199803829193 and parameters: {'learning_rate': 1.34336568680343e-05, 'batch_size': 8, 'weight_decay': 0.09445645065743215}. Best is trial 3 with value: 0.43828199803829193.


Trial 4, Fold 1/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.482989
2,0.492000,0.471518
3,0.492000,0.450886
4,0.372600,0.446476
5,0.372600,0.450128
6,0.309600,0.443985
7,0.309600,0.439435
8,0.267900,0.441342
9,0.267900,0.435575
10,0.230900,0.434098


Trial 4, Fold 2/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.454056
2,0.465500,0.418778
3,0.465500,0.417069
4,0.366700,0.393316
5,0.366700,0.374398
6,0.301500,0.379186
7,0.301500,0.376796


Trial 4, Fold 3/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.50125
2,0.517300,0.470817
3,0.517300,0.454933
4,0.385300,0.460127
5,0.385300,0.467433


Trial 4, Fold 4/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.488222
2,0.472700,0.466958
3,0.472700,0.460776
4,0.362700,0.461161
5,0.362700,0.460948


[I 2025-05-18 18:33:17,128] Trial 4 finished with value: 0.43105141818523407 and parameters: {'learning_rate': 2.6771137242145903e-05, 'batch_size': 12, 'weight_decay': 0.1422602954229404}. Best is trial 4 with value: 0.43105141818523407.


Trial 5, Fold 1/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.477325
2,0.463900,0.465629
3,0.463900,0.440951
4,0.358800,0.433346
5,0.358800,0.450792
6,0.272600,0.422361
7,0.272600,0.423951
8,0.215900,0.425601


Trial 5, Fold 2/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.44603
2,0.472400,0.421524
3,0.472400,0.424743
4,0.353100,0.408277
5,0.353100,0.389174
6,0.283700,0.409545
7,0.283700,0.422628


Trial 5, Fold 3/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.490614
2,0.503400,0.473986
3,0.503400,0.453487
4,0.364700,0.459034
5,0.364700,0.473406


Trial 5, Fold 4/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.490337
2,0.464000,0.465256
3,0.464000,0.459612
4,0.343100,0.465295
5,0.343100,0.464141


[I 2025-05-18 18:35:21,184] Trial 5 finished with value: 0.43115826696157455 and parameters: {'learning_rate': 3.538461259525519e-05, 'batch_size': 12, 'weight_decay': 0.02347061968879934}. Best is trial 4 with value: 0.43105141818523407.


Trial 6, Fold 1/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.478
2,0.471100,0.460436
3,0.471100,0.447171
4,0.374000,0.442786
5,0.374000,0.445411
6,0.301300,0.429815
7,0.301300,0.430394
8,0.258000,0.429886


Trial 6, Fold 2/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.450876
2,0.480100,0.427551
3,0.480100,0.429954
4,0.373200,0.417425
5,0.373200,0.392521
6,0.317500,0.408693
7,0.317500,0.41219


Trial 6, Fold 3/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.501484
2,0.517700,0.470798
3,0.517700,0.455014
4,0.385800,0.46006
5,0.385800,0.467241


Trial 6, Fold 4/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.488163
2,0.473000,0.46695
3,0.473000,0.460799
4,0.363200,0.461035
5,0.363200,0.460881


[I 2025-05-18 18:37:26,265] Trial 6 finished with value: 0.4345373436808586 and parameters: {'learning_rate': 2.658616083788978e-05, 'batch_size': 12, 'weight_decay': 0.2900332895916222}. Best is trial 4 with value: 0.43105141818523407.


Best Hyperparameters: {'learning_rate': 2.6771137242145903e-05, 'batch_size': 12, 'weight_decay': 0.1422602954229404}
Training final model...


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  final_trainer = Trainer(


Step,Training Loss
50,0.5163
100,0.3768
150,0.3351
200,0.2765
250,0.2639
300,0.2054


Saving model and prediction requirements for task 1...
Saved the following files for task 1:
1. Model and tokenizer at: ./promise_task1_processed_Base-Model/task1_final_model
2. Best hyperparameters at: ./promise_task1_processed_Base-Model/task1_best_hyperparameters.json
3. Configuration at: ./promise_task1_processed_Base-Model/task1_config.json

Task 1 training complete. You can now use these files for prediction on new data.


### 3.1.2. Task1, Predict

In [None]:
base_path = './promise_task1_processed_Base-Model'  # Local path instead of Google Drive
test_data_path = 'english_submission_file_restructured_unlabeled.json'
test_data_T1_labeled_path = 'english_submission_file_restructured_T1-labeled.json'

task1_model_path = os.path.join(base_path, "task1_final_model")
task1_config_path = os.path.join(base_path, "task1_config.json") # contains the label mapping

print("Loading model and tokenizer...")
model = AutoModelForSequenceClassification.from_pretrained(task1_model_path)
tokenizer = AutoTokenizer.from_pretrained(task1_model_path)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# Load label mapping
with open(task1_config_path, 'r') as f:
    config = json.load(f)
    label_mapping = config['label_mapping']


print("Loading test data...")
with open(test_data_path, 'r') as file:
    test_data = json.load(file)


texts = [item['data'] for item in test_data.values()]

# Tokenize preprocessed texts
def tokenize_function(examples):
    return tokenizer(examples, truncation=True, padding="max_length", max_length=512)

encodings = tokenize_function(texts)
dataset = Dataset.from_dict({
    "input_ids": encodings['input_ids'],
    "attention_mask": encodings['attention_mask']
})

def predict(dataset):
    model.eval()
    trainer = Trainer(model=model, tokenizer=tokenizer)
    raw_pred, _, _ = trainer.predict(dataset)
    predictions = torch.softmax(torch.from_numpy(raw_pred), dim=-1)
    return predictions.argmax(dim=1).numpy()


print("Making predictions...")
predictions = predict(dataset)

# Update the promise_status based on predictions
for idx, pred in enumerate(predictions):
    test_data[str(idx)]['promise_status'] = label_mapping[str(pred)]


print(f"Saving predictions to {test_data_T1_labeled_path}")
with open(test_data_T1_labeled_path, 'w') as file:
    json.dump(test_data, file, indent=4)

print("\nPrediction Statistics:")
pred_distribution = {}
for item in test_data.values():
    status = item['promise_status']
    pred_distribution[status] = pred_distribution.get(status, 0) + 1

for status, count in pred_distribution.items():
    percentage = (count / len(test_data)) * 100
    print(f"{status}: {count} ({percentage:.1f}%)")

Loading model and tokenizer from Drive...
Loading test data...
Making predictions...


  trainer = Trainer(model=model, tokenizer=tokenizer)


Saving predictions to english_submission_file_restructured_T1-labeled.json

Prediction Statistics:
Yes: 363 (90.8%)
No: 37 (9.2%)


## 3.2. Task 2, Supporting Evidence

 This is a boolean label (Yes/No) based on whether supporting evidence exists.

### 3.2.1. Task2, Train

In [None]:
if not torch.cuda.is_available():
    raise RuntimeError("This script requires a GPU to run")

# Create task-specific paths
task2_model_path = os.path.join(task2_output_path, "task2_final_model")
task2_hyperparams_path = os.path.join(task2_output_path, "task2_best_hyperparameters.json")
task2_config_path = os.path.join(task2_output_path, "task2_config.json")  # label_mapping


def set_seed(seed_value=42):
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)
    os.environ['PYTHONHASHSEED'] = str(seed_value)
    torch.backends.cudnn.deterministic = True

set_seed()


def prepare_model_for_training():
    model = AutoModelForSequenceClassification.from_pretrained(
        "nbroad/ESG-BERT",
        num_labels=2,
        ignore_mismatched_sizes=True
    )

    model.classifier = torch.nn.Linear(model.config.hidden_size, 2)

    # Freeze all layers except last 2 transformer layers and classifier
    N = 2
    for param in model.bert.parameters():
        param.requires_grad = False
    for param in model.bert.encoder.layer[-N:].parameters():
        param.requires_grad = True
    for param in model.classifier.parameters():
        param.requires_grad = True

    return model

# Load dataset
print("Loading dataset...")
with open("PromiseEval_Trainset_English.json", 'r') as file:
    data = json.load(file)

# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained("nbroad/ESG-BERT")

def tokenize_function(examples):
    return tokenizer(examples, truncation=True, padding="max_length", max_length=512)

# Process texts
print("Processing texts...")
texts = [item['data'] for item in data]
labels = [1 if item["evidence_status"] == "Yes" else 0 for item in data]

# Setup k-fold
n_splits = 4
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

def objective(trial):
    learning_rate = trial.suggest_float("learning_rate", 1e-5, 5e-5, log=True)
    batch_size = trial.suggest_categorical("batch_size", [4, 8, 12])
    weight_decay = trial.suggest_float("weight_decay", 0.01, 0.3)

    fold_losses = []
    for fold, (train_idx, val_idx) in enumerate(skf.split(texts, labels)):
        print(f"Trial {trial.number}, Fold {fold + 1}/{n_splits}")

        # Prepare fold data
        train_texts = [texts[i] for i in train_idx]
        val_texts = [texts[i] for i in val_idx]
        train_labels = [labels[i] for i in train_idx]
        val_labels = [labels[i] for i in val_idx]

        train_encodings = tokenize_function(train_texts)
        val_encodings = tokenize_function(val_texts)

        train_dataset = Dataset.from_dict({**train_encodings, "labels": train_labels})
        val_dataset = Dataset.from_dict({**val_encodings, "labels": val_labels})

        model = prepare_model_for_training()
        model.to("cuda")

        trial_dir = os.path.join(task2_output_path, f'trial_{trial.number}_fold_{fold}')

        training_args = TrainingArguments(
            output_dir=trial_dir,
            evaluation_strategy="epoch",
            save_strategy="epoch",
            save_total_limit=1,
            logging_strategy="steps",
            logging_steps=50,
            per_device_train_batch_size=batch_size,
            per_device_eval_batch_size=batch_size,
            num_train_epochs=10,
            learning_rate=learning_rate,
            weight_decay=weight_decay,
            load_best_model_at_end=True,
            no_cuda=False,
            report_to="none",
            seed=42
        )

        trainer = Trainer(
            model=model,
            args=training_args,
            train_dataset=train_dataset,
            eval_dataset=val_dataset,
            tokenizer=tokenizer,
            callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
        )

        try:
            trainer.train()
            val_output = trainer.evaluate()
            fold_losses.append(val_output["eval_loss"])
        except Exception as e:
            print(f"Error in fold {fold}: {e}")
            return float('inf')
        finally:
            del model, trainer
            torch.cuda.empty_cache()

    return np.mean(fold_losses)

# Run hyperparameter optimization
print("Starting hyperparameter optimization...")
study = optuna.create_study(direction="minimize", sampler=optuna.samplers.TPESampler(seed=42))
study.optimize(objective, n_trials=7)

# Save best hyperparameters
best_hyperparameters = study.best_params
print("Best Hyperparameters:", best_hyperparameters)
with open(task2_hyperparams_path, 'w') as f:
    json.dump(best_hyperparameters, f, indent=2)

# Train final model
print("Training final model...")
train_encodings = tokenize_function(texts)
train_dataset = Dataset.from_dict({**train_encodings, "labels": labels})

final_model = prepare_model_for_training()
final_model.to("cuda")

final_training_args = TrainingArguments(
    output_dir=task2_model_path,
    save_strategy="epoch",
    save_total_limit=2,
    logging_strategy="steps",
    logging_steps=50,
    per_device_train_batch_size=best_hyperparameters["batch_size"],
    num_train_epochs=10,
    learning_rate=best_hyperparameters["learning_rate"],
    weight_decay=best_hyperparameters["weight_decay"],
    no_cuda=False,
    report_to="none",
    seed=42
)

final_trainer = Trainer(
    model=final_model,
    args=final_training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer
)

final_trainer.train()

# Save everything needed for prediction
print("Saving model and prediction requirements for task 2...")

# 1. Save the model and tokenizer
final_model.save_pretrained(task2_model_path, save_optimizer_state=False)
tokenizer.save_pretrained(task2_model_path)

# 2. Save evidence indicators and config
config = {
    'label_mapping': {
        '0': 'No',
        '1': 'Yes'
    }
}

with open(task2_config_path, 'w') as f:
    json.dump(config, f, indent=2)

print("Saved the following files for task 2:")
print(f"1. Model and tokenizer at: {task2_model_path}")
print(f"2. Configuration at: {task2_config_path}")
print(f"3. Best hyperparameters at: {task2_hyperparams_path}")
print("\nTask 2 training complete. You can now use these files for prediction on new data.")

Loading dataset...


[I 2025-05-19 00:38:26,679] A new study created in memory with name: no-name-49e7d57f-f4ee-4b00-bca9-e8333fd20a72


Processing texts...
Starting hyperparameter optimization...
Trial 0, Fold 1/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,0.6834,0.572438
2,0.56,0.557496
3,0.5059,0.552609
4,0.4741,0.53315
5,0.4387,0.547783
6,0.3983,0.546339


Trial 0, Fold 2/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,0.6741,0.604251
2,0.5822,0.619792
3,0.4698,0.638466


Trial 0, Fold 3/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,0.724,0.662308
2,0.5636,0.643338
3,0.497,0.662725
4,0.4818,0.65541


Trial 0, Fold 4/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,0.6973,0.621487
2,0.5399,0.620407
3,0.4664,0.602042
4,0.4659,0.582133
5,0.4845,0.57274
6,0.3818,0.605233
7,0.3616,0.582119


[I 2025-05-19 00:40:23,102] Trial 0 finished with value: 0.5883695185184479 and parameters: {'learning_rate': 1.827226177606625e-05, 'batch_size': 4, 'weight_decay': 0.055245405728306586}. Best is trial 0 with value: 0.5883695185184479.


Trial 1, Fold 1/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.649407
2,0.679400,0.612811
3,0.583400,0.600134
4,0.567100,0.591533
5,0.567100,0.584943
6,0.520000,0.585097
7,0.500700,0.58203
8,0.481800,0.581452
9,0.481800,0.581878
10,0.477000,0.581223


Trial 1, Fold 2/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.637305
2,0.632300,0.612368
3,0.575000,0.61706
4,0.556300,0.601829
5,0.556300,0.597803
6,0.496700,0.587766
7,0.489900,0.576295
8,0.421600,0.570098
9,0.421600,0.567122
10,0.448700,0.566893


Trial 1, Fold 3/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.661284
2,0.632200,0.664112
3,0.563000,0.639914
4,0.514500,0.645853
5,0.514500,0.66043


Trial 1, Fold 4/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.648012
2,0.642500,0.631537
3,0.547700,0.6191
4,0.533800,0.609184
5,0.533800,0.602858
6,0.485800,0.60501
7,0.431100,0.595425
8,0.440600,0.593917
9,0.440600,0.589587
10,0.414200,0.587949


[I 2025-05-19 00:43:28,509] Trial 1 finished with value: 0.5939946323633194 and parameters: {'learning_rate': 1.2853916978930139e-05, 'batch_size': 8, 'weight_decay': 0.21534104756085318}. Best is trial 0 with value: 0.5883695185184479.


Trial 2, Fold 1/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,0.6627,0.599058
2,0.6076,0.573739
3,0.5113,0.580231
4,0.528,0.564775
5,0.4893,0.559288
6,0.4674,0.554769
7,0.4221,0.554001
8,0.4217,0.554793
9,0.3604,0.557508


Trial 2, Fold 2/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,0.6755,0.633937
2,0.6002,0.618934
3,0.5244,0.632565
4,0.5382,0.62115


Trial 2, Fold 3/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,0.7003,0.659497
2,0.5961,0.644624
3,0.537,0.633899
4,0.53,0.642242
5,0.4909,0.653796


Trial 2, Fold 4/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,0.686,0.641269
2,0.5807,0.62845
3,0.5154,0.613939
4,0.5163,0.598097
5,0.5026,0.597098
6,0.4323,0.60645
7,0.4177,0.590652
8,0.4001,0.588554
9,0.4082,0.583189
10,0.391,0.581952


[I 2025-05-19 00:46:10,377] Trial 2 finished with value: 0.5971967428922653 and parameters: {'learning_rate': 1.0336843570697396e-05, 'batch_size': 4, 'weight_decay': 0.06272924049005918}. Best is trial 0 with value: 0.5883695185184479.


Trial 3, Fold 1/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.613681
2,0.651100,0.582105
3,0.552600,0.57328
4,0.537900,0.563743
5,0.537900,0.561251
6,0.480400,0.554684
7,0.454300,0.556899
8,0.415100,0.559627


Trial 3, Fold 2/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.68738
2,0.649400,0.658915
3,0.571700,0.654563
4,0.566600,0.635278
5,0.566600,0.628099
6,0.514600,0.620174
7,0.525500,0.604754
8,0.433600,0.599178
9,0.433600,0.598149
10,0.468000,0.598054


Trial 3, Fold 3/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.660839
2,0.630900,0.663956
3,0.560300,0.639149
4,0.510600,0.645591
5,0.510600,0.660878


Trial 3, Fold 4/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.646979
2,0.640900,0.630604
3,0.545000,0.618305
4,0.529800,0.607665
5,0.529800,0.601352
6,0.480800,0.604182
7,0.424700,0.593214
8,0.433900,0.592029
9,0.433900,0.587384
10,0.406500,0.585664


[I 2025-05-19 00:49:04,981] Trial 3 finished with value: 0.5943876504898071 and parameters: {'learning_rate': 1.34336568680343e-05, 'batch_size': 8, 'weight_decay': 0.09445645065743215}. Best is trial 0 with value: 0.5883695185184479.


Trial 4, Fold 1/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.596125
2,0.609500,0.57191
3,0.609500,0.572469
4,0.493300,0.567672
5,0.493300,0.552799
6,0.409400,0.555171
7,0.409400,0.548495
8,0.348900,0.551211
9,0.348900,0.55185


Trial 4, Fold 2/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.62601
2,0.606200,0.606568
3,0.606200,0.604716
4,0.504000,0.574391
5,0.504000,0.576374
6,0.437700,0.565346
7,0.437700,0.561154
8,0.394400,0.568065
9,0.394400,0.564456


Trial 4, Fold 3/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.644437
2,0.616200,0.666629
3,0.616200,0.631848
4,0.488200,0.617799
5,0.488200,0.662449
6,0.417300,0.627422


Trial 4, Fold 4/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.649986
2,0.607000,0.621392
3,0.607000,0.603355
4,0.487200,0.59361
5,0.487200,0.587541
6,0.419100,0.591588
7,0.419100,0.564147
8,0.371800,0.569342
9,0.371800,0.565841


[I 2025-05-19 00:51:47,615] Trial 4 finished with value: 0.572898805141449 and parameters: {'learning_rate': 2.6771137242145903e-05, 'batch_size': 12, 'weight_decay': 0.1422602954229404}. Best is trial 4 with value: 0.572898805141449.


Trial 5, Fold 1/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.602143
2,0.622100,0.582741
3,0.622100,0.589167
4,0.477700,0.584998


Trial 5, Fold 2/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.623031
2,0.609400,0.595772
3,0.609400,0.609791
4,0.496300,0.564914
5,0.496300,0.539105
6,0.414300,0.550942
7,0.414300,0.53317
8,0.348200,0.534337
9,0.348200,0.531655
10,0.304100,0.534248


Trial 5, Fold 3/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.654788
2,0.592500,0.686464
3,0.592500,0.64767
4,0.447100,0.653577
5,0.447100,0.691976


Trial 5, Fold 4/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.63087
2,0.599600,0.613296
3,0.599600,0.584646
4,0.458800,0.586296
5,0.458800,0.564699
6,0.373900,0.59132
7,0.373900,0.549116
8,0.305300,0.562401
9,0.305300,0.550724


[I 2025-05-19 00:54:09,937] Trial 5 finished with value: 0.5777955055236816 and parameters: {'learning_rate': 3.538461259525519e-05, 'batch_size': 12, 'weight_decay': 0.02347061968879934}. Best is trial 4 with value: 0.572898805141449.


Trial 6, Fold 1/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.607839
2,0.626300,0.589427
3,0.626300,0.585147
4,0.496200,0.579513
5,0.496200,0.556927
6,0.414300,0.560081
7,0.414300,0.558469


Trial 6, Fold 2/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.617597
2,0.628800,0.596791
3,0.628800,0.620386
4,0.525200,0.600844


Trial 6, Fold 3/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.643464
2,0.622100,0.650299
3,0.622100,0.621432
4,0.505600,0.616417
5,0.505600,0.637304
6,0.424900,0.622048


Trial 6, Fold 4/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.65022
2,0.607200,0.621666
3,0.607200,0.603638
4,0.487900,0.593892
5,0.487900,0.587672
6,0.420100,0.591682
7,0.420100,0.564643
8,0.373100,0.569581
9,0.373100,0.566193


[I 2025-05-19 00:56:21,937] Trial 6 finished with value: 0.5836943835020065 and parameters: {'learning_rate': 2.658616083788978e-05, 'batch_size': 12, 'weight_decay': 0.2900332895916222}. Best is trial 4 with value: 0.572898805141449.


Best Hyperparameters: {'learning_rate': 2.6771137242145903e-05, 'batch_size': 12, 'weight_decay': 0.1422602954229404}
Training final model...


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  final_trainer = Trainer(


Step,Training Loss
50,0.629
100,0.5098
150,0.4471
200,0.3899
250,0.3516
300,0.3177


Saving model and prediction requirements for task 2...
Saved the following files for task 2:
1. Model and tokenizer at: ./promise_task2_processed_Base-Model/task2_final_model
2. Configuration at: ./promise_task2_processed_Base-Model/task2_config.json
3. Best hyperparameters at: ./promise_task2_processed_Base-Model/task2_best_hyperparameters.json

Task 2 training complete. You can now use these files for prediction on new data.


### 3.2.2. Task2, Predict

In [None]:
test_data_T1_labeled_path = 'english_submission_file_restructured_T1-labeled.json' # annotted from the previous step
test_data_T1_T2_labeled_path = 'english_submission_file_restructured_T1_T2-labeled.json'

task2_model_path = os.path.join(task2_output_path, "task2_final_model")
task2_config_path = os.path.join(task2_output_path, "task2_config.json")

print("Loading model and tokenizer...")
model = AutoModelForSequenceClassification.from_pretrained(task2_model_path)
tokenizer = AutoTokenizer.from_pretrained(task2_model_path)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# Load label_mapping
with open(task2_config_path, 'r') as f:
    config = json.load(f)
    label_mapping = config['label_mapping']

print("Loading test data...")
with open(test_data_T1_labeled_path, 'r') as file:
    test_data = json.load(file)

# Extract texts for prediction
texts = [item['data'] for item in test_data.values()]

# Tokenize preprocessed texts
def tokenize_function(examples):
    return tokenizer(examples, truncation=True, padding="max_length", max_length=512)

encodings = tokenize_function(texts)
dataset = Dataset.from_dict({
    "input_ids": encodings['input_ids'],
    "attention_mask": encodings['attention_mask']
})

# Prediction function
def predict(dataset):
    model.eval()
    trainer = Trainer(model=model, tokenizer=tokenizer)
    raw_pred, _, _ = trainer.predict(dataset)
    predictions = torch.softmax(torch.from_numpy(raw_pred), dim=-1)
    return predictions.argmax(dim=1).numpy()

# Make predictions
print("Making predictions...")
predictions = predict(dataset)

# Update the evidence_status based on predictions
for idx, pred in enumerate(predictions):
    test_data[str(idx)]['evidence_status'] = label_mapping[str(pred)]

print(f"Saving predictions to {test_data_T1_T2_labeled_path}")
with open(test_data_T1_T2_labeled_path, 'w') as file:
    json.dump(test_data, file, indent=4)

print("\nPrediction Statistics:")
pred_distribution = {}
for item in test_data.values():
    status = item['evidence_status']
    pred_distribution[status] = pred_distribution.get(status, 0) + 1

for status, count in pred_distribution.items():
    percentage = (count / len(test_data)) * 100
    print(f"{status}: {count} ({percentage:.1f}%)")

Loading model and tokenizer...
Loading test data...
Making predictions...


  trainer = Trainer(model=model, tokenizer=tokenizer)


Saving predictions to english_submission_file_restructured_T1_T2-labeled.json

Prediction Statistics:
No: 168 (42.0%)
Yes: 232 (58.0%)


below is the one with zip. to be waitied

In [None]:
base_path = './promise_task2_processed_Base-Model'  # Local path instead of Google Drive
test_data_path = 'english_submission_file_restructured_T1-labeled.json'
test_data_T2_labeled_path = 'english_submission_file_restructured_T1_T2-labeled.json'

task2_model_path = os.path.join(base_path, "task2_final_model")
task2_config_path = os.path.join(base_path, "task2_config.json") # contains the label mapping

print("Loading model and tokenizer...")
model = AutoModelForSequenceClassification.from_pretrained(task2_model_path)
tokenizer = AutoTokenizer.from_pretrained(task2_model_path)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

print("Loading test data...")
with open(test_data_path, 'r') as file:
    test_data = json.load(file)


print("Loading label mapping...")
with open(task2_config_path, 'r') as f:
    config = json.load(f)
    label_mapping = config['label_mapping']

print("Processing texts...")
texts = []
indices = []
for idx in sorted(test_data.keys(), key=int):
    item = test_data[idx]
    input_text = item['data']
    texts.append(input_text)
    indices.append(idx)

# Tokenize all texts
encodings = tokenizer(texts, truncation=True, padding="max_length", max_length=512)
dataset = Dataset.from_dict({
    "input_ids": encodings['input_ids'],
    "attention_mask": encodings['attention_mask']
})

# Make predictions
print("Making predictions...")
model.eval()
trainer = Trainer(model=model, tokenizer=tokenizer)
predictions = trainer.predict(dataset)
predicted_labels = predictions.predictions.argmax(-1)

# Update test data with predictions
for idx, pred in zip(indices, predicted_labels):
    test_data[idx]['evidence_status'] = label_mapping[str(pred)]

# Save updated test data
output_path = os.path.join(test_data_T2_labeled_path)
print(f"Saving predictions to {output_path}")
with open(output_path, 'w') as f:
    json.dump(test_data, f, indent=2)

# Print prediction statistics
print("\nPrediction Statistics:")
pred_distribution = {}
for idx in test_data:
    status = test_data[idx]['evidence_status']
    pred_distribution[status] = pred_distribution.get(status, 0) + 1

for status, count in pred_distribution.items():
    percentage = (count / len(test_data)) * 100
    print(f"{status}: {count} ({percentage:.1f}%)")

Loading model and tokenizer...
Loading test data...
Loading label mapping...
Processing texts...
Making predictions...


  trainer = Trainer(model=model, tokenizer=tokenizer)


Saving predictions to english_submission_file_restructured_T1_T2-labeled.json

Prediction Statistics:
No: 168 (42.0%)
Yes: 232 (58.0%)


## 3.3. Task 3, Clarity of the Promise-Evidence Pair

We designed three labels (Clear/Not Clear/Misleading) for this task, which should depend on the clarity of the given evidence in relation to the promise.

### 3.3.1. Task3,Train

In [None]:

# Create task-specific paths
task3_model_path = os.path.join(task3_output_path, "task3_final_model")
task3_hyperparams_path = os.path.join(task3_output_path, "task3_best_hyperparameters.json")
task3_config_path = os.path.join(task3_output_path, "task3_config.json")  # label_mapping

if not torch.cuda.is_available():
    raise RuntimeError("This script requires a GPU to run")

def set_seed(seed_value=42):
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)
    os.environ['PYTHONHASHSEED'] = str(seed_value)
    torch.backends.cudnn.deterministic = True

set_seed()


def prepare_model_for_training():
    model = AutoModelForSequenceClassification.from_pretrained(
        "nbroad/ESG-BERT",
        num_labels=4,  # Clear, Not Clear, Misleading, N/A
        ignore_mismatched_sizes=True
    )

    # Freeze all layers except last N transformer layers
    N = 2
    for param in model.bert.parameters():
        param.requires_grad = False
    for param in model.bert.encoder.layer[-N:].parameters():
        param.requires_grad = True
    for param in model.classifier.parameters():
        param.requires_grad = True

    return model

# Load dataset
print("Loading dataset...")
with open("PromiseEval_Trainset_English.json", 'r') as file:
    data = json.load(file)

clarity_to_id = {
    "Clear": 0,
    "Not Clear": 1,
    "Misleading": 2,
    "N/A": 3
}

# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained("nbroad/ESG-BERT")

def tokenize_function(examples):
    return tokenizer(examples, truncation=True, padding="max_length", max_length=512)

# Preprocess dataset
print("Processing texts...")
texts = [item['data']for item in data]
labels = [clarity_to_id[item["evidence_quality"]] for item in data]


# Setup cross-validation
n_splits = 4
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)


def objective(trial):
    learning_rate = trial.suggest_float("learning_rate", 1e-5, 5e-5, log=True)
    batch_size = trial.suggest_categorical("batch_size", [4, 8, 12])
    weight_decay = trial.suggest_float("weight_decay", 0.01, 0.3)

    fold_losses = []
    fold_accuracies = []

    for fold, (train_idx, val_idx) in enumerate(skf.split(texts, labels)):
        print(f"Trial {trial.number}, Fold {fold + 1}/{n_splits}")

        # Prepare fold data
        train_texts = [texts[i] for i in train_idx]
        val_texts = [texts[i] for i in val_idx]
        train_labels = [labels[i] for i in train_idx]
        val_labels = [labels[i] for i in val_idx]

        train_encodings = tokenize_function(train_texts)
        val_encodings = tokenize_function(val_texts)

        train_dataset = Dataset.from_dict({**train_encodings, "labels": train_labels})
        val_dataset = Dataset.from_dict({**val_encodings, "labels": val_labels})

        model = prepare_model_for_training()
        model.to("cuda")

        trial_dir = os.path.join(task3_output_path, f'trial_{trial.number}_fold_{fold}')

        training_args = TrainingArguments(
            output_dir=trial_dir,
            evaluation_strategy="epoch",
            save_strategy="epoch",
            save_total_limit=1,
            logging_strategy="steps",
            logging_steps=50,
            per_device_train_batch_size=batch_size,
            per_device_eval_batch_size=batch_size,
            num_train_epochs=10,
            learning_rate=learning_rate,
            weight_decay=weight_decay,
            load_best_model_at_end=True,

            no_cuda=False,
            report_to="none",
            seed=42
        )

        trainer = Trainer(
            model=model,
            args=training_args,
            train_dataset=train_dataset,
            eval_dataset=val_dataset,
            tokenizer=tokenizer,
            callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
        )

        try:
            trainer.train()
            eval_results = trainer.evaluate()
            fold_losses.append(eval_results["eval_loss"])
        except Exception as e:
            print(f"Error in fold {fold}: {e}")
            return float('inf')
        finally:
            del model, trainer
            torch.cuda.empty_cache()

    return np.mean(fold_losses)

# Run hyperparameter optimization
print("Starting hyperparameter optimization...")
study = optuna.create_study(direction="minimize", sampler=optuna.samplers.TPESampler(seed=42))
study.optimize(objective, n_trials=7)

# Save best hyperparameters
best_hyperparameters = study.best_params
print("Best Hyperparameters:", best_hyperparameters)
with open(task3_hyperparams_path, "w") as file:
    json.dump(best_hyperparameters, file)

# Train final model
print("Training final model...")
train_encodings = tokenize_function(texts)
train_dataset = Dataset.from_dict({**train_encodings, "labels": labels})

final_model = prepare_model_for_training()
final_model.to("cuda")

final_training_args = TrainingArguments(
    output_dir=task3_model_path,
    save_strategy="epoch",
    save_total_limit=2,
    logging_strategy="steps",
    logging_steps=50,
    per_device_train_batch_size=best_hyperparameters["batch_size"],
    num_train_epochs=10,
    learning_rate=best_hyperparameters["learning_rate"],
    weight_decay=best_hyperparameters["weight_decay"],
    no_cuda=False,
    report_to="none",
    seed=42
)

final_trainer = Trainer(
    model=final_model,
    args=final_training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer,
)

final_trainer.train()

# Save final model
print("Saving final model...")
final_model.save_pretrained(task3_model_path, save_optimizer_state=False)
tokenizer.save_pretrained(task3_model_path)

# Save class mapping for future reference
class_mapping = {v: k for k, v in clarity_to_id.items()}
config = {
    'label_mapping': class_mapping,
}

with open(task3_config_path, "w") as f:
    json.dump(config, f) # saving the lable mapping

print("Saved the following files for task 3:")
print(f"1. Model and tokenizer at: {task3_model_path}")
print(f"2. Configuration at: {task3_config_path}")
print(f"3. Best hyperparameters at: {task3_hyperparams_path}")
print("\nTask 3 training complete. You can now use these files for prediction on new data.")


Loading dataset...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/376 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

[I 2025-05-19 16:20:05,469] A new study created in memory with name: no-name-635a0b71-9fb1-4cab-9fab-b46844f0eb21


Processing texts...
Starting hyperparameter optimization...
Trial 0, Fold 1/4


config.json:   0%|          | 0.00/2.67k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([4]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([4, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,1.3121,1.086583
2,1.0339,1.067004
3,0.9602,1.063277
4,0.8769,1.071048
5,0.8438,1.060938
6,0.8657,1.04605
7,0.7658,1.050965
8,0.712,1.053021


Trial 0, Fold 2/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([4]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([4, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,1.2114,1.079501
2,1.0008,1.038278
3,0.9149,1.006993
4,0.9125,0.995923
5,0.8798,0.992376
6,0.7577,0.983242
7,0.8021,0.971141
8,0.7674,0.966584
9,0.7257,0.962874
10,0.7111,0.964402


Trial 0, Fold 3/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([4]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([4, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,1.2371,1.117333
2,0.9879,1.092512
3,0.9119,1.095775
4,0.8379,1.093683


Trial 0, Fold 4/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([4]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([4, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,1.1946,1.08262
2,1.0178,1.034296
3,0.8811,1.043054
4,0.9223,1.000492
5,0.8581,1.000985
6,0.8029,1.010005


[I 2025-05-19 16:22:49,196] Trial 0 finished with value: 1.0254821926355362 and parameters: {'learning_rate': 1.827226177606625e-05, 'batch_size': 4, 'weight_decay': 0.055245405728306586}. Best is trial 0 with value: 1.0254821926355362.


Trial 1, Fold 1/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([4]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([4, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.199445
2,1.221300,1.133204
3,1.053100,1.101505
4,0.976000,1.088836
5,0.976000,1.079462
6,0.909000,1.06995
7,0.906400,1.07165
8,0.899800,1.063105
9,0.899800,1.060049
10,0.866700,1.059122


Trial 1, Fold 2/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([4]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([4, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.154914
2,1.226200,1.081121
3,1.027100,1.046737
4,0.990700,1.02746
5,0.990700,1.011079
6,0.956200,1.003851
7,0.914300,0.995835
8,0.867200,0.990295
9,0.867200,0.984856
10,0.883400,0.983127


Trial 1, Fold 3/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([4]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([4, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.161198
2,1.200600,1.123048
3,1.015200,1.099569
4,0.953100,1.091872
5,0.953100,1.084308
6,0.901300,1.078659
7,0.862200,1.071521
8,0.838900,1.069639
9,0.838900,1.069647
10,0.817700,1.069663


Trial 1, Fold 4/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([4]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([4, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.148587
2,1.189700,1.094113
3,1.000700,1.076097
4,0.997000,1.058177
5,0.997000,1.048216
6,0.916400,1.032868
7,0.880000,1.030657
8,0.858800,1.030184
9,0.858800,1.028418
10,0.858000,1.028005


[I 2025-05-19 16:26:13,321] Trial 1 finished with value: 1.0349734276533127 and parameters: {'learning_rate': 1.2853916978930139e-05, 'batch_size': 8, 'weight_decay': 0.21534104756085318}. Best is trial 0 with value: 1.0254821926355362.


Trial 2, Fold 1/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([4]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([4, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,1.2773,1.103272
2,1.0611,1.068394
3,0.9786,1.055883
4,0.938,1.046827
5,0.9054,1.036315
6,0.9095,1.027424
7,0.8322,1.042055
8,0.791,1.033729


Trial 2, Fold 2/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([4]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([4, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,1.2437,1.117282
2,1.0439,1.069189
3,0.9801,1.047693
4,0.9919,1.032173
5,0.9706,1.022364
6,0.872,1.017971
7,0.9138,1.012257
8,0.9007,1.007981
9,0.8578,1.00637
10,0.8524,1.006732


Trial 2, Fold 3/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([4]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([4, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,1.2691,1.155004
2,1.0414,1.117699
3,0.9746,1.100846
4,0.9059,1.0964
5,0.8864,1.086723
6,0.863,1.086699
7,0.8463,1.080757
8,0.8385,1.080478
9,0.8146,1.080295
10,0.7966,1.078209


Trial 2, Fold 4/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([4]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([4, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,1.2136,1.142917
2,1.0441,1.083031
3,0.9103,1.075987
4,0.9443,1.046789
5,0.9481,1.02828
6,0.8647,1.022635
7,0.8057,1.023768
8,0.8277,1.020435
9,0.8023,1.024262
10,0.7612,1.024386


[I 2025-05-19 16:29:44,753] Trial 2 finished with value: 1.0331095159053802 and parameters: {'learning_rate': 1.0336843570697396e-05, 'batch_size': 4, 'weight_decay': 0.06272924049005918}. Best is trial 0 with value: 1.0254821926355362.


Trial 3, Fold 1/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([4]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([4, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.106798
2,1.217200,1.073762
3,1.033000,1.044085
4,0.967000,1.032727
5,0.967000,1.024253
6,0.896200,1.016577
7,0.870700,1.024085
8,0.856200,1.013587
9,0.856200,1.013088
10,0.810000,1.010665


Trial 3, Fold 2/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([4]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([4, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.150681
2,1.223100,1.076904
3,1.021900,1.042689
4,0.984900,1.02412
5,0.984900,1.007416
6,0.949500,1.000362
7,0.906400,0.992293
8,0.857700,0.986623
9,0.857700,0.981012
10,0.873900,0.979259


Trial 3, Fold 3/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([4]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([4, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.157248
2,1.197000,1.119729
3,1.010100,1.097304
4,0.946900,1.090165
5,0.946900,1.082162
6,0.894300,1.077168
7,0.854300,1.069154
8,0.829900,1.067711
9,0.829900,1.067836
10,0.807800,1.067852


Trial 3, Fold 4/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([4]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([4, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.145668
2,1.185700,1.090564
3,0.995500,1.073861
4,0.991100,1.055931
5,0.991100,1.045685
6,0.909200,1.030443
7,0.872300,1.027825
8,0.850100,1.027597
9,0.850100,1.025668
10,0.848200,1.025165


[I 2025-05-19 16:33:07,490] Trial 3 finished with value: 1.0207000076770782 and parameters: {'learning_rate': 1.34336568680343e-05, 'batch_size': 8, 'weight_decay': 0.09445645065743215}. Best is trial 3 with value: 1.0207000076770782.


Trial 4, Fold 1/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([4]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([4, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.081222
2,1.113700,1.058025
3,1.113700,1.027113
4,0.911300,1.045842
5,0.911300,1.016502
6,0.805900,1.010988
7,0.805900,1.028062
8,0.726500,1.009489
9,0.726500,1.008293
10,0.678300,1.004835


Trial 4, Fold 2/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([4]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([4, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.096588
2,1.129500,1.03862
3,1.129500,1.016835
4,0.934800,0.987208
5,0.934800,0.977507
6,0.838400,0.966973
7,0.838400,0.959477
8,0.768100,0.951243
9,0.768100,0.943282
10,0.723800,0.938922


Trial 4, Fold 3/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([4]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([4, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.11853
2,1.100100,1.092268
3,1.100100,1.076193
4,0.907000,1.063061
5,0.907000,1.066605
6,0.810700,1.06328


Trial 4, Fold 4/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([4]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([4, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.141894
2,1.134900,1.080963
3,1.134900,1.064645
4,0.917400,1.046852
5,0.917400,1.028944
6,0.844500,1.021431
7,0.844500,1.033104
8,0.765900,1.018342
9,0.765900,1.017876
10,0.747200,1.015429


[I 2025-05-19 16:36:01,324] Trial 4 finished with value: 1.00556181371212 and parameters: {'learning_rate': 2.6771137242145903e-05, 'batch_size': 12, 'weight_decay': 0.1422602954229404}. Best is trial 4 with value: 1.00556181371212.


Trial 5, Fold 1/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([4]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([4, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.059523
2,1.088700,1.04709
3,1.088700,1.028555
4,0.871700,1.052791
5,0.871700,1.008815
6,0.739400,1.008851
7,0.739400,1.047729


Trial 5, Fold 2/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([4]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([4, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.065477
2,1.106300,1.033425
3,1.106300,1.029344
4,0.936200,1.013424
5,0.936200,1.009531
6,0.826900,1.002532
7,0.826900,0.986288
8,0.742500,0.993806
9,0.742500,0.98561
10,0.692300,0.980714


Trial 5, Fold 3/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([4]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([4, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.101657
2,1.077000,1.079812
3,1.077000,1.074206
4,0.872600,1.059647
5,0.872600,1.056396
6,0.758800,1.068285
7,0.758800,1.046148
8,0.663700,1.047555
9,0.663700,1.039104
10,0.596900,1.038494


Trial 5, Fold 4/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([4]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([4, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.112947
2,1.083500,1.046638
3,1.083500,1.080403
4,0.872800,1.049915


[I 2025-05-19 16:38:31,316] Trial 5 finished with value: 1.0186652094125748 and parameters: {'learning_rate': 3.538461259525519e-05, 'batch_size': 12, 'weight_decay': 0.02347061968879934}. Best is trial 4 with value: 1.00556181371212.


Trial 6, Fold 1/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([4]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([4, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.128056
2,1.112900,1.094087
3,1.112900,1.080276
4,0.922300,1.085061
5,0.922300,1.05925
6,0.821800,1.044697
7,0.821800,1.042784
8,0.756600,1.04026
9,0.756600,1.038442
10,0.700500,1.037386


Trial 6, Fold 2/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([4]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([4, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.097313
2,1.130100,1.039116
3,1.130100,1.017293
4,0.935700,0.987761
5,0.935700,0.977818
6,0.839900,0.967469
7,0.839900,0.960027
8,0.770100,0.95181
9,0.770100,0.943836
10,0.726200,0.939532


Trial 6, Fold 3/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([4]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([4, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.119018
2,1.100700,1.092623
3,1.100700,1.076258
4,0.907900,1.063214
5,0.907900,1.06687
6,0.812100,1.063157
7,0.812100,1.049739
8,0.734900,1.051754
9,0.734900,1.046735
10,0.681400,1.043877


Trial 6, Fold 4/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([4]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([4, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.133218
2,1.104200,1.066531
3,1.104200,1.065628
4,0.911100,1.040984
5,0.911100,1.018049
6,0.828000,1.010414
7,0.828000,0.998097
8,0.741200,0.997013
9,0.741200,0.991393
10,0.705000,0.989725


[I 2025-05-19 16:41:45,208] Trial 6 finished with value: 1.0026298016309738 and parameters: {'learning_rate': 2.658616083788978e-05, 'batch_size': 12, 'weight_decay': 0.2900332895916222}. Best is trial 6 with value: 1.0026298016309738.


Best Hyperparameters: {'learning_rate': 2.658616083788978e-05, 'batch_size': 12, 'weight_decay': 0.2900332895916222}
Training final model...


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([4]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([4, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  final_trainer = Trainer(


Step,Training Loss
50,1.1376
100,0.9396
150,0.8575
200,0.7649
250,0.7257
300,0.672


Saving final model...
Saved the following files for task 3:
1. Model and tokenizer at: ./promise_task3_processed_Base-Model/task3_final_model
2. Configuration at: ./promise_task3_processed_Base-Model/task3_config.json
3. Best hyperparameters at: ./promise_task3_processed_Base-Model/task3_best_hyperparameters.json

Task 3 training complete. You can now use these files for prediction on new data.


### 3.3.2. Task3, Predict

In [None]:
test_data_T1_T2_labeled_path = 'english_submission_file_restructured_T1_T2-labeled.json'
test_data_T1_T2_T3_labeled_path = 'english_submission_file_restructured_T1_T2_T3-labeled.json'

task3_model_path = os.path.join(task3_output_path, "task3_final_model")
task3_config_path = os.path.join(task3_output_path, "task3_config.json")

print("Loading model and tokenizer...")
model = AutoModelForSequenceClassification.from_pretrained(task3_model_path)
tokenizer = AutoTokenizer.from_pretrained(task3_model_path)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# Load label_mapping
with open(task3_config_path, 'r') as f:
    config = json.load(f)
    label_mapping = config['label_mapping']

print("Loading test data...")
with open(test_data_T1_T2_labeled_path, 'r') as file:
    test_data = json.load(file)

# Extract texts for prediction
texts = [item['data'] for item in test_data.values()]

# Tokenize preprocessed texts
def tokenize_function(examples):
    return tokenizer(examples, truncation=True, padding="max_length", max_length=512)

encodings = tokenize_function(texts)
dataset = Dataset.from_dict({
    "input_ids": encodings['input_ids'],
    "attention_mask": encodings['attention_mask']
})

# Prediction function
def predict(dataset):
    model.eval()
    trainer = Trainer(model=model, tokenizer=tokenizer)
    raw_pred, _, _ = trainer.predict(dataset)
    predictions = torch.softmax(torch.from_numpy(raw_pred), dim=-1)
    return predictions.argmax(dim=1).numpy()

# Make predictions
print("Making predictions...")
predictions = predict(dataset)

# Update the evidence_quality based on predictions
for idx, pred in enumerate(predictions):
    test_data[str(idx)]['evidence_quality'] = label_mapping[str(pred)]

print(f"Saving predictions to {test_data_T1_T2_T3_labeled_path}")
with open(test_data_T1_T2_T3_labeled_path, 'w') as file:
    json.dump(test_data, file, indent=4)

print("\nPrediction Statistics:")
pred_distribution = {}
for item in test_data.values():
    status = item['evidence_quality']
    pred_distribution[status] = pred_distribution.get(status, 0) + 1

for status, count in pred_distribution.items():
    percentage = (count / len(test_data)) * 100
    print(f"{status}: {count} ({percentage:.1f}%)")

Loading model and tokenizer...
Loading test data...
Making predictions...


  trainer = Trainer(model=model, tokenizer=tokenizer)


Saving predictions to english_submission_file_restructured_T1_T2_T3-labeled.json

Prediction Statistics:
N/A: 216 (54.0%)
Clear: 171 (42.8%)
Not Clear: 13 (3.2%)


## 3.4. Task4, Timing for Verification

Following the MSCI guidelines, we set timing labels (within 2 years/2-5 years/longer than 5 years/other) to indicate when readers/investors should return to verify the promise. Here, "other" denotes the promise has already been verified or doesn't have a specific timing to verify it.

### 3.4.1. Task4, Train



In [None]:
# Create task-specific paths
task4_model_path = os.path.join(task4_output_path, "task4_final_model")
task4_hyperparams_path = os.path.join(task4_output_path, "task4_best_hyperparameters.json")
task4_config_path = os.path.join(task4_output_path, "task4_config.json")  # label_mapping

if not torch.cuda.is_available():
    raise RuntimeError("This script requires a GPU to run")

def set_seed(seed_value=42):
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)
    os.environ['PYTHONHASHSEED'] = str(seed_value)
    torch.backends.cudnn.deterministic = True

set_seed()

def prepare_model_for_training():
    model = AutoModelForSequenceClassification.from_pretrained(
        "nbroad/ESG-BERT",
        num_labels=5,
        ignore_mismatched_sizes=True
    )

    model.classifier = torch.nn.Linear(model.config.hidden_size, 5)

    # Freeze all layers except last N transformer layers
    N = 2
    for param in model.bert.parameters():
        param.requires_grad = False
    for param in model.bert.encoder.layer[-N:].parameters():
        param.requires_grad = True
    for param in model.classifier.parameters():
        param.requires_grad = True

    return model

print("Loading dataset...")
with open("PromiseEval_Trainset_English.json", 'r') as file:
    data = json.load(file)

timeline_to_id = {
    "Less than 2 years": 0,
    "2 to 5 years": 1,
    "More than 5 years": 2,
    "Already": 3,
    "N/A": 4
}

tokenizer = AutoTokenizer.from_pretrained("nbroad/ESG-BERT")

def tokenize_function(examples):
    return tokenizer(examples, truncation=True, padding="max_length", max_length=512)

# Enhanced text preprocessing
print("Processing texts...")
texts = [item['data'] for item in data]
labels = [timeline_to_id[item["verification_timeline"].strip()] for item in data]

# Setup k-fold
n_splits = 4
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

def objective(trial):
    learning_rate = trial.suggest_float("learning_rate", 1e-5, 5e-5, log=True)
    batch_size = trial.suggest_categorical("batch_size", [4, 8, 12])
    weight_decay = trial.suggest_float("weight_decay", 0.01, 0.3)

    fold_losses = []
    for fold, (train_idx, val_idx) in enumerate(skf.split(texts, labels)):
        print(f"Trial {trial.number}, Fold {fold + 1}/{n_splits}")

        # Prepare fold data
        train_texts = [texts[i] for i in train_idx]
        val_texts = [texts[i] for i in val_idx]
        train_labels = [labels[i] for i in train_idx]
        val_labels = [labels[i] for i in val_idx]

        train_encodings = tokenize_function(train_texts)
        val_encodings = tokenize_function(val_texts)

        train_dataset = Dataset.from_dict({**train_encodings, "labels": train_labels})
        val_dataset = Dataset.from_dict({**val_encodings, "labels": val_labels})

        model = prepare_model_for_training()
        model.to("cuda")

        trial_dir = os.path.join(task4_output_path, f'trial_{trial.number}_fold_{fold}')

        training_args = TrainingArguments(
            output_dir=trial_dir,
            evaluation_strategy="epoch",
            save_strategy="epoch",
            save_total_limit=1,
            logging_strategy="steps",
            logging_steps=50,
            per_device_train_batch_size=batch_size,
            per_device_eval_batch_size=batch_size,
            num_train_epochs=10,
            learning_rate=learning_rate,
            weight_decay=weight_decay,
            load_best_model_at_end=True,
            no_cuda=False,
            report_to="none",
            seed=42

        )

        trainer = Trainer(
            model=model,
            args=training_args,
            train_dataset=train_dataset,
            eval_dataset=val_dataset,
            tokenizer=tokenizer,
            callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
        )

        try:
            trainer.train()
            val_output = trainer.evaluate()
            fold_losses.append(val_output["eval_loss"])
        except Exception as e:
            print(f"Error in fold {fold}: {e}")
            return float('inf')
        finally:
            del model, trainer
            torch.cuda.empty_cache()

    return np.mean(fold_losses)

# Run hyperparameter optimization
print("Starting hyperparameter optimization...")
study = optuna.create_study(direction="minimize", sampler=optuna.samplers.TPESampler(seed=42))
study.optimize(objective, n_trials=7)

# Save best hyperparameters
best_hyperparameters = study.best_params
print("Best Hyperparameters:", best_hyperparameters)

# Train final model
print("Training final model...")
train_encodings = tokenize_function(texts)
train_dataset = Dataset.from_dict({**train_encodings, "labels": labels})

final_model = prepare_model_for_training()
final_model.to("cuda")


final_training_args = TrainingArguments(
    output_dir=task4_model_path,
    save_strategy="epoch",
    save_total_limit=2,
    logging_strategy="steps",
    logging_steps=50,
    per_device_train_batch_size=best_hyperparameters["batch_size"],
    num_train_epochs=10,
    learning_rate=best_hyperparameters["learning_rate"],
    weight_decay=best_hyperparameters["weight_decay"],
    no_cuda=False,
    report_to="none",
    seed=42

)

final_trainer = Trainer(
    model=final_model,
    args=final_training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer
)

final_trainer.train()

# Save everything needed for prediction
print("Saving model and prediction requirements for task 4...")

# 1. Save the model and tokenizer
final_model.save_pretrained(task4_model_path, save_optimizer_state=False)
tokenizer.save_pretrained(task4_model_path)

# 3. Save label mapping
label_mapping = {
    'label_mapping': {v: k for k, v in timeline_to_id.items()},
}

with open(task4_config_path, 'w') as f:
    json.dump(label_mapping, f, indent=2)

# 4. Save best hyperparameters
with open(task4_hyperparams_path, 'w') as f:
    json.dump(best_hyperparameters, f, indent=2)

print("Saved the following files for task 4 prediction:")
print(f"1. Model and tokenizer at: {task4_model_path}")
print(f"2. Label mapping at: {task4_config_path}")
print(f"3. Best hyperparameters at: {task4_hyperparams_path}")
print("\nTask 4 training complete. You can now use these files for prediction on new data.")

Loading dataset...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/376 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

[I 2025-05-19 19:10:40,063] A new study created in memory with name: no-name-86ba3d72-c633-49a3-8129-5ef4ba96dfa1


Processing texts...
Starting hyperparameter optimization...
Trial 0, Fold 1/4


config.json:   0%|          | 0.00/2.67k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,1.4635,1.400041
2,1.2617,1.371121
3,1.1822,1.36371
4,1.15,1.371544
5,1.1279,1.380913


Trial 0, Fold 2/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,1.4632,1.332383
2,1.2857,1.309384
3,1.2341,1.293656
4,1.2018,1.298221
5,1.101,1.301682


Trial 0, Fold 3/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,1.4811,1.437929
2,1.2756,1.447675
3,1.1836,1.475312


Trial 0, Fold 4/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,1.5036,1.329222
2,1.3015,1.303463
3,1.2252,1.311372
4,1.1805,1.306459


[I 2025-05-19 19:12:27,883] Trial 0 finished with value: 1.3496894240379333 and parameters: {'learning_rate': 1.827226177606625e-05, 'batch_size': 4, 'weight_decay': 0.055245405728306586}. Best is trial 0 with value: 1.3496894240379333.


Trial 1, Fold 1/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.503303
2,1.565500,1.430037
3,1.319500,1.399427
4,1.266300,1.382958
5,1.266300,1.381661
6,1.209900,1.383805
7,1.181000,1.378861
8,1.137400,1.378286
9,1.137400,1.377866
10,1.114900,1.377581


Trial 1, Fold 2/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.455437
2,1.509600,1.389818
3,1.320400,1.35866
4,1.270100,1.343753
5,1.270100,1.332474
6,1.206400,1.327327
7,1.170000,1.326807
8,1.184100,1.323764
9,1.184100,1.321217
10,1.144500,1.321051


Trial 1, Fold 3/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.452943
2,1.478200,1.429368
3,1.303500,1.424725
4,1.205100,1.430577
5,1.205100,1.433271


Trial 1, Fold 4/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.414852
2,1.473000,1.362774
3,1.310600,1.350118
4,1.249000,1.339346
5,1.249000,1.328625
6,1.176600,1.323247
7,1.199900,1.32247
8,1.157900,1.322288
9,1.157900,1.322335
10,1.152700,1.321535


[I 2025-05-19 19:15:32,328] Trial 1 finished with value: 1.3612230718135834 and parameters: {'learning_rate': 1.2853916978930139e-05, 'batch_size': 8, 'weight_decay': 0.21534104756085318}. Best is trial 0 with value: 1.3496894240379333.


Trial 2, Fold 1/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,1.5181,1.396352
2,1.3149,1.356275
3,1.2706,1.34173
4,1.205,1.344759
5,1.2166,1.338141
6,1.1738,1.337182
7,1.1386,1.334576
8,1.1127,1.333865
9,1.1107,1.330715
10,1.079,1.328876


Trial 2, Fold 2/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,1.5067,1.423914
2,1.3347,1.376605
3,1.2937,1.351792
4,1.2526,1.334417
5,1.1846,1.329957
6,1.2193,1.319873
7,1.1087,1.320713
8,1.1001,1.31759
9,1.0808,1.316111
10,1.0931,1.317049


Trial 2, Fold 3/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,1.531,1.436294
2,1.3318,1.419642
3,1.2777,1.422403
4,1.1441,1.439101


Trial 2, Fold 4/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,1.6462,1.432504
2,1.3695,1.361249
3,1.2988,1.341592
4,1.2556,1.333913
5,1.1889,1.336859
6,1.2392,1.331635
7,1.1258,1.324409
8,1.1318,1.328076
9,1.0997,1.327781


[I 2025-05-19 19:18:42,780] Trial 2 finished with value: 1.3472595810890198 and parameters: {'learning_rate': 1.0336843570697396e-05, 'batch_size': 4, 'weight_decay': 0.06272924049005918}. Best is trial 2 with value: 1.3472595810890198.


Trial 3, Fold 1/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.424388
2,1.475500,1.376679
3,1.297600,1.349612
4,1.243400,1.344633
5,1.243400,1.346379
6,1.193300,1.348299


Trial 3, Fold 2/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.405688
2,1.508500,1.343843
3,1.339500,1.323584
4,1.261800,1.31629
5,1.261800,1.307795
6,1.212000,1.311114
7,1.178000,1.310085


Trial 3, Fold 3/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.456041
2,1.459900,1.438357
3,1.287600,1.442926
4,1.156200,1.453435


Trial 3, Fold 4/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.445966
2,1.560800,1.368467
3,1.328400,1.352547
4,1.257600,1.332139
5,1.257600,1.327694
6,1.194100,1.321555
7,1.202900,1.31665
8,1.142000,1.322172
9,1.142000,1.319136


[I 2025-05-19 19:21:00,171] Trial 3 finished with value: 1.3518588840961456 and parameters: {'learning_rate': 1.34336568680343e-05, 'batch_size': 8, 'weight_decay': 0.09445645065743215}. Best is trial 2 with value: 1.3472595810890198.


Trial 4, Fold 1/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.384412
2,1.385300,1.354062
3,1.385300,1.348773
4,1.212500,1.356624
5,1.212500,1.369931


Trial 4, Fold 2/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.348148
2,1.416900,1.294717
3,1.416900,1.288102
4,1.202900,1.286109
5,1.202900,1.281103
6,1.120200,1.2921
7,1.120200,1.293403


Trial 4, Fold 3/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.437706
2,1.347900,1.451437
3,1.347900,1.466341


Trial 4, Fold 4/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.363502
2,1.405300,1.314727
3,1.405300,1.318871
4,1.204600,1.306175
5,1.204600,1.301393
6,1.112700,1.301782
7,1.112700,1.306108


[I 2025-05-19 19:22:53,094] Trial 4 finished with value: 1.3422439098358154 and parameters: {'learning_rate': 2.6771137242145903e-05, 'batch_size': 12, 'weight_decay': 0.1422602954229404}. Best is trial 4 with value: 1.3422439098358154.


Trial 5, Fold 1/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.382485
2,1.346500,1.371279
3,1.346500,1.340625
4,1.121500,1.330581
5,1.121500,1.32286
6,0.992800,1.329949
7,0.992800,1.331744


Trial 5, Fold 2/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.350407
2,1.377400,1.309378
3,1.377400,1.31478
4,1.121200,1.303354
5,1.121200,1.290678
6,1.001400,1.297239
7,1.001400,1.277081
8,0.905900,1.290497
9,0.905900,1.293127


Trial 5, Fold 3/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.447036
2,1.327900,1.467338
3,1.327900,1.468246


Trial 5, Fold 4/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.347474
2,1.386600,1.318381
3,1.386600,1.321368
4,1.172500,1.30771
5,1.172500,1.304684
6,1.061500,1.311904
7,1.061500,1.31913


[I 2025-05-19 19:25:03,652] Trial 5 finished with value: 1.3379151225090027 and parameters: {'learning_rate': 3.538461259525519e-05, 'batch_size': 12, 'weight_decay': 0.02347061968879934}. Best is trial 5 with value: 1.3379151225090027.


Trial 6, Fold 1/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.391267
2,1.366800,1.378358
3,1.366800,1.350034
4,1.164800,1.341152
5,1.164800,1.329395
6,1.061000,1.334897
7,1.061000,1.325988
8,0.980400,1.326663
9,0.980400,1.328564


Trial 6, Fold 2/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.381522
2,1.397200,1.320439
3,1.397200,1.312613
4,1.211900,1.309932
5,1.211900,1.305328
6,1.116900,1.315213
7,1.116900,1.305794


Trial 6, Fold 3/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.437762
2,1.348800,1.450946
3,1.348800,1.465517


Trial 6, Fold 4/4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.363932
2,1.405800,1.314781
3,1.405800,1.318873
4,1.205400,1.306128
5,1.205400,1.301367
6,1.114000,1.301625
7,1.114000,1.305883


[I 2025-05-19 19:27:15,222] Trial 6 finished with value: 1.3426111340522766 and parameters: {'learning_rate': 2.658616083788978e-05, 'batch_size': 12, 'weight_decay': 0.2900332895916222}. Best is trial 5 with value: 1.3379151225090027.


Best Hyperparameters: {'learning_rate': 3.538461259525519e-05, 'batch_size': 12, 'weight_decay': 0.02347061968879934}
Training final model...


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nbroad/ESG-BERT and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([26]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.weight: found shape torch.Size([26, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  final_trainer = Trainer(


Step,Training Loss
50,1.3899
100,1.1766
150,1.0656
200,1.0017
250,0.8875
300,0.8679


Saving model and prediction requirements for task 4...
Saved the following files for task 4 prediction:
1. Model and tokenizer at: ./promise_task4_processed_Base-Model/task4_final_model
2. Label mapping at: ./promise_task4_processed_Base-Model/task4_config.json
3. Best hyperparameters at: ./promise_task4_processed_Base-Model/task4_best_hyperparameters.json

Task 4 training complete. You can now use these files for prediction on new data.


### 3.4.2. Task4, Predict

In [None]:
test_data_T1_T2_T3_labeled_path = 'english_submission_file_restructured_T1_T2_T3-labeled.json' # input
test_data_T1_T2_T3_T4_labeled_path = 'english_submission_file_restructured_T1_T2_T3_T4-labeled.json' # output

task4_model_path = os.path.join(task4_output_path, "task4_final_model")
task4_config_path = os.path.join(task4_output_path, "task4_config.json")

print("Loading model and tokenizer...")
model = AutoModelForSequenceClassification.from_pretrained(task4_model_path)
tokenizer = AutoTokenizer.from_pretrained(task4_model_path)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# Load label_mapping
with open(task4_config_path, 'r') as f:
    config = json.load(f)
    label_mapping = config['label_mapping']

print("Loading test data...")
with open(test_data_T1_T2_T3_labeled_path, 'r') as file:
    test_data = json.load(file)

texts = [item['data'] for item in test_data.values()]

# Tokenize preprocessed texts
def tokenize_function(examples):
    return tokenizer(examples, truncation=True, padding="max_length", max_length=512)

encodings = tokenize_function(texts)
dataset = Dataset.from_dict({
    "input_ids": encodings['input_ids'],
    "attention_mask": encodings['attention_mask']
})

# Prediction function
def predict(dataset):
    model.eval()
    trainer = Trainer(model=model, tokenizer=tokenizer)
    raw_pred, _, _ = trainer.predict(dataset)
    predictions = torch.softmax(torch.from_numpy(raw_pred), dim=-1)
    return predictions.argmax(dim=1).numpy()

# Make predictions
print("Making predictions...")
predictions = predict(dataset)

# Update the verification_timeline based on predictions
for idx, pred in enumerate(predictions):
    test_data[str(idx)]['verification_timeline'] = label_mapping[str(pred)]

print(f"Saving predictions to {test_data_T1_T2_T3_T4_labeled_path}")
with open(test_data_T1_T2_T3_T4_labeled_path, 'w') as file:
    json.dump(test_data, file, indent=4)

print("\nPrediction Statistics:")
pred_distribution = {}
for item in test_data.values():
    timeline = item['verification_timeline']
    pred_distribution[timeline] = pred_distribution.get(timeline, 0) + 1

for timeline, count in pred_distribution.items():
    percentage = (count / len(test_data)) * 100
    print(f"{timeline}: {count} ({percentage:.1f}%)")

Loading model and tokenizer...
Loading test data...
Making predictions...


  trainer = Trainer(model=model, tokenizer=tokenizer)


Saving predictions to english_submission_file_restructured_T1_T2_T3_T4-labeled.json

Prediction Statistics:
N/A: 66 (16.5%)
2 to 5 years: 90 (22.5%)
Already: 190 (47.5%)
Less than 2 years: 14 (3.5%)
More than 5 years: 40 (10.0%)


# 4. Convert to Parquet for Kaggle Submission










As required by the SemEval-2025 PromiseEval competition on Kaggle, submissions must be in the Parquet file format. The code below converts our final prediction JSON file containing all four tasks' predictions into the required Parquet format for submission.

In [None]:
import pandas as pd
import json

with open("english_submission_file_restructured_T1_T2_T3_T4-labeled.json", 'r') as file:
    test_data = json.load(file)

df = pd.DataFrame.from_dict(test_data, orient='index')
df.to_parquet("test_data_base-model_T1_T2_T3_T4_submit.parquet")

print("Conversion complete! You may now submit the parquet file.")

Conversion complete!
