# Data Preprocessing Phase

# Data Loading Functions Documentation

## Overview
The `load_data_train_val` and `load_data_test` functions are used to load and preprocess dataset files in tab-separated format (`.tsv`). These functions read data from a file and assign appropriate column names

## Parameters
- `file_path` (str): The path to the file containing the dataset.

## Expected Output
- `load_data_train_val(file_path)`: Returns a Pandas DataFrame with labeled training and validation data.
- `load_data_test(file_path)`: Returns a Pandas DataFrame with test data.

## Note
- when referring to validation data here we intend the development samples

In [None]:
import pandas as pd

def load_data_train_val(file_path):
  data = pd.read_csv(file_path, sep="\t", header=None, names=["file_name", "entity", "start_offset", "end_offset", "label_0", "label_1", "label_2", "label_3", "label_4", "label_5", "label_6", "label_7", "label_8", "label_9", "label_10", "label_11", "label_12"])
  return data

def load_data_test(file_path):
  data = pd.read_csv(file_path, sep="\t", header=None, names=["file_name", "entity", "start_offset", "end_offset"])
  return data

In [None]:
# Load the file
file_path_train = ...
file_path_val = ...
train_data = load_data_train_val(file_path_train)
val_data = load_data_train_val(file_path_val)

In [None]:
file_path_test = ...
test_data = load_data_test(file_path_test)

# BERT-based Text Vectorization Function

## Overview
The `get_text_vector` function leverages a pre-trained BERT model to generate a semantic vector representation for a given text input. It tokenizes the text, processes it through BERT, and extracts the representation of the `[CLS]` token as the final vector.

## Parameters
- `text` (str): The input text to be transformed into a semantic vector.

## Expected Output
- Returns a PyTorch tensor representing the `[CLS]` token's hidden state from the BERT model.
- The output tensor has the shape `(1, hidden_size)`, where `hidden_size` is 768 for `bert-base-cased`.


In [None]:
from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT (cased) model and tokenizer
model_name = "bert-base-cased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

def get_text_vector(text):
    # Tokenize the input text and add special tokens [CLS] and [SEP]
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)

    # Get the outputs from BERT model
    with torch.no_grad():
        outputs = model(**inputs)

    # Use the [CLS] token's representation for the semantic vector
    cls_vector = outputs.last_hidden_state[:, 0, :]  # Shape: (batch_size, hidden_size)
    return cls_vector

# Semantic Vector Extraction Function

## Overview
The `get_semantic_vector` function reads a text file, retrieves its content, and generates a semantic vector representation using a BERT-based model. It processes the file associated with an entity and converts the text into a vectorial representation.

## Parameters
- `entity_info` (dict): A dictionary containing entity details, including:
  - `"file_name"` (str): The name of the file containing the text.
  - Additional entity metadata (not used in this function).
- `folder_path` (str): The path to the folder where text files are stored.

## Expected Output
- Returns a PyTorch tensor representing the semantic vector of the text extracted from the file.
- The vector is generated using the `get_text_vector` function, which employs a BERT model to obtain the `[CLS]` token's hidden state.


In [None]:
def get_semantic_vector(entity_info, folder_path):
    """
    Add special tokens to mark entities in the text based on their offsets.

    Parameters:
        file_path (str): Path to the .txt file containing the text.
        entity_info (list of dict): A list of dictionaries containing the entity offsets and labels.
            Example:
            [{"start_offset": 27, "end_offset": 40, "entity": "lab-grown meat"}]

    Returns:
        str: The modified text with special tokens.
    """
        # Open and read the file content

    with open(folder_path+"/"+entity_info["file_name"], "r", encoding="utf-8") as file:
        text = file.read()

    semantic_vector = get_text_vector(text)

    return semantic_vector

# Entity Annotation with Special Tokens

## Overview
The `add_special_tokens` function modifies a text file by inserting special tokens (`<T>` and `</T>`) around entity mentions based on their character offsets. This helps highlight entities in the text for further processing.

## Parameters
- `entity_info` (dict): A dictionary containing entity details, including:
  - `"file_name"` (str): The name of the text file.
  - `"start_offset"` (int): The starting character position of the entity in the text.
  - `"end_offset"` (int): The ending character position of the entity in the text.
  - Additional entity metadata (not used in this function).
- `folder_path` (str): The path to the folder where the text files are stored.

## Expected Output
- Returns a modified string where the specified entity is wrapped in `<T>` and `</T>` tags.



In [None]:
def add_special_tokens(entity_info, folder_path):
    """
    Add special tokens to mark entities in the text based on their offsets.

    Parameters:
        file_path (str): Path to the .txt file containing the text.
        entity_info (list of dict): A list of dictionaries containing the entity offsets and labels.
            Example:
            [{"start_offset": 27, "end_offset": 40, "entity": "lab-grown meat"}]

    Returns:
        str: The modified text with special tokens.
    """
        # Open and read the file content

    with open(folder_path+"/"+entity_info["file_name"], "r", encoding="utf-8") as file:
        text = file.read()

    # Add special tokens to each entity

    start, end = entity_info["start_offset"], entity_info["end_offset"]
    text = text[:start] + "<T> " + text[start:end+1] + " </T>" + text[end+1:]

    return text

# Training and Test Dataset Preprocessing

## Overview
The `preprocess_train_val_dataset` and `preprocess_test_dataset` functions process training, validation, and test datasets by extracting relevant entity information, adding special tokens, and generating semantic vectors.

## Parameters
- `data` (Pandas DataFrame): A dataset containing entity details and labels.
- `folder_path` (str): Path to the folder containing text files.

## Expected Output
- `preprocess_train_val_dataset(data, folder_path)`: Returns a list of dictionaries where each entry includes:
  - `file_name`, `entity_name`, `start_offset`, `end_offset`
  - `main_role`, `refined_roles`
  - `text` with special tokens
  - `semantic_vector`
  
- `preprocess_test_dataset(data, folder_path)`: Returns a list of dictionaries where each entry includes:
  - `file_name`, `entity_name`, `start_offset`, `end_offset`
  - `text` with special tokens
  - `semantic_vector`


In [1]:
def preprocess_train_val_dataset(data, folder_path):
  semantic_similarity_dataset = []
  for index, row in data.iterrows():
    dict_entity = {"file_name": row[0], "entity": row[1], "start_offset": row[2], "end_offset": row[3],}
    label_list = []
    for i in range(5, 17):
      if str(row[i]) != 'nan':
        label_list.append(row[i])
    semantic_similarity_dataset.append({"file_name": row[0], "entity_name": row[1], "start_offset": row[2], "end_offset": row[3], "main_role": row[4], "refined_roles": label_list, "text": add_special_tokens(dict_entity, folder_path), "semantic_vector": get_semantic_vector(dict_entity, folder_path)})
  return semantic_similarity_dataset

def preprocess_test_dataset(data, folder_path):
  semantic_similarity_dataset = []
  for index, row in data.iterrows():
    dict_entity = {"file_name": row[0], "entity": row[1], "start_offset": row[2], "end_offset": row[3]}
    semantic_similarity_dataset.append({"file_name": row[0], "entity_name": row[1], "start_offset": row[2], "end_offset": row[3], "text": add_special_tokens(dict_entity, folder_path), "semantic_vector": get_semantic_vector(dict_entity, folder_path)})
  return semantic_similarity_dataset

semantic_similarity_train_dataset = preprocess_train_val_dataset(train_data, "./Datasets/Train/EN/raw-documents")
semantic_similarity_val_dataset = preprocess_train_val_dataset(val_data, "./Datasets/Development/EN/subtask-1-documents")
semantic_similarity_test_dataset = preprocess_test_dataset(test_data, "./Datasets/Test/EN/subtask-1-documents")

# Save Processed Data to JSON

## Overview
The `save_to_json_train_val` and `save_to_json_test` functions serialize processed datasets into JSON format, ensuring compatibility by converting tensors into lists.

## Parameters
- `data` (list of dicts): A list of dictionaries containing processed entity details.
- `file_path` (str): The destination file path where the JSON output will be saved.

## Expected Output
- `save_to_json_train_val(data, file_path)`: Saves training and validation data in JSON format, including entity details, roles, text with special tokens, and semantic vectors.
- `save_to_json_test(data, file_path)`: Saves test data in JSON format, including entity details, text with special tokens, and semantic vectors.


In [None]:
import json

def save_to_json_train_val(data, file_path):
    # Transform the data into JSON-compatible format if necessary
    json_data = []
    for entry in data:
        # Extract details and prepare JSON object
        json_object = {
            "file_name": entry['file_name'],
            "entity_name": entry['entity_name'],
            "start_offset": entry['start_offset'],
            "end_offset": entry['end_offset'],
            "main_role": entry['main_role'],
            "refined_roles": entry['refined_roles'],
            "text": entry['text'],
            "semantic_vector": entry['semantic_vector'].tolist()  # Ensure compatibility with JSON
        }
        json_data.append(json_object)

    # Save the list of JSON objects to a file
    with open(file_path, "w") as file:
        json.dump(json_data, file, indent=4)  # Pretty-print with indentation

    print(f"Data saved as JSON to {file_path}")

def save_to_json_test(data, file_path):
    json_data = []
    for entry in data:

        # Extract details and prepare JSON object
        json_object = {
            "file_name": entry['file_name'],
            "entity_name": entry['entity_name'],
            "start_offset": entry['start_offset'],
            "end_offset": entry['end_offset'],
            "text": entry['text'],
            "semantic_vector": entry['semantic_vector'].tolist()  # Ensure compatibility with JSON
        }
        json_data.append(json_object)


        # Save the list of JSON objects to a file
    with open(file_path, "w") as file:
        json.dump(json_data, file, indent=4)  # Pretty-print with indentation

In [None]:
preprocess_train_json_path = ...
preprocess_val_json_path = ...
preprocess_test_json_path = ...

save_to_json_train_val(semantic_similarity_train_dataset, preprocess_train_json_path)
save_to_json_train_val(semantic_similarity_val_dataset, preprocess_val_json_path)
save_to_json_test(semantic_similarity_test_dataset, preprocess_test_json_path)

# Classification Phase

In [None]:
# Mount the Drive
from google.colab import drive
drive.mount('/content/drive')

# Hugging Face Model Authentication and Loading

## Overview
The script logs into the Hugging Face Model Hub, downloads a pre-trained `Llama-3.2-3B` model, and configures the tokenizer for text generation tasks.

## Parameters
- `token` (str): Authentication token for accessing the Hugging Face Model Hub.
- `model_id` (str): Identifier for the pre-trained model (`meta-llama/Llama-3.2-3B`).

## Expected Output
- Logs into the Hugging Face Model Hub.
- Loads the `Llama-3.2-3B` model into memory using `torch_dtype=torch.float16` and `device_map="auto"` for optimized inference.
- Initializes the tokenizer and sets the pad token to match the end-of-sequence (EOS) token.


In [None]:
from huggingface_hub import login

import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

# Login to the Hugging Face model hub to be able to upload models
token = ...
login(token=token)

model_id = "meta-llama/Llama-3.2-3B"

# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

tokenizer.pad_token = tokenizer.eos_token

# Read JSON Data into List of Dictionaries

## Overview
The `read_file_into_list_of_dicts` function loads JSON-formatted data, reconstructs its structure, and converts semantic vectors back into PyTorch tensors for further processing.

## Parameters
- `path_to_file` (str): The path to the JSON file containing the dataset.
- `train_val` (bool, default=True): A flag indicating whether the dataset is for training/validation (`True`) or testing (`False`).

## Expected Output
- Returns a list of dictionaries where each dictionary represents an entity with relevant attributes:
  - **Training/Validation Format:** Includes `file_name`, `entity_name`, `main_role`, `refined_roles`, `text`, and `semantic_vector`.
  - **Test Format:** Includes `file_name`, `start_offset`, `end_offset`, `text`, and `semantic_vector`.


In [None]:
import json
import torch

def read_file_into_list_of_dicts(path_to_file, train_val=True):
    rows = []

    # Open and read the JSON file
    with open(path_to_file, 'r', encoding='utf-8') as f:
        data = json.load(f)  # Load the JSON file as a list of dictionaries

    for entry in data:
        file_name = entry['file_name']
        semantic_vector = torch.tensor(entry['semantic_vector'])  # Convert list to tensor
        text = entry['text']

        if train_val:  # Train / Val format
            main_role = entry['main_role']
            refined_roles = entry['refined_roles']
            entity_name = entry['entity_name']

            row_dict = {
                'file_name': file_name,
                'entity_name': entity_name,
                'main_role': main_role.lower(),
                'refined_roles': refined_roles,
                'text': text,
                'semantic_vector': semantic_vector

            }

        else:  # Test format
            start_offset = entry['start_offset']
            end_offset = entry['end_offset']

            row_dict = {
                'file_name': file_name,
                'start_offset': start_offset,
                'end_offset': end_offset,
                'text': text,
                'semantic_vector': semantic_vector
            }

        rows.append(row_dict)

    return rows


# Compute Cosine Similarities

## Overview
The `compute_cosine_similarities` function calculates the cosine similarity between a given query vector and a set of precomputed semantic vectors from training data. It assigns similarity scores to each entry in the dataset.

## Parameters
- `query_vector` (torch.Tensor): A tensor representing the semantic vector of the query.
- `train_data_semantic_vectors` (list of dicts): A list where each dictionary contains a `semantic_vector` representing an entity.

## Expected Output
- Returns an updated list of dictionaries where each entry includes a new key, `cosine_similarity`, representing the similarity score between the query vector and the respective training vector.


In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def compute_cosine_similiraties(query_vector, train_data_semantic_vectors):
  train_data_vectors = [item['semantic_vector'].flatten() for item in train_data_semantic_vectors]
  # Stack the semantic vectors into a single tensor
  semantic_vectors_tensor = torch.stack(train_data_vectors)
  # Ensure query_vector is 2D for cosine_similarity
  query_vector_2d = query_vector.cpu().detach().numpy().reshape(1, -1)

  # Convert semantic_vectors_tensor to numpy and ensure it's 2D
  semantic_vectors_2d = semantic_vectors_tensor.cpu().detach().numpy()

  # Compute cosine similarities
  similarities = cosine_similarity(query_vector_2d, semantic_vectors_2d)

  # Assign similarities back to the original list
  for i, similarity in enumerate(similarities[0]):
      train_data_semantic_vectors[i]['cosine_similarity'] = similarity.item()

  return train_data_semantic_vectors

# Role Encoding and Decoding

## Overview
The `roles_to_binary` and `binary_to_roles` functions facilitate the conversion between refined roles and binary vectors. These functions help encode refined roles into a binary format and decode them back into their respective roles, ensuring consistency with predefined main role mappings.

## Functions & Parameters

### `roles_to_binary(refined_roles, main_roles_mapping)`
- `refined_roles` (list): A list of refined roles associated with an entity.
- `main_roles_mapping` (dict): A dictionary where each key is a main role, and each value is a list of refined roles under that category.

### `binary_to_roles(binary_dict, main_roles_mapping)`
- `binary_dict` (dict): A dictionary containing a single main role and its associated binary vector.
- `main_roles_mapping` (dict): The mapping of main roles to their refined roles.

## Expected Output
- `roles_to_binary(refined_roles, main_roles_mapping)`: Returns a binary vector indicating whether each refined role under the identified main role is present (`Yes`) or absent (`No`).
- `binary_to_roles(binary_dict, main_roles_mapping)`: Returns a list of refined roles that correspond to the binary vector representation.


In [None]:
def roles_to_binary(refined_roles, main_roles_mapping):
    """
    Converts refined roles into a binary vector for a single main role.

    Args:
        refined_roles (list): Refined roles for the current test sample (e.g., ["Guardian", "Rebel"]).
        main_roles_mapping (dict): A dictionary where keys are main roles and values are their refined roles.

    Returns:
        dict: A dictionary with a single main role and its binary vector, or an empty dictionary if inconsistent.
    """
    # Identifica il main role a cui appartengono i refined roles
    valid_main_role = None
    for main_role, refined_list in main_roles_mapping.items():
        if all(role in refined_list for role in refined_roles):
            valid_main_role = main_role
            break

    # Se non sono coerenti con un main role, ritorna un dizionario vuoto
    if valid_main_role is None:
        raise ValueError("Refined roles belong to multiple or invalid main roles.")

    # Costruisce il vettore binario per il main role identificato
    binary_vector = ['Yes' if role in refined_roles else 'No' for role in main_roles_mapping[valid_main_role]]
    return binary_vector


def binary_to_roles(binary_dict, main_roles_mapping):
    """
    Converts a binary vector back into refined roles for a single main role.

    Args:
        binary_dict (dict): A dictionary with a single main role and its binary vector.
        main_roles_mapping (dict): A dictionary where keys are main roles and values are their refined roles.

    Returns:
        list: A list of refined roles for the given main role.
    """
    # Estrai il main role e il vettore binario
    if len(binary_dict) != 1:
        raise ValueError("Binary dictionary must have exactly one main role.")

    main_role = list(binary_dict.keys())[0]
    binary_vector = binary_dict[main_role]

    # Converti il vettore binario nei refined roles
    refined_roles = [
        role for role, flag in zip(main_roles_mapping[main_role], binary_vector) if flag == 1
    ]
    return refined_roles


# Esempio di utilizzo
main_roles_mapping = {
    "protagonist": ["Guardian", "Martyr", "Peacemaker", "Rebel", "Underdog", "Virtuous"],
    "antagonist": [
        "Instigator", "Conspirator", "Tyrant", "Foreign Adversary", "Traitor",
        "Spy", "Saboteur", "Corrupt", "Incompetent", "Terrorist", "Deceiver", "Bigot"
    ],
    "innocent": ["Forgotten", "Exploited", "Victim", "Scapegoat"]
}


# Retrieve Top K Entries per Main Role

## Overview
The `get_top_k_per_main_role` function selects the top `k` entries from a dataset for each main role (`Protagonist`, `Antagonist`, and `Innocent`) based on cosine similarity. It ensures that each file name appears only once across roles.

## Parameters
- `train_data_with_vectors` (list of dicts): A list of dictionaries where each entry contains:
  - `file_name` (str): Identifier for the document.
  - `main_role` (str): The main role category (`Protagonist`, `Antagonist`, or `Innocent`).
  - `cosine_similarity` (float): The similarity score used for ranking.
  - `text` (str): The textual content associated with the document.
- `k` (int): The number of top entries to retrieve per main role.

## Expected Output
- Returns a list of dictionaries containing the top `k` entries for each main role.
- Each dictionary in the output includes:
  - `file_name`
  - `main_role`
  - `cosine_similarity`
  - `text`
- The returned list contains `k` elements per role (up to 3 roles), totaling at most `3 * k` entries.


In [None]:
def get_top_k_per_main_role(train_data_with_vectors, k):
  file_names_list = []

  #Ricordare di fare lista [protagonist_data, antagonist_data, innocent_data] e poi shufflare
  protagonist_data = [item for item in train_data_with_vectors if item.get('main_role') == 'Protagonist']

  unique_protagonists = {item['file_name']: item for item in protagonist_data}.values()
  # Sort the data by 'cosine_similarity' in descending order and select the top 2
  top_2_protagonists = sorted(unique_protagonists, key=lambda x: x['cosine_similarity'], reverse=True)[:k]
  top_2_protagonists_file_names = [item['file_name'] for item in top_2_protagonists]

  file_names_list.extend(top_2_protagonists_file_names)

  antagonist_data = [
    item for item in train_data_with_vectors
    if item.get('main_role') == 'Antagonist' and item['file_name'] not in file_names_list
  ]

  unique_antagonists = {item['file_name']: item for item in antagonist_data}.values()

  # Sort the remaining data by 'cosine_similarity' in descending order and select the top 2
  top_2_antagonists = sorted(unique_antagonists, key=lambda x: x['cosine_similarity'], reverse=True)[:k]

  # Retrieve just the 'file_name' field
  top_2_antagonists_file_names = [item['file_name'] for item in top_2_antagonists]
  top_2_antagonists_texts = [item['text'] for item in top_2_antagonists]

  # Append the new file names to the existing list
  file_names_list.extend(top_2_antagonists_file_names)


  innocent_data = [
      item for item in train_data_with_vectors
      if item.get('main_role') == 'Innocent' and item['file_name'] not in file_names_list
  ]

  unique_innocents = {item['file_name']: item for item in innocent_data}.values()



  top_2_innocents = sorted(unique_innocents, key=lambda x: x['cosine_similarity'], reverse=True)[:k]

  top2_list = []
  top2_list.extend(top_2_protagonists)
  top2_list.extend(top_2_antagonists)
  top2_list.extend(top_2_innocents)
  return top2_list

# Retrieve Top Examples for Each Refined Role

## Overview
The `get_refined_roles_examples` function extracts the top example for each refined role within a specified main role category (`Protagonist`, `Antagonist`, or `Innocent`). It ranks the entries based on cosine similarity and ensures that each file appears only once in the results.

## Parameters
- `train_data_with_vectors` (list of dicts): A dataset where each entry contains:
  - `file_name` (str): Identifier for the document.
  - `main_role` (str): The main role category.
  - `refined_roles` (list): A list of refined roles associated with the entry.
  - `cosine_similarity` (float): The similarity score used for ranking.
- `main_role` (str): The main role for which refined role examples should be retrieved. Must be one of:
  - `"protagonist"`
  - `"antagonist"`
  - `"innocent"`

## Expected Output
- Returns a list of dictionaries containing the top example for each refined role within the given main role.


In [None]:
def get_refined_roles_examples(train_data_with_vectors, main_role):
  main_roles_mapping = {
    "protagonist": ["Guardian", "Martyr", "Peacemaker", "Rebel", "Underdog", "Virtuous"],
    "antagonist": [
        "Instigator", "Conspirator", "Tyrant", "Foreign Adversary", "Traitor",
        "Spy", "Saboteur", "Corrupt", "Incompetent", "Terrorist", "Deceiver", "Bigot"
    ],
    "innocent": ["Forgotten", "Exploited", "Victim", "Scapegoat"]
  }
  data = [item for item in train_data_with_vectors if item.get('main_role') == main_role]
  data = sorted(data, key=lambda x: x['cosine_similarity'], reverse=True)
  list_files_retrieved = []
  list_of_refined_roles = main_roles_mapping[main_role]


  # Dictionary to store the top element for each refined role
  refined_roles = []

  # Iterate through each refined role
  for refined_role in list_of_refined_roles:
      # Find the top element for the current refined role
      top_1_for_current_role = None
      for item in data:
          # Check if the item's refined_role matches the current refined_role
          # and if its file_name is not already in list_files_retrieved
          if refined_role in item.get('refined_roles') and item['file_name'] not in list_files_retrieved:
              top_1_for_current_role = item
              break

      # If a valid item was found, add it to the results
      if top_1_for_current_role:
          refined_roles.append(top_1_for_current_role)
          # Add the file_name to the retrieved list
          list_files_retrieved.append(top_1_for_current_role['file_name'])

  return refined_roles


# Role Classification Prompt Generator

## Overview
The `create_prompt` function generates structured prompts for role classification tasks based on given narratives. It supports two levels of classification:
1. **Main Role Classification** – Assigns an entity to one of three broad categories: **innocent, protagonist, or antagonist**.
2. **Refined Roles Classification** – Assigns an entity to more specific roles within the main category.

The function formats the input examples and test sample into a structured textual prompt.

## Parameters
- **`context`** *(str)*: A textual description providing background for the task.
- **`examples`** *(list of dicts)*: A list of example cases, each containing:
  - `text` *(str)*: The narrative where an entity appears.
  - `entity_name` *(str)*: The name of the entity being classified.
  - `main_role` *(str, optional)*: The main classification of the entity (only required for refined classification).
  - `refined_roles` *(list of str, optional)*: A list of refined roles associated with the entity (only required for refined classification).
- **`test_sample`** *(dict)*: A dictionary representing the test case to classify, containing:
  - `text` *(str)*: The narrative for classification.
  - `entity_name` *(str)*: The entity to classify.
- **`type_role`** *(str)*: Defines the classification type:
  - `"main"`: Classifies the entity as **innocent, protagonist, or antagonist**.
  - `"refined"`: Classifies the entity into **specific roles** within the main category.

## Expected Output
The function returns a **formatted prompt** for classification

In [None]:
main_roles_mapping = {
    "protagonist": ["Guardian", "Martyr", "Peacemaker", "Rebel", "Underdog", "Virtuous"],
    "antagonist": [
        "Instigator", "Conspirator", "Tyrant", "Foreign Adversary", "Traitor",
        "Spy", "Saboteur", "Corrupt", "Incompetent", "Terrorist", "Deceiver", "Bigot"
    ],
    "innocent": ["Forgotten", "Exploited", "Victim", "Scapegoat"]
}

def create_prompt(context, examples, test_sample, type_role):
  prompt = context
  prompt +="\n\n\nExample Section:\n"

  if type_role == 'main':
    for i in range(len(examples)):
      prompt += f"""### Example {i+1}\n"""
      prompt+=f"""**Narrative**: {examples[i]['text']}\n\n"""
      prompt+=f"""**Main role**: entity {examples[i]['entity_name']} is {examples[i]['main_role']}\n\n\n"""

    prompt +=f"""End of Example Section\n\n\n"""

    prompt +=f"""### Your Task\nNow choose just one **Main Role** between innocent, protagonist, antagonist for the entity framed between <T> and </T>."""

    prompt +=f"""**Narrative**: {test_sample['text']}\n\n"""
    prompt+= f"""**Main role**: entity {test_sample['entity_name']} is """


  elif type_role == 'refined':

    refined_roles = main_roles_mapping[examples[0]['main_role']]

    for i in range(len(examples)):
      binary_vector = roles_to_binary(examples[i]['refined_roles'], main_roles_mapping)

      binary_string = ''

      for j in range(len(refined_roles)):
        binary_string+=f"""{refined_roles[j]}: {binary_vector[j]}\n"""

      prompt += f"""### Example {i+1}:\n**Narrative**:\n{examples[i]['text']}"""

      prompt += f"""\n\n**Entity**: {examples[i]['entity_name']}"""

      prompt += f"""\n**Refined roles**:\n{binary_string}"""

      prompt += "\n\n\n"

    prompt +="\nEnd of Example Section\n\n"

    prompt +="### Your Task\n"

    prompt+= '''Now for each of the **Refined Roles**, answer "Yes" if the entity framed plays the Refined Role or answer "No" if the entity does not play the Refined Role in the following Narrative.\n\n'''

    prompt += f"""**Narrative**:\n{test_sample['text']}"""

    prompt += f"""\n\n**Entity**: {test_sample['entity_name']}"""

    prompt += "\n\n**Refined roles**:"



  return prompt

# Retrieve Refined Roles Descriptions

## Overview
The `get_refined_roles_descriptions` function retrieves textual descriptions of refined roles associated with a specified **main role**. It reads the content from a provided file path and returns the file's content as a string. This function is useful for dynamically loading role definitions from an external source.

## Parameters
- **`main_role`** *(str)*: The main role category for which refined role descriptions are needed. This parameter is currently unused within the function.
- **`file_path`** *(str)*: The path to the file containing refined role descriptions.

## Expected Output
- If the file exists, the function returns its **entire content** as a string.


In [None]:
def get_refined_roles_descriptions(main_role, file_path):
  try:
      with open(file_path, 'r', encoding='utf-8') as file:
          content = file.read()  # Legge il contenuto del file come testo grezzo
      return content

  except FileNotFoundError:
      return "Errore: Il file non è stato trovato."

  except Exception as e:
      return f"Errore: {e}"

# Visualization of the main roles tokens

In [None]:
print("protagonist tokens:\n")
for el in tokenizer.encode("protagonist"):
  print('\t', tokenizer.decode(el) , ' --> ', el)

print("antagonist tokens:\n")
for el in tokenizer.encode("antagonist"):
  print('\t', tokenizer.decode(el) , ' --> ', el)

print("innocent tokens:\n")
for el in tokenizer.encode("innocent"):
  print('\t', tokenizer.decode(el) , ' --> ', el)

# Main Role Prediction

## Overview
The `get_main_role_prediction` function predicts the **main role** of an entity in a given text prompt. It classifies the entity as one of three roles:
- **Protagonist**
- **Antagonist**
- **Innocent**

This function uses a **memory-efficient** approach with PyTorch and CUDA to process the input prompt, generate predictions iteratively, and determine the most likely role.

## Parameters
- **`prompt`** *(str)*: The input text containing context, persona, examples and the entity whose main role needs to be classified.

## Expected Output
- Returns one of the following role classifications as a **string**:
  - `"Protagonist"`
  - `"Antagonist"`
  - `"Innocent"`

## Note
- We look at the first **uncased** token of each main role. This choice was taken to increase the probability of the Innocent label (In LLaMa "Innocent" has "In" as first token, while "innocent" starts with "inn").

In [None]:
import torch
import torch.nn.functional as F

def get_main_role_prediction(prompt):
    """
    Predicts the main role (Protagonist, Antagonist, or Innocent) in a memory-efficient way.
    """
    first_id_protagonist = tokenizer.encode("protagonist")[1]
    first_id_antagonist = tokenizer.encode("antagonist")[1]
    first_id_innocent = tokenizer.encode("innocent")[1]
    # Tokenize prompt and move to CUDA
    inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True).to("cuda")
    input_ids = inputs["input_ids"]
    eos_token_id = tokenizer.eos_token_id  # End-of-sequence token

    # Parameters for the generation loop
    max_length = 500
    i = 0

    mainrole = ""

    model.eval()
    with torch.inference_mode():  # More efficient than torch.no_grad()
        while i < max_length and (eos_token_id is None or input_ids[0, -1] != eos_token_id):
            output = model(input_ids=input_ids)

            # Select logits for specific role tokens (already on CUDA)
            next_tokek_Pro = output.logits[:, -1, first_id_protagonist]#4490
            next_tokek_Ant = output.logits[:, -1, first_id_antagonist]#519
            next_tokek_In = output.logits[:, -1, first_id_innocent]#6258

            # Define Temperature
            Temperature = 1.0
            next_tokek_Pro /= Temperature
            next_tokek_Ant /= Temperature
            next_tokek_In /= Temperature

            # Stack logits & apply softmax
            selected_logits = torch.stack([next_tokek_Pro, next_tokek_Ant, next_tokek_In], dim=-1)
            probabilities = F.softmax(selected_logits, dim=-1)
            print(probabilities)
            argmax_indices = torch.argmax(probabilities, dim=-1)

            # Assign Main Role Based on Predicted Token
            if argmax_indices.item() == 0:
                mainrole = "Protagonist"
                break
            elif argmax_indices.item() == 1:
                mainrole = "Antagonist"
                break
            elif argmax_indices.item() == 2:
                mainrole = "Innocent"
                break

            # Free up memory efficiently
            del output
            input_ids = input_ids.to("cpu")
            torch.cuda.empty_cache()

            # Prevent infinite loop
            i += 1

    return mainrole


# Refined Role Prediction

## Overview
The `get_refined_roles_predictions` function predicts **refined roles** for an entity in a given text prompt. It determines whether an entity plays each refined role by generating a **Yes/No** classification. This function optimizes memory usage with PyTorch and CUDA for efficient processing.

## Parameters
- **`prompt`** *(str)*: The input text containing the entity whose refined roles need to be classified.
- **`refined_roles`** *(list of str)*: A list of refined role names that the model will evaluate.

## Expected Output
- Returns a **list of refined roles** *(list of str)* where the entity is classified as playing the role.
- If no refined role is assigned based on the probability threshold, the function assigns the role with the highest probability.


In [None]:
def get_refined_roles_predictions(prompt, refined_roles):
    """
    Generates refined role predictions (Yes/No) in a memory-efficient way.
    """
    # Tokenize prompt and move to CUDA
    inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True).to("cuda")
    eos_token_id = tokenizer.eos_token_id  # End-of-sequence token

    # Precompute token IDs and move to CUDA (avoid repeated calls)
    Yes_id = torch.tensor(tokenizer.encode('Yes')[1:]).to('cuda').unsqueeze(0)
    No_id = torch.tensor(tokenizer.encode('No')[1:]).to('cuda').unsqueeze(0)
    space_id = torch.tensor(tokenizer.encode(' ')[1:]).to('cuda').unsqueeze(0)
    points_id = torch.tensor(tokenizer.encode(':')[1:]).to('cuda').unsqueeze(0)
    new_line_id = torch.tensor(tokenizer.encode('\n')[1:]).to('cuda').unsqueeze(0)

    predicted_refined_roles = []

    torch.cuda.empty_cache()  # Free unused GPU memory before starting

    input_ids = inputs["input_ids"]
    max_prob=0

    model.eval()
    with torch.inference_mode():  # More efficient than torch.no_grad()
        for refined_role in refined_roles:

            # Append refined role formatting
            concat_input_ids = torch.cat((input_ids, new_line_id), dim=-1)

            refined_role_ids = torch.tensor(tokenizer.encode(refined_role)[1:]).to('cuda').unsqueeze(0)
            contact_input_ids = torch.cat((concat_input_ids, refined_role_ids, points_id, space_id), dim=-1)

            # Run model inference
            output = model(input_ids=concat_input_ids)

            # Get logits for Yes/No
            next_token_Yes = output.logits[:, -1, 9642]
            next_token_No = output.logits[:, -1, 2822]

            # Stack and apply softmax
            selected_logits = torch.stack([next_token_Yes, next_token_No], dim=-1)
            probabilities = F.softmax(selected_logits, dim=-1)
            argmax_indices = torch.argmax(probabilities, dim=-1)
            print(probabilities)
            if(probabilities[0][0]>max_prob):
              max_prob=probabilities[0][0]
              max_prob_role=refined_role

            # Append prediction based on highest probability
            if argmax_indices.item() == 0:
                predicted_refined_roles.append(refined_role)

            del input_ids
            input_ids = concat_input_ids
            # Free up memory
            del output, concat_input_ids, refined_role_ids
            torch.cuda.empty_cache()

    if len(predicted_refined_roles)==0:
      predicted_refined_roles.append(max_prob_role)
    return predicted_refined_roles


# Main and Refined Role Classification System

## Overview
This script performs a **two-stage classification** of entities in a narrative. It first predicts the **main role** of an entity as either:
- **Protagonist**
- **Antagonist**
- **Innocent**

After determining the main role, the script further assigns **refined roles** specific to the chosen main category. The classification is based on **semantic vector similarity** and **LLM-based inference**.

The script evaluates classification performance by comparing predictions with actual labels from a validation dataset.

---

## Process Workflow

### **1. Load Training and Validation Data**
- The script reads the training (`semantic_vectors_train.json`) and validation (`semantic_vectors_development.json`) datasets.
- These datasets contain **semantic vectors** representing entities and their roles.

### **2. Compute Cosine Similarity for Example Selection**
- For each validation entity, the script computes the **cosine similarity** between its semantic vector and all training vectors.
- It retrieves the **top `k=2` most similar examples** per main role.

### **3. Generate Main Role Classification Prompt**
- A structured prompt is created to classify the entity as **Protagonist, Antagonist, or Innocent**.
- The prompt includes:
  - A **persona** assignment
  - A **task description** explaining how to classify entities.
  - **Example cases** retrieved based on cosine similarity.

### **4. Predict Main Role Using LLM**
- The script uses `get_main_role_prediction(prompt)` to determine the entity’s **main role**.
- If the prediction matches the actual label, the **correctly_predicted** counter is incremented.

### **5. Retrieve and Describe Refined Roles**
- Based on the predicted **main role**, the script retrieves the corresponding **refined roles**:
  - **Protagonist** → Guardian, Martyr, Peacemaker, Rebel, Underdog, Virtuous
  - **Antagonist** → Instigator, Conspirator, Tyrant, Foreign Adversary, Traitor
  - **Innocent** → Forgotten, Exploited, Victim, Scapegoat
- It loads descriptions of these refined roles from external text files.

### **6. Generate Refined Role Classification Prompt**
- A second structured prompt is generated for **refined role classification**.
- The model is tasked with answering **Yes/No** for each refined role.

### **7. Predict Refined Roles Using LLM**
- The script uses `get_refined_roles_predictions(prompt, refined_roles)` to determine which refined roles apply.
- If the predicted refined roles exactly match the expected roles, the **exactmatch** counter is incremented.

---

## Functions & data structures used
- **`main_roles_mapping`** *(dict)*: Defines the available refined roles for each main role category.
- **`train_file_path`** *(str)*: Path to the training dataset.
- **`val_file_path`** *(str)*: Path to the validation (development) dataset.
- **`train_data_with_vectors`** *(list of dicts)*: Training dataset containing entity roles and semantic vectors.
- **`val_data_with_vectors`** *(list of dicts)*: Validation dataset used for evaluation containing entity roles and semantic vectors..
- **`compute_cosine_similiraties`** *(function)*: Computes the similarity between an entity's semantic vector and training data.
- **`get_top_k_per_main_role`** *(function)*: Retrieves the top `k=2` most similar training examples for each main role.
- **`create_prompt`** *(function)*: Generates a structured classification prompt.
- **`get_main_role_prediction`** *(function)*: Predicts the **main role** based on the prompt.
- **`get_refined_roles_descriptions`** *(function)*: Loads textual descriptions of refined roles.
- **`get_refined_roles_examples`** *(function)*: Retrieves example cases for refined role classification.
- **`get_refined_roles_predictions`** *(function)*: Predicts the **refined roles** for the given entity.



In [None]:
main_roles_mapping = {
    "protagonist": ["Guardian", "Martyr", "Peacemaker", "Rebel", "Underdog", "Virtuous"],
    "antagonist": [
        "Instigator", "Conspirator", "Tyrant", "Foreign Adversary", "Traitor",
        "Spy", "Saboteur", "Corrupt", "Incompetent", "Terrorist", "Deceiver", "Bigot"
    ],
    "innocent": ["Forgotten", "Exploited", "Victim", "Scapegoat"]
}


train_file_path = ...
val_file_path = ...
train_data_with_vectors = read_file_into_list_of_dicts(train_file_path)
val_data_with_vectors = read_file_into_list_of_dicts(val_file_path)

correctly_predicted = 0.0

val_data_semantic_vectors = [item['semantic_vector'] for item in val_data_with_vectors]

i = 0
exactmatch = 0.0
for item in val_data_with_vectors:
  train_data_with_vectors = compute_cosine_similiraties(item['semantic_vector'], train_data_with_vectors)

  main_role_examples = get_top_k_per_main_role(train_data_with_vectors, k=2) # used for main_role inference

  context_main_role = f"""<<SYS>>You are an expert in classification of narrative entities. Your task is to classify the entity framed between "<T>" and "<\T>". You can choose just one label between innocent, protagonist, antagonist.
  Assign the **Main Role** label based on the following criteria:

  - protagonist: The central entity in the narrative, typically depicted as the main driver of events, actions, or decisions. This role is often associated with individuals, organizations, or groups that initiate key actions or are the primary focus of the story.

  - antagonist: The entity that opposes, challenges, or creates obstacles for the protagonist or other actors in the narrative. The antagonist may act directly or indirectly, and can include individuals, organizations, groups, or abstract forces. This role is often linked to conflict or controversy within the story.

  - innocent: An entity that is affected by the events of the narrative without playing an active role in driving them. The innocent may be a victim or a passive participant whose involvement is incidental rather than intentional. This role is typically associated with entities that experience consequences rather than cause them.

  In the Example section you have some examples<</SYS>>"""

  type_role = 'main'
  prompt = create_prompt(context_main_role, main_role_examples, item, type_role)
  ''' if i == 0:
    print(prompt)
    break '''

  main_role_predicted = get_main_role_prediction(prompt)

  if main_role_predicted.lower() == item['main_role'].strip():
    correctly_predicted += 1.0
  print(f"""Predicted: {main_role_predicted.strip()}\nActual main role: {item['main_role'].strip()}\n""")


  refined_roles = main_roles_mapping[main_role_predicted.strip().lower()]

  role=''
  if main_role_predicted == 'Protagonist':
    role='protagonist.txt'

  elif main_role_predicted == 'Antagonist':
    role='antagonist.txt'

  elif main_role_predicted == 'Innocent':
    role='innocent.txt'

  descriptions_file_path = ...

  refined_roles_descriptions = get_refined_roles_descriptions(main_role_predicted, descriptions_file_path)

  context_refined_roles = f"""<<SYS>>You are an expert in classification of narrative entities. Your task is to understand which **Refined Roles** the entity framed between "<T>" and "<\T>" plays in a narrative.
  The **Refined Roles** are: {refined_roles}.
  **Refine Roles** are described as follow: {refined_roles_descriptions}\nFor each of the **Refined Roles** previously described, answer "Yes" if the entity framed plays the role in the **Narrative** or answer "No" if the entity does not play the role. Insert the prediction after "**Refined Roles**:". In the Example section you have some examples<</SYS>>"""

  refined_roles_examples = get_refined_roles_examples(train_data_with_vectors, main_role_predicted.lower())


  type_role = 'refined'
  prompt = create_prompt(context_refined_roles, refined_roles_examples, item, type_role)
  refined_roles_predicted = get_refined_roles_predictions(prompt, refined_roles)

  print("Predicted refined roles: ", refined_roles_predicted)
  print("Expected refined roles: ", item['refined_roles'], '\n\n')
  if item['refined_roles'] == refined_roles_predicted:
    exactmatch += 1


print(f"""Corectly predicted:  {correctly_predicted}""")
print(f"""Total samples:  {len(val_data_with_vectors)}""")

print(f"""Main role predictions accuracy: {correctly_predicted/len(val_data_with_vectors)}""")
print(f"""Exact match: {exactmatch/len(val_data_with_vectors)}""")


In [None]:
def save_to_txt(data, file_path):
    with open(file_path, 'w') as file:
        for entry in data:
            # Extract details
            file_name = entry['file_name']
            # Join labels with a single tab, preserving multi-word labels
            textual_labels = '\t'.join(entry['textual_labels'])
            semantic_vector = ','.join(map(str, entry['semantic_vector'].tolist()))

            # Format the line
            line = f"{file_name}\t{textual_labels}\t{semantic_vector}\n"
            file.write(line)

In [None]:
folder_path = ...
save_path = ...
semantic_similarity_train_dataset = preprocess_dataset(train_data, folder_path)
save_to_txt(semantic_similarity_train_dataset, save_path)