# ðŸ“‚ **Project: Automatic Argumentation Analysis with LLM**

In this notebook, we explore an advanced approach to **argument mining** by leveraging a **Large Language Model (LLM)**. Unlike the other implementation, which relied on a naive argument matcher, we now use a state-of-the-art model to enhance the quality and accuracy of our argumentation analysis.

This approach integrates **transformers** and **pre-trained models** from the [Huggingface Transformers library](https://huggingface.co), bringing the power of cutting-edge NLP techniques to the task of argument mining. The provided code demonstrates how to fine-tune a **Roberta-based model** for **Named Entity Recognition (NER)** tasks, specifically designed for identifying argument components such as claims or premises.

This project showcases a more sophisticated LLM-based solution, offering insights into the integration of **transformers**, **datasets**, and **tokenizers** within a machine learning pipeline. It serves as a stepping stone for exploring more robust and scalable solutions for **argumentation analysis**.

Authors: **Nassim Lattab** and **Mohamed Azzaoui**.

---


### Install Packages and Load Environment Variables

As explained in the notebook, you need to install all the required packages listed in `requirements.txt`. If you have already completed this step, there is no need to run the cell below again.





In [None]:
# Install all the necessary packages
%pip install -r requirements.txt


This cell loads environment variables, and retrieves the Hugging Face API token for authentication. 

In [6]:
# Import necessary libraries
import os
from dotenv import load_dotenv

# Load the .env file from the current directory
load_dotenv()

# Retrieve the Hugging Face token from environment variables
mytoken = os.environ.get('HUGGINGFACEHUB_API_TOKEN')

# Check if the token was successfully loaded
if mytoken:
    print("Token loaded successfully.")
else:
    print("Error: No token found. Make sure it is defined in the .env file.")

Token loaded successfully.


### Check CUDA Availability

This cell checks if CUDA (GPU support) is available on the system and prints the name of the GPU if available.


In [7]:
import torch
print("PyTorch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))

PyTorch version: 2.5.1+cu118
CUDA available: True
GPU: NVIDIA GeForce RTX 3070


### Preprocessing data
This cell preprocesses the data by extracting tokens and tags, also splitting long paragraphs into smaller chunks.





In [9]:
def build_data(data, tag2idx):
    """
    Prepares the data by extracting tokens and their associated tags.

    Args:
        data (list): List of documents with token dictionaries.
        tag2idx (dict): Mapping of tags to their corresponding indices.
        max_length (int): Maximum number of tokens per chunk.

    Returns:
        tuple: Two lists, one with tokens and the other with tags.
    """
    parags = []
    tags = []

    for doc in data:
        for parag in doc['tokens']:
            # Process the paragraph directly if within the max length
            parag_tokens = [token['str'] for token in parag]
            parag_tags = [tag2idx.get(token['arg']) for token in parag]
            
            parags.append(parag_tokens)
            tags.append(parag_tags)
            
    return parags, tags

### Dataset and Data Preparation
This cell defines utility functions and a custom PyTorch `Dataset` class for handling tokenized data and aligning tags. It also includes padding logic for batching during training.


In [31]:
from torch.utils.data import Dataset

def align_tags_with_tokens(subtokens, tag_seq):
    """
    Aligns word-level tags with their corresponding subword tokens.
    Args:
        subtokens (list): List of subword tokens.
        tag_seq (list): List of tags corresponding to the original words.
    Returns:
        list: List of tags replicated to match subwords.
    """
    tag_id = 0
    exp_tags = []

    for subtoken in subtokens:
        # Increment tag index for new words (indicated by subword prefix 'Ä ')
        if subtoken.startswith('Ä '):
            tag_id += 1
        exp_tags.append(tag_seq[tag_id])

    return exp_tags


class NerDataSet(Dataset):
    """
    Custom PyTorch Dataset for NER tasks.
    Wraps token and tag sequences, tokenizes inputs, and handles padding.
    """

    def __init__(self, tokens, tags, tokenizer, pad_tag_id, max_length=512):
        self.xsentences = []
        self.ytags = []

        # Tokenize sentences and align tags
        for sentence, tag_seq in zip(tokens, tags):
            subwords = tokenizer.tokenize(' '.join(sentence))
            exp_tags = align_tags_with_tokens(subwords, tag_seq)
            if(len(subwords) > max_length):
                continue
            self.xsentences.append(tokenizer.convert_tokens_to_ids(subwords))
            self.ytags.append(exp_tags)

        self.pad_tag_id = pad_tag_id
        self.pad_tok_id = tokenizer.pad_token_id
        
    def __len__(self):
        return len(self.xsentences)

    def __getitem__(self, idx):
        return self.xsentences[idx], self.ytags[idx]

    def pad_batch(self, itemlist):
        """
        Pads sequences in a batch to the maximum sequence length.
        Args:
            itemlist (list): List of (tokenized sentence, tag sequence) tuples.
        Returns:
            tuple: Padded token and tag tensors.
        """
        xbatch = [x for x, _ in itemlist]
        ybatch = [y for _, y in itemlist]
        maxseq = max(len(x) for x in xbatch)

        # Pad token and tag sequences to match max length
        xbatch = torch.LongTensor([x + [self.pad_tok_id] * (maxseq - len(x)) for x in xbatch])
        ybatch = torch.LongTensor([y + [self.pad_tag_id] * (maxseq - len(y)) for y in ybatch])

        return xbatch, ybatch


### Roberta NER Tagger Model
This cell defines the `RobertaTagger` class for fine-tuning a pre-trained Roberta model for Named Entity Recognition (NER). It includes training, validation, and prediction functionalities.


In [28]:
import torch.nn as nn
from transformers import RobertaModel
import torch.optim as optim
from torch.optim.lr_scheduler import ReduceLROnPlateau  # Scheduler

class RobertaTagger (nn.Module):
    """
    A NER tagger fine tuning the Roberta model
    """
    def __init__(self,output_vocab_size,output_pad_id):
          super().__init__()
          self.output_pad_id     = output_pad_id
          self.output_vocab_size = output_vocab_size

    def allocate_params(self,device):
          self.llm           = RobertaModel.from_pretrained("FacebookAI/roberta-base",device_map=device)
          self.classifier    = nn.Linear(self.llm.config.hidden_size, self.output_vocab_size,device=device)

    def forward(self,X):
        representations = self.llm(X)
        tag_logits = self.classifier(representations.last_hidden_state)
        return tag_logits


    def __call__(self, X, attention_mask=None, device=None):
        """
        An interface for prediction
        Args:
          X   (tensor) : tokens ids in batches
          attention_mask (tensor, optional): attention mask (batch_size, seq_len).
          device (str) : a compute device
        Returns:
          Y (tensor)   : predicted tags ids in batches with same shape as X
        """
        X = X.to(device)
        if attention_mask is not None:
            attention_mask = attention_mask.to(device)

        logits = self.forward(X)
        Y = torch.argmax(logits, dim=-1)

        return Y


    def fine_tune(self, train_loader, val_loader, device, lr=0.00005, epochs=2, patience=3):
        """
        Fine-tune the model on training data with early stopping and a learning rate scheduler.
        """
        self.allocate_params(device)
        
        cross_entropy = nn.CrossEntropyLoss(ignore_index=self.output_pad_id)
        optimizer     = optim.AdamW(self.parameters(), lr=lr)
        
        # Scheduler to reduce learning rate when validation loss plateaus
        scheduler = ReduceLROnPlateau(optimizer, 'min', patience=1, factor=0.1, verbose=True)

        best_val_loss     = float('inf')  # To track best validation loss
        epochs_no_improve = 0  # To count the number of epochs with no improvement

        for epoch in range(epochs):
            self.train()  # Set model to training mode
            train_loss = 0

            # Training step
            for xbatch, ybatch in train_loader:
                xbatch = xbatch.to(device)
                ybatch = ybatch.to(device)

                tag_logits = self.forward(xbatch)
                batch, seq = ybatch.shape
                loss       = cross_entropy(tag_logits.reshape(batch * seq, -1), ybatch.reshape(batch * seq))

                # Backward pass
                loss.backward()
                train_loss += loss.item()

                # Update and reset gradients
                optimizer.step()
                optimizer.zero_grad()

            # Average training loss for the epoch
            train_loss /= len(train_loader)

            # Validation step
            val_loss = self.validate(val_loader, device, cross_entropy)
            scheduler.step(val_loss)  # Adjust learning rate based on validation loss

            print(f"Epoch {epoch+1}/{epochs}, Training Loss: {train_loss:.4f}, Validation Loss: {val_loss:.4f}")

            # Early Stopping check
            if val_loss < best_val_loss:
                best_val_loss     = val_loss
                epochs_no_improve = 0  # Reset counter

            else:
                epochs_no_improve += 1

            if epochs_no_improve >= patience:
                print(f"Early stopping triggered after {epoch} epochs")
                break

    def validate(self, val_loader, device, loss_fn):
        """
        Evaluate the model on validation data.
        """
        self.eval()  # Set model to evaluation mode
        val_loss = 0

        with torch.no_grad():
            for xbatch, ybatch in val_loader:
                xbatch = xbatch.to(device)
                ybatch = ybatch.to(device)

                tag_logits = self.forward(xbatch)
                batch, seq = ybatch.shape
                loss       = loss_fn(tag_logits.reshape(batch * seq, -1), ybatch.reshape(batch * seq))
                val_loss += loss.item()

        # Validation loss
        val_loss/= len(val_loader)
        return val_loss

### Reconstruct Word-Level Labels
This function reconstructs word-level labels from subword token predictions, ensuring alignment with the original text.


In [12]:
def reconstruct_labels(subwords, labels):
    """
    Reconstructs word-level labels from subword labels.
    Args:
        subwords (list): List of subword tokens.
        labels (list): List of predicted labels for subwords.
    Returns:
        list: Reconstructed labels aligned with the original words.
    """
    word_labels = []
    tag_id = 0

    # Handle the first subword if it does not start with 'Ä ' (continuation of a word)
    if(not(subwords[0].startswith('Ä '))):
        word_labels.append(labels[tag_id])
        tag_id += 1
        subwords = subwords[1:]

    # Process the remaining subwords
    for subword in subwords:
        if(subword.startswith('Ä ')): # Indicates the start of a new word
            word_labels.append(labels[tag_id])
        tag_id += 1

    return word_labels

### Predict Labels for Words in Paragraphs
This function predicts word-level labels for paragraphs by leveraging a trained model and tokenizer. It ensures that the labels align correctly with the original text structure.


In [13]:
def predict_labels(paragraphs, model, tokenizer, idx2tag, device, max_length=512):
    """
    Predicts the labels for a list of paragraphs (each paragraph is a list of words).
    Handles subword tokenization, processes sequences without truncation, and splits long sequences after encoding.
    Args:
        paragraphs (list of list of str): List of paragraphs, each paragraph is a list of words.
        model (RobertaTagger): Trained model for label prediction.
        tokenizer (AutoTokenizer): Tokenizer used to convert words into tokens.
        idx2tag (dict): Dictionary to map label indices to text labels.
        device (torch.device): Computation device (CPU/GPU).
        max_length (int): Maximum number of tokens allowed per sequence.
    Returns:
        list of list of str: List of actual labels for each word in each paragraph.
    """
    model.eval()
    predictions = []

    for paragraph in paragraphs:
        # Full tokenization without truncation
        encoding = tokenizer.encode_plus(
            ' '.join(paragraph),
            truncation=False,  # No truncation, so we don't lose any data
            padding=True,
            return_tensors="pt",
            return_attention_mask=True
        )

        input_ids = encoding['input_ids'][0]  # Remove batch dimension
        attention_mask = encoding['attention_mask'][0]

        # Split into manageable chunks of max_length
        chunks = [
            (input_ids[i:i + max_length], attention_mask[i:i + max_length])
            for i in range(0, input_ids.size(0), max_length)
        ]

        chunk_predictions = []
        for chunk_input_ids, chunk_attention_mask in chunks:
            # Add batch dimension to the chunks
            chunk_input_ids = chunk_input_ids.unsqueeze(0).to(device)
            chunk_attention_mask = chunk_attention_mask.unsqueeze(0).to(device)

            # Get predictions for the chunk
            with torch.no_grad():
                predicted_indices = model(chunk_input_ids, chunk_attention_mask)

            chunk_predictions.extend(predicted_indices.cpu().tolist()[0])

        # Map predictions back to subwords
        subtokens = tokenizer.convert_ids_to_tokens(input_ids.tolist())
        word_labels_indices = reconstruct_labels(subtokens, chunk_predictions)

        # Convert indices to actual labels
        word_labels = [idx2tag[idx] for idx in word_labels_indices]

        predictions.append(word_labels)

    return predictions


# def predict_labels(paragraphs, model, tokenizer, idx2tag, device):
#     """
#     Predicts the labels for a list of paragraphs (each paragraph is a list of words).
#     Args:
#         paragraphs (list of list of str): List of paragraphs, each paragraph is a list of words.
#         model (RobertaTagger): Trained model for label prediction.
#         tokenizer (AutoTokenizer): Tokenizer used to convert words into tokens.
#         idx2tag (dict): Dictionary to map label indices to text labels.
#         device (torch.device): Computation device (CPU/GPU).
#         max_length (int): Maximum number of tokens allowed.
#     Returns:
#         list of list of str: List of actual labels for each word in each paragraph.
#     """
#     model.eval()
#     predictions = []

#     for paragraph in paragraphs:
#         # Full tokenization with attention mask
#         for paragraph in paragraphs:
#             subwords = tokenizer.tokenize(' '.join(paragraph))
#             print(f"Original tokens: {len(paragraph)}, Subwords: {len(subwords)}")

#         encoding = tokenizer.encode_plus(
#             ' '.join(paragraph),
#             truncation=True,  # Ensure tokenization respects max_length
#             padding=True,
#             return_tensors="pt",
#             return_attention_mask=True
#         )

#         input_ids = encoding['input_ids']
#         print(f"Encoded input IDs length: {input_ids.size(1)}")

#         input_ids      = encoding['input_ids'] # Truncate to max_length
#         attention_mask = encoding['attention_mask']  # Truncate mask to align

#         # Pass through the model
#         with torch.no_grad():
#             predicted_indices = model(input_ids.to(device), attention_mask.to(device))

#         # Convert indices to text labels
#         predicted_labels = [
#             [idx2tag[idx] for idx in seq] for seq in predicted_indices.tolist()
#         ]

#         # Reconstruct labels by word
#         subtokens = tokenizer.convert_ids_to_tokens(input_ids[0].cpu().tolist())
#         word_labels = reconstruct_labels(subtokens, predicted_labels[0])
#         predictions.append(word_labels)
    
#     return predictions


### JSON Reconstruction from Predictions
This cell defines functions to reconstruct a JSON file from model predictions while preserving the original structure of the input data. Predicted BIO labels are processed for each token and used to rebuild spans.


In [14]:
import json

def create_predictions_json(predictions, test_data, output_filepath):
    """
    Reconstructs a JSON file from model predictions while preserving the original structure.
    Predictions are processed for each token in each paragraph and grouped into their respective documents.
    Args:
        predictions (list): List of predicted BIO labels, flattened across all paragraphs.
        test_data (list): Original test data in JSON format.
        output_filepath (str): Filepath where the reconstructed JSON will be saved.
    Returns:
        None: The JSON file is saved to the specified location.
    """
    # Flatten predictions
    flat_predictions = [label for sublist in predictions for label in sublist]
    pred_index = 0  # Index to track the current position in flat_predictions

    output_data = []

    for doc in test_data:
        # Initialize a new document structure for predictions
        pred_doc = {"tokens": [], "spans": [], "rels": doc["rels"]}  # Keep the original relations

        for paragraph in doc["tokens"]:
            # Reconstruct tokens with predicted BIO labels
            reconstructed_paragraph = []
            bio_labels = []  # Collect BIO labels for span reconstruction
            for token in paragraph:
                # Assign the next predicted label from flat_predictions
                token["arg"] = flat_predictions[pred_index]
                bio_labels.append(flat_predictions[pred_index])
                pred_index += 1
                reconstructed_paragraph.append(token)

            pred_doc["tokens"].append(reconstructed_paragraph)

            # Reconstruct spans based on predicted BIO labels
            spans = reconstruct_spans(reconstructed_paragraph, bio_labels)
            if spans:  # Add spans if any were found
                pred_doc["spans"].extend(spans)

        # Append the reconstructed document to the output data
        output_data.append(pred_doc)

    # Save the reconstructed JSON file to the specified path
    with open(output_filepath, "w") as f:
        json.dump(output_data, f, indent=4)

    print(f"Complete JSON file saved at: {output_filepath}")

def reconstruct_spans(paragraph, bio_labels):
    """
    Reconstruct spans from predicted BIO labels.
    Args:
      paragraph (list): List of tokens in the paragraph.
      bio_labels (list): Predicted BIO labels for the tokens.
    Returns:
      list: A list of reconstructed spans, each represented as a dictionary.
    """
    spans = []
    current_span = None

    for idx, label in enumerate(bio_labels):
        if label.startswith("B-"):
            # If a new span starts, close the previous one (if it exists)
            if current_span:
                spans.append(current_span)
            # Create a new span
            current_span = {
                "name": label[2:],  # Extract the name (e.g., "Claim", "Premise") by removing "B-"
                "start": paragraph[idx]["idx"],  # Start index of the span
                "end": paragraph[idx]["idx"]  # Initialize the end index as the start index
            }
        elif label.startswith("I-") and current_span and current_span["name"] == label[2:]:
            # Extend the current span if it matches the "I-" label
            current_span["end"] = paragraph[idx]["idx"]
        else:
            # Close the current span if the label doesn't match "I-*" or if it's "O"
            if current_span:
                current_span["end"] = paragraph[idx]["idx"]  # Include the first non-I-* or O as the end index
                spans.append(current_span)
                current_span = None

    # Add the last span if it wasn't closed
    if current_span:
        spans.append(current_span)

    return spans


### Essay Dataset Pipeline - Data Preparation, Model Initialization, and Training
This pipeline uses the `aae_*.json datasets`, which focuses on argumentative structures within essays. The model is trained, validated, and tested on this dataset. It demonstrates the model's ability to capture argumentative spans and relationships in natural language texts.

The cell handles the following tasks:
1. Load and preprocess training and validation data.
2. Initialize the tokenizer and datasets for the essay dataset.
3. Set up the model and train it using a fine-tuning process.
4. Generate predictions on test data.


In [23]:
import json
from transformers import AutoTokenizer
import torch
from torch.utils.data import DataLoader
from pathlib import Path

# Define the tag mappings
tag2idx = {"O": 0, "B-Claim": 1, "B-MajorClaim": 2, "B-Premise": 3, "I-Claim": 4, "I-MajorClaim": 5, "I-Premise": 6}
idx2tag = {v: k for k, v in tag2idx.items()}

# Dataset Paths
data_folder = Path("data")
train_file_path = data_folder / "aae_train.json"
validation_file_path = data_folder / "aae_dev.json"
test_file_path = data_folder / "aae_test.json"

# Load training data and preprocess
with open(train_file_path, "r") as f:
    train_data = json.load(f)
sentences, tags = build_data(train_data, tag2idx)

# Load validation data and preprocess
with open(validation_file_path, "r") as f:
    dev_data = json.load(f)
dev_sentences, dev_tags = build_data(dev_data, tag2idx)

# Initialize tokenizer and datasets
tokenizer = AutoTokenizer.from_pretrained("FacebookAI/roberta-base")
train_set = NerDataSet(sentences, tags, tokenizer, len(tag2idx))
dev_set = NerDataSet(dev_sentences, dev_tags, tokenizer, len(tag2idx))

# Set up DataLoaders for batching
dataloader = DataLoader(train_set, batch_size=16, shuffle=True, collate_fn=train_set.pad_batch)
dev_dataloader = DataLoader(dev_set, batch_size=32, shuffle=False, collate_fn=dev_set.pad_batch)

# Set computation device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device used for training: {device}")
if device.type == "cuda":
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")

# Initialize and fine-tune the NER model
ner_model = RobertaTagger(output_vocab_size=len(idx2tag) + 1, output_pad_id=len(idx2tag))
ner_model.fine_tune(dataloader, dev_dataloader, device, epochs=6)

# Load test data and generate predictions
with open(test_file_path, "r") as f:
    test_data = json.load(f)
sentences, _ = build_data(test_data, tag2idx)

predictions = predict_labels(
    paragraphs=sentences,
    model=ner_model,
    tokenizer=tokenizer,
    idx2tag=idx2tag,
    device=device
)

Device used for training: cuda
GPU Name: NVIDIA GeForce RTX 3070


Some weights of RobertaModel were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/6, Training Loss: 0.9360, Validation Loss: 0.6339
Epoch 2/6, Training Loss: 0.5372, Validation Loss: 0.5520
Epoch 3/6, Training Loss: 0.3783, Validation Loss: 0.4908
Epoch 4/6, Training Loss: 0.2659, Validation Loss: 0.5465
Epoch 5/6, Training Loss: 0.1656, Validation Loss: 0.5664
Epoch 6/6, Training Loss: 0.0894, Validation Loss: 0.5445
Early stopping triggered after 5 epochs


### Save Predictions and Evaluate Model
This cell saves the predicted labels into a JSON file and evaluates the model's performance using a predefined evaluation script.


In [24]:
create_predictions_json(predictions, test_data, "aae_predictions_llm.json")

# Evaluate the predictions against the test dataset
!python evaluate.py aae_predictions_llm.json aae_test.json

Complete JSON file saved at: aae_predictions_llm.json


********************** SPANS *************************** 
   STRICT EVALUATION
    > Argument mining spans (unlabeled)
      Precision : 0.7450240240218107 
      Recall    : 0.7699594000622003
      F-score   : 0.7555197241169669
    > Argument mining spans (labeled)
      Precision : 0.6678192895858273 
      Recall    : 0.689371945407862 
      F-score   : 0.6768506469467651

    RELAXED EVALUATION (Î± = 0.5)
    > Argument mining spans (unlabeled)
      Precision : 0.777481394508183 
      Recall    : 0.8956024985140908
      F-score   : 0.8298550531967501
    > Argument mining spans (labeled)
      Precision : 0.7013360652167573 
      Recall    : 0.7925492975180652 
      F-score   : 0.7421477442445348



******************* RELATIONS *************************** 
   STRICT EVALUATION
    > Argument mining spans (unlabeled)
      Precision : 1.0 
      Recall    : 1.0
      F-score   : 1.0
    > Argument mining spans (labeled


### Medical Dataset Pipeline - Data Preparation, Model Initialization, and Training
This pipeline works with the `abstrct_*.json` datasets, which focuses on argumentative structures within medical abstracts. Similar to the essay dataset, the model is trained, validated, and tested on this dataset. The medical dataset offers unique challenges due to its domain-specific language and structure, enabling a comprehensive evaluation of the model's generalizability.

In [35]:
import json
import os
import torch
from torch.utils.data import DataLoader
from transformers import AutoTokenizer
from pathlib import Path

# Define the tag mappings
tag2idx = {"O": 0, "B-Claim": 1, "B-MajorClaim": 2, "B-Premise": 3, "I-Claim": 4, "I-MajorClaim": 5, "I-Premise": 6}
idx2tag = {v: k for k, v in tag2idx.items()}

# Dataset Paths
data_folder = Path("data")
train_file_path = data_folder / "train_data.json"
validation_file_path = data_folder / "validation_data.json"
test_file_path = data_folder / "test_data.json"

# Load training data and preprocess
with open(train_file_path, "r") as f:
    train_data = json.load(f)
sentences, tags = build_data(train_data, tag2idx)

# Load validation data and preprocess
with open(validation_file_path, "r") as f:
    dev_data = json.load(f)
dev_sentences, dev_tags = build_data(dev_data, tag2idx)

# Initialize tokenizer and datasets
tokenizer = AutoTokenizer.from_pretrained("FacebookAI/roberta-base")
train_set = NerDataSet(sentences, tags, tokenizer, len(tag2idx))
dev_set = NerDataSet(dev_sentences, dev_tags, tokenizer, len(tag2idx))

# Set up DataLoaders for batching
dataloader = DataLoader(train_set, batch_size=4, shuffle=True, collate_fn=train_set.pad_batch)
dev_dataloader = DataLoader(dev_set, batch_size=8, shuffle=False, collate_fn=dev_set.pad_batch)

# Set computation device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device used for training: {device}")
if device.type == "cuda":
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")

# Initialize and fine-tune the NER model
ner_model = RobertaTagger(output_vocab_size=len(idx2tag) + 1, output_pad_id=len(idx2tag))
ner_model.fine_tune(dataloader, dev_dataloader, device, epochs=6)


# Load test data and generate predictions
with open(test_file_path, "r") as f:
    test_data = json.load(f)

sentences, _ = build_data(test_data, tag2idx)

predictions = predict_labels(
    paragraphs=sentences,
    model=ner_model,
    tokenizer=tokenizer,
    idx2tag=idx2tag,
    device=device
)

# Nombre total de tokens dans les donnÃ©es d'entrÃ©e
total_tokens_input = len([token for tokens in sentences for token in tokens])

# Nombre total de prÃ©dictions gÃ©nÃ©rÃ©es
total_tokens_predicted = sum(len(pred) for pred in predictions)

# VÃ©rification
assert total_tokens_input == total_tokens_predicted, (
    f"Mismatch between input tokens ({total_tokens_input}) "
    f"and predictions ({total_tokens_predicted})!"
)

print(f"Total input tokens: {total_tokens_input}")
print(f"Total predicted tokens: {total_tokens_predicted}")


Token indices sequence length is longer than the specified maximum sequence length for this model (755 > 512). Running this sequence through the model will result in indexing errors


Device used for training: cuda
GPU Name: NVIDIA GeForce RTX 3070


Some weights of RobertaModel were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/6, Training Loss: 1.0019, Validation Loss: 0.4881
Epoch 2/6, Training Loss: 0.4935, Validation Loss: 0.2955
Epoch 3/6, Training Loss: 0.3326, Validation Loss: 0.2864
Epoch 4/6, Training Loss: 0.2424, Validation Loss: 0.2668
Epoch 5/6, Training Loss: 0.2117, Validation Loss: 0.4000
Epoch 6/6, Training Loss: 0.1027, Validation Loss: 0.4090
Total input tokens: 8423
Total predicted tokens: 8423


### Save Predictions and Evaluate Model

In [36]:
create_predictions_json(predictions, test_data, "abstrct_neoplasm_predictions_llm.json")
!python view_data.py abstrct_neoplasm_predictions_llm.json

Complete JSON file saved at: abstrct_neoplasm_predictions_llm.json
0	Imatinib	O
1	(	I-MajorClaim
2	Gleevec	I-MajorClaim
3	)	I-MajorClaim
4	,	I-MajorClaim
5	a	I-MajorClaim
6	highly	I-MajorClaim
7	effective	I-MajorClaim
8	specific	I-MajorClaim
9	tyrosine	I-MajorClaim
10	kinase	I-MajorClaim
11	inhibitor	I-MajorClaim
12	,	I-MajorClaim
13	demonstrates	I-MajorClaim
14	a	I-MajorClaim
15	better	I-MajorClaim
16	side	I-MajorClaim
17	effect	I-MajorClaim
18	profile	I-MajorClaim
19	than	I-MajorClaim
20	interferon-alpha	I-MajorClaim
21	(	I-MajorClaim
22	IFN	I-MajorClaim
23	)	I-MajorClaim
24	,	I-MajorClaim
25	which	I-MajorClaim
26	impairs	I-MajorClaim
27	patients	I-MajorClaim
28	'	I-MajorClaim
29	quality	I-MajorClaim
30	of	I-MajorClaim
31	life	I-MajorClaim
32	(	I-MajorClaim
33	QoL	I-MajorClaim
34	)	I-MajorClaim
35	.	O
36	This	O
37	phase	O
38	III	O
39	international	O
40	study	O
41	evaluated	O
42	QoL	O
43	outcomes	O
44	in	O
45	1,106	O
46	newly	O
47	diagnosed	O
48	patients	O
49	with	O
50	chronic-phase	O

In [37]:
!python evaluate.py abstrct_neoplasm_predictions_llm.json abstrct_neoplasm_test.json



********************** SPANS *************************** 
   STRICT EVALUATION
    > Argument mining spans (unlabeled)
      Precision : 0.6514414983164982 
      Recall    : 0.7160428691678691
      F-score   : 0.6739365958115958
    > Argument mining spans (labeled)
      Precision : 0.6263392857142857 
      Recall    : 0.6877840909090908 
      F-score   : 0.6474190161690161

    RELAXED EVALUATION (Î± = 0.5)
    > Argument mining spans (unlabeled)
      Precision : 0.702831080956081 
      Recall    : 0.8294748075998076
      F-score   : 0.752459641536654
    > Argument mining spans (labeled)
      Precision : 0.6732956857956859 
      Recall    : 0.7787292568542569 
      F-score   : 0.7142452211531158



******************* RELATIONS *************************** 
   STRICT EVALUATION
    > Argument mining spans (unlabeled)
      Precision : 1.0 
      Recall    : 1.0
      F-score   : 1.0
    > Argument mining spans (labeled)
      Precision : 1.0 
      Recall    : 1.0 
      

In [20]:
# Load test data and generate predictions
with open("abstrct_glaucoma_test.json", "r") as f:
    test_data = json.load(f)

sentences, _ = build_data(test_data, tag2idx)

predictions = predict_labels(
    paragraphs=sentences,
    model=ner_model,
    tokenizer=tokenizer,
    idx2tag=idx2tag,
    device=device
)

create_predictions_json(predictions, test_data, "abstrct_glaucoma_predictions_llm.json")
!python evaluate.py abstrct_glaucoma_predictions_llm.json abstrct_glaucoma_test.json

Complete JSON file saved at: abstrct_glaucoma_predictions_llm.json
[error] division by zero


In [21]:
# Load test data and generate predictions
with open("abstrct_mixed_test.json", "r") as f:
    test_data = json.load(f)

sentences, _ = build_data(test_data, tag2idx)

predictions = predict_labels(
    paragraphs=sentences,
    model=ner_model,
    tokenizer=tokenizer,
    idx2tag=idx2tag,
    device=device
)
create_predictions_json(predictions, test_data, "abstrct_mixed_predictions_llm.json")
!python evaluate.py abstrct_mixed_predictions_llm.json abstrct_mixed_test.json

Complete JSON file saved at: abstrct_mixed_predictions_llm.json
[error] division by zero
