# Italian Sentence Boundary Detection - Inference Pipeline

This notebook provides an inference pipeline for sentence boundary detection in Italian text using fine-tuned encoder models with and without CRF layers.

## Models

The following models are evaluated:

1. **BERT-Italian-CRF**: dbmdz/bert-base-italian-xxl-cased with CRF layer
2. **XLM-RoBERTa-CRF**: FacebookAI/xlm-roberta-base with CRF layer  
3. **BERT-Italian-Base**: dbmdz/bert-base-italian-xxl-cased (standard encoder)

## Task

Binary token classification:
- Label 0: Token does not end a sentence
- Label 1: Token ends a sentence

---

## 1. Install Required Libraries

In [1]:
!pip install -q torch transformers pytorch-crf scikit-learn pandas numpy tqdm huggingface-hub safetensors protobuf

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


## 2. Configuration and Setup

In [2]:
# Configuration Cell
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Model configurations
MODELS = {
    "bert-italian-crf": {
        "model_path": "ArchitRastogi/bert-base-italian-xxl-cased-sentence-splitter-CRF",
        "base_model": "dbmdz/bert-base-italian-xxl-cased",
        "is_crf": True
    },
    "xlm-roberta-crf": {
        "model_path": "ArchitRastogi/xlm-roberta-base-italian-sentence-splitter-CRF",
        "base_model": "FacebookAI/xlm-roberta-base",
        "is_crf": True
    },
    "bert-italian-base": {
        "model_path": "ArchitRastogi/bert-base-italian-xxl-cased-sentence-splitter-base",
        "base_model": "dbmdz/bert-base-italian-xxl-cased",
        "is_crf": False
    },
}

# Inference parameters
MAX_LENGTH = 512
STRIDE = 64
BATCH_SIZE = 16
CACHE_DIR = "cache"
OUTPUT_DIR = "inference_output"

# Input file configuration
# Set to None to use default OOD_test.csv, or provide path to custom CSV
CUSTOM_INPUT_FILE = None  # e.g., "path/to/your/file.csv"
DEFAULT_INPUT_FILE = "OOD_test.csv"

# Create output directory
os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(CACHE_DIR, exist_ok=True)

print("Configuration loaded successfully.")
print(f"Number of models to evaluate: {len(MODELS)}")
print(f"Output directory: {OUTPUT_DIR}")

Configuration loaded successfully.
Number of models to evaluate: 3
Output directory: inference_output


## 3. Import Libraries

In [3]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModel, AutoModelForTokenClassification
from torchcrf import CRF
from sklearn.metrics import (
    precision_recall_fscore_support, 
    accuracy_score,
    classification_report,
    confusion_matrix
)
from tqdm.auto import tqdm
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Check device
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {DEVICE}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

Using device: cuda
GPU: NVIDIA RTX A4500
Memory: 21.15 GB


## 4. Define Model Architecture

The CRF wrapper adds a Conditional Random Field layer on top of the encoder for structured prediction.

In [4]:
class BERTWithCRF(nn.Module):
    """BERT encoder with CRF layer for token classification."""
    
    def __init__(self, base_model, num_labels=2):
        super().__init__()
        self.encoder = base_model
        self.dropout = nn.Dropout(0.1)
        self.classifier = nn.Linear(self.encoder.config.hidden_size, num_labels)
        self.crf = CRF(num_labels, batch_first=True)
        
    def forward(self, input_ids, attention_mask, labels=None):
        outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
        sequence_output = outputs.last_hidden_state
        sequence_output = self.dropout(sequence_output)
        emissions = self.classifier(sequence_output)
        
        mask = attention_mask.bool()
        
        if labels is not None:
            labels_crf = labels.clone()
            valid_labels_mask = (labels != -100)
            labels_crf[labels_crf == -100] = 0
            
            log_likelihood = self.crf(emissions, labels_crf, mask=mask, reduction='none')
            
            batch_size = input_ids.size(0)
            masked_log_likelihood = []
            for i in range(batch_size):
                n_valid = valid_labels_mask[i].sum()
                if n_valid > 0:
                    masked_log_likelihood.append(log_likelihood[i])
            
            if len(masked_log_likelihood) > 0:
                loss = -torch.stack(masked_log_likelihood).mean()
            else:
                loss = -log_likelihood.mean()
            
            predictions = self.crf.decode(emissions, mask=mask)
            return {'loss': loss, 'logits': emissions, 'predictions': predictions}
        else:
            predictions = self.crf.decode(emissions, mask=mask)
            return {'logits': emissions, 'predictions': predictions}

print("Model architecture defined.")

Model architecture defined.


## 5. Data Loading Functions

Expected CSV format (semicolon-separated):
```
token;label
C';0
era;0
una;0
volta;0
.;1
```

In [5]:
def load_test_data(filepath):
    """Load test data from CSV file.
    
    Args:
        filepath: Path to CSV file with token;label format
        
    Returns:
        sentences: List of token lists
        labels: List of label lists
        df: Raw dataframe
    """
    print(f"Loading data from: {filepath}")
    
    if not os.path.exists(filepath):
        raise FileNotFoundError(f"File not found: {filepath}")
    
    data = []
    with open(filepath, 'r', encoding='utf-8') as f:
        first_line = f.readline().strip()
        
        # Skip header if present
        if first_line.lower().startswith('pinocchio') or first_line == 'token;label':
            for line in f:
                line = line.strip()
                if not line:
                    continue
                parts = line.split(';')
                if len(parts) == 2:
                    token, label = parts
                    if token == 'token' and label == 'label':
                        continue
                    try:
                        data.append([token, int(label)])
                    except ValueError:
                        continue
        else:
            # First line is data
            parts = first_line.split(';')
            if len(parts) == 2:
                token, label = parts
                try:
                    data.append([token, int(label)])
                except ValueError:
                    pass
            
            for line in f:
                line = line.strip()
                if not line:
                    continue
                parts = line.split(';')
                if len(parts) == 2:
                    token, label = parts
                    try:
                        data.append([token, int(label)])
                    except ValueError:
                        continue
    
    df = pd.DataFrame(data, columns=['token', 'label'])
    
    # Group into sentences (split at label=1)
    sentences = []
    labels = []
    current_sent = []
    current_labels = []
    
    for _, row in df.iterrows():
        token = str(row['token'])
        label = int(row['label'])
        current_sent.append(token)
        current_labels.append(label)
        if label == 1:
            sentences.append(current_sent)
            labels.append(current_labels)
            current_sent = []
            current_labels = []
    
    if current_sent:
        sentences.append(current_sent)
        labels.append(current_labels)
    
    print(f"Loaded {len(df)} tokens grouped into {len(sentences)} sentences")
    print(f"Label distribution:\n{df['label'].value_counts().sort_index()}")
    
    return sentences, labels, df

print("Data loading functions defined.")

Data loading functions defined.


## 6. Dataset Class

In [6]:
class InferenceDataset(Dataset):
    """Dataset for inference with stride support."""
    
    def __init__(self, sentences, labels, tokenizer, max_length=512, stride=64):
        self.encodings = []
        self.labels_aligned = []
        self.original_indices = []
        
        for idx, (sent, labs) in enumerate(tqdm(zip(sentences, labels), 
                                                 desc="Tokenizing", 
                                                 total=len(sentences))):
            text = ' '.join(sent)
            
            encoding = tokenizer(
                text, 
                truncation=True, 
                max_length=max_length,
                stride=stride,
                return_overflowing_tokens=True,
                return_offsets_mapping=True,
                padding='max_length'
            )
            
            for i in range(len(encoding['input_ids'])):
                self.encodings.append({
                    'input_ids': encoding['input_ids'][i],
                    'attention_mask': encoding['attention_mask'][i]
                })
                
                word_ids = encoding.word_ids(batch_index=i)
                label_ids = []
                previous_word_idx = None
                
                for word_idx in word_ids:
                    if word_idx is None:
                        label_ids.append(-100)
                    elif word_idx != previous_word_idx:
                        label_ids.append(labs[word_idx] if word_idx < len(labs) else 0)
                    else:
                        label_ids.append(-100)
                    previous_word_idx = word_idx
                
                self.labels_aligned.append(label_ids)
                self.original_indices.append(idx)
    
    def __len__(self):
        return len(self.encodings)
    
    def __getitem__(self, idx):
        item = {key: torch.tensor(val) for key, val in self.encodings[idx].items()}
        item['labels'] = torch.tensor(self.labels_aligned[idx])
        item['idx'] = self.original_indices[idx]
        return item

print("Dataset class defined.")

Dataset class defined.


## 7. Inference Functions

In [7]:
def run_inference(model, dataloader, is_crf=False):
    """Run inference on dataset.
    
    Args:
        model: PyTorch model
        dataloader: DataLoader instance
        is_crf: Whether model uses CRF layer
        
    Returns:
        predictions: Numpy array of predictions
        labels: Numpy array of true labels
    """
    model.eval()
    all_predictions = []
    all_labels = []
    
    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Running inference"):
            input_ids = batch['input_ids'].to(DEVICE)
            attention_mask = batch['attention_mask'].to(DEVICE)
            labels = batch['labels'].to(DEVICE)
            
            if is_crf:
                outputs = model(input_ids, attention_mask)
                predictions = outputs['predictions']
                
                for i, pred_seq in enumerate(predictions):
                    mask = (labels[i] != -100).cpu().numpy()
                    pred_seq_padded = pred_seq + [0] * (len(mask) - len(pred_seq))
                    pred_masked = np.array(pred_seq_padded)[mask]
                    label_masked = labels[i].cpu().numpy()[mask]
                    all_predictions.extend(pred_masked.tolist())
                    all_labels.extend(label_masked.tolist())
            else:
                outputs = model(input_ids, attention_mask=attention_mask)
                logits = outputs.logits
                predictions = torch.argmax(logits, dim=-1)
                
                for i in range(predictions.shape[0]):
                    mask = (labels[i] != -100).cpu().numpy()
                    pred_masked = predictions[i].cpu().numpy()[mask]
                    label_masked = labels[i].cpu().numpy()[mask]
                    all_predictions.extend(pred_masked.tolist())
                    all_labels.extend(label_masked.tolist())
    
    return np.array(all_predictions), np.array(all_labels)


def compute_metrics(predictions, labels):
    """Compute evaluation metrics.
    
    Args:
        predictions: Predicted labels
        labels: True labels
        
    Returns:
        Dictionary of metrics
    """
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, predictions, average='binary', zero_division=0
    )
    acc = accuracy_score(labels, predictions)
    
    class_report = classification_report(
        labels, predictions, 
        target_names=['No Split (0)', 'Split (1)'],
        digits=4,
        zero_division=0
    )
    
    conf_matrix = confusion_matrix(labels, predictions)
    
    return {
        'accuracy': acc,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'classification_report': class_report,
        'confusion_matrix': conf_matrix
    }

print("Inference functions defined.")

Inference functions defined.


## 8. Model Loading and Testing Function

In [8]:
def test_model(model_name, model_config, sentences, labels):
    """Load and test a single model.
    
    Args:
        model_name: Name identifier for the model
        model_config: Configuration dictionary
        sentences: List of sentence token lists
        labels: List of label lists
        
    Returns:
        Dictionary of results or None if failed
    """
    print(f"\n{'='*80}")
    print(f"Testing: {model_name}")
    print(f"Model: {model_config['model_path']}")
    print(f"Type: {'CRF' if model_config['is_crf'] else 'Base Encoder'}")
    print("="*80)
    
    try:
        # Load tokenizer
        tokenizer = AutoTokenizer.from_pretrained(
            model_config['base_model'],
            cache_dir=CACHE_DIR,
            trust_remote_code=True
        )
        print(f"Tokenizer loaded: {model_config['base_model']}")
        
        # Load model
        if model_config['is_crf']:
            from huggingface_hub import hf_hub_download
            from safetensors.torch import load_file
            
            # Load base encoder
            base_encoder = AutoModel.from_pretrained(
                model_config['base_model'],
                cache_dir=CACHE_DIR,
                trust_remote_code=True
            )
            print("Base encoder loaded")
            
            # Create CRF wrapper
            model = BERTWithCRF(base_encoder, num_labels=2)
            print("CRF wrapper created")
            
            # Load weights
            try:
                model_file = hf_hub_download(
                    repo_id=model_config['model_path'],
                    filename="model.safetensors",
                    cache_dir=CACHE_DIR
                )
                state_dict = load_file(model_file)
                model.load_state_dict(state_dict, strict=False)
                print("Weights loaded from safetensors")
            except:
                model_file = hf_hub_download(
                    repo_id=model_config['model_path'],
                    filename="pytorch_model.bin",
                    cache_dir=CACHE_DIR
                )
                state_dict = torch.load(model_file, map_location=DEVICE)
                model.load_state_dict(state_dict, strict=False)
                print("Weights loaded from pytorch_model.bin")
        else:
            model = AutoModelForTokenClassification.from_pretrained(
                model_config['model_path'],
                cache_dir=CACHE_DIR,
                num_labels=2,
                trust_remote_code=True
            )
            print("Model loaded directly")
        
        model = model.to(DEVICE)
        model.eval()
        print(f"Model moved to {DEVICE}")
        
        # Create dataset and dataloader
        test_dataset = InferenceDataset(sentences, labels, tokenizer, MAX_LENGTH, STRIDE)
        test_loader = DataLoader(
            test_dataset, 
            batch_size=BATCH_SIZE, 
            shuffle=False,
            num_workers=0,
            pin_memory=True
        )
        
        # Run inference
        predictions, true_labels = run_inference(model, test_loader, is_crf=model_config['is_crf'])
        
        # Compute metrics
        metrics = compute_metrics(predictions, true_labels)
        
        # Display results
        print(f"\nResults:")
        print(f"  Accuracy:  {metrics['accuracy']:.4f}")
        print(f"  Precision: {metrics['precision']:.4f}")
        print(f"  Recall:    {metrics['recall']:.4f}")
        print(f"  F1 Score:  {metrics['f1']:.4f}")
        print(f"\nClassification Report:\n{metrics['classification_report']}")
        print(f"\nConfusion Matrix:\n{metrics['confusion_matrix']}")
        
        # Cleanup
        del model, tokenizer, test_dataset, test_loader
        torch.cuda.empty_cache()
        
        return {
            'model_name': model_name,
            'model_path': model_config['model_path'],
            'model_type': 'CRF' if model_config['is_crf'] else 'Base',
            'huggingface_link': f"https://huggingface.co/{model_config['model_path']}",
            'accuracy': metrics['accuracy'],
            'precision': metrics['precision'],
            'recall': metrics['recall'],
            'f1': metrics['f1'],
            'predictions': predictions,
            'classification_report': metrics['classification_report']
        }
        
    except Exception as e:
        print(f"Error testing model {model_name}: {str(e)}")
        import traceback
        traceback.print_exc()
        return None

print("Model testing function defined.")

Model testing function defined.


## 9. Load Input Data

Loads either the custom input file or the default OOD test set.

In [9]:
# Determine input file
input_file = CUSTOM_INPUT_FILE if CUSTOM_INPUT_FILE else DEFAULT_INPUT_FILE

print(f"Input file: {input_file}")
print()

# Load data
sentences, labels, raw_df = load_test_data(input_file)

print(f"\nDataset loaded successfully.")
print(f"Total tokens: {len(raw_df)}")
print(f"Total sentences: {len(sentences)}")

Input file: OOD_test.csv

Loading data from: OOD_test.csv
Loaded 1508 tokens grouped into 88 sentences
Label distribution:
label
0    1420
1      88
Name: count, dtype: int64

Dataset loaded successfully.
Total tokens: 1508
Total sentences: 88


## 10. Run Inference on All Models

In [10]:
results = []
model_predictions = {}

for model_name, model_config in MODELS.items():
    result = test_model(model_name, model_config, sentences, labels)
    if result:
        predictions = result.pop('predictions')
        classification_report_text = result.pop('classification_report')
        results.append(result)
        model_predictions[model_name] = predictions
        
        # Save individual classification report
        report_file = os.path.join(OUTPUT_DIR, f"{model_name}_classification_report.txt")
        with open(report_file, 'w') as f:
            f.write(f"Model: {model_name}\n")
            f.write(f"HuggingFace: {result['huggingface_link']}\n\n")
            f.write(f"Metrics:\n")
            f.write(f"  Accuracy:  {result['accuracy']:.4f}\n")
            f.write(f"  Precision: {result['precision']:.4f}\n")
            f.write(f"  Recall:    {result['recall']:.4f}\n")
            f.write(f"  F1 Score:  {result['f1']:.4f}\n\n")
            f.write(f"Classification Report:\n{classification_report_text}\n")
        print(f"Classification report saved: {report_file}")

print("\nAll models evaluated.")


Testing: bert-italian-crf
Model: ArchitRastogi/bert-base-italian-xxl-cased-sentence-splitter-CRF
Type: CRF


tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/433 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Tokenizer loaded: dbmdz/bert-base-italian-xxl-cased


model.safetensors:   0%|          | 0.00/445M [00:00<?, ?B/s]

Base encoder loaded
CRF wrapper created


model.safetensors:   0%|          | 0.00/443M [00:00<?, ?B/s]

Weights loaded from safetensors
Model moved to cuda


Tokenizing:   0%|          | 0/88 [00:00<?, ?it/s]

Running inference:   0%|          | 0/6 [00:00<?, ?it/s]


Results:
  Accuracy:  0.9552
  Precision: 0.5938
  Recall:    0.6477
  F1 Score:  0.6196

Classification Report:
              precision    recall  f1-score   support

No Split (0)     0.9789    0.9736    0.9762      1475
   Split (1)     0.5938    0.6477    0.6196        88

    accuracy                         0.9552      1563
   macro avg     0.7863    0.8106    0.7979      1563
weighted avg     0.9572    0.9552    0.9561      1563


Confusion Matrix:
[[1436   39]
 [  31   57]]
Classification report saved: inference_output/bert-italian-crf_classification_report.txt

Testing: xlm-roberta-crf
Model: ArchitRastogi/xlm-roberta-base-italian-sentence-splitter-CRF
Type: CRF


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Tokenizer loaded: FacebookAI/xlm-roberta-base


model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Base encoder loaded
CRF wrapper created


model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Weights loaded from safetensors
Model moved to cuda


Tokenizing:   0%|          | 0/88 [00:00<?, ?it/s]

Running inference:   0%|          | 0/6 [00:00<?, ?it/s]


Results:
  Accuracy:  0.9821
  Precision: 0.7699
  Recall:    0.9886
  F1 Score:  0.8657

Classification Report:
              precision    recall  f1-score   support

No Split (0)     0.9993    0.9817    0.9904      1420
   Split (1)     0.7699    0.9886    0.8657        88

    accuracy                         0.9821      1508
   macro avg     0.8846    0.9852    0.9280      1508
weighted avg     0.9859    0.9821    0.9831      1508


Confusion Matrix:
[[1394   26]
 [   1   87]]
Classification report saved: inference_output/xlm-roberta-crf_classification_report.txt

Testing: bert-italian-base
Model: ArchitRastogi/bert-base-italian-xxl-cased-sentence-splitter-base
Type: Base Encoder
Tokenizer loaded: dbmdz/bert-base-italian-xxl-cased


config.json:   0%|          | 0.00/594 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Model loaded directly
Model moved to cuda


Tokenizing:   0%|          | 0/88 [00:00<?, ?it/s]

Running inference:   0%|          | 0/6 [00:00<?, ?it/s]


Results:
  Accuracy:  0.9629
  Precision: 0.6630
  Recall:    0.6932
  F1 Score:  0.6778

Classification Report:
              precision    recall  f1-score   support

No Split (0)     0.9816    0.9790    0.9803      1475
   Split (1)     0.6630    0.6932    0.6778        88

    accuracy                         0.9629      1563
   macro avg     0.8223    0.8361    0.8290      1563
weighted avg     0.9637    0.9629    0.9633      1563


Confusion Matrix:
[[1444   31]
 [  27   61]]
Classification report saved: inference_output/bert-italian-base_classification_report.txt

All models evaluated.


## 11. Display Results Summary

In [11]:
if results:
    results_df = pd.DataFrame(results)
    results_df = results_df.sort_values('f1', ascending=False)
    
    print("="*80)
    print("RESULTS SUMMARY")
    print("="*80)
    print()
    print(results_df[['model_name', 'model_type', 'accuracy', 'precision', 'recall', 'f1']].to_string(index=False))
    print()
    
    # Save summary CSV
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    summary_file = os.path.join(OUTPUT_DIR, f"evaluation_summary_{timestamp}.csv")
    results_df.to_csv(summary_file, index=False)
    print(f"Summary saved to: {summary_file}")
    
    # Display best model
    print("\nBest Performing Model:")
    best_model = results_df.iloc[0]
    print(f"  Model: {best_model['model_name']}")
    print(f"  F1 Score: {best_model['f1']:.4f}")
    print(f"  Accuracy: {best_model['accuracy']:.4f}")
    print(f"  Precision: {best_model['precision']:.4f}")
    print(f"  Recall: {best_model['recall']:.4f}")
    print(f"  Link: {best_model['huggingface_link']}")
else:
    print("No results generated. Please check the errors above.")

RESULTS SUMMARY

       model_name model_type  accuracy  precision   recall       f1
  xlm-roberta-crf        CRF  0.982095   0.769912 0.988636 0.865672
bert-italian-base       Base  0.962892   0.663043 0.693182 0.677778
 bert-italian-crf        CRF  0.955214   0.593750 0.647727 0.619565

Summary saved to: inference_output/evaluation_summary_20251214_154526.csv

Best Performing Model:
  Model: xlm-roberta-crf
  F1 Score: 0.8657
  Accuracy: 0.9821
  Precision: 0.7699
  Recall: 0.9886
  Link: https://huggingface.co/ArchitRastogi/xlm-roberta-base-italian-sentence-splitter-CRF


## 12. Generate Prediction Output Files

Creates CSV files with predictions for each model in the same format as the input.

In [13]:
print("Generating prediction output files...\n")

for model_name, predictions in model_predictions.items():
    # Predictions are longer due to stride creating overlapping chunks
    # We need to map predictions back to original tokens
    
    # Reconstruct predictions to match original token count
    pred_idx = 0
    aligned_predictions = []
    
    for sent_idx, (sent, labs) in enumerate(zip(sentences, labels)):
        sent_length = len(sent)
        # Take only the number of predictions matching this sentence length
        if pred_idx + sent_length <= len(predictions):
            aligned_predictions.extend(predictions[pred_idx:pred_idx + sent_length])
            pred_idx += sent_length
        else:
            # Handle edge case: not enough predictions
            aligned_predictions.extend(predictions[pred_idx:])
            break
    
    # Ensure we have exactly the right number of predictions
    if len(aligned_predictions) != len(raw_df):
        print(f"Warning: {model_name} prediction length mismatch. Expected {len(raw_df)}, got {len(aligned_predictions)}")
        # Pad or truncate to match
        if len(aligned_predictions) < len(raw_df):
            aligned_predictions.extend([0] * (len(raw_df) - len(aligned_predictions)))
        else:
            aligned_predictions = aligned_predictions[:len(raw_df)]
    
    # Create output dataframe
    output_df = raw_df.copy()
    output_df['prediction'] = aligned_predictions
    
    # Save with predictions
    output_file = os.path.join(OUTPUT_DIR, f"{model_name}_predictions.csv")
    output_df[['token', 'prediction']].to_csv(output_file, sep=';', index=False)
    print(f"Saved: {output_file}")
    
    # Also save with original labels for comparison
    comparison_file = os.path.join(OUTPUT_DIR, f"{model_name}_comparison.csv")
    output_df[['token', 'label', 'prediction']].to_csv(comparison_file, sep=';', index=False)
    print(f"Saved: {comparison_file}")

print("\nAll prediction files generated successfully.")

Generating prediction output files...

Saved: inference_output/bert-italian-crf_predictions.csv
Saved: inference_output/bert-italian-crf_comparison.csv
Saved: inference_output/xlm-roberta-crf_predictions.csv
Saved: inference_output/xlm-roberta-crf_comparison.csv
Saved: inference_output/bert-italian-base_predictions.csv
Saved: inference_output/bert-italian-base_comparison.csv

All prediction files generated successfully.


## 13. Summary

The inference pipeline has completed. The following outputs are available in the `inference_output` directory:

1. **Evaluation Summary**: CSV file with metrics for all models
2. **Classification Reports**: Individual text files with detailed metrics per model
3. **Prediction Files**: CSV files with predicted labels for each model
4. **Comparison Files**: CSV files with original labels and predictions side-by-side

The models are ranked by F1 score, with the best performing model identified above.