# Italian Sentence Boundary Detection - Decoder Inference Pipeline

This notebook provides an inference pipeline for sentence boundary detection in Italian text using decoder-based LLM models with various prompting strategies.

## Models

The following local models are evaluated:

1. **Llama-3.2-1B**: meta-llama/Llama-3.2-1B-Instruct
2. **Llama-3.2-3B**: meta-llama/Llama-3.2-3B-Instruct
3. **Llama-3.1-8B**: meta-llama/Llama-3.1-8B-Instruct
4. **Qwen3-8B**: Qwen/Qwen3-8B

## Strategies

Seven prompting strategies are implemented:

1. **Sliding Window**: Binary YES/NO classification for each punctuation mark
2. **Next-Token Probability**: Analyze probability of sentence starters after punctuation
3. **Marker Insertion**: Insert <EOS> markers and align back to tokens
4. **Structured JSON**: Output boundary indices as JSON
5. **Few-Shot Hard**: Few-shot learning with edge case examples
6. **Chain-of-Thought**: Step-by-step reasoning for each punctuation
7. **Iterative Refinement**: Two-pass prediction with verification

## Task

Binary token classification:
- Label 0: Token does not end a sentence
- Label 1: Token ends a sentence

---

## 1. Install Required Libraries

In [1]:
!pip install -q torch transformers accelerate bitsandbytes safetensors
!pip install -q pandas numpy scikit-learn tqdm huggingface-hub hf-transfer

[0m

## 2. Configuration and Setup

In [None]:
# Configuration Cell
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Model configurations
MODELS = {
    "llama-3.2-1b": {
        "model_path": "meta-llama/Llama-3.2-1B-Instruct",
        "model_type": "decoder"
    },
    "llama-3.2-3b": {
        "model_path": "meta-llama/Llama-3.2-3B-Instruct",
        "model_type": "decoder"
    },
    "llama-3.1-8b": {
        "model_path": "meta-llama/Llama-3.1-8B-Instruct",
        "model_type": "decoder"
    },
    "qwen3-8b": {
        "model_path": "Qwen/Qwen3-8B",
        "model_type": "decoder"
    },
}

# In this notebook w showcase
# MODELS = {
#     "llama-3.2-1b": {
#         "model_path": "meta-llama/Llama-3.2-1B-Instruct",
#         "model_type": "decoder"
#     },
#     "llama-3.2-3b": {
#         "model_path": "meta-llama/Llama-3.2-3B-Instruct",
#         "model_type": "decoder"
#     },
# }

# Strategy configurations
STRATEGIES = {
    1: "sliding_window",
    2: "next_token_prob",
    3: "marker_insertion",
    4: "structured_json",
    5: "few_shot_hard",
    6: "chain_of_thought",
    7: "iterative_refinement",
}

STRATEGY_NAMES = {
    1: "Sliding Window",
    2: "Next-Token Prob",
    3: "Marker Insertion",
    4: "Structured JSON",
    5: "Few-Shot Hard",
    6: "Chain-of-Thought",
    7: "Iterative Refinement",
}

# Inference parameters
CHUNK_SIZE = 150
BATCH_SIZE = 8
MAX_NEW_TOKENS = 300
CACHE_DIR = "cache"
OUTPUT_DIR = "decoder_inference_output"

# Input file configuration
CUSTOM_INPUT_FILE = None
DEFAULT_INPUT_FILE = "OOD_test.csv"

# Team name
GROUP_NAME = "exACSAI"

# HF Token

HF_TOKEN = "" # Add your HF access token here

# Create output directory
os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(CACHE_DIR, exist_ok=True)

print("Configuration loaded successfully.")
print(f"Number of models to evaluate: {len(MODELS)}")
print(f"Number of strategies: {len(STRATEGIES)}")
print(f"Output directory: {OUTPUT_DIR}")

Configuration loaded successfully.
Number of models to evaluate: 2
Number of strategies: 7
Output directory: decoder_inference_output


## 3. Import Libraries

In [None]:
import pandas as pd
import numpy as np
import torch
import re
import json
from transformers import AutoModelForCausalLM, AutoTokenizer
from sklearn.metrics import (
    precision_recall_fscore_support, 
    accuracy_score,
    classification_report,
    confusion_matrix
)
from tqdm.auto import tqdm
from datetime import datetime
from typing import List, Dict, Tuple
import warnings
warnings.filterwarnings('ignore')

# Login to HF
from huggingface_hub import login, whoami
login(token=HF_TOKEN)
# print(whoami())
# Check device
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {DEVICE}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

Using device: cuda
GPU: NVIDIA RTX A4000
Memory: 16.78 GB


## 4. Data Loading Functions

In [4]:
def load_test_data(filepath):
    """Load test data from CSV file.
    
    Args:
        filepath: Path to CSV file with token;label format
        
    Returns:
        tokens: List of tokens
        labels: List of labels
        df: Raw dataframe
    """
    print(f"Loading data from: {filepath}")
    
    if not os.path.exists(filepath):
        raise FileNotFoundError(f"File not found: {filepath}")
    
    tokens = []
    labels = []
    
    with open(filepath, 'r', encoding='utf-8') as f:
        lines = f.readlines()
    
    # Skip header lines
    start_idx = 0
    for i, line in enumerate(lines):
        line = line.strip().replace('\r', '')
        if line.lower().startswith('pinocchio') or line == 'token;label':
            start_idx = i + 1
            continue
        break
    
    for line in lines[start_idx:]:
        line = line.strip().replace('\r', '')
        if not line:
            continue
        
        # Handle quoted semicolons
        if line.startswith('";"'):
            tokens.append(';')
            parts = line.split(';')
            if len(parts) >= 3:
                try:
                    labels.append(int(parts[2]))
                except ValueError:
                    labels.append(0)
            else:
                labels.append(0)
            continue
        
        parts = line.split(';')
        if len(parts) >= 2:
            token = parts[0].strip('"')
            try:
                label = int(parts[1])
                tokens.append(token)
                labels.append(label)
            except ValueError:
                continue
    
    df = pd.DataFrame({'token': tokens, 'label': labels})
    
    print(f"Loaded {len(tokens)} tokens")
    print(f"Label distribution: 0={labels.count(0)}, 1={labels.count(1)}")
    
    return tokens, labels, df

print("Data loading functions defined.")

Data loading functions defined.


## 5. Model Loading Class

In [5]:
class DecoderModelInference:
    """Class for running inference with decoder LLM models."""
    
    def __init__(self, model_path: str, use_quantization: bool = False):
        self.model_path = model_path
        
        print(f"Loading model: {model_path}")
        
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
        
        # Load tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(
            model_path,
            trust_remote_code=True,
            padding_side="left"
        )
        
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
            self.tokenizer.pad_token_id = self.tokenizer.eos_token_id
        
        # Load model
        self.model = AutoModelForCausalLM.from_pretrained(
            model_path,
            trust_remote_code=True,
            torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
            device_map="auto",
        )
        
        self.model.eval()
        print(f"Model loaded successfully")
    
    @torch.inference_mode()
    def generate_batch(self, prompts: List[str], max_new_tokens: int = 256, batch_size: int = 8) -> List[str]:
        """Batch generation for efficiency."""
        all_responses = []
        
        for i in range(0, len(prompts), batch_size):
            batch_prompts = prompts[i:i + batch_size]
            
            # Apply chat template
            formatted_prompts = []
            for prompt in batch_prompts:
                messages = [{"role": "user", "content": prompt}]
                try:
                    formatted = self.tokenizer.apply_chat_template(
                        messages,
                        tokenize=False,
                        add_generation_prompt=True
                    )
                    formatted_prompts.append(formatted)
                except Exception:
                    formatted_prompts.append(prompt)
            
            # Tokenize batch
            inputs = self.tokenizer(
                formatted_prompts,
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=2048
            ).to(self.model.device)
            
            # Generate
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                temperature=0.1,
                do_sample=True,
                top_p=0.9,
                pad_token_id=self.tokenizer.pad_token_id,
                eos_token_id=self.tokenizer.eos_token_id,
                use_cache=True,
            )
            
            # Decode responses
            for j, output in enumerate(outputs):
                input_len = inputs['input_ids'][j].shape[0]
                generated = output[input_len:]
                response = self.tokenizer.decode(generated, skip_special_tokens=True)
                all_responses.append(response.strip())
        
        return all_responses
    
    @torch.inference_mode()
    def get_next_token_probs(self, text: str, target_tokens: List[str]) -> Dict[str, float]:
        """Get probability of specific next tokens."""
        inputs = self.tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
        inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
        
        outputs = self.model(**inputs)
        logits = outputs.logits[0, -1, :]
        probs = torch.softmax(logits, dim=-1)
        
        result = {}
        for token in target_tokens:
            token_ids = self.tokenizer.encode(token, add_special_tokens=False)
            if token_ids:
                first_id = token_ids[0]
                if first_id < len(probs):
                    result[token] = probs[first_id].item()
        
        return result
    
    def cleanup(self):
        """Free GPU memory."""
        del self.model
        del self.tokenizer
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
            torch.cuda.synchronize()

print("Model class defined.")

Model class defined.


## 6. Prompting Strategies

In [6]:
# Common Italian sentence starters for Strategy 2
SENTENCE_STARTERS = [
    "Il", "La", "Lo", "Le", "Li", "I", "Gli",
    "Un", "Una", "Uno",
    "E", "Ma", "Se", "Non", "Per", "Con", "Da", "Di", "A", "In",
    "Quel", "Quella", "Questo", "Questa", "Chi", "Che", "Come", "Quando", "Dove",
    "Era", "Fu", "Si", "Ne", "Ci",
]

# Few-shot examples for Strategy 5
FEW_SHOT_EXAMPLES = """
### Example 1: Abbreviation (NOT a boundary) ###
Tokens: S. Maria della Stella era
Labels: 0,0,0,0,0,0

### Example 2: True sentence boundary ###
Tokens: in nuovi seni . La costiera
Labels: 0,0,0,1,0,0

### Example 3: Colon before speech (NOT a boundary) ###
Tokens: disse : - Non mi
Labels: 0,0,0,0,0

### Example 4: Semicolon in literary text (NOT a boundary) ###
Tokens: era schermito ; pero con
Labels: 0,0,0,0,0

### Example 5: Question mark (IS a boundary) ###
Tokens: Al sagrestano gli crede ? - Perche
Labels: 0,0,0,0,1,0,0
"""

def get_punctuation_indices(tokens: List[str]) -> List[int]:
    """Find indices of punctuation marks."""
    punctuation = {'.', '!', '?', ';', ':', '...', '...'}
    indices = []
    for i, token in enumerate(tokens):
        if token in punctuation:
            indices.append(i)
    return indices

def parse_model_output(output: str, expected_length: int) -> List[int]:
    """Parse model output to list of predictions."""
    predictions = [0] * expected_length
    output = output.strip()
    
    # Look for comma-separated values
    lines = output.split('\n')
    candidates = []
    for line in lines:
        line = line.strip()
        if not line:
            continue
        clean = line.replace(' ', '').replace(',', '').replace('0', '').replace('1', '')
        if len(clean) < len(line) * 0.3:
            candidates.append(line)
    
    if candidates:
        output = candidates[-1]
    
    output = output.replace(' ', '')
    parts = output.split(',')
    
    labels = []
    for p in parts:
        p = p.strip()
        if p in ['0', '1']:
            labels.append(int(p))
        elif p.startswith('0') or p.startswith('1'):
            labels.append(int(p[0]))
    
    if len(labels) < expected_length:
        labels.extend([0] * (expected_length - len(labels)))
    elif len(labels) > expected_length:
        labels = labels[:expected_length]
    
    return labels

print("Strategy utilities defined.")

Strategy utilities defined.


## 7. Strategy Implementations

In [7]:
def strategy_1_sliding_window(model: DecoderModelInference, tokens: List[str], batch_size: int = 8) -> List[int]:
    """Strategy 1: Sliding Window Binary Classification."""
    predictions = [0] * len(tokens)
    punct_indices = get_punctuation_indices(tokens)
    
    if not punct_indices:
        return predictions
    
    prompts = []
    for idx in punct_indices:
        start = max(0, idx - 15)
        end = min(len(tokens), idx + 16)
        context = " ".join(tokens[start:end])
        
        prompt = f"""Analyze Italian text for sentence boundaries.

Context: "{context}"
Token: [{tokens[idx]}]

Does this punctuation mark end a sentence? Answer ONLY "YES" or "NO":"""
        prompts.append(prompt)
    
    responses = []
    for i in tqdm(range(0, len(prompts), batch_size), desc="Strategy 1"):
        batch = prompts[i:i + batch_size]
        batch_responses = model.generate_batch(batch, max_new_tokens=10, batch_size=len(batch))
        responses.extend(batch_responses)
    
    for idx, response in zip(punct_indices, responses):
        response = response.strip().upper()
        if "YES" in response:
            predictions[idx] = 1
    
    return predictions


def strategy_2_next_token_prob(
    model: DecoderModelInference,
    tokens: List[str],
    threshold: float = 0.15,
    batch_size: int = None,   # <-- add this
) -> List[int]:
    """Strategy 2: Next-Token Probability Analysis."""
    predictions = [0] * len(tokens)
    punct_indices = get_punctuation_indices(tokens)

    if not punct_indices:
        return predictions

    for idx in tqdm(punct_indices, desc="Strategy 2"):
        start = max(0, idx - 50)
        context = " ".join(tokens[start:idx + 1])

        probs = model.get_next_token_probs(context, SENTENCE_STARTERS)
        total_prob = sum(probs.values())

        if total_prob > threshold:
            predictions[idx] = 1

    return predictions


def strategy_3_marker_insertion(model: DecoderModelInference, tokens: List[str], chunk_size: int = 200, batch_size: int = 8) -> List[int]:
    """Strategy 3: Marker Insertion."""
    predictions = [0] * len(tokens)
    EOS_MARKER = "<EOS>"
    
    prompts = []
    chunk_ranges = []
    
    for i in range(0, len(tokens), chunk_size):
        end = min(i + chunk_size, len(tokens))
        chunk_tokens = tokens[i:end]
        text = " ".join(chunk_tokens)
        
        prompt = f"""Insert {EOS_MARKER} marker after each sentence-ending token.

Input: {text}

Output (same text with {EOS_MARKER} markers):"""
        prompts.append(prompt)
        chunk_ranges.append((i, end))
    
    responses = []
    for i in tqdm(range(0, len(prompts), batch_size), desc="Strategy 3"):
        batch = prompts[i:i + batch_size]
        batch_responses = model.generate_batch(batch, max_new_tokens=500, batch_size=len(batch))
        responses.extend(batch_responses)
    
    for (start, end), response in zip(chunk_ranges, responses):
        chunk_tokens = tokens[start:end]
        for i, token in enumerate(chunk_tokens):
            patterns = [f"{token} {EOS_MARKER}", f"{token}{EOS_MARKER}", 
                       f"{token} {EOS_MARKER.lower()}", f"{token}{EOS_MARKER.lower()}"]
            for pattern in patterns:
                if pattern in response:
                    predictions[start + i] = 1
                    break
    
    return predictions


def strategy_4_structured_json(model: DecoderModelInference, tokens: List[str], chunk_size: int = 150, batch_size: int = 8) -> List[int]:
    """Strategy 4: Structured JSON Output."""
    predictions = [0] * len(tokens)
    
    prompts = []
    chunk_ranges = []
    
    for i in range(0, len(tokens), chunk_size):
        end = min(i + chunk_size, len(tokens))
        chunk_tokens = tokens[i:end]
        numbered = [f"{j}:{t}" for j, t in enumerate(chunk_tokens)]
        token_str = " ".join(numbered)
        
        prompt = f"""Identify sentence boundaries.

Tokens: {token_str}

Output JSON with boundary indices: {{"boundaries": [...]}}:"""
        prompts.append(prompt)
        chunk_ranges.append((i, end))
    
    responses = []
    for i in tqdm(range(0, len(prompts), batch_size), desc="Strategy 4"):
        batch = prompts[i:i + batch_size]
        batch_responses = model.generate_batch(batch, max_new_tokens=400, batch_size=len(batch))
        responses.extend(batch_responses)
    
    for (start, end), response in zip(chunk_ranges, responses):
        chunk_len = end - start
        
        # Try to parse JSON
        json_match = re.search(r'\{[^}]+\}', response)
        if json_match:
            try:
                data = json.loads(json_match.group())
                if "boundaries" in data:
                    for idx in data["boundaries"]:
                        if isinstance(idx, int) and 0 <= idx < chunk_len:
                            predictions[start + idx] = 1
                    continue
            except json.JSONDecodeError:
                pass
        
        # Fallback: find numbers
        numbers = re.findall(r'\b(\d+)\b', response)
        for num_str in numbers:
            idx = int(num_str)
            if 0 <= idx < chunk_len:
                predictions[start + idx] = 1
    
    return predictions


def strategy_5_few_shot_hard(model: DecoderModelInference, tokens: List[str], chunk_size: int = 100, batch_size: int = 8) -> List[int]:
    """Strategy 5: Few-Shot Learning with Hard Examples."""
    predictions = [0] * len(tokens)
    
    prompts = []
    chunk_ranges = []
    
    for i in range(0, len(tokens), chunk_size):
        end = min(i + chunk_size, len(tokens))
        chunk_tokens = tokens[i:end]
        token_str = " ".join(chunk_tokens)
        
        prompt = f"""Italian sentence boundary detection.

Learn from examples:
{FEW_SHOT_EXAMPLES}

Tokens: {token_str}

Output comma-separated 0s and 1s ({len(chunk_tokens)} values):"""
        prompts.append(prompt)
        chunk_ranges.append((i, end))
    
    responses = []
    for i in tqdm(range(0, len(prompts), batch_size), desc="Strategy 5"):
        batch = prompts[i:i + batch_size]
        batch_responses = model.generate_batch(batch, max_new_tokens=300, batch_size=len(batch))
        responses.extend(batch_responses)
    
    for (start, end), response in zip(chunk_ranges, responses):
        chunk_len = end - start
        chunk_preds = parse_model_output(response, chunk_len)
        predictions[start:end] = chunk_preds
    
    return predictions


def strategy_6_chain_of_thought(model: DecoderModelInference, tokens: List[str], batch_size: int = 8) -> List[int]:
    """Strategy 6: Chain-of-Thought Reasoning."""
    predictions = [0] * len(tokens)
    punct_indices = get_punctuation_indices(tokens)
    
    if not punct_indices:
        return predictions
    
    prompts = []
    for idx in punct_indices:
        start = max(0, idx - 20)
        end = min(len(tokens), idx + 21)
        
        before = " ".join(tokens[start:idx])
        punct = tokens[idx]
        after = " ".join(tokens[idx + 1:end]) if idx + 1 < end else ""
        next_word = tokens[idx + 1] if idx + 1 < len(tokens) else "END"
        
        prompt = f"""Analyze if this punctuation ends a sentence.

Before: "{before}"
Punctuation: [{punct}]
After: "{after}"
Next word: "{next_word}"

Think step by step, then answer FINAL: YES or FINAL: NO"""
        prompts.append(prompt)
    
    responses = []
    for i in tqdm(range(0, len(prompts), batch_size), desc="Strategy 6"):
        batch = prompts[i:i + batch_size]
        batch_responses = model.generate_batch(batch, max_new_tokens=200, batch_size=len(batch))
        responses.extend(batch_responses)
    
    for idx, response in zip(punct_indices, responses):
        response = response.upper()
        final_match = re.search(r'FINAL:\s*(YES|NO)', response)
        if final_match:
            if final_match.group(1) == "YES":
                predictions[idx] = 1
        elif response.strip().endswith("YES"):
            predictions[idx] = 1
    
    return predictions


def strategy_7_iterative_refinement(model: DecoderModelInference, tokens: List[str], chunk_size: int = 100, batch_size: int = 8) -> List[int]:
    """Strategy 7: Iterative Refinement."""
    predictions = [0] * len(tokens)
    
    # Pass 1: Initial predictions
    prompts = []
    chunk_ranges = []
    
    for i in range(0, len(tokens), chunk_size):
        end = min(i + chunk_size, len(tokens))
        chunk_tokens = tokens[i:end]
        token_str = " ".join(chunk_tokens)
        
        prompt = f"""Sentence boundary detection.

Tokens: {token_str}

Output comma-separated 0s and 1s ({len(chunk_tokens)} values):"""
        prompts.append(prompt)
        chunk_ranges.append((i, end))
    
    responses = []
    for i in tqdm(range(0, len(prompts), batch_size), desc="Strategy 7 Pass 1"):
        batch = prompts[i:i + batch_size]
        batch_responses = model.generate_batch(batch, max_new_tokens=250, batch_size=len(batch))
        responses.extend(batch_responses)
    
    for (start, end), response in zip(chunk_ranges, responses):
        chunk_len = end - start
        chunk_preds = parse_model_output(response, chunk_len)
        predictions[start:end] = chunk_preds
    
    # Pass 2: Verify punctuation predictions
    punct_indices = get_punctuation_indices(tokens)
    if punct_indices:
        verify_prompts = []
        verify_indices = []
        
        for idx in punct_indices[:50]:  # Limit verification
            start = max(0, idx - 10)
            end = min(len(tokens), idx + 11)
            context = " ".join(tokens[start:end])
            pred = "BOUNDARY" if predictions[idx] == 1 else "NOT boundary"
            
            prompt = f"""Verify: Token '{tokens[idx]}' predicted as {pred}.
Context: "{context}"

Is this correct? Answer: CORRECT or INCORRECT (and correct label 0 or 1):"""
            verify_prompts.append(prompt)
            verify_indices.append(idx)
        
        verify_responses = []
        for i in tqdm(range(0, len(verify_prompts), batch_size), desc="Strategy 7 Pass 2"):
            batch = verify_prompts[i:i + batch_size]
            batch_responses = model.generate_batch(batch, max_new_tokens=50, batch_size=len(batch))
            verify_responses.extend(batch_responses)
        
        for idx, response in zip(verify_indices, verify_responses):
            response = response.upper()
            if "INCORRECT" in response:
                if ":1" in response or "= 1" in response or "LABEL 1" in response:
                    predictions[idx] = 1
                elif ":0" in response or "= 0" in response or "LABEL 0" in response:
                    predictions[idx] = 0
    
    return predictions

# Map strategy IDs to functions
STRATEGY_FUNCTIONS = {
    1: strategy_1_sliding_window,
    2: strategy_2_next_token_prob,
    3: strategy_3_marker_insertion,
    4: strategy_4_structured_json,
    5: strategy_5_few_shot_hard,
    6: strategy_6_chain_of_thought,
    7: strategy_7_iterative_refinement,
}

print("All strategy functions defined.")

All strategy functions defined.


## 8. Evaluation Functions

In [8]:
def compute_metrics(predictions, labels):
    """Compute evaluation metrics."""
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, predictions, average='binary', zero_division=0
    )
    acc = accuracy_score(labels, predictions)
    
    class_report = classification_report(
        labels, predictions, 
        target_names=['No Split (0)', 'Split (1)'],
        digits=4,
        zero_division=0
    )
    
    conf_matrix = confusion_matrix(labels, predictions)
    
    return {
        'accuracy': acc,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'classification_report': class_report,
        'confusion_matrix': conf_matrix
    }

print("Evaluation functions defined.")

Evaluation functions defined.


## 9. Load Test Data

In [9]:
# Determine input file
input_file = CUSTOM_INPUT_FILE if CUSTOM_INPUT_FILE else DEFAULT_INPUT_FILE

if not os.path.exists(input_file):
    print(f"Input file not found: {input_file}")
    print("Please ensure the OOD_test.csv file is in the current directory.")
else:
    tokens, labels, raw_df = load_test_data(input_file)
    print(f"\nData loaded successfully.")
    print(f"Total tokens: {len(tokens)}")
    print(f"Sentence boundaries: {sum(labels)}")

Loading data from: OOD_test.csv
Loaded 1522 tokens
Label distribution: 0=1426, 1=96

Data loaded successfully.
Total tokens: 1522
Sentence boundaries: 96


## 10. Run Inference for All Models and Strategies

In [10]:
# Store results
results = []
all_predictions = {}  # {(model_name, strategy_id): predictions}

for model_name, model_config in MODELS.items():
    print(f"\n{'='*80}")
    print(f"Loading model: {model_name}")
    print(f"{'='*80}")
    
    try:
        model = DecoderModelInference(model_config['model_path'])
        
        for strategy_id, strategy_name in STRATEGIES.items():
            print(f"\n--- Running Strategy {strategy_id}: {STRATEGY_NAMES[strategy_id]} ---")
            
            try:
                strategy_func = STRATEGY_FUNCTIONS[strategy_id]
                predictions = strategy_func(model, tokens, batch_size=BATCH_SIZE)
                
                # Compute metrics
                metrics = compute_metrics(predictions, labels)
                
                print(f"  Accuracy:  {metrics['accuracy']:.4f}")
                print(f"  Precision: {metrics['precision']:.4f}")
                print(f"  Recall:    {metrics['recall']:.4f}")
                print(f"  F1 Score:  {metrics['f1']:.4f}")
                
                # Store results
                results.append({
                    'model_name': model_name,
                    'model_type': 'decoder',
                    'strategy_id': strategy_id,
                    'strategy_name': STRATEGY_NAMES[strategy_id],
                    'accuracy': metrics['accuracy'],
                    'precision': metrics['precision'],
                    'recall': metrics['recall'],
                    'f1': metrics['f1'],
                    'huggingface_link': f"https://huggingface.co/{model_config['model_path']}"
                })
                
                all_predictions[(model_name, strategy_id)] = predictions
                
            except Exception as e:
                print(f"  Error in strategy {strategy_id}: {e}")
                import traceback
                traceback.print_exc()
        
        # Cleanup model
        model.cleanup()
        
    except Exception as e:
        print(f"Error loading model {model_name}: {e}")
        import traceback
        traceback.print_exc()

print("\n" + "="*80)
print("All models and strategies evaluated.")
print("="*80)


Loading model: llama-3.2-1b
Loading model: meta-llama/Llama-3.2-1B-Instruct


`torch_dtype` is deprecated! Use `dtype` instead!


Model loaded successfully

--- Running Strategy 1: Sliding Window ---


Strategy 1:   0%|          | 0/15 [00:00<?, ?it/s]

  Accuracy:  0.9396
  Precision: 0.6000
  Recall:    0.1250
  F1 Score:  0.2069

--- Running Strategy 2: Next-Token Prob ---


Strategy 2:   0%|          | 0/119 [00:00<?, ?it/s]

  Accuracy:  0.9369
  Precision: 0.0000
  Recall:    0.0000
  F1 Score:  0.0000

--- Running Strategy 3: Marker Insertion ---


Strategy 3:   0%|          | 0/1 [00:00<?, ?it/s]

  Accuracy:  0.9021
  Precision: 0.1807
  Recall:    0.1562
  F1 Score:  0.1676

--- Running Strategy 4: Structured JSON ---


Strategy 4:   0%|          | 0/2 [00:00<?, ?it/s]

  Accuracy:  0.6367
  Precision: 0.0792
  Recall:    0.4479
  F1 Score:  0.1346

--- Running Strategy 5: Few-Shot Hard ---


Strategy 5:   0%|          | 0/2 [00:00<?, ?it/s]

  Accuracy:  0.9369
  Precision: 0.0000
  Recall:    0.0000
  F1 Score:  0.0000

--- Running Strategy 6: Chain-of-Thought ---


Strategy 6:   0%|          | 0/15 [00:00<?, ?it/s]

  Accuracy:  0.9435
  Precision: 0.7500
  Recall:    0.1562
  F1 Score:  0.2586

--- Running Strategy 7: Iterative Refinement ---


Strategy 7 Pass 1:   0%|          | 0/2 [00:00<?, ?it/s]

Strategy 7 Pass 2:   0%|          | 0/7 [00:00<?, ?it/s]

  Accuracy:  0.8955
  Precision: 0.0000
  Recall:    0.0000
  F1 Score:  0.0000

Loading model: llama-3.2-3b
Loading model: meta-llama/Llama-3.2-3B-Instruct


tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

Model loaded successfully

--- Running Strategy 1: Sliding Window ---


Strategy 1:   0%|          | 0/15 [00:00<?, ?it/s]

  Accuracy:  0.9369
  Precision: 0.0000
  Recall:    0.0000
  F1 Score:  0.0000

--- Running Strategy 2: Next-Token Prob ---


Strategy 2:   0%|          | 0/119 [00:00<?, ?it/s]

  Accuracy:  0.9369
  Precision: 0.0000
  Recall:    0.0000
  F1 Score:  0.0000

--- Running Strategy 3: Marker Insertion ---


Strategy 3:   0%|          | 0/1 [00:00<?, ?it/s]

  Accuracy:  0.9488
  Precision: 0.7045
  Recall:    0.3229
  F1 Score:  0.4429

--- Running Strategy 4: Structured JSON ---


Strategy 4:   0%|          | 0/2 [00:00<?, ?it/s]

  Accuracy:  0.6846
  Precision: 0.0556
  Recall:    0.2500
  F1 Score:  0.0909

--- Running Strategy 5: Few-Shot Hard ---


Strategy 5:   0%|          | 0/2 [00:00<?, ?it/s]

  Accuracy:  0.9363
  Precision: 0.3333
  Recall:    0.0104
  F1 Score:  0.0202

--- Running Strategy 6: Chain-of-Thought ---


Strategy 6:   0%|          | 0/15 [00:00<?, ?it/s]

  Accuracy:  0.9396
  Precision: 0.8333
  Recall:    0.0521
  F1 Score:  0.0980

--- Running Strategy 7: Iterative Refinement ---


Strategy 7 Pass 1:   0%|          | 0/2 [00:00<?, ?it/s]

Strategy 7 Pass 2:   0%|          | 0/7 [00:00<?, ?it/s]

  Accuracy:  0.5953
  Precision: 0.0681
  Recall:    0.4271
  F1 Score:  0.1175

All models and strategies evaluated.


## 11. Display Results Summary

In [11]:
if results:
    results_df = pd.DataFrame(results)
    results_df = results_df.sort_values('f1', ascending=False)
    
    print("="*100)
    print("RESULTS SUMMARY")
    print("="*100)
    print()
    print(results_df[['model_name', 'strategy_name', 'accuracy', 'precision', 'recall', 'f1']].to_string(index=False))
    print()
    
    # Save summary CSV
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    summary_file = os.path.join(OUTPUT_DIR, f"decoder_evaluation_summary_{timestamp}.csv")
    results_df.to_csv(summary_file, index=False)
    print(f"Summary saved to: {summary_file}")
    
    # Display best result
    print("\nBest Performing Configuration:")
    best = results_df.iloc[0]
    print(f"  Model: {best['model_name']}")
    print(f"  Strategy: {best['strategy_name']}")
    print(f"  F1 Score: {best['f1']:.4f}")
    print(f"  Accuracy: {best['accuracy']:.4f}")
    print(f"  Precision: {best['precision']:.4f}")
    print(f"  Recall: {best['recall']:.4f}")
else:
    print("No results generated. Please check the errors above.")

RESULTS SUMMARY

  model_name        strategy_name  accuracy  precision   recall       f1
llama-3.2-3b     Marker Insertion  0.948752   0.704545 0.322917 0.442857
llama-3.2-1b     Chain-of-Thought  0.943495   0.750000 0.156250 0.258621
llama-3.2-1b       Sliding Window  0.939553   0.600000 0.125000 0.206897
llama-3.2-1b     Marker Insertion  0.902102   0.180723 0.156250 0.167598
llama-3.2-1b      Structured JSON  0.636662   0.079190 0.447917 0.134585
llama-3.2-3b Iterative Refinement  0.595269   0.068106 0.427083 0.117479
llama-3.2-3b     Chain-of-Thought  0.939553   0.833333 0.052083 0.098039
llama-3.2-3b      Structured JSON  0.684625   0.055556 0.250000 0.090909
llama-3.2-3b        Few-Shot Hard  0.936268   0.333333 0.010417 0.020202
llama-3.2-1b      Next-Token Prob  0.936925   0.000000 0.000000 0.000000
llama-3.2-1b        Few-Shot Hard  0.936925   0.000000 0.000000 0.000000
llama-3.2-1b Iterative Refinement  0.895532   0.000000 0.000000 0.000000
llama-3.2-3b       Sliding Window 

## 12. Generate Prediction Output Files

In [12]:
print("Generating prediction output files...\n")

prediction_files = []

for (model_name, strategy_id), predictions in all_predictions.items():
    strategy_name = STRATEGIES[strategy_id]
    
    # Ensure predictions match token count
    if len(predictions) != len(tokens):
        print(f"Warning: {model_name}-{strategy_name} length mismatch. Adjusting...")
        if len(predictions) < len(tokens):
            predictions = predictions + [0] * (len(tokens) - len(predictions))
        else:
            predictions = predictions[:len(tokens)]
    
    # Create output dataframe
    output_df = pd.DataFrame({
        'token': tokens,
        'label': predictions
    })
    
    # Save with naming convention: groupname-hw2_split-modelname-strategy.csv
    safe_model_name = model_name.replace('/', '_').replace('-', '_')
    output_file = os.path.join(OUTPUT_DIR, f"{GROUP_NAME}-hw2_split-{safe_model_name}-{strategy_name}.csv")
    output_df.to_csv(output_file, sep=',', index=False)
    prediction_files.append(output_file)
    print(f"Saved: {output_file}")

print(f"\nAll prediction files generated successfully.")
print(f"\nTotal prediction files: {len(prediction_files)}")

Generating prediction output files...

Saved: decoder_inference_output/exACSAI-hw2_split-llama_3.2_1b-sliding_window.csv
Saved: decoder_inference_output/exACSAI-hw2_split-llama_3.2_1b-next_token_prob.csv
Saved: decoder_inference_output/exACSAI-hw2_split-llama_3.2_1b-marker_insertion.csv
Saved: decoder_inference_output/exACSAI-hw2_split-llama_3.2_1b-structured_json.csv
Saved: decoder_inference_output/exACSAI-hw2_split-llama_3.2_1b-few_shot_hard.csv
Saved: decoder_inference_output/exACSAI-hw2_split-llama_3.2_1b-chain_of_thought.csv
Saved: decoder_inference_output/exACSAI-hw2_split-llama_3.2_1b-iterative_refinement.csv
Saved: decoder_inference_output/exACSAI-hw2_split-llama_3.2_3b-sliding_window.csv
Saved: decoder_inference_output/exACSAI-hw2_split-llama_3.2_3b-next_token_prob.csv
Saved: decoder_inference_output/exACSAI-hw2_split-llama_3.2_3b-marker_insertion.csv
Saved: decoder_inference_output/exACSAI-hw2_split-llama_3.2_3b-structured_json.csv
Saved: decoder_inference_output/exACSAI-hw2_

## 13. Save Best Model Predictions

In [13]:
if results:
    # Get best configuration
    best_result = results_df.iloc[0]
    best_model = best_result['model_name']
    best_strategy = best_result['strategy_id']
    best_strategy_name = STRATEGIES[best_strategy]
    
    best_predictions = all_predictions.get((best_model, best_strategy))
    
    if best_predictions is not None:
        # Save best model predictions with special naming
        output_df = pd.DataFrame({
            'token': tokens,
            'label': best_predictions
        })
        
        best_file = os.path.join(OUTPUT_DIR, f"{GROUP_NAME}-hw2_split-decoder-best.csv")
        output_df.to_csv(best_file, sep=',', index=False)
        print(f"Best model predictions saved to: {best_file}")
        print(f"Best configuration: {best_model} + {STRATEGY_NAMES[best_strategy]}")
        print(f"F1 Score: {best_result['f1']:.4f}")

Best model predictions saved to: decoder_inference_output/exACSAI-hw2_split-decoder-best.csv
Best configuration: llama-3.2-3b + Marker Insertion
F1 Score: 0.4429


## 14. Summary

The decoder inference pipeline has completed. The following outputs are available in the `decoder_inference_output` directory:

1. **Evaluation Summary**: CSV file with metrics for all model-strategy combinations
2. **Prediction Files**: CSV files with predicted labels for each model-strategy pair
3. **Best Model Predictions**: CSV file with predictions from the best performing configuration

The configurations are ranked by F1 score, with the best performing combination identified above.

### Strategy Descriptions

1. **Sliding Window**: Binary YES/NO classification for each punctuation mark with context window
2. **Next-Token Probability**: Analyzes probability of sentence-starting tokens after punctuation
3. **Marker Insertion**: Asks LLM to insert <EOS> markers, then aligns back to tokens
4. **Structured JSON**: Outputs boundary indices as JSON object
5. **Few-Shot Hard**: Uses curated examples covering edge cases (abbreviations, dialogue, etc.)
6. **Chain-of-Thought**: Step-by-step reasoning for each punctuation mark
7. **Iterative Refinement**: Two-pass approach with verification and correction