Skip to content

Orange-OpenSource/PoSum-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Positional Bias Analyzer for PoSum-Bench

Python 3.10+ License: MIT Hugging Face

Official implementation of the positional bias analysis tool for "PoSum-Bench: Benchmarking Position Bias in LLM-based Conversational Summarization" (Sun et al., EMNLP 2025).

πŸ“ Project Structure

positionalbias/
β”œβ”€β”€ data/                   # Main data directory
β”œβ”€β”€ scripts/                # Core analysis scripts
β”œβ”€β”€ LICENSE
└── README.md

πŸ—‚οΈ File Descriptions

Core Scripts (scripts/)

  1. positionalbias_analyzer.py - Main script for replicating paper experiments

    python positionalbias_analyzer.py -h  # For detailed help
  2. summary_generation_marcel_template.py - Template for summary generation, used on Marcel cluster

  3. llm_judger.py - LLM-based evaluation tool

  4. merge_all_pickle_into_one.ipynb - Merge individual model results

  5. extractive_stats_validation.ipynb - Validate extractive baseline statistics

Main Data Files (data/)

Input Files

  • 2773_abstractive_input.pkl - Original abstractive summarization dataset (2,773 conversations)
  • baseline_input_15_25_35.pkl - Extractive baseline with 15%, 25%, 35% extraction ratios

Result Files (with similarity matrices)

  • 2773_abstractive_input_result_with_similarity.pkl - Main experiment results with abstractive summaries

    python positionalbias_analyzer.py --input 2773_abstractive_input.pkl --output 2773_abstractive_input_result_with_similarity.pkl --alpha 1.0 --quantile --ratio 1.0
  • 2773_HF_input_result_with_similarity.pkl - Results from HuggingFace dataset format

    python positionalbias_analyzer.py --input huggingface --output 2773_HF_input_result_with_similarity.pkl --alpha 1.0 --quantile --hf_token YOUR_TOKEN --hf_dataset --ratio 1.0
  • result_baseline_input_15_25_35.pkl - Controlled experiment results with extractive baselines

    python positionalbias_analyzer.py --input baseline_input_15_25_35.pkl --output result_baseline_input_15_25_35.pkl --alpha 1.0 --quantile --device cuda --ratio 1.0
  • result_ablation_study_05_to_15_baseline.pkl - Ablation study results with baseline (Ξ± varies from 0.5 to 1.5)

πŸ“– Abstract

Large language models (LLMs) are increasingly used for zero-shot conversation summarization, but often exhibit positional biasβ€”tending to overemphasize content from the beginning or end of a conversation while neglecting the middle. To address this issue, we introduce PoSum-Bench, a comprehensive benchmark for evaluating positional bias in conversational summarization, featuring diverse English and French conversational datasets spanning formal meetings, casual conversations, and customer service interactions. We propose a novel semantic similarity-based sentence-level metric to quantify the direction and magnitude of positional bias in model-generated summaries, enabling systematic and reference-free evaluation across conversation positions, languages, and conversational contexts.

Our benchmark and methodology thus provide the first systematic, cross-lingual framework for reference-free evaluation of positional bias in conversational summarization, laying the groundwork for developing more balanced and unbiased summarization models.

key word: model bias/fairness evaluation, benchmarking, NLP datasets, conversational summarization

🎯 Key Contributions

  1. Novel Bias Quantification Method: Introduces a semantic similarity-based approach to measure leading and recency biases in summarization
  2. Comprehensive Evaluation: Analyzes 10 state-of-the-art LLMs across 2,773 conversations in English and French
  3. Open-Source Tool: Provides researchers with an easy-to-use tool for analyzing positional bias in their own models
  4. PoSum-Bench Dataset: Accompanies our benchmark dataset containing 52,687 summary instances with bias annotations

Requirements

python_requires=">=3.10"
torch>=2.0.0,<3.0.0
numpy>=1.21.0,<2.0.0
scikit-learn>=1.0.0,<2.0.0
sentence-transformers>=2.2.0,<4.0.0
datasets>=2.14.0,<4.0.0
pandas>=1.3.0,<3.0.0
matplotlib>=3.4.0,<4.0.0
seaborn>=0.11.0,<1.0.0
tqdm>=4.62.0
scipy>=1.7.0,<2.0.0

πŸ“Š PoSum-Bench Dataset

Our analyzer works with the PoSum-Bench dataset, available on Hugging Face. The dataset includes:

  • 2,773 conversations (2,273 English, 500 French)
  • 19 summarization methods (10 LLMs + 9 extractive strategies)
  • 52,687 summary instances with positional bias annotations
  • Multi-domain coverage: meetings, dialogues, social media, call centers

πŸ’» Usage

1. Analyzing the PoSum-Bench Dataset

# Analyze the full PoSum-Bench dataset from Hugging Face
python positionalbias_analyzer.py \
    --input huggingface \
    --output posum_bench_results.pkl \
    --hf_dataset \
    --hf_token YOUR_HF_TOKEN \
    --alpha 1.0 \
    --quantile \
    --model all-MiniLM-L6-v2

2. Analyzing Custom Datasets

# Analyze your own dataset (must follow the required format)
python positionalbias_analyzer.py \
    --input your_dataset.pkl \
    --output custom_results.pkl \
    --alpha 1.0 \
    --quantile \
    --device cuda

3. Single Instance Analysis

from positionalbias_analyzer import PositionalBiasAnalyzer

# Initialize the analyzer
analyzer = PositionalBiasAnalyzer(
    sentence_transformer_model="all-MiniLM-L6-v2",
    device="cuda"
)

# Example conversation and summary
conversation = [
    "Alice: Should we start with the budget review?",
    "Bob: Yes, let's look at Q3 expenses first.",
    "Alice: We're 15% over budget in marketing.",
    "Bob: We need to cut back on digital ads.",
    "Alice: Agreed. What about R&D spending?",
    "Bob: R&D is on track, no concerns there.",
    "Alice: Good. Let's prepare the report.",
    "Bob: I'll have it ready by Friday."
]

summary = [
    "The team discussed being over budget in marketing.",
    "They agreed to cut digital ad spending.",
    "A report will be ready by Friday."
]

# Analyze bias
result = analyzer.analyze_single_instance(
    conversation=conversation,
    generated_summary=summary,
    alpha=1.0,
    plot_distribution=True
)

print(f"Leading Bias Score: {result['leading_score']:.3f}")
print(f"Recency Bias Score: {result['recency_score']:.3f}")
print(f"Ignored Turns: {result['ignored_indices']}")

πŸ”¬ Methodology

Our approach quantifies positional bias through four key steps:

  1. Semantic Similarity Computation: Calculate cosine similarity between conversation turns and summary sentences using sentence transformers

  2. Contribution Score Calculation: Apply softmax normalization and max-pooling to identify which conversation segments contribute most to the summary

  3. Dynamic Thresholding: Identify "ignored" conversation segments using threshold Ο„ = ΞΌ - Ξ±Β·Οƒ

  4. Bias Score Computation: Calculate position-weighted scores for leading and recency biases using log-normalized positions

Mathematical Formulation

Leading Bias Score:

B_lead = (1/k) Γ— Ξ£[(log(p_i+2)/log(n+1)) / (log(j+2)/log(k+1)) Γ— w_i]

Recency Bias Score:

B_recency = (1/k) Γ— Ξ£[(log(n-p_i+1)/log(n+1)) / (log(k-j+1)/log(k+1)) Γ— w_i]

Where:

  • n: Total conversation turns
  • k: Number of ignored turns
  • p_i: Position of ignored turn in conversation
  • w_i: Length-based weight of ignored turn

πŸ“ˆ Command Line Arguments

Argument Type Default Description
--input str Required Input dataset path or 'huggingface'
--output str Required Output file path (.pkl)
--alpha float 1.0 Threshold adjustment factor (Ξ±)
--hf_dataset flag False Use PoSum-Bench from HuggingFace
--hf_token str None HuggingFace access token
--ratio float 0.5 Dataset sampling ratio (0.0-1.0)
--quantile flag False Apply quantile normalization
--model str all-MiniLM-L6-v2 Sentence transformer model
--device str cuda Computing device (cuda/cpu/mps)
--test flag False Run test mode with sample data
--plot flag False Generate distribution plots

πŸ“ Data Format

Input Format (Local Dataset)

{
    "en": [  # Language code
        {
            "conversations": ["Turn 1", "Turn 2", ...],
            "llm_generated_summary": {
                "model_name": {
                    "splitted_summary": ["Sentence 1", "Sentence 2", ...]
                }
            }
        },
        ...
    ],
    "fr": [...]  # Additional languages
}

Output Format

{
    # Original fields preserved
    ...
    # Added analysis fields
    "leading_score": 0.123,              # Leading bias score
    "recency_score": 0.456,              # Recency bias score
    "ignored_indices": [2, 5, 8],        # Indices of ignored turns
    "similarity_matrix": [[...], [...]]  # Turn-summary similarity matrix
}

πŸ“Š Reproducing Paper Results

To reproduce the results from our EMNLP 2025 submission:

# 1. Download the full PoSum-Bench dataset
python positionalbias_analyzer.py \
    --input huggingface \
    --output emnlp2025_results.pkl \
    --hf_dataset \
    --hf_token YOUR_TOKEN \
    --alpha 1.0 \
    --quantile \
    --ratio 1.0

# 2. Generate analysis plots (requires separate script)
python generate_figures.py --input emnlp2025_results.pkl

πŸ“š Citation

If you use PoSum-Bench or this analyzer in your research, please cite our paper:

@inproceedings{sun-etal-2025-posum,
    title = "{P}o{S}um-Bench: Benchmarking Position Bias in {LLM}-based Conversational Summarization",
    author = "Sun, Xu  and
      Delphin-Poulat, Lionel  and
      Tarnec, Christ{\`e}le  and
      Shimorina, Anastasia",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.404/",
    doi = "10.18653/v1/2025.emnlp-main.404",
    pages = "7985--8009",
    ISBN = "979-8-89176-332-6",
    abstract = "Large language models (LLMs) are increasingly used for zero-shot conversation summarization, but often exhibit positional bias{---}tending to overemphasize content from the beginning or end of a conversation while neglecting the middle. To address this issue, we introduce PoSum-Bench, a comprehensive benchmark for evaluating positional bias in conversational summarization, featuring diverse English and French conversational datasets spanning formal meetings, casual conversations, and customer service interactions. We propose a novel semantic similarity-based sentence-level metric to quantify the direction and magnitude of positional bias in model-generated summaries, enabling systematic and reference-free evaluation across conversation positions, languages, and conversational contexts.Our benchmark and methodology thus provide the first systematic, cross-lingual framework for reference-free evaluation of positional bias in conversational summarization, laying the groundwork for developing more balanced and unbiased summarization models."
}

🀝 Contributing

We welcome contributions to improve the analyzer! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ“ž Contact

About

Benchmarking Position Bias in LLM-based Conversational Summarization

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published