Official implementation of the positional bias analysis tool for "PoSum-Bench: Benchmarking Position Bias in LLM-based Conversational Summarization" (Sun et al., EMNLP 2025).
positionalbias/
βββ data/ # Main data directory
βββ scripts/ # Core analysis scripts
βββ LICENSE
βββ README.md
-
positionalbias_analyzer.py- Main script for replicating paper experimentspython positionalbias_analyzer.py -h # For detailed help -
summary_generation_marcel_template.py- Template for summary generation, used on Marcel cluster -
llm_judger.py- LLM-based evaluation tool -
merge_all_pickle_into_one.ipynb- Merge individual model results -
extractive_stats_validation.ipynb- Validate extractive baseline statistics
2773_abstractive_input.pkl- Original abstractive summarization dataset (2,773 conversations)baseline_input_15_25_35.pkl- Extractive baseline with 15%, 25%, 35% extraction ratios
-
2773_abstractive_input_result_with_similarity.pkl- Main experiment results with abstractive summariespython positionalbias_analyzer.py --input 2773_abstractive_input.pkl --output 2773_abstractive_input_result_with_similarity.pkl --alpha 1.0 --quantile --ratio 1.0
-
2773_HF_input_result_with_similarity.pkl- Results from HuggingFace dataset formatpython positionalbias_analyzer.py --input huggingface --output 2773_HF_input_result_with_similarity.pkl --alpha 1.0 --quantile --hf_token YOUR_TOKEN --hf_dataset --ratio 1.0
-
result_baseline_input_15_25_35.pkl- Controlled experiment results with extractive baselinespython positionalbias_analyzer.py --input baseline_input_15_25_35.pkl --output result_baseline_input_15_25_35.pkl --alpha 1.0 --quantile --device cuda --ratio 1.0
-
result_ablation_study_05_to_15_baseline.pkl- Ablation study results with baseline (Ξ± varies from 0.5 to 1.5)
Large language models (LLMs) are increasingly used for zero-shot conversation summarization, but often exhibit positional biasβtending to overemphasize content from the beginning or end of a conversation while neglecting the middle. To address this issue, we introduce PoSum-Bench, a comprehensive benchmark for evaluating positional bias in conversational summarization, featuring diverse English and French conversational datasets spanning formal meetings, casual conversations, and customer service interactions. We propose a novel semantic similarity-based sentence-level metric to quantify the direction and magnitude of positional bias in model-generated summaries, enabling systematic and reference-free evaluation across conversation positions, languages, and conversational contexts.
Our benchmark and methodology thus provide the first systematic, cross-lingual framework for reference-free evaluation of positional bias in conversational summarization, laying the groundwork for developing more balanced and unbiased summarization models.
key word: model bias/fairness evaluation, benchmarking, NLP datasets, conversational summarization
- Novel Bias Quantification Method: Introduces a semantic similarity-based approach to measure leading and recency biases in summarization
- Comprehensive Evaluation: Analyzes 10 state-of-the-art LLMs across 2,773 conversations in English and French
- Open-Source Tool: Provides researchers with an easy-to-use tool for analyzing positional bias in their own models
- PoSum-Bench Dataset: Accompanies our benchmark dataset containing 52,687 summary instances with bias annotations
python_requires=">=3.10"
torch>=2.0.0,<3.0.0
numpy>=1.21.0,<2.0.0
scikit-learn>=1.0.0,<2.0.0
sentence-transformers>=2.2.0,<4.0.0
datasets>=2.14.0,<4.0.0
pandas>=1.3.0,<3.0.0
matplotlib>=3.4.0,<4.0.0
seaborn>=0.11.0,<1.0.0
tqdm>=4.62.0
scipy>=1.7.0,<2.0.0Our analyzer works with the PoSum-Bench dataset, available on Hugging Face. The dataset includes:
- 2,773 conversations (2,273 English, 500 French)
- 19 summarization methods (10 LLMs + 9 extractive strategies)
- 52,687 summary instances with positional bias annotations
- Multi-domain coverage: meetings, dialogues, social media, call centers
# Analyze the full PoSum-Bench dataset from Hugging Face
python positionalbias_analyzer.py \
--input huggingface \
--output posum_bench_results.pkl \
--hf_dataset \
--hf_token YOUR_HF_TOKEN \
--alpha 1.0 \
--quantile \
--model all-MiniLM-L6-v2# Analyze your own dataset (must follow the required format)
python positionalbias_analyzer.py \
--input your_dataset.pkl \
--output custom_results.pkl \
--alpha 1.0 \
--quantile \
--device cudafrom positionalbias_analyzer import PositionalBiasAnalyzer
# Initialize the analyzer
analyzer = PositionalBiasAnalyzer(
sentence_transformer_model="all-MiniLM-L6-v2",
device="cuda"
)
# Example conversation and summary
conversation = [
"Alice: Should we start with the budget review?",
"Bob: Yes, let's look at Q3 expenses first.",
"Alice: We're 15% over budget in marketing.",
"Bob: We need to cut back on digital ads.",
"Alice: Agreed. What about R&D spending?",
"Bob: R&D is on track, no concerns there.",
"Alice: Good. Let's prepare the report.",
"Bob: I'll have it ready by Friday."
]
summary = [
"The team discussed being over budget in marketing.",
"They agreed to cut digital ad spending.",
"A report will be ready by Friday."
]
# Analyze bias
result = analyzer.analyze_single_instance(
conversation=conversation,
generated_summary=summary,
alpha=1.0,
plot_distribution=True
)
print(f"Leading Bias Score: {result['leading_score']:.3f}")
print(f"Recency Bias Score: {result['recency_score']:.3f}")
print(f"Ignored Turns: {result['ignored_indices']}")Our approach quantifies positional bias through four key steps:
-
Semantic Similarity Computation: Calculate cosine similarity between conversation turns and summary sentences using sentence transformers
-
Contribution Score Calculation: Apply softmax normalization and max-pooling to identify which conversation segments contribute most to the summary
-
Dynamic Thresholding: Identify "ignored" conversation segments using threshold
Ο = ΞΌ - Ξ±Β·Ο -
Bias Score Computation: Calculate position-weighted scores for leading and recency biases using log-normalized positions
Leading Bias Score:
B_lead = (1/k) Γ Ξ£[(log(p_i+2)/log(n+1)) / (log(j+2)/log(k+1)) Γ w_i]
Recency Bias Score:
B_recency = (1/k) Γ Ξ£[(log(n-p_i+1)/log(n+1)) / (log(k-j+1)/log(k+1)) Γ w_i]
Where:
n: Total conversation turnsk: Number of ignored turnsp_i: Position of ignored turn in conversationw_i: Length-based weight of ignored turn
| Argument | Type | Default | Description |
|---|---|---|---|
--input |
str | Required | Input dataset path or 'huggingface' |
--output |
str | Required | Output file path (.pkl) |
--alpha |
float | 1.0 | Threshold adjustment factor (Ξ±) |
--hf_dataset |
flag | False | Use PoSum-Bench from HuggingFace |
--hf_token |
str | None | HuggingFace access token |
--ratio |
float | 0.5 | Dataset sampling ratio (0.0-1.0) |
--quantile |
flag | False | Apply quantile normalization |
--model |
str | all-MiniLM-L6-v2 | Sentence transformer model |
--device |
str | cuda | Computing device (cuda/cpu/mps) |
--test |
flag | False | Run test mode with sample data |
--plot |
flag | False | Generate distribution plots |
{
"en": [ # Language code
{
"conversations": ["Turn 1", "Turn 2", ...],
"llm_generated_summary": {
"model_name": {
"splitted_summary": ["Sentence 1", "Sentence 2", ...]
}
}
},
...
],
"fr": [...] # Additional languages
}{
# Original fields preserved
...
# Added analysis fields
"leading_score": 0.123, # Leading bias score
"recency_score": 0.456, # Recency bias score
"ignored_indices": [2, 5, 8], # Indices of ignored turns
"similarity_matrix": [[...], [...]] # Turn-summary similarity matrix
}To reproduce the results from our EMNLP 2025 submission:
# 1. Download the full PoSum-Bench dataset
python positionalbias_analyzer.py \
--input huggingface \
--output emnlp2025_results.pkl \
--hf_dataset \
--hf_token YOUR_TOKEN \
--alpha 1.0 \
--quantile \
--ratio 1.0
# 2. Generate analysis plots (requires separate script)
python generate_figures.py --input emnlp2025_results.pklIf you use PoSum-Bench or this analyzer in your research, please cite our paper:
@inproceedings{sun-etal-2025-posum,
title = "{P}o{S}um-Bench: Benchmarking Position Bias in {LLM}-based Conversational Summarization",
author = "Sun, Xu and
Delphin-Poulat, Lionel and
Tarnec, Christ{\`e}le and
Shimorina, Anastasia",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.emnlp-main.404/",
doi = "10.18653/v1/2025.emnlp-main.404",
pages = "7985--8009",
ISBN = "979-8-89176-332-6",
abstract = "Large language models (LLMs) are increasingly used for zero-shot conversation summarization, but often exhibit positional bias{---}tending to overemphasize content from the beginning or end of a conversation while neglecting the middle. To address this issue, we introduce PoSum-Bench, a comprehensive benchmark for evaluating positional bias in conversational summarization, featuring diverse English and French conversational datasets spanning formal meetings, casual conversations, and customer service interactions. We propose a novel semantic similarity-based sentence-level metric to quantify the direction and magnitude of positional bias in model-generated summaries, enabling systematic and reference-free evaluation across conversation positions, languages, and conversational contexts.Our benchmark and methodology thus provide the first systematic, cross-lingual framework for reference-free evaluation of positional bias in conversational summarization, laying the groundwork for developing more balanced and unbiased summarization models."
}We welcome contributions to improve the analyzer! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Paper: PoSum-Bench: Benchmarking Position Bias in LLM-based Conversational Summarization (Sun et al., EMNLP 2025)
- Dataset: HuggingFace Dataset
- Issues: Please open a GitHub issue for bugs or questions
- Email: [xu.sun@orange.com], [anastasia.shimorina@orange.com]