Positional Bias Analyzer for PoSum-Bench

Official implementation of the positional bias analysis tool for "PoSum-Bench: Benchmarking Position Bias in LLM-based Conversational Summarization" (Sun et al., EMNLP 2025).

📁 Project Structure

positionalbias/
├── data/                   # Main data directory
├── scripts/                # Core analysis scripts
├── LICENSE
└── README.md

🗂️ File Descriptions

Core Scripts (`scripts/`)

positionalbias_analyzer.py - Main script for replicating paper experiments
```
python positionalbias_analyzer.py -h  # For detailed help
```
summary_generation_marcel_template.py - Template for summary generation, used on Marcel cluster
llm_judger.py - LLM-based evaluation tool
merge_all_pickle_into_one.ipynb - Merge individual model results
extractive_stats_validation.ipynb - Validate extractive baseline statistics

Main Data Files (`data/`)

Input Files

2773_abstractive_input.pkl - Original abstractive summarization dataset (2,773 conversations)
baseline_input_15_25_35.pkl - Extractive baseline with 15%, 25%, 35% extraction ratios

Result Files (with similarity matrices)

2773_abstractive_input_result_with_similarity.pkl - Main experiment results with abstractive summaries

python positionalbias_analyzer.py --input 2773_abstractive_input.pkl --output 2773_abstractive_input_result_with_similarity.pkl --alpha 1.0 --quantile --ratio 1.0

2773_HF_input_result_with_similarity.pkl - Results from HuggingFace dataset format

python positionalbias_analyzer.py --input huggingface --output 2773_HF_input_result_with_similarity.pkl --alpha 1.0 --quantile --hf_token YOUR_TOKEN --hf_dataset --ratio 1.0

result_baseline_input_15_25_35.pkl - Controlled experiment results with extractive baselines

python positionalbias_analyzer.py --input baseline_input_15_25_35.pkl --output result_baseline_input_15_25_35.pkl --alpha 1.0 --quantile --device cuda --ratio 1.0

result_ablation_study_05_to_15_baseline.pkl - Ablation study results with baseline (α varies from 0.5 to 1.5)

📖 Abstract

Large language models (LLMs) are increasingly used for zero-shot conversation summarization, but often exhibit positional bias—tending to overemphasize content from the beginning or end of a conversation while neglecting the middle. To address this issue, we introduce PoSum-Bench, a comprehensive benchmark for evaluating positional bias in conversational summarization, featuring diverse English and French conversational datasets spanning formal meetings, casual conversations, and customer service interactions. We propose a novel semantic similarity-based sentence-level metric to quantify the direction and magnitude of positional bias in model-generated summaries, enabling systematic and reference-free evaluation across conversation positions, languages, and conversational contexts.

Our benchmark and methodology thus provide the first systematic, cross-lingual framework for reference-free evaluation of positional bias in conversational summarization, laying the groundwork for developing more balanced and unbiased summarization models.

key word: model bias/fairness evaluation, benchmarking, NLP datasets, conversational summarization

🎯 Key Contributions

Novel Bias Quantification Method: Introduces a semantic similarity-based approach to measure leading and recency biases in summarization
Comprehensive Evaluation: Analyzes 10 state-of-the-art LLMs across 2,773 conversations in English and French
Open-Source Tool: Provides researchers with an easy-to-use tool for analyzing positional bias in their own models
PoSum-Bench Dataset: Accompanies our benchmark dataset containing 52,687 summary instances with bias annotations

Requirements

python_requires=">=3.10"
torch>=2.0.0,<3.0.0
numpy>=1.21.0,<2.0.0
scikit-learn>=1.0.0,<2.0.0
sentence-transformers>=2.2.0,<4.0.0
datasets>=2.14.0,<4.0.0
pandas>=1.3.0,<3.0.0
matplotlib>=3.4.0,<4.0.0
seaborn>=0.11.0,<1.0.0
tqdm>=4.62.0
scipy>=1.7.0,<2.0.0

📊 PoSum-Bench Dataset

Our analyzer works with the PoSum-Bench dataset, available on Hugging Face. The dataset includes:

2,773 conversations (2,273 English, 500 French)
19 summarization methods (10 LLMs + 9 extractive strategies)
52,687 summary instances with positional bias annotations
Multi-domain coverage: meetings, dialogues, social media, call centers

💻 Usage

1. Analyzing the PoSum-Bench Dataset

# Analyze the full PoSum-Bench dataset from Hugging Face
python positionalbias_analyzer.py \
    --input huggingface \
    --output posum_bench_results.pkl \
    --hf_dataset \
    --hf_token YOUR_HF_TOKEN \
    --alpha 1.0 \
    --quantile \
    --model all-MiniLM-L6-v2

2. Analyzing Custom Datasets

# Analyze your own dataset (must follow the required format)
python positionalbias_analyzer.py \
    --input your_dataset.pkl \
    --output custom_results.pkl \
    --alpha 1.0 \
    --quantile \
    --device cuda

3. Single Instance Analysis

from positionalbias_analyzer import PositionalBiasAnalyzer

# Initialize the analyzer
analyzer = PositionalBiasAnalyzer(
    sentence_transformer_model="all-MiniLM-L6-v2",
    device="cuda"
)

# Example conversation and summary
conversation = [
    "Alice: Should we start with the budget review?",
    "Bob: Yes, let's look at Q3 expenses first.",
    "Alice: We're 15% over budget in marketing.",
    "Bob: We need to cut back on digital ads.",
    "Alice: Agreed. What about R&D spending?",
    "Bob: R&D is on track, no concerns there.",
    "Alice: Good. Let's prepare the report.",
    "Bob: I'll have it ready by Friday."
]

summary = [
    "The team discussed being over budget in marketing.",
    "They agreed to cut digital ad spending.",
    "A report will be ready by Friday."
]

# Analyze bias
result = analyzer.analyze_single_instance(
    conversation=conversation,
    generated_summary=summary,
    alpha=1.0,
    plot_distribution=True
)

print(f"Leading Bias Score: {result['leading_score']:.3f}")
print(f"Recency Bias Score: {result['recency_score']:.3f}")
print(f"Ignored Turns: {result['ignored_indices']}")

🔬 Methodology

Our approach quantifies positional bias through four key steps:

Semantic Similarity Computation: Calculate cosine similarity between conversation turns and summary sentences using sentence transformers
Contribution Score Calculation: Apply softmax normalization and max-pooling to identify which conversation segments contribute most to the summary
Dynamic Thresholding: Identify "ignored" conversation segments using threshold τ = μ - α·σ
Bias Score Computation: Calculate position-weighted scores for leading and recency biases using log-normalized positions

Mathematical Formulation

Leading Bias Score:

B_lead = (1/k) × Σ[(log(p_i+2)/log(n+1)) / (log(j+2)/log(k+1)) × w_i]

Recency Bias Score:

B_recency = (1/k) × Σ[(log(n-p_i+1)/log(n+1)) / (log(k-j+1)/log(k+1)) × w_i]

Where:

n: Total conversation turns
k: Number of ignored turns
p_i: Position of ignored turn in conversation
w_i: Length-based weight of ignored turn

📈 Command Line Arguments

Argument	Type	Default	Description
`--input`	str	Required	Input dataset path or 'huggingface'
`--output`	str	Required	Output file path (.pkl)
`--alpha`	float	1.0	Threshold adjustment factor (α)
`--hf_dataset`	flag	False	Use PoSum-Bench from HuggingFace
`--hf_token`	str	None	HuggingFace access token
`--ratio`	float	0.5	Dataset sampling ratio (0.0-1.0)
`--quantile`	flag	False	Apply quantile normalization
`--model`	str	all-MiniLM-L6-v2	Sentence transformer model
`--device`	str	cuda	Computing device (cuda/cpu/mps)
`--test`	flag	False	Run test mode with sample data
`--plot`	flag	False	Generate distribution plots

📁 Data Format

Input Format (Local Dataset)

{
    "en": [  # Language code
        {
            "conversations": ["Turn 1", "Turn 2", ...],
            "llm_generated_summary": {
                "model_name": {
                    "splitted_summary": ["Sentence 1", "Sentence 2", ...]
                }
            }
        },
        ...
    ],
    "fr": [...]  # Additional languages
}

Output Format

{
    # Original fields preserved
    ...
    # Added analysis fields
    "leading_score": 0.123,              # Leading bias score
    "recency_score": 0.456,              # Recency bias score
    "ignored_indices": [2, 5, 8],        # Indices of ignored turns
    "similarity_matrix": [[...], [...]]  # Turn-summary similarity matrix
}

📊 Reproducing Paper Results

To reproduce the results from our EMNLP 2025 submission:

# 1. Download the full PoSum-Bench dataset
python positionalbias_analyzer.py \
    --input huggingface \
    --output emnlp2025_results.pkl \
    --hf_dataset \
    --hf_token YOUR_TOKEN \
    --alpha 1.0 \
    --quantile \
    --ratio 1.0

# 2. Generate analysis plots (requires separate script)
python generate_figures.py --input emnlp2025_results.pkl

📚 Citation

If you use PoSum-Bench or this analyzer in your research, please cite our paper:

@inproceedings{sun-etal-2025-posum,
    title = "{P}o{S}um-Bench: Benchmarking Position Bias in {LLM}-based Conversational Summarization",
    author = "Sun, Xu  and
      Delphin-Poulat, Lionel  and
      Tarnec, Christ{\`e}le  and
      Shimorina, Anastasia",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.404/",
    doi = "10.18653/v1/2025.emnlp-main.404",
    pages = "7985--8009",
    ISBN = "979-8-89176-332-6",
    abstract = "Large language models (LLMs) are increasingly used for zero-shot conversation summarization, but often exhibit positional bias{---}tending to overemphasize content from the beginning or end of a conversation while neglecting the middle. To address this issue, we introduce PoSum-Bench, a comprehensive benchmark for evaluating positional bias in conversational summarization, featuring diverse English and French conversational datasets spanning formal meetings, casual conversations, and customer service interactions. We propose a novel semantic similarity-based sentence-level metric to quantify the direction and magnitude of positional bias in model-generated summaries, enabling systematic and reference-free evaluation across conversation positions, languages, and conversational contexts.Our benchmark and methodology thus provide the first systematic, cross-lingual framework for reference-free evaluation of positional bias in conversational summarization, laying the groundwork for developing more balanced and unbiased summarization models."
}

🤝 Contributing

We welcome contributions to improve the analyzer! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

📞 Contact

Paper: PoSum-Bench: Benchmarking Position Bias in LLM-based Conversational Summarization (Sun et al., EMNLP 2025)
Dataset: HuggingFace Dataset
Issues: Please open a GitHub issue for bugs or questions
Email: [xu.sun@orange.com], [anastasia.shimorina@orange.com]

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
scripts		scripts
.gitattributes		.gitattributes
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Positional Bias Analyzer for PoSum-Bench

📁 Project Structure

🗂️ File Descriptions

Core Scripts (`scripts/`)

Main Data Files (`data/`)

Input Files

Result Files (with similarity matrices)

📖 Abstract

🎯 Key Contributions

Requirements

📊 PoSum-Bench Dataset

💻 Usage

1. Analyzing the PoSum-Bench Dataset

2. Analyzing Custom Datasets

3. Single Instance Analysis

🔬 Methodology

Mathematical Formulation

📈 Command Line Arguments

📁 Data Format

Input Format (Local Dataset)

Output Format

📊 Reproducing Paper Results

📚 Citation

🤝 Contributing

📄 License

📞 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Orange-OpenSource/PoSum-Bench

Folders and files

Latest commit

History

Repository files navigation

Positional Bias Analyzer for PoSum-Bench

📁 Project Structure

🗂️ File Descriptions

Core Scripts (scripts/)

Main Data Files (data/)

Input Files

Result Files (with similarity matrices)

📖 Abstract

🎯 Key Contributions

Requirements

📊 PoSum-Bench Dataset

💻 Usage

1. Analyzing the PoSum-Bench Dataset

2. Analyzing Custom Datasets

3. Single Instance Analysis

🔬 Methodology

Mathematical Formulation

📈 Command Line Arguments

📁 Data Format

Input Format (Local Dataset)

Output Format

📊 Reproducing Paper Results

📚 Citation

🤝 Contributing

📄 License

📞 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Core Scripts (`scripts/`)

Main Data Files (`data/`)

Packages