Skip to content

NKWBTB/PrefScore

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PrefScore

Code for PrefScore: Pairwise Preference Learning for Reference-free Summarization Quality Assessment

Requirements

Files

pre/            Codes for negative sampling
human/          Codes for human evaluation
config.py       Config file for folder and training settings
model.py        Script for training the models
evaluate.py     Evaluate the trained models on target datasets 

Negative sampling (Preprocess)

Code for generating negative samples are in pre folder.

cd pre
python3 ordered_generation.py  

Edit pre/sentence_conf.py to change negative sampling settings.

Training

Run python3 model.py -h for full command line arguments.

Example (Training on the preprocessed billsumm dataset):

python3 model.py --dataset billsum

Evaluating

To evaluate the trained model on newsroom, realsumm or tac2010, go to human/ folder for detailed instructions to get the processed files:

  • human/newsroom/newsroom-human-eval.csv
  • human/realsumm/realsumm_100.tsv
  • human/tac/TAC2010_all.json

Run python3 evaluate.py -h for full command line arguments.

Example (Evaluate the model trained from billsumm on newsroom):

python3 evaluate.py --dataset billsum --target newsroom

Alignment with human evaluations

Code for computing the correlation between our models' predictions and human ratings from the three datasets is in the human folder.

Misc

  1. To evaluate on a custom dataset, format the dataset as a tsv file where each line starts with a document and followed by serveral summaries of the document separated by '\t'. See example.tsv for example.

  2. Example for use the metric in a script:

import torch
import config as CFG
from model import Scorer
from evaluate import evaluate
# CKPT_PATH is the path of a pretrained pth model file
scorer = Scorer()
scorer.load_state_dict(torch.load(CKPT_PATH, map_location=CFG.DEVICE))
scorer.to(CFG.DEVICE)
scorer.eval() 
# Test example
docs = ["This is a document.", "This is another document."]
sums = ["This is summary1", "This is summary2."]
results = evaluate(docs, sums, scorer)

Cite

@inproceedings{luo-etal-2022-prefscore,
    title = "{P}ref{S}core: Pairwise Preference Learning for Reference-free Summarization Quality Assessment",
    author = "Luo, Ge  and
      Li, Hebi  and
      He, Youbiao  and
      Bao, Forrest Sheng",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics",
    url = "https://aclanthology.org/2022.coling-1.515",
    pages = "5896--5903",
    abstract = "Evaluating machine-generated summaries without a human-written reference summary has been a need for a long time. Inspired by preference labeling in existing work of summarization evaluation, we propose to judge summary quality by learning the preference rank of summaries using the Bradley-Terry power ranking model from inferior summaries generated by corrupting base summaries. Extensive experiments on several datasets show that our weakly supervised scheme can produce scores highly correlated with human ratings.",
}

About

PrefScore for summarization evaluation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published