Dataset introduced in the paper: Analyzing the Dialect Diversity in Multi-document Summaries (COLING 2022) Olubusayo Olabisi, Aaron Hudson, Antonie Jetter, Ameeta Agrawal
DivSumm is a novel dataset consisting of dialect-diverse tweets and human-written extractive and abstractive summaries. It consists of 90 tweets each on 25 topics in multiple English dialects (African-American, Hispanic and White), and two reference summaries per input."
input_docs - 90 tweets per topic evenly distributed among 3 dialects; total 25 topics
abstractive - Two annotators were asked to summarize each topic in 5 sentences using their own words.
extractive - Two annotators were asked to select 5 tweets from each topic that summarized the input tweets.
You can find our paper here. If you use this dataset in your work, please cite our paper:
@inproceedings{olabisi-etal-2022-analyzing,
title = "Analyzing the Dialect Diversity in Multi-document Summaries",
author = "Olabisi, Olubusayo and Hudson, Aaron and Jetter, Antonie and Agrawal, Ameeta",
booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
month = oct,
year = "2022",
}