Skip to content

Latest commit

 

History

History
41 lines (34 loc) · 2.2 KB

README.md

File metadata and controls

41 lines (34 loc) · 2.2 KB

HistSumm

Code and data for Summarising Historical Text in Modern Languages (EACL 2021)

TL;DR

Historical text summarisation is a task where documents in historical forms of a language are summarised in the corresponding modern language. We are the first to explore this fascinating and meaningful direction (through cross-lingual transfer learning). Our repo contains

  • Testsets for German and Chinese, annotated by experts
  • Preprocessing code
  • Trained & aligned embeddings
  • Neural summariser

For more info, please read our paper or create an issue!

Corpus

Some of the entries may be puzzling. We thus explain them here

entry description
(de) germanc_file the index of the story in the GermanC dataset
(de) region historical German has dialects, so it's important to log the geometric sources
(zh) source via which academic paper did we obtain the piece of Wanli Gazette news
human_eval_scores annotation scores given by expert validator

Code and model

Preprocessing

As suggested by anonymous reviewers, we release our preprocessing step for reproducibility check as well as to aid future studies on historical language processing. Our code is presented in documented Jupyter Notebooks.

Embeddings and summariser code

To be released near the conference (late April) 👀

About

If you like our project or find it useful, please give us a ⭐ and cite us

@inproceedings{HistSumm-2021,
    title = {Summarising Historical Text in Modern Languages}, 
    author = {Xutan Peng and Yi Zheng and Chenghua Lin and Advaith Siddharthan},
    year = {2021},
    booktitle = "Proceedings of the 16th Conference of the {E}uropean Chapter of the Association for Computational Linguistics: Volume 1, Long Papers",
    address = "Online",
    publisher = "Association for Computational Linguistics"
}