SimpleDelete

This repository hosts the SimpleDelete Corpus and code used to create it.

The corpus can be found in TSV format in the file: simpleDeleteCorpus_Filtered.tsv

This file contains one instance per line and has the form:

The target word should appear in the context bounded by the begin and end offsets. This target word is a candidate for deletion in the context.

This work was published at the Second Workshop on Text Simplification, Accessibility and Readability, Colocated with RANLP 2023, Varna, Bulgaria.

The paper is available via the ACL anthology, and also is included in this repository as 2023.tsar-1.5.pdf

https://aclanthology.org/2023.tsar-1.5/

The bibtex for citation is below:

@inproceedings{shardlow-przybyla-2023-simplification,
    title = "Simplification by Lexical Deletion",
    author = "Shardlow, Matthew  and
      Przyby{\l}a, Piotr",
    editor = "{\v{S}}tajner, Sanja  and
      Saggio, Horacio  and
      Shardlow, Matthew  and
      Alva-Manchego, Fernando",
    booktitle = "Proceedings of the Second Workshop on Text Simplification, Accessibility and Readability",
    month = sep,
    year = "2023",
    address = "Varna, Bulgaria",
    publisher = "INCOMA Ltd., Shoumen, Bulgaria",
    url = "https://aclanthology.org/2023.tsar-1.5",
    pages = "44--50",
    abstract = "Lexical simplification traditionally focuses on the replacement of tokens with simpler alternatives. However, in some cases the goal of this task (simplifying the form while preserving the meaning) may be better served by removing a word rather than replacing it. In fact, we show that existing datasets rely heavily on the deletion operation. We propose supervised and unsupervised solutions for lexical deletion based on classification, end-to-end simplification systems and custom language models. We contribute a new silver-standard corpus of lexical deletions (called SimpleDelete), which we mine from simple English Wikipedia edit histories and use to evaluate approaches to detecting superfluous words. The results show that even unsupervised approaches (TerseBERT) can achieve good performance in this new task. Deletion is one part of the wider lexical simplification puzzle, which we show can be isolated and investigated.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
2023.tsar-1.5.pdf		2023.tsar-1.5.pdf
FilterCorpus.java		FilterCorpus.java
README.md		README.md
WikiEditHistoryParser.java		WikiEditHistoryParser.java
pom.xml		pom.xml
simpleDeleteCorpus_Filtered.tsv		simpleDeleteCorpus_Filtered.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SimpleDelete

About

Releases

Packages

Languages

MMU-TDMLab/SimpleDelete

Folders and files

Latest commit

History

Repository files navigation

SimpleDelete

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages