Skip to content
/ AASC Public

AASC: ACL Anthology Sentence Corpus

Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit



15 Commits

Repository files navigation

AASC: ACL Anthology Sentence Corpus

AASC is a corpus of natural language text extracted from scientific papers. It contains 2,339,195 sentences from PDF-format papers from the ACL Anthology [1], a comprehensive scientific paper repository on computational linguistics and natural language processing.

For PDF document analysis, we use PDFNLT 1.0 [2], a PDF paper analysis tool specifically trained for ACL Anthology. After excluding papers with non-standard structures (eg. no abstract, or no references), the rest 13,923 papers were further processed by (1) sentence splitting, and (2) section type labeling.

The ACL_2018_v2.tar.gz file contains the extracted natural language sentences for each <paper_ID>, where the <paper_ID> is the unique identifier of the paper on the ACL Anthology. The corresponding PDF version can be found using the URL:<paper_ID>.

Each sentence file is named as <paper_ID>.ss within which each line represents tab-separated values of a sentence:

Column Example (
Sentence ID s-1-1-0-0
Section type abstract
Sentence text: The paper describes a natural language based expert system route advisor for the public bus transport in Trondheim, Norway.

A simple dictionary-based classifier was used for the section type labeling.

For details, see also our Overview of AASC


No releases published


No packages published