GitHub - KMCS-NII/AASC: AASC: ACL Anthology Sentence Corpus

AASC: ACL Anthology Sentence Corpus

AASC is a corpus of natural language text extracted from scientific papers. It contains 2,339,195 sentences from PDF-format papers from the ACL Anthology [1], a comprehensive scientific paper repository on computational linguistics and natural language processing.

For PDF document analysis, we use PDFNLT 1.0 [2], a PDF paper analysis tool specifically trained for ACL Anthology. After excluding papers with non-standard structures (eg. no abstract, or no references), the rest 13,923 papers were further processed by (1) sentence splitting, and (2) section type labeling.

The ACL_2018_v2.tar.gz file contains the extracted natural language sentences for each <paper_ID>, where the <paper_ID> is the unique identifier of the paper on the ACL Anthology. The corresponding PDF version can be found using the URL: http://aclweb.org/anthology/<paper_ID>.

Each sentence file is named as <paper_ID>.ss within which each line represents tab-separated values of a sentence:

Column	Example (A00-1001.ss)
Sentence ID	`s-1-1-0-0`
Section type	`abstract`
Sentence text:	`The paper describes a natural language based expert system route advisor for the public bus transport in Trondheim, Norway.`

A simple dictionary-based classifier was used for the section type labeling.

For details, see also our Overview of AASC

Following the copyright policy of the original ACL Anthology, AASC materials are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 International License.
This work was supported by National Institute of Informatics and JST Crest JPMJCR1513.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
section_classify		section_classify
.gitattributes		.gitattributes
AASC.md		AASC.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AASC: ACL Anthology Sentence Corpus

About

Releases

Packages

Contributors 2

Languages

KMCS-NII/AASC

Folders and files

Latest commit

History

Repository files navigation

AASC: ACL Anthology Sentence Corpus

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages