Skip to content

A ‘thin’ version of the dataset that contains the annotated data in combination with the URLs of the scraped texts and the scraping tools

License

Notifications You must be signed in to change notification settings

TallChris91/CACAPO-Dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The CACAPO Dataset

CACAPO is a data-to-text dataset that contains sentences from news reports for the sports, weather, stock, and incidents domain in English and Dutch, aligned with relevant attribute-value paired data. This is the first data-to-text dataset based on "naturally occurring" human-written texts (i.e., texts that were not collected in a task-based setting), that covers various domains, as well as multiple languages. A description of the dataset can be found here.

This GitHub page

This GitHub page contains the "thin" version of the texts in the dataset, which means the annotated data, combined with links to the collected texts. Furthermore it contains the tools used to collect these texts and tools to calculate statistics for the dataset.

The full dataset

The full dataset is available via DataVerse. The format is the same as that of (Enriched) WebNLG (v1.5). Meaning that the dataset consists of XML files that also contain intermediate representations to enable the development and evaluation of pipeline data-to-text architectures that encompass tasks such as Discourse Ordering, Lexicalization, Aggregation and Referring Expression Generation.

Example

<entry category="EnglishIncidents" eid="Id118" size="2">
  <originaltripleset>
    <otriple>victimAge | 22-year-old</otriple>
    <otriple>victimStatus | grazed_in_the_thigh</otriple>
  </originaltripleset>
  <modifiedtripleset>
    <mtriple>victimAge | 22-year-old</mtriple>
    <mtriple>victimStatus | grazed_in_the_thigh</mtriple>
	</modifiedtripleset>
  <lex comment="good" lid="Id1">
    <sortedtripleset>
      <sentence ID="1">
        <striple>victimAge | 22-year-old</striple>
        <striple>victimStatus | grazed_in_the_thigh</striple>
      </sentence>
    </sortedtripleset>
    <references>
      <reference entity="22-year-old" number="1" tag="PATIENT-1" type="description">A 22-year-old</reference>
      <reference entity="grazed_in_the_thigh" number="2" tag="PATIENT-2" type="description">grazed in the thigh</reference>
    </references>
    <text>A 22-year-old was grazed in the thigh</text>
    <template>PATIENT-1 was PATIENT-2</template>
    <lexicalization>PATIENT-1 VP[aspect=simple,tense=past,voice=active,person=null,number=singular] be PATIENT-2</lexicalization>
  </lex>
  <entitymap>
    <entity>PATIENT-1 | 22-year-old</entity>
    <entity>PATIENT-2 | grazed_in_the_thigh</entity>
  </entitymap>
</entry>

The name "CACAPO"

A kakapo

Credits: Andrew Digby, Twitter: @takapodigs

We are aware that the acronym for this dataset is far-fetched, and have also been made aware that the name is phonetically similar to a naughty French word. However, the name of the dataset is a thinly-veiled excuse to raise awareness for the Kākāpō. The kākāpō is the world's only flightless parrot, and unfortunately, they are critically endangered. There are only 197 kākāpō alive today, but it is not too late to save this beautiful animal from extinction.

If you would like to support the conservation efforts, you can donate to Kākāpō Recovery.

About

A ‘thin’ version of the dataset that contains the annotated data in combination with the URLs of the scraped texts and the scraping tools

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages