Skip to content

StonyBrookNLP/PoMo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 

Repository files navigation

LUNR Lab.

PoMo: Post-Modifier dataset

Overview

PoMo is the dataset introduced in our paper PoMo: Generating Entity-Specific Post-Modifiers in Context from NAACL 2019.

Our paper can be found here: https://www.aclweb.org/anthology/N19-1089

Post-Modifier Generation Task

Post-modifier is a short phrase that comes after an entity in a sentence to describe the entity in detail. It can be found easily in many news articles. For example, in the below sentence, the MIT professor and antiwar activist is the post-modifier of Noam Chomsky.

Noam Chomsky, the MIT professor and antiwar activist, said Dr. Melman helped mobilize what once was weak and scattered resistance to war and other military operations.

We formulate post-modifier generation task as a data-to-text generation problem, where the data is the context (a sentence without a post-modifier) and the set of known facts about the target entity. The text to be generated is a post-modifier that is relevant to the rest of the information conveyed in the text. Below example shows the input and output of the task.

Image of post-modifier generation task

Download

The dataset can be downloaded from https://github.com/StonyBrookNLP/PoMo

Citation

Please use the following bibtex entry:

@inproceedings{Kang2019PoMo,
title={PoMo: Generating Entity-Specific Post-Modifiers in Context},
author={Jun Seok Kang and Robert L. Logan IV and Zewei Chu and Yang Chen and Dheeru Dua and Kevin Gimpel and Sameer Singh and Niranjan Balasubramanian},
booktitle={NAACL-HLT},
year={2019}
}

Dataset Information

  • Dataset Split

    The dataset is split into train/valid/test, along with their Wikidata entities. The split was done randomly but there is no entity overlap accross the splits. The splits show similar distribution of entity occupations.

  • Dataset Sizes

    • train: 220,615 (Unique Entities: 55,367)
    • valid: 5,200 (Unique Entities: 1,257)
    • test: 5,242 (Unique Entities: 1,342)
  • Dataset Fields

    • Post modifer instances (*.pm)

      • These files (train.pm/test.pm/valid.pm) have following fields: (tab separated)
        • sent_wo_post_modifier: A sentence without a post modifier
        • entity_name: The entity that the post modifier depends on
        • post_modifier: The post modifier
        • sent: A full sentence with the post modifier
        • wiki_id: Wikidata ID. Use this to look up the Wikidata entity from the accompanying file.
        • prev_sent: The previous sentence before [sent]. "n/a" if [sent] is the first sentence.
        • next_sent: The next sentence after [sent]. "n/a" if [sent] is the last sentence.
        • context_relevance_score: Crowd sourced context sensitivity of the post-modifier of this instance to its context
          • 1:Not relevant, 5:Relevant // For train, this field is set as 0.
        • file_info: This field contains the source of each instance: filepath and line number. Since it is unique, it is used as an ID for each instance. If the value of this field starts with a year (1987-2007), it indicates the instance is from NYT corpus. For instances extracted from CNN and DailyMail, this value starts with "cnn" and "dm" respectively.
    • Wikidata entity (*.wiki)

      • These files (train.wiki/test.wiki/valid.wiki) have following fields: (tab separated)
        • Wikidata ID
        • entity_name: The Wikidata entity's label
        • aliases: aliases of the label. “,” separated if there are more than one.
        • descriptions: description of the Wikidata entity. “,” separated if there are more than one.
        • claims: processed claims of this entity in JSON.
          • A list of

            {
               "property": [<field_name>,<value>],
               "qualifiers": [<field_name>,<value>]
               "use":[<fileinfo_of_pm_instance>, ...]    // A list of fileinfos whose post-modifiers contain information of this claim. 
            }
            
          • For qualifiers, if there are more than one field, it is changed to a list of list as below:

              {
                "property": [
                  "member of sports team",
                  "Wiggins"
                ],
                "qualifiers": [
                  [
                    "start time",
                    "+2015-04-30T00:00:00Z"
                  ],
                  [
                    "end time",
                    "+2016-12-31T00:00:00Z"
                  ]
                ],
              }
            

Data Sources

We used various data sources to construct PoMo.

Contributors

About

PoMo: Generating Entity-Specific Post-Modifiers in Context

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published