# Event co-reference task

News articles often report on incidents (e.g. a terrorist attack or a natural disaster). Different articles may report on the same incident, but approach it from different perspectives. For instance, an article may have been written right after the event occured, several weeks after the event occured (when new information is uncovered or follow-up events occur) or even years after the event (reflecting on its implications). There is a high degree of linguistic variation in how such events are expressed in text. It is very difficult for automatic systems to recognize if two texts are about the same indicident or not (i.e. perform event coreference).

In the course of the Dutch FrameNet project, we have collected texts about real-world incidents using information about indicdents from Wikidata and Wikipedia. Through these resources, we were able to compile corpora of news articles that all report on the same indicent. The indicents can further be categorized in event types (e.g. mass shooting, royal wedding, etc.). This approach provides us with a challenging event-coreference dataset; we have texts reporting on the same incident written at different times after the event (and potentially containing a high degree of variation in referring to the same indicent) and texts about different incidents of the same type, thus potentially containing a high degree of similar expressions and even ambiguity. 

Many of our texts have manually been annotated in terms of semantic roles (as represented in FrameNet). In contrast to traditional SRL, we annotate frames on the level of the entire document instead of on the level of sentences. In addition, we capture information about whether a frame expresses information about target incident we are focusing on. 

For this hackathon, we propose a binary event-coreference task containing three subtasks:


* Subtask 1: Are the two texts about the same incident? (gold: structured data information - target incidents)
* Subtask 2: Provide an explanation of your answer giving evidence from the text (gold: frame annotations)
* Subtask 3: Perform event linking: Name the incident or incidents the texts are about (gold: structured data information - target incidents)

Subtask 3 may seem particularly uncrealistic for "a humble language model". However, given that models are being trained on virtually all data on the internet, it could be interesting to explore how much of this information the model captures and how reliably it can produce it. 

Data: to be made available soon. 


In [1]:
import json
import pandas as pd
import os
from lxml import etree as et
from collections import defaultdict
import itertools

import utils_structured
import utils_docs
import utils_naf
import utils_task

In [2]:
path_data = '../dfn-data-cleaning/data-headlines/unstructured'
path_structured = f'../DFNDataReleases/structured/'

In [3]:
type2inc, type2label, inc2label, inc2lang2doc, annotation_status_dict, inc2str = utils_structured.load_structured_data(path_structured)

In [4]:
# Overview
# table: event type, inc, annotated docs, not annotated docs, not found

lang = 'en'
df_en = utils_structured.generate_overview(lang, type2inc, type2label, inc2label, inc2lang2doc, annotation_status_dict, inc2str)
lang = 'nl'
df_nl = utils_structured.generate_overview(lang, type2inc, type2label, inc2label, inc2lang2doc, annotation_status_dict, inc2str)

table written to: overview_en.csv
table written to: overview_nl.csv


In [5]:
df_shooting = df_en[df_en['typeID'] == 'Q63442071']
print(len(df_shooting))

17


In [6]:

df_en_higher10 = df_en[df_en['annotated-manual'] > 9]
df_en_higher10

Unnamed: 0,type,typeID,inc,incID,inc_time,total,file not found,annotated-manual,file exists,annotated-system,annotated-deprecated
77,mass shooting,Q21480300,2016 Hesston shooting,Q23012840,2016-02-25,15,,14.0,1.0,,
100,mass shooting,Q21480300,2011 Seal Beach shooting,Q4443728,2011-10-12,12,,12.0,,,
108,mass shooting,Q21480300,2014 Fort Hood shooting,Q16016666,2014-04-02,12,,12.0,,,
123,mass shooting,Q21480300,2019 Utrecht shooting,Q62090804,0001-01-01,22,5.0,17.0,,,
127,mass shooting,Q21480300,2015 Bamako hotel attack,Q21516923,2015-11-20,14,2.0,12.0,,,
136,mass shooting,Q21480300,2015 San Bernardino shooting,Q21613643,2015-12-02,15,,14.0,,1.0,
139,mass shooting,Q21480300,2016 Citronelle homicides,Q27629584,2016-08-20,14,1.0,12.0,,1.0,
140,mass shooting,Q21480300,2016 Kalamazoo shootings,Q22910769,2016-02-20,19,2.0,16.0,,1.0,
150,mass shooting,Q21480300,2015 Chattanooga shootings,Q20671234,2015-07-16,13,,13.0,,,
154,mass shooting,Q21480300,"2015 Charleston, South Carolina shooting",Q20154675,2015-06-17,15,,15.0,,,


In [7]:

df_nl_higher10 = df_nl[df_nl['annotated-manual'] > 9]
df_nl_higher10

Unnamed: 0,type,typeID,inc,incID,inc_time,total,file not found,file exists,annotated-manual
74,mass shooting,Q21480300,Alphen aan den Rijn shopping mall shooting,Q473866,2011-04-09,12,2.0,,10.0
123,mass shooting,Q21480300,2019 Utrecht shooting,Q62090804,0001-01-01,49,6.0,,43.0
210,Eurovision Song Contest,Q276,2021 Dutch curfew riots,Q105077032,0001-01-01,43,4.0,1.0,38.0
215,disease outbreak,Q3241045,COVID-19 pandemic in the Netherlands,Q86756826,0001-01-01,333,318.0,2.0,13.0
234,aircraft shootdown,Q6539177,Malaysia Airlines Flight 17,Q17374096,0001-01-01,135,26.0,,109.0
417,music festival,Q868557,Eurovision Song Contest 2020,Q30973589,2020-01-01,37,9.0,2.0,26.0
420,music festival,Q868557,Eurovision Song Contest 2019,Q9095390,0001-01-01,12,2.0,,10.0


In [11]:
# # selected types

types_selected = ['legal case', 'mass shooting', 'royal wedding']
lang = 'en'
for type_selected in types_selected:
    utils_task.create_event_type_data(type_selected, df_en, path_data, lang, inc2label, inc2lang2doc, inc2str, min_doc = 10)

Incident info: Enumclaw horse sex case Q101209661
{'sem:hasPlace': ['http://www.wikidata.org/entity/Q30 | United States of America'], 'sem:hasTimeStamp': ['2005-07-01T00:00:00UTC | 2005-07-01T00:00:00UTC']}

Incident info: Cambodian–Thai border dispute Q168329
{'sem:hasPlace': ['http://www.wikidata.org/entity/Q869 | Thailand'], 'sem:hasTimeStamp': ['2011-12-15T00:00:00UTC | 2011-12-15T00:00:00UTC']}

Incident info: 2015 FIFA corruption case Q19984977
{'sem:hasPlace': ['http://www.wikidata.org/entity/Q39 | Switzerland'], 'sem:hasTimeStamp': ['2015-01-01T00:00:00UTC | 2015-01-01T00:00:00UTC']}

Incident info: Enumclaw horse sex case Q101209661
{'sem:hasPlace': ['http://www.wikidata.org/entity/Q30 | United States of America'], 'sem:hasTimeStamp': ['2005-07-01T00:00:00UTC | 2005-07-01T00:00:00UTC']}

Incident info: Cambodian–Thai border dispute Q168329
{'sem:hasPlace': ['http://www.wikidata.org/entity/Q869 | Thailand'], 'sem:hasTimeStamp': ['2011-12-15T00:00:00UTC | 2011-12-15T00:00:00UTC'

In [12]:
types_selected = ['mass shooting', 'music festival']
lang = 'nl'
for type_selected in types_selected:
    utils_task.create_event_type_data(type_selected, df_nl, path_data, lang, inc2label, inc2lang2doc, inc2str, min_doc = 10)

Incident info: Alphen aan den Rijn shopping mall shooting Q473866
{'sem:hasPlace': ['http://www.wikidata.org/entity/Q29999 | Kingdom of the Netherlands'], 'sem:hasTimeStamp': ['2011-04-09T00:00:00UTC | 2011-04-09T00:00:00UTC']}

Incident info: 2019 Utrecht shooting Q62090804
{}

Incident info: Alphen aan den Rijn shopping mall shooting Q473866
{'sem:hasPlace': ['http://www.wikidata.org/entity/Q29999 | Kingdom of the Netherlands'], 'sem:hasTimeStamp': ['2011-04-09T00:00:00UTC | 2011-04-09T00:00:00UTC']}

Incident info: 2019 Utrecht shooting Q62090804
{}

Created dataset for event type: mass shooting
Written dataset to: task_data/mass-shooting_nl.csv
Frame info written to: task_data/mass-shooting_nl_frames.json
Incident info: Eurovision Song Contest 2020 Q30973589
{'sem:hasPlace': ['http://www.wikidata.org/entity/Q55 | Netherlands'], 'sem:hasTimeStamp': ['2020-01-01T00:00:00UTC | 2020-01-01T00:00:00UTC']}

Incident info: Eurovision Song Contest 2019 Q9095390
{}

Incident info: Eurovision