# Event co-reference task

News articles often report on incidents (e.g. a terrorist attack or a natural disaster). Different articles may report on the same incident, but approach it from different perspectives. For instance, an article may have been written right after the event occured, several weeks after the event occured (when new information is uncovered or follow-up events occur) or even years after the event (reflecting on its implications). There is a high degree of linguistic variation in how such events are expressed in text. It is very difficult for automatic systems to recognize if two texts are about the same indicident or not (i.e. perform event coreference).

In the course of the Dutch FrameNet project, we have collected texts about real-world incidents using information about indicdents from Wikidata and Wikipedia. Through these resources, we were able to compile corpora of news articles that all report on the same indicent. The indicents can further be categorized in event types (e.g. mass shooting, royal wedding, etc.). This approach provides us with a challenging event-coreference dataset; we have texts reporting on the same incident written at different times after the event (and potentially containing a high degree of variation in referring to the same indicent) and texts about different incidents of the same type, thus potentially containing a high degree of similar expressions and even ambiguity. 

Many of our texts have manually been annotated in terms of semantic roles (as represented in FrameNet). In contrast to traditional SRL, we annotate frames on the level of the entire document instead of on the level of sentences. In addition, we capture information about whether a frame expresses information about target incident we are focusing on. 

For this hackathon, we propose a binary event-coreference task containing three subtasks:


* Subtask 1: Are the two texts about the same incident? (gold: structured data information - target incidents)
* Subtask 2: Provide an explanation of your answer giving evidence from the text (gold: frame annotations)
* Subtask 3: Perform event linking: Name the incident or incidents the texts are about (gold: structured data information - target incidents)

Subtask 3 may seem particularly uncrealistic for "a humble language model". However, given that models are being trained on virtually all data on the internet, it could be interesting to explore how much of this information the model captures and how reliably it can produce it. 

Data: to be made available soon. 


In [1]:
import json
import pandas as pd
import os
from lxml import etree as et
import itertools
from collections import defaultdict

from utils_structured import generate_overview
import utils_docs
import utils_naf
import utils_task

In [2]:
path_data = '../dfn-data-cleaning/data-headlines/unstructured'
path_structured = f'../DFNDataReleases/structured/'

In [3]:
type2inc_path = f'{path_structured}/type2inc_index.json'
inc2str_path = f'{path_structured}/inc2str_index.json'

type2label_path = f'{path_structured}/type2label.json'
inc2label_path = f'{path_structured}/inc2label.json'

inc2lang2doc_path = f'{path_structured}/inc2lang2doc_index.json'
annotation_status_path = f'{path_structured}/annotation_status.json'

with open(type2inc_path) as infile:
    type2inc = json.load(infile)
     
with open(type2label_path) as infile:
    type2label = json.load(infile)
    
with open(inc2label_path) as infile:
    inc2label = json.load(infile)
  
with open(inc2lang2doc_path) as infile:
    inc2lang2doc = json.load(infile)
    
with open(annotation_status_path) as infile:
    annotation_status_dict = json.load(infile)
    
with open(f'{path_structured}/inc2str_index.json') as infile:
    inc2str = json.load(infile)

In [4]:
# Overview
# table: event type, inc, annotated docs, not annotated docs, not found

lang = 'en'
df_en = generate_overview(lang, type2inc, type2label, inc2label, inc2lang2doc, annotation_status_dict)
lang = 'nl'
df_nl = generate_overview(lang, type2inc, type2label, inc2label, inc2lang2doc, annotation_status_dict)

table written to: overview_en.csv
table written to: overview_nl.csv


In [5]:
df_shooting = df_en[df_en['typeID'] == 'Q63442071']
print(len(df_shooting))

17


In [6]:

df_en_higher10 = df_en[df_en['annotated-manual'] > 4]
df_en_higher10

Unnamed: 0,type,typeID,inc,incID,total,file not found,annotated-manual,file exists,annotated-system,annotated-deprecated
77,mass shooting,Q21480300,2016 Hesston shooting,Q23012840,15,,14.0,1.0,,
96,mass shooting,Q21480300,2010 University of Alabama in Huntsville shooting,Q2071916,10,3.0,6.0,,1.0,
100,mass shooting,Q21480300,2011 Seal Beach shooting,Q4443728,12,,12.0,,,
108,mass shooting,Q21480300,2014 Fort Hood shooting,Q16016666,12,,12.0,,,
118,mass shooting,Q21480300,2012 Azana Spa shootings,Q4624844,6,1.0,5.0,,,
123,mass shooting,Q21480300,2019 Utrecht shooting,Q62090804,22,5.0,17.0,,,
127,mass shooting,Q21480300,2015 Bamako hotel attack,Q21516923,14,2.0,12.0,,,
135,mass shooting,Q21480300,2011 Tucson shooting,Q757986,10,2.0,8.0,,,
136,mass shooting,Q21480300,2015 San Bernardino shooting,Q21613643,15,,14.0,,1.0,
139,mass shooting,Q21480300,2016 Citronelle homicides,Q27629584,14,1.0,12.0,,1.0,


In [7]:
# get texts of two shooting events


    

In [8]:
incident1 = '2016 Hesston shooting'
incident2 = '2011 Seal Beach shooting'
lang = 'en'

inc1_id, docs1 = utils_docs.get_text_names(incident1, lang, inc2str, inc2label, inc2lang2doc)
inc2_id, docs2 = utils_docs.get_text_names(incident2, lang, inc2str, inc2label, inc2lang2doc)

Q23012840
Incident info
{'sem:hasPlace': ['http://www.wikidata.org/entity/Q30 | United States of America'], 'sem:hasTimeStamp': ['2016-02-25T00:00:00UTC | 2016-02-25T00:00:00UTC']}

Q4443728
Incident info
{'sem:hasPlace': ['http://www.wikidata.org/entity/Q30 | United States of America'], 'sem:hasTimeStamp': ['2011-10-12T00:00:00UTC | 2011-10-12T00:00:00UTC']}



In [9]:
path_texts = f'{path_data}/{lang}'
doc_time_dict = utils_docs.show_docs_by_time(path_texts, docs1, annotation_status_dict)

2016-02-25T15:59:27UTC Gunman, 3 Others Dead in Kansas Shooting Rampage annotated-manual
2016-02-25T17:40:00UTC Four dead, including gunman, in Kansas workplace shooting | The Wichita Eagle annotated-manual
2016-02-25T20:17:00UTC At Least 4 Dead, Including Gunman, At Kansas Manufacturing Plant annotated-manual
2016-02-26T03:36:00UTC Kansas Workplace Shooting Leaves 4 Dead, 14 Wounded annotated-manual
2016-02-26T09:57:12UTC Inside the Deadly Kansas Shooting: Here's What Happened Step by Step annotated-manual
2016-02-26T10:48:53UTC Cedric Larry Ford Kills 3, Injures 14 in Kansas Shooting Spree annotated-manual
2016-02-28T14:30:00UTC Kansas Shooting Victim Says Coworker Hesitated Before Firing annotated-manual
2016-02-29T13:07:00UTC Sarah Jo Hopkins, accused of giving guns to Hesston shooter, appears in court annotated-manual
2016-03-04T17:48:00UTC Before Kansas mass shooting, a trail of domestic violence | Biloxi Sun Herald annotated-manual
2016-03-10T00:00:00UTC Two Weeks After Deadly S

In [10]:
doc_time_dict = utils_docs.show_docs_by_time(path_texts, docs2, annotation_status_dict)

2011-10-13T00:00:00UTC Michelle Fournier: Custody dispute may have led to ex-wife's death annotated-manual
2011-10-13T07:00:00UTC Police ID victims in Seal Beach shooting annotated-manual
2011-10-13T18:25:33UTC Neighbors Of Salon Owner Stunned By Deadly Shooting annotated-manual
2011-10-14T07:00:00UTC Salon patron recalled as ‘warm, brilliant' annotated-manual
2011-11-02T07:00:00UTC Suspect in Seal Beach shootings said he saw bruises on son annotated-manual
2012-03-22T18:20:45UTC Tearing Down, Then Rebuilding at Seal Beach Massacre Site annotated-manual
2012-11-18T08:00:00UTC A year after Seal Beach shootings, Salon Meritage reopens annotated-manual
2015-06-29T07:30:00UTC Work begins on Seal Beach salon shooting memorial annotated-manual


In [12]:
target_text1 = "Four dead, including gunman, in Kansas workplace shooting | The Wichita Eagle"
path_naf = f'{path_texts}/{target_text1}.naf'
tree = et.parse(path_naf)
root = tree.getroot()
text = utils_naf.get_text(root)
print(text)

Four dead, including gunman, in Kansas workplace shooting | The Wichita Eagle.

An employee went on a shooting spree Thursday afternoon that left three people dead and injured 12 at a manufacturing plant in Hesston, about 35 miles north of Wichita.
Two other people were shot earlier, one in Newton and one on U.S. 81.
The suspected gunman – identified by two co-workers who witnessed the shootings as Cedric Ford – was shot and killed inside the plant by a Hesston police officer.
“This is just a horrible incident here,” Harvey County Sheriff T. Walton said. “It’s going to be a lot of sad people before this is all over.”
Walton said authorities have an idea of the motive, “something that triggered this particular individual.” But he did not offer any details Thursday evening.
News alerts in your inbox Sign up for email alerts and be the first to know when news breaks. Recaptcha SIGN UP This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
The shootin

In [13]:
predicate_info1 = utils_naf.get_predicate_role_info(root, inc1_id, anchor_filter=True)
#predicate_info1

In [15]:
target_text2 = "Before Kansas mass shooting, a trail of domestic violence | Biloxi Sun Herald"
path_naf = f'{path_texts}/{target_text2}.naf'
tree = et.parse(path_naf)
root = tree.getroot()
text = utils_naf.get_text(root)
print(text)

Before Kansas mass shooting, a trail of domestic violence | Biloxi Sun Herald.

A makeshift memorial to shooting victims in Hesston, Kansas, on Feb. 26, 2016. The Wichita Eagle
Before a gunman opened fire last month in Kansas, killing three people and injuring 14, he established a pattern of domestic abuse that, experts say, can frequently be a precursor to deadly violence.
U.S. Justice Department statistics show that more than 50 percent of women who are killed with guns are killed by intimate partners or family members. Kansas, like many states, relies heavily on federal funding to support services to cope with domestic violence. But advocates in the state say that despite an increase in federal grant money last year, Kansas still has unmet needs.
And while federal law prohibits the possession of guns by domestic abusers, Kansas lacks a state law that empowers local courts and law enforcement agencies to enforce it.
Women are killed with guns at higher rates by abusive partners in st

In [16]:
predicate_info2 = utils_naf.get_predicate_role_info(root, inc2_id, anchor_filter=True)
predicate_info2

[]

In [17]:
# compare frame info
pair_dict = utils_task.get_frame_overlap(predicate_info1, predicate_info2)
pair_dict

{'shared_frames': set(), 'shared_elements': set()}

## Test pairs of same incident


In [None]:
print(inc1_id)

In [None]:
# all_pairs
l1 = ['a', 'b', 'c']
l2 = [1, 2, 3]

pairs = list(itertools.product(l1, l2))
pairs

In [78]:
# def get_inc_doc_pred_dict(path_texts, incidents, lang, inc1str, inc2label, inc2lang2doc):
    
#     inc1_id, docs1 = utils_docs.get_text_names(incident1, lang, inc2str, inc2label, inc2lang2doc)
#     inc2_id, docs2 = utils_docs.get_text_names(incident2, lang, inc2str, inc2label, inc2lang2doc)
    
#     inc_doc_pred_dict = #
#     inc_doc_text_dict = #
#     for inc in incidents:
#         inc_id, docs = utils_docs.get_text_names(inc, lang, inc2str, inc2label, inc2lang2doc)
#         pred_doc_dict = get_doc_pred_dict(path_texts, docs, inc_id)
#         inc_doc_pred_dict[inc_id] = 
        
    
def get_doc_pred_dict(path_texts, docs, inc_id):
    
    doc_pred_dict = dict()
    for doc in docs:
        path_naf = f'{path_texts}/{doc}.naf'
        tree = et.parse(path_naf)
        root = tree.getroot()
        pred_info = utils_naf.get_predicate_role_info(root, inc_id, anchor_filter=True)
        doc_pred_dict[doc] = pred_info
    return doc_pred_dict

def get_text_pairs(docs1 = [], docs2 = None, inc1_id = None, inc2_id = None):
    # collect pairs
    data = []
    # create text pairs
    if docs2 is None:
        doc_pairs = itertools.combinations(docs1, 2)
        label = 1
        inc_ids = [inc1_id, inc1_id]
        # doc pred dict
        doc_pred_dict = get_doc_pred_dict(path_texts, docs1, inc1_id)
        # inc_doc_pred_dict = dict()
        # inc_doc_pred_dict = 
            
    else:
        doc_pairs = itertools.product(docs1, docs2)
        doc_pred_dict = get_doc_pred_dict(path_texts, docs1, inc1_id)
        doc_pred_dict2 = get_doc_pred_dict(path_texts, docs2, inc2_id)
        doc_pred_dict.update(doc_pred_dict2)
        #print(doc_pred_dict.keys())
        label = 0
        inc_ids = [inc1_id, inc2_id]
    
    for pair in doc_pairs:
        pair_dict = dict()
        pair_dict['label'] = label
        #print(pair)
        pred_info_dicts = []
        for n, doc in enumerate(pair):
            #print(doc)
            path_naf = f'{path_texts}/{doc}.naf'
            #print(path_naf)
            tree = et.parse(path_naf)
            root = tree.getroot()
            
            if not root is None:
                #print(inc_ids[n])
                predicate_info = doc_pred_dict[doc]
                if len(predicate_info) > 0:
                    pred_info_dicts.append(predicate_info)
                    pair_dict[f'title{n}'] = doc
                    pair_dict[f'text{n}'] = utils_naf.get_text(root)
                    pair_dict[f'inc_id{n}'] = inc_ids[0]

        if len(pred_info_dicts) == 2:
            frame_dict = utils_task.get_frame_overlap(pred_info_dicts[0], pred_info_dicts[1])
            total_overlap = len(frame_dict['shared_frames']) + len(frame_dict['shared_elements'])
            pair_dict['shared_frames'] = len(frame_dict['shared_frames'])
            pair_dict['shared_elements'] = len(frame_dict['shared_elements'])
            pair_dict['total_overlap'] = total_overlap
            data.append(pair_dict)
    return data, doc_pred_dict
            
#data_different = get_text_pairs(docs1, docs2, inc1_id, inc2_id)

In [79]:
incident1 = '2016 Hesston shooting'
inc1_id, docs1 = utils_docs.get_text_names(incident1, lang, inc2str, inc2label, inc2lang2doc)
print(len(docs1))
data_same, doc_pred_dict = get_text_pairs(docs1 = docs1, inc1_id = inc1_id)

Q23012840
Incident info
{'sem:hasPlace': ['http://www.wikidata.org/entity/Q30 | United States of America'], 'sem:hasTimeStamp': ['2016-02-25T00:00:00UTC | 2016-02-25T00:00:00UTC']}

15
No srl layer


In [80]:
print(len(doc_pred_dict))

15


In [81]:
df_same = pd.DataFrame(data_same)
df_same.to_csv('test_same.csv')

In [82]:

incident2 ='2011 Seal Beach shooting'
lang = 'en'

inc2_id, docs2 = utils_docs.get_text_names(incident2, lang, inc2str, inc2label, inc2lang2doc)
data_different, pred_docs_different = get_text_pairs(docs1 = docs1, docs2 = docs2, inc1_id = inc1_id, inc2_id = inc2_id)

Q4443728
Incident info
{'sem:hasPlace': ['http://www.wikidata.org/entity/Q30 | United States of America'], 'sem:hasTimeStamp': ['2011-10-12T00:00:00UTC | 2011-10-12T00:00:00UTC']}

No srl layer


In [83]:
print(len(data_different))

168


In [84]:
df_different = pd.DataFrame(data_different)
df_different

Unnamed: 0,label,title0,text0,inc_id0,title1,text1,inc_id1,shared_frames,shared_elements,total_overlap
0,0,Kansas Shooting Victim Says Coworker Hesitated...,Kansas Shooting Victim Says Coworker Hesitated...,Q23012840,Victoria Buzzo: Seal Beach shooting victim dan...,Victoria Buzzo: Seal Beach shooting victim dan...,Q23012840,0,0,0
1,0,Kansas Shooting Victim Says Coworker Hesitated...,Kansas Shooting Victim Says Coworker Hesitated...,Q23012840,"A year after Seal Beach shootings, Salon Merit...","A year after Seal Beach shootings, Salon Merit...",Q23012840,0,0,0
2,0,Kansas Shooting Victim Says Coworker Hesitated...,Kansas Shooting Victim Says Coworker Hesitated...,Q23012840,Laura Webb: She was shot while doing her mothe...,Laura Webb: She was shot while doing her mothe...,Q23012840,0,0,0
3,0,Kansas Shooting Victim Says Coworker Hesitated...,Kansas Shooting Victim Says Coworker Hesitated...,Q23012840,Christy Wilson: Victim of Seal Beach shooting ...,Christy Wilson: Victim of Seal Beach shooting ...,Q23012840,0,0,0
4,0,Kansas Shooting Victim Says Coworker Hesitated...,Kansas Shooting Victim Says Coworker Hesitated...,Q23012840,911 tape of Seal Beach shooting released,911 tape of Seal Beach shooting released.\n\nA...,Q23012840,0,0,0
...,...,...,...,...,...,...,...,...,...,...
163,0,Inside the Deadly Kansas Shooting: Here's What...,Inside the Deadly Kansas Shooting: Here's What...,Q23012840,Work begins on Seal Beach salon shooting memorial,Work begins on Seal Beach salon shooting memor...,Q23012840,2,0,2
164,0,Inside the Deadly Kansas Shooting: Here's What...,Inside the Deadly Kansas Shooting: Here's What...,Q23012840,"Salon patron recalled as ‘warm, brilliant'","Salon patron recalled as &#8216;warm, brillian...",Q23012840,1,0,1
165,0,Inside the Deadly Kansas Shooting: Here's What...,Inside the Deadly Kansas Shooting: Here's What...,Q23012840,"Tearing Down, Then Rebuilding at Seal Beach Ma...","Tearing Down, Then Rebuilding at Seal Beach Ma...",Q23012840,1,0,1
166,0,Inside the Deadly Kansas Shooting: Here's What...,Inside the Deadly Kansas Shooting: Here's What...,Q23012840,Neighbors Of Salon Owner Stunned By Deadly Sho...,Neighbors Of Salon Owner Stunned By Deadly Sho...,Q23012840,1,0,1


In [18]:
target_text1 = "Four dead, including gunman, in Kansas workplace shooting | The Wichita Eagle"
path_naf = f'{path_texts}/{target_text1}.naf'

text = utils_docs.load_text(path_naf)

In [19]:
text

'Four dead, including gunman, in Kansas workplace shooting | The Wichita Eagle.\n\nAn employee went on a shooting spree Thursday afternoon that left three people dead and injured 12 at a manufacturing plant in Hesston, about 35 miles north of Wichita.\nTwo other people were shot earlier, one in Newton and one on U.S. 81.\nThe suspected gunman – identified by two co-workers who witnessed the shootings as Cedric Ford – was shot and killed inside the plant by a Hesston police officer.\n“This is just a horrible incident here,” Harvey County Sheriff T. Walton said. “It’s going to be a lot of sad people before this is all over.”\nWalton said authorities have an idea of the motive, “something that triggered this particular individual.” But he did not offer any details Thursday evening.\nNews alerts in your inbox Sign up for email alerts and be the first to know when news breaks. Recaptcha SIGN UP This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.\nTh