## Data analysis

- **Train, Dev, and Test Splits**: Each line in a document representing a data sample (a sentence pair containing target event mentions).

- Format: Each line includes (separated by tabs):

    - Unique ID of event mention1 from sentence1(e.g. 01_04_35#2_3_3) 
    - Unique ID of event mention2 from sentence2(e.g. 01_04_35#2_3_3)  
    - Sentence 1  
    - Start token index of event1 trigger word 
    - End token index of event1 trigger word(inclusive) 
    - Start token index of event1 participant phrase 1 
    - End token index of event1 participant phrase 1(inclusive) 
    - Start token index of event1 participant phrase 2 
    - End token index of event1 participant phrase 2(inclusive) 
    - Start token index of event1 time End token index of event1 time (inclusive) 
    - Start token index of event1 location End token index of event1 location (inclusive) 
    - Sentence 2 
    - Start token index of event2 trigger word 
    - End token index of event2 trigger word (inclusive) 
    - Start token index of event2 participant phrase 1
    - End token index of event2 participant phrase 1 (inclusive) 
    - Start token index of event2 participant phrase 2 
    - End token index of event2 participant phrase 2 (inclusive) 
    - Start token index of event2 time End token index of event2 time (inclusive) 
    - Start token index of event2 location 
    - End token index of event2 location (inclusive) 
    - label: A binary label indicating whether the events are coreferent (1) or not (0).

NOTE: index -1 means this information is not provided in data. You can choose to leave it be -1 or you can extract the participants, time and location of the event mentions for extra credit. Event trigger word is provided in the data.

In [1]:
import pandas as pd

In [28]:
# Load data
train_set = pd.read_csv("../data/event_pairs.train", sep='\t', on_bad_lines='skip')
dev_set = pd.read_csv("../data/event_pairs.dev", sep='\t', on_bad_lines='skip')
test_set = pd.read_csv("../data/event_pairs.test", sep='\t', on_bad_lines='skip')

In [30]:
# Rename columns
col_names = [
    "sentence1",
    "e1_trigger_start",
    "e1_trigger_end",
    "e1_participant1_start",
    "e1_participant1_end",
    "e1_participant2_start",
    "e1_participant2_end",
    "e1_time_start",
    "e1_time_end",
    "e1_loc_start",
    "e1_loc_end",
    "sentence2",
    "e2_trigger_start",
    "e2_trigger_end",
    "e2_participant1_start",
    "e2_participant1_end",
    "e2_participant2_start",
    "e2_participant2_end",
    "e2_time_start",
    "e2_time_end",
    "e2_loc_start",
    "e2_loc_end",
    "label"
]

train_set.columns = col_names
dev_set.columns = col_names
test_set.columns = ['event_id_1', 'event_id_2'] + col_names

In [31]:
print(train_set.label.value_counts())
print(f"Number of unique setence 1 in train set: {len(train_set.sentence1.unique())}")
print(f"Number of unique setence 2 in train set: {len(train_set.sentence2.unique())}")

set_1 = set(train_set.sentence1.unique())
set_2 = set(train_set.sentence2.unique())
set_1.update(set_2)
print(f"Number of total number of unique sentences in train set: {len(set_1)}")

label
0.0    202078
1.0     19252
Name: count, dtype: int64
Number of unique setence 1 in train set: 6279
Number of unique setence 2 in train set: 6001
Number of total number of unique sentences in train set: 7011


In [38]:
# check the number of event info provided in train set
event_info_names = [
    "e1_participant1_start",
    "e1_participant1_end",
    "e1_participant2_start",
    "e1_participant2_end",
    "e1_time_start",
    "e1_time_end",
    "e1_loc_start",
    "e1_loc_end",
    "e2_participant1_start",
    "e2_participant1_end",
    "e2_participant2_start",
    "e2_participant2_end",
    "e2_time_start",
    "e2_time_end",
    "e2_loc_start",
    "e2_loc_end",
    
] 

for event_info_name in event_info_names:
    print(f"Number of {event_info_name} provided in train set: {len(train_set[train_set[event_info_name] != -1])}")

Number of e1_participant1_start provided in train set: 0
Number of e1_participant1_end provided in train set: 0
Number of e1_participant2_start provided in train set: 0
Number of e1_participant2_end provided in train set: 0
Number of e1_time_start provided in train set: 0
Number of e1_time_end provided in train set: 0
Number of e1_loc_start provided in train set: 0
Number of e1_loc_end provided in train set: 0
Number of e2_participant1_start provided in train set: 458
Number of e2_participant1_end provided in train set: 458
Number of e2_participant2_start provided in train set: 458
Number of e2_participant2_end provided in train set: 458
Number of e2_time_start provided in train set: 458
Number of e2_time_end provided in train set: 458
Number of e2_loc_start provided in train set: 458
Number of e2_loc_end provided in train set: 458
