# Role Experiments

Here we try to compare different strategies for role labeling functions and show how & why we arrived at our currently used labeling functions.

For the role labeling functions we first used some basic criteria to filter out negative trigger argument pairs using simple heuristics. We let the labeling functions abstain, if trigger and argument according to our sentence splitting were not in the same sentence, the trigger was not labeled by any of the trigger labeling functions or the trigger and argument were too distant from each other for cases where the sentence splitting may have overlooked a sentence boundary. From our experience with the sentence splitting experiments we chose SoMajo over stanfordnlp to determine whether trigger and argument belonged to the same sentence. Using this strategy together with a too-far-apart heuristic helped filtering out a majority of the trigger argument pairs, which were the result of working on a document level instead of on a sentence level.
Then we looked at the entity type to see if it matched an argument role class, e.g. duration entity type with role class delay, distance entity type with jam_length or location_route for route. The most prevalent and most important location role class required more fine-grained location entity types for specific event types, such as location_route for CanceledRoute events or location_stop for CanceledStop events.

For some of the role classes we looked for typical context words such as: 
- "Richtung" for the direction
- "von" and "bis" for start_location and end_location, respectively if the entity type was location
- "von" and "bis" for start_date and end_date, respectively if the entity type was location
- "wegen" for cause

This worked well for most of the role classes except for the location class as there were multiple role classes requiring a location entity type (location, direction, start_location, end_location, route). We tried approaches that would label the examples as location, when all the other labeling functions for the more specific location role classes would abstain.
We found that some heuristics and patterns worked best, such as:
- the first location entity in the sentence is usually the location argument as we tend to put the most import information first
- the delay argument of a Delay event usually occurs right before the event trigger
- the jam length argument of a TrafficJam event usually occurs right before the event trigger

In addition we reused existing patterns from past projects. There were some revisions in the NER annotations, which lowered the coverage of these patterns. We created a version of the patterns, where we relaxed the fine-grained location requirements, i.e. replaced all the more fine-grained locations with all the location entity types.
We found that while the accuracy of our labeling functions were relatively high, the coverage was relatively low, especially for the location class. Relaxing some of the conditions in the labeling functions resulted in bad accuracy.

In [1]:
import sys
sys.path.append("../")

## Data preparation
We first load the SD4M gold train data, build the trigger examples and add some information to help us during the experiments

In [2]:
import pandas as pd
from wsee.utils import corpus_statistics
sd4m_train = pd.read_json("../data/daystream_corpus/train/train_with_events_and_defaults.jsonl", lines=True, encoding='utf8')
filtered_sd4m_train = sd4m_train[sd4m_train.apply(lambda document: corpus_statistics.has_triggers(document), axis=1)]
corpus_statistics.get_snorkel_event_stats(filtered_sd4m_train)

{'# Docs': 567,
 '# Docs with event triggers': 413,
 '# Event triggers with positive label': 488,
 '# Event triggers with negative label': 289,
 '# Event triggers with abstain': 0,
 'Trigger class frequencies': {'Accident': 59,
  'CanceledRoute': 61,
  'CanceledStop': 25,
  'Delay': 65,
  'Obstruction': 101,
  'RailReplacementService': 22,
  'TrafficJam': 155,
  'O': 289},
 '# Docs with event roles': 413,
 '# Event role with positive label': 2001,
 '# Event roles with negative label': 5284,
 '# Event roles with abstain': 0,
 'Role class frequencies': {'location': 571,
  'delay': 87,
  'direction': 277,
  'start_loc': 377,
  'end_loc': 352,
  'start_date': 35,
  'end_date': 41,
  'cause': 103,
  'jam_length': 135,
  'route': 23,
  'no_arg': 5284}}

In [3]:
from wsee.data import pipeline

df_sd_train, Y_sd_train = pipeline.build_event_role_examples(filtered_sd4m_train)

INFO:root:Building event role examples
INFO:root:DataFrame has 567 rows
INFO:root:Adding the following attributes to each document: entity_type_freqs, somajo_doc, mixed_ner, mixed_ner_spans
567it [00:12, 45.89it/s]
INFO:root:Adding the following attributes to each role example: not_an_event, arg_type_event_type_match, between_distance, is_multiple_same_event_type
INFO:root:Number of event roles: 2001
INFO:root:Number of event role examples: 7285


In [24]:
from wsee.preprocessors import preprocessors
from wsee import SD4M_RELATION_TYPES
import numpy as np

def get_simple_trigger(doc):
    trigger = doc.trigger
    simple_trigger = {
        'text': trigger['text'],
        'entity_type': trigger['entity_type'],
        'char_start': trigger['char_start'],
        'char_end': trigger['char_end']
    }
    return simple_trigger
        
def get_simple_argument(doc):
    argument = doc['argument']
    simple_argument = {
        'text': argument['text'],
        'entity_type': argument['entity_type'],
        'char_start': argument['char_start'],
        'char_end': argument['char_end']
    }
    return simple_argument

def get_simple_triggers(x):
    simple_triggers = []
    for trigger in x['event_triggers']:
        entity = preprocessors.get_entity(trigger['id'], x['entities'])
        simple_trigger = {
            'text': entity['text'],
            'entity_type': entity['entity_type'],
            'start': entity['start'],
            'end': entity['end'],
            'event_type': SD4M_RELATION_TYPES[np.argmax(trigger['event_type_probs'])]
        }
        simple_triggers.append(simple_trigger)
    return simple_triggers

def get_simple_somajo_sentences(doc):
    sentences = doc.somajo_doc['sentences']
    return [{'text': doc.text[sentence['char_start']:sentence['char_end']], 'char_start': sentence['char_start'], 'char_end': sentence['char_end']} for sentence in sentences]


def add_simple_columns(df):
    df['trigger_sm'] = df.apply(lambda doc: get_simple_trigger(doc), axis=1)
    df['argument_sm'] = df.apply(lambda doc: get_simple_argument(doc), axis=1)
    df['somajo_sm'] = df.apply(lambda doc: get_simple_somajo_sentences(doc), axis=1)
    df['simple_triggers'] = df.apply(lambda doc: get_simple_triggers(doc), axis=1)
    df['between_tokens'] = df.apply(lambda doc: preprocessors.get_between_tokens(doc), axis=1)
    df['between_distance'] = df.apply(lambda doc: preprocessors.get_between_distance(doc), axis=1)
    return df

In [25]:
from wsee import ROLE_LABELS
pd.set_option('display.max_colwidth', None)
labeled_sd4m_roles = df_sd_train.copy()
labeled_sd4m_roles['label'] = Y_sd_train
labeled_sd4m_roles['role'] = [ROLE_LABELS[label_idx] for label_idx in Y_sd_train]
labeled_sd4m_roles = add_simple_columns(labeled_sd4m_roles)

## Strategies for the labeling functions
Features:
- Same sentence, between distance, trigger is event trigger, entity type, occurrence of required argument entity types (location + trigger) as basic criteria
- Positional information for location, delay, jam length
- Context words for direction, start & end location, start & date, cause
- Negation, parentheses, multiple occurrences of argument checks
- NER patterns

Most labeling functions in our pipeline use a combination of these features.

In [11]:
from snorkel.labeling import labeling_function
from wsee.labeling import event_argument_role_lfs as role_lfs

location = 0
delay = 1
direction = 2
start_loc = 3
end_loc = 4
start_date = 5
end_date = 6
cause = 7
jam_length = 8
route = 9
no_arg = 10
ABSTAIN = -1

# + positional information (first location entity in the sentence)
    
@labeling_function(pre=[])
def basic_checks_location(x):
    # location entity type check, same sentence + distance + event trigger + required arguments (with location role this is a given)
    arg_entity_type = x.argument['entity_type']
    if not role_lfs.is_location_entity_type(arg_entity_type):
        return ABSTAIN
    return role_lfs.lf_location(x, same_sentence=True, nearest=False, check_event_type=True)
    
@labeling_function(pre=[])
def basic_checks_exclusions_location(x):
    # ABSTAIN if more specific location related role classes match
    arg_entity_type = x.argument['entity_type']
    if not role_lfs.is_location_entity_type(arg_entity_type):
        return ABSTAIN
    if role_lfs.lf_start_location_type(x) == ABSTAIN and role_lfs.lf_end_location_type(x) == ABSTAIN and role_lfs.lf_direction(x) == ABSTAIN:
        return role_lfs.lf_location(x, same_sentence=True, nearest=False, check_event_type=True)
    else:
        return ABSTAIN

@labeling_function(pre=[])
def basic_checks_exclusions_heuristics_location(x):
    # uses heuristic that the location argument is often the first location entity while specifying more fine-grained location entity types
    # for cases where there might be a more general context location first (location, location_city) and then the relevant location argument
    arg_entity_type = x.argument['entity_type']
    if not role_lfs.is_location_entity_type(arg_entity_type):
        return ABSTAIN
    if role_lfs.lf_start_location_type(x) == ABSTAIN and role_lfs.lf_end_location_type(x) == ABSTAIN and \
            role_lfs.lf_direction(x) == ABSTAIN:
        first_street_stop_route = role_lfs.get_first_of_entity_types(
            preprocessors.get_sentence_entities(x), ['location_route', 'location_stop', 'location_street'])
        if first_street_stop_route and first_street_stop_route['id'] == x.argument['id']:
            return role_lfs.lf_location(x, same_sentence=True, nearest=False, check_event_type=True)
    return ABSTAIN

In [15]:
from snorkel.labeling import PandasLFApplier

lfs = [
    basic_checks_location,
    basic_checks_exclusions_location,
    basic_checks_exclusions_heuristics_location,
    role_lfs.lf_direction_pattern,  # may be very specific to corpus: A1 Hamburg-Bremen ..., where '-' may be a direction marker
    role_lfs.lf_start_date_adjacent  # as the name says, without checks for some specific context word, excluding end date
] 
applier = PandasLFApplier(lfs)

In [16]:
L_sd_train = applier.apply(df_sd_train)

100%|██████████| 7285/7285 [00:18<00:00, 384.50it/s]


In [17]:
from snorkel.labeling import LFAnalysis

LFAnalysis(L_sd_train, lfs).lf_summary(Y_sd_train)

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
basic_checks_location,0,[0],0.224296,0.12862,0.00604,534,1100,0.326805
basic_checks_exclusions_location,1,[0],0.122581,0.122581,0.0,524,369,0.586786
basic_checks_exclusions_heuristics_location,2,[0],0.057653,0.057653,0.0,387,33,0.921429
lf_direction_pattern,3,[2],0.006863,0.00604,0.00604,28,22,0.56
lf_start_date_adjacent,4,[5],0.001784,0.0,0.0,6,7,0.461538


The SD4M train set contains 65 `Delay`events and 155 `TrafficJam` events. Given that the trigger lists were refined using the SD4M training data, the recall is expectedly high.
The strategy of only checking whether the entity type of the closest entity to a potential `TrafficJam` trigger is `distance` worked fairly well with 133/155 recall and 133/137 precision.
It did not work as well for the `Delay` event type. While the precision of 22/26 was good, the recall of 22/65 is quite low. 
However both approaches did improve on the simpler, more straightforward strategy of matching the trigger text to trigger lists.

### Error analysis:

In [18]:
from wsee.labeling import error_analysis

In [27]:
error_analysis.get_false_positives(labeled_df=labeled_sd4m_roles, lf_outputs=L_sd_train, lf_index=2, label_of_interest=0).sample(n=2)[['text', 'trigger', 'argument', 'simple_triggers', 'role']]

Unnamed: 0,text,trigger,argument,simple_triggers,role
5693,Auf der A1 Köln Richtung Dortmund ist die Ausfahrt Schwerte wegen Bergungsarbeiten gesperrt.\n,"{'id': 'c/576f3260-87aa-474c-ac04-95ec3293b457', 'text': 'gesperrt', 'entity_type': 'trigger', 'start': 12, 'end': 13, 'char_start': 83, 'char_end': 91}","{'id': 'c/9ce21202-5a2e-4a14-b78d-f05ce7c9b24c', 'text': 'A1', 'entity_type': 'location_street', 'start': 2, 'end': 3, 'char_start': 8, 'char_end': 10}","[{'text': 'Bergungsarbeiten', 'entity_type': 'trigger', 'start': 11, 'end': 12, 'event_type': 'O'}, {'text': 'gesperrt', 'entity_type': 'trigger', 'start': 12, 'end': 13, 'event_type': 'Obstruction'}]",no_arg
3827,A7 Hannover Richtung Kassel zwischen Lutterberg und Kreuz Kassel-Mitte 4 km Stau nach einem Unfall\n,"{'id': 'c/16605983-925d-4cb1-9a45-0858ee69fff1', 'text': 'Unfall', 'entity_type': 'trigger', 'start': 16, 'end': 17, 'char_start': 92, 'char_end': 98}","{'id': 'c/36ff6b63-54d2-411c-86eb-f5b222cf0cd1', 'text': 'A7', 'entity_type': 'location_street', 'start': 0, 'end': 1, 'char_start': 0, 'char_end': 2}","[{'text': 'Stau', 'entity_type': 'trigger', 'start': 13, 'end': 14, 'event_type': 'TrafficJam'}, {'text': 'Unfall', 'entity_type': 'trigger', 'start': 16, 'end': 17, 'event_type': 'O'}]",no_arg


In [28]:
error_analysis.get_false_positives(labeled_df=labeled_sd4m_roles, lf_outputs=L_sd_train, lf_index=1, label_of_interest=0).sample(n=2)[['text', 'trigger', 'argument', 'simple_triggers', 'role']]

Unnamed: 0,text,trigger,argument,simple_triggers,role
6935,■ #Hamburg: die Ohlsdorfer Straße ist zwischen Jahnring und Winterhuder Marktplatz wegen #Bauarbeiten bis Ende Juli gesperrt.\n,"{'id': 'c/0fb7209e-2ba7-49e5-884e-aa2c3f600b88', 'text': 'gesperrt', 'entity_type': 'trigger', 'start': 17, 'end': 18, 'char_start': 117, 'char_end': 125}","{'id': 'c/bc7010e3-2401-4f6f-9787-ff75bc60b14c', 'text': '#Hamburg', 'entity_type': 'location_city', 'start': 1, 'end': 2, 'char_start': 2, 'char_end': 10}","[{'text': '#Bauarbeiten', 'entity_type': 'trigger', 'start': 13, 'end': 14, 'event_type': 'O'}, {'text': 'gesperrt', 'entity_type': 'trigger', 'start': 17, 'end': 18, 'event_type': 'Obstruction'}]",no_arg
5016,Die A31 Bottrop Richtung Gronau ist zwischen Lembeck und Reken wegen Bergungsarbeiten gesperrt. Eine Umleitung führt ab Lembeck über die U25\n,"{'id': 'c/96eb3033-eb65-4d72-9fbb-c3985840e6c8', 'text': 'gesperrt', 'entity_type': 'trigger', 'start': 12, 'end': 13, 'char_start': 86, 'char_end': 94}","{'id': 'c/f312238e-a7ca-4825-87eb-51f40c1b146b', 'text': 'Bottrop', 'entity_type': 'location_city', 'start': 2, 'end': 3, 'char_start': 8, 'char_end': 15}","[{'text': 'gesperrt', 'entity_type': 'trigger', 'start': 12, 'end': 13, 'event_type': 'Obstruction'}]",no_arg


For both `Delay` & `duration` and `TrafficJam` & `distance` there were false positives, where the closest entity to the trigger was of the relevant entity type, but the trigger was of a different event type.