# Transform the data to work with Snorkel: Part 1 - Event Type

Essentially we will have to create two labeling models.
One assigns labels to event types and the other assigns labels to argument roles in event mentions.

In any case we need to create a row for each event (trigger) to do event type labeling.

For this we need 1 additional column:
- trigger_id

One numpy array containing the:
- event_type

We will probably focus on keyword lists and some heuristics to create our labeling functions.

In [1]:
import sys
sys.path.append("../")
import warnings
import pandas as pd
import numpy as np
from wsee.utils import utils
from wsee.data import pipeline

In [2]:
warnings.filterwarnings(action='once')
pd.set_option('display.max_colwidth', None)
DATA_DIR = '../data/daystream_corpus'  # replace path to corpus

### SD4M Event Types

| Number | Code                   | Description                                                                             |
|--------|------------------------|-----------------------------------------------------------------------------------------|
| -1     | ABSTAIN                | No vote, for Labeling Functions                                                         |
| 0      | Accident               | Collision of a vehicle with another vehicle, person, or obstruction                     |
| 1      | CanceledRoute          | Cancellation of public transport routes                                                 |
| 2      | CanceledStop           | Cancellation of public transport stops                                                  |
| 3      | Delay                  | Delay resulting from remaining traffic disturbances                                     |
| 4      | Obstruction            | Temporary installation to control traffic                                               |
| 5      | RailReplacementService | Replacement of a passenger train by buses or other substitute public transport services |
| 6      | TrafficJam             | Line of stationary or very slow-moving traffic                                          |
| 7      | O                      | No SD4M event.                                                                          |

In [3]:
loaded_data = pipeline.load_data(DATA_DIR)
sd_train = loaded_data['train']
sd_dev = loaded_data['dev']
sd_test = loaded_data['test']

daystream = loaded_data['daystream']

INFO:wsee:Reading train data from: ../data/daystream_corpus/train/train_with_events_and_defaults.jsonl
INFO:wsee:Reading dev data from: ../data/daystream_corpus/dev/dev_with_events_and_defaults.jsonl
INFO:wsee:Reading test data from: ../data/daystream_corpus/test/test_with_events_and_defaults.jsonl
INFO:wsee:Reading daystream data from: ../data/daystream_corpus/daystream.jsonl


## Step 1: Create one row for every event trigger

We will use the (labeled) SD4M training set as our development data to create our labeling functions.
In this notebook we will run our labeling functions and our LabelModel on that data.
In the real pipeline we will instead label the Daystream data that does not have event type and event argument role labels.

In [4]:
df_sd_train, Y_sd_train = pipeline.build_event_trigger_examples(sd_train)

INFO:wsee:Building event trigger examples
INFO:wsee:DataFrame has 1273 rows
1273it [00:00, 1402.61it/s]
INFO:wsee:Number of events: 488
INFO:wsee:Number of event trigger examples: 777


We use the (labeled) SD4m development set as our "test set" to measure the performance of our LabelModel.

In [5]:
df_sd_dev, Y_sd_dev = pipeline.build_event_trigger_examples(sd_dev)

INFO:wsee:Building event trigger examples
INFO:wsee:DataFrame has 147 rows
147it [00:00, 1639.88it/s]
INFO:wsee:Number of events: 46
INFO:wsee:Number of event trigger examples: 71


In [6]:
from wsee import SD4M_RELATION_TYPES
print(SD4M_RELATION_TYPES)

['Accident', 'CanceledRoute', 'CanceledStop', 'Delay', 'Obstruction', 'RailReplacementService', 'TrafficJam', 'O']


## Step 2: Explore the data

In [7]:
from wsee.preprocessors.preprocessors import *
from wsee.data import explore, pipeline

We can apply all our preprocessors on our data and see if we can find something interesting for our labeling functions.
Let's first sample the SD4M training data, which is labeled.

In [8]:
labeled_sd4m_triggers = explore.add_labels(df_sd_train, Y_sd_train)
labeled_sd4m_triggers = explore.apply_preprocessors(labeled_sd4m_triggers, [pre_trigger_left_tokens, pre_mixed_ner, pre_trigger_right_tokens])
labeled_sd4m_triggers = explore.add_event_types(labeled_sd4m_triggers)

100%|██████████| 3/3 [00:05<00:00,  1.95s/it]


In [9]:
filtered_sd4m_triggers = labeled_sd4m_triggers[labeled_sd4m_triggers['label'] != 7]
print(f"Number of events: {len(labeled_sd4m_triggers)}\n")
for idx, class_name in enumerate(SD4M_RELATION_TYPES):
    class_sd4m_triggers = labeled_sd4m_triggers[labeled_sd4m_triggers['label'] == idx]
    print(f"{class_name}: {len(class_sd4m_triggers)} instances")

Number of events: 777

Accident: 59 instances
CanceledRoute: 61 instances
CanceledStop: 25 instances
Delay: 65 instances
Obstruction: 101 instances
RailReplacementService: 22 instances
TrafficJam: 155 instances
O: 289 instances


## Step 3: Evaluate the labeling functions on the SD4M training data

In [10]:
from wsee.labeling import event_trigger_lfs as trigger_lfs

In [11]:
from snorkel.labeling import PandasLFApplier
from wsee.data.pipeline import get_trigger_list_lfs

lfs = get_trigger_list_lfs()

applier = PandasLFApplier(lfs)

In [12]:
L_sd_train = applier.apply(df_sd_train)

100%|██████████| 777/777 [00:30<00:00, 25.65it/s]


In [13]:
from snorkel.labeling import LFAnalysis

LFAnalysis(L_sd_train, lfs).lf_summary(Y_sd_train)

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
lf_accident,0,[0],0.084942,0.084942,0.0,57,9,0.863636
lf_accident_street,1,[0],0.054054,0.054054,0.0,39,3,0.928571
lf_accident_no_cause_check,2,[0],0.105534,0.105534,0.020592,59,23,0.719512
lf_canceledroute_cat,3,[1],0.088803,0.088803,0.018018,60,9,0.869565
lf_canceledroute_replicated,4,[1],0.088803,0.088803,0.018018,60,9,0.869565
lf_canceledstop_cat,5,[2],0.033462,0.033462,0.001287,25,1,0.961538
lf_canceledstop_replicated,6,[2],0.033462,0.033462,0.001287,25,1,0.961538
lf_delay_cat,7,[3],0.083655,0.083655,0.0,65,0,1.0
lf_delay_duration_positional_check,8,[3],0.028314,0.028314,0.002574,20,2,0.909091
lf_delay_priorities,9,[3],0.087516,0.087516,0.003861,65,3,0.955882


## Step 4: Error Analysis 
Now we can look at the LabelMatrix for errors. We can use the DataFrame from the exploration section, which includes the information from the preprocessors.
We can then specifically look for the instances that were labeled incorrectly.

In [14]:
from wsee.labeling import error_analysis
relevant_cols = ['text','trigger', 'event_types']

In [15]:
labeled_sd4m_triggers.iloc[L_sd_train[:, 3] == 1].sample()[['text', 'trigger', 'label']]

Unnamed: 0,text,trigger,label
1257,"am Donnerstag, 24. März, 7.00 – 10.00 Uhr<br />\n<br />\nMeldung:<br />\nEN 459 von Leipzig Hbf (planmäßige Ankunft 9.28 Uhr in Praha hl.n.) fällt von Dresden Hbf bis Děčín hl.n. aus. Als Ersatz nutzen Sie den Bus von Dresden Hbf (ab 7.14 Uhr) bis Ústí nad Labem (an 8.20 Uhr). In Ústí nad Labem besteht Anschluss an einen Ersatzzug nach Praha hl.n.<br />\n<br />Grund:<br />\nOberleitungsarbeiten zwischen Bad Schandau und Königstein (Sächs Schw)<br />\n<br />Link zur detaillierten Meldung: <br />\n<a href= / ><br />\nLink zum kompletten PDF-Dokument: <br />\n<a href=target=_blank>(141 kB)<br /><br />------------------<br /><br />\n","{'id': 'c/2cc43ab7-ce99-480d-86d0-2bd5777d8171', 'text': 'fällt', 'entity_type': 'trigger', 'start': 31, 'end': 32, 'char_start': 138, 'char_end': 143}",1


In [16]:
error_analysis.sample_fp(labeled_df=labeled_sd4m_triggers, lf_outputs=L_sd_train, lf_index=3, label_of_interest=1, sample_size=1)[relevant_cols]

Unnamed: 0,text,trigger,event_types
616,#RE10: Wegen eines Unfalls ist der Abschnitt Kleve - Goch aktuell gesperrt. Ein Busnotverkehr ist eingerichtet. https://t.co/ivZS6KPKvj\n,"{'id': 'c/8ba9b59f-3e29-409e-a3f3-02066ff57857', 'text': 'gesperrt', 'entity_type': 'trigger', 'start': 12, 'end': 13, 'char_start': 66, 'char_end': 74}","[(Unfalls, (19, 26), 7), (gesperrt, (66, 74), 4)]"


In [17]:
error_analysis.sample_abstained_instances(labeled_df=labeled_sd4m_triggers, lf_outputs=L_sd_train, lf_index=10, label_of_interest=4, sample_size=1)[relevant_cols]

Unnamed: 0,text,trigger,event_types
145,Die A21 Bargteheide - Kiel ist zwischen Bad Oldesloe-Nord und Leezen nach einem Unfall in beiden Richtungen gesperrt.\n,"{'id': 'c/9d4f26fe-e709-40b0-bfac-2793d9efd146', 'text': 'gesperrt', 'entity_type': 'trigger', 'start': 19, 'end': 20, 'char_start': 108, 'char_end': 116}","[(Unfall, (80, 86), 0), (gesperrt, (108, 116), 4)]"


## Step 5: Train the label model and label the data

### Training the label model

In [18]:
from snorkel.labeling import LabelModel
from snorkel.labeling import filter_unlabeled_dataframe

In [19]:
df_daystream, Y_daystream = pipeline.build_event_trigger_examples(daystream)
L_daystream = applier.apply(df_daystream)

INFO:wsee:Building event trigger examples
INFO:wsee:DataFrame has 1955 rows
1955it [00:03, 626.34it/s]
INFO:wsee:Number of events: 0
INFO:wsee:Number of event trigger examples: 3076
100%|██████████| 3076/3076 [02:02<00:00, 25.04it/s]


In [20]:
LFAnalysis(L_daystream, lfs).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
lf_accident,0,[0],0.11671,0.11671,0.002601
lf_accident_street,1,[0],0.108257,0.108257,0.002276
lf_accident_no_cause_check,2,[0],0.13264,0.13264,0.019506
lf_canceledroute_cat,3,[1],0.168075,0.168075,0.047789
lf_canceledroute_replicated,4,[1],0.168075,0.168075,0.047789
lf_canceledstop_cat,5,[2],0.020806,0.020806,0.009753
lf_canceledstop_replicated,6,[2],0.020806,0.020806,0.009753
lf_delay_cat,7,[3],0.126463,0.126463,0.000325
lf_delay_duration_positional_check,8,[3],0.03316,0.03316,0.008453
lf_delay_priorities,9,[3],0.179129,0.164824,0.038687


In [22]:
# Labeling functions summary statistics over all LFs

print(f"Number of trigger labeling functions: \t{len(get_trigger_list_lfs())}")
print(f"Coverage: \t{LFAnalysis(L_daystream, lfs).label_coverage()}")  # percentage of objects that had at least one label
print(f"Overlap: \t{LFAnalysis(L_daystream, lfs).label_overlap()}")  # percentage of objects with more than one label
print(f"Conflicts: \t{LFAnalysis(L_daystream, lfs).label_conflict()}")  # percentage of objects with conflicting labels

Number of trigger labeling functions: 	23
Coverage: 	1.0
Overlap: 	0.7418725617685306
Conflicts: 	0.10988296488946683


In [23]:
daystream_model = LabelModel(cardinality=8, verbose=True)
daystream_model.fit(L_train=L_daystream,n_epochs=5000, log_freq=500, seed=12345, Y_dev=Y_sd_train)

INFO:root:Computing O...
INFO:root:Estimating \mu...
INFO:root:[0 epochs]: TRAIN:[loss=0.133]
INFO:root:[500 epochs]: TRAIN:[loss=0.004]
INFO:root:[1000 epochs]: TRAIN:[loss=0.003]
INFO:root:[1500 epochs]: TRAIN:[loss=0.002]
INFO:root:[2000 epochs]: TRAIN:[loss=0.002]
INFO:root:[2500 epochs]: TRAIN:[loss=0.002]
INFO:root:[3000 epochs]: TRAIN:[loss=0.002]
INFO:root:[3500 epochs]: TRAIN:[loss=0.002]
INFO:root:[4000 epochs]: TRAIN:[loss=0.002]
INFO:root:[4500 epochs]: TRAIN:[loss=0.002]
INFO:root:Finished Training


### Look at label model performance

Here we evaluate the LabelModel on the SD4M development data, because we used the SD4M training data to develop our labeling functions our model is likely overfitted on the SD4M training data. The included `score` function from Snorkel is limited and more easily applicable in a binary classification setting. We will instead use the predictions and sklearn metrics ourselves.
For each model we will first report the metrics for all classes and then the metrics without the majority negative class.

In [24]:
from wsee.utils.scorer import score_model

positive_event_type_indices = [idx for idx, _ in enumerate(SD4M_RELATION_TYPES)][:-1]

We create a MajorityLabelVoter and a LabelModel version that does not use the SD4M training data to infer a class balance prior for comparison.

In [25]:
from snorkel.labeling import MajorityLabelVoter

daystream_mlv = MajorityLabelVoter(cardinality=8, verbose=True)
daystream_without_sd4m_cb = LabelModel(cardinality=8, verbose=True)
daystream_without_sd4m_cb.fit(L_train=L_daystream,n_epochs=5000, log_freq=500, seed=12345)

INFO:root:Computing O...
INFO:root:Estimating \mu...
INFO:root:[0 epochs]: TRAIN:[loss=0.130]
INFO:root:[500 epochs]: TRAIN:[loss=0.003]
INFO:root:[1000 epochs]: TRAIN:[loss=0.002]
INFO:root:[1500 epochs]: TRAIN:[loss=0.002]
INFO:root:[2000 epochs]: TRAIN:[loss=0.002]
INFO:root:[2500 epochs]: TRAIN:[loss=0.002]
INFO:root:[3000 epochs]: TRAIN:[loss=0.002]
INFO:root:[3500 epochs]: TRAIN:[loss=0.002]
INFO:root:[4000 epochs]: TRAIN:[loss=0.002]
INFO:root:[4500 epochs]: TRAIN:[loss=0.002]
INFO:root:Finished Training


In [26]:
L_sd_dev = applier.apply(df_sd_dev)

100%|██████████| 71/71 [00:02<00:00, 25.08it/s]


#### With tie_break_policy set to "random"
Sometimes there might be instances where all the labeling functions abstain or where we might encounter a tie between the labeling functions.
Here we use the tie break policy "random", where the label models randomly choose among tied option using deterministic hash.
(When all labeling functions abstain all options/classes are tied.)
Note that coverage is still calculated as normal, i.e. as the ratio of labeled data points and all data points.

**Label Model**

In [27]:
score_model(model=daystream_model, L=L_sd_dev, Y=Y_sd_dev, tie_break_policy="random")

  'precision', 'predicted', average, warn_for)


Unnamed: 0,Metric,Micro Average,Macro Average
0,precision,0.929577,0.795521
1,recall,0.929577,0.828333
2,f1,0.929577,0.805606
3,accuracy,0.929577,0.929577
4,coverage,1.0,1.0


In [28]:
score_model(model=daystream_model, L=L_sd_dev, Y=Y_sd_dev, tie_break_policy="random", labels=positive_event_type_indices)

  'precision', 'predicted', average, warn_for)


Unnamed: 0,Metric,Micro Average,Macro Average
0,precision,0.913043,0.772024
1,recall,0.913043,0.809524
2,f1,0.913043,0.78355
3,accuracy,0.929577,0.929577
4,coverage,1.0,1.0


**Label model without a class balance prior inferred from SD4M training set**

In [29]:
score_model(model=daystream_without_sd4m_cb, L=L_sd_dev, Y=Y_sd_dev, tie_break_policy="random")

  'precision', 'predicted', average, warn_for)


Unnamed: 0,Metric,Micro Average,Macro Average
0,precision,0.859155,0.742361
1,recall,0.859155,0.803333
2,f1,0.859155,0.758428
3,accuracy,0.859155,0.859155
4,coverage,1.0,1.0


In [30]:
score_model(model=daystream_without_sd4m_cb, L=L_sd_dev, Y=Y_sd_dev, tie_break_policy="random", labels=positive_event_type_indices)

  'precision', 'predicted', average, warn_for)


Unnamed: 0,Metric,Micro Average,Macro Average
0,precision,0.823529,0.712698
1,recall,0.913043,0.809524
2,f1,0.865979,0.74614
3,accuracy,0.859155,0.859155
4,coverage,1.0,1.0


**Majority Label Voter**

In [31]:
score_model(model=daystream_mlv, L=L_sd_dev, Y=Y_sd_dev, tie_break_policy="random")

  'precision', 'predicted', average, warn_for)


Unnamed: 0,Metric,Micro Average,Macro Average
0,precision,0.901408,0.796412
1,recall,0.901408,0.818333
2,f1,0.901408,0.802069
3,accuracy,0.901408,0.901408
4,coverage,0.929577,0.929577


In [32]:
score_model(model=daystream_mlv, L=L_sd_dev, Y=Y_sd_dev, tie_break_policy="random", labels=positive_event_type_indices)

  'precision', 'predicted', average, warn_for)


Unnamed: 0,Metric,Micro Average,Macro Average
0,precision,0.875,0.773539
1,recall,0.913043,0.809524
2,f1,0.893617,0.785698
3,accuracy,0.901408,0.901408
4,coverage,0.929577,0.929577


### Do predictions on the daystream data

In [33]:
daystream_probs = daystream_model.predict_proba(L=L_daystream)

In the proposed workflow one would filter out all the datapoints that were not labeled by any of the labeling functions.
There will not be such a case here because we use a negative labeling functions that outputs the negative trigger label, when all the other labeling functions abstain.
If it was not the case, we would instead multiply the probabilities of abstains with zero so that they look like padding instances, when fed into the end model.
We propose this workaround since examples that are filtered out here are treated as negative examples per default in the end model.
We also cannot afford to filter out the whole document if just one trigger/role example was not labeled.

In [34]:
labeled_daystream = pipeline.merge_event_trigger_examples(df_daystream, daystream_probs)

INFO:wsee:Merging event trigger examples that belong to the same document


In [35]:
labeled_daystream.reset_index(level=0).to_json(DATA_DIR + "/save_daystreamv6_triggers.jsonl", orient='records', lines=True, force_ascii=False)

## Step 7: Check Daystream Labeling

To look at the daystream labeling it would be best to remove the abstains.

In [36]:
from snorkel.labeling import filter_unlabeled_dataframe

df_daystream_filtered, probs_daystream_filtered = filter_unlabeled_dataframe(
    X=df_daystream, y=daystream_probs, L=L_daystream
)

In [37]:
df_daystream_filtered['trigger_probs'] = list(probs_daystream_filtered)
df_daystream_filtered['most_probable_class'] = [SD4M_RELATION_TYPES[label_idx] for label_idx in probs_daystream_filtered.argmax(axis=1)]
df_daystream_filtered['max_class_prob'] = ["{:.2f}".format(class_prob) for class_prob in probs_daystream_filtered.max(axis=1)]

In [38]:
for trigger_class in SD4M_RELATION_TYPES:
    print(f"{trigger_class}: {len(df_daystream_filtered[df_daystream_filtered['most_probable_class'] == trigger_class])} instances")

Accident: 359 instances
CanceledRoute: 427 instances
CanceledStop: 35 instances
Delay: 432 instances
Obstruction: 644 instances
RailReplacementService: 186 instances
TrafficJam: 65 instances
O: 928 instances


Code to display all the rows of the dataframe:
```python
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    display(df_daystream_filtered[df_daystream_filtered['most_probable_class'] == 'O'][['text', 'trigger', 'most_probable_class', 'max_class_prob', 'trigger_probs']])
```

In [39]:
df_daystream_filtered[df_daystream_filtered['most_probable_class'] == 'CanceledRoute'].sample(1)[['text', 'trigger', 'most_probable_class', 'max_class_prob', 'trigger_probs']]

Unnamed: 0,text,trigger,most_probable_class,max_class_prob,trigger_probs
139,Störung Z10 S4 S5 S6 S60: Nordbahnhof - Hauptbahnhof (tief): Verspätungen und Fahrtausfälle wegen https://t.co/OvqLQqQd8B,"{'id': 'c/eae0a52f-7215-4fbf-b0ab-207d65bc8458', 'text': 'Fahrtausfälle', 'entity_type': 'trigger', 'start': 16, 'end': 17, 'char_start': 78, 'char_end': 91}",CanceledRoute,0.92,"[8.88346316577926e-09, 0.9182762726984268, 0.0007755455670196107, 9.786866199587348e-09, 0.033630494738167004, 3.312477790629557e-09, 0.04133703932971479, 0.0059806256838644545]"
