# Transform the data to work with Snorkel: Part 1 - Event Type

Essentially we will have to create two labeling models.
One assigns labels to event types and the other assigns labels to argument roles in event mentions.

In any case we need to create a row for each event (trigger) to do event type labeling.

For this we need 1 additional column:
- trigger_id

One numpy array containing the:
- event_type

We will probably focus on keyword lists and some heuristics to create our labeling functions.

In [1]:
import sys
sys.path.append("../")
from wsee.utils import utils
from wsee.data import pipeline

DATA_DIR = '/Users/phuc/data/snorkel-daystreamv5'  # replace path to corpus

### SD4M Event Types

| Number | Code                   | Description                                                                             |
|--------|------------------------|-----------------------------------------------------------------------------------------|
| -1     | ABSTAIN                | No vote, for Labeling Functions                                                         |
| 0      | Accident               | Collision of a vehicle with another vehicle, person, or obstruction                     |
| 1      | CanceledRoute          | Cancellation of public transport routes                                                 |
| 2      | CanceledStop           | Cancellation of public transport stops                                                  |
| 3      | Delay                  | Delay resulting from remaining traffic disturbances                                     |
| 4      | Obstruction            | Temporary installation to control traffic                                               |
| 5      | RailReplacementService | Replacement of a passenger train by buses or other substitute public transport services |
| 6      | TrafficJam             | Line of stationary or very slow-moving traffic                                          |
| 7      | O                      | No SD4M event.                                                                          |

In [2]:
loaded_data = pipeline.load_data(DATA_DIR)
sd_train = loaded_data['train']
sd_dev = loaded_data['dev']
sd_test = loaded_data['test']

daystream = loaded_data['daystream']

In [3]:
sd_train.head()

Unnamed: 0,id,text,tokens,pos_tags,ner_tags,entities,event_triggers,event_roles
0,http://www.viz-info.de/LMS-BR_r_LMS-BR_60517@2...,Unfall\nAbschnitt: Marzahn (Berlin)\nGültig ab...,"[Unfall, Abschnitt, :, Marzahn, (, Berlin, ), ...","[NN, NN, $., NE, TRUNC, NE, TRUNC, NN, PTKVZ, ...","[B-TRIGGER, O, O, B-LOCATION, O, B-LOCATION_CI...",[{'id': 'c/e6ad8c7f-24a4-4742-a52d-90207de04f0...,[{'id': 'c/e6ad8c7f-24a4-4742-a52d-90207de04f0...,[{'trigger': 'c/e6ad8c7f-24a4-4742-a52d-90207d...
1,http://www.deutschlandradio.de/#17@2016-04-04T...,Vorsicht auf der A7 Ulm Richtung Füssen zwisch...,"[Vorsicht, auf, der, A7, Ulm, Richtung, Füssen...","[NN, APPR, ART, NE, NE, NN, NN, APPR, NN, NE, ...","[O, O, O, B-LOCATION_STREET, B-LOCATION_CITY, ...",[{'id': 'c/2db85836-812f-4ced-90d3-46df9495782...,[],[]
2,667383197769048064,"Genau in dem Bus sitzen, der im Stau steht. Fü...","[Genau, in, dem, Bus, sitzen, ,, der, im, Stau...","[ADV, APPR, ART, NN, VVFIN, $,, PRELS, APPRART...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O]",[],[],[]
3,603844236484550658,Große Carsharing-Übernahme: Der französische C...,"[Große, Carsharing, -, Übernahme, :, Der, fran...","[ADJA, NN, $[, NN, $., ART, ADJA, NN, $[, NN, ...","[O, O, O, O, O, O, B-LOCATION, O, O, O, B-ORGA...",[{'id': 'c/f0fdb663-677e-4353-9159-8a9530f9777...,[],[]
4,http://bauarbeiten.bahn.de/fernverkehr/Linie/I...,"an mehreren Terminen\n an den Freitagen, 3. un...","[an, mehreren, Terminen, an, den, Freitagen, ,...","[APPR, PIAT, NN, APPR, ART, NN, $,, CARD, $., ...","[O, O, O, O, O, B-DATE, I-DATE, I-DATE, I-DATE...",[{'id': 'c/f46384bf-20c6-47f5-a019-2a11fc52079...,[{'id': 'c/f84a50a1-b58f-4077-a68c-ae95a4f81e3...,[{'trigger': 'c/f84a50a1-b58f-4077-a68c-ae95a4...


## Step 1: Create one row for every event trigger

We will use the (labeled) SD4M training set as our development data to create our labeling functions.
In this notebook we will run our labeling functions and our LabelModel on that data.
In the real pipeline we will instead label the Daystream data that does not have event type and event argument role labels.

In [4]:
df_dev, Y_dev = pipeline.build_event_trigger_examples(sd_train)

116it [00:00, 656.26it/s]

DataFrame has 1273 rows


1273it [00:01, 1091.86it/s]


Number of events: 487


We use the (labeled) SD4m development set as our "test set" to measure the performance of our LabelModel.

In [5]:
df_test, Y_test = pipeline.build_event_trigger_examples(sd_dev)

147it [00:00, 1465.98it/s]

DataFrame has 147 rows
Number of events: 46





In [6]:
from wsee import SD4M_RELATION_TYPES
print(SD4M_RELATION_TYPES)

['Accident', 'CanceledRoute', 'CanceledStop', 'Delay', 'Obstruction', 'RailReplacementService', 'TrafficJam', 'O']


## Step 2: Explore the data

In [7]:
from wsee.preprocessors.preprocessors import *
from wsee.data import explore, pipeline

We can apply all our preprocessors on our data and see if we can find something interesting for our labeling functions.
Let's first sample the SD4M training data, which is labeled.

In [8]:
labeled_sd4m_triggers = explore.add_labels(df_dev, Y_dev)

In [9]:
labeled_sd4m_triggers = explore.apply_preprocessors(labeled_sd4m_triggers, [get_trigger, get_trigger_text, get_trigger_left_tokens, get_trigger_right_tokens, get_entity_type_freqs, get_mixed_ner])

100%|██████████| 6/6 [00:07<00:00,  1.30s/it]


In [10]:
labeled_sd4m_triggers = explore.add_event_types(labeled_sd4m_triggers)

Let's first take a look at the trigger text.

In [11]:
import pandas as pd
pd.set_option('display.max_colwidth', -1)

In [12]:
labeled_sd4m_triggers[labeled_sd4m_triggers['label'] == 4].sample(10)[['trigger_left_tokens','trigger_text','trigger_right_tokens','entity_type_freqs','mixed_ner','label', 'event_types']]

Unnamed: 0,trigger_left_tokens,trigger_text,trigger_right_tokens,entity_type_freqs,mixed_ner,label,event_types
296,"[Die, A61, Mönchengladbach, Richtung, Venlo, ist, zwischen, Viersen, und, Nettetal, wegen, Bergungsarbeiten]",gesperrt,"[., Zurzeit, 2, km, Stau, .]","{'location_street': 1, 'location_city': 2, 'location': 2, 'trigger': 2, 'distance': 1}",Die LOCATION_STREET LOCATION_CITY Richtung LOCATION_CITY ist zwischen LOCATION und LOCATION wegen TRIGGER TRIGGER. Zurzeit DISTANCE Stau.\n,4,"[(Bergungsarbeiten, 7), (gesperrt, 4)]"
823,[],Vorsicht,"[bitte, in, beiden, Richtungen, auf, der, A20, Lübeck, -, Rostock, zwischen, Grevesmühlen, und, Rastplatz, Bretthäger, Wisch, laufen, Tiere, !]","{'trigger': 2, 'location': 3, 'location_street': 1, 'location_city': 2}",TRIGGER bitte in LOCATION auf der LOCATION_STREET LOCATION_CITY - LOCATION_CITY zwischen LOCATION und LOCATION laufen TRIGGER!\n,4,"[(Vorsicht, 4), (Tiere, 7)]"
723,"[■, #Hamburg, :, Auf, der, #B4]",behindern,"[im, Verlauf, Stresemannstraße, -, Neuer, Pferdemarkt, #Bauarbeiten, den, Verkehr, .]","{'location_city': 1, 'location_street': 3, 'trigger': 2}",■ LOCATION_CITY: Auf der LOCATION_STREET TRIGGER im Verlauf LOCATION_STREET - LOCATION_STREET TRIGGER den Verkehr.\n,4,"[(behindern, 4), (#Bauarbeiten, 7)]"
665,"[Kreis, Schleswig, -, Flensburg, :, Die, Kreisstraße, zwischen, Selk, und, Geltorf, ist, wegen, Überflutung, der, dortigen, Wiesen]",gesperrt,"[., Eine, Umleitung, ist, eingerichtet, .]","{'location': 1, 'location_street': 1, 'location_city': 2, 'trigger': 2}",Kreis LOCATION: Die LOCATION_STREET zwischen LOCATION_CITY und LOCATION_CITY ist wegen TRIGGER der dortigen Wiesen TRIGGER. Eine Umleitung ist eingerichtet.\n,4,"[(Überflutung, 7), (gesperrt, 4)]"
811,"[Auf, der, A2, Hannover, Richtung, Dortmund, ist, die, Anschlussstelle, Vlotho, -, West, wegen, Bauarbeiten, voraussichtlich, bis, zum, 15, ., Dezember]",gesperrt,"[., Weichen, Sie, über, die, Anschlussstellen, Bad, Oyenhausen, oder, Herford, -, Ost, aus, .]","{'location_street': 1, 'location_city': 2, 'location': 3, 'trigger': 2, 'date': 1}",Auf der LOCATION_STREET LOCATION_CITY Richtung LOCATION_CITY ist die Anschlussstelle LOCATION wegen TRIGGER voraussichtlich bis zum DATE TRIGGER. Weichen Sie über die Anschlussstellen LOCATION oder LOCATION aus.\n,4,"[(Bauarbeiten, 7), (gesperrt, 4)]"
877,"[A27, Bremerhaven, Richtung, Bremen, die, Ausfahrt, Bremen, -, Vahr, ist, nach, einem, Unfall]",gesperrt,[.],"{'location_street': 1, 'location_city': 2, 'location': 1, 'trigger': 2}",LOCATION_STREET LOCATION_CITY Richtung LOCATION_CITY die LOCATION ist nach einem TRIGGER TRIGGER.\n,4,"[(Unfall, 7), (gesperrt, 4)]"
430,"[Kreis, Biberach, ,, K7588, zwischen, Daugendorf, und, Unlingen, in, beiden, Richtungen, Gefahr, durch, Hochwasser, ,, Verbindungsfahrbahn]",gesperrt,"[,, eine, örtliche, Umleitung, ist, eingerichtet, ,, bis, 19.04.2016, 17:00, Uhr]","{'location': 1, 'location_street': 1, 'location_city': 2, 'trigger': 3, 'date': 1, 'time': 1}","LOCATION, LOCATION_STREET zwischen LOCATION_CITY und LOCATION_CITY in beiden Richtungen Gefahr durch TRIGGER, Verbindungsfahrbahn TRIGGER, eine örtliche TRIGGER ist eingerichtet, bis DATE TIME\n",4,"[(Hochwasser, 7), (gesperrt, 4), (Umleitung, 4)]"
1016,"[RT, @hannover, :, Ab, Montag, (, 18.7, ., ), wird, die, Dragonerstr, ., zw, ., Isernhagener, &, Vahrenwalder, Str, .]",gesperrt,"[:, https://t.co/dcztfI3JF6, ^, SW, https://t.…]","{'location_city': 1, 'date': 2, 'location_street': 3, 'trigger': 1}",RT LOCATION_CITY: Ab DATE (DATE) wird die LOCATION_STREET. zw. LOCATION_STREET & LOCATION_STREET TRIGGER: https://t.co/dcztfI3JF6 ^SW https://t.…\n,4,"[(gesperrt, 4)]"
663,"[Die, B217, Springe, Richtung, Hannover, ist, in, Höhe, Alvesrode, /, Wisentgehege, wegen, Bauarbeiten, bis, Samstag, 17:00, Uhr]",gesperrt.,"[Eine, Umleitung, ist, eingerichtet, .]","{'location_street': 1, 'location_city': 2, 'location': 1, 'trigger': 2, 'date': 1, 'time': 1}",Die LOCATION_STREET LOCATION_CITY Richtung LOCATION_CITY ist in Höhe LOCATION wegen TRIGGER bis DATE TIME TRIGGER Eine Umleitung ist eingerichtet.\n,4,"[(Bauarbeiten, 7), (gesperrt., 4)]"
371,"[Hamburg, :, Der, Schiffbeker, Weg, ist, zwischen, Billstedter, Hauptstraße, und, Reclamstraße, nach, einem, Unfall]",gesperrt,[.],"{'location_city': 1, 'location_street': 3, 'trigger': 2}",LOCATION_CITY: Der LOCATION_STREET ist zwischen LOCATION_STREET und LOCATION_STREET nach einem TRIGGER TRIGGER.\n,4,"[(Unfall, 7), (gesperrt, 4)]"


In [13]:
labeled_sd4m_triggers[labeled_sd4m_triggers['label'] == 4]['trigger_text'].value_counts()

gesperrt               61
umgeleitet             8 
#Bauarbeiten           4 
Vollsperrung           3 
Behinderungen          3 
blockiert              2 
Schwertransport        2 
behindern              2 
Umleitung              2 
Sperrung               2 
Bauarbeiten            2 
austelle               1 
Vorsicht               1 
brennender PKW         1 
Störung:               1 
Streckensperrung       1 
lahm                   1 
gesperrt.              1 
Verkehrsbehinderung    1 
unterbrochen           1 
voll gesperrt          1 
Name: trigger_text, dtype: int64

Now we can collect the trigger words per class.

## Step 3: Evaluate the labeling functions on the SD4M training data

In [14]:
from wsee.labeling.event_trigger_lfs import *

In [15]:
from snorkel.labeling import PandasLFApplier

lfs = [
    lf_accident_context,
    lf_canceledroute_cat,
    lf_canceledstop_cat,
    lf_delay_cat,
    lf_obstruction_cat,
    lf_railreplacementservice_cat,
    lf_trafficjam_cat,
    lf_negative
]

applier = PandasLFApplier(lfs)

In [16]:
L_dev = applier.apply(df_dev)
L_test = applier.apply(df_test)

100%|██████████| 817/817 [00:42<00:00, 19.34it/s]
100%|██████████| 75/75 [00:03<00:00, 19.55it/s]


In [17]:
from snorkel.labeling import LFAnalysis

LFAnalysis(L_dev, lfs).lf_summary(Y_dev)

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
lf_accident_context,0,"[0, 7]",0.118727,0.0,0.0,84,13,0.865979
lf_canceledroute_cat,1,[1],0.152999,0.028152,0.028152,61,64,0.488
lf_canceledstop_cat,2,[2],0.031824,0.001224,0.001224,25,1,0.961538
lf_delay_cat,3,"[3, 7]",0.091799,0.011016,0.004896,69,6,0.92
lf_obstruction_cat,4,"[4, 7]",0.176255,0.040392,0.034272,103,41,0.715278
lf_railreplacementservice_cat,5,[5],0.031824,0.0,0.0,22,4,0.846154
lf_trafficjam_cat,6,"[6, 7]",0.198286,0.0,0.0,156,6,0.962963
lf_negative,7,[7],0.238678,0.0,0.0,194,1,0.994872


## Step 4: Error Analysis 
Now we can look at the LabelMatrix for errors. We need to use the DataFrame from the exploration section, which includes the information from the preprocessors.
We can then specifically look for the instances that were labeled incorrectly.

We will first look at the keyword based labeling function for accidents:

In [18]:
from wsee.labeling import error_analysis

In [19]:
error_analysis.sample_fp(labeled_df=labeled_sd4m_triggers, lf_outputs=L_dev, lf_index=4, label_of_interest=4)

Unnamed: 0,trigger_left_tokens,trigger_text,trigger_right_tokens,entity_type_freqs,mixed_ner,label,event_types
1040,"[■, #A7, #Flensburg, -, #Hamburg, zwischen, #Hamburg, -, #Stellingen, und, #Volkspark, behindern, in, beiden, Richtungen]",#Bauarbeiten,"[den, Verkehr]","{'location_street': 1, 'location_city': 2, 'location': 3, 'trigger': 2}",■ LOCATION_STREET LOCATION_CITY - LOCATION_CITY zwischen LOCATION und LOCATION TRIGGER in LOCATION TRIGGER den Verkehr\n,7,"[(behindern, 4), (#Bauarbeiten, 7)]"
833,[],Sperrung,"[Abschnitt, :, in, beiden, Richtungen, (, Berlin, ), Gültig, ab, :, 11.03.2016, 19:00, Vollsperrung, (, bis, 04:00, )]","{'trigger': 1, 'location': 1, 'date': 1, 'time': 2}",TRIGGER\nAbschnitt: in beiden Richtungen (LOCATION)\nGültig ab: DATE TIME\nVollsperrung (bis TIME)\n,7,"[(Sperrung, 7)]"
723,"[■, #Hamburg, :, Auf, der, #B4, behindern, im, Verlauf, Stresemannstraße, -, Neuer, Pferdemarkt]",#Bauarbeiten,"[den, Verkehr, .]","{'location_city': 1, 'location_street': 3, 'trigger': 2}",■ LOCATION_CITY: Auf der LOCATION_STREET TRIGGER im Verlauf LOCATION_STREET - LOCATION_STREET TRIGGER den Verkehr.\n,7,"[(behindern, 4), (#Bauarbeiten, 7)]"
1249,[],Bauarbeiten,"[der, Deutschen, Bahn, –, Schienenersatzverkehr, beim, Donau, -, Isar, -, Express, :, Einschränkungen, Passau, -, Reisende, https://t.co/YQH78iQeSQ]","{'trigger': 2, 'organization_company': 1, 'location_route': 1, 'location_city': 1}",TRIGGER der ORGANIZATION_COMPANY – TRIGGER beim LOCATION_ROUTE: Einschränkungen LOCATION_CITY-Reisende https://t.co/YQH78iQeSQ\n,7,"[(Bauarbeiten, 7), (Schienenersatzverkehr, 5)]"
760,"[Kreis, Breisgau, -, Hochschwarzwald, Erdrutsch, ,, Störungen, im, Schienenverkehr, ,, bis, 11.02.2016, Mitternacht, Die, Zugverbindung, Freiburg, im, Breisgau, -, Titisee, -, Neustadt, (, Höllentalbahn, ), ist, im, Bereich, Falkensteig]",unterbrochen,"[., Ein, Schienenersatzverkehr, ist, eingerichtet, ., Mit, Behinderungen, ist, zu, rechnen, .]","{'location': 1, 'date': 1, 'location_route': 1, 'location_stop': 1, 'trigger': 1}","Kreis LOCATION Erdrutsch, Störungen im Schienenverkehr, bis DATE Die LOCATION_ROUTE ist im Bereich LOCATION_STOP TRIGGER. Ein Schienenersatzverkehr ist eingerichtet. Mit Behinderungen ist zu rechnen.\n",1,"[(unterbrochen, 1)]"
228,"[Aufgrund, eines, Notarzteinsatzes, ist, derzeit, die, Strecke, zwischen, Augsburg, Hbf, und, Mering]",gesperrt,"[., Die, Züge, aus, Richtung, Ulm, fahren, bis, Augsburg, ., Die, Züge, aus, Richtung, München, fahren, bis, Mering, ., Aktuell, konnte, noch, kein, Schienenersatzverkehr, eingerichtet, werden, ., (, 18:00, Uhr, ), .]","{'location_route': 1, 'location_stop': 2, 'trigger': 1, 'location_city': 4, 'time': 1}",Aufgrund eines Notarzteinsatzes ist derzeit die LOCATION_ROUTE zwischen LOCATION_STOP und LOCATION_STOP TRIGGER. Die Züge aus Richtung LOCATION_CITY fahren bis LOCATION_CITY. Die Züge aus Richtung LOCATION_CITY fahren bis LOCATION_CITY. Aktuell konnte noch kein Schienenersatzverkehr eingerichtet werden. (TIME).\n\n\n,1,"[(gesperrt, 1)]"
484,"[■, #Hamburg, -, #Tonndorf, :, Die, Kuehnstraße, ist, wegen]",#Bauarbeiten,"[ab, Wilsonstraße, in, Richtung, Jenfelder, Allee, bis, zum, 20, ., Mai, als, ...]","{'location_city': 1, 'location_street': 3, 'trigger': 1, 'date': 1}",■ LOCATION_CITY: Die LOCATION_STREET ist wegen TRIGGER ab LOCATION_STREET in Richtung LOCATION_STREET bis zum DATE als ...\n,7,"[(#Bauarbeiten, 7)]"
909,"[■, #Hamburg, :, Die, #B75, Meiendorfer, Straße, ist, zwischen, Spitzbergenweg, und, Saseler, Straße, wegen]",#Bauarbeiten,"[bis, Anfang, Oktober, ...]","{'location_city': 1, 'location_street': 3, 'trigger': 1, 'date': 1}",■ LOCATION_CITY: Die LOCATION_STREET ist zwischen LOCATION_STREET und LOCATION_STREET wegen TRIGGER bis DATE ...\n,7,"[(#Bauarbeiten, 7)]"
150,"[https://t.co/rJmMnlRvrb, Vorangegangene]",Streckensperrung,"[:, BOB, 86833, (, München, Hbf, ab, 18:05, Uhr, ), fällt, zwischen, Schliersee, und, Bayrischzell, a, …]","{'trigger': 1, 'location_route': 1, 'location_stop': 3, 'time': 1}",https://t.co/rJmMnlRvrb Vorangegangene TRIGGER: LOCATION_ROUTE (LOCATION_STOP ab TIME Uhr) fällt zwischen LOCATION_STOP und LOCATION_STOP a…\n,1,"[(Streckensperrung, 1)]"
1013,"[■, #Hamburg, :, Die, Hammer, Straße, ist, zwischen, Jüthornstraße, und, Bärenallee, in, beiden, Richtungen, wegen]",#Bauarbeiten,"[bis, Ende, ...]","{'location_city': 1, 'location_street': 3, 'location': 1, 'trigger': 1}",■ LOCATION_CITY: Die LOCATION_STREET ist zwischen LOCATION_STREET und LOCATION_STREET in LOCATION wegen TRIGGER bis Ende ...\n,7,"[(#Bauarbeiten, 7)]"


In [20]:
error_analysis.trigger_text_counts_fp(labeled_df=labeled_sd4m_triggers, lf_outputs=L_dev, lf_index=4, label_of_interest=4)

#Bauarbeiten           6
gesperrt               4
unterbrochen           2
Bauarbeiten            1
Sperrung               1
Betriebsstörung        1
Verkehrsbehinderung    1
Technische Störung     1
Wanderbaustelle        1
Baustelle              1
umgeleitet.Mit         1
#Störung               1
Streckensperrung       1
Name: trigger_text, dtype: int64

In [21]:
error_analysis.sample_fp(labeled_df=labeled_sd4m_triggers, lf_outputs=L_dev, lf_index=6, label_of_interest=6)

Unnamed: 0,trigger_left_tokens,trigger_text,trigger_right_tokens,entity_type_freqs,mixed_ner,label,event_types
108,"[Stuttgart, Richtung, Karlsruhe, zwischen, Stuttgart, -, Möhringen, und, Kreuz, Stuttgart, 3, km]",stockender Verkehr,[],"{'location_city': 2, 'location': 2, 'distance': 1, 'trigger': 1}",LOCATION_CITY Richtung LOCATION_CITY zwischen LOCATION und LOCATION DISTANCE TRIGGER\n,7,"[(stockender Verkehr, 7)]"
533,"[#A2, -, Basel, Richtung, Luzern, -, Zwischen, Egerkingen, und, Autobahndreieck, Verzweigung, Härkingen, stockender, Verkehr, ,]",Verkehrsüberlastung,[],"{'location_street': 1, 'location_city': 2, 'location': 2, 'trigger': 2}","LOCATION_STREET - LOCATION_CITY Richtung LOCATION_CITY - Zwischen LOCATION und LOCATION TRIGGER, TRIGGER\n",7,"[(stockender Verkehr, 6), (Verkehrsüberlastung, 7)]"
706,"[Stuttgart, Richtung, Karlsruhe, zwischen, Kreuz, Stuttgart, und, Rutesheim, 12, km]",stockender Verkehr,[],"{'location_city': 2, 'location': 2, 'distance': 1, 'trigger': 1}",LOCATION_CITY Richtung LOCATION_CITY zwischen LOCATION und LOCATION DISTANCE TRIGGER\n,7,"[(stockender Verkehr, 7)]"
708,"[A14, Magdeburg, Richtung, Halle, zwischen, Löbejün, und, Halle, -, Tornau, stockender, Verkehr, an, einer, 10, km]",Stau,"[an, einer, Baustelle]","{'location_street': 1, 'location_city': 2, 'location': 2, 'trigger': 2, 'distance': 1}",LOCATION_STREET LOCATION_CITY Richtung LOCATION_CITY zwischen LOCATION und LOCATION TRIGGER an einer DISTANCE TRIGGER an einer Baustelle\n,7,"[(stockender Verkehr, 6), (Stau, 7)]"
999,"[#A1, -, Bern, -, >, Zürich, -, Zwischen, Aarau, -, West, und, Lenzburg, Stau, ,]",Überlastung,[],"{'location_street': 1, 'location_city': 2, 'location': 2, 'trigger': 2}","LOCATION_STREET - LOCATION_CITY -> LOCATION_CITY - Zwischen LOCATION und LOCATION TRIGGER, TRIGGER\n",7,"[(Stau, 6), (Überlastung, 7)]"
1266,"[170, :]",Hohes Verkehrsaufkommen,"[Verspätungen, von, bis, zu, 20, Minuten, ..., ., https://t.co/9NM2UUzbpK]","{'location_route': 1, 'trigger': 2, 'duration': 1}",LOCATION_ROUTE: TRIGGER TRIGGER von bis zu DURATION.... https://t.co/9NM2UUzbpK\n,7,"[(Hohes Verkehrsaufkommen, 7), (Verspätungen, 3)]"


## Step 5: Train the Labeling model and label the data

In [22]:
from snorkel.labeling import LabelModel

label_model = LabelModel(cardinality=8, verbose=True)
label_model.fit(L_train=L_dev, n_epochs=500, log_freq=100, seed=123)

In [23]:
label_model_acc = label_model.score(L=L_test, Y=Y_test, tie_break_policy="random")[
    "accuracy"
]
print(f"{'Label Model Accuracy:':<25} {label_model_acc * 100:.1f}%")

Label Model Accuracy:     82.7%


In [24]:
probs_train = label_model.predict_proba(L=L_dev)

In the proposed workflow one would filter out all the datapoints that were not labeled by any of the labeling functions.
We will follow this approach as that does not affect the merging process in our Snorkel processing pipeline.
While it may result in sentences missing certain events, they would then be processed as dummy events in the AllenNLP model and factored out during the loss calculation (?).

In [25]:
from snorkel.labeling import filter_unlabeled_dataframe

df_train_filtered, probs_train_filtered = filter_unlabeled_dataframe(
    X=df_dev, y=probs_train, L=L_dev
)

In the Snorkel processing pipeline we would merge the labeled dataframes back together that belong to the same document and proceed with labeling the event argument roles.

## Step 6: Label the Daystream data with Snorkel

In [26]:
df_train, Y_train = pipeline.build_event_trigger_examples(daystream)
L_train = applier.apply(df_train)

191it [00:00, 888.19it/s]

DataFrame has 1955 rows


1955it [00:04, 431.08it/s]
  0%|          | 0/1845 [00:00<?, ?it/s]

Number of events: 0


100%|██████████| 1845/1845 [02:48<00:00, 10.93it/s]


In [27]:
LFAnalysis(L_train, lfs).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
lf_accident_context,0,"[0, 7]",0.01897,0.002168,0.002168
lf_canceledroute_cat,1,[1],0.310027,0.080759,0.080759
lf_canceledstop_cat,2,[2],0.01355,0.00542,0.00542
lf_delay_cat,3,"[3, 7]",0.225474,0.007046,0.007046
lf_obstruction_cat,4,"[4, 7]",0.161518,0.083469,0.083469
lf_railreplacementservice_cat,5,[5],0.097019,0.001626,0.001626
lf_trafficjam_cat,6,[6],0.028184,0.001084,0.001084
lf_negative,7,[7],0.240108,0.0,0.0


In [28]:
daystream_model = LabelModel(cardinality=8, verbose=True)
daystream_model.fit(L_train=L_dev, n_epochs=500, log_freq=100, seed=123)

In [29]:
daystream_model_acc = daystream_model.score(L=L_test, Y=Y_test, tie_break_policy="random")[
    "accuracy"
]
print(f"{'Label Model Accuracy:':<25} {daystream_model_acc * 100:.1f}%")

Label Model Accuracy:     82.7%


In [30]:
daystream_probs = daystream_model.predict_proba(L=L_train)

In [31]:
labeled_daystream = pipeline.merge_event_trigger_examples(df_train, daystream_probs)

In [32]:
labeled_daystream.head()

Unnamed: 0_level_0,event_triggers
id,Unnamed: 1_level_1
1106219278641045504,"[{'id': 'c/c45026ac-2537-4d19-ad8c-010c1587c7bd', 'event_type_probs': [0.07253840792860496, 0.0021861230594413775, 0.09911352885515469, 0.6419880343341251, 0.08190383931926444, 0.09916077749651382, 0.001554644503447738, 0.001554644503447738]}, {'id': 'c/758086f7-c213-45a5-b9d0-981431fe5df4', 'event_type_probs': [0.054288361561215734, 0.6069224587923946, 0.07651211082020527, 0.05010262656385902, 0.13398453139224548, 0.07656778713122799, 0.0008110618694258922, 0.0008110618694258922]}, {'id': 'c/b597d583-7fe2-4616-950b-d32b2b2c435a', 'event_type_probs': [0.0872568889544155, 0.009897017097536279, 0.11418104248845252, 0.0815559595112779, 0.03210530437146285, 0.6662757083087492, 0.004364039634052925, 0.004364039634052925]}]"
1106220052636975105,"[{'id': 'c/e08ede13-b420-4b0e-9b96-f1c71595f631', 'event_type_probs': [0.07253840792860496, 0.0021861230594413775, 0.09911352885515469, 0.6419880343341251, 0.08190383931926444, 0.09916077749651382, 0.001554644503447738, 0.001554644503447738]}, {'id': 'c/08f09097-94cb-4342-ae4e-cf1cfb94572b', 'event_type_probs': [0.054288361561215734, 0.6069224587923946, 0.07651211082020527, 0.05010262656385902, 0.13398453139224548, 0.07656778713122799, 0.0008110618694258922, 0.0008110618694258922]}]"
1106221297904816130,"[{'id': 'c/d3812715-f08f-411e-904a-c69bcafd5b86', 'event_type_probs': [0.0872568889544155, 0.009897017097536279, 0.11418104248845252, 0.0815559595112779, 0.03210530437146285, 0.6662757083087492, 0.004364039634052925, 0.004364039634052925]}]"
1106222498524344320,"[{'id': 'c/9b2d513a-e521-4344-9776-4350235792eb', 'event_type_probs': [0.08530603024738755, 0.028533060799313028, 0.10895953478683099, 0.08257228315807126, 0.04689411732841202, 0.10936859660325718, 0.005775321204889781, 0.5325910558718383]}]"
1106230914458238976,"[{'id': 'c/d2dff8c6-4e5f-4c30-b216-c4f578fc1390', 'event_type_probs': [0.054288361561215734, 0.6069224587923946, 0.07651211082020527, 0.05010262656385902, 0.13398453139224548, 0.07656778713122799, 0.0008110618694258922, 0.0008110618694258922]}]"
