# Transform the data to work with Snorkel: Part 1 - Event Type

Essentially we will have to create two labeling models.
One assigns labels to event types and the other assigns labels to argument roles in event mentions.

In any case we need to create a row for each event (trigger) to do event type labeling.

For this we need 1 additional column:
- trigger_id

One numpy array containing the:
- event_type

We will probably focus on keyword lists and some heuristics to create our labeling functions.

In [1]:
import sys
sys.path.append("../")
from wsee.utils import utils
from wsee.data import pipeline

DATA_DIR = '/Users/phuc/data/snorkel-daystreamv5'  # replace path to corpus

### SD4M Event Types

| Number | Code                   | Description                                                                                                                                          |   |   |
|--------|------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------|---|---|
| -1     | ABSTAIN                | No vote, for Labeling Functions                                                                                                                      |   |   |
| 0      | Accident               | Occurs when a vehicle collides with another vehicle, person, or obstruction.                                                                         |   |   |
| 1      | CanceledRoute          | Cancellation of public transport routes or stops                                                                                                     |   |   |
| 2      | CanceledStop           | Cancellation of public transport stops. Includes airports / train stations shutdowns, even without mentioning a specific route                       |   |   |
| 3      | Delay                  | Remaining traffic disturbances should be labelled as delay                                                                                           |   |   |
| 4      | Obstruction            | A temporary installation to control traffic.                                                                                                         |   |   |
| 5      | RailReplacementService | A replacement service uses buses (or another transport service) to replacea passenger train on temporary or permanent basis.                         |   |   |
| 6      | TrafficJam             | Condition on road networks that occurs as use increases, and is characterised by slower speeds, longer trip times, and increases vehicular queueing. |   |   |
| 7      | O                      | No SD4M event.                                                                                                                                       |   |   |
|        |                        |                                                                                                                                                      |   |   |
|        |                        |                                                                                                                                                      |   |   |

In [2]:
loaded_data = pipeline.load_data(DATA_DIR)
sd_train = loaded_data['train']
sd_dev = loaded_data['dev']
sd_test = loaded_data['test']

daystream = loaded_data['daystream']

In [3]:
sd_train.head()

Unnamed: 0,id,text,tokens,pos_tags,ner_tags,entities,event_triggers,event_roles
0,http://www.viz-info.de/LMS-BR_r_LMS-BR_60517@2...,Unfall\nAbschnitt: Marzahn (Berlin)\nGültig ab...,"[Unfall, Abschnitt, :, Marzahn, (, Berlin, ), ...","[NN, NN, $., NE, TRUNC, NE, TRUNC, NN, PTKVZ, ...","[B-TRIGGER, O, O, B-LOCATION, O, B-LOCATION_CI...",[{'id': 'c/e6ad8c7f-24a4-4742-a52d-90207de04f0...,[{'id': 'c/e6ad8c7f-24a4-4742-a52d-90207de04f0...,[{'trigger': 'c/e6ad8c7f-24a4-4742-a52d-90207d...
1,http://www.deutschlandradio.de/#17@2016-04-04T...,Vorsicht auf der A7 Ulm Richtung Füssen zwisch...,"[Vorsicht, auf, der, A7, Ulm, Richtung, Füssen...","[NN, APPR, ART, NE, NE, NN, NN, APPR, NN, NE, ...","[O, O, O, B-LOCATION_STREET, B-LOCATION_CITY, ...",[{'id': 'c/2db85836-812f-4ced-90d3-46df9495782...,[],[]
2,667383197769048064,"Genau in dem Bus sitzen, der im Stau steht. Fü...","[Genau, in, dem, Bus, sitzen, ,, der, im, Stau...","[ADV, APPR, ART, NN, VVFIN, $,, PRELS, APPRART...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O]",[],[],[]
3,603844236484550658,Große Carsharing-Übernahme: Der französische C...,"[Große, Carsharing, -, Übernahme, :, Der, fran...","[ADJA, NN, $[, NN, $., ART, ADJA, NN, $[, NN, ...","[O, O, O, O, O, O, B-LOCATION, O, O, O, B-ORGA...",[{'id': 'c/f0fdb663-677e-4353-9159-8a9530f9777...,[],[]
4,http://bauarbeiten.bahn.de/fernverkehr/Linie/I...,"an mehreren Terminen\n an den Freitagen, 3. un...","[an, mehreren, Terminen, an, den, Freitagen, ,...","[APPR, PIAT, NN, APPR, ART, NN, $,, CARD, $., ...","[O, O, O, O, O, B-DATE, I-DATE, I-DATE, I-DATE...",[{'id': 'c/f46384bf-20c6-47f5-a019-2a11fc52079...,[{'id': 'c/f84a50a1-b58f-4077-a68c-ae95a4f81e3...,[{'trigger': 'c/f84a50a1-b58f-4077-a68c-ae95a4...


## Step 1: Create one row for every event trigger

We will use the (labeled) SD4M training set as our development data to create our labeling functions.
In this notebook we will run our labeling functions and our LabelModel on that data.
In the real pipeline we will instead label the Daystream data that does not have event type and event argument role labels.

In [4]:
df_dev, Y_dev = pipeline.build_event_trigger_examples(sd_train)

116it [00:00, 685.09it/s]

DataFrame has 1273 rows


1273it [00:01, 1138.36it/s]

Number of events: 487





We use the (labeled) SD4m development set as our "test set" to measure the performance of our LabelModel.

In [5]:
df_test, Y_test = pipeline.build_event_trigger_examples(sd_dev)

147it [00:00, 1530.77it/s]

DataFrame has 147 rows
Number of events: 46





In [6]:
from wsee import SD4M_RELATION_TYPES
print(SD4M_RELATION_TYPES)

['Accident', 'CanceledRoute', 'CanceledStop', 'Delay', 'Obstruction', 'RailReplacementService', 'TrafficJam', 'O']


## Step 2: Explore the data

In [7]:
from wsee.preprocessors.preprocessors import *
from wsee.data import explore, pipeline

We can apply all our preprocessors on our data and see if we can find something interesting for our labeling functions.
Let's first sample the SD4M training data, which is labeled.

In [8]:
labeled_sd4m_triggers = explore.add_labels(df_dev, Y_dev)

In [9]:
labeled_sd4m_triggers = explore.apply_preprocessors(labeled_sd4m_triggers, [get_trigger, get_trigger_text, get_trigger_left_tokens, get_trigger_right_tokens, get_entity_type_freqs, get_mixed_ner])

100%|██████████| 6/6 [00:07<00:00,  1.23s/it]


In [10]:
labeled_sd4m_triggers = explore.add_event_types(labeled_sd4m_triggers)

Let's first take a look at the trigger text.

In [11]:
import pandas as pd
pd.set_option('display.max_colwidth', -1)

In [12]:
labeled_sd4m_triggers[labeled_sd4m_triggers['label'] == 4].sample(10)[['trigger_left_tokens','trigger_text','trigger_right_tokens','entity_type_freqs','mixed_ner','label', 'event_types']]

Unnamed: 0,trigger_left_tokens,trigger_text,trigger_right_tokens,entity_type_freqs,mixed_ner,label,event_types
0,"[Unfall, Abschnitt, :, Marzahn, (, Berlin, ), Gültig, ab, :, 09.02.2016, 20:06]",gesperrt,"[,, Unfall]","{'trigger': 2, 'location': 1, 'location_city': 1, 'date': 1, 'time': 1}","TRIGGER\nAbschnitt: LOCATION (LOCATION_CITY)\nGültig ab: DATE TIME\nTRIGGER, Unfall\n",4,"[(Unfall, 0), (gesperrt, 4)]"
1264,"[■, #Hamburg, :, Die, Bundesstraße, ist, zwischen, Sedanstraße, und, Papendamm, wegen, #Bauarbeiten, in, beiden, Richtungen, bis, Ende, Juli]",gesperrt,[.],"{'location_city': 1, 'location_street': 3, 'trigger': 2, 'location': 1, 'date': 1}",■ LOCATION_CITY: Die LOCATION_STREET ist zwischen LOCATION_STREET und LOCATION_STREET wegen TRIGGER in LOCATION bis DATE TRIGGER.\n,4,"[(#Bauarbeiten, 7), (gesperrt, 4)]"
735,"[A255, Zweig, Hamburg, -, Veddel, stadtauswärts, zwischen, Hamburg, -, Süd, und, dem, Kreuz, Hamburg, -, Süd, ist, wegen, Bauarbeiten, eine, Spur]",gesperrt,[.],"{'location_street': 1, 'location': 4, 'trigger': 2}",LOCATION_STREET Zweig LOCATION LOCATION zwischen LOCATION und dem LOCATION ist wegen TRIGGER eine Spur TRIGGER.\n,4,"[(Bauarbeiten, 7), (gesperrt, 4)]"
663,"[Die, B217, Springe, Richtung, Hannover, ist, in, Höhe, Alvesrode, /, Wisentgehege, wegen, Bauarbeiten, bis, Samstag, 17:00, Uhr]",gesperrt.,"[Eine, Umleitung, ist, eingerichtet, .]","{'location_street': 1, 'location_city': 2, 'location': 1, 'trigger': 2, 'date': 1, 'time': 1}",Die LOCATION_STREET LOCATION_CITY Richtung LOCATION_CITY ist in Höhe LOCATION wegen TRIGGER bis DATE TIME TRIGGER Eine Umleitung ist eingerichtet.\n,4,"[(Bauarbeiten, 7), (gesperrt., 4)]"
1154,"[B95, ,, B173, Grenzübergang, Oberwiesenthal, Richtung, Zwickau, zwischen, Thum, und, Einfahrt, A72, ,, Chemnitz, -, Süd]",Schwertransport,"[,, Überholen, nicht, möglich]","{'location_street': 2, 'location': 4, 'location_city': 1, 'trigger': 1}","LOCATION_STREET, LOCATION_STREET LOCATION Richtung LOCATION_CITY zwischen LOCATION und LOCATION, LOCATION TRIGGER, Überholen nicht möglich\n",4,"[(Schwertransport, 4)]"
776,"[1, ., Aktualisierung, Nürnberg, -, Bamberg, :, Notarzteinsatz, am, Gleis, /, Schienenersatzverkehr, eingerichtet, (, Stand, 24.03.2016, ,, 12:45, Uhr, ), KBS820, :, Nürnberg, -, Bamberg, -, Lichtenfels, -, Sonneberg, S, 1, :, Hartmannshof, -, Nürnberg, -, Bamberg, Meldung, :, wegen, eines, Notarzteinsatzes, am, Gleis, ist, die, Strecke, zwischen, Hirschaid, und, Forchheim, weiterhin]",gesperrt,"[., Die, S, -, Bahnen, der, Linie, S, 1, aus, Richtung, Bamberg, verkehren, bis, Hirschaid, und, enden, dort, vorzeitig, ., Die, S, -, Bahnen, aus, Richtung, Hartmannshof, verkehren, bis, Forchheim, und, enden, vorzeitig, ., Die, Regionalzüge, aus, Richtung, Nürnberg, verkehren, bis, Forchheim, und, enden, dort, vorzeitig, ., Die, Züge, aus, Richtung, Lichtenfels, verkehren, bis, Bamberg, und, enden, vorzeitig, ., Ein, Schienenersatzverkehr, mit, Bus, zwischen, Hirschaid, und, Forchheim, ist, für, Sie, eingerichtet, ., Letzte, Aktualisierung, :, 2016, -03-24, 12:48:41, Meldehistorie, zu, dieser, Störung, einsehen, Ältere, Meldungen, :, Nürnberg, -, Bamberg, :, No]","{'location_stop': 23, 'trigger': 3, 'date': 2, 'time': 2, 'location_route': 7}","1. Aktualisierung LOCATION_STOP - LOCATION_STOP: Notarzteinsatz am Gleis / TRIGGER eingerichtet (Stand DATE, TIME)\nLOCATION_ROUTE:\n LOCATION_STOP - LOCATION_STOP - LOCATION_STOP - LOCATION_STOP\nLOCATION_ROUTE:\n LOCATION_STOP - LOCATION_STOP - LOCATION_STOP\nMeldung:\n wegen eines Notarzteinsatzes am Gleis ist die LOCATION_ROUTE zwischen LOCATION_STOP und LOCATION_STOP weiterhin TRIGGER. Die S-Bahnen der Linie LOCATION_ROUTE aus Richtung LOCATION_STOP verkehren bis LOCATION_STOP und enden dort vorzeitig. Die LOCATION_ROUTE aus Richtung LOCATION_STOP verkehren bis LOCATION_STOP und enden vorzeitig. \nDie LOCATION_ROUTE aus Richtung LOCATION_STOP verkehren bis LOCATION_STOP und enden dort vorzeitig. Die LOCATION_ROUTE aus Richtung LOCATION_STOP verkehren bis LOCATION_STOP und TRIGGER. \nEin Schienenersatzverkehr mit Bus zwischen LOCATION_STOP und LOCATION_STOP ist für Sie eingerichtet.\n \n Letzte Aktualisierung: DATE TIME\n \nMeldehistorie zu dieser Störung einsehen\nÄltere Meldungen:\nLOCATION_STOP - LOCATION_STOP: No\n",4,"[(Schienenersatzverkehr, 7), (gesperrt, 4), (enden vorzeitig, 7)]"
960,"[A17, Prag, Richtung, Dresden, zwischen, Dresden, -, Südvorstadt, und, Dresden, -, Gorbitz, rechter, Fahrstreifen]",blockiert,"[,, defekter, LKW]","{'location_street': 1, 'location_city': 2, 'location': 2, 'trigger': 1}","LOCATION_STREET LOCATION_CITY Richtung LOCATION_CITY zwischen LOCATION und LOCATION rechter Fahrstreifen TRIGGER, defekter LKW\n",4,"[(blockiert, 4)]"
1195,"[A61, Koblenz, Richtung, Mönchengladbach, zwischen, Swisttal, und, Kreuz, Bliesheim, 2, km, Stau, wegen, der]",Sperrung,"[der, A1]","{'location_street': 2, 'location_city': 2, 'location': 2, 'distance': 1, 'trigger': 2}",LOCATION_STREET LOCATION_CITY Richtung LOCATION_CITY zwischen LOCATION und LOCATION DISTANCE TRIGGER wegen der TRIGGER der LOCATION_STREET\n,4,"[(Stau, 6), (Sperrung, 4)]"
962,"[Unfall, bei, Dessau, -, Roßlau, :, A9, bleibt, nach, Lkw, -, Unfall, bis, Freitag]",gesperrt,"[-, FOCUS, Online, https://t.co/xwLYw8Vqq2, #dessau, #rosslau]","{'location': 1, 'location_street': 1, 'trigger': 2, 'date': 1, 'organization_company': 1, 'location_city': 2}",Unfall bei LOCATION: LOCATION_STREET bleibt nach TRIGGER bis DATE TRIGGER - ORGANIZATION_COMPANY https://t.co/xwLYw8Vqq2 LOCATION_CITY LOCATION_CITY\n,4,"[(Lkw-Unfall, 7), (gesperrt, 4)]"
328,"[am, Sonntag, ,, 1, ., Mai, ,, 21.10, –, 23.45, Uhr, Meldung, :, IC, 2437, von, Emden, Hbf, (, planmäßige, Ankunft, 23.00, Uhr, in, Magdeburg, Hbf, ), wird, von, Hannover, Hbf, bis, Braunschweig, Hbf]",umgeleitet,"[und, hält, nicht, in, Peine, ., Aufgrund, der, Umleitung, verspätet, sich, der, Zug, um, bis, zu, 20, Min, ., IC, 2030, von, Dresden, Hbf, (, planmäßige, Ankunft, 23.23, Uhr, in, Hannover, Hbf, ), wird, von, Braunschweig, Hbf, bis, Hannover, Hbf, umgeleitet, und, verspätet, sich, um, bis, zu, 20, Min, ., Grund, :, Weichenarbeiten, in, Hannover, Hbf, Link, zur, detaillierten, Meldung, :, Link, zum, kompletten, PDF, -, Dokument, :, (, 88, kB, ), ------------------]","{'date': 1, 'time': 4, 'location_route': 2, 'location_stop': 10, 'trigger': 4, 'duration': 2, 'number': 1}","am DATE, TIME – TIME\nMeldung:\n LOCATION_ROUTE von LOCATION_STOP (planmäßige Ankunft TIME in LOCATION_STOP) wird von LOCATION_STOP bis LOCATION_STOP TRIGGER und TRIGGER in LOCATION_STOP. Aufgrund der Umleitung verspätet sich der Zug um bis zu DURATION\n LOCATION_ROUTE von LOCATION_STOP (planmäßige Ankunft TIME in LOCATION_STOP) wird von LOCATION_STOP bis LOCATION_STOP TRIGGER und TRIGGER sich um bis zu DURATION\nGrund:\nWeichenarbeiten in LOCATION_STOP\nLink zur detaillierten Meldung: \nLink zum kompletten PDF-Dokument: \n(NUMBER kB)\n------------------\n",4,"[(umgeleitet, 4), (hält nicht, 2), (umgeleitet, 4), (verspätet, 3)]"


In [13]:
labeled_sd4m_triggers[labeled_sd4m_triggers['label'] == 4]['trigger_text'].value_counts()

gesperrt               61
umgeleitet             8 
#Bauarbeiten           4 
Vollsperrung           3 
Behinderungen          3 
blockiert              2 
Sperrung               2 
behindern              2 
Bauarbeiten            2 
Schwertransport        2 
Umleitung              2 
Streckensperrung       1 
brennender PKW         1 
Verkehrsbehinderung    1 
Störung:               1 
austelle               1 
voll gesperrt          1 
unterbrochen           1 
Vorsicht               1 
gesperrt.              1 
lahm                   1 
Name: trigger_text, dtype: int64

Now we can collect the trigger words per class.

## Step 3: Evaluate the labeling functions on the SD4M training data

In [14]:
from wsee.labeling.event_trigger_lfs import *

In [15]:
from snorkel.labeling import PandasLFApplier

lfs = [
    lf_accident_context,
    lf_canceledroute_cat,
    lf_canceledstop_cat,
    lf_delay_cat,
    lf_obstruction_cat,
    lf_railreplacementservice_cat,
    lf_trafficjam_cat,
    lf_negative
]

applier = PandasLFApplier(lfs)

In [16]:
L_dev = applier.apply(df_dev)
L_test = applier.apply(df_test)

100%|██████████| 817/817 [00:34<00:00, 23.77it/s]
100%|██████████| 75/75 [00:02<00:00, 25.27it/s]


In [17]:
from snorkel.labeling import LFAnalysis

LFAnalysis(L_dev, lfs).lf_summary(Y_dev)

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
lf_accident_context,0,"[0, 7]",0.118727,0.0,0.0,84,13,0.865979
lf_canceledroute_cat,1,[1],0.152999,0.028152,0.028152,61,64,0.488
lf_canceledstop_cat,2,[2],0.031824,0.001224,0.001224,25,1,0.961538
lf_delay_cat,3,"[3, 7]",0.091799,0.011016,0.004896,69,6,0.92
lf_obstruction_cat,4,"[4, 7]",0.176255,0.040392,0.034272,103,41,0.715278
lf_railreplacementservice_cat,5,[5],0.033048,0.0,0.0,22,5,0.814815
lf_trafficjam_cat,6,[6],0.200734,0.0,0.0,155,9,0.945122
lf_negative,7,[7],0.235006,0.0,0.0,191,1,0.994792


## Step 4: Error Analysis 
Now we can look at the LabelMatrix for errors. We need to use the DataFrame from the exploration section, which includes the information from the preprocessors.
We can then specifically look for the instances that were labeled incorrectly.

We will first look at the keyword based labeling function for accidents:

In [18]:
from wsee.labeling import error_analysis

In [19]:
error_analysis.sample_fp(labeled_df=labeled_sd4m_triggers, lf_outputs=L_dev, lf_index=4, label_of_interest=4)

Unnamed: 0,trigger_left_tokens,trigger_text,trigger_right_tokens,entity_type_freqs,mixed_ner,label,event_types
1013,"[■, #Hamburg, :, Die, Hammer, Straße, ist, zwischen, Jüthornstraße, und, Bärenallee, in, beiden, Richtungen, wegen]",#Bauarbeiten,"[bis, Ende, ...]","{'location_city': 1, 'location_street': 3, 'location': 1, 'trigger': 1}",■ LOCATION_CITY: Die LOCATION_STREET ist zwischen LOCATION_STREET und LOCATION_STREET in LOCATION wegen TRIGGER bis Ende ...\n,7,"[(#Bauarbeiten, 7)]"
484,"[■, #Hamburg, -, #Tonndorf, :, Die, Kuehnstraße, ist, wegen]",#Bauarbeiten,"[ab, Wilsonstraße, in, Richtung, Jenfelder, Allee, bis, zum, 20, ., Mai, als, ...]","{'location_city': 1, 'location_street': 3, 'trigger': 1, 'date': 1}",■ LOCATION_CITY: Die LOCATION_STREET ist wegen TRIGGER ab LOCATION_STREET in Richtung LOCATION_STREET bis zum DATE als ...\n,7,"[(#Bauarbeiten, 7)]"
813,"[■, #Hamburg, :, Die, #B75, Bremer, Straße, ist, zwischen, Eißendorfer, Mühlenweg, und, Metzendorfer, Weg, wegen]",#Bauarbeiten,"[bis, Anfang, ...]","{'location_route': 1, 'location_street': 3, 'trigger': 1}",■ LOCATION_ROUTE: Die LOCATION_STREET ist zwischen LOCATION_STREET und LOCATION_STREET wegen TRIGGER bis Anfang ...\n,7,"[(#Bauarbeiten, 7)]"
37,"[161, :]",Betriebsstörung,"[Verspätungen, von, bis, zu, 20, Minuten, ..., ., https://t.co/pLUsLLsfwM]","{'location_route': 1, 'trigger': 2, 'duration': 1}",LOCATION_ROUTE: TRIGGER TRIGGER von bis zu DURATION.... https://t.co/pLUsLLsfwM\n,7,"[(Betriebsstörung, 7), (Verspätungen, 3)]"
43,"[ÖAMTC, meldet, :, Zwischen, Nußdorf, am, Attersee, -, Oberwang, und, Innerschwand, am, Mondsee]",Verkehrsbehinderung,"[,, Radrennen, …, https://t.co/Hm0BwzQFXs]","{'organization': 1, 'location': 3, 'trigger': 2}","ORGANIZATION meldet: Zwischen LOCATION - LOCATION und LOCATION TRIGGER, TRIGGER… https://t.co/Hm0BwzQFXs\n",7,"[(Verkehrsbehinderung, 7), (Radrennen, 7)]"
909,"[■, #Hamburg, :, Die, #B75, Meiendorfer, Straße, ist, zwischen, Spitzbergenweg, und, Saseler, Straße, wegen]",#Bauarbeiten,"[bis, Anfang, Oktober, ...]","{'location_city': 1, 'location_street': 3, 'trigger': 1, 'date': 1}",■ LOCATION_CITY: Die LOCATION_STREET ist zwischen LOCATION_STREET und LOCATION_STREET wegen TRIGGER bis DATE ...\n,7,"[(#Bauarbeiten, 7)]"
760,"[Kreis, Breisgau, -, Hochschwarzwald, Erdrutsch, ,, Störungen, im, Schienenverkehr, ,, bis, 11.02.2016, Mitternacht, Die, Zugverbindung, Freiburg, im, Breisgau, -, Titisee, -, Neustadt, (, Höllentalbahn, ), ist, im, Bereich, Falkensteig]",unterbrochen,"[., Ein, Schienenersatzverkehr, ist, eingerichtet, ., Mit, Behinderungen, ist, zu, rechnen, .]","{'location': 1, 'date': 1, 'location_route': 1, 'location_stop': 1, 'trigger': 1}","Kreis LOCATION Erdrutsch, Störungen im Schienenverkehr, bis DATE Die LOCATION_ROUTE ist im Bereich LOCATION_STOP TRIGGER. Ein Schienenersatzverkehr ist eingerichtet. Mit Behinderungen ist zu rechnen.\n",1,"[(unterbrochen, 1)]"
848,"[A8, München, Richtung, Stuttgart, zwischen, Günzburg, und, Leipheim]",Wanderbaustelle,"[,, die, rechte, Spur, ist, blockiert, .]","{'location_street': 1, 'location_city': 2, 'location': 2, 'trigger': 1}","LOCATION_STREET LOCATION_CITY Richtung LOCATION_CITY zwischen LOCATION und LOCATION TRIGGER, die rechte Spur ist blockiert.\n",7,"[(Wanderbaustelle, 7)]"
650,"[Wegen, eines, Notarzteinsatzes, ist, die, Strecke, zwischen, Geisenhausen, und, Landshut]",gesperrt,"[., Es, fahren, ersatzweise, Busse, ., (, 06:33, )]","{'trigger': 2, 'location_route': 1, 'location_stop': 2, 'time': 1}",Wegen eines TRIGGER ist die LOCATION_ROUTE zwischen LOCATION_STOP und LOCATION_STOP TRIGGER. Es fahren ersatzweise Busse. (TIME)\n,1,"[(Notarzteinsatzes, 7), (gesperrt, 1)]"
590,"[[, DB, Regio, ], 1, ., Akt, ., #Günzburg, -, #Mindelheim, :]",#Störung,"[an, einem, #Bahnübergang, /, #Schienenersatzverkehr, +, +, +, https://t.co/Ev7U6WwYnI]","{'organization_company': 1, 'number': 1, 'location_stop': 2, 'location_route': 1, 'trigger': 2}",[ORGANIZATION_COMPANY] NUMBER. Akt. LOCATION_STOPLOCATION_ROUTELOCATION_STOP: TRIGGER an einem #Bahnübergang / TRIGGER +++ https://t.co/Ev7U6WwYnI\n,7,"[(#Störung, 7), (#Schienenersatzverkehr, 5)]"


In [20]:
error_analysis.trigger_text_counts_fp(labeled_df=labeled_sd4m_triggers, lf_outputs=L_dev, lf_index=4, label_of_interest=4)

#Bauarbeiten           6
gesperrt               4
unterbrochen           2
Baustelle              1
#Störung               1
Streckensperrung       1
Verkehrsbehinderung    1
Betriebsstörung        1
Wanderbaustelle        1
Bauarbeiten            1
umgeleitet.Mit         1
Technische Störung     1
Sperrung               1
Name: trigger_text, dtype: int64

## Step 5: Train the Labeling model and label the data

In [21]:
from snorkel.labeling import LabelModel

label_model = LabelModel(cardinality=8, verbose=True)
label_model.fit(L_train=L_dev, n_epochs=500, log_freq=100, seed=123)

In [22]:
label_model_acc = label_model.score(L=L_test, Y=Y_test, tie_break_policy="random")[
    "accuracy"
]
print(f"{'Label Model Accuracy:':<25} {label_model_acc * 100:.1f}%")

Label Model Accuracy:     82.7%


In [23]:
probs_train = label_model.predict_proba(L=L_dev)

In the proposed workflow one would filter out all the datapoints that were not labeled by any of the labeling functions.
We will follow this approach as that does not affect the merging process in our Snorkel processing pipeline.
While it may result in sentences missing certain events, they would then be processed as dummy events in the AllenNLP model and factored out during the loss calculation (?).

In [24]:
from snorkel.labeling import filter_unlabeled_dataframe

df_train_filtered, probs_train_filtered = filter_unlabeled_dataframe(
    X=df_dev, y=probs_train, L=L_dev
)

In the Snorkel processing pipeline we would merge the labeled dataframes back together that belong to the same document and proceed with labeling the event argument roles.

## Step 6: Label the Daystream data with Snorkel

In [25]:
df_train, Y_train = pipeline.build_event_trigger_examples(daystream)
L_train = applier.apply(df_train)

92it [00:00, 919.52it/s]

DataFrame has 1955 rows


1955it [00:04, 459.09it/s]
  0%|          | 0/1845 [00:00<?, ?it/s]

Number of events: 0


100%|██████████| 1845/1845 [02:02<00:00, 15.08it/s]


In [26]:
LFAnalysis(L_train, lfs).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
lf_accident_context,0,"[0, 7]",0.01897,0.002168,0.002168
lf_canceledroute_cat,1,[1],0.310027,0.080759,0.080759
lf_canceledstop_cat,2,[2],0.01355,0.00542,0.00542
lf_delay_cat,3,"[3, 7]",0.225474,0.007046,0.007046
lf_obstruction_cat,4,"[4, 7]",0.161518,0.083469,0.083469
lf_railreplacementservice_cat,5,[5],0.097561,0.002168,0.002168
lf_trafficjam_cat,6,[6],0.028184,0.001084,0.001084
lf_negative,7,[7],0.240108,0.0,0.0


In [27]:
daystream_model = LabelModel(cardinality=8, verbose=True)
daystream_model.fit(L_train=L_dev, n_epochs=500, log_freq=100, seed=123)

In [28]:
daystream_model_acc = daystream_model.score(L=L_test, Y=Y_test, tie_break_policy="random")[
    "accuracy"
]
print(f"{'Label Model Accuracy:':<25} {daystream_model_acc * 100:.1f}%")

Label Model Accuracy:     82.7%


In [29]:
daystream_probs = daystream_model.predict_proba(L=L_train)

In [30]:
labeled_daystream = pipeline.merge_event_trigger_examples(df_train, daystream_probs)

In [31]:
labeled_daystream.head()

Unnamed: 0_level_0,event_triggers
id,Unnamed: 1_level_1
1106219278641045504,"[{'id': 'c/c45026ac-2537-4d19-ad8c-010c1587c7bd', 'event_type_probs': [0.07258689342951322, 0.00227831976204088, 0.09915398271049561, 0.6420814982987076, 0.0819887998844427, 0.09880081465756707, 0.0015548456286164183, 0.0015548456286164183]}, {'id': 'c/758086f7-c213-45a5-b9d0-981431fe5df4', 'event_type_probs': [0.05433896182049692, 0.607005981444464, 0.07654494493455964, 0.05015803055773239, 0.1340734475049488, 0.07625635972573566, 0.0008111370060313427, 0.0008111370060313427]}, {'id': 'c/b597d583-7fe2-4616-950b-d32b2b2c435a', 'event_type_probs': [0.08722699671500016, 0.009946107619590352, 0.11415798376947897, 0.08154182228395485, 0.03214040320302889, 0.6665829808289148, 0.004201852790015959, 0.004201852790015959]}]"
1106220052636975105,"[{'id': 'c/e08ede13-b420-4b0e-9b96-f1c71595f631', 'event_type_probs': [0.07258689342951322, 0.00227831976204088, 0.09915398271049561, 0.6420814982987076, 0.0819887998844427, 0.09880081465756707, 0.0015548456286164183, 0.0015548456286164183]}, {'id': 'c/08f09097-94cb-4342-ae4e-cf1cfb94572b', 'event_type_probs': [0.05433896182049692, 0.607005981444464, 0.07654494493455964, 0.05015803055773239, 0.1340734475049488, 0.07625635972573566, 0.0008111370060313427, 0.0008111370060313427]}]"
1106221297904816130,"[{'id': 'c/d3812715-f08f-411e-904a-c69bcafd5b86', 'event_type_probs': [0.08722699671500016, 0.009946107619590352, 0.11415798376947897, 0.08154182228395485, 0.03214040320302889, 0.6665829808289148, 0.004201852790015959, 0.004201852790015959]}]"
1106222498524344320,"[{'id': 'c/9b2d513a-e521-4344-9776-4350235792eb', 'event_type_probs': [0.08480709909559564, 0.028172826623828667, 0.10839924983590002, 0.08209961667539409, 0.04644703443396561, 0.10845480367162497, 0.0035421234553312947, 0.5380772462083598]}]"
1106230914458238976,"[{'id': 'c/d2dff8c6-4e5f-4c30-b216-c4f578fc1390', 'event_type_probs': [0.05433896182049692, 0.607005981444464, 0.07654494493455964, 0.05015803055773239, 0.1340734475049488, 0.07625635972573566, 0.0008111370060313427, 0.0008111370060313427]}]"
