# Transform the data to work with Snorkel: Part 1 - Event Type

Essentially we will have to create two labeling models.
One assigns labels to event types and the other assigns labels to argument roles in event mentions.

In any case we need to create a row for each event (trigger) to do event type labeling.

For this we need 1 additional column:
- trigger_id

One numpy array containing the:
- event_type

We will probably focus on keyword lists and some heuristics to create our labeling functions.

In [None]:
import sys
sys.path.append("../")
from wsee.utils import utils
from wsee.data import pipeline

DATA_DIR = '/Users/phuc/data/snorkel-daystreamv5'  # replace path to corpus

### SD4M Event Types

| Number | Code                   | Description                                                                             |
|--------|------------------------|-----------------------------------------------------------------------------------------|
| -1     | ABSTAIN                | No vote, for Labeling Functions                                                         |
| 0      | Accident               | Collision of a vehicle with another vehicle, person, or obstruction                     |
| 1      | CanceledRoute          | Cancellation of public transport routes                                                 |
| 2      | CanceledStop           | Cancellation of public transport stops                                                  |
| 3      | Delay                  | Delay resulting from remaining traffic disturbances                                     |
| 4      | Obstruction            | Temporary installation to control traffic                                               |
| 5      | RailReplacementService | Replacement of a passenger train by buses or other substitute public transport services |
| 6      | TrafficJam             | Line of stationary or very slow-moving traffic                                          |
| 7      | O                      | No SD4M event.                                                                          |

In [None]:
loaded_data = pipeline.load_data(DATA_DIR)
sd_train = loaded_data['train']
sd_dev = loaded_data['dev']
sd_test = loaded_data['test']

daystream = loaded_data['daystream']

In [None]:
sd_train.head()

## Step 1: Create one row for every event trigger

We will use the (labeled) SD4M training set as our development data to create our labeling functions.
In this notebook we will run our labeling functions and our LabelModel on that data.
In the real pipeline we will instead label the Daystream data that does not have event type and event argument role labels.

In [None]:
SAMPLE = False

In [None]:
if SAMPLE:
    df_dev, Y_dev = pipeline.build_event_trigger_examples(sd_train.sample(n=200, random_state=42))
else:
    df_dev, Y_dev = pipeline.build_event_trigger_examples(sd_train)

We use the (labeled) SD4m development set as our "test set" to measure the performance of our LabelModel.

In [None]:
if SAMPLE:
    df_test, Y_test = pipeline.build_event_trigger_examples(sd_dev.sample(n=100, random_state=42))
else:
    df_test, Y_test = pipeline.build_event_trigger_examples(sd_dev)

In [None]:
from wsee import SD4M_RELATION_TYPES
print(SD4M_RELATION_TYPES)

## Step 2: Explore the data

In [None]:
from wsee.preprocessors.preprocessors import *
from wsee.data import explore, pipeline

We can apply all our preprocessors on our data and see if we can find something interesting for our labeling functions.
Let's first sample the SD4M training data, which is labeled.

In [None]:
labeled_sd4m_triggers = explore.add_labels(df_dev, Y_dev)

In [None]:
labeled_sd4m_triggers = explore.apply_preprocessors(labeled_sd4m_triggers, [get_trigger, get_trigger_text, get_trigger_left_tokens, get_trigger_right_tokens, get_entity_type_freqs, get_mixed_ner])

In [None]:
labeled_sd4m_triggers = explore.add_event_types(labeled_sd4m_triggers)

Let's first take a look at the trigger text.

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', -1)

In [None]:
labeled_sd4m_triggers[labeled_sd4m_triggers['label'] == 4].sample(10)[['trigger_left_tokens','trigger_text','trigger_right_tokens','entity_type_freqs','mixed_ner','label', 'event_types']]

In [None]:
labeled_sd4m_triggers[labeled_sd4m_triggers['label'] == 4]['trigger_text'].value_counts()

Now we can collect the trigger words per class.

## Step 3: Evaluate the labeling functions on the SD4M training data

In [None]:
from wsee.labeling.event_trigger_lfs import *

In [None]:
from snorkel.labeling import PandasLFApplier

lfs = [
    lf_accident_context,
    lf_canceledroute_cat,
    lf_canceledstop_cat,
    lf_delay_cat,
    lf_obstruction_cat,
    lf_railreplacementservice_cat,
    lf_trafficjam_cat,
    lf_negative
]

applier = PandasLFApplier(lfs)

In [None]:
L_dev = applier.apply(df_dev)
L_test = applier.apply(df_test)

In [None]:
from snorkel.labeling import LFAnalysis

LFAnalysis(L_dev, lfs).lf_summary(Y_dev)

In [None]:
LFAnalysis(L_test, lfs).lf_summary(Y_test)

## Step 4: Error Analysis 
Now we can look at the LabelMatrix for errors. We need to use the DataFrame from the exploration section, which includes the information from the preprocessors.
We can then specifically look for the instances that were labeled incorrectly.

We will first look at the keyword based labeling function for accidents:

In [None]:
from wsee.labeling import error_analysis
relevant_cols = ['trigger_left_tokens','trigger_text','trigger_right_tokens','entity_type_freqs','mixed_ner','label', 'event_types']

In [None]:
error_analysis.sample_fp(labeled_df=labeled_sd4m_triggers, lf_outputs=L_dev, lf_index=6, label_of_interest=6)[relevant_cols]

In [None]:
error_analysis.trigger_text_counts_fp(labeled_df=labeled_sd4m_triggers, lf_outputs=L_dev, lf_index=4, label_of_interest=4)

In [None]:
error_analysis.sample_abstained_instances(labeled_df=labeled_sd4m_triggers, lf_outputs=L_dev, lf_index=7, label_of_interest=7)[relevant_cols]

## Step 5: Train the Labeling model and label the data

In [None]:
from snorkel.labeling import LabelModel

label_model = LabelModel(cardinality=8, verbose=True)
label_model.fit(L_train=L_dev, n_epochs=500, log_freq=100, seed=123)

In [None]:
label_model_acc = label_model.score(L=L_test, Y=Y_test, tie_break_policy="random")[
    "accuracy"
]
print(f"{'Label Model Accuracy:':<25} {label_model_acc * 100:.1f}%")

In [None]:
probs_train = label_model.predict_proba(L=L_dev)

In the proposed workflow one would filter out all the datapoints that were not labeled by any of the labeling functions.
We will follow this approach as that does not affect the merging process in our Snorkel processing pipeline.
While it may result in sentences missing certain events, they would then be processed as dummy events in the AllenNLP model and factored out during the loss calculation (?).

In [None]:
from snorkel.labeling import filter_unlabeled_dataframe

df_train_filtered, probs_train_filtered = filter_unlabeled_dataframe(
    X=df_dev, y=probs_train, L=L_dev
)

In the Snorkel processing pipeline we would merge the labeled dataframes back together that belong to the same document and proceed with labeling the event argument roles.

In [None]:
labeled_sd_train = pipeline.merge_event_trigger_examples(df_dev, probs_train)
import pickle
pickle.dump(labeled_sd_train, open( "/Users/phuc/develop/python/wsee/data/save_sd_triggers.p", "wb" ) )

## Step 6: Label the Daystream data with Snorkel

In [None]:
df_train, Y_train = pipeline.build_event_trigger_examples(daystream)
L_train = applier.apply(df_train)

In [None]:
LFAnalysis(L_train, lfs).lf_summary()

In [None]:
daystream_model = LabelModel(cardinality=8, verbose=True)
daystream_model.fit(L_train=L_dev, n_epochs=500, log_freq=100, seed=123)

In [None]:
daystream_model_acc = daystream_model.score(L=L_test, Y=Y_test, tie_break_policy="random")[
    "accuracy"
]
print(f"{'Label Model Accuracy:':<25} {daystream_model_acc * 100:.1f}%")

In [None]:
daystream_probs = daystream_model.predict_proba(L=L_train)

In [None]:
labeled_daystream = pipeline.merge_event_trigger_examples(df_train, daystream_probs)

In [None]:
labeled_daystream.head()

In [None]:
import pickle
pickle.dump( labeled_daystream, open( "/Users/phuc/develop/python/wsee/data/save_triggers.p", "wb" ) )