# Transform the data to work with Snorkel: Part 1 - Event Type

Essentially we will have to create two labeling models.
One assigns labels to event types and the other assigns labels to argument roles in event mentions.

In any case we need to create a row for each event (trigger) to do event type labeling.

For this we need 1 additional column:
- trigger_id

One numpy array containing the:
- event_type

We will probably focus on keyword lists and some heuristics to create our labeling functions.

In [None]:
import sys
sys.path.append("../")
import warnings
import pandas as pd
import numpy as np
from wsee.utils import utils
from wsee.data import pipeline

In [None]:
warnings.filterwarnings(action='once')
pd.set_option('display.max_colwidth', None)
DATA_DIR = '/Users/phuc/data/daystream_corpus'  # replace path to corpus

### SD4M Event Types

| Number | Code                   | Description                                                                             |
|--------|------------------------|-----------------------------------------------------------------------------------------|
| -1     | ABSTAIN                | No vote, for Labeling Functions                                                         |
| 0      | Accident               | Collision of a vehicle with another vehicle, person, or obstruction                     |
| 1      | CanceledRoute          | Cancellation of public transport routes                                                 |
| 2      | CanceledStop           | Cancellation of public transport stops                                                  |
| 3      | Delay                  | Delay resulting from remaining traffic disturbances                                     |
| 4      | Obstruction            | Temporary installation to control traffic                                               |
| 5      | RailReplacementService | Replacement of a passenger train by buses or other substitute public transport services |
| 6      | TrafficJam             | Line of stationary or very slow-moving traffic                                          |
| 7      | O                      | No SD4M event.                                                                          |

In [None]:
loaded_data = pipeline.load_data(DATA_DIR)
sd_train = loaded_data['train']
sd_dev = loaded_data['dev']
sd_test = loaded_data['test']

daystream = loaded_data['daystream']

## Step 1: Create one row for every event trigger

We will use the (labeled) SD4M training set as our development data to create our labeling functions.
In this notebook we will run our labeling functions and our LabelModel on that data.
In the real pipeline we will instead label the Daystream data that does not have event type and event argument role labels.

In [None]:
df_sd_train, Y_sd_train = pipeline.build_event_trigger_examples(sd_train)

We use the (labeled) SD4m development set as our "test set" to measure the performance of our LabelModel.

In [None]:
df_sd_dev, Y_sd_dev = pipeline.build_event_trigger_examples(sd_dev)

In [None]:
from wsee import SD4M_RELATION_TYPES
print(SD4M_RELATION_TYPES)

## Step 2: Explore the data

In [None]:
from wsee.preprocessors.preprocessors import *
from wsee.data import explore, pipeline

We can apply all our preprocessors on our data and see if we can find something interesting for our labeling functions.
Let's first sample the SD4M training data, which is labeled.

In [None]:
labeled_sd4m_triggers = explore.add_labels(df_sd_train, Y_sd_train)
labeled_sd4m_triggers = explore.apply_preprocessors(labeled_sd4m_triggers, [pre_trigger_left_tokens, pre_mixed_ner, pre_trigger_right_tokens])
labeled_sd4m_triggers = explore.add_event_types(labeled_sd4m_triggers)

In [None]:
filtered_sd4m_triggers = labeled_sd4m_triggers[labeled_sd4m_triggers['label'] != 7]
print(f"Number of events: {len(labeled_sd4m_triggers)}\n")
for idx, class_name in enumerate(SD4M_RELATION_TYPES):
    class_sd4m_triggers = labeled_sd4m_triggers[labeled_sd4m_triggers['label'] == idx]
    print(f"{class_name}: {len(class_sd4m_triggers)} instances")

## Step 3: Evaluate the labeling functions on the SD4M training data

In [None]:
from wsee.labeling import event_trigger_lfs as trigger_lfs

In [None]:
from snorkel.labeling import PandasLFApplier

lfs = [
    trigger_lfs.lf_accident_context,
    trigger_lfs.lf_accident_context_street,
    trigger_lfs.lf_accident_context_no_cause_check,
    trigger_lfs.lf_canceledroute_cat,
    trigger_lfs.lf_canceledroute_replicated,
    trigger_lfs.lf_canceledstop_cat,
    trigger_lfs.lf_canceledstop_replicated,
    trigger_lfs.lf_delay_cat,
    trigger_lfs.lf_delay_priorities,
    trigger_lfs.lf_delay_duration,
    trigger_lfs.lf_obstruction_cat,
    trigger_lfs.lf_obstruction_street,
    trigger_lfs.lf_obstruction_priorities,
    trigger_lfs.lf_railreplacementservice_cat,
    trigger_lfs.lf_railreplacementservice_replicated,
    trigger_lfs.lf_trafficjam_cat,
    trigger_lfs.lf_trafficjam_street,
    trigger_lfs.lf_trafficjam_order,
    trigger_lfs.lf_negative,
    trigger_lfs.lf_cause_negative,
    trigger_lfs.lf_obstruction_negative
]

applier = PandasLFApplier(lfs)

In [None]:
L_sd_train = applier.apply(df_sd_train)

In [None]:
from snorkel.labeling import LFAnalysis

LFAnalysis(L_sd_train, lfs).lf_summary(Y_sd_train)

## Step 4: Error Analysis 
Now we can look at the LabelMatrix for errors. We can use the DataFrame from the exploration section, which includes the information from the preprocessors.
We can then specifically look for the instances that were labeled incorrectly.

In [None]:
from wsee.labeling import error_analysis
relevant_cols = ['text','trigger', 'event_types']

In [None]:
labeled_sd4m_triggers.iloc[L_sd_train[:, 3] == 1].sample()[['text', 'trigger', 'label']]

In [None]:
error_analysis.sample_fp(labeled_df=labeled_sd4m_triggers, lf_outputs=L_sd_train, lf_index=3, label_of_interest=1, sample_size=1)[relevant_cols]

In [None]:
error_analysis.sample_abstained_instances(labeled_df=labeled_sd4m_triggers, lf_outputs=L_sd_train, lf_index=10, label_of_interest=4, sample_size=1)[relevant_cols]

## Step 5: Train the Labeling model and label the data

In [None]:
from snorkel.labeling import LabelModel
from snorkel.labeling import filter_unlabeled_dataframe

In [None]:
df_daystream, Y_daystream = pipeline.build_event_trigger_examples(daystream)
L_daystream = applier.apply(df_daystream)

In [None]:
LFAnalysis(L_daystream, lfs).lf_summary()

In [None]:
daystream_model = LabelModel(cardinality=8, verbose=True)
daystream_model.fit(L_train=L_daystream,n_epochs=5000, log_freq=500, seed=12345, Y_dev=Y_sd_train)

In [None]:
daystream_model_acc = daystream_model.score(L=L_sd_train, Y=Y_sd_train, tie_break_policy="random")[
    "accuracy"
]
print(f"{'Label Model Accuracy:':<25} {daystream_model_acc * 100:.1f}%")

In [None]:
daystream_probs = daystream_model.predict_proba(L=L_daystream)

In the proposed workflow one would filter out all the datapoints that were not labeled by any of the labeling functions.
There will not be such a case here because we use a negative labeling functions that outputs the negative trigger label, when all the other labeling functions abstain.
If it was not the case, we would instead multiply the probabilities of abstains with zero so that they look like padding instances, when fed into the end model.
We propose this workaround since examples that are filtered out here are treated as negative examples per default in the end model.
We also cannot afford to filter out the whole document if just one trigger/role example was not labeled.

In [None]:
labeled_daystream = pipeline.merge_event_trigger_examples(df_daystream, daystream_probs)

In [None]:
labeled_daystream.reset_index(level=0).to_json(DATA_DIR + "/save_daystreamv6_triggers.jsonl", orient='records', lines=True, force_ascii=False)

## Step 7: Check Daystream Labeling

To look at the daystream labeling it would be best to remove the abstains.

In [None]:
from snorkel.labeling import filter_unlabeled_dataframe

df_daystream_filtered, probs_daystream_filtered = filter_unlabeled_dataframe(
    X=df_daystream, y=daystream_probs, L=L_daystream
)

In [None]:
df_daystream_filtered['trigger_probs'] = list(probs_daystream_filtered)
df_daystream_filtered['most_probable_class'] = [SD4M_RELATION_TYPES[label_idx] for label_idx in probs_daystream_filtered.argmax(axis=1)]
df_daystream_filtered['max_class_prob'] = ["{:.2f}".format(class_prob) for class_prob in probs_daystream_filtered.max(axis=1)]

In [None]:
for trigger_class in SD4M_RELATION_TYPES:
    print(f"{trigger_class}: {len(df_daystream_filtered[df_daystream_filtered['most_probable_class'] == trigger_class])} instances")

Code to display all the rows of the dataframe:
```python
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    display(df_daystream_filtered[df_daystream_filtered['most_probable_class'] == 'O'][['text', 'trigger', 'most_probable_class', 'max_class_prob', 'trigger_probs']])
```

In [None]:
df_daystream_filtered[df_daystream_filtered['most_probable_class'] == 'CanceledRoute'].sample(1)[['text', 'trigger', 'most_probable_class', 'max_class_prob', 'trigger_probs']]