# Transform the data to work with Snorkel: Part 1 - Event Type

Essentially we will have to create two labeling models.
One assigns labels to event types and the other assigns labels to argument roles in event mentions.

In any case we need to create a row for each event (trigger) to do event type labeling.

For this we need 1 additional column:
- trigger_id

One numpy array containing the:
- event_type

We will probably focus on keyword lists and some heuristics to create our labeling functions.

In [None]:
import sys
sys.path.append("../")
from wsee.utils import utils
from wsee.data.pipeline import load_data, build_event_trigger_examples

DATA_DIR = '/Users/phuc/data/snorkel-daystreamv5'  # replace path to corpus

In [None]:
loaded_data = load_data(DATA_DIR)
sd_train = loaded_data['train']
sd_dev = loaded_data['dev']
sd_test = loaded_data['test']

daystream = loaded_data['daystream']

In [None]:
sd_train.head()

Example .jsonl file
```json
{
  "id": "754201930264633344",
  "text": "■ #A1 #Bremen Richtung #Hamburg zwischen Horster Dreieck und #Stillhorn 9 km #Stau.  Dort ist wegen #Bauarbeiten nur eine Spur frei.\n",
  "entities": [
    {
      "id": "c/82bf4c32-861d-4e09-b8d1-bf7adc488f2b",
      "text": "#A1",
      "entity_type": "location_street",
      "start": 1,
      "end": 2,
      "char_start": 2,
      "char_end": 5
    },
    ...
  ],
  "event_triggers": [
    {
      "id": "c/3958da47-7b47-414f-8210-5b2c487de9df",
      "event_type_probs": [ 0.0, ..., 1.0, 0.0 ]
    }
  ],
  "event_roles": [
    {
      "trigger": "c/3958da47-7b47-414f-8210-5b2c487de9df",
      "argument": "c/82bf4c32-861d-4e09-b8d1-bf7adc488f2b",
      "event_argument_probs": [ 1.0, 0.0, ..., 0.0 ]
    },
    
  ]
}
```

## Step 1: Create one row for every event trigger

In [None]:
event_type_rows, event_type_rows_y = build_event_trigger_examples(sd_train)

In [None]:
event_type_rows_y.shape

In [None]:
from wsee import SD4M_RELATION_TYPES
print(SD4M_RELATION_TYPES)

In [None]:
from wsee.labeling.event_trigger_lfs import lf_accident_cat, lf_canceledroute_cat, lf_delay_cat, \
    lf_obstruction_cat, lf_railreplacementservice_cat, lf_trafficjam_cat

In [None]:
from snorkel.labeling import PandasLFApplier

lfs = [
    lf_accident_cat,
    lf_canceledroute_cat,
    # lf_canceledstop_cat
    lf_delay_cat,
    lf_obstruction_cat,
    lf_railreplacementservice_cat,
    lf_trafficjam_cat
]

applier = PandasLFApplier(lfs)
L_valid = applier.apply(event_type_rows)

In [None]:
from snorkel.labeling import LFAnalysis

Y_valid = event_type_rows_y
LFAnalysis(L_valid, lfs).lf_summary(Y_valid)