# Data preparation
For our experiments we prepared the data the following ways.

1. We convert the corpus data from avro to jsonl using wsee/data/avro_to_jsonl.py & wsee/data/convert.py. The former converts n-ary relations into ACE style events. Th latter converts the n-ary relations into events in a specialized Snorkel format, i.e. turn string labels into class probabilities, while isolating the Daystream data from the corpus data.
2. We develop labeling functions, learn label models with Snorkel and probabilistically label the Daystream data using wsee/data/pipeline.py. We do these 5 times with different seeds in order to perform random repeats.
3. We create progressively bigger subsets (50%, 60%, 70%, 80%, 90%, 100%) from the probabilistically labeled data to examine whether more training data created using weak supervision improves model performance.

In order to perform the next steps, we need to download the corpus data:
https://cloud.dfki.de/owncloud/index.php/s/wSNN78s4Ck7omXm
with

```bash
wget -O ../data/daystream_corpus.zip --content-disposition https://cloud.dfki.de/owncloud/index.php/s/L5igzCiLNxnM3HD/download
unzip ../data/daystream_corpus.zip
```

Be sure to put the daystream_corpus into the data directory.

In [None]:
import sys
sys.path.append("../")
from pathlib import Path
from wsee.data import avro_to_jsonl, convert, pipeline
import pandas as pd

In [None]:
input_path = Path("../data/daystream_corpus")

## Avro to jsonl

### Conversion from avro to ACE stlye jsonl

In [None]:
avro_to_jsonl.convert_avros(input_path)

### Conversion from avro to Snorkel style jsonl

In [None]:
convert.convert_avros(input_path)

## Snorkel labeling
This probabilistically labels the daystream data using our labeling functions & learned Snorkel label models.
It creates a merged version of the SD4M gold train data and the probabilistically labeled Daystream data.

In [None]:
save_path = Path("../data/daystream_corpus")
seed = 12345
pipeline.create_train_datasets(input_path, save_path, seed)

## Random repeat variants
This performs the Snorkel labeling steps #`random_repeat` times with different seeds.

In [None]:
save_path = Path("../data/daystream_corpus")
random_repeats = 5
pipeline.create_random_repeats_train_datasets(input_path, save_path, random_repeats)

## Daystream subsets

In [None]:
from wsee.utils import corpus_statistics

In [None]:
import pandas as pd
daystream_snorkeled_path = Path("../data/daystream_corpus/daystream_snorkeled.jsonl") 
output_path = Path("../data/daystream_corpus/")
daystream_snorkeled = pd.read_json(daystream_snorkeled_path, lines=True, encoding='utf8') 
sample_statistics = []
for percentage in range(50, 101, 10):
    row = {'sample_fraction': percentage}
    sample = daystream_snorkeled.sample(frac=percentage/100)
    print(f'{percentage}% sample statistics')
    row.update(corpus_statistics.get_snorkel_event_stats(sample))
    sample_statistics.append(row)
    sample.to_json(output_path.joinpath(f"daystream{percentage}_snorkeled.jsonl"), orient='records', lines=True, force_ascii=False)
sample_statistics = pd.DataFrame(sample_statistics)
sample_statistics.to_json(output_path.joinpath('sample_statistics.jsonl'), orient='records', lines=True, force_ascii=False)
sample_statistics

ALTERNATIVE: To reduce the chance of disproportionally getting more events per documents in some samples, we can alternatively create subsets for each of the random repeats and use the mean/median & standard deviation during the evaluation instead.

This variant sample from each random repeat, which introduces even more randonmness via the seeds for the label models & eventx model in addition to the sample randomness. This did not work out well in past experiments.
```python
import pandas as pd
input_path = Path("../data/daystream_corpus")
random_repeats = 5
for run in range(random_repeats):
    run_path = input_path.joinpath(f"run_{run+1}")
    daystream_snorkeled = pd.read_json(run_path.joinpath('daystream_snorkeled.jsonl'), lines=True, encoding='utf8') 
    for percentage in range(50, 101, 10):
        sample = daystream_snorkeled.sample(frac=percentage/100)
        sample.to_json(run_path.joinpath(f"daystream{percentage}_snorkeled.jsonl"), orient='records', lines=True, force_ascii=False)
```

In [None]:
import pandas as pd
import os
input_path = Path("../data/daystream_corpus")
daystream_snorkeled_path = Path("../data/daystream_corpus/daystream_snorkeled.jsonl") 
daystream_snorkeled = pd.read_json(daystream_snorkeled_path, lines=True, encoding='utf8') 
sample_repeats = 5
for run in range(sample_repeats):
    run_path = input_path.joinpath(f"samples_{run+1}")
    for percentage in range(50, 101, 10):
        sample = daystream_snorkeled.sample(frac=percentage/100)
        sample_path = run_path.joinpath(f"daystream{percentage}_snorkeled.jsonl")
        os.makedirs(os.path.dirname(sample_path), exist_ok=True)
        sample.to_json(sample_path, orient='records', lines=True, force_ascii=False)

### SD4M Train Sample for experiments
We further used a sample from the gold SD4M Train set to 

We also count all the event triggers and compare it to the number of event triggers in the data.
We expect the latter to be lower as we converted n-ary relations into events, which excludes triggers with no arguments.

In [None]:
import pandas as pd
sd4m_train = pd.read_json("../data/daystream_corpus/train/train_with_events_and_defaults.jsonl", lines=True, encoding='utf8')

In [None]:
from wsee.utils import corpus_statistics

In [None]:
len(sd4m_train) # contains document with no trigger entity -> not relevant for event extraction

In [None]:
filtered_sd4m_train = sd4m_train[sd4m_train.apply(lambda document: corpus_statistics.has_triggers(document), axis=1)]

In [None]:
corpus_statistics.get_snorkel_event_stats(filtered_sd4m_train)

In [None]:
sample = filtered_sd4m_train.sample(n=100, random_state=42)

In [None]:
sample.to_json("../data/daystream_corpus/train_sample.jsonl", orient='records', lines=True, force_ascii=False)