# Sentence Splitting Experiments
Here we examine the quality of the sentence splitting in the corpus.
We found that the sentence splitting in the corpus which was done with stanfordnlp was too aggressive. Many sentences were split incorrectly and often trigger-argument pairs (from the document level annotation) ended up being in different sentences.
We compare negative labeling functions only using the sentence splitting information from stanfordnlp vs. SoMaJo and we manually examine the sentence splitting quality on the incorrectly labeled instances.

In [1]:
import sys
sys.path.append("../")

## spaCy, stanfordnlp, SoMaJo comparison

In [2]:
import spacy
nlp_spacy = spacy.load('de_core_news_md')

In [3]:
import stanfordnlp
# stanfordnlp.download('de')
nlp_stanford = stanfordnlp.Pipeline(lang='de')

Use device: cpu
---
Loading: tokenize
With settings: 
{'model_path': '/Users/phuc/stanfordnlp_resources/de_gsd_models/de_gsd_tokenizer.pt', 'lang': 'de', 'shorthand': 'de_gsd', 'mode': 'predict'}
---
Loading: mwt
With settings: 
{'model_path': '/Users/phuc/stanfordnlp_resources/de_gsd_models/de_gsd_mwt_expander.pt', 'lang': 'de', 'shorthand': 'de_gsd', 'mode': 'predict'}
Building an attentional Seq2Seq model...
Using a Bi-LSTM encoder
Using soft attention for LSTM.
Finetune all embeddings.
---
Loading: pos
With settings: 
{'model_path': '/Users/phuc/stanfordnlp_resources/de_gsd_models/de_gsd_tagger.pt', 'pretrain_path': '/Users/phuc/stanfordnlp_resources/de_gsd_models/de_gsd.pretrain.pt', 'lang': 'de', 'shorthand': 'de_gsd', 'mode': 'predict'}
---
Loading: lemma
With settings: 
{'model_path': '/Users/phuc/stanfordnlp_resources/de_gsd_models/de_gsd_lemmatizer.pt', 'lang': 'de', 'shorthand': 'de_gsd', 'mode': 'predict'}
Building an attentional Seq2Seq model...
Using a Bi-LSTM encoder
Usi

In [4]:
from somajo import SoMaJo
tokenizer = SoMaJo("de_CMC", split_camel_case=True)

### Test documents

In [5]:
doc1 = "#S1 Nach der Weichenstörung in Hohen Neuendorf verkehren die S-Bahnen wieder durchgehend, erster Zug ab #Frohnau 21:58 Uhr und erster Zug ab #Hohen_Neuendorf 22:03 Uhr."
doc2 = "Unfall\nAbschnitt: Marzahn (Berlin)\nGültig ab: 09.02.2016 20:06\ngesperrt, Unfall\n"
doc3 = "■ #A1 #Bremen Richtung #Hamburg zwischen Horster Dreieck und #Stillhorn 9 km #Stau.  Dort ist wegen #Bauarbeiten nur eine Spur frei.\n"
doc4 = "Wegen einer techn. Störung an der Strecke besteht für die Linien S41, S42 u. S46 zw. Halensee <> Westkreuz <> Messe Nord <> Westend S-Bahn-Pendelverkehr im 20-Minuten-Takt. Die Linien S41 u. S42 fahren nur im 10-Minuten-Takt, die Linie S46 fährt nur Königs Wusterhausen <> Tempelhof."
doc5 = "#S3, #S5, #S7, #S9: Nach einer ärztliche Versorgung eines Fahrgastes im Zug in Bellevue kommt es noch zu Verspätungen und vereinzelten Ausfällen."

In [6]:
test_docs = [doc1, doc2, doc3, doc4, doc5]

### Process documents with spaCy, stanfordnlp, somajo

In [7]:
spacy_docs = [nlp_spacy(doc) for doc in test_docs]

In [8]:
stanford_docs = [nlp_stanford(doc) for doc in test_docs]

In [9]:
somajo_docs = [list(tokenizer.tokenize_text([doc])) for doc in test_docs]

### Tokenization comparison
How to access tokens:

#### spaCy
`Doc` is a sequence of `Token`s. We can get the token text with `Token.text`.

#### stanfordnlp
Here we have to access the sentences of a `Doc` to access the tokens with `tokens` property. We can get the token text with `Token.text`.

### somajo
Similar to stanfordnlp.

In [10]:
def get_spacy_doc_tokens(doc):
    return [token.text for token in doc]

def get_stanford_doc_tokens(doc):
    return [token.text for sentence in doc.sentences for token in sentence.tokens]

def get_somajo_doc_tokens(doc):
    return [token.text for sentence in doc for token in sentence]

In [11]:
for spacy_doc, stanford_doc, somajo_doc in zip(spacy_docs, stanford_docs, somajo_docs):
    spacy_tokens = get_spacy_doc_tokens(spacy_doc)
    print("spaCy:", spacy_tokens)
    stanford_tokens = get_stanford_doc_tokens(stanford_doc)
    print("stanfordnlp:", stanford_tokens)
    somajo_tokens = get_somajo_doc_tokens(somajo_doc)
    print("somajo:", somajo_tokens)
    print("\n")

spaCy: ['#', 'S1', 'Nach', 'der', 'Weichenstörung', 'in', 'Hohen', 'Neuendorf', 'verkehren', 'die', 'S-Bahnen', 'wieder', 'durchgehend', ',', 'erster', 'Zug', 'ab', '#', 'Frohnau', '21:58', 'Uhr', 'und', 'erster', 'Zug', 'ab', '#', 'Hohen_Neuendorf', '22:03', 'Uhr', '.']
stanfordnlp: ['#S1', 'Nach', 'der', 'Weichenstörung', 'in', 'Hohen', 'Neuendorf', 'verkehren', 'die', 'S-', 'Bahnen', 'wieder', 'durchgehend', ',', 'erster', 'Zug', 'ab', '#', 'Frohnau', '21:58', 'Uhr', 'und', 'erster', 'Zug', 'ab', '#', 'Hohen_Neuendorf', '22:03', 'Uhr', '.']
somajo: ['#S1', 'Nach', 'der', 'Weichenstörung', 'in', 'Hohen', 'Neuendorf', 'verkehren', 'die', 'S-Bahnen', 'wieder', 'durchgehend', ',', 'erster', 'Zug', 'ab', '#Frohnau', '21:58', 'Uhr', 'und', 'erster', 'Zug', 'ab', '#Hohen_Neuendorf', '22:03', 'Uhr', '.']


spaCy: ['Unfall', '\n', 'Abschnitt', ':', 'Marzahn', '(', 'Berlin', ')', '\n', 'Gültig', 'ab', ':', '09.02.2016', '20:06', '\n', 'gesperrt', ',', 'Unfall', '\n']
stanfordnlp: ['Unfall', '

spaCy tokenizer treats hashtags as separate tokens and keeps whitespace characters.
stanfordnlp more often than not treats hashtags as separate token and often does not handle abbreviations well, i.e. the tokenizer treats the dot as a separate token.
It also tends to split words containing punctuation marks more aggressively than the other tokenizers.
SoMaJo does not treat hashtags as separate tokens and handles abbreviations better. It does however split dates into multiple tokens.

### Sentence splitting comparison

In [12]:
def get_spacy_doc_sentences(doc):
    return [s.text for s in doc.sents]

def get_stanford_doc_sentences(doc):
    # introduces whitespaces
    # see: https://github.com/stanfordnlp/stanfordnlp/blob/dev/stanfordnlp/models/common/doc.py
    # to get original sentence text
    return [" ".join([token.text for token in sentence.tokens]) for sentence in doc.sentences]

def get_somajo_doc_sentences(doc):
    # introduces whitespaces
    return [" ".join([token.text for token in sentence]) for sentence in doc]

In [13]:
for spacy_doc, stanford_doc, somajo_doc in zip(spacy_docs, stanford_docs, somajo_docs):
    spacy_sentences = get_spacy_doc_sentences(spacy_doc)
    print("spaCy:", len(spacy_sentences), "\n", spacy_sentences)
    stanford_sentences = get_stanford_doc_sentences(stanford_doc)
    print("stanfordnlp:", len(stanford_sentences), "\n", stanford_sentences)
    somajo_sentences = get_somajo_doc_sentences(somajo_doc)
    print("somajo:", len(somajo_sentences), "\n", somajo_sentences)
    print("\n")

spaCy: 7 
 ['#S1', 'Nach der Weichenstörung in Hohen Neuendorf verkehren die S-Bahnen wieder durchgehend, erster Zug ab', '#', 'Frohnau', '21:58 Uhr und erster Zug ab', '#Hohen_Neuendorf', '22:03 Uhr.']
stanfordnlp: 1 
 ['#S1 Nach der Weichenstörung in Hohen Neuendorf verkehren die S- Bahnen wieder durchgehend , erster Zug ab # Frohnau 21:58 Uhr und erster Zug ab # Hohen_Neuendorf 22:03 Uhr .']
somajo: 1 
 ['#S1 Nach der Weichenstörung in Hohen Neuendorf verkehren die S-Bahnen wieder durchgehend , erster Zug ab #Frohnau 21:58 Uhr und erster Zug ab #Hohen_Neuendorf 22:03 Uhr .']


spaCy: 5 
 ['Unfall\nAbschnitt: Marzahn (Berlin)\n', 'Gültig ab', ':', '09.02.2016', '20:06\ngesperrt, Unfall\n']
stanfordnlp: 1 
 ['Unfall Abschnitt : Marzahn ( Berlin ) Gültig ab : 09.02.2016 20:06 gesperrt , Unfall']
somajo: 1 
 ['Unfall Abschnitt : Marzahn ( Berlin ) Gültig ab : 09. 02. 2016 20:06 gesperrt , Unfall']


spaCy: 6 
 ['■', '#', 'A1 #Bremen', 'Richtung #Hamburg zwischen Horster Dreieck und #Sti

In the small sample of sentences we can observe that spaCy tends to split the document text very aggressively. It seems to not be able to handle hashtags, punctuation marks and abbreviations well.
stanfordnlp tends to do a little better, but seems rather ill-equipped to handle text data from social media containing a lot of abbreviations and use of special punctuation marks.
SoMaJo does considerably better. In our testing we found that it only made mistakes on very few occasions where it encountered unknown abbreviations.
Therefore we chose to do event extraction on a document level and use SoMaJo sentence splitting information for our negative labeling functions.

## Automatic approach to evaluate sentence splitting quality
In order to automatically evaluate the quality of sentence splitting of stanfordnlp and SoMaJo we compare event role labeling functions, that label an example as `no_arg` when trigger and (potential) argument are in separate sentences according to the sentence splitting (boundary) information and abstain otherwise.

Caveats: If the annotators did not pay attention to sentence boundaries when labeling event roles, then it may seem that the splitter made a mistake. This would then measure the consistency / quality of the annotation rather than the quality of the sentence splitter. This approach only covers the sentence splitting errors where trigger and argument ended up in different sentences according to the splitter. There may be other sentence splitting errors where trigger and argument still ended up in the same sentence or splitting errors in sentences with no event roles.

In [14]:
import pandas as pd
from wsee.utils import corpus_statistics
sd4m_train = pd.read_json("../data/daystream_corpus/train/train_with_events_and_defaults.jsonl", lines=True, encoding='utf8')
filtered_sd4m_train = sd4m_train[sd4m_train.apply(lambda document: corpus_statistics.has_triggers(document), axis=1)]
corpus_statistics.get_snorkel_event_stats(filtered_sd4m_train)

{'# Docs': 567,
 '# Docs with event triggers': 413,
 '# Event triggers with positive label': 488,
 '# Event triggers with negative label': 289,
 '# Event triggers with abstain': 0,
 'Trigger class frequencies': {'Accident': 59,
  'CanceledRoute': 61,
  'CanceledStop': 25,
  'Delay': 65,
  'Obstruction': 101,
  'RailReplacementService': 22,
  'TrafficJam': 155,
  'O': 289},
 '# Docs with event roles': 413,
 '# Event role with positive label': 2001,
 '# Event roles with negative label': 5284,
 '# Event roles with abstain': 0,
 'Role class frequencies': {'location': 571,
  'delay': 87,
  'direction': 277,
  'start_loc': 377,
  'end_loc': 352,
  'start_date': 35,
  'end_date': 41,
  'cause': 103,
  'jam_length': 135,
  'route': 23,
  'no_arg': 5284}}

In [15]:
from wsee.data import pipeline

df_sd_train, Y_sd_train = pipeline.build_event_role_examples(filtered_sd4m_train)

INFO:root:Building event role examples
INFO:root:DataFrame has 567 rows
INFO:root:Adding the following attributes to each document: entity_type_freqs, somajo_doc, mixed_ner, mixed_ner_spans
567it [00:12, 46.02it/s]
INFO:root:Adding the following attributes to each role example: not_an_event, arg_type_event_type_match, between_distance, is_multiple_same_event_type
INFO:root:Number of event roles: 2001
INFO:root:Number of event role examples: 7285


In [16]:
from wsee.labeling import event_argument_role_lfs as role_lfs
from snorkel.labeling import PandasLFApplier

lfs = [
    role_lfs.lf_somajo_separate_sentence,
    role_lfs.lf_stanford_separate_sentence
]
applier = PandasLFApplier(lfs)

In [17]:
L_sd_train = applier.apply(df_sd_train)

100%|██████████| 7285/7285 [00:02<00:00, 3188.25it/s]


In [18]:
from snorkel.labeling import LFAnalysis

LFAnalysis(L_sd_train, lfs).lf_summary(Y_sd_train)

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
lf_somajo_separate_sentence,0,[10],0.288813,0.287577,0.0,2097,7,0.996673
lf_stanford_separate_sentence,1,[10],0.423747,0.287577,0.0,2993,94,0.96955


The SD4M train set contains 2001 positive event roles and 5284 negative event roles.
`lf_stanford_separate_sentence` using the sentence splitting information from stanfordnlp correctly labels 2993 of the negative event roles, but incorrectly labels 94 of the positive event roles as `no_arg`. 
While `lf_somajo_separate_sentence` correctly labels less of of the negative event roles (2097), it only labels 7 of the positive event roles incorrectly. 

Event extraction on a sentence level, i.e. where the input to the model is a sentence not a document, hinges on the quality of the sentence splitting. We do not want to loose these 94 examples of positive event roles for model training.
That is why we decided to feed documents into the model.
As the information from the sentence splitting is crucial to the role labeling functions that only take proximity into account, we chose `lf_somajo_separate_sentence`.

To make sure that those errors are not due to inconsistencies in the annotation, we will manually examine the incorrect instances.

In [19]:
from wsee.preprocessors import preprocessors

def get_simple_trigger(doc):
    trigger = doc.trigger
    simple_trigger = {
        'text': trigger['text'],
        'entity_type': trigger['entity_type'],
        'char_start': trigger['char_start'],
        'char_end': trigger['char_end']
    }
    return simple_trigger

def get_simple_argument(doc):
    argument = doc['argument']
    simple_argument = {
        'text': argument['text'],
        'entity_type': argument['entity_type'],
        'char_start': argument['char_start'],
        'char_end': argument['char_end']
    }
    return simple_argument

def get_simple_somajo_sentences(doc):
    sentences = doc.somajo_doc['sentences']
    return [{'text': doc.text[sentence['char_start']:sentence['char_end']], 'char_start': sentence['char_start'], 'char_end': sentence['char_end']} for sentence in sentences]

def get_simple_stanford_sentences(doc):
    sentences = doc.sentence_spans
    return [{'text': doc.text[sentence['char_start']:sentence['char_end']], 'char_start': sentence['char_start'], 'char_end': sentence['char_end']} for sentence in sentences]

def add_simple_columns(df):
    df['trigger_sm'] = df.apply(lambda doc: get_simple_trigger(doc), axis=1)
    df['argument_sm'] = df.apply(lambda doc: get_simple_argument(doc), axis=1)
    df['somajo_sm'] = df.apply(lambda doc: get_simple_somajo_sentences(doc), axis=1)
    df['stanford_sm'] = df.apply(lambda doc: get_simple_stanford_sentences(doc), axis=1)
    df['between_tokens'] = df.apply(lambda doc: preprocessors.get_between_tokens(doc), axis=1)
    return df

In [20]:
from wsee.labeling import error_analysis
from wsee import ROLE_LABELS
pd.set_option('display.max_colwidth', None)
labeled_sd4m_roles = df_sd_train.copy()
labeled_sd4m_roles['label'] = Y_sd_train
labeled_sd4m_roles['event_role'] = [ROLE_LABELS[label_idx] for label_idx in Y_sd_train]
labeled_sd4m_roles = add_simple_columns(labeled_sd4m_roles)

In [21]:
error_analysis.get_false_positives(labeled_df=labeled_sd4m_roles, lf_outputs=L_sd_train, lf_index=0, label_of_interest=10)[['text', 'between_tokens', 'trigger_sm', 'argument_sm', 'somajo_sm', 'event_role']]

Unnamed: 0,text,between_tokens,trigger_sm,argument_sm,somajo_sm,event_role
5054,"Kreis Breisgau-Hochschwarzwald Störungen im Schienenverkehr, Erdrutsch Die Zugverbindung Freiburg i.Br. - Titisee-Neustadt (Höllentalbahn) ist im Bereich Falkensteig unterbrochen. Ein Schienenersatzverkehr ist eingerichtet. Mit Behinderungen ist zu rechnen.\n","[ist, im, Bereich, Falkensteig]","{'text': 'unterbrochen', 'entity_type': 'trigger', 'char_start': 166, 'char_end': 178}","{'text': 'Zugverbindung Freiburg i.Br. - Titisee-Neustadt (Höllentalbahn)', 'entity_type': 'location_route', 'char_start': 75, 'char_end': 138}","[{'text': 'Kreis Breisgau-Hochschwarzwald Störungen im Schienenverkehr, Erdrutsch Die Zugverbindung Freiburg i.', 'char_start': 0, 'char_end': 100}, {'text': 'Br. - Titisee-Neustadt (Höllentalbahn) ist im Bereich Falkensteig unterbrochen.', 'char_start': 100, 'char_end': 179}, {'text': 'Ein Schienenersatzverkehr ist eingerichtet.', 'char_start': 180, 'char_end': 223}, {'text': 'Mit Behinderungen ist zu rechnen.', 'char_start': 224, 'char_end': 257}]",location
5081,#RE4 wird ab #Düsseldorf Hbf (17:26) bis #Mönchengladbach (17:48) ohne Halt umgeleitet.Mit 20 Minuten Verspätung muss in MG gerechnet werden\n,"[wird, ab, #Düsseldorf, Hbf, (, 17:26, ), bis, #Mönchengladbach, (, 17:48, ), ohne, Halt]","{'text': 'umgeleitet.Mit', 'entity_type': 'trigger', 'char_start': 76, 'char_end': 90}","{'text': '#RE4', 'entity_type': 'location_route', 'char_start': 0, 'char_end': 4}","[{'text': '#RE4 wird ab #Düsseldorf Hbf (17:26) bis #Mönchengladbach (17:48) ohne Halt umgeleitet.', 'char_start': 0, 'char_end': 87}, {'text': 'Mit 20 Minuten Verspätung muss in MG gerechnet werden', 'char_start': 87, 'char_end': 140}]",location
5082,#RE4 wird ab #Düsseldorf Hbf (17:26) bis #Mönchengladbach (17:48) ohne Halt umgeleitet.Mit 20 Minuten Verspätung muss in MG gerechnet werden\n,"[(, 17:26, ), bis, #Mönchengladbach, (, 17:48, ), ohne, Halt]","{'text': 'umgeleitet.Mit', 'entity_type': 'trigger', 'char_start': 76, 'char_end': 90}","{'text': '#Düsseldorf Hbf', 'entity_type': 'location_stop', 'char_start': 13, 'char_end': 28}","[{'text': '#RE4 wird ab #Düsseldorf Hbf (17:26) bis #Mönchengladbach (17:48) ohne Halt umgeleitet.', 'char_start': 0, 'char_end': 87}, {'text': 'Mit 20 Minuten Verspätung muss in MG gerechnet werden', 'char_start': 87, 'char_end': 140}]",start_loc
5084,#RE4 wird ab #Düsseldorf Hbf (17:26) bis #Mönchengladbach (17:48) ohne Halt umgeleitet.Mit 20 Minuten Verspätung muss in MG gerechnet werden\n,"[(, 17:48, ), ohne, Halt]","{'text': 'umgeleitet.Mit', 'entity_type': 'trigger', 'char_start': 76, 'char_end': 90}","{'text': '#Mönchengladbach', 'entity_type': 'location_stop', 'char_start': 41, 'char_end': 57}","[{'text': '#RE4 wird ab #Düsseldorf Hbf (17:26) bis #Mönchengladbach (17:48) ohne Halt umgeleitet.', 'char_start': 0, 'char_end': 87}, {'text': 'Mit 20 Minuten Verspätung muss in MG gerechnet werden', 'char_start': 87, 'char_end': 140}]",end_loc
5643,A43 Wuppertal Richtung Recklinghausen zwischen Witten-Herbede und Bochum-Querenburg Unfall 5 km Stau. Dort wird der Verkehr über die Parallelfahrbahn geleitet. (Zeitverlust: etwa eine halbe Stunde)\n,"[5, km, Stau, ., Dort, wird, der, Verkehr, über, die, Parallelfahrbahn, geleitet, ., (, Zeitverlust, :, etwa, eine]","{'text': 'Unfall', 'entity_type': 'trigger', 'char_start': 84, 'char_end': 90}","{'text': 'halbe Stunde', 'entity_type': 'duration', 'char_start': 184, 'char_end': 196}","[{'text': 'A43 Wuppertal Richtung Recklinghausen zwischen Witten-Herbede und Bochum-Querenburg Unfall 5 km Stau.', 'char_start': 0, 'char_end': 101}, {'text': 'Dort wird der Verkehr über die Parallelfahrbahn geleitet.', 'char_start': 102, 'char_end': 159}, {'text': '(Zeitverlust: etwa eine halbe Stunde)', 'char_start': 160, 'char_end': 197}]",delay
5651,A43 Wuppertal Richtung Recklinghausen zwischen Witten-Herbede und Bochum-Querenburg Unfall 5 km Stau. Dort wird der Verkehr über die Parallelfahrbahn geleitet. (Zeitverlust: etwa eine halbe Stunde)\n,"[., Dort, wird, der, Verkehr, über, die, Parallelfahrbahn, geleitet, ., (, Zeitverlust, :, etwa, eine]","{'text': 'Stau', 'entity_type': 'trigger', 'char_start': 96, 'char_end': 100}","{'text': 'halbe Stunde', 'entity_type': 'duration', 'char_start': 184, 'char_end': 196}","[{'text': 'A43 Wuppertal Richtung Recklinghausen zwischen Witten-Herbede und Bochum-Querenburg Unfall 5 km Stau.', 'char_start': 0, 'char_end': 101}, {'text': 'Dort wird der Verkehr über die Parallelfahrbahn geleitet.', 'char_start': 102, 'char_end': 159}, {'text': '(Zeitverlust: etwa eine halbe Stunde)', 'char_start': 160, 'char_end': 197}]",delay
6947,A6 Nürnberg Richtung Heilbronn zwischen Herrieden und Kreuz Feuchtwangen / Crailsheim 18 km Stau. Zeitverlust von bis zu ein-einhalb Stunden.\n,"[Zeitverlust, von, bis, zu]","{'text': 'Stau.', 'entity_type': 'trigger', 'char_start': 92, 'char_end': 97}","{'text': 'ein-einhalb Stunden.', 'entity_type': 'duration', 'char_start': 121, 'char_end': 141}","[{'text': 'A6 Nürnberg Richtung Heilbronn zwischen Herrieden und Kreuz Feuchtwangen / Crailsheim 18 km Stau.', 'char_start': 0, 'char_end': 97}, {'text': 'Zeitverlust von bis zu ein-einhalb Stunden.', 'char_start': 98, 'char_end': 141}]",delay


Out of the 7 instances SoMaJo only split one sentence incorrectly after an unknown abbreviation (Freiburg i.|Br.):
- Kreis Breisgau-Hochschwarzwald Störungen im Schienenverkehr, Erdrutsch Die **Zugverbindung Freiburg i.Br. - Titisee-Neustadt (Höllentalbahn)** ist im Bereich Falkensteig **unterbrochen**. Ein Schienenersatzverkehr ist eingerichtet. Mit Behinderungen ist zu rechnen.\n
    - Kreis Breisgau-Hochschwarzwald Störungen im Schienenverkehr, Erdrutsch Die **Zugverbindung Freiburg i.**
    - Br. - Titisee-Neustadt (Höllentalbahn)** ist im Bereich Falkensteig **unterbrochen**.
    - Ein Schienenersatzverkehr ist eingerichtet. Mit Behinderungen ist zu rechnen.\n

3 instances were incorrectly labeled because of a tokenization error from the annotation. The text 'umgeleitet.Mit' was not correctly tokenized into \['umgeleitet', '.', 'Mit'\] in the original annotation and ended up being the trigger text. The character offsets of the trigger were consequently incorrect as well.
SoMaJo correctly identified the sentence boundary within the trigger.
- **#RE4** wird ab **#Düsseldorf Hbf** (17:26) bis **#Mönchengladbach** (17:48) ohne Halt **umgeleitet.Mit** 20 Minuten Verspätung muss in MG gerechnet werden\n
    - **#RE4** wird ab **#Düsseldorf Hbf** (17:26) bis **#Mönchengladbach** (17:48) ohne Halt **umgeleitet.**
    - **Mit** 20 Minuten Verspätung muss in MG gerechnet werden\n

The remaining 3 instances are due to inconsistencies in the annotation: 
- A43 Wuppertal Richtung Recklinghausen zwischen Witten-Herbede und Bochum-Querenburg **Unfall** 5 km **Stau**. Dort wird der Verkehr über die Parallelfahrbahn geleitet. (Zeitverlust: etwa eine **halbe Stunde**)\n
    - A43 Wuppertal Richtung Recklinghausen zwischen Witten-Herbede und Bochum-Querenburg **Unfall** 5 km **Stau**. 
    - Dort wird der Verkehr über die Parallelfahrbahn geleitet.
    - (Zeitverlust: etwa eine **halbe Stunde**)\n
- A6 Nürnberg Richtung Heilbronn zwischen Herrieden und Kreuz Feuchtwangen / Crailsheim 18 km **Stau**. Zeitverlust von bis zu **ein-einhalb Stunden**.\n
    - A6 Nürnberg Richtung Heilbronn zwischen Herrieden und Kreuz Feuchtwangen / Crailsheim 18 km **Stau**.
    - Zeitverlust von bis zu **ein-einhalb Stunden**.\n

Trigger and argument are in separate sentences. According to the annotation guidelines these pairs should not have been annotated.
> Annotate only explicit relation mentions that occur within a single sentence with all required arguments.

In [22]:
error_analysis.get_false_positives(labeled_df=labeled_sd4m_roles, lf_outputs=L_sd_train, lf_index=1, label_of_interest=10).iloc[0:2][['text', 'between_tokens', 'trigger_sm', 'argument_sm', 'stanford_sm', 'event_role']]

Unnamed: 0,text,between_tokens,trigger_sm,argument_sm,stanford_sm,event_role
888,"folgende Meldung ergänzt\nam Mittwoch, 6. und Donnerstag, 7. April, jeweils 20.45 – 24.00 Uhr\nMeldung:\nCNL 471 nach Zürich HB (planmäßig 21.33 Uhr ab Berlin Gesundbrunnen) fährt bis zu 46 Min. früher von Berlin-Gesundbrunnen bis Bitterfeld und hält nicht in Halle (Saale) Hbf.\nGrund:\nSoftwareanpassungen im Elektronischen Stellwerk Halle (Saale)\nLink zur detaillierten Meldung: \nLink zum kompletten PDF-Dokument: \n(142 kB)\n------------------\n","[nach, Zürich, HB, (, planmäßig, 21.33, Uhr, ab, Berlin, Gesundbrunnen, ), fährt, bis, zu, 46, Min, ., früher, von, Berlin, -, Gesundbrunnen, bis, Bitterfeld, und]","{'text': 'hält nicht', 'entity_type': 'trigger', 'char_start': 243, 'char_end': 253}","{'text': 'CNL 471', 'entity_type': 'location_route', 'char_start': 102, 'char_end': 109}","[{'text': 'folgende Meldung ergänzt am Mittwoch, 6.', 'char_start': 0, 'char_end': 40}, {'text': 'und Donnerstag, 7.', 'char_start': 41, 'char_end': 59}, {'text': 'April, jeweils 20.45 – 24.00 Uhr Meldung: CNL 471 nach Zürich HB (planmäßig 21.33 Uhr ab Berlin Gesundbrunnen) fährt bis zu 46 Min.', 'char_start': 60, 'char_end': 191}, {'text': 'früher von Berlin-Gesundbrunnen bis Bitterfeld und hält nicht in Halle (Saale) Hbf.', 'char_start': 192, 'char_end': 275}, {'text': 'Grund: Softwareanpassungen im Elektronischen Stellwerk Halle (Saale) Link zur detaillierten Meldung: Link zum kompletten PDF-Dokument: (142 kB) ------------------', 'char_start': 276, 'char_end': 440}]",route
889,"folgende Meldung ergänzt\nam Mittwoch, 6. und Donnerstag, 7. April, jeweils 20.45 – 24.00 Uhr\nMeldung:\nCNL 471 nach Zürich HB (planmäßig 21.33 Uhr ab Berlin Gesundbrunnen) fährt bis zu 46 Min. früher von Berlin-Gesundbrunnen bis Bitterfeld und hält nicht in Halle (Saale) Hbf.\nGrund:\nSoftwareanpassungen im Elektronischen Stellwerk Halle (Saale)\nLink zur detaillierten Meldung: \nLink zum kompletten PDF-Dokument: \n(142 kB)\n------------------\n","[(, planmäßig, 21.33, Uhr, ab, Berlin, Gesundbrunnen, ), fährt, bis, zu, 46, Min, ., früher, von, Berlin, -, Gesundbrunnen, bis, Bitterfeld, und]","{'text': 'hält nicht', 'entity_type': 'trigger', 'char_start': 243, 'char_end': 253}","{'text': 'Zürich HB', 'entity_type': 'location_stop', 'char_start': 115, 'char_end': 124}","[{'text': 'folgende Meldung ergänzt am Mittwoch, 6.', 'char_start': 0, 'char_end': 40}, {'text': 'und Donnerstag, 7.', 'char_start': 41, 'char_end': 59}, {'text': 'April, jeweils 20.45 – 24.00 Uhr Meldung: CNL 471 nach Zürich HB (planmäßig 21.33 Uhr ab Berlin Gesundbrunnen) fährt bis zu 46 Min.', 'char_start': 60, 'char_end': 191}, {'text': 'früher von Berlin-Gesundbrunnen bis Bitterfeld und hält nicht in Halle (Saale) Hbf.', 'char_start': 192, 'char_end': 275}, {'text': 'Grund: Softwareanpassungen im Elektronischen Stellwerk Halle (Saale) Link zur detaillierten Meldung: Link zum kompletten PDF-Dokument: (142 kB) ------------------', 'char_start': 276, 'char_end': 440}]",direction


As before mentioned stanfordnlp seems to struggle with abbreviations and dates.
Manually examining all the examples, where the `lf_stanford_separate_sentence` falsely labeled positive event roles as `no_arg`, we found that in 2 cases the ssplitter worked correctly, but the error was due to annotation inconsistency (Zeitverlust same as SoMaJo).
The remaining 92 of 94 examples were labeled incorrectly because of ssplitting errors. 4 of those 92 (from one document) were due to the use of an exclamation mark within a sentence, which was full of hashtags and without a classical sentence structure. In one instance there was wrong use of a punctuation mark.
The 94 stemmed from 38 unique documents. We found 84 ssplitting erros. In 31 cases the ssplitter did not recognize the punctuation mark as part of an abbreviation and instead treated it as a sentence boundary marker. In 51 cases the ssplitter split the sentence in the middle of a date.

- 2 Cases of correct split, but error due to annotation inconsistency
    - Same Zeitverlust examples (2)
- 92 Errors due to wrong split: 2, 5, 4, 2, 4, 4, 1, 2, 1, 1, 1, 2, 7, (4), 2, 3, 1, 2, 5, 1, 2, 2, 2, 2, 2, 1, 5, 1, 0, 3, 3, 2, 2, 2, (1), 1, 1, 4, 0, 2
    - (4): due to unconventional use of exclamation mark, very loose sentence structure
    - (1): wrong use of punctuation mark

- Sentence splitting errors in 94 examples (38 unique documents with 84 ssplit errors + 2 where ssplitter cannot do much):
    - Abbreviations (31): 1, 2, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 2, 3, 2, 0, 0, 0, 0, 2, 1, 0, 0, 0, 1, 2, 0, 1, 0, 0, 2, 1, 0, 2  
    - Dates (51): 2, 4, 0, 0, 0, 1, 1, 2, 1, 1, 1, 6, 0, 2, 6, 0, 2, 1, 1, 1, 2, 2, 1, 1, 0, 1, 1, 0, 1, 0, 5, 0, 3, 0, 1, 0, 0, 1
    - Unconventional use of punctuation marks: 1 (bunch of hashtags, one followed by exclamation mark)
    - Wrong use of punctuation mark in text data: 1
    
